Nearest Neighbor Methods - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Nearest Neighbor Methods

Description:

Assume you have known classifications of a set of exemplars ... The basis functions can also involve sigmoids (aka squashing or logistic functions) ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 39
Provided by: michae1249
Category:

less

Transcript and Presenter's Notes

Title: Nearest Neighbor Methods


1
Nearest Neighbor Methods
  • Similar outputs for similar inputs

2
Nearest neighbor classification The statistical
approach
  • Assume you have known classifications of a set of
    exemplars
  • NN classification assumes that the best guess at
    the classification of an unknown item is the
    classification of its nearest neighbors (those
    known items that are most similar to the unknown
    item).
  • k-NN methods use the k nearest neighbors for that
    guess.
  • For example, choose the most common class of its
    neighbors

3
Issues
  • How large to make the neighborhood?
  • Which neighbors are nearest?
  • In other words, How to measure distance?
  • How to apply to continuous outputs?

4
Example
5
Example
6
Number of neighbors
  • Big neighborhoods produces smoother
    categorization boundaries.
  • Small neighborhoods can produce overfitting.
  • But, small neighborhoods can make finer
    distinctions
  • Solution cross-validate to find right size

7
How to measure distance?
  • Various metrics
  • City-block
  • Euclidean
  • Squared Euclidean
  • Chebychev
  • Will discuss some of these later with RBFs

8
How to apply nearest neighbormethod to continous
outputs?
  • Simple method - average the outputs of the k
    nearest neighbors
  • Complex method - weighted average of outputs of
    the k nearest neighbors
  • Weighted based on distance of neighbor

9
Advantages and Disadvantages
  • Advantage
  • Nonparametric - assumes nothing about shape of
    boundaries in classification.
  • Disadvantages
  • Does not incorporate domain knowledge (if it
    exists).
  • Irrelevant features have a negative impact on
    distance metric (the curse of dimensionality).
  • Exceptions or errors in training set can have too
    much influence on fit.
  • Computation-intensive to compute all distances
    (good algorithms exist, though).
  • Memory-intensive - must store original exemplars.

10
Hybrid networks
  • Issue
  • Time to train
  • Backpropagation Problem with networks with
    hidden units is that learning the right hidden
    unit representation takes a long time.
  • Note - Previous discussion of nearest-neighbor
    methods becomes relevant in a particular type of
    hybrid network - RBF nets

11
Counterpropagation networks
  • A form of hybrid network
  • Includes supervised and unsupervised learning
  • Hidden unit representations are formed by
    unsupervised learning.
  • Mapping hidden unit representations to network
    output is accomplished using supervised learning
  • Single layer can use the delta rule
  • Multiple layers could use backprop
  • (At least one layer was made faster using
    unsupervised learning!)
  • The unsupervised portion of the network attempts
    to find the structure in the input.
  • Its a preprocessor
  • Could use competitive learning, or other forms
  • Some hybrid networks (including some RBFs) dont
    use unsupervised learning at all the mapping
    from input to hidden unit representations is
    predetermined.

12
Disadvantage of the hybrid network approach
  • The hidden unit representation is independent of
    the task to be learned
  • Its structure is not optimized for the task
  • Similar inputs are mapped to similar hidden unit
    representations could produce problems for
    networks that must make fine distinctions.

13
Basis function networks Details
  • Consider a network with a set of m hidden
    units, each of which has its own transfer
    function, hj , for how it reacts to its input
  • These m functions are the basis functions of the
    network.
  • The weights are learned using the delta rule
  • How you define these basis functions determines
    the behavior of the network.

14
Example 1
  • 2 hidden units, one input x, one output y.
  • h1 1, h2 x
  • What does this compute?
  • simple linear regression
  • Weight for h1 is the intercept
  • Weight for h2 is the slope

15
Example 2
  • 3 hidden units, one input x, one output y.
  • h1 1, h2 x, h3 x2
  • A form of polynomial regression fitting a
    quadratic

16
Example 3
  • m hidden units, m - 1 inputs X, one output y.
  • h1 X1, h2 X2, hm-1 Xm-1, hm 1
  • multiple regression

17
Aside - unlearned basis functions
  • There is no learning of which functions are the
    most appropriate in these examples
  • The models that Ill be discussing (e.g., ALCOVE)
    are almost exclusively static models (no
    unsupervised learning)
  • This means that its important to choose a good
    set of basis functions to begin with
  • It also means that the learning is essentially
    single layer FAST learning.

18
Example 4
  • The basis functions can also involve sigmoids
    (aka squashing or logistic functions)
  • This is a simple, single-layer neural network

19
Radial basis functions are a special class of
hidden unit functions
  • Their response decreases monotonically with
    distance from a central point
  • This functions like a receptive field for each
    hidden unit
  • Each hidden unit has a center
  • The center is the input pattern that maximally
    actives the unit.
  • Activity level decreases as the input pattern
    grows increasingly dissimilar to the hidden
    units center.

20
RBF Parameters
  • The locations of the centers for each unit (hj),
    the shape of the RBFs, and the width of the RBFs
    are all parameters of the model.
  • These parameters are fixed in advance if there is
    no unsupervised learning.
  • One or more of these parameters can change by
    using unsupervised learning methods.
  • The precise shape of the function is a function
    of a distance/similarity metric how close is
    the input pattern to the hidden units desired
    input pattern?

21
Distance/similarity can be measured in many ways
  • Euclidean
  • City-block
  • More generally Minkowski metric
  • Minkowski metric is readily extended to many
    input dimensions.

22
Minkowski metric
  • When r 1, city block when r 2, Euclidean

23
Example of city block
  • FOR r 1
  • hj 0,0, a 1,1
  • For i 1, 0 11 1
  • For i 2, 0 11 1
  • Add those two and raise to the power of 1 11
    2

24
Example Euclidean distance
  • FOR r 2
  • hj 0,0, a 1,1
  • For i 1, 0 12 1
  • For i 2, 0 12 1
  • Add those two and raise to the 1/2 power (square
    root) sqroot(11) 1.414

25
Turning distance into similarity
  • Desired properties
  • Similarity f(distance).
  • Maximal similarity when distance 0.
  • Function with these properties
  • sij e-cd
  • When dij 0, sij 1
  • When dij 1 and c 1, sij e-1 1/e.
  • This function defines the generalization gradient
  • c defines the width of the receptive field

26
Shape of receptive fields
27
RBF issues - conceptual
  • RBF networks pave the representational space
    with a set of hidden units/receptive fields, each
    of which is maximally active for a particular
    input pattern.
  • When the receptive fields overlap, you have a
    distributed representation (any given unit
    participates in representing multiple input
    patterns)
  • When the receptive fields do not overlap, you
    have a highly localized representation
  • The degree of overlap is a modifiable parameter
    and allows you to decrease the localization of
    your hidden unit representations

28
Catastrophic retroactive interference
  • Networks with sigmoids divide up representational
    space into halves
  • This produces a highly distributed representation
    in which each unit participates in most internal
    representations
  • RBFs have more localized receptive fields.
  • Interference occurs only when new patterns are
    very similar to ones with which the network has
    already been trained.

29
Design issueWhere to put the RBF centers?
  • Could be random or evenly distributed.
  • Some models (e.g., ALCOVE) place the RBF centers
    where the training exemplars are.

30
Design issue How to move centers?
  • Vector quantization approaches
  • Example winner take all
  • Closest hidden unit moves towards the input
    pattern

31
Design issue How to determine the width of the
receptive fields?
  • Usually ad hoc.
  • Commonly, chosen based on averaged distance
    between training exemplars.
  • Some learning algorithms have been proposed (see
    Moody Darkens work)

32
Comparison of RBF to nearest neighbor methods
  • Nearest neighbor methods are exemplar-based
  • RBF are prototype-based
  • Network doesnt require storage of every training
    exemplar - it summarizes the data.
  • RBF and nearest neighbor are nearly identical
    when RBFs are located at each and every training
    exemplar.

33
Comparison of RBF to MLP
  • Behavior of hidden units
  • MLP hidden units use weighted linear summation
    of input transformed by a sigmoid (can be
    Gaussian, though)
  • RBF hidden units use distance to a prototype
    vector followed by transformation by a localized
    function
  • Local vs. distributed
  • MLP hidden units form a strongly distributed
    representation (exception - Gaussian activation
    functions)
  • RBF hidden unit representations are more
    localized
  • Learning
  • MLP all of the parameters (weights) are
    determined at the same time as part of a single
    global training strategy
  • RBF training of the hidden unit representations
    is decoupled from that of the output units.

34
Advantages of RBFs
  • Learn quickly
  • Can use unsupervised learning to learn the hidden
    unit representations
  • Are not subject to catastrophic retroactive
    interference
  • Are not overly sensitive to linear boundaries
    (Kruschke, 1993)
  • Example XOR problem
  • Receptive field notion more biological

35
Disadvantages of RBFs
  • Hidden unit representations are general not
    specific to the problem to be learned
  • Makes it difficult to map similar input
    representations to different outputs.
  • May not quite achieve the accuracy of a backprop
    network, but it gets close quickly!
  • Extrapolation!

36
Kruschkes ALCOVE model
  • Attention Learning COVEring map
  • A connectionist implementation of Nosofskys GCM
  • Category learning model
  • RBF network that uses the Minkowski metric
  • Includes a c parameter which he calls
    specificity
  • This parameter determines the fixed width of the
    RBFs

37
Attention learning
  • Includes attentional weights and attention
    learning to determine those attentional weights
  • Attentional weights are learned via
    backpropagation
  • Makes for slow learning of attention weights
  • Recent Kruschke models are acquiring attention
    weights without using backprop.
  • Attentional weights serve to stretch the
    representational space

38
Other applications of RBFs to psychology
  • Function learning
  • DeLosh, Busemeyer, McDaniels EXAM
    (extrapolation-association model)
  • Object recognition
  • Edelmans model
  • Recognizing handwriting
  • Lee, Yuchun (1991).
  • Compares k nearest-neighbor, radial-basis
    function, and backpropagation neural networks.
  • Sonar discrimination by dolphins
  • Au, Andersen, Rasmussen, Roitblat (1995)
Write a Comment
User Comments (0)
About PowerShow.com