Learning: Nearest Neighbor, Perceptrons - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Learning: Nearest Neighbor, Perceptrons

Description:

Sigmoid (squashed s' function): Logistic fn. Neural Net Training. Goal: ... Note: Derivative of sigmoid: ds(z1) = s(z1)(1-s(z1)) dz1. From 6.034 notes lozano-perez ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 50
Provided by: peopleCs
Category:

less

Transcript and Presenter's Notes

Title: Learning: Nearest Neighbor, Perceptrons


1
Learning Nearest Neighbor, Perceptrons Neural
Nets
  • Artificial Intelligence
  • CSPP 56553
  • February 4, 2004

2
Nearest Neighbor Example II
  • Credit Rating
  • Classifier Good / Poor
  • Features
  • L late payments/yr
  • R Income/Expenses

Name L R G/P
A 0 1.2 G
B 25 0.4 P
C 5 0.7 G
D 20 0.8 P
E 30 0.85 P
F 11 1.2 G
G 7 1.15 G
H 15 0.8 P
3
Nearest Neighbor Example II
Name L R G/P
A 0 1.2 G
A
F
B 25 0.4 P
1
G
R
E
C 5 0.7 G
H
D
C
D 20 0.8 P
E 30 0.85 P
B
F 11 1.2 G
G 7 1.15 G
10
30
20
L
H 15 0.8 P
4
Nearest Neighbor Example II
Name L R G/P
I 6 1.15
G
A
F
K
J 22 0.45
1
G
I
P
E
??
K 15 1.2
D
H
R
C
B
J
Distance Measure
Sqrt ((L1-L2)2 sqrt(10)(R1-R2)2)) -
Scaled distance
10
30
20
L
5
Nearest Neighbor Issues
  • Prediction can be expensive if many features
  • Affected by classification, feature noise
  • One entry can change prediction
  • Definition of distance metric
  • How to combine different features
  • Different types, ranges of values
  • Sensitive to feature selection

6
Efficient Implementations
  • Classification cost
  • Find nearest neighbor O(n)
  • Compute distance between unknown and all
    instances
  • Compare distances
  • Problematic for large data sets
  • Alternative
  • Use binary search to reduce to O(log n)

7
Efficient Implementation K-D Trees
  • Divide instances into sets based on features
  • Binary branching E.g. gt value
  • 2d leaves with d split path n
  • d O(log n)
  • To split cases into sets,
  • If there is one element in the set, stop
  • Otherwise pick a feature to split on
  • Find average position of two middle objects on
    that dimension
  • Split remaining objects based on average position
  • Recursively split subsets

8
K-D Trees Classification
Yes
No
No
Yes
Yes
No
No
Yes
No
Yes
No
No
Yes
Yes
Poor
Good
Good
Poor
Good
Good
Poor
Good
9
Efficient ImplementationParallel Hardware
  • Classification cost
  • distance computations
  • Const time if O(n) processors
  • Cost of finding closest
  • Compute pairwise minimum, successively
  • O(log n) time

10
Nearest Neighbor Analysis
  • Issue
  • What features should we use?
  • E.g. Credit rating Many possible features
  • Tax bracket, debt burden, retirement savings,
    etc..
  • Nearest neighbor uses ALL
  • Irrelevant feature(s) could mislead
  • Fundamental problem with nearest neighbor

11
Nearest Neighbor Advantages
  • Fast training
  • Just record feature vector - output value set
  • Can model wide variety of functions
  • Complex decision boundaries
  • Weak inductive bias
  • Very generally applicable

12
Summary Nearest Neighbor
  • Nearest neighbor
  • Training record input vectors output value
  • Prediction closest training instance to new data
  • Efficient implementations
  • Pros fast training, very general, little bias
  • Cons distance metric (scaling), sensitivity to
    noise extraneous features

13
Learning Perceptrons
  • Artificial Intelligence
  • CSPP 56553
  • February 4, 2004

14
Agenda
  • Neural Networks
  • Biological analogy
  • Perceptrons Single layer networks
  • Perceptron training
  • Perceptron convergence theorem
  • Perceptron limitations
  • Conclusions

15
Neurons The Concept
Dendrites
Axon
Nucleus
Cell Body
Neurons Receive inputs from other neurons (via
synapses) When input exceeds threshold,
fires Sends output along axon to other
neurons Brain 1011 neurons, 1016 synapses
16
Artificial Neural Nets
  • Simulated Neuron
  • Node connected to other nodes via links
  • Links axonsynapselink
  • Links associated with weight (like synapse)
  • Multiplied by output of node
  • Node combines input via activation function
  • E.g. sum of weighted inputs passed thru
    threshold
  • Simpler than real neuronal processes

17
Artificial Neural Net
w
x
w
Sum Threshold
x
w
x
18
Perceptrons
  • Single neuron-like element
  • Binary inputs
  • Binary outputs
  • Weighted sum of inputs gt threshold

19
Perceptron Structure
y
w0
wn
w1
w3
w2
x01
x1
x3
x2
xn
. . .
compensates for threshold
x0 w0
20
Perceptron Example
  • Logical-OR Linearly separable
  • 00 0 01 1 10 1 11 1

x2
x2




0
0


x1
x1
or
or
21
Perceptron Convergence Procedure
  • Straight-forward training procedure
  • Learns linearly separable functions
  • Until perceptron yields correct output for all
  • If the perceptron is correct, do nothing
  • If the percepton is wrong,
  • If it incorrectly says yes,
  • Subtract input vector from weight vector
  • Otherwise, add input vector to weight vector

22
Perceptron Convergence Example
  • LOGICAL-OR
  • Sample x0 x1 x2 Desired Output
  • 1 1 0 0 0
  • 2 1 0 1 1
  • 3 1 1 0 1
  • 4 1 1 1 1
  • Initial w(000)After S2, wws2(101)
  • Pass2 S1ww-s1(001)S3wws3(111)
  • Pass3 S1ww-s1(011)

23
Perceptron Convergence Theorem
  • If there exists a vector W s.t.
  • Perceptron training will find it


  • Assume

    for all
    ive examples x
  • w2 increases by at most x2, in each
    iteration
  • wx2 lt w2x2 ltk x2
  • v.w/w gt lt 1




















    Converges in k lt O
    steps

24
Perceptron Learning
  • Perceptrons learn linear decision boundaries
  • E.g.

x2

0
But not
0

x1
xor
X1 X2 -1 -1 w1x1 w2x2 lt 0 1
-1 w1x1 w2x2 gt 0 gt implies w1 gt 0 1
1 w1x1 w2x2 gt0 gt but should be
false -1 1 w1x1 w2x2 gt 0 gt implies
w2 gt 0
25
Perceptron Example
  • Digit recognition
  • Assume display 8 lightable bars
  • Inputs on/off threshold
  • 65 steps to recognize 8

26
Perceptron Summary
  • Motivated by neuron activation
  • Simple training procedure
  • Guaranteed to converge
  • IF linearly separable

27
Neural Nets
  • Multi-layer perceptrons
  • Inputs real-valued
  • Intermediate hidden nodes
  • Output(s) one (or more) discrete-valued

X1
Y1 Y2
X2
X3
X4
Inputs
Hidden
Hidden
Outputs
28
Neural Nets
  • Pro More general than perceptrons
  • Not restricted to linear discriminants
  • Multiple outputs one classification each
  • Con No simple, guaranteed training procedure
  • Use greedy, hill-climbing procedure to train
  • Gradient descent, Backpropagation

29
Solving the XOR Problem
o1
w11
Network Topology 2 hidden nodes 1 output
w13
x1
w01
w21
y
-1
w23
w12
w03
w22
x2
-1
w02
o2
Desired behavior x1 x2 o1 o2 y 0 0 0
0 0 1 0 0 1 1 0 1 0 1
1 1 1 1 1 0
-1
Weights w11 w121 w21w22 1 w013/2 w021/2
w031/2 w13-1 w231
30
Neural Net Applications
  • Speech recognition
  • Handwriting recognition
  • NETtalk Letter-to-sound rules
  • ALVINN Autonomous driving

31
ALVINN
  • Driving as a neural network
  • Inputs
  • Image pixel intensities
  • I.e. lane lines
  • 5 Hidden nodes
  • Outputs
  • Steering actions
  • E.g. turn left/right how far
  • Training
  • Observe human behavior sample images, steering

32
Backpropagation
  • Greedy, Hill-climbing procedure
  • Weights are parameters to change
  • Original hill-climb changes one parameter/step
  • Slow
  • If smooth function, change all parameters/step
  • Gradient descent
  • Backpropagation Computes current output, works
    backward to correct error

33
Producing a Smooth Function
  • Key problem
  • Pure step threshold is discontinuous
  • Not differentiable
  • Solution
  • Sigmoid (squashed s function) Logistic fn

34
Neural Net Training
  • Goal
  • Determine how to change weights to get correct
    output
  • Large change in weight to produce large reduction
    in error
  • Approach
  • Compute actual output o
  • Compare to desired output d
  • Determine effect of each weight w on error d-o
  • Adjust weights

35
Neural Net Example
xi ith sample input vector w weight vector
yi desired output for ith sample
-
Sum of squares error over training samples
From 6.034 notes lozano-perez
Full expression of output in terms of input and
weights
36
Gradient Descent
  • Error Sum of squares error of inputs with
    current weights
  • Compute rate of change of error wrt each weight
  • Which weights have greatest effect on error?
  • Effectively, partial derivatives of error wrt
    weights
  • In turn, depend on other weights gt chain rule

37
Gradient Descent
dG dw
  • E G(w)
  • Error as function of weights
  • Find rate of change of error
  • Follow steepest rate of change
  • Change weights s.t. error is minimized

E
G(w)
w0w1
w
Local minima
38
Gradient of Error
-
Note Derivative of sigmoid ds(z1)
s(z1)(1-s(z1)) dz1
From 6.034 notes lozano-perez
39
From Effect to Update
  • Gradient computation
  • How each weight contributes to performance
  • To train
  • Need to determine how to CHANGE weight based on
    contribution to performance
  • Need to determine how MUCH change to make per
    iteration
  • Rate parameter r
  • Large enough to learn quickly
  • Small enough reach but not overshoot target values

40
Backpropagation Procedure
i
j
k
  • Pick rate parameter r
  • Until performance is good enough,
  • Do forward computation to calculate output
  • Compute Beta in output node with
  • Compute Beta in all other nodes with
  • Compute change for all weights with

41
Backprop Example
Forward prop Compute zi and yi given xk, wl
42
Backpropagation Observations
  • Procedure is (relatively) efficient
  • All computations are local
  • Use inputs and outputs of current node
  • What is good enough?
  • Rarely reach target (0 or 1) outputs
  • Typically, train until within 0.1 of target

43
Neural Net Summary
  • Training
  • Backpropagation procedure
  • Gradient descent strategy (usual problems)
  • Prediction
  • Compute outputs based on input vector weights
  • Pros Very general, Fast prediction
  • Cons Training can be VERY slow (1000s of
    epochs), Overfitting

44
Training Strategies
  • Online training
  • Update weights after each sample
  • Offline (batch training)
  • Compute error over all samples
  • Then update weights
  • Online training noisy
  • Sensitive to individual instances
  • However, may escape local minima

45
Training Strategy
  • To avoid overfitting
  • Split data into training, validation, test
  • Also, avoid excess weights (less than samples)
  • Initialize with small random weights
  • Small changes have noticeable effect
  • Use offline training
  • Until validation set minimum
  • Evaluate on test set
  • No more weight changes

46
Classification
  • Neural networks best for classification task
  • Single output -gt Binary classifier
  • Multiple outputs -gt Multiway classification
  • Applied successfully to learning pronunciation
  • Sigmoid pushes to binary classification
  • Not good for regression

47
Neural Net Example
  • NETtalk Letter-to-sound by net
  • Inputs
  • Need context to pronounce
  • 7-letter window predict sound of middle letter
  • 29 possible characters alphabetspace,.
  • 729203 inputs
  • 80 Hidden nodes
  • Output Generate 60 phones
  • Nodes map to 26 units 21 articulatory, 5
    stress/sil
  • Vector quantization of acoustic space

48
Neural Net Example NETtalk
  • Learning to talk
  • 5 iterations/1024 training words bound/stress
  • 10 iterations intelligible
  • 400 new test words 80 correct
  • Not as good as DecTalk, but automatic

49
Neural Net Conclusions
  • Simulation based on neurons in brain
  • Perceptrons (single neuron)
  • Guaranteed to find linear discriminant
  • IF one exists -gt problem XOR
  • Neural nets (Multi-layer perceptrons)
  • Very general
  • Backpropagation training procedure
  • Gradient descent - local min, overfitting issues
Write a Comment
User Comments (0)
About PowerShow.com