Learning: Nearest Neighbor, Perceptrons - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Learning: Nearest Neighbor, Perceptrons

Description:

Sigmoid (squashed s' function): Logistic fn. Neural Net Training. Goal: ... Note: Derivative of sigmoid: ds(z1) = s(z1)(1-s(z1)) dz1. From 6.034 notes lozano-perez ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 50

Provided by: peopleCs

Learn more at: http://people.cs.uchicago.edu

Category:

more less

Transcript and Presenter's Notes

Title: Learning: Nearest Neighbor, Perceptrons

1
Learning Nearest Neighbor, Perceptrons Neural
Nets

Artificial Intelligence
CSPP 56553
February 4, 2004

2
Nearest Neighbor Example II

Credit Rating
Classifier Good / Poor
Features
L late payments/yr
R Income/Expenses

Name L R G/P
A 0 1.2 G
B 25 0.4 P
C 5 0.7 G
D 20 0.8 P
E 30 0.85 P
F 11 1.2 G
G 7 1.15 G
H 15 0.8 P
3
Nearest Neighbor Example II
Name L R G/P
A 0 1.2 G
A
F
B 25 0.4 P
1
G
R
E
C 5 0.7 G
H
D
C
D 20 0.8 P
E 30 0.85 P
B
F 11 1.2 G
G 7 1.15 G
10
30
20
L
H 15 0.8 P
4
Nearest Neighbor Example II
Name L R G/P
I 6 1.15
G
A
F
K
J 22 0.45
1
G
I
P
E
??
K 15 1.2
D
H
R
C
B
J
Distance Measure
Sqrt ((L1-L2)2 sqrt(10)(R1-R2)2)) -
Scaled distance
10
30
20
L
5
Nearest Neighbor Issues

Prediction can be expensive if many features
Affected by classification, feature noise
One entry can change prediction
Definition of distance metric
How to combine different features
Different types, ranges of values
Sensitive to feature selection

6
Efficient Implementations

Classification cost
Find nearest neighbor O(n)
Compute distance between unknown and all
instances
Compare distances
Problematic for large data sets
Alternative
Use binary search to reduce to O(log n)

7
Efficient Implementation K-D Trees

Divide instances into sets based on features
Binary branching E.g. gt value
2d leaves with d split path n
d O(log n)
To split cases into sets,
If there is one element in the set, stop
Otherwise pick a feature to split on
Find average position of two middle objects on
that dimension
Split remaining objects based on average position
Recursively split subsets

8
K-D Trees Classification
Yes
No
No
Yes
Yes
No
No
Yes
No
Yes
No
No
Yes
Yes
Poor
Good
Good
Poor
Good
Good
Poor
Good
9
Efficient ImplementationParallel Hardware

Classification cost
distance computations
Const time if O(n) processors
Cost of finding closest
Compute pairwise minimum, successively
O(log n) time

10
Nearest Neighbor Analysis

Issue
What features should we use?
E.g. Credit rating Many possible features
Tax bracket, debt burden, retirement savings,
etc..
Nearest neighbor uses ALL
Irrelevant feature(s) could mislead
Fundamental problem with nearest neighbor

11
Nearest Neighbor Advantages

Fast training
Just record feature vector - output value set
Can model wide variety of functions
Complex decision boundaries
Weak inductive bias
Very generally applicable

12
Summary Nearest Neighbor

Nearest neighbor
Training record input vectors output value
Prediction closest training instance to new data
Efficient implementations
Pros fast training, very general, little bias
Cons distance metric (scaling), sensitivity to
noise extraneous features

13
Learning Perceptrons

Artificial Intelligence
CSPP 56553
February 4, 2004

14
Agenda

Neural Networks
Biological analogy
Perceptrons Single layer networks
Perceptron training
Perceptron convergence theorem
Perceptron limitations
Conclusions

15
Neurons The Concept
Dendrites
Axon
Nucleus
Cell Body
Neurons Receive inputs from other neurons (via
synapses) When input exceeds threshold,
fires Sends output along axon to other
neurons Brain 1011 neurons, 1016 synapses
16
Artificial Neural Nets

Simulated Neuron
Node connected to other nodes via links
Links axonsynapselink
Links associated with weight (like synapse)
Multiplied by output of node
Node combines input via activation function
E.g. sum of weighted inputs passed thru
threshold
Simpler than real neuronal processes

17
Artificial Neural Net
w
x
w
Sum Threshold
x
w
x
18
Perceptrons

Single neuron-like element
Binary inputs
Binary outputs
Weighted sum of inputs gt threshold

19
Perceptron Structure
y
w0
wn
w1
w3
w2
x01
x1
x3
x2
xn
. . .
compensates for threshold
x0 w0
20
Perceptron Example

Logical-OR Linearly separable
00 0 01 1 10 1 11 1

x2
x2

0
0

x1
x1
or
or
21
Perceptron Convergence Procedure

Straight-forward training procedure
Learns linearly separable functions
Until perceptron yields correct output for all
If the perceptron is correct, do nothing
If the percepton is wrong,
If it incorrectly says yes,
Subtract input vector from weight vector
Otherwise, add input vector to weight vector

22
Perceptron Convergence Example

LOGICAL-OR
Sample x0 x1 x2 Desired Output
1 1 0 0 0
2 1 0 1 1
3 1 1 0 1
4 1 1 1 1
Initial w(000)After S2, wws2(101)
Pass2 S1ww-s1(001)S3wws3(111)
Pass3 S1ww-s1(011)

23
Perceptron Convergence Theorem

If there exists a vector W s.t.
Perceptron training will find it
Assume

for all
ive examples x
w2 increases by at most x2, in each
iteration
wx2 lt w2x2 ltk x2
v.w/w gt lt 1

Converges in k lt O
steps

24
Perceptron Learning

Perceptrons learn linear decision boundaries
E.g.

x2

0
But not
0

x1
xor
X1 X2 -1 -1 w1x1 w2x2 lt 0 1
-1 w1x1 w2x2 gt 0 gt implies w1 gt 0 1
1 w1x1 w2x2 gt0 gt but should be
false -1 1 w1x1 w2x2 gt 0 gt implies
w2 gt 0
25
Perceptron Example

Digit recognition
Assume display 8 lightable bars
Inputs on/off threshold
65 steps to recognize 8

26
Perceptron Summary

Motivated by neuron activation
Simple training procedure
Guaranteed to converge
IF linearly separable

27
Neural Nets

Multi-layer perceptrons
Inputs real-valued
Intermediate hidden nodes
Output(s) one (or more) discrete-valued

X1
Y1 Y2
X2
X3
X4
Inputs
Hidden
Hidden
Outputs
28
Neural Nets

Pro More general than perceptrons
Not restricted to linear discriminants
Multiple outputs one classification each
Con No simple, guaranteed training procedure
Use greedy, hill-climbing procedure to train
Gradient descent, Backpropagation

29
Solving the XOR Problem
o1
w11
Network Topology 2 hidden nodes 1 output
w13
x1
w01
w21
y
-1
w23
w12
w03
w22
x2
-1
w02
o2
Desired behavior x1 x2 o1 o2 y 0 0 0
0 0 1 0 0 1 1 0 1 0 1
1 1 1 1 1 0
-1
Weights w11 w121 w21w22 1 w013/2 w021/2
w031/2 w13-1 w231
30
Neural Net Applications

Speech recognition
Handwriting recognition
NETtalk Letter-to-sound rules
ALVINN Autonomous driving

31
ALVINN

Driving as a neural network
Inputs
Image pixel intensities
I.e. lane lines
5 Hidden nodes
Outputs
Steering actions
E.g. turn left/right how far
Training
Observe human behavior sample images, steering

32
Backpropagation

Greedy, Hill-climbing procedure
Weights are parameters to change
Original hill-climb changes one parameter/step
Slow
If smooth function, change all parameters/step
Gradient descent
Backpropagation Computes current output, works
backward to correct error

33
Producing a Smooth Function

Key problem
Pure step threshold is discontinuous
Not differentiable
Solution
Sigmoid (squashed s function) Logistic fn

34
Neural Net Training

Goal
Determine how to change weights to get correct
output
Large change in weight to produce large reduction
in error
Approach
Compute actual output o
Compare to desired output d
Determine effect of each weight w on error d-o
Adjust weights

35
Neural Net Example
xi ith sample input vector w weight vector
yi desired output for ith sample
-
Sum of squares error over training samples
From 6.034 notes lozano-perez
Full expression of output in terms of input and
weights
36
Gradient Descent

Error Sum of squares error of inputs with
current weights
Compute rate of change of error wrt each weight
Which weights have greatest effect on error?
Effectively, partial derivatives of error wrt
weights
In turn, depend on other weights gt chain rule

37
Gradient Descent
dG dw

E G(w)
Error as function of weights
Find rate of change of error
Follow steepest rate of change
Change weights s.t. error is minimized

E
G(w)
w0w1
w
Local minima
38
Gradient of Error
-
Note Derivative of sigmoid ds(z1)
s(z1)(1-s(z1)) dz1
From 6.034 notes lozano-perez
39
From Effect to Update