Title: Learning with Perceptrons and Neural Networks
1Learning with Perceptronsand Neural Networks
- Artificial Intelligence
- CMSC 25000
- February 14, 2002
2Agenda
- Neural Networks
- Biological analogy
- Perceptrons Single layer networks
- Perceptron training Perceptron convergence
theorem - Perceptron limitations
- Neural Networks Multilayer perceptrons
- Neural net training Backpropagation
- Strengths Limitations
- Conclusions
3Neurons The Concept
Dendrites
Axon
Nucleus
Cell Body
Neurons Receive inputs from other neurons (via
synapses) When input exceeds threshold,
fires Sends output along axon to other
neurons Brain 1011 neurons, 1016 synapses
4Artificial Neural Nets
- Simulated Neuron
- Node connected to other nodes via links
- Links axonsynapselink
- Links associated with weight (like synapse)
- Multiplied by output of node
- Node combines input via activation function
- E.g. sum of weighted inputs passed thru
threshold - Simpler than real neuronal processes
5Artificial Neural Net
w
x
w
Sum Threshold
x
w
x
6Perceptrons
- Single neuron-like element
- Binary inputs
- Binary outputs
- Weighted sum of inputs gt threshold
- (Possibly logic box between inputs and weights)
7Perceptron Structure
y
w0
wn
w1
w3
w2
x0-1
x1
x3
x2
xn
. . .
compensates for threshold
x0 w0
8Perceptron Convergence Procedure
- Straight-forward training procedure
- Learns linearly separable functions
- Until perceptron yields correct output for all
- If the perceptron is correct, do nothing
- If the percepton is wrong,
- If it incorrectly says yes,
- Subtract input vector from weight vector
- Otherwise, add input vector to weight vector
9Perceptron Convergence Example
- LOGICAL-OR
- Sample x1 x2 x3 Desired Output
- 1 0 0 1
0 - 2 0 1 1
1 - 3 1 0 1
1 - 4 1 1 1
1 - Initial w(0 0 0)After S2, wws2(0 1 1)
- Pass2 S1ww-s1(0 1 0)S3wws3(1 1 1)
- Pass3 S1ww-s1(1 1 0)
10Perceptron Convergence Theorem
- If there exists a vector W s.t.
- Perceptron training will find it
- Assume v.x gt
for
all ive examples x - wx1x2..xk, v.wgt k
- w2 increases by at most 1, in each iteration
- wx2 lt w21..w2 ltk ( mislabel)
- v.w/w gt k / lt 1
Converges in k lt (1/ )2
steps
11Perceptron Learning
- Perceptrons learn linear decision boundaries
- E.g.
x2
0
But not
0
x1
xor
X1 X2 -1 -1 w1x1 w2x2 lt 0 1
-1 w1x1 w2x2 gt 0 gt implies w1 gt 0 1
1 w1x1 w2x2 gt0 gt but should be
false -1 1 w1x1 w2x2 gt 0 gt implies
w2 gt 0
12Neural Nets
- Multi-layer perceptrons
- Inputs real-valued
- Intermediate hidden nodes
- Output(s) one (or more) discrete-valued
X1
Y1 Y2
X2
X3
X4
Inputs
Hidden
Hidden
Outputs
13Neural Nets
- Pro More general than perceptrons
- Not restricted to linear discriminants
- Multiple outputs one classification each
- Con No simple, guaranteed training procedure
- Use greedy, hill-climbing procedure to train
- Gradient descent, Backpropagation
14Solving the XOR Problem
o1
w11
Network Topology 2 hidden nodes 1 output
w13
x1
w01
w21
y
-1
w23
w12
w03
w22
x2
-1
w02
o2
Desired behavior x1 x2 o1 o2 y 0 0 0
0 0 1 0 0 1 1 0 1 0 1
1 1 1 1 1 0
-1
Weights w11 w121 w21w22 1 w013/2 w021/2
w031/2 w13-1 w231
15Backpropagation
- Greedy, Hill-climbing procedure
- Weights are parameters to change
- Original hill-climb changes one parameter/step
- Slow
- If smooth function, change all parameters/step
- Gradient descent
- Backpropagation Computes current output, works
backward to correct error
16Producing a Smooth Function
- Key problem
- Pure step threshold is discontinuous
- Not differentiable
- Solution
- Sigmoid (squashed s function) Logistic fn
17Neural Net Training
- Goal
- Determine how to change weights to get correct
output - Large change in weight to produce large reduction
in error - Approach
- Compute actual output o
- Compare to desired output d
- Determine effect of each weight w on error d-o
- Adjust weights
18Neural Net Example
xi ith sample input vector w weight vector
yi desired output for ith sample
Sum of squares error over training samples
Full expression of output in terms of input and
weights
19Gradient Descent
- Error Sum of squares error of inputs with
current weights - Compute rate of change of error wrt each weight
- Which weights have greatest effect on error?
- Effectively, partial derivatives of error wrt
weights - In turn, depend on other weights gt chain rule
20Gradient of Error
Note Derivative of sigmoid ds(z1)
s(z1)(1-s(z1) z1
21From Effect to Update
- Gradient computation
- How each weight contributes to performance
- To train
- Need to determine how to CHANGE weight based on
contribution to performance - Need to determine how MUCH change to make per
iteration - Rate parameter r
- Large enough to learn quickly
- Small enough reach but not overshoot target values
22Backpropagation Procedure
i
j
k
- Pick rate parameter r
- Until performance is good enough,
- Do forward computation to calculate output
- Compute Beta in output node with
- Compute Beta in all other nodes with
- Compute change for all weights with
23Backpropagation Observations
- Procedure is (relatively) efficient
- All computations are local
- Use inputs and outputs of current node
- What is good enough?
- Rarely reach target (0 or 1) outputs
- Typically, train until within 0.1 of target
24Neural Net Summary
- Training
- Backpropagation procedure
- Gradient descent strategy (usual problems)
- Prediction
- Compute outputs based on input vector weights
- Pros Very general, Fast prediction
- Cons Training can be VERY slow (1000s of
epochs), Overfitting
25Training Strategies
- Online training
- Update weights after each sample
- Offline (batch training)
- Compute error over all samples
- Then update weights
- Online training noisy
- Sensitive to individual instances
- However, may escape local minima
26Training Strategy
- To avoid overfitting
- Split data into training, validation, test
- Also, avoid excess weights (less than samples)
- Initialize with small random weights
- Small changes have noticeable effect
- Use offline training
- Until validation set minimum
- Evaluate on test set
- No more weight changes
27Classification
- Neural networks best for classification task
- Single output -gt Binary classifier
- Multiple outputs -gt Multiway classification
- Applied successfully to learning pronunciation
- Sigmoid pushes to binary classification
- Not good for regression
28Neural Net Conclusions
- Simulation based on neurons in brain
- Perceptrons (single neuron)
- Guaranteed to find linear discriminant
- IF one exists -gt problem XOR
- Neural nets (Multi-layer perceptrons)
- Very general
- Backpropagation training procedure
- Gradient descent - local min, overfitting issues