Title: Outline
1Outline
- Announcement
- Midterm Review
2Announcement
- The second exam will be on Nov. 17, 2004
- Please come to class a bit earlier so that we can
start on time - I will be here about 10 minutes before class
- When most of you are present, we will start so
that you can have some extra time if you need - You need to bring a calculator
- The exam is closed-book, closed-note, and
closed-neighbors - A double-side sheet no larger than the letter
size with notes is allowed
3Hopfield Network
- A closely related neural network is Hopfield
Network - It is a recurrent network
- It is more powerful than the one-layer
feed-forward networks as associative memory
4Hopfield Network Architecture cont.
- Here each neuron is a simple perceptron neuron
with - the hardlims transfer function
5Hopfield Network Architecture
- It is a single-layer recurrent network
- Each neuron is a perceptron unit with a hardlims
transfer function
6Hopfield Network as Associative Memory
- One pattern p1
- The condition for the pattern to be stable is
- This can be satisfied by
7Hopfield Network cont.
- Three cases when presenting a pattern p to the
Hopfield network stored only one pattern p1 - P1 will be recalled perfectly if h(p, p1) lt R/2
- -P1 will be recalled if h(p,p1) gt R/2
- What will happen if h(p, p1) R/2?
- Here h(p, p1) is the hamming distance between p
and p1
8Hopfield Network as Associative Memory
- Many patterns
- Matrix form
9Hopfield Network cont.
- Storage capacity of a Hopfield network
- For random patterns, we can estimate the maximum
number of patterns a Hopfield network can store
with an acceptable error - This depends on how we define acceptable error
10Hopfield Network cont.
- Here the acceptable error is the error of each bit
11Hopfield Network cont.
- If we require the error for each pattern
- For error lt 0.01 / R, Qmax R/(2log R)
- If we require the error for all the patterns
- For error lt 0.01 / (QR), Qmax R/(4log R)
12Hopfield Network cont.
- Spurious states
- Hopfield networks can have stable configurations
that other than the given patterns - Reversed states
- Mixture states
- For a large Q, there are also stable
configurations not correlated with any of the
stored patterns
13Widrow-Hoff Learning
- The neural network we will discuss here is called
ADALINE - Very similar to the perceptron except that its
transfer function is linear - Same as the linear associator
- ADALINE with its learning algorithm, LMS is
widely used in digital signal processing
14ADALINE Network
15Two-Input ADALINE
16Mean Square Error
Training Set
Input
Target
Notation
Mean Square Error
17Error Analysis
If a unique solution exists, it will be given by
18Approximate Steepest Descent
Approximate mean square error (one sample)
Approximate (stochastic) gradient
19Approximate Gradient Calculation
20LMS Algorithm
21Multiple-Neuron Case
Matrix Form
22Properties and Advantages of the LMS Algorithm
- Compared to the analytical solution, the LMS
algorithm provides some advantages - It does not require calculating the inverse of a
(potentially large) matrix - It is more flexible in that it requires all the
training examples to be available at the
beginning - On-line learning is possible
- Compared to backpropagation, the LMS algorithm
converges to a unique solution as long as the
learning rate is not too large
23Learning Rate
- Note that here the learning rate is important
- If the learning rate is too small, it will take a
lot of iterations for the algorithm to converge - If the learning rate is too large, the algorithm
may not converge
24Example
25Conditions for Stability
(where li is an eigenvalue of R)
Therefore the stability condition simplifies to
26Example
Banana
Apple
27Iteration One
Banana
28Iteration Two
Apple
29Iteration Three
30Backpropagation
- Backpropagation is a direct generalization of the
LMS algorithm - Both are steepest gradient descent algorithm
based on approximate square error - Backpropagation becomes LMS algorithm when
applied on the ADALINE network - The main difference is how to calculate the
gradients - Practically, the backpropagation is more powerful
- Note that nonlinearity is essential for multiple
layer neural networks - However, there is no guarantee that
backpropagation algorithm will converge to the
global optimal solution
31Multilayer Network
32Performance Index
Training Set
Mean Square Error
Vector Case
Approximate Mean Square Error (Single Sample)
Approximate Steepest Descent
33Chain Rule
Example
Application to Gradient Calculation
34Gradient Calculation
Sensitivity
Gradient
35Steepest Descent
Next Step Compute the Sensitivities
(Backpropagation)
36Jacobian Matrix
37Backpropagation (Sensitivities)
The sensitivities are computed by starting at the
last layer, and then propagating backwards
through the network to the first layer.
38Initialization (Last Layer)
39Summary
Forward Propagation
Backpropagation
Weight Update
40Example Function Approximation
t
-
p
e
a
1-2-1 Network
41Network
42Initial Conditions
43Forward Propagation
44Transfer Function Derivatives
45Backpropagation
46Weight Update
47Other Gradient Descent Algorithms
- The steepest gradient descent algorithm can be
used to derive learning algorithms for different
kinds of networks - The key step is how to calculate the gradients
48Other Gradient Descent Algorithms
49Other Gradient Descent Algorithms
50Practical Issues
- While in theory a multiple layer neural network
with nonlinear transfer function trained using
backpropagation is sufficient to approximate any
function or solve any recognition problem, there
are practical issues - What is the optimal architecture for a particular
problem/application? - What is the performance on unknown test data?
- Will the network converge to a good solution?
- How long does it take to train a network?
51Choice of Architecture
1-3-1 Network
i 1
i 2
i 4
i 8
52Choice of Network Architecture
1-2-1
1-3-1
1-5-1
1-4-1
53The Issue of Generalization
- We are interested in a neural network trained
only on a training set will work well for novel
and unseen test data - For example, for face recognition, a neural
network can only recognize those in the training
set is not very useful - Generalization is one of the most fundamental
problems in neural networks and many other
recognition techniques
54Generalization
1-2-1
1-9-1
55Improving Generalization
- Heuristics
- A neural network should have fewer parameters
than the number of data points in the training
set - Simpler neural networks should be preferred over
complicated ones, known as Occams razor - More domain specific knowledge
- Cross validation
- Divide the labeled examples into training and
validation sets - Stop training when the error on the validation
set increases
56Cross Validation
57Convergence Issues
- A neural network may converge to a bad solution
- Train several neural networks from different
initial conditions
58Convergence
5
1
5
3
3
4
2
4
2
0
0
1
59Convergence Issues
- A neural network may converge to a bad solution
- Train several neural networks from different
initial conditions - The convergence is slow
- Practical techniques
- Variations of basic backpropagation algorithms
60Practical Techniques for Improving Backpropagation
- Scaling input
- We can standardize each feature component to have
zero mean and the same variance - Target values
- For pattern recognition applications, use 1 for
the target category and -1 for non-target
category - Training with noise
61Practical Techniques for Improving Backpropagation
- Manufacturing data
- If we have knowledge about the sources of
variation among the inputs, we can manufacture
training data - For face detection, we can rotate and enlarge /
shrink the training images - Initializing weights
- If we use standardized data, we want positive and
negative weights as well from a uniform
distribution - Uniform learning
62Practical Techniques for Improving Backpropagation
- Training protocols
- Epoch corresponds to a single presentation of all
the patterns in the training set - Stochastic training
- Training samples are chosen randomly from the set
and the weights are updated after each sample - Batch training
- All the training samples are presented to the
network before weights are updated - On-line training
- Each training sample is presented once and only
once - There is no memory for storing training samples
63Speeding up Convergence
- Heuristics
- Momentum
- Variable learning rate
- Conjugate gradient
- Second-order methods
- Newtons method
- Levenberg-Marquardt algorithm
64Performance Surface Example
Network Architecture
Nominal Function
Parameter Values
65Squared Error vs. w11,1 and w21,1
w21,1
w11,1
w21,1
w11,1
66Momentum
Filter
Example
67Momentum Backpropagation
Steepest Descent Backpropagation (SDBP)
w21,1
Momentum Backpropagation (MOBP)
w11,1
68Momentum Backpropagation
Using standard BP
69Momentum Backpropagation
70Variable Learning Rate (VLBP)
- If the squared error (over the entire training
set) increases by more than some set percentage z
after a weight update, then the weight update is
discarded, the learning rate is multiplied by
some factor (1 gt r gt 0), and the momentum
coefficient g is set to zero. - If the squared error decreases after a weight
update, then the weight update is accepted and
the learning rate is multiplied by some factor
hgt1. If g has been previously set to zero, it is
reset to its original value. - If the squared error increases by less than z,
then the weight update is accepted, but the
learning rate and the momentum coefficient are
unchanged.
71VLBP Example
w21,1
w11,1
72VLBP Example
73Associative Learning
- To learn associations between a systems input
and the systems output - In this chapter the association is learned
between things that occur simultaneously - The inputs, also called stimuli, are divided into
- Unconditional inputs (whose weights are fixed),
corresponding to the food presented to the dog in
the Pavlovs experiment - Conditional inputs (whose weights will be
learned), corresponding to the bell in the
Pavlovs experiment
74Unsupervised Learning
- In unsupervised learning, the networks weights
and biases are updated according to the inputs
only - There is no target value(s) for each input
pattern - The training now consists of a sequence of input
patterns, given by
75Simple Associative Network
76Banana Associator
Unconditioned Stimulus
Conditioned Stimulus
77Unsupervised Hebb Rule
Vector Form
- Local learning rule a rule that uses only
signals available within the layer containing the
weights being updated - Is a backpropagation a local learning rule?
78Banana Recognition Example
Initial Weights
Training Sequence
a 1
First Iteration (sight fails)
79Example
Second Iteration (sight works)
Third Iteration (sight fails)
Banana will now be detected if either sensor
works.
80Problems with Hebb Rule
- Weights can become arbitrarily large
- When inputs are presented again and again
- There is no mechanism for weights to decrease
- Noise in the inputs or outputs can cause the
network to respond to any stimulus
81Hebb Rule with Decay
This keeps the weight matrix from growing without
bound, which can be demonstrated by setting both
ai and pj to 1
82Example Banana Associator
g 0.1
a 1
First Iteration (sight fails)
Second Iteration (sight works)
83Example
Third Iteration (sight fails)
Hebb Rule
Hebb with Decay
84Problem of Hebb with Decay
- Associations will decay away if stimuli are not
occasionally presented.
If ai 0, then
If g 0, this becomes
Therefore the weight decays by 10 at each
iteration where there is no stimulus.
85Instar Network
- Instar network
- Architecture wise, identical to simple perceptron
network - A single layer network
- However, in instar, the bias is given and weights
are learned using instar rule
86Instar (Recognition Network)
87Instar Operation
The instar will be active when
or
For normalized vectors, the largest inner product
occurs when the angle between the weight vector
and the input vector is zero -- the input vector
is equal to the weight vector.
The rows of a weight matrix represent patterns to
be recognized.
88Vector Recognition
If we set
the instar will only be active when q 0.
If we set
the instar will be active for a range of angles.
As b is increased, the more patterns there will
be (over a wider range of q) which will activate
the instar.
89Instar Rule
Hebb rule
Modify so that learning and forgetting will only
occur when the neuron is active - Instar Rule
or
Vector Form
90Graphical Representation
For the case where the instar is active (ai 1)
or
For the case where the instar is inactive (ai
0)
91Example
92Training
First Iteration (a1)
93Further Training
94Outstar (Recall Network)
95Outstar Operation
Suppose we want the outstar to recall a certain
pattern a whenever the input p 1 is presented
to the network. Let
Then, when p 1
and the pattern is correctly recalled.
The columns of a weight matrix represent patterns
to be recalled.
96Outstar Rule
For the instar rule we made the weight decay term
of the Hebb rule proportional to the output of
the network. For the outstar rule we make the
weight decay term proportional to the input of
the network.
If we make the decay rate g equal to the learning
rate a,
Vector Form
97Example - Pineapple Recall
98Definitions
99Iteration 1
a 1
100Convergence
101Hamming Network
102Hamming Network cont.
- Layer 1
- Consists of multiple instar neurons to recognize
more than one pattern - The output of a neuron is the inner production
between the weight vector (prototype) and the
input vector - The output from the first layer indicates the
correlation between the prototype pattern and the
input vector - It is feedforward
103Layer 1 (Correlation)
We want the network to recognize the following
prototype vectors
The first layer weight matrix and bias vector are
given by
The response of the first layer is
The prototype closest to the input vector
produces the largest response.
104Hamming Network cont.
- Layer 2
- It is a recurrent network, called a competitive
network - The neurons in this layer compete with each other
to determine a winner - After competition, only one neuron will have a
nonzero output - The winning neuron indicates which category of
input was presented to the network
105Layer 2 (Competition)
The second layer is initialized with the
output of the first layer.
The neuron with the largest initial condition
will win the competition.
106Hamming Network cont.
- Lateral inhibition
- This competition is called a winner-take-all
competition - Because the one with the largest value decreases
the slowest, it remains positive when all others
become zero - What will happen if there are ties?
107Classification Example
108Competitive Layer
- In a competitive layer, each neuron excites
itself and inhibits all the other neurons - A transfer function that does the job of a
recurrent competitive layer - It works by finding the neuron with the largest
net input and setting its output to 1 (In case of
ties, the neuron with lowest index). All other
outputs are set to 0
109Competitive Layer
110Competitive Learning
- A learning rule to train the weights in a
competitive network - Instar rule
- In other words,
- For the competitive network, the winning neuron
has an output of 1, and the other neurons have an
output of 0.
111Competitive Learning
Kohonen Rule
112Graphical Representation
113Example
114Four Iterations
115Problems with Competitive Layers
- Choice of learning rate
- A learning rate near zero results in slow
learning but stable - A learning rate near one results in fast learning
but oscillate - Stability problem when clusters are close to each
other - Dead neuron
- A neuron whose initial weight vector is so far
from any input vectors that it never wins the
competition - The number of classes must be known
- These limitations can be overcome by the feature
maps, LVQ networks, and ART networks
116Choice of Learning Rates
- When learning rate is small, the learning is
stable but slow - When learning rate is close to 1, the learning is
fast but slow - Adaptive learning rate can be used
- Initial learning rate is large and gradually
decrease the learning rate
117Stability
If the input vectors dont fall into nice
clusters, then for large learning rates the
presentation of each input vector may modify the
configuration so that the system will undergo
continual evolution.
p3
p3
p1
p1
p5
p5
1w(0)
1w(8)
p8
p8
p7
p7
2w(8)
2w(0)
p6
p6
p2
p2
p4
p4
118Another Stability Example
119Typical Convergence (Clustering)
Weights
Input Vectors
Before Training
After Training
120Dead Units
One problem with competitive learning is that
neurons with initial weights far from any input
vector may never win.
121Dead Units cont.
- Solution
- Add a negative bias to each neuron, and increase
the magnitude of the bias as the neuron wins - This will make it harder to win if a neuron has
won often - This is called a conscience
122Competitive Layers in Biology
On-Center/Off-Surround Connections for Competition
Weights in the competitive layer of the Hamming
network
Weights assigned based on distance
123Mexican-Hat Function
124Feature Maps
Update weight vectors in a neighborhood of the
winning neuron.
125Example
126Self-Organizing Feature Maps cont.
127Self-Organizing Feature Maps cont.
128Self-Organizing Feature Maps cont.
129Self-Organizing Feature Maps cont.
130Improving SOFM
- Convergence speed-up of SOFM
- Variable neighborhood size
- Use a larger neighborhood size initially and
gradually reduce it until it includes only the
winning neuron - Variable learning rate
- Use a larger learning rate initially (close to 1)
and decrease it toward 0 asymptotically - Let the winning neuron use a larger rate than the
neighboring ones - One can use distance as the net input instead of
the inner product
131Learning Vector Quantization
The net input is not computed by taking an inner
product of the prototype vectors with the input.
Instead, the net input is the negative of the
distance between the prototype vectors and the
input.
132Subclass
For the LVQ network, the winning neuron in the
first layer indicates the subclass which the
input vector belongs to. There may be several
different neurons (subclasses) which make up each
class.
The second layer of the LVQ network combines
subclasses into a single class. The columns of W2
represent subclasses, and the rows represent
classes. W2 has a single 1 in each column,
with the other elements set to zero. The row in
which the 1 occurs indicates which class the
appropriate subclass belongs to.
133Example
Subclasses 1, 3 and 4 belong to class
1. Subclass 2 belongs to class
2. Subclasses 5 and 6 belong to class 3.
A single-layer competitive network can create
convex classification regions. The second layer
of the LVQ network can combine the convex regions
to create more complex categories.
134LVQ Design Example
135LVQ Design Example
136LVQ Design Example
137LVQ Learning
LVQ learning combines competitive learning with
supervision. It requires a training set of
examples of proper network behavior.
If the input pattern is classified correctly,
then move the winning weight toward the input
vector according to the Kohonen rule.
If the input pattern is classified incorrectly,
then move the winning weight away from the input
vector.
138Example
139First Iteration
140Second Layer
This is the correct class, therefore the weight
vector is moved toward the input vector.
141Figure
142Final Decision Regions
143LVQ2
If the winning neuron in the hidden layer
incorrectly classifies the current input, we move
its weight vector away from the input vector, as
before. However, we also adjust the weights of
the closest neuron to the input vector that does
classify it properly. The weights for this second
neuron should be moved toward the input
vector. When the network correctly classifies an
input vector, the weights of only one neuron are
moved toward the input vector. However, if the
input vector is incorrectly classified, the
weights of two neurons are updated, one weight
vector is moved away from the input vector, and
the other one is moved toward the input vector.
The resulting algorithm is called LVQ2.
144LVQ2 Example