Outline

About This Presentation

Title:

Outline

Description:

Announcement. The second exam will be on Nov. 17, 2004 ... Apple. 9/14/09. CAP 5615: Introduction to Neural Networks. 29. Iteration Three. 9/14/09 ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 145

Provided by: xiuwe

Category:

Tags: outline

more less

Transcript and Presenter's Notes

Title: Outline

1
Outline

Announcement
Midterm Review

2
Announcement

The second exam will be on Nov. 17, 2004
Please come to class a bit earlier so that we can
start on time
I will be here about 10 minutes before class
When most of you are present, we will start so
that you can have some extra time if you need
You need to bring a calculator
The exam is closed-book, closed-note, and
closed-neighbors
A double-side sheet no larger than the letter
size with notes is allowed

3
Hopfield Network

A closely related neural network is Hopfield
Network
It is a recurrent network
It is more powerful than the one-layer
feed-forward networks as associative memory

4
Hopfield Network Architecture cont.

Here each neuron is a simple perceptron neuron
with
the hardlims transfer function

5
Hopfield Network Architecture

It is a single-layer recurrent network
Each neuron is a perceptron unit with a hardlims
transfer function

6
Hopfield Network as Associative Memory

One pattern p1
The condition for the pattern to be stable is
This can be satisfied by

7
Hopfield Network cont.

Three cases when presenting a pattern p to the
Hopfield network stored only one pattern p1
P1 will be recalled perfectly if h(p, p1) lt R/2
-P1 will be recalled if h(p,p1) gt R/2
What will happen if h(p, p1) R/2?
Here h(p, p1) is the hamming distance between p
and p1

8
Hopfield Network as Associative Memory

Many patterns
Matrix form

9
Hopfield Network cont.

Storage capacity of a Hopfield network
For random patterns, we can estimate the maximum
number of patterns a Hopfield network can store
with an acceptable error
This depends on how we define acceptable error

10
Hopfield Network cont.

Here the acceptable error is the error of each bit

11
Hopfield Network cont.

If we require the error for each pattern
For error lt 0.01 / R, Qmax R/(2log R)
If we require the error for all the patterns
For error lt 0.01 / (QR), Qmax R/(4log R)

12
Hopfield Network cont.

Spurious states
Hopfield networks can have stable configurations
that other than the given patterns
Reversed states
Mixture states
For a large Q, there are also stable
configurations not correlated with any of the
stored patterns

13
Widrow-Hoff Learning

The neural network we will discuss here is called
ADALINE
Very similar to the perceptron except that its
transfer function is linear
Same as the linear associator
ADALINE with its learning algorithm, LMS is
widely used in digital signal processing

14
ADALINE Network
15
Two-Input ADALINE
16
Mean Square Error
Training Set
Input
Target
Notation
Mean Square Error
17
Error Analysis
If a unique solution exists, it will be given by
18
Approximate Steepest Descent
Approximate mean square error (one sample)
Approximate (stochastic) gradient
19
Approximate Gradient Calculation
20
LMS Algorithm
21
Multiple-Neuron Case
Matrix Form
22
Properties and Advantages of the LMS Algorithm

Compared to the analytical solution, the LMS
algorithm provides some advantages
It does not require calculating the inverse of a
(potentially large) matrix
It is more flexible in that it requires all the
training examples to be available at the
beginning
On-line learning is possible
Compared to backpropagation, the LMS algorithm
converges to a unique solution as long as the
learning rate is not too large

23
Learning Rate

Note that here the learning rate is important
If the learning rate is too small, it will take a
lot of iterations for the algorithm to converge
If the learning rate is too large, the algorithm
may not converge

24
Example
25
Conditions for Stability
(where li is an eigenvalue of R)
Therefore the stability condition simplifies to
26
Example
Banana
Apple
27
Iteration One
Banana
28
Iteration Two
Apple
29
Iteration Three
30
Backpropagation

Backpropagation is a direct generalization of the
LMS algorithm
Both are steepest gradient descent algorithm
based on approximate square error
Backpropagation becomes LMS algorithm when
applied on the ADALINE network
The main difference is how to calculate the
gradients
Practically, the backpropagation is more powerful
Note that nonlinearity is essential for multiple
layer neural networks
However, there is no guarantee that
backpropagation algorithm will converge to the
global optimal solution

31
Multilayer Network
32
Performance Index
Training Set
Mean Square Error
Vector Case
Approximate Mean Square Error (Single Sample)
Approximate Steepest Descent
33
Chain Rule
Example
Application to Gradient Calculation
34
Gradient Calculation
Sensitivity
Gradient
35
Steepest Descent
Next Step Compute the Sensitivities
(Backpropagation)
36
Jacobian Matrix
37
Backpropagation (Sensitivities)
The sensitivities are computed by starting at the
last layer, and then propagating backwards
through the network to the first layer.
38
Initialization (Last Layer)
39
Summary
Forward Propagation
Backpropagation
Weight Update
40
Example Function Approximation
t
-
p
e

a
1-2-1 Network
41
Network
42
Initial Conditions
43
Forward Propagation
44
Transfer Function Derivatives
45
Backpropagation
46
Weight Update
47
Other Gradient Descent Algorithms

The steepest gradient descent algorithm can be
used to derive learning algorithms for different
kinds of networks
The key step is how to calculate the gradients

48
Other Gradient Descent Algorithms
49
Other Gradient Descent Algorithms
50
Practical Issues

While in theory a multiple layer neural network
with nonlinear transfer function trained using
backpropagation is sufficient to approximate any
function or solve any recognition problem, there
are practical issues
What is the optimal architecture for a particular
problem/application?
What is the performance on unknown test data?
Will the network converge to a good solution?
How long does it take to train a network?

51
Choice of Architecture
1-3-1 Network
i 1
i 2
i 4
i 8
52
Choice of Network Architecture
1-2-1
1-3-1
1-5-1
1-4-1
53
The Issue of Generalization

We are interested in a neural network trained
only on a training set will work well for novel
and unseen test data
For example, for face recognition, a neural
network can only recognize those in the training
set is not very useful
Generalization is one of the most fundamental
problems in neural networks and many other
recognition techniques

54
Generalization
1-2-1
1-9-1
55
Improving Generalization

Heuristics
A neural network should have fewer parameters
than the number of data points in the training
set
Simpler neural networks should be preferred over
complicated ones, known as Occams razor
More domain specific knowledge
Cross validation
Divide the labeled examples into training and
validation sets
Stop training when the error on the validation
set increases

56
Cross Validation
57
Convergence Issues

A neural network may converge to a bad solution
Train several neural networks from different
initial conditions

58
Convergence
5
1
5
3
3
4
2
4
2
0
0
1
59
Convergence Issues

A neural network may converge to a bad solution
Train several neural networks from different
initial conditions
The convergence is slow
Practical techniques
Variations of basic backpropagation algorithms

60
Practical Techniques for Improving Backpropagation

Scaling input
We can standardize each feature component to have
zero mean and the same variance
Target values
For pattern recognition applications, use 1 for
the target category and -1 for non-target
category
Training with noise

61
Practical Techniques for Improving Backpropagation

Manufacturing data
If we have knowledge about the sources of
variation among the inputs, we can manufacture
training data
For face detection, we can rotate and enlarge /
shrink the training images
Initializing weights
If we use standardized data, we want positive and
negative weights as well from a uniform
distribution
Uniform learning

62
Practical Techniques for Improving Backpropagation

Training protocols
Epoch corresponds to a single presentation of all
the patterns in the training set
Stochastic training
Training samples are chosen randomly from the set
and the weights are updated after each sample
Batch training
All the training samples are presented to the
network before weights are updated
On-line training
Each training sample is presented once and only
once
There is no memory for storing training samples

63
Speeding up Convergence

Heuristics
Momentum
Variable learning rate
Conjugate gradient
Second-order methods
Newtons method
Levenberg-Marquardt algorithm

64
Performance Surface Example
Network Architecture
Nominal Function
Parameter Values
65
Squared Error vs. w11,1 and w21,1
w21,1
w11,1
w21,1
w11,1
66
Momentum
Filter
Example
67
Momentum Backpropagation
Steepest Descent Backpropagation (SDBP)
w21,1
Momentum Backpropagation (MOBP)
w11,1
68
Momentum Backpropagation
Using standard BP
69
Momentum Backpropagation
70
Variable Learning Rate (VLBP)

If the squared error (over the entire training
set) increases by more than some set percentage z
after a weight update, then the weight update is
discarded, the learning rate is multiplied by
some factor (1 gt r gt 0), and the momentum
coefficient g is set to zero.
If the squared error decreases after a weight
update, then the weight update is accepted and
the learning rate is multiplied by some factor
hgt1. If g has been previously set to zero, it is
reset to its original value.
If the squared error increases by less than z,
then the weight update is accepted, but the
learning rate and the momentum coefficient are
unchanged.

71
VLBP Example
w21,1
w11,1
72
VLBP Example
73
Associative Learning

To learn associations between a systems input
and the systems output
In this chapter the association is learned
between things that occur simultaneously
The inputs, also called stimuli, are divided into
Unconditional inputs (whose weights are fixed),
corresponding to the food presented to the dog in
the Pavlovs experiment
Conditional inputs (whose weights will be
learned), corresponding to the bell in the
Pavlovs experiment

74
Unsupervised Learning

In unsupervised learning, the networks weights
and biases are updated according to the inputs
only
There is no target value(s) for each input
pattern
The training now consists of a sequence of input
patterns, given by

75
Simple Associative Network
76
Banana Associator
Unconditioned Stimulus
Conditioned Stimulus
77
Unsupervised Hebb Rule
Vector Form

Local learning rule a rule that uses only
signals available within the layer containing the
weights being updated
Is a backpropagation a local learning rule?

78
Banana Recognition Example
Initial Weights
Training Sequence
a 1
First Iteration (sight fails)
79
Example
Second Iteration (sight works)
Third Iteration (sight fails)
Banana will now be detected if either sensor
works.
80
Problems with Hebb Rule

Weights can become arbitrarily large
When inputs are presented again and again
There is no mechanism for weights to decrease
Noise in the inputs or outputs can cause the
network to respond to any stimulus

81
Hebb Rule with Decay
This keeps the weight matrix from growing without
bound, which can be demonstrated by setting both
ai and pj to 1
82
Example Banana Associator
g 0.1
a 1
First Iteration (sight fails)
Second Iteration (sight works)
83
Example
Third Iteration (sight fails)
Hebb Rule
Hebb with Decay
84
Problem of Hebb with Decay

Associations will decay away if stimuli are not
occasionally presented.

If ai 0, then
If g 0, this becomes
Therefore the weight decays by 10 at each
iteration where there is no stimulus.
85
Instar Network

Instar network
Architecture wise, identical to simple perceptron
network
A single layer network
However, in instar, the bias is given and weights
are learned using instar rule

86
Instar (Recognition Network)
87
Instar Operation
The instar will be active when
or
For normalized vectors, the largest inner product
occurs when the angle between the weight vector
and the input vector is zero -- the input vector
is equal to the weight vector.
The rows of a weight matrix represent patterns to
be recognized.
88
Vector Recognition
If we set
the instar will only be active when q 0.
If we set
the instar will be active for a range of angles.
As b is increased, the more patterns there will
be (over a wider range of q) which will activate
the instar.
89
Instar Rule
Hebb rule
Modify so that learning and forgetting will only
occur when the neuron is active - Instar Rule
or
Vector Form
90
Graphical Representation
For the case where the instar is active (ai 1)
or
For the case where the instar is inactive (ai
0)
91
Example
92
Training
First Iteration (a1)
93
Further Training
94
Outstar (Recall Network)
95
Outstar Operation
Suppose we want the outstar to recall a certain
pattern a whenever the input p 1 is presented
to the network. Let
Then, when p 1
and the pattern is correctly recalled.
The columns of a weight matrix represent patterns
to be recalled.
96
Outstar Rule
For the instar rule we made the weight decay term
of the Hebb rule proportional to the output of
the network. For the outstar rule we make the
weight decay term proportional to the input of
the network.
If we make the decay rate g equal to the learning
rate a,
Vector Form
97
Example - Pineapple Recall
98
Definitions
99
Iteration 1
a 1
100
Convergence
101
Hamming Network
102
Hamming Network cont.

Layer 1
Consists of multiple instar neurons to recognize
more than one pattern
The output of a neuron is the inner production
between the weight vector (prototype) and the
input vector
The output from the first layer indicates the
correlation between the prototype pattern and the
input vector
It is feedforward

103
Layer 1 (Correlation)
We want the network to recognize the following
prototype vectors
The first layer weight matrix and bias vector are
given by
The response of the first layer is
The prototype closest to the input vector
produces the largest response.
104
Hamming Network cont.

Layer 2
It is a recurrent network, called a competitive
network
The neurons in this layer compete with each other
to determine a winner
After competition, only one neuron will have a
nonzero output
The winning neuron indicates which category of
input was presented to the network

105
Layer 2 (Competition)
The second layer is initialized with the
output of the first layer.
The neuron with the largest initial condition
will win the competition.
106
Hamming Network cont.

Lateral inhibition
This competition is called a winner-take-all
competition
Because the one with the largest value decreases
the slowest, it remains positive when all others
become zero
What will happen if there are ties?

107
Classification Example
108
Competitive Layer

In a competitive layer, each neuron excites
itself and inhibits all the other neurons
A transfer function that does the job of a
recurrent competitive layer
It works by finding the neuron with the largest
net input and setting its output to 1 (In case of
ties, the neuron with lowest index). All other
outputs are set to 0

109
Competitive Layer
110
Competitive Learning

A learning rule to train the weights in a
competitive network
Instar rule
In other words,
For the competitive network, the winning neuron
has an output of 1, and the other neurons have an
output of 0.

111
Competitive Learning
Kohonen Rule
112
Graphical Representation
113
Example
114
Four Iterations
115
Problems with Competitive Layers

Choice of learning rate
A learning rate near zero results in slow
learning but stable
A learning rate near one results in fast learning
but oscillate
Stability problem when clusters are close to each
other
Dead neuron
A neuron whose initial weight vector is so far
from any input vectors that it never wins the
competition
The number of classes must be known
These limitations can be overcome by the feature
maps, LVQ networks, and ART networks

116
Choice of Learning Rates

When learning rate is small, the learning is
stable but slow
When learning rate is close to 1, the learning is
fast but slow
Adaptive learning rate can be used
Initial learning rate is large and gradually
decrease the learning rate

117
Stability
If the input vectors dont fall into nice
clusters, then for large learning rates the
presentation of each input vector may modify the
configuration so that the system will undergo
continual evolution.
p3
p3
p1
p1
p5
p5
1w(0)
1w(8)
p8
p8
p7
p7
2w(8)
2w(0)
p6
p6
p2
p2
p4
p4
118
Another Stability Example
119
Typical Convergence (Clustering)
Weights
Input Vectors
Before Training
After Training
120
Dead Units
One problem with competitive learning is that
neurons with initial weights far from any input
vector may never win.
121
Dead Units cont.

Solution
Add a negative bias to each neuron, and increase
the magnitude of the bias as the neuron wins
This will make it harder to win if a neuron has
won often
This is called a conscience

122
Competitive Layers in Biology
On-Center/Off-Surround Connections for Competition
Weights in the competitive layer of the Hamming
network
Weights assigned based on distance
123
Mexican-Hat Function
124
Feature Maps
Update weight vectors in a neighborhood of the
winning neuron.
125
Example
126
Self-Organizing Feature Maps cont.
127
Self-Organizing Feature Maps cont.
128
Self-Organizing Feature Maps cont.
129
Self-Organizing Feature Maps cont.
130
Improving SOFM

Convergence speed-up of SOFM
Variable neighborhood size
Use a larger neighborhood size initially and
gradually reduce it until it includes only the
winning neuron
Variable learning rate
Use a larger learning rate initially (close to 1)
and decrease it toward 0 asymptotically
Let the winning neuron use a larger rate than the
neighboring ones
One can use distance as the net input instead of
the inner product

131
Learning Vector Quantization
The net input is not computed by taking an inner
product of the prototype vectors with the input.
Instead, the net input is the negative of the
distance between the prototype vectors and the
input.
132
Subclass
For the LVQ network, the winning neuron in the
first layer indicates the subclass which the
input vector belongs to. There may be several
different neurons (subclasses) which make up each
class.
The second layer of the LVQ network combines
subclasses into a single class. The columns of W2
represent subclasses, and the rows represent
classes. W2 has a single 1 in each column,
with the other elements set to zero. The row in
which the 1 occurs indicates which class the
appropriate subclass belongs to.
133
Example
Subclasses 1, 3 and 4 belong to class
1. Subclass 2 belongs to class
2. Subclasses 5 and 6 belong to class 3.
A single-layer competitive network can create
convex classification regions. The second layer
of the LVQ network can combine the convex regions
to create more complex categories.
134
LVQ Design Example
135
LVQ Design Example
136
LVQ Design Example
137
LVQ Learning
LVQ learning combines competitive learning with
supervision. It requires a training set of
examples of proper network behavior.
If the input pattern is classified correctly,
then move the winning weight toward the input
vector according to the Kohonen rule.
If the input pattern is classified incorrectly,
then move the winning weight away from the input
vector.
138
Example
139
First Iteration
140
Second Layer
This is the correct class, therefore the weight
vector is moved toward the input vector.
141
Figure
142
Final Decision Regions
143
LVQ2
If the winning neuron in the hidden layer
incorrectly classifies the current input, we move
its weight vector away from the input vector, as
before. However, we also adjust the weights of
the closest neuron to the input vector that does
classify it properly. The weights for this second
neuron should be moved toward the input
vector. When the network correctly classifies an
input vector, the weights of only one neuron are
moved toward the input vector. However, if the
input vector is incorrectly classified, the
weights of two neurons are updated, one weight
vector is moved away from the input vector, and
the other one is moved toward the input vector.
The resulting algorithm is called LVQ2.
144
LVQ2 Example

Write a Comment

User Comments (0)