Outline - PowerPoint PPT Presentation

1 / 144
About This Presentation
Title:

Outline

Description:

Announcement. The second exam will be on Nov. 17, 2004 ... Apple. 9/14/09. CAP 5615: Introduction to Neural Networks. 29. Iteration Three. 9/14/09 ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 145
Provided by: xiuwe
Category:
Tags: outline

less

Transcript and Presenter's Notes

Title: Outline


1
Outline
  • Announcement
  • Midterm Review

2
Announcement
  • The second exam will be on Nov. 17, 2004
  • Please come to class a bit earlier so that we can
    start on time
  • I will be here about 10 minutes before class
  • When most of you are present, we will start so
    that you can have some extra time if you need
  • You need to bring a calculator
  • The exam is closed-book, closed-note, and
    closed-neighbors
  • A double-side sheet no larger than the letter
    size with notes is allowed

3
Hopfield Network
  • A closely related neural network is Hopfield
    Network
  • It is a recurrent network
  • It is more powerful than the one-layer
    feed-forward networks as associative memory

4
Hopfield Network Architecture cont.
  • Here each neuron is a simple perceptron neuron
    with
  • the hardlims transfer function

5
Hopfield Network Architecture
  • It is a single-layer recurrent network
  • Each neuron is a perceptron unit with a hardlims
    transfer function

6
Hopfield Network as Associative Memory
  • One pattern p1
  • The condition for the pattern to be stable is
  • This can be satisfied by

7
Hopfield Network cont.
  • Three cases when presenting a pattern p to the
    Hopfield network stored only one pattern p1
  • P1 will be recalled perfectly if h(p, p1) lt R/2
  • -P1 will be recalled if h(p,p1) gt R/2
  • What will happen if h(p, p1) R/2?
  • Here h(p, p1) is the hamming distance between p
    and p1

8
Hopfield Network as Associative Memory
  • Many patterns
  • Matrix form

9
Hopfield Network cont.
  • Storage capacity of a Hopfield network
  • For random patterns, we can estimate the maximum
    number of patterns a Hopfield network can store
    with an acceptable error
  • This depends on how we define acceptable error

10
Hopfield Network cont.
  • Here the acceptable error is the error of each bit

11
Hopfield Network cont.
  • If we require the error for each pattern
  • For error lt 0.01 / R, Qmax R/(2log R)
  • If we require the error for all the patterns
  • For error lt 0.01 / (QR), Qmax R/(4log R)

12
Hopfield Network cont.
  • Spurious states
  • Hopfield networks can have stable configurations
    that other than the given patterns
  • Reversed states
  • Mixture states
  • For a large Q, there are also stable
    configurations not correlated with any of the
    stored patterns

13
Widrow-Hoff Learning
  • The neural network we will discuss here is called
    ADALINE
  • Very similar to the perceptron except that its
    transfer function is linear
  • Same as the linear associator
  • ADALINE with its learning algorithm, LMS is
    widely used in digital signal processing

14
ADALINE Network
15
Two-Input ADALINE
16
Mean Square Error
Training Set
Input
Target
Notation
Mean Square Error
17
Error Analysis
If a unique solution exists, it will be given by
18
Approximate Steepest Descent
Approximate mean square error (one sample)
Approximate (stochastic) gradient
19
Approximate Gradient Calculation
20
LMS Algorithm
21
Multiple-Neuron Case
Matrix Form
22
Properties and Advantages of the LMS Algorithm
  • Compared to the analytical solution, the LMS
    algorithm provides some advantages
  • It does not require calculating the inverse of a
    (potentially large) matrix
  • It is more flexible in that it requires all the
    training examples to be available at the
    beginning
  • On-line learning is possible
  • Compared to backpropagation, the LMS algorithm
    converges to a unique solution as long as the
    learning rate is not too large

23
Learning Rate
  • Note that here the learning rate is important
  • If the learning rate is too small, it will take a
    lot of iterations for the algorithm to converge
  • If the learning rate is too large, the algorithm
    may not converge

24
Example
25
Conditions for Stability
(where li is an eigenvalue of R)
Therefore the stability condition simplifies to
26
Example
Banana
Apple
27
Iteration One
Banana
28
Iteration Two
Apple
29
Iteration Three
30
Backpropagation
  • Backpropagation is a direct generalization of the
    LMS algorithm
  • Both are steepest gradient descent algorithm
    based on approximate square error
  • Backpropagation becomes LMS algorithm when
    applied on the ADALINE network
  • The main difference is how to calculate the
    gradients
  • Practically, the backpropagation is more powerful
  • Note that nonlinearity is essential for multiple
    layer neural networks
  • However, there is no guarantee that
    backpropagation algorithm will converge to the
    global optimal solution

31
Multilayer Network
32
Performance Index
Training Set
Mean Square Error
Vector Case
Approximate Mean Square Error (Single Sample)
Approximate Steepest Descent
33
Chain Rule
Example
Application to Gradient Calculation
34
Gradient Calculation
Sensitivity
Gradient
35
Steepest Descent
Next Step Compute the Sensitivities
(Backpropagation)
36
Jacobian Matrix
37
Backpropagation (Sensitivities)
The sensitivities are computed by starting at the
last layer, and then propagating backwards
through the network to the first layer.
38
Initialization (Last Layer)
39
Summary
Forward Propagation
Backpropagation
Weight Update
40
Example Function Approximation
t
-
p
e

a
1-2-1 Network
41
Network
42
Initial Conditions
43
Forward Propagation
44
Transfer Function Derivatives
45
Backpropagation
46
Weight Update
47
Other Gradient Descent Algorithms
  • The steepest gradient descent algorithm can be
    used to derive learning algorithms for different
    kinds of networks
  • The key step is how to calculate the gradients

48
Other Gradient Descent Algorithms
49
Other Gradient Descent Algorithms
50
Practical Issues
  • While in theory a multiple layer neural network
    with nonlinear transfer function trained using
    backpropagation is sufficient to approximate any
    function or solve any recognition problem, there
    are practical issues
  • What is the optimal architecture for a particular
    problem/application?
  • What is the performance on unknown test data?
  • Will the network converge to a good solution?
  • How long does it take to train a network?

51
Choice of Architecture
1-3-1 Network
i 1
i 2
i 4
i 8
52
Choice of Network Architecture
1-2-1
1-3-1
1-5-1
1-4-1
53
The Issue of Generalization
  • We are interested in a neural network trained
    only on a training set will work well for novel
    and unseen test data
  • For example, for face recognition, a neural
    network can only recognize those in the training
    set is not very useful
  • Generalization is one of the most fundamental
    problems in neural networks and many other
    recognition techniques

54
Generalization
1-2-1
1-9-1
55
Improving Generalization
  • Heuristics
  • A neural network should have fewer parameters
    than the number of data points in the training
    set
  • Simpler neural networks should be preferred over
    complicated ones, known as Occams razor
  • More domain specific knowledge
  • Cross validation
  • Divide the labeled examples into training and
    validation sets
  • Stop training when the error on the validation
    set increases

56
Cross Validation
57
Convergence Issues
  • A neural network may converge to a bad solution
  • Train several neural networks from different
    initial conditions

58
Convergence
5
1
5
3
3
4
2
4
2
0
0
1
59
Convergence Issues
  • A neural network may converge to a bad solution
  • Train several neural networks from different
    initial conditions
  • The convergence is slow
  • Practical techniques
  • Variations of basic backpropagation algorithms

60
Practical Techniques for Improving Backpropagation
  • Scaling input
  • We can standardize each feature component to have
    zero mean and the same variance
  • Target values
  • For pattern recognition applications, use 1 for
    the target category and -1 for non-target
    category
  • Training with noise

61
Practical Techniques for Improving Backpropagation
  • Manufacturing data
  • If we have knowledge about the sources of
    variation among the inputs, we can manufacture
    training data
  • For face detection, we can rotate and enlarge /
    shrink the training images
  • Initializing weights
  • If we use standardized data, we want positive and
    negative weights as well from a uniform
    distribution
  • Uniform learning

62
Practical Techniques for Improving Backpropagation
  • Training protocols
  • Epoch corresponds to a single presentation of all
    the patterns in the training set
  • Stochastic training
  • Training samples are chosen randomly from the set
    and the weights are updated after each sample
  • Batch training
  • All the training samples are presented to the
    network before weights are updated
  • On-line training
  • Each training sample is presented once and only
    once
  • There is no memory for storing training samples

63
Speeding up Convergence
  • Heuristics
  • Momentum
  • Variable learning rate
  • Conjugate gradient
  • Second-order methods
  • Newtons method
  • Levenberg-Marquardt algorithm

64
Performance Surface Example
Network Architecture
Nominal Function
Parameter Values
65
Squared Error vs. w11,1 and w21,1
w21,1
w11,1
w21,1
w11,1
66
Momentum
Filter
Example
67
Momentum Backpropagation
Steepest Descent Backpropagation (SDBP)
w21,1
Momentum Backpropagation (MOBP)
w11,1
68
Momentum Backpropagation
Using standard BP
69
Momentum Backpropagation
70
Variable Learning Rate (VLBP)
  • If the squared error (over the entire training
    set) increases by more than some set percentage z
    after a weight update, then the weight update is
    discarded, the learning rate is multiplied by
    some factor (1 gt r gt 0), and the momentum
    coefficient g is set to zero.
  • If the squared error decreases after a weight
    update, then the weight update is accepted and
    the learning rate is multiplied by some factor
    hgt1. If g has been previously set to zero, it is
    reset to its original value.
  • If the squared error increases by less than z,
    then the weight update is accepted, but the
    learning rate and the momentum coefficient are
    unchanged.

71
VLBP Example
w21,1
w11,1
72
VLBP Example
73
Associative Learning
  • To learn associations between a systems input
    and the systems output
  • In this chapter the association is learned
    between things that occur simultaneously
  • The inputs, also called stimuli, are divided into
  • Unconditional inputs (whose weights are fixed),
    corresponding to the food presented to the dog in
    the Pavlovs experiment
  • Conditional inputs (whose weights will be
    learned), corresponding to the bell in the
    Pavlovs experiment

74
Unsupervised Learning
  • In unsupervised learning, the networks weights
    and biases are updated according to the inputs
    only
  • There is no target value(s) for each input
    pattern
  • The training now consists of a sequence of input
    patterns, given by

75
Simple Associative Network
76
Banana Associator
Unconditioned Stimulus
Conditioned Stimulus
77
Unsupervised Hebb Rule
Vector Form
  • Local learning rule a rule that uses only
    signals available within the layer containing the
    weights being updated
  • Is a backpropagation a local learning rule?

78
Banana Recognition Example
Initial Weights
Training Sequence
a 1
First Iteration (sight fails)
79
Example
Second Iteration (sight works)
Third Iteration (sight fails)
Banana will now be detected if either sensor
works.
80
Problems with Hebb Rule
  • Weights can become arbitrarily large
  • When inputs are presented again and again
  • There is no mechanism for weights to decrease
  • Noise in the inputs or outputs can cause the
    network to respond to any stimulus

81
Hebb Rule with Decay
This keeps the weight matrix from growing without
bound, which can be demonstrated by setting both
ai and pj to 1
82
Example Banana Associator
g 0.1
a 1
First Iteration (sight fails)
Second Iteration (sight works)
83
Example
Third Iteration (sight fails)
Hebb Rule
Hebb with Decay
84
Problem of Hebb with Decay
  • Associations will decay away if stimuli are not
    occasionally presented.

If ai 0, then
If g 0, this becomes
Therefore the weight decays by 10 at each
iteration where there is no stimulus.
85
Instar Network
  • Instar network
  • Architecture wise, identical to simple perceptron
    network
  • A single layer network
  • However, in instar, the bias is given and weights
    are learned using instar rule

86
Instar (Recognition Network)
87
Instar Operation
The instar will be active when
or
For normalized vectors, the largest inner product
occurs when the angle between the weight vector
and the input vector is zero -- the input vector
is equal to the weight vector.
The rows of a weight matrix represent patterns to
be recognized.
88
Vector Recognition
If we set
the instar will only be active when q 0.
If we set
the instar will be active for a range of angles.
As b is increased, the more patterns there will
be (over a wider range of q) which will activate
the instar.
89
Instar Rule
Hebb rule
Modify so that learning and forgetting will only
occur when the neuron is active - Instar Rule
or
Vector Form
90
Graphical Representation
For the case where the instar is active (ai 1)
or
For the case where the instar is inactive (ai
0)
91
Example
92
Training
First Iteration (a1)
93
Further Training
94
Outstar (Recall Network)
95
Outstar Operation
Suppose we want the outstar to recall a certain
pattern a whenever the input p 1 is presented
to the network. Let
Then, when p 1
and the pattern is correctly recalled.
The columns of a weight matrix represent patterns
to be recalled.
96
Outstar Rule
For the instar rule we made the weight decay term
of the Hebb rule proportional to the output of
the network. For the outstar rule we make the
weight decay term proportional to the input of
the network.
If we make the decay rate g equal to the learning
rate a,
Vector Form
97
Example - Pineapple Recall
98
Definitions
99
Iteration 1
a 1
100
Convergence
101
Hamming Network
102
Hamming Network cont.
  • Layer 1
  • Consists of multiple instar neurons to recognize
    more than one pattern
  • The output of a neuron is the inner production
    between the weight vector (prototype) and the
    input vector
  • The output from the first layer indicates the
    correlation between the prototype pattern and the
    input vector
  • It is feedforward

103
Layer 1 (Correlation)
We want the network to recognize the following
prototype vectors
The first layer weight matrix and bias vector are
given by
The response of the first layer is
The prototype closest to the input vector
produces the largest response.
104
Hamming Network cont.
  • Layer 2
  • It is a recurrent network, called a competitive
    network
  • The neurons in this layer compete with each other
    to determine a winner
  • After competition, only one neuron will have a
    nonzero output
  • The winning neuron indicates which category of
    input was presented to the network

105
Layer 2 (Competition)
The second layer is initialized with the
output of the first layer.
The neuron with the largest initial condition
will win the competition.
106
Hamming Network cont.
  • Lateral inhibition
  • This competition is called a winner-take-all
    competition
  • Because the one with the largest value decreases
    the slowest, it remains positive when all others
    become zero
  • What will happen if there are ties?

107
Classification Example
108
Competitive Layer
  • In a competitive layer, each neuron excites
    itself and inhibits all the other neurons
  • A transfer function that does the job of a
    recurrent competitive layer
  • It works by finding the neuron with the largest
    net input and setting its output to 1 (In case of
    ties, the neuron with lowest index). All other
    outputs are set to 0

109
Competitive Layer
110
Competitive Learning
  • A learning rule to train the weights in a
    competitive network
  • Instar rule
  • In other words,
  • For the competitive network, the winning neuron
    has an output of 1, and the other neurons have an
    output of 0.

111
Competitive Learning
Kohonen Rule
112
Graphical Representation
113
Example
114
Four Iterations
115
Problems with Competitive Layers
  • Choice of learning rate
  • A learning rate near zero results in slow
    learning but stable
  • A learning rate near one results in fast learning
    but oscillate
  • Stability problem when clusters are close to each
    other
  • Dead neuron
  • A neuron whose initial weight vector is so far
    from any input vectors that it never wins the
    competition
  • The number of classes must be known
  • These limitations can be overcome by the feature
    maps, LVQ networks, and ART networks

116
Choice of Learning Rates
  • When learning rate is small, the learning is
    stable but slow
  • When learning rate is close to 1, the learning is
    fast but slow
  • Adaptive learning rate can be used
  • Initial learning rate is large and gradually
    decrease the learning rate

117
Stability
If the input vectors dont fall into nice
clusters, then for large learning rates the
presentation of each input vector may modify the
configuration so that the system will undergo
continual evolution.
p3
p3
p1
p1
p5
p5
1w(0)
1w(8)
p8
p8
p7
p7
2w(8)
2w(0)
p6
p6
p2
p2
p4
p4
118
Another Stability Example
119
Typical Convergence (Clustering)
Weights
Input Vectors
Before Training
After Training
120
Dead Units
One problem with competitive learning is that
neurons with initial weights far from any input
vector may never win.
121
Dead Units cont.
  • Solution
  • Add a negative bias to each neuron, and increase
    the magnitude of the bias as the neuron wins
  • This will make it harder to win if a neuron has
    won often
  • This is called a conscience

122
Competitive Layers in Biology
On-Center/Off-Surround Connections for Competition
Weights in the competitive layer of the Hamming
network
Weights assigned based on distance
123
Mexican-Hat Function
124
Feature Maps
Update weight vectors in a neighborhood of the
winning neuron.
125
Example
126
Self-Organizing Feature Maps cont.
127
Self-Organizing Feature Maps cont.
128
Self-Organizing Feature Maps cont.
129
Self-Organizing Feature Maps cont.
130
Improving SOFM
  • Convergence speed-up of SOFM
  • Variable neighborhood size
  • Use a larger neighborhood size initially and
    gradually reduce it until it includes only the
    winning neuron
  • Variable learning rate
  • Use a larger learning rate initially (close to 1)
    and decrease it toward 0 asymptotically
  • Let the winning neuron use a larger rate than the
    neighboring ones
  • One can use distance as the net input instead of
    the inner product

131
Learning Vector Quantization
The net input is not computed by taking an inner
product of the prototype vectors with the input.
Instead, the net input is the negative of the
distance between the prototype vectors and the
input.
132
Subclass
For the LVQ network, the winning neuron in the
first layer indicates the subclass which the
input vector belongs to. There may be several
different neurons (subclasses) which make up each
class.
The second layer of the LVQ network combines
subclasses into a single class. The columns of W2
represent subclasses, and the rows represent
classes. W2 has a single 1 in each column,
with the other elements set to zero. The row in
which the 1 occurs indicates which class the
appropriate subclass belongs to.
133
Example
Subclasses 1, 3 and 4 belong to class
1. Subclass 2 belongs to class
2. Subclasses 5 and 6 belong to class 3.
A single-layer competitive network can create
convex classification regions. The second layer
of the LVQ network can combine the convex regions
to create more complex categories.
134
LVQ Design Example
135
LVQ Design Example
136
LVQ Design Example
137
LVQ Learning
LVQ learning combines competitive learning with
supervision. It requires a training set of
examples of proper network behavior.
If the input pattern is classified correctly,
then move the winning weight toward the input
vector according to the Kohonen rule.
If the input pattern is classified incorrectly,
then move the winning weight away from the input
vector.
138
Example
139
First Iteration
140
Second Layer
This is the correct class, therefore the weight
vector is moved toward the input vector.
141
Figure
142
Final Decision Regions
143
LVQ2
If the winning neuron in the hidden layer
incorrectly classifies the current input, we move
its weight vector away from the input vector, as
before. However, we also adjust the weights of
the closest neuron to the input vector that does
classify it properly. The weights for this second
neuron should be moved toward the input
vector. When the network correctly classifies an
input vector, the weights of only one neuron are
moved toward the input vector. However, if the
input vector is incorrectly classified, the
weights of two neurons are updated, one weight
vector is moved away from the input vector, and
the other one is moved toward the input vector.
The resulting algorithm is called LVQ2.
144
LVQ2 Example
Write a Comment
User Comments (0)
About PowerShow.com