Title: CENG 569 NEUROCOMPUTING
1CENG 569NEUROCOMPUTING
Erol SahinDept. of Computer EngineeringMiddle
East Technical UniversityInonu Bulvari, 06531,
Ankara, TURKEY
2The history
- 1962 - Frank Rosenblatt Back-propagating
error-correction procedures In Principles of
Neurodynamics. - 1969 Marvin Minsky and Seymour Papert
Perceptrons, MIT Press. - 1974 - Paul Werbos Beyond regression new tools
for prediction and analysis in the behavioral
sciences Ph.D. thesis. Harvard University. - 1986 - D. E. Rumelhart, G. E. Hinton, and R. J.
Williams. Learning Internal Representations by
Error Propagation, published in Parallel
Distributed Processing. volume I and II, by the
PDP group of UCSD. - 1986- today - the interest about the neural
networks is on the rise..
3Back propagating error correction
- The procedure described here is called the
back-propagating error correction procedure since
it takes its cue from the error of the R-units,
propagating corrections back towards the sensory
end of the network if it fails to make a
satisfactory correction quickly at the response
end. - The rules for the back-propagating correction
procedure are - For each R-unit, set Er R - r, where R is
the required response and r is the obtained
response. - For each association unit , ai is computed as
follows for each stimulus Begin with Ei 0. - If ai is active, and the connection cir
terminates on an R-unit with a non-zero error Er
which differs in sign from vir, add -1 to Ei with
probability ?i - . . .
4Perceptrons
- In the popular history of neural networks, first
came the classical period of perceptron, when it
seemed as if neural networks could do anything. A
hundred algorithms bloomed, a hundred schools of
learning machines contended. Then came the onset
of dark ages, where suddenly, research on neural
networks was unloved, unwanted, and most
important, unfunded. A precipitating factor in
this sharp decline was the publication of the
book Perceptrons by Minsky and Papert in 1969. - (These) authors expressed the strong belief that
limitations of the kind they discovered for
simple perceptrons would be held to be true for
perceptron variants. more specically, multilayer
systems. - . . . This conjecture . . . thoroughly dampened
the enthusiasm of granting agencies to support
future research. Why bother, since more complex
versions would have the same problems?
Unfortunately, this conjecture now seems to be
wrong. - AR Introduction to chapter 13.
5Limitation of perceptrons
- The problem with two-layer perceptrons They can
only map similar inputs to similar outputs. - Minsky and Papert have provided a very careful
analysis of conditions under which such systems
are capable of carrying out the required
mappings.They show that in a large number of
interesting cases, networks of this kind are
incapable of solving the problems (Rumelhart,
Hinton and Williams 1986)
The XOR (parity) problem
6Hidden units can help!
- On the other hand, as Minsky and Papert also
pointed out, if there is a layer of simple
perceptron-like hidden units . . . there is
always a recoding (i.e. an internal
representation) of the input patterns in the
hidden units in which the similarity of the
patterns among the hidden units can support any
required mapping from the input to the output
units (Rumelhart, Hinton and Williams 1986)
7Weights to hidden units how to learn?
- The problem, as noted by Minsky and Papert, is
that whereas there is a very simple guaranteed
learning rule for all problems that can be solved
without hidden units, namely the perceptron
convergence procedure, there is no equally
powerful rule for learning in networks with
hidden units (Rumelhart, Hinton and Williams
1986) - The learning rule dened for the perceptron and
ADAline update the weights to the output units
using the error between the actual output and the
desired output. The challenge is - How do you compute the error for the hidden
units?
8The delta rule
wij
wij
wij
wij
wij
wij
wij
9The generalized delta rule
wij
wij
wij
wjk
wij
wij
wij
wij
opi
oi
opi
wij
10Contd
wij
wij
11(No Transcript)
12wjk
opj
j
wjk
wjk
wjk
13In short..
wij
wjk
14Two phases of back-propagation
15Activation and Error back-propagation
16Weight updates
wij
wjk
17Two schemes of training
- There are two schemes of updating weights
- Batch Update weights after all patterns have
been presented (epoch). - Incremental Update weights after each pattern is
presented. - Although the batch update scheme implements the
true gradient descent, the second scheme is often
preferred since - it requires less storage,
- it has more noise, hence is less likely to get
stuck in a local minima (which is a problem with
nonlinear activation functions). In the
incremental update scheme, order of presentation
matters!
18Problems of back-propagation
- It is extremely slow, if it does converge.
- It may get stuck in a local minima.
- It is sensitive to initial conditions.
- It may start oscillating.
- etc.
19The local minima problem
- Unlike LMS, the error function is not smooth with
a single minima. Local minima can occur, in which
case a true gradient descent is not
desirable.Momentum, incremental updates, and a
large learning rate make jiggly path that can
avoid local minimas.
20Some variations
- True gradient descent assumes infinitesmall
learning rate (?). If ? is too small then
learning is very slow. If large, then the
system's learning may never converge. - Some of the possible solutions to this problem
are - Add a momentum term to allow a large learning
rate. - Use a different activation function
- Use a different error function
- Use an adaptive learning rate
- Use a good weight initialization procedure.
- Use a different minimization procedure
21Momentum
- The most widely used trick is to remember the
direction of earlier steps.Weight update becomes - ? wij (n1) ? (?pj opi) ? ? wij(n)
- The momentum parameter ? is chosen between 0 and
1, typically 0.9. This allows one to use higher
learning rates. The momentum term filters out
high frequency oscillations on the error surface. - What would the learning rate be in a deep valley?
22Choice of the activation function
- The computational power is increased by the use
of a squashing function.In the original paper the
logistic function - f(x) 1/(1e-x)
- Is used.
23Activation function
24Alternative activation functions
25Alternative error functions
26Adaptive parameters
27Weight initialization
28Other minimization procedures
29Using the Hessian
30Contd
31Using steepest descent
32Conjugate gradient method
33Genetic algorithms
34Modifying architecture
35What do Minsky and Papert NOW think?
- In preparing this edition we were tempted to
bring (our) theories up to date. But when we
found that little of signicance had chanced since
1969, when the book was first published, we
concluded that it would be more useful to keep
the original text (with its corrections of 1972)
and add an epilogue, so that the book could still
be read in its original form - -Minsky and Papert's prologue to the 1988 edition
of Perceptrons
Perceptrons - Expanded EditionAn Introduction to
Computational GeometryMarvin L. Minsky and
Seymour A. Papert December 1987ISBN
0-262-63111-36 x 9, 275 pp.
36Readings for next week
- Original Back-prop paper of Rumelhart et. al.
When reading it, note the reversed indexing of
weights. What we call as wij is denoted as wji in
their notation. - Prologue and Epilogue of the book Perceptrons.
37First project
- Due Feb 27, 1340.
- Implement the delta rule to train a single
perceptron with two inputs. Use the following
training set which consisted of four patterns - Show how the decision surface of the perceptron
evolves at each iteration until all four patterns
are learned.
38Contd
- Implement a two-layer perceptron network and the
back-propagation learning algorithm. - Train the network with the XOR problem and show
that it can learn it. - Train the network with the training data given
and evaluate it with the testing data. - How does the number of hidden units affect?