CENG 569 NEUROCOMPUTING - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

CENG 569 NEUROCOMPUTING

Description:

1969 Marvin Minsky and Seymour Papert: 'Perceptrons', MIT Press. ... increased by the use of a squashing function.In the original paper the logistic function: ... – PowerPoint PPT presentation

Number of Views:539

Avg rating:3.0/5.0

Slides: 39

Provided by: erols

Category:

more less

Transcript and Presenter's Notes

Title: CENG 569 NEUROCOMPUTING

1
CENG 569NEUROCOMPUTING
Erol SahinDept. of Computer EngineeringMiddle
East Technical UniversityInonu Bulvari, 06531,
Ankara, TURKEY

Week 2

2
The history

1962 - Frank Rosenblatt Back-propagating
error-correction procedures In Principles of
Neurodynamics.
1969 Marvin Minsky and Seymour Papert
Perceptrons, MIT Press.
1974 - Paul Werbos Beyond regression new tools
for prediction and analysis in the behavioral
sciences Ph.D. thesis. Harvard University.
1986 - D. E. Rumelhart, G. E. Hinton, and R. J.
Williams. Learning Internal Representations by
Error Propagation, published in Parallel
Distributed Processing. volume I and II, by the
PDP group of UCSD.
1986- today - the interest about the neural
networks is on the rise..

3
Back propagating error correction

The procedure described here is called the
back-propagating error correction procedure since
it takes its cue from the error of the R-units,
propagating corrections back towards the sensory
end of the network if it fails to make a
satisfactory correction quickly at the response
end.
The rules for the back-propagating correction
procedure are
For each R-unit, set Er R - r, where R is
the required response and r is the obtained
response.
For each association unit , ai is computed as
follows for each stimulus Begin with Ei 0.
If ai is active, and the connection cir
terminates on an R-unit with a non-zero error Er
which differs in sign from vir, add -1 to Ei with
probability ?i
. . .

4
Perceptrons

In the popular history of neural networks, first
came the classical period of perceptron, when it
seemed as if neural networks could do anything. A
hundred algorithms bloomed, a hundred schools of
learning machines contended. Then came the onset
of dark ages, where suddenly, research on neural
networks was unloved, unwanted, and most
important, unfunded. A precipitating factor in
this sharp decline was the publication of the
book Perceptrons by Minsky and Papert in 1969.
(These) authors expressed the strong belief that
limitations of the kind they discovered for
simple perceptrons would be held to be true for
perceptron variants. more specically, multilayer
systems.
. . . This conjecture . . . thoroughly dampened
the enthusiasm of granting agencies to support
future research. Why bother, since more complex
versions would have the same problems?
Unfortunately, this conjecture now seems to be
wrong.
AR Introduction to chapter 13.

5
Limitation of perceptrons

The problem with two-layer perceptrons They can
only map similar inputs to similar outputs.
Minsky and Papert have provided a very careful
analysis of conditions under which such systems
are capable of carrying out the required
mappings.They show that in a large number of
interesting cases, networks of this kind are
incapable of solving the problems (Rumelhart,
Hinton and Williams 1986)

The XOR (parity) problem
6
Hidden units can help!

On the other hand, as Minsky and Papert also
pointed out, if there is a layer of simple
perceptron-like hidden units . . . there is
always a recoding (i.e. an internal
representation) of the input patterns in the
hidden units in which the similarity of the
patterns among the hidden units can support any
required mapping from the input to the output
units (Rumelhart, Hinton and Williams 1986)

7
Weights to hidden units how to learn?

The problem, as noted by Minsky and Papert, is
that whereas there is a very simple guaranteed
learning rule for all problems that can be solved
without hidden units, namely the perceptron
convergence procedure, there is no equally
powerful rule for learning in networks with
hidden units (Rumelhart, Hinton and Williams
1986)
The learning rule dened for the perceptron and
ADAline update the weights to the output units
using the error between the actual output and the
desired output. The challenge is
How do you compute the error for the hidden
units?

8
The delta rule
wij
wij
wij
wij
wij
wij
wij
9
The generalized delta rule
wij
wij
wij
wjk
wij
wij
wij
wij
opi
oi
opi
wij
10
Contd
wij
wij
11
(No Transcript)
12
wjk
opj
j
wjk
wjk
wjk
13
In short..
wij
wjk
14
Two phases of back-propagation
15
Activation and Error back-propagation
16
Weight updates
wij
wjk
17
Two schemes of training

There are two schemes of updating weights
Batch Update weights after all patterns have
been presented (epoch).
Incremental Update weights after each pattern is
presented.
Although the batch update scheme implements the
true gradient descent, the second scheme is often
preferred since
it requires less storage,
it has more noise, hence is less likely to get
stuck in a local minima (which is a problem with
nonlinear activation functions). In the
incremental update scheme, order of presentation
matters!

18
Problems of back-propagation

It is extremely slow, if it does converge.
It may get stuck in a local minima.
It is sensitive to initial conditions.
It may start oscillating.
etc.

19
The local minima problem

Unlike LMS, the error function is not smooth with
a single minima. Local minima can occur, in which
case a true gradient descent is not
desirable.Momentum, incremental updates, and a
large learning rate make jiggly path that can
avoid local minimas.

20
Some variations

True gradient descent assumes infinitesmall
learning rate (?). If ? is too small then
learning is very slow. If large, then the
system's learning may never converge.
Some of the possible solutions to this problem
are
Add a momentum term to allow a large learning
rate.
Use a different activation function
Use a different error function
Use an adaptive learning rate
Use a good weight initialization procedure.
Use a different minimization procedure

21
Momentum

The most widely used trick is to remember the
direction of earlier steps.Weight update becomes
? wij (n1) ? (?pj opi) ? ? wij(n)
The momentum parameter ? is chosen between 0 and
1, typically 0.9. This allows one to use higher
learning rates. The momentum term filters out
high frequency oscillations on the error surface.
What would the learning rate be in a deep valley?

22
Choice of the activation function

The computational power is increased by the use
of a squashing function.In the original paper the
logistic function
f(x) 1/(1e-x)
Is used.

23
Activation function
24
Alternative activation functions
25
Alternative error functions
26
Adaptive parameters
27
Weight initialization
28
Other minimization procedures
29
Using the Hessian
30
Contd
31
Using steepest descent
32
Conjugate gradient method
33
Genetic algorithms
34
Modifying architecture
35
What do Minsky and Papert NOW think?

In preparing this edition we were tempted to
bring (our) theories up to date. But when we
found that little of signicance had chanced since
1969, when the book was first published, we
concluded that it would be more useful to keep
the original text (with its corrections of 1972)
and add an epilogue, so that the book could still
be read in its original form
-Minsky and Papert's prologue to the 1988 edition
of Perceptrons

Perceptrons - Expanded EditionAn Introduction to
Computational GeometryMarvin L. Minsky and
Seymour A. Papert December 1987ISBN
0-262-63111-36 x 9, 275 pp.
36
Readings for next week

Original Back-prop paper of Rumelhart et. al.
When reading it, note the reversed indexing of
weights. What we call as wij is denoted as wji in
their notation.
Prologue and Epilogue of the book Perceptrons.

37
First project

Due Feb 27, 1340.
Implement the delta rule to train a single
perceptron with two inputs. Use the following
training set which consisted of four patterns
Show how the decision surface of the perceptron
evolves at each iteration until all four patterns
are learned.

38
Contd

Implement a two-layer perceptron network and the
back-propagation learning algorithm.
Train the network with the XOR problem and show
that it can learn it.
Train the network with the training data given
and evaluate it with the testing data.
How does the number of hidden units affect?

Write a Comment

User Comments (0)