CENG 569 NEUROCOMPUTING - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

CENG 569 NEUROCOMPUTING

Description:

1969 Marvin Minsky and Seymour Papert: 'Perceptrons', MIT Press. ... increased by the use of a squashing function.In the original paper the logistic function: ... – PowerPoint PPT presentation

Number of Views:540
Avg rating:3.0/5.0
Slides: 39
Provided by: erols
Category:

less

Transcript and Presenter's Notes

Title: CENG 569 NEUROCOMPUTING


1
CENG 569NEUROCOMPUTING
Erol SahinDept. of Computer EngineeringMiddle
East Technical UniversityInonu Bulvari, 06531,
Ankara, TURKEY
  • Week 2

2
The history
  • 1962 - Frank Rosenblatt Back-propagating
    error-correction procedures In Principles of
    Neurodynamics.
  • 1969 Marvin Minsky and Seymour Papert
    Perceptrons, MIT Press.
  • 1974 - Paul Werbos Beyond regression new tools
    for prediction and analysis in the behavioral
    sciences Ph.D. thesis. Harvard University.
  • 1986 - D. E. Rumelhart, G. E. Hinton, and R. J.
    Williams. Learning Internal Representations by
    Error Propagation, published in Parallel
    Distributed Processing. volume I and II, by the
    PDP group of UCSD.
  • 1986- today - the interest about the neural
    networks is on the rise..

3
Back propagating error correction
  • The procedure described here is called the
    back-propagating error correction procedure since
    it takes its cue from the error of the R-units,
    propagating corrections back towards the sensory
    end of the network if it fails to make a
    satisfactory correction quickly at the response
    end.
  • The rules for the back-propagating correction
    procedure are
  • For each R-unit, set Er R - r, where R is
    the required response and r is the obtained
    response.
  • For each association unit , ai is computed as
    follows for each stimulus Begin with Ei 0.
  • If ai is active, and the connection cir
    terminates on an R-unit with a non-zero error Er
    which differs in sign from vir, add -1 to Ei with
    probability ?i
  • . . .

4
Perceptrons
  • In the popular history of neural networks, first
    came the classical period of perceptron, when it
    seemed as if neural networks could do anything. A
    hundred algorithms bloomed, a hundred schools of
    learning machines contended. Then came the onset
    of dark ages, where suddenly, research on neural
    networks was unloved, unwanted, and most
    important, unfunded. A precipitating factor in
    this sharp decline was the publication of the
    book Perceptrons by Minsky and Papert in 1969.
  • (These) authors expressed the strong belief that
    limitations of the kind they discovered for
    simple perceptrons would be held to be true for
    perceptron variants. more specically, multilayer
    systems.
  • . . . This conjecture . . . thoroughly dampened
    the enthusiasm of granting agencies to support
    future research. Why bother, since more complex
    versions would have the same problems?
    Unfortunately, this conjecture now seems to be
    wrong.
  • AR Introduction to chapter 13.

5
Limitation of perceptrons
  • The problem with two-layer perceptrons They can
    only map similar inputs to similar outputs.
  • Minsky and Papert have provided a very careful
    analysis of conditions under which such systems
    are capable of carrying out the required
    mappings.They show that in a large number of
    interesting cases, networks of this kind are
    incapable of solving the problems (Rumelhart,
    Hinton and Williams 1986)

The XOR (parity) problem
6
Hidden units can help!
  • On the other hand, as Minsky and Papert also
    pointed out, if there is a layer of simple
    perceptron-like hidden units . . . there is
    always a recoding (i.e. an internal
    representation) of the input patterns in the
    hidden units in which the similarity of the
    patterns among the hidden units can support any
    required mapping from the input to the output
    units (Rumelhart, Hinton and Williams 1986)

7
Weights to hidden units how to learn?
  • The problem, as noted by Minsky and Papert, is
    that whereas there is a very simple guaranteed
    learning rule for all problems that can be solved
    without hidden units, namely the perceptron
    convergence procedure, there is no equally
    powerful rule for learning in networks with
    hidden units (Rumelhart, Hinton and Williams
    1986)
  • The learning rule dened for the perceptron and
    ADAline update the weights to the output units
    using the error between the actual output and the
    desired output. The challenge is
  • How do you compute the error for the hidden
    units?

8
The delta rule
wij
wij
wij
wij
wij
wij
wij
9
The generalized delta rule
wij
wij
wij
wjk
wij
wij
wij
wij
opi
oi
opi
wij
10
Contd
wij
wij
11
(No Transcript)
12
wjk
opj
j
wjk
wjk
wjk
13
In short..
wij
wjk
14
Two phases of back-propagation
15
Activation and Error back-propagation
16
Weight updates
wij
wjk
17
Two schemes of training
  • There are two schemes of updating weights
  • Batch Update weights after all patterns have
    been presented (epoch).
  • Incremental Update weights after each pattern is
    presented.
  • Although the batch update scheme implements the
    true gradient descent, the second scheme is often
    preferred since
  • it requires less storage,
  • it has more noise, hence is less likely to get
    stuck in a local minima (which is a problem with
    nonlinear activation functions). In the
    incremental update scheme, order of presentation
    matters!

18
Problems of back-propagation
  • It is extremely slow, if it does converge.
  • It may get stuck in a local minima.
  • It is sensitive to initial conditions.
  • It may start oscillating.
  • etc.

19
The local minima problem
  • Unlike LMS, the error function is not smooth with
    a single minima. Local minima can occur, in which
    case a true gradient descent is not
    desirable.Momentum, incremental updates, and a
    large learning rate make jiggly path that can
    avoid local minimas.

20
Some variations
  • True gradient descent assumes infinitesmall
    learning rate (?). If ? is too small then
    learning is very slow. If large, then the
    system's learning may never converge.
  • Some of the possible solutions to this problem
    are
  • Add a momentum term to allow a large learning
    rate.
  • Use a different activation function
  • Use a different error function
  • Use an adaptive learning rate
  • Use a good weight initialization procedure.
  • Use a different minimization procedure

21
Momentum
  • The most widely used trick is to remember the
    direction of earlier steps.Weight update becomes
  • ? wij (n1) ? (?pj opi) ? ? wij(n)
  • The momentum parameter ? is chosen between 0 and
    1, typically 0.9. This allows one to use higher
    learning rates. The momentum term filters out
    high frequency oscillations on the error surface.
  • What would the learning rate be in a deep valley?

22
Choice of the activation function
  • The computational power is increased by the use
    of a squashing function.In the original paper the
    logistic function
  • f(x) 1/(1e-x)
  • Is used.

23
Activation function
24
Alternative activation functions
25
Alternative error functions
26
Adaptive parameters
27
Weight initialization
28
Other minimization procedures
29
Using the Hessian
30
Contd
31
Using steepest descent
32
Conjugate gradient method
33
Genetic algorithms
34
Modifying architecture
35
What do Minsky and Papert NOW think?
  • In preparing this edition we were tempted to
    bring (our) theories up to date. But when we
    found that little of signicance had chanced since
    1969, when the book was first published, we
    concluded that it would be more useful to keep
    the original text (with its corrections of 1972)
    and add an epilogue, so that the book could still
    be read in its original form
  • -Minsky and Papert's prologue to the 1988 edition
    of Perceptrons


Perceptrons - Expanded EditionAn Introduction to
Computational GeometryMarvin L. Minsky and
Seymour A. Papert December 1987ISBN
0-262-63111-36 x 9, 275 pp.
36
Readings for next week
  • Original Back-prop paper of Rumelhart et. al.
    When reading it, note the reversed indexing of
    weights. What we call as wij is denoted as wji in
    their notation.
  • Prologue and Epilogue of the book Perceptrons.

37
First project
  • Due Feb 27, 1340.
  • Implement the delta rule to train a single
    perceptron with two inputs. Use the following
    training set which consisted of four patterns
  • Show how the decision surface of the perceptron
    evolves at each iteration until all four patterns
    are learned.

38
Contd
  • Implement a two-layer perceptron network and the
    back-propagation learning algorithm.
  • Train the network with the XOR problem and show
    that it can learn it.
  • Train the network with the training data given
    and evaluate it with the testing data.
  • How does the number of hidden units affect?
Write a Comment
User Comments (0)
About PowerShow.com