Practical Aspects of Backpropagation - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Practical Aspects of Backpropagation

Description:

A network is said to generalize if it gives correct results for inputs not in the training set. ... Small NN generalizes well, big NN very bad: ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 46
Provided by: wwwisW
Category:

less

Transcript and Presenter's Notes

Title: Practical Aspects of Backpropagation


1
Practical Aspects of Backpropagation
  • H. Ten Eikelder, A. Cristea

2
(No Transcript)
3
(No Transcript)
4
Minimizing Squared Error
  • E(W)
  • W all weights

5
(No Transcript)
6
  • iff ? percepton to compute training set without
    errors ?
  • ? global minimum W of E, E(W)0
  • However
  • often we end up with W, E(W)gt0
  • ? local ( non-global) minimum of E
  • ? ? a better set of weights

7
(No Transcript)
8
2 local minima and 1 saddle point
9
(No Transcript)
10
  • iff ?? perceptron to compute without errors, it
    is possible that W' is a global minimum but NO
    guarantee!!
  • In practice the error surface may contain many
    local minima and finding a global minimum can be
    difficult.

11
Many local minima example f(x,y)-gt
x2sin(xy-2)-sin(x-y)
12
Gradient graph for previous function
13
What is a Local Minumum?
  • Relative Minimum a place where the nearby points
    are all at least as high.
  • g(x) ? g(x0) ?x x-x0 ? ?

14
(No Transcript)
15
(No Transcript)
16
What is a Local Minumum?
  • Relative Minimum ?
  • Strict Relative Minimum a place where the nearby
    points are all higher.
  • g(x) gt g(x0) ?x 0 lt x-x0 lt ?

17
But
  • convergence of error backpropagation (even to
    local minimum) may be slow
  • e.g., if error surface is very flat
  • Then dE/dW are very small
  • ? very small weight changes/ adaptations

18
(No Transcript)
19
Definition of Local Minumum
  • Relative Minimum ?
  • Strict Relative Minimum ?
  • Regional Minimum A point w represents a
    regional minimum with value E0, if E0 is the
    maximum value such that for all points w0 ,
    reachable by non-ascending trajectories Aw ? w0
    , E(w0) ? E(w) E0

20
And yet
  • On the other hand, a highly curved error surface
    may lead to
  • ? large weight adaptations that overshoot'' the
    minimum

21
Example of overshooting
step const.
Error f(x,y)-gt x2sin(xy-2)-sin(x-y)
22
In practice
  • training is often repeated with different
    parameters / different initial weights.
  • the addition of some random aspects during
    training, e.g., selection of the next training
    pair, can be useful.

23
Improving Convergence Speed
  • Notations
  • W(t) weights after t learning steps
  • E(t) error after t learning steps
  • ?v(t) update of weight v in step t
  • v(t1) v(t) ? v(t)

24
Momentum term
  • ? v(t) -? (? E(t) / ? v) ? ? v(t-1)
  • its effect is that previous weight changes have
    some contribution to the current weight change.
  • ? ?(0,1) determines strength of momentum term.

25
From ? v(t) -? (? E(t) / ? v) ? ? v(t-1)
  • ? v(t) -? (? E(t) / ? v ? ? E(t-1) / ? v )
    ?2 ? v( t 2 )
  • -? ? k0,l-1 ? k ? E(t-k) / ? v
    ?l ? v( t l )
  • if ?? l? ? ?l ? 0 ? ?l ? v( t l ) ? 0
  • ? the weight update is a weighted sum of the
    gradients of E for the last values of the
    weights.

26
From ? v(t) -? ? k0,l-1 ? k ? E(t-k) / ? v
  • If derivatives of E ct. (flat error surface)
  • ? E(t-k) / ? v ? E(t) / ? v ct.
  • ? v(t) -? ? E(t) / ? v ? k0,l-1 ? k
    -? (? E(t) / ? v) (1-? l ) / (1-?) ?
    l ?0 -? (? E(t) / ? v) 1 / (1-?)
  • ? momentum term weight update 1/(1- ?) higher
    than standard BP learning rule.

27
Variants of Backpropagation
  • variable learning parameter ?
  • in regions with small,stable'' gradient of E we
    may use a large ? to prevent extremely slow
    learning.
  • for a large gradient of the error it is better to
    use a small ? to prevent overshooting of the
    minimum.

28
Delta bar learning
  • ?v individual learning parameter for weight v
  • ? v(t) - ?v (? E(t) / ? v) - ?v ? v(t)
  • ? ?v (t) ? , if gradient ct. for a few
    steps
  • - ? ?v (t) , if gradient changes sign
  • 0 , otherwise
  • where ?gt0, ??(0,1)

29
How to check gradient sign?
  • ? v(t) ? E(t) / ? v the current gradient
  • ? v(t) (1-?) ? v(t) ? ? v(t-1)
  • where ? ?(0,1) and ? v(-1) 0
  • ? ? v(t) weighted average of the ? v(s) over
    the time points s ? t
  • ? gradient ct. ? v(t-1) ? v(t) gt0
  • gradient changes sign lt0
  • So Delta Bar Rule !

30
QuickProp
  • uses second derivatives of E for a better
    approximation of the weight updates.
  • From Taylor developing
  • ?E(t1)/?v ?E(t)/?v ?v(t) ?2E(t)/?2v
  • We want ?E(t1)/?v 0 (local min)
  • ? ?v(t) - ?E(t)/?v / ?2E(t)/?2v
  • This can be computed by approximating the second
    derivative
  • Quickprop converges often faster than BP

31
Other variants of BP
  • number of neurons may change during learning
  • Pruning neurons are deleted
  • constructive methods neurons are added

32
Attention
  • Despite extensive training with a good learning
    algorithm, the resulting network may still have a
    large error on the training set.
  • only local minima of error have been found
    (global min with small attraction region').
  • but it is possible that networks of the given
    topology cannot compute a function with a small
    error.

33
Example X(1, 0.9),(2,0.1),(3,0.9)
(x,t)x input, t target
  • a one-layer continuous perceptron can only
    compute monotonic functions.
  • w w0 s1 3w w0

f( w w0 ) 0.9 ? w w0 s1 f(2w w0 )
0.1 ? 2w w0 s2 f(3w w0 ) 0.9 ? 3w w0
s1
34
Overtraining
  • large networks, many layers ? more types of
    functions, but training is more complicated
  • In general training set size must be related to
    weights number in the network.
  • Intensive training of a large network with a
    small training set may lead to a network that
    memorizes the training set.
  • Overtraining a network that performs very well
    on the training set but very bad on other points.

35
Generalization
  • A network is said to generalize if it gives
    correct results for inputs not in the training
    set.
  • In practice generalization is tested by using
    besides training set an additional test set.
  • Stopcriterion error on the test
  • Often training set test set are obtained by
    separating original data set into 2 parts.

36
Example of bad generalization
  • 2 feedforward NNs, each 2 layers, hidden
    2, rsp. 25 neurons
  • Function sin(x), interval -?/2 ? x ? ? /2,
  • training set X(- ? /2 i ? /8, sin(- ?
    /2 i ? /8)) 0 ? i ? 8.
  • 1st layer standard activation function
  • 2nd layer one linear neuron

37
Training results
38
Comments on bad example
  • Both NNs give good approx. to sin(x) in training
    set points.
  • Small NN generalizes well, big NN very bad
  • It memorized the training set but gives wrong
    results for other inputs gt overtraining
  • Big NN 76 weights, too many for 9 points
    training set.
  • the use of an additional validation set can
    prevent bad generalization

39
Solutions to problems chapt. 2
  • http//wwwis.win.tue.nl/alex/
  • http//wwwis.win.tue.nl/alex/HTML/NN/NNpb2.html

40
Example of signal traveling through axon
41
Example of single layer perceptron learning
  • 0,1 check learning rate, momentum,
    hidden layer number
  • Perceptron, BP, etc. 0,1

42
Example of multiple layer perceptron (MLP)
learning
  • MLP 0,1
  • MLP -1,1

43
Generalization with MLP
  • MLP 0,1
  • MLP -1,1

44
OCR with MLP
45
Prediction with MLP
Write a Comment
User Comments (0)
About PowerShow.com