Title: Practical Aspects of Backpropagation
1Practical Aspects of Backpropagation
- H. Ten Eikelder, A. Cristea
2(No Transcript)
3(No Transcript)
4Minimizing Squared Error
5(No Transcript)
6- iff ? percepton to compute training set without
errors ? - ? global minimum W of E, E(W)0
- However
- often we end up with W, E(W)gt0
- ? local ( non-global) minimum of E
- ? ? a better set of weights
7(No Transcript)
82 local minima and 1 saddle point
9(No Transcript)
10- iff ?? perceptron to compute without errors, it
is possible that W' is a global minimum but NO
guarantee!! - In practice the error surface may contain many
local minima and finding a global minimum can be
difficult.
11Many local minima example f(x,y)-gt
x2sin(xy-2)-sin(x-y)
12Gradient graph for previous function
13What is a Local Minumum?
- Relative Minimum a place where the nearby points
are all at least as high. - g(x) ? g(x0) ?x x-x0 ? ?
14(No Transcript)
15(No Transcript)
16What is a Local Minumum?
- Relative Minimum ?
- Strict Relative Minimum a place where the nearby
points are all higher. - g(x) gt g(x0) ?x 0 lt x-x0 lt ?
17But
- convergence of error backpropagation (even to
local minimum) may be slow - e.g., if error surface is very flat
- Then dE/dW are very small
- ? very small weight changes/ adaptations
18(No Transcript)
19Definition of Local Minumum
- Relative Minimum ?
- Strict Relative Minimum ?
- Regional Minimum A point w represents a
regional minimum with value E0, if E0 is the
maximum value such that for all points w0 ,
reachable by non-ascending trajectories Aw ? w0
, E(w0) ? E(w) E0
20And yet
- On the other hand, a highly curved error surface
may lead to - ? large weight adaptations that overshoot'' the
minimum
21Example of overshooting
step const.
Error f(x,y)-gt x2sin(xy-2)-sin(x-y)
22In practice
- training is often repeated with different
parameters / different initial weights. - the addition of some random aspects during
training, e.g., selection of the next training
pair, can be useful.
23Improving Convergence Speed
- Notations
- W(t) weights after t learning steps
- E(t) error after t learning steps
- ?v(t) update of weight v in step t
- v(t1) v(t) ? v(t)
24Momentum term
- ? v(t) -? (? E(t) / ? v) ? ? v(t-1)
- its effect is that previous weight changes have
some contribution to the current weight change. - ? ?(0,1) determines strength of momentum term.
25From ? v(t) -? (? E(t) / ? v) ? ? v(t-1)
- ? v(t) -? (? E(t) / ? v ? ? E(t-1) / ? v )
?2 ? v( t 2 ) - -? ? k0,l-1 ? k ? E(t-k) / ? v
?l ? v( t l ) - if ?? l? ? ?l ? 0 ? ?l ? v( t l ) ? 0
- ? the weight update is a weighted sum of the
gradients of E for the last values of the
weights.
26From ? v(t) -? ? k0,l-1 ? k ? E(t-k) / ? v
- If derivatives of E ct. (flat error surface)
- ? E(t-k) / ? v ? E(t) / ? v ct.
- ? v(t) -? ? E(t) / ? v ? k0,l-1 ? k
-? (? E(t) / ? v) (1-? l ) / (1-?) ?
l ?0 -? (? E(t) / ? v) 1 / (1-?) - ? momentum term weight update 1/(1- ?) higher
than standard BP learning rule.
27Variants of Backpropagation
- variable learning parameter ?
- in regions with small,stable'' gradient of E we
may use a large ? to prevent extremely slow
learning. - for a large gradient of the error it is better to
use a small ? to prevent overshooting of the
minimum.
28Delta bar learning
- ?v individual learning parameter for weight v
- ? v(t) - ?v (? E(t) / ? v) - ?v ? v(t)
- ? ?v (t) ? , if gradient ct. for a few
steps - - ? ?v (t) , if gradient changes sign
- 0 , otherwise
- where ?gt0, ??(0,1)
29How to check gradient sign?
- ? v(t) ? E(t) / ? v the current gradient
- ? v(t) (1-?) ? v(t) ? ? v(t-1)
- where ? ?(0,1) and ? v(-1) 0
- ? ? v(t) weighted average of the ? v(s) over
the time points s ? t - ? gradient ct. ? v(t-1) ? v(t) gt0
- gradient changes sign lt0
- So Delta Bar Rule !
30QuickProp
- uses second derivatives of E for a better
approximation of the weight updates. - From Taylor developing
- ?E(t1)/?v ?E(t)/?v ?v(t) ?2E(t)/?2v
- We want ?E(t1)/?v 0 (local min)
- ? ?v(t) - ?E(t)/?v / ?2E(t)/?2v
- This can be computed by approximating the second
derivative - Quickprop converges often faster than BP
31Other variants of BP
- number of neurons may change during learning
- Pruning neurons are deleted
- constructive methods neurons are added
32Attention
- Despite extensive training with a good learning
algorithm, the resulting network may still have a
large error on the training set. - only local minima of error have been found
(global min with small attraction region'). - but it is possible that networks of the given
topology cannot compute a function with a small
error.
33Example X(1, 0.9),(2,0.1),(3,0.9)
(x,t)x input, t target
- a one-layer continuous perceptron can only
compute monotonic functions.
f( w w0 ) 0.9 ? w w0 s1 f(2w w0 )
0.1 ? 2w w0 s2 f(3w w0 ) 0.9 ? 3w w0
s1
34Overtraining
- large networks, many layers ? more types of
functions, but training is more complicated - In general training set size must be related to
weights number in the network. - Intensive training of a large network with a
small training set may lead to a network that
memorizes the training set. - Overtraining a network that performs very well
on the training set but very bad on other points.
35Generalization
- A network is said to generalize if it gives
correct results for inputs not in the training
set. - In practice generalization is tested by using
besides training set an additional test set. - Stopcriterion error on the test
- Often training set test set are obtained by
separating original data set into 2 parts.
36Example of bad generalization
- 2 feedforward NNs, each 2 layers, hidden
2, rsp. 25 neurons - Function sin(x), interval -?/2 ? x ? ? /2,
- training set X(- ? /2 i ? /8, sin(- ?
/2 i ? /8)) 0 ? i ? 8. - 1st layer standard activation function
- 2nd layer one linear neuron
37Training results
38Comments on bad example
- Both NNs give good approx. to sin(x) in training
set points. - Small NN generalizes well, big NN very bad
- It memorized the training set but gives wrong
results for other inputs gt overtraining - Big NN 76 weights, too many for 9 points
training set. - the use of an additional validation set can
prevent bad generalization
39Solutions to problems chapt. 2
- http//wwwis.win.tue.nl/alex/
- http//wwwis.win.tue.nl/alex/HTML/NN/NNpb2.html
40Example of signal traveling through axon
41Example of single layer perceptron learning
- 0,1 check learning rate, momentum,
hidden layer number - Perceptron, BP, etc. 0,1
42Example of multiple layer perceptron (MLP)
learning
43Generalization with MLP
44OCR with MLP
45Prediction with MLP