Practical Aspects of Backpropagation - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Practical Aspects of Backpropagation

Description:

A network is said to generalize if it gives correct results for inputs not in the training set. ... Small NN generalizes well, big NN very bad: ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 46

Provided by: wwwisW

Category:

more less

Transcript and Presenter's Notes

Title: Practical Aspects of Backpropagation

1
Practical Aspects of Backpropagation

H. Ten Eikelder, A. Cristea

2
(No Transcript)
3
(No Transcript)
4
Minimizing Squared Error

E(W)
W all weights

5
(No Transcript)
6

iff ? percepton to compute training set without
errors ?
? global minimum W of E, E(W)0
However
often we end up with W, E(W)gt0
? local ( non-global) minimum of E
? ? a better set of weights

7
(No Transcript)
8
2 local minima and 1 saddle point
9
(No Transcript)
10

iff ?? perceptron to compute without errors, it
is possible that W' is a global minimum but NO
guarantee!!
In practice the error surface may contain many
local minima and finding a global minimum can be
difficult.

11
Many local minima example f(x,y)-gt
x2sin(xy-2)-sin(x-y)
12
Gradient graph for previous function
13
What is a Local Minumum?

Relative Minimum a place where the nearby points
are all at least as high.
g(x) ? g(x0) ?x x-x0 ? ?

14
(No Transcript)
15
(No Transcript)
16
What is a Local Minumum?

Relative Minimum ?
Strict Relative Minimum a place where the nearby
points are all higher.
g(x) gt g(x0) ?x 0 lt x-x0 lt ?

17
But

convergence of error backpropagation (even to
local minimum) may be slow
e.g., if error surface is very flat
Then dE/dW are very small
? very small weight changes/ adaptations

18
(No Transcript)
19
Definition of Local Minumum

Relative Minimum ?
Strict Relative Minimum ?
Regional Minimum A point w represents a
regional minimum with value E0, if E0 is the
maximum value such that for all points w0 ,
reachable by non-ascending trajectories Aw ? w0
, E(w0) ? E(w) E0

20
And yet

On the other hand, a highly curved error surface
may lead to
? large weight adaptations that overshoot'' the
minimum

21
Example of overshooting
step const.
Error f(x,y)-gt x2sin(xy-2)-sin(x-y)
22
In practice

training is often repeated with different
parameters / different initial weights.
the addition of some random aspects during
training, e.g., selection of the next training
pair, can be useful.

23
Improving Convergence Speed

Notations
W(t) weights after t learning steps
E(t) error after t learning steps
?v(t) update of weight v in step t
v(t1) v(t) ? v(t)

24
Momentum term

? v(t) -? (? E(t) / ? v) ? ? v(t-1)
its effect is that previous weight changes have
some contribution to the current weight change.
? ?(0,1) determines strength of momentum term.

25
From ? v(t) -? (? E(t) / ? v) ? ? v(t-1)

? v(t) -? (? E(t) / ? v ? ? E(t-1) / ? v )
?2 ? v( t 2 )
-? ? k0,l-1 ? k ? E(t-k) / ? v
?l ? v( t l )
if ?? l? ? ?l ? 0 ? ?l ? v( t l ) ? 0
? the weight update is a weighted sum of the
gradients of E for the last values of the
weights.

26
From ? v(t) -? ? k0,l-1 ? k ? E(t-k) / ? v

If derivatives of E ct. (flat error surface)
? E(t-k) / ? v ? E(t) / ? v ct.
? v(t) -? ? E(t) / ? v ? k0,l-1 ? k
-? (? E(t) / ? v) (1-? l ) / (1-?) ?
l ?0 -? (? E(t) / ? v) 1 / (1-?)
? momentum term weight update 1/(1- ?) higher
than standard BP learning rule.

27
Variants of Backpropagation

variable learning parameter ?
in regions with small,stable'' gradient of E we
may use a large ? to prevent extremely slow
learning.
for a large gradient of the error it is better to
use a small ? to prevent overshooting of the
minimum.

28
Delta bar learning

?v individual learning parameter for weight v
? v(t) - ?v (? E(t) / ? v) - ?v ? v(t)
? ?v (t) ? , if gradient ct. for a few
steps
- ? ?v (t) , if gradient changes sign
0 , otherwise
where ?gt0, ??(0,1)

29
How to check gradient sign?

? v(t) ? E(t) / ? v the current gradient
? v(t) (1-?) ? v(t) ? ? v(t-1)
where ? ?(0,1) and ? v(-1) 0
? ? v(t) weighted average of the ? v(s) over
the time points s ? t
? gradient ct. ? v(t-1) ? v(t) gt0
gradient changes sign lt0
So Delta Bar Rule !

30
QuickProp

uses second derivatives of E for a better
approximation of the weight updates.
From Taylor developing
?E(t1)/?v ?E(t)/?v ?v(t) ?2E(t)/?2v
We want ?E(t1)/?v 0 (local min)
? ?v(t) - ?E(t)/?v / ?2E(t)/?2v
This can be computed by approximating the second
derivative
Quickprop converges often faster than BP

31
Other variants of BP

number of neurons may change during learning
Pruning neurons are deleted
constructive methods neurons are added

32
Attention

Despite extensive training with a good learning
algorithm, the resulting network may still have a
large error on the training set.
only local minima of error have been found
(global min with small attraction region').
but it is possible that networks of the given
topology cannot compute a function with a small
error.

33
Example X(1, 0.9),(2,0.1),(3,0.9)
(x,t)x input, t target

a one-layer continuous perceptron can only
compute monotonic functions.

w w0 s1 3w w0

f( w w0 ) 0.9 ? w w0 s1 f(2w w0 )
0.1 ? 2w w0 s2 f(3w w0 ) 0.9 ? 3w w0
s1
34
Overtraining

large networks, many layers ? more types of
functions, but training is more complicated
In general training set size must be related to
weights number in the network.
Intensive training of a large network with a
small training set may lead to a network that
memorizes the training set.
Overtraining a network that performs very well
on the training set but very bad on other points.

35
Generalization

A network is said to generalize if it gives
correct results for inputs not in the training
set.
In practice generalization is tested by using
besides training set an additional test set.
Stopcriterion error on the test
Often training set test set are obtained by
separating original data set into 2 parts.

36
Example of bad generalization

2 feedforward NNs, each 2 layers, hidden
2, rsp. 25 neurons
Function sin(x), interval -?/2 ? x ? ? /2,
training set X(- ? /2 i ? /8, sin(- ?
/2 i ? /8)) 0 ? i ? 8.
1st layer standard activation function
2nd layer one linear neuron

37
Training results
38
Comments on bad example

Both NNs give good approx. to sin(x) in training
set points.
Small NN generalizes well, big NN very bad
It memorized the training set but gives wrong
results for other inputs gt overtraining
Big NN 76 weights, too many for 9 points
training set.
the use of an additional validation set can
prevent bad generalization

39
Solutions to problems chapt. 2