Title: 2L490 Backpropagation 1
1Error Backpropagation
- All learning algorithms for (layered)
feed-forward networks are based on a technique
called error backpropagation - This is a form corrective supervised learning
which consists of two phases. In the first
(for-ward) phase the output of each neuron is
computed, in the second (backward) phase the
partial derivatives of the error function with
respect to the weights are computed, where-after
the weights are updated
2Approach
- The approach we take
- is a minor variation of the one in R. Rojas,
- Neural Networks, Springer, 1996.
- applies to general feed-forward networks
- allows distinct activation functions for each of
- the neurons
- uses a graphical method called B-diagrams
- to illustrate how partial derivatives of the
error - function can be computed
3General Feed-forward Networks
- A general feed-forward network consists of
- n input nodes (numbered 1, , n)
- l hidden neurons (numbered n1, , nl)
- m output neurons (numbered nl1, , nlm)
- a set of connections such that the network does
- not contain cycles. Hence the hidden neurons
- can be topologically sorted, i.e. numbered such
- that (i, j) is a connection, iff
- i lt j and n lt j and i lt nl1.
4(No Transcript)
5B-diagrams
- A B-diagram is a directed acyclic network
containing four types of nodes - Fan-in nodes
- Fan-out nodes
- Product nodes
- Function nodes
- The forward phase computes function composition,
the backward phase computes partial derivatives.
6B-diagram (fan-in node)
Forward phase
Backward phase
7B-diagram (fan-out node)
Forward phase Backward phase
8B-diagram (product node)
Forward phase Backward phase
9B-diagram (function node)
Forward phase Backward phase
10Chain-rule
x
1
(g ? f)(x) g(f(x)) (g ? f)(x)
g(f(x))f (x)
11Remark
Note that the product node, the fan-in node,
and the function node are all special cases of a
more general node for functions with an arbitrary
num- ber of arguments that stores all partial
derivates.
f (x1, x2)
12(No Transcript)
13Translation scheme
- As a first step in the development of the error
backpropagation algorithm we show how to
translate a general feed-forward net into a
B-diagram - Replace each input node by a fan-out node
- Replace each edge by a product node
- Replace each neuron by a fan-in node, followed
- by a function node, followed by a fan-out node
14Translation of a neuron
Note that this translation only captures the
activa-tion function and connection pattern of a
neuron. The weights are modeled by separate
product nodes.
15Simplifications
- The B-diagram of a general feed-forward net can
be simplified as follows - Neurons with a single output do not require a
- fan-out node
- Neurons with a single input do not require a
- fan-in node
- Neurons with activation function f(z) z do
not - require a function node
- Edges with weight 1 do not require a product
- node
16Backpropagation theorem
Let B be the B-diagram of a general
feed-forward net N that computes a function F
Rn ! R Presenting value xi at the input
node i of B and performing the forward phase
of each node (in the order indicated by the
numbering of the nodes of N) will result in the
value F(x) at the output of B. Subsequently
presenting value 1 at the output node and
performing the backward phase will result in
partial derivative F(x) / xi at input i.
17Error function
Consider a general FFN that computes
with training set
Then the error of training pair q is defined by
18FFNs that compute Error Functions
Hidden neurons
19Error Dependence on Weight wij
20E(rror)B(ack)P(ropagation) Learning
21EBP learning (forward phase)
22EBP learning (backward phase)
23EBP learning (update phase)
Beware a weight update can only be
performed after all errors that depend on that
weight have been computed. A separate phase
trivially gua- rantees this requirement.
24Layered version of EBP
- To obtain a version of the error backpropagation
- algorithm for layered feedforward networks, i.e.
- multi-layer perceptrons, we
- introduce a layer-oriented node numbering
- visit the nodes on a layer by layer basis
- introduce vector notation for quantities
pertain- - ing to a single layer
25Layer-oriented Node Numbers
- Assume that the nodes of the network can be
- organized in r1 layers, numbered 0, , r
- For 0 s r1, let ns denote the number
- of nodes in layers 0, , (s -1). Hence node
i - lies in layer s iff ns lt i ns1
- Renumber the nodes according to the scheme
-
26Weight Matrix of Layer s
Let Ws be the (nsns-1)-matrix defined
by Note that for the sake of simplicity we
have added zero weights such that there exists a
connection between any pair of nodes in
successive layers For convenience we write wsij
instead of (Ws)ij
27EBP (forward phase, layered)
28EBP (backward phase, layered)
29EBP (update phase, layered)
30Vector notation
For a continuous and differentiable function f
R!R and vector z2 Rn for arbitrary
dimen-sion n define the n-dimensional vector F
(z) by and the diagonal matrix by
31EBP (layered and vectorized)
32Practical Aspects
- Convergence improvements
- Elementary improvements
- Advanced first-order methods
- Second order methods
- Generalization
- Overtraining
- Training with cross validation
33Elementary Improvements
- Momentum term
- Resilient backpropagation
- gradient determines the sign of the weight
updates - learning rate increases for stable gradient
- learning rate decreases for alternating gradient
34First-order Methods
- Steepest descent where
- is chosen such that is
minimal. - Conjugated gradient methods directions are given
by - with suitably chosen.
35Second-order Methods (derivation)
- Consider the Taylor expansion of the error func-
- tion around w0
- Ignore third- and higher-order terms and choose
- such that is
minimal, i.e.
36(Quasi) Newton methods
- Quasi Newton methods use the update rule
- with
- Fast convergence (Newtons method requires1
iteration for a quadratic error function) - Solving the above equation is time consuming
- Hessian matrix H can be very large
37Levenberg-Marquardt Methods
- LM-methods use update rule
- This is a combination of gradient descent and
- Newtons method
- If small, then
- If large, then
38Generalization
- Generalization addresses the issue how well a
- net performs on fresh (not part of the training
set) - samples from the population.
- Generalization is influenced by three factors
- The architecture of the network
- The size of the training set
- The complexity of the problem
39Overtraining
- Overtraining is the situation in which the
network memorizes the data of the training set,
but generalizes poorly. - The size of the training set must be related to
the amount of data the network can memorize (i.e.
the number of weights). - Vice-versa in order to prevent overtraining the
number of weights must be kept in proportion to
the size of the training set
40Cross Validation
- To protect against overtraining a technique
called - cross-validation can be used. It involves
- an additional data set called the validation set
- computing the error made by the net on this
validation set, while training with the training
set - stop training when the error on the validation
set starts increasing - Usually the size of the validation set is chosen
- roughly halve the size of the training set.
41Practical Aspects
- Preprocessing
- Normalization
- Decorrelation
- Network pruning
- Magnitude-based
- Optimal brain damage
- Optimal brain surgeon
42Preprocessing
- Normalization
- Decorrelation
43Pruning
- Pruning is a technique to increase network
perfor- - mance by elimination (pruning in the strict
sense) - or addition (pruning in the broad sense) of neu-
- rons and/or connections.
training set validation set action taken
error too large irrelevant add neurons
error small error too large remove neurons
error small error small stop pruning
44Pruning connections
Optimal Brain Damage
Optimal Brain Surgeon