2L490 Backpropagation 1 - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

2L490 Backpropagation 1

Description:

The weights are modeled by separate product nodes. 9/7/09 ... zero weights such that there exists a connection. between any pair of nodes in successive layers ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 45
Provided by: Rudol
Category:

less

Transcript and Presenter's Notes

Title: 2L490 Backpropagation 1


1
Error Backpropagation
  • All learning algorithms for (layered)
    feed-forward networks are based on a technique
    called error backpropagation
  • This is a form corrective supervised learning
    which consists of two phases. In the first
    (for-ward) phase the output of each neuron is
    computed, in the second (backward) phase the
    partial derivatives of the error function with
    respect to the weights are computed, where-after
    the weights are updated

2
Approach
  • The approach we take
  • is a minor variation of the one in R. Rojas,
  • Neural Networks, Springer, 1996.
  • applies to general feed-forward networks
  • allows distinct activation functions for each of
  • the neurons
  • uses a graphical method called B-diagrams
  • to illustrate how partial derivatives of the
    error
  • function can be computed

3
General Feed-forward Networks
  • A general feed-forward network consists of
  • n input nodes (numbered 1, , n)
  • l hidden neurons (numbered n1, , nl)
  • m output neurons (numbered nl1, , nlm)
  • a set of connections such that the network does
  • not contain cycles. Hence the hidden neurons
  • can be topologically sorted, i.e. numbered such
  • that (i, j) is a connection, iff
  • i lt j and n lt j and i lt nl1.

4
(No Transcript)
5
B-diagrams
  • A B-diagram is a directed acyclic network
    containing four types of nodes
  • Fan-in nodes
  • Fan-out nodes
  • Product nodes
  • Function nodes
  • The forward phase computes function composition,
    the backward phase computes partial derivatives.

6
B-diagram (fan-in node)
Forward phase
Backward phase
7
B-diagram (fan-out node)
Forward phase Backward phase
8
B-diagram (product node)
Forward phase Backward phase
9
B-diagram (function node)
Forward phase Backward phase
10
Chain-rule
x
1
(g ? f)(x) g(f(x)) (g ? f)(x)
g(f(x))f (x)
11
Remark
Note that the product node, the fan-in node,
and the function node are all special cases of a
more general node for functions with an arbitrary
num- ber of arguments that stores all partial
derivates.
f (x1, x2)
12
(No Transcript)
13
Translation scheme
  • As a first step in the development of the error
    backpropagation algorithm we show how to
    translate a general feed-forward net into a
    B-diagram
  • Replace each input node by a fan-out node
  • Replace each edge by a product node
  • Replace each neuron by a fan-in node, followed
  • by a function node, followed by a fan-out node

14
Translation of a neuron
Note that this translation only captures the
activa-tion function and connection pattern of a
neuron. The weights are modeled by separate
product nodes.
15
Simplifications
  • The B-diagram of a general feed-forward net can
    be simplified as follows
  • Neurons with a single output do not require a
  • fan-out node
  • Neurons with a single input do not require a
  • fan-in node
  • Neurons with activation function f(z) z do
    not
  • require a function node
  • Edges with weight 1 do not require a product
  • node

16
Backpropagation theorem
Let B be the B-diagram of a general
feed-forward net N that computes a function F
Rn ! R Presenting value xi at the input
node i of B and performing the forward phase
of each node (in the order indicated by the
numbering of the nodes of N) will result in the
value F(x) at the output of B. Subsequently
presenting value 1 at the output node and
performing the backward phase will result in
partial derivative F(x) / xi at input i.
17
Error function
Consider a general FFN that computes
with training set
Then the error of training pair q is defined by
18
FFNs that compute Error Functions
Hidden neurons
19
Error Dependence on Weight wij
20
E(rror)B(ack)P(ropagation) Learning
21
EBP learning (forward phase)
22
EBP learning (backward phase)
23
EBP learning (update phase)
Beware a weight update can only be
performed after all errors that depend on that
weight have been computed. A separate phase
trivially gua- rantees this requirement.
24
Layered version of EBP
  • To obtain a version of the error backpropagation
  • algorithm for layered feedforward networks, i.e.
  • multi-layer perceptrons, we
  • introduce a layer-oriented node numbering
  • visit the nodes on a layer by layer basis
  • introduce vector notation for quantities
    pertain-
  • ing to a single layer

25
Layer-oriented Node Numbers
  • Assume that the nodes of the network can be
  • organized in r1 layers, numbered 0, , r
  • For 0 s r1, let ns denote the number
  • of nodes in layers 0, , (s -1). Hence node
    i
  • lies in layer s iff ns lt i ns1
  • Renumber the nodes according to the scheme

26
Weight Matrix of Layer s
Let Ws be the (nsns-1)-matrix defined
by Note that for the sake of simplicity we
have added zero weights such that there exists a
connection between any pair of nodes in
successive layers For convenience we write wsij
instead of (Ws)ij
27
EBP (forward phase, layered)
28
EBP (backward phase, layered)
29
EBP (update phase, layered)
30
Vector notation
For a continuous and differentiable function f
R!R and vector z2 Rn for arbitrary
dimen-sion n define the n-dimensional vector F
(z) by and the diagonal matrix by
31
EBP (layered and vectorized)
32
Practical Aspects
  • Convergence improvements
  • Elementary improvements
  • Advanced first-order methods
  • Second order methods
  • Generalization
  • Overtraining
  • Training with cross validation

33
Elementary Improvements
  • Momentum term
  • Resilient backpropagation
  • gradient determines the sign of the weight
    updates
  • learning rate increases for stable gradient
  • learning rate decreases for alternating gradient

34
First-order Methods
  • Steepest descent where
  • is chosen such that is
    minimal.
  • Conjugated gradient methods directions are given
    by
  • with suitably chosen.

35
Second-order Methods (derivation)
  • Consider the Taylor expansion of the error func-
  • tion around w0
  • Ignore third- and higher-order terms and choose
  • such that is
    minimal, i.e.

36
(Quasi) Newton methods
  • Quasi Newton methods use the update rule
  • with
  • Fast convergence (Newtons method requires1
    iteration for a quadratic error function)
  • Solving the above equation is time consuming
  • Hessian matrix H can be very large

37
Levenberg-Marquardt Methods
  • LM-methods use update rule
  • This is a combination of gradient descent and
  • Newtons method
  • If small, then
  • If large, then

38
Generalization
  • Generalization addresses the issue how well a
  • net performs on fresh (not part of the training
    set)
  • samples from the population.
  • Generalization is influenced by three factors
  • The architecture of the network
  • The size of the training set
  • The complexity of the problem

39
Overtraining
  • Overtraining is the situation in which the
    network memorizes the data of the training set,
    but generalizes poorly.
  • The size of the training set must be related to
    the amount of data the network can memorize (i.e.
    the number of weights).
  • Vice-versa in order to prevent overtraining the
    number of weights must be kept in proportion to
    the size of the training set

40
Cross Validation
  • To protect against overtraining a technique
    called
  • cross-validation can be used. It involves
  • an additional data set called the validation set
  • computing the error made by the net on this
    validation set, while training with the training
    set
  • stop training when the error on the validation
    set starts increasing
  • Usually the size of the validation set is chosen
  • roughly halve the size of the training set.

41
Practical Aspects
  • Preprocessing
  • Normalization
  • Decorrelation
  • Network pruning
  • Magnitude-based
  • Optimal brain damage
  • Optimal brain surgeon

42
Preprocessing
  • Normalization
  • Decorrelation

43
Pruning
  • Pruning is a technique to increase network
    perfor-
  • mance by elimination (pruning in the strict
    sense)
  • or addition (pruning in the broad sense) of neu-
  • rons and/or connections.

training set validation set action taken
error too large irrelevant add neurons
error small error too large remove neurons
error small error small stop pruning
44
Pruning connections
Optimal Brain Damage
Optimal Brain Surgeon
Write a Comment
User Comments (0)
About PowerShow.com