Title: Neural Networks
1Neural Networks
and
2Pattern Recognition
3Giansalvo EXIN Cirrincione
Sabine Van Huffel
4unit 5
5The multi-layer perceptron
Feed-forward network mappings
Feed-forward neural networks provide a general
framework for representing non-linear functional
mappings.
6The multi-layer perceptron
Feed-forward network mappings
Layered networks
7The multi-layer perceptron
Feed-forward network mappings
six layers
8The multi-layer perceptron
Feed-forward network mappings
Hinton diagram
The size of a square is proportional to the
magnitude of the corresponding parameter and the
square is black or white according to whether the
parameter is positive or negative.
9The multi-layer perceptron
Feed-forward network mappings
The outputs can be expressed as deterministic
functions of the inputs
General topologies
10Threshold units
A two-layer network can generate any Boolean
function provided the number M of hidden units is
sufficiently large.
Binary inputs
11Each hidden unit acts as a template for the
corresponding input pattern and only generates an
output when the input pattern matches the
template pattern
Threshold units
1
-1
1
1-b
Binary inputs
no generalization
12Possible decision boundaries
single convex region
AND
- M hidden units
- output bias - M
Relaxing this, more general decision boundaries
can be constructed
Continuous inputs
13Possible decision boundaries
hidden unit activation transitions from 0 to 1
2
3
4
hyperplanes corresponding to hidden units
The second-layer weights are all set to 1 and so
the numbers represent the value of the linear sum
presented to the output unit
Continuous inputs
14Possible decision boundaries
2
output unit bias -3.5
3
4
Continuous inputs
non-convex decision boundary
15Possible decision boundaries
output unit bias -4.5
Continuous inputs
disjoint decision region
16IMPOSSIBLE decision boundaries an example
However, any given decision boundary can be
approximated arbitrarily closely by a two-layer
network having sigmoidal activation functions.
Continuous inputs
17Possible decision boundaries
Arbitrary decision region
Continuous inputs
18Possible decision boundaries
divide the input space into a fine grid of
hypercubes
Continuous inputs
19Possible decision boundaries
CONCLUSION Feed-forward neural networks with
threshold units can generate arbitrarily complex
decision boundaries.
Problem classify a dichotomy
For N data points in general position in
d-dimensional space, a network with ?N/d? hidden
units in a single hidden layer can separate them
correctly into two classes.
Continuous inputs
20Sigmoidal units
linear transformation
A neural network using tanh activation functions
is equivalent to one using logistic activation
functions but having different values for the
weights and biases. Empirically, tanh activation
functions often give rise to faster convergence
of training algorithms than logistic functions.
21Sigmoidal units
linear output units
22Three-layer networks
They approximate, to arbitrary accuracy, any
smoothing mapping.
23Sigmoidal units
They approximate arbitrarily well any functional
continuous mapping
They approximate arbitrarily well any decision
boundary
They approximate arbitrarily well both a function
and its derivative
two-layer networks
24Sigmoidal units
two-layer networks
25Generalized Mapping Regressor GMR
Pollock, Convergence 10
26Weight-space symmetries
27Error back-propagation
Credit assignment problem
- Hessian matrix evaluation
- Jacobian evaluation
- several error functions
- several kinds of networks
back-propagation
e.g. gradient descent
28Error back-propagation
- arbitrary feed-forward topology
- arbitrary differentiable non-linear activation
function - arbitrary differentiable error function
29Error back-propagation
First step forward propagation
30Error back-propagation
? computation
hidden unit
output unit
31Error back-propagation
example
32Error back-propagation
33Homework 1
Show, for a feedforward network with tanh hidden
unit activation functions and a sum of squares
error function, that the origin in weight space
is a stationary point of the error function.
34Homework 2
Let W the total number of weights and biases.
Show that, for each input pattern, the cost of
backpropagation for the evaluation of all the
derivatives is O(W) (if the derivatives are
evaluated numerically by forward propagation, the
total cost is O(W2)).
35Numerical differentiation
finite differences
perturb each weight in turn
O(W2)
symmetrical central finite differences
O(W2)
BP correctness check
node perturbation
O(MW)
36The Jacobian matrix
It provides a measure of the local sensitivity of
the outputs to changes in each of the input
variables.
It is valid only for small perturbations of the
inputs and the Jacobian must be re-evaluated for
each new input vector.
forward propagation
37The Jacobian matrix
38The Jacobian matrix
39The Jacobian matrix
401. Several non-linear optimization algorithms
used for training neural networks are based
on the second-order properties of the error
surface.
2. The Hessian forms the basis of a fast
procedure for training a feed-fw. network
following a small change in the training data.
3. The inverse Hessian is used to identify the
least significant weights in a network as
part of a pruning algorithm.
4. The inverse Hessian is used to assign error
bars to the predictions made by a trained
network.
41The inverse of a diagonal matrix is trivial to
compute.
O(W)
42Regression problems
straightforward extension
Levenberg-Marquardt approximation (outer product
approximation)
O(W2)
43Sherman-Morrison-Woodbury formula
44BP check
O( W 2 )
45- arbitrary feed-forward topology
- arbitrary differentiable activation function
- arbitrary differentiable error function
O( W 2 )
wij does not occur on any forward propagation
path connecting unit l to the outputs of the
network
46Initial conditions for each unit j (except for
input units) set hjj 1 and set hkj 0 ? k ? j
(units which do not lie on any forward
propagation path starting from unit j).
forward propagation
47back propagation
48ALGORITHM
1. Evaluate the activations of all of the hidden
and output unit, for a given input pattern,
by forward propagation. Similarly, compute
the initial conditions for the hkj and forward
propagate through the network to find the
remaining non-zero elements of hkj .
2. Evaluate ?k for the output units and,
similarly, evaluate the Hkk for all the
output units.
3. Use BP to find ?j for all hidden units.
Similarly, back propagate to find the blj
by using the given initial conditions.
4. Evaluate the elements of the Hessian for this
input pattern.
5. Repeat the above steps for each pattern in the
TS and then sum to obtain the full Hessian.
49Exact Hessian for two-layer network
- Legenda
- indices i and i denote inputs
- indices j and j denote hidden units
- indices k and k denote outputs
50homework
51Consider a feed-forward network which has been
trained to a minimum of some error function E,
corresponding to a set of weights wj. Suppose
that all of the input values xin and target
values tkn in the TS are perturbed by small
amounts Dxin and D tkn respectively. This causes
the minimum of the error function to change to a
new set of weight values given by wjDwj. Write
down the Taylor expansion (to second order in the
Ds) of the new error function
By minimizing this expression w.r.t. the Dwj,
show that the new set of weights which minimizes
the error function can be calculated from the
original set of weights by adding corrections Dwj
which are given by solutions of the following
equation
where Hlj are the elements of the Hessian matrix,
and we have defined
52Projection pursuit regression
Parameters are optimized cyclically in groups.
Specifically, training takes place for one hidden
unit at a time, and for each hidden unit the
second-layer weights are optimized first (OLS),
followed by the activation function
(one-dimensional curve fitting, e.g. cubic
splines), followed by the first-layer weights
(non-linear techniques). The process is repeated
for each hidden unit in turn until a stopping
criterion is satisfied.
Several generalizations to more than one output
variable are possible depending on whether the
outputs share common basis functions fj and, if
not, whether the separate basis functions fjk
(where k labels the outputs) share common
projection directions.
53FINE