Title: Learning Rules
1Topic 3.
- Learning Rules
- of the Artificial Neural Networks.
2Multilayer Perceptron.
- The first layer is the input layer,
- and the last layer is the output layer.
- All other layers with no direct connections from
or to the outside are called hidden layers.
3Multilayer Perceptron.
- The input is processed and relayed from one layer
to the next, until the final result has been
computed. - This process represents the feedforward scheme.
4Multilayer Perceptron.
- structural credit assignment problem when an
error is made at the output of a network, how is
credit (or blame) to be assigned to neurons deep
within the network?
- One of the most popular techniques to train the
hidden neurons is error backpropagation, - whereby the error of output units is propagated
back to yield estimates of how much a given
hidden unit contributed to the output error.
5Multilayer Perceptron.
- The error function of multilayer perceptron
The best performance of the network corresponds
to the minimum of the total squared error, and
during the network training, we adjust the
weights of connections in order to get to that
minimum.
6Multilayer Perceptron.
- Combination of the weights, including that of
hidden neurons, which minimises the error
function E is considered to be a solution of
multiple layer perceptron learning problem .
7Multilayer Perceptron.
- The error function of multilayer perceptron
- The backpropagation algorithm looks for the
minimum of the multi-variable error function E in
the space of weights of connections w using the
method of gradient descent.
8Multilayer Perceptron.
- Following calculus, a local minimum of a function
of two or more variables is defined by equality
to zero of its gradient
where is partial derivative of
the error function E with respect to the weight
of connection between h-th
unit in the layer k and t-th unit in the
previous layer number k-1.
9Multilayer Perceptron.
We would like to go in the direction opposite to
to most rapidly minimise E. Therefore,
during the iterative process of gradient descent
each weight of connection, including the hidden
ones, is updated
using the increment
here C represents the learning rate.
10Multilayer Perceptron.
where
Since calculus-based methods of minimisation rest
on the taking of derivatives, their application
to network training requires the error function E
be a differentiable function
11Multilayer Perceptron.
Since calculus-based methods of minimisation rest
on the taking of derivatives, their application
to network training requires the error function E
be a differentiable function, which requires the
network output Xjp to be differentiable, which
requires the activation functions f(S) to be
differentiable
where
12Multilayer Perceptron.
Since calculus-based methods of minimisation rest
on the taking of derivatives, their application
to network training requires the error function
E be a differentiable function, which requires
the network output Xjp to be differentiable,
which requires the activation functions f(S) to
be differentiable
This provides a powerful motivation for using
continuous and differentiable activation
functions f(w,a).
where
13Multilayer Perceptron.
Since calculus-based methods of minimisation rest
on the taking of derivatives, their application
to network training requires the activation
functions f(S) to be differentiable.
- To make a multiple layer perceptron to be able
to learn here is a useful generic sigmoid
activation function associated with a hidden or
output neuron
where
14Multilayer Perceptron.
Since calculus-based methods of minimisation rest
on the taking of derivatives, their application
to network training requires the activation
functions f(S) to be differentiable.
- To make a multiple layer perceptron to be able
to learn here is a useful generic sigmoid
activation function associated with a hidden or
output neuron
Important thing about the generic sigmoid
function is that it is differentiable, with a
very simple and easy to compute derivative
where
15Multilayer Perceptron.
Since calculus-based methods of minimisation rest
on the taking of derivatives, their application
to network training requires the activation
functions f(S) to be differentiable.
- To make a multiple layer perceptron to be able
to learn here is a useful generic sigmoid
activation function associated with a hidden or
output neuron
If all activation functions f(S) in the network
are differentiable then, according to the chain
rule of calculus, differentiating the error
function E with respect to the weight of
connection in consideration we can express the
corresponding partial derivative of the error
function
where
16Multilayer Perceptron.
Then.
where
17Multilayer Perceptron.
where
18Multilayer Perceptron.
where
Thus, correction to the hidden weight of
connection between h-th unit in the k-th layer
and t-th unit in the previous (k-1)-th layer can
be found by
19Multilayer Perceptron Learning rule!!!
where
- The correction is defined by
- the output layer errors ejp,
- derivatives of activation functions of
all - neurons in the upper layers with numbers
p gt k, - derivative of activation function of
the neuron h itself in the layer k, - activation function of connected
neuron t in the previous layer (k-1).
20Multilayer Perceptron Learning rule!!!
where
We can easily measure the output errors of the
network, and it is us to define all the
activation functions. If we also know the
derivatives of the activation functions, then we
can easily find all the corrections to weights of
connections of all neurons in the network,
including the hidden ones, during the second run
back through the network.
21Multilayer Perceptron Training.
The training process of multilayer perceptron
consists of two phases. Initial values of the
weights of connections set up randomly. Then,
during the first, feedforward phase, starting
from the input layer and further layer-by-layer,
outputs of every unit in the network are computed
together with the corresponding derivatives.
Figure Directions of two basic signal flows in
multilayer perceptron forward propagation of
function signals and back-propagation of error
signals.
22Multilayer Perceptron Training.
The training process of multilayer perceptron
consists of two phases. Initial values of the
weights of connections set up randomly. Then,
during the first, feedforward phase, starting
from the input layer and further layer-by-layer,
outputs of every unit in the network are computed
together with the corresponding derivatives. In
the second, feedback phase corrections to all
weights of connections of all units including the
hidden ones are computed using the outputs and
derivatives computed during the feedforward phase.
Figure Directions of two basic signal flows in
multilayer perceptron forward propagation of
function signals and back-propagation of error
signals.
23Multilayer Perceptron Training.
To understand the second, error back-propagation
phase of computing corrections to the weights,
let us follow an example of a small three-layer
perceptron.
24Multilayer Perceptron Training.
To understand the second, error back-propagation
phase of computing corrections to the weights,
let us follow an example of a small three-layer
perceptron.
Suppose that we have found all outputs and
corresponding derivatives of activation functions
of all computing units including the hidden ones
in the network.
25Multilayer Perceptron Training.
We shall mark values of
the layer in consideration,
values of the layer previous to the one in
consideration,
26Multilayer Perceptron Training.
Weight of connection between unit number 1
(first lower index) in the output layer (layer
number 2 shown as the upper index) and unit
number 0 (second lower index) in the previous
layer (number 12-1) after presentation of a
training pattern would have a correction
27Multilayer Perceptron Training.
Analogously, corrections to all six weights of
connections between the output layer and the
hidden layer are obtained as
28Multilayer Perceptron Training.Corrections to
hidden units connections.
We shall mark values of
the layer in consideration,
values of the layer previous to the one in
consideration, values of
the layers above the one in consideration,
29Multilayer Perceptron Training.Corrections to
hidden units connections.
Weight of connection between unit number 1
(first lower index) in the hidden layer (layer
number 1 shown in the upper index) and unit
number 0 in the previous input layer (second
lower index) would have a correction
30Multilayer Perceptron Training.Corrections to
hidden units connections.
Analogously, for all six weights of connections
between the hidden layer and the input layer
31Multilayer Perceptron Training.Corrections to
hidden units connections.
- In this way going backwards through the network,
one obtain the corrections to all weights ,
32Multilayer Perceptron Training.Corrections to
hidden units connections.
- In this way going backwards through the network,
one obtain the corrections to all weights , - then update the weights.
33Multilayer Perceptron Training.Corrections to
hidden units connections.
- In this way going backwards through the network,
one obtain the corrections to all weights , - then update the weights.
- After that, with the new weights go forward to
get new outputs
34Multilayer Perceptron Training.Corrections to
hidden units connections.
- In this way going backwards through the network,
one obtain the corrections to all weights , - then update the weights.
- After that, with the new weights go forward to
get new outputs - Find new error, go backwards and so on
35Multilayer Perceptron Training.
- In this way going backwards through the network,
one obtain the corrections to all weights , then
update the weights. - After that, with the new weights go forward to
get new outputs - Find new error, go backwards and so on
- Hopefully, sooner or later the iterative
procedure will come to output with the minimum
error, i.e. the absolute minimum of the error
function E.
36Multilayer Perceptron Training.
- In this way going backwards through the network,
one obtain the corrections to all weights , then
update the weights. - After that, with the new weights go forward to
get new outputs - Find new error, go backwards and so on
- Hopefully, sooner or later the iterative
procedure will come to output with the minimum
error, i.e. the absolute minimum of the error
function E. - Unfortunately, as a function of many variables,
the error function might have more than one
minimum, and one may get not to the absolute
minimum but to a relative one.
37Multilayer Perceptron Training.
- Unfortunately, as a function of many variables,
the - error function might have more than one minimum,
and one may get not to the absolute minimum but
to a relative one.
- If it happens, the error function stops to
decrease regardless of number of iteration. - Some measures must be taken to get out of the
function relative minimum, for example, adding
small random values, i.e. noise, to one or more
of the weights. - Then the iterative procedure starts from that new
point to get to the absolute minimum eventually.
38Multilayer Perceptron Training.
- Finally, after successful training, perceptron
is able to produce the desired responses to all
input patterns of the training set.
39Multilayer Perceptron Training.
- Finally, after successful training, perceptron
is able to produce the desired responses to all
input patterns of the training set. - Then all the network weights of connections are
fixed,
40Multilayer Perceptron Training.
- Finally, after successful training, perceptron
is able to produce the desired responses to all
input patterns of the training set. - Then all the network weights of connections are
fixed, - and the network is presented with inputs it must
recognise, i.e. not the training set inputs.
41Multilayer Perceptron Training.
- Finally, after successful training, perceptron
is able to produce the desired responses to all
input patterns of the training set. - Then all the network weights of connections are
fixed, - and the network is presented with inputs it must
recognise, i.e. not the training set inputs. - If an input in consideration produces an output
similar to one of the training set, such input is
said to belong to the same type or cluster of
inputs as the corresponding one of the training
set.
42Multilayer Perceptron Training.
- Then all the network weights of connections are
fixed, - and the network is presented with inputs it must
recognise, i.e. not the training set inputs. - If an input in consideration produces an output
similar to one of the training set, such input is
said to belong to the same type or cluster of
inputs as the corresponding one of the training
set. - If the network produces an output not similar to
any of the training set, then such an input is
said not been recognised.
43Multilayer Perceptron Training. Conclusion.
- In 1969 Minsky and Papert not just found the
solution to the XOR problem in a form of
multilayer perceptron, they also gave a very
thorough mathematical analysis of the time it
takes to train such networks. - Minsky and Papert emphasized that training times
increase very rapidly for certain problems as the
number of input lines and weights of connections
increases.
44Multilayer Perceptron Training. Conclusion.
- Minsky and Papert emphasized that training times
increase very rapidly for certain problems as the
number of input lines and weights of connections
increases. - The difficulties were seized upon by opponents
of the subject. In particular, this was true of
those working in the field of artificial
intelligence (AI), who at that time did not want
to concern themselves with the underlying
wetware of the brain, but only with the
functional aspects regarded by them solely as
logical processing. - Due to the limitations of funding, competition
between AI and neural network communities could
have only one victor.
45Multilayer Perceptron Training. Conclusion.
- Due to the limitations of funding, competition
between AI and neural network communities could
have only one victor. - Neural networks then went into a relative
quietude for more then fifteen years, with only a
few devotees still working on it.
46Multilayer Perceptron Training. Conclusion.
- Due to the limitations of funding, competition
between AI and neural network communities could
have only one victor. - Neural networks then went into a relative
quietude for more then fifteen years, with only a
few devotees still working on it. - Then new vigour came from various sources. One
was from the increasing power of computers,
allowing simulations of otherwise intractable
problems.
47Multilayer Perceptron Training. Conclusion.
- New vigour came from various sources. One was
from the increasing power of computers, allowing
simulations of otherwise intractable problems. - Finally, established by the mid 80s the
backpropagation algorithm solved the difficulty
of training hidden neurons.
48Multilayer Perceptron Training. Conclusion.
- New vigour came from various sources. One was
from the increasing power of computers, allowing
simulations of otherwise intractable problems. - Finally, established by the mid 80s the
backpropagation algorithm solved the difficulty
of training hidden neurons. - Nowadays, Perceptron is an effective tool for
recognising protein and amino-acid sequences and
processing other complex biological data.