Title: The Perceptron
1The Perceptron
2Prehistory
- W.S. McCulloch W. Pitts (1943). A logical
calculus of the ideas immanent in nervous
activity, Bulletin of Mathematical Biophysics,
5, 115-137.
- This seminal paper pointed out that simple
artificial neurons could be made to perform
basic logical operations such as AND, OR and NOT.
x 1
y 1
if sumlt0 0 else 1
xy-2
0
1 -2
0
0
1
sum
output
inputs
weights
3Nervous Systems as Logical Circuits
- Groups of these neuronal logic gates could
carry out any computation, even though each
neuron was very limited.
- Could computers built from these simple units
reproduce the computational power of biological
brains? - Were biological neurons performing logical
operations?
x 1
y 1
if sumlt0 0 else 1
xy-1
0
1 -1
1
1
1
sum
output
inputs
weights
4The Perceptron
Frank Rosenblatt (1962). Principles of
Neurodynamics, Spartan, New York, NY. Subsequent
progress was inspired by the invention of
learning rules inspired by ideas from
neuroscience Rosenblatts Perceptron could
automatically learn to categorise or classify
input vectors into types.
It obeyed the following rule If the sum of the
weighted inputs exceeds a threshold, output 1,
else output -1.
1 if S inputi weighti gt threshold -1 if S
inputi weighti lt threshold
5Activation functions
- Sign function (sometimes step function or
threshold) - Sigmoid function 1/(1e-x)
- W0,j controls the threshold location (a0 is
generally set to 1)
g(inj)
1
0
inj
-1
6Exercise
1
1
2
-1
?
0.5
0.5
1
3
- Represent by a perceptron the following Boolean
functions
7Exercise - solution
1
0
?
-1
a
o
1
-1
a1
1
?
1
o
a2
1
1
1
a1
?
1
o
a2
8Exercise not just one solution
1
0
?
-0.5
a
o
1
-0.5
a1
0.5
?
0.5
o
a2
1
0.5
0.5
a1
?
0.5
o
a2
9Classifier
- Consider a network as a Classifier
- Network parameters are adapted so that it
discriminates between classes - For m classes, the classifier partitions the
feature space into m decision regions - The line or curve separating the classes is the
decision boundary. - In more than 2 dimensions this is a surface
(e.g., a hyperplane) - For 2 classes can view net output as a
discriminant function y(x, w) where - y(x, w) 1 if x in C1
- y(x, w) - 1 if x in C2
- Need some training data with known classes to
generate an error function for the network - Need a (supervised) learning algorithm to adjust
the weights
10Linear discriminant functions
A linear discriminant function is a mapping which
partitions feature space using a linear function
(a straight line, or a hyperplane) Thus in 2
dimensions the decision boundary is a straight
line
Simple form of classifier separate the two
classes using a straight line in feature space
11The Perceptron as a Classifier
For d-dimensional data perceptron consists of
d-weights, a bias and a thresholding activation
function. For 2D data we have
x1
w1
w2
a w0 w1 x1 w2 x2
yg(a)
-1, 1
x2
Output class decision
w0
1
1. Weighted Sum of the inputs
2. Pass thru Heaviside function T(a) -1 if a lt
0 T(a) 1 if a gt 0
View the bias as another weight from an input
which is constantly on
If we group the weights as a vector w we
therefore have the net output y given by y
g(w . x w0)
12w11
x1
Weight to output j from input k is wjk
y1
wk1
wkd
xd
yk
yj g(Swjk xk wk0)
wk0
1
- Perceptron can be extended to discriminate
between k classes by having k output nodes - x is in class Cj if yj (x)gt yk for all k
- Resulting decision boundaries divide the feature
space into convex decision regions
C1
C2
C3
13Other activation functions can also be used
(usually chosen to be monotonic). NB discriminant
is still linear. Use of the sigmoidal logistic
activation function g(a) 1/(1 e-a)
together with data drawn from Gaussian or
Bernoulli class-conditional distributions (P(x
Ck)) means that the network outputs can be
interpreted as the posterior probabilities P(Ck
x) Generalised linear discriminants Linear
discriminants can be made more general by
including non-linear functions (basis functions)
fk which to transform the input data. Thus the
outputs become yj g(Swjk fk wk0)
14Network Learning
Standard procedure for training the weights is by
gradient descent For this process we have a set
of training data from known classes to be used in
conjunction with an error function E(w) (eg sum
of squares error) to specify an error for each
instantiation of the network Then do w new
w old - h E(w) So where E(w)
is a vector representing the gradient and h is
the learning rate (small, positive) 1. This
moves us downhill in direction E(w)
(steepest downhill since E(w) is the
direction of steepest increase) 2. How far we go
is determined by the value of h
D
D
D
D
15 Moving Downhill Move in direction of negative
derivative
E(w)
Decreasing E(w)
w1
d E(w)/ dw1
w1
d E(w)/dw1 gt 0 w1 lt w1 - h d
E(w)/dw1 i.e., the rule decreases w1
16 Moving Downhill Move in direction of negative
derivative
E(w)
Decreasing E(w)
w1
d E(w)/ dw1
w1
d E(w)/dw1 lt 0 w1 lt w1 - h d
E(w)/dw1 i.e., the rule increases w1
17Illustration of Gradient Descent
E(w)
w1
w0
18Illustration of Gradient Descent
E(w)
w1
w0
19Illustration of Gradient Descent
E(w)
w1
Direction of steepest descent direction
of negative gradient
w0
20Illustration of Gradient Descent
E(w)
w1
Original point in weight space
New point in weight space
w0
21General Gradient Descent Algorithm
- Define an objective function E(w)
- Algorithm
- pick an initial set of weights w, e.g. randomly
- evaluate E(w) at w
- note this can be done numerically or in closed
form - update all the weights
- w new w old - h E(w)
- check if E(w) is approximately 0
- if so, we have converged to a flat minimum
- if not, we move again in weight space
D
D
D
22- Equivalent to hill-climbing
- Can be problems knowing when to stop
- Local minima
- can have multiple local minima (note for
perceptron, E(w) only has a single global
minimum, so this is not a problem) - gradient descent goes to the closest local
minimum - solution random restarts from multiple places in
weight space
23Sequential Gradient Descent
In standard gradient descent (batch version) get
network output for all data points and estimate
error gradient from difference between outputs
and targets (for current weights) Sequential
gradient descent get an approximation to the
full gradient based on the ith training vector
xi only use where Ei is the error due
to xi This allows us to update the weights as
we cycle through each input - tends to be
faster in practice - dont have to store all
outputs and vectors - can be used to adapt
weights on-line - can track slow moving changes
in the data - stochasticity can help to escape
from local minima
24Error function
Need to define an error function to start the
training procedure Also need to define target
functions ti for each input pattern xi in the
training data set X ti 1 if pattern xi is in
C1 and 1 if xi is in C2 An obvious starting
point is to use number of training patterns that
are currently misclassified Equivalent to sum of
squares error function E(w) S y(xi) ti
(1/4) S (y(xi) ti)2 However, thinking about the
resulting Error Surface highlights some bad
properties of this error for gradient descent
25In particular a smooth change in the weights Dw
will not result in a smooth change in the error
Dw
E
5
4
Dw
w
Either weight change has no effect on error
Or a pattern is reclassified causing a
discontinuity in the error surface
This means we get no info from the error gradient
(not great for a gradient descent procedure )
x
x
o
From this
o
ie we cannot distinguish this
26Learning of the perceptron
- Perceptron learning is to find values of the
weights that will lead to the minimal
classification error - let call
- wi the ith weight of the neuron
- xm(a1,aN) an input vector of size N
- x1?y1, x2?y2, , xL?y1 a learning set with L
examples - hw(xm) the output computed for xm
- Given the input x and the correct output y, the
output of a perceptron can be written as a
function - And the square error for 1 example is
27Learning of the perceptron
- The partial derivative according to each weight
- With the gradient descent algorithm, the weights
are incrementally updated as follow - And ? is the learning rate
28Perceptron weight learning algorithm
- Suppose
- Wi the ith weight of the neuron
- xm(a1,aN) an input vector of size N
- x1?y1, x2?y2, , xL?y1 a learning set with L
examples - g is the evaluation function
- Repeat
- for each xm do
-
- for each Wi do
- Until the number of epochs is reached
- Return weights
29Learning of the perceptron Epoch 1
- Epoch 1
- g() is sgn() and we use sgn(xm) 1
- w0 1.5 and w1 0 the initial value of the
weights of the neuron - x the input vector of size 1
- 2?1, 3?-1 the learning set with 2 examples
- ? 0.1 the learning rate
- For x1
- For x2
30Learning of the perceptron Epoch 2
- Epoch 2
- w0 1.3 and w1 -0.6
- x the input vector of size 1
- 2?1, 3?-1 the learning set with 2 examples
- ? 0.1 the learning rate
- For x1
- For x2
31The Fall of the Perceptron
- Marvin Minsky Seymour Papert (1969).
Perceptrons, MIT Press, Cambridge, MA.
- Before long researchers had begun to discover the
Perceptrons limitations. - Unless input categories were linearly
separable, a perceptron could not learn to
discriminate between them. - Unfortunately, it appeared that many important
categories were not linearly separable. - E.g., those inputs to an XOR gate that give an
output of 1 (namely 10 01) are not linearly
separable from those that do not (00 11).
32The Fall of the Perceptron
despite the simplicity of their
relationship Academics Successful XOR Gym
In this example, a perceptron would not be able
to discriminate between the footballers and the
academics
This failure caused the majority of researchers
to walk away.
33The simple XOR example masks a deeper problem ...
1.
3.
2.
4.
Consider a perceptron classifying shapes as
connected or disconnected and taking inputs from
the dashed circles in 1. In going from 1 to 2,
change of right hand end only must be sufficient
to change classification (raise/lower linear sum
thru 0) Similarly, the change in left hand end
only must be sufficient to change classification
Therefore changing both ends must take the sum
even further across threshold Problem is because
of single layer of processing local knowledge
cannot be combined into global knowledge. So add
more layers ...
34 THE PERCEPTRON CONTROVERSY There is no
doubt that Minsky and Papert's book was a block
to the funding of research in neural networks for
more than ten years. The book was widely
interpreted as showing that neural networks are
basically limited and fatally flawed. What IS
controversial is whether Minsky and Papert shared
and/or promoted this belief ? Following the
rebirth of interest in artificial neural
networks, Minsky and Papert claimed that they had
not intended such a broad interpretation of the
conclusions they reached in the book Perceptrons.
However, Jianfeng was present at MIT in 1974,
and reached a different conclusion on the basis
of the internal reports circulating at MIT. What
were Minsky and Papert actually saying to their
colleagues in the period after the publication
of their book?
35Minsky and Papert describe a neural network with
a hidden layer as follows GAMBA PERCEPTRON A
number of linear threshold systems have their
outputs connected to the in- puts of a linear
threshold system. Thus we have a linear threshold
function of many linear threshold functions.
Minsky and Papert then state Virtually nothing
is known about the computational capabilities of
this latter kind of machine. We believe that it
can do little more than can a low order
perceptron. (This, in turn, would mean, roughly,
that although they could recognize (sp) some
relations between the points of a picture, they
could not handle relations between such relations
to any significant extent.) That we cannot
understand mathematically the Gamba perceptron
very well is, we feel, symptomatic of the early
state of development of elementary computational
theories.
36 In summary, Minsky and Papert, with intellectual
honesty, confessed that they were not not able to
prove that even with hidden layers, feed-forward
neural nets were useless, but they expressed
strong confidence that they were quite inadequate
computational learning devices. NB Minsky and
Papert restrict discussion to "linear threshold"
rather than the sigmoid threshold functions
prevalent in ANN. Conclusion? Dont believe
everything you hear