Title: Perceptron
1Perceptron
2Neural Networks
- A large number of very simple neuron like
processing elements - A large number of weighted connections between
the elements - Highly parallel, distributed control
- An emphasis on learning internal representations
automatically
3Why Neural Nets?
- Solving problems under the constraints similar to
those of the brain may lead to solutions to AI
problems that might otherwise be overlooked. - Individual neurons operate relatively slowly, but
make up for that with massive parallelism.
4The Parts of a Neuron
5How it Works
- Each neuron has branching from it a number of
small fibers called dendrites and a single long
fiber, the axon.
6How it Works
- The axon eventually splits and ends in a number
of synapses which connect the axon to the
dendrites of other neurons.
7How it Works
- Communication between neurons occurs along these
paths. When the electric potential in a neuron
rises above a threshold, the neuron activates.
8How it Works
- The neuron sends the electrical impulse down the
axon to the synapses.
9How it Works
- A synapse can either add to the electrical
potential or subtract from the electrical
potential.
10How it Works
- The pulse then enters the connected neurons
dendrites, and the process begins again.
11neural network
- A neural network is made up of the
- interconnection of a large number
- of nonlinear processing units (neurons)
- The network may consist of
- feedforward and feedback paths
- Interesting properties
- nonlinearity
- learning
12McCulloch and Pitts, 1943
- modern era of neural networks starts in the
1940s, when Warren McCulloch (a psychiatrist and
neuroanatomist) and Walter Pitts (amathematician)
explored the computational capabilities of
networks made of very simple neurons - A McCulloch-Pitts network fires if the sum of its
excitatory inputs exceeds its threshold, as long
as it does not receive an inhibitory input - Using a network of such neurons, they showed that
it was possible to construct any logical function
13Each logical function can be computed by a
two-layered McCulloch-Pitt network.Every finite
automaton can be simulated by a network of
(recurrent) McCulloch- Pitts cells.
14Hebb, 1949
- In his book The organization of Behavior,
Donald Hebb introduced his postulate of learning
(a.k.a. Hebbian learning), which states that the
effectiveness of a variable synapse between two
neurons is increased by the repeated activation
of one neuron by the other across that synapse - The Hebbian rule has a strong similarity to the
biological process in which a neural pathway is
strengthened each time it is used
15Rosenblatt, 1958
- Frank Rosenblatt introduced the perceptron, the
simplest form of a neural network - The perceptron consists of a single neuron with
adjustable synaptic weights and a threshold
activation function - Rosenblatts original perceptron in fact
consisted of three layers (sensory, association
and response) of with only one layer had variable
weights.
16Rosenblatt,1958-continuation
- Rosenblatt also developed an error-correction
rule to adapt these weights (a.k.a. the
perceptron learning rule), and proved that if the
(two) classes were linearly separable, the
algorithm would converge to a solution (a.k.a.
the perceptron convergence theorem)
17(No Transcript)
18Inputs To Neurons
- Arise from other neurons or from outside the
network - Nodes whose inputs arise outside the network are
called input nodes and simply copy values - An input may excite or inhibit the response of
the neuron to which it is applied, depending upon
the weight of the connection
19Weights
- Represent synaptic efficacy and may be excitatory
or inhibitory - Normally, positive weights are considered as
excitatory while negative weights are thought of
as inhibitory - Learning is the process of modifying the weights
in order to produce a network that performs some
function
20Output
- The response function is normally nonlinear
- Samples include
- Sigmoid
- Piecewise linear
21(No Transcript)
22(No Transcript)
23Representational Power of Perceptrons
- Perceptrons can represent the logical AND, OR,
and NOT functions as above. - we consider 1 to represent True and 1 to
represent False.
24- Here there is no way to draw a single line that
separates the "" (true) values from the "-" - (false) values.
25The Good and the Bad News
- Good- every Boolean function can be represented
by some network of perceptrons only two levels
deep. - Bad- any single perceptron can only represent
linearly separable functions. - Good-there is a perceptron algorithm that will
learn any linearly separable function
26train a perceptron
- To train a perceptron , Rosenblatt developed a
procedure for changing the synaptic weight - Y(t) sgn ? Xi(t)Wi(t) 0.
- Sgn- 1 if its argument is positive otherwise -1
- Xi(t) - inputs signal
- Wi(t) - the synaptic weight
- 0 - the threshold for that node
- If the sum of the weighted inputs xi wi exceed
the threshold y(t)1 otherwise y(t)-1
27train a perceptron -continuation
- At start of the experimenter the W(0) and 0 are
set of random values - Than the train begin with objective of teaching
it to differentiate two classes of inputs I and
II - The goal is to have the nodes output y(t) 1 if
the input is of class I , and to have - y(t) -1 if the input is of class II
- You can free to choose any inputs (Xi) and to
designate them as being of class I or II
28train a perceptron - continuation
- If the node happened to output 1 signal when
given a class II input or output -1 signal when
given a class I input the weight Wi no change - If the node happened to output -1 signal when
given a class I input or output 1 signal when
given a class II input the weight Wi change
according to the rule
29train a perceptron - continuation
- Wi(t1)Wi(t) r d(t) y(t) Xi(t)
- d(t) desire or target output (1 or -1)
- Since d and y can be 1 or -1 the difference if on
zero can only equal 2 or -2 - r present positive learning (no greater than 1
or 2)
30Example
- Let's say we want to figure out the appropriate
weights to model the AND function we discussed
above (1 1 1, - 1 (-1) -1 , (-1) (-1) -1)
- We're assuming, of course, that no one gave us
the weights. - We set up a perceptron with two inputs (three, if
we include X0). Now let's guess some weights.
31reminder
- d(t)- desire or target ?
- input (1 or -1)
- Y(t) 1 if ? Xi(t)Wi(t) gt 0
- -1 otherwise
- Change the weights if d(t) ? y(t)
- Wi(t1)Wi(t) r d(t) y(t) Xi(t)
32Example - continuation
- W0 0.1, W1 0.1, W2 0.1
- And let's let our learning rate r be 0.1
- Our first training example has
- X1 X2 1, so the output of the perceptron
should be 1. - Fortunately, that is the output of the
perceptron, so no modifications are needed.
33Example - continuation
- Our second training example has
- X1 1, X2 -1, so the target output of the
perceptron should be -1. - Unfortunately, the actual output of the
perceptron is 1. So we need to modify the
weights. - Following the equations above, we calculate
- W0 0.1 (0.1)(-2)(1) -0.1
- W1 0.1 (0.1)(-2)(1) -0.1
- W2 0.1 (0.1)(-2)(-1) 0.3
34Example - continuation
- Now we get a third training example
- X1 X2 -1, for which the target output is
- -1.
- Fortunately, it is, so the weights need not be
modified.
35Example - continuation
- Our fourth training example has
- X1 -1, X2 1, so the target output of the
perceptron should be -1. - Unfortunately, the actual output is 1.
- So it's time for more modification of the
weights. - Again following the equations above, we
calculate - W0 -0.1 (0.1)(-2)(1) -0.3
- W1 -0.1 (0.1)(-2)(-1) 0.1
- W2 0.3 (0.1)(-2)(1) 0.1
36Example - continuation
- Our fifth training example has
- X1 X2 1, so the output of the perceptron
should be 1. - This time the output is -1.
- So it's time for more modification of the
weights. - Again following the equations above, we
calculate - W0 -0.3 (0.1)(2)(1) -0.1
- W1 0.1 (0.1)(2)(1) 0.3
- W2 0.1 (0.1)(2)(1) 0.3
37Example - continuation
- Our sixth training example has
- X1 1, X2 -1, so the target output of the
perceptron should be -1. - Indeed, that's what the perceptron produces.
38Example - continuation
- Our seventh example has
- X1 X2 -1, for which the target output is
-1. - Fortunately, it is, so the weights need not be
modified. - Our eighth example has
- X1 -1, X2 1, so the target output of the
perceptron should be -1. - Again, it is!
- We've converged on appropriate weights!
39Single Layer Perceptron
40Single Layer Perceptron
- For a problem which calls for more then 2
classes, several perceptrons can be combined into
a network. - Can distinguish only linear separable functions
41Single Layer Perceptron
Single layer, five nodes. 2 inputs and 3 outputs
Recognizes 3 linear separate classes, by means of
2 features
42 For general problem we have to resort to
multi-layer network, as in our brain
Perceptron can do it
Perceptron can not do it
43Multi-Layer Networks
44Multi-Layer Networks
- A Multi layer perceptron can classify non linear
separable problems. - A Multilayer (feedforward) network has one or
more hidden layers.
45Multi-layer networks
x1
x2
Input (visual input)
Output (Motor output)
xn
Hidden layers
46XOR
47XOR
Activation Function if (input gt threshold),
fire else, dont fire
1
1
0
0
2
1
1
-2
All weights are 1, unless otherwise labeled.
1
0
48XOR
Activation Function if (input gt threshold),
fire else, dont fire
1
1
0
0
2
1
1
-2
All weights are 1, unless otherwise labeled.
1
1
49XOR
Activation Function if (input gt threshold),
fire else, dont fire
1
1
0
0
2
1
1
-2
All weights are 1, unless otherwise labeled.
1
1
50XOR
Activation Function if (input gt threshold),
fire else, dont fire
1
1
0
0
2
1
1
-2
All weights are 1, unless otherwise labeled.
1
0
51Network Topology
52Feedforward Networks
- Feedforward Networks
- Solutions are known
- Weights are learned
- Evolves in the weight space
- Mostly Used for
- Interpolation.
- System modeling.
- Classification, example face, handwrite and
voice. - Adaptive Filtering.
- Non Linear Control.
53Training Multilayer Perceptron
- The training of multilayer networks raises some
important issues - How many layers ?, how many neurons per layer ?
- Too few neurons makes the network unable to learn
the desired behavior. Too many neurons increases
the complexity of the learning algorithm.
54Training Multilayer Perceptron
- A desired property of a neural network is its
ability to generalize from the training set. - If there are too many neurons, there is the
danger of over fitting. - Does there exist an effective training algorithm?
55Neural Network Model Building (Supervised
Learning)
56The Backpropagation Algorithm
57Backpropagation Algorithm
- It is a gradient-descent method. A
generalization of the LMS rule. - Requires that the function describing the neural
network should be differentiable. This especially
means that the activation function should be
differentiable. - Activation function that is often used is the
sigmoid function.
58Gradient Descent Learning Rule
- Consider linear unit without threshold and
continuous output o (not just 1,1) - ow0 w1 x1 wn xn
- Train the wis such that they minimize the
squared error LMS, least mean square - Ew1,,wn ½ ?d?S (td-od)2
- where S is the set of training examples
- The opposite of hill climbing.
59Gradient Descent
Slt(1,1),1gt,lt(-1,-1),1gt,
lt(1,-1),-1gt,lt(-1,1),-1gt
?w-? ?Ew
?wi-? ?E/?wi
60Sigmoid Unit
x01
w1
w0
z?i0n wi xi
o?(z)1/(1e-z)
w2
S
o
. . .
wn
?(z) 1/(1e-z) sigmoid function.
d?(z)/dz ?(z) (1- ?(z))
61Backpropagation Preparation
- Training SetA collection of input-output
patterns that are used to train the network - Testing SetA collection of input-output patterns
that are used to assess network performance - Learning Rate-?A scalar parameter, analogous to
step size in numerical integration, used to set
the rate of adjustments
62A Pseudo-Code Algorithm
- Randomly choose the initial weights
- While error is too large E gt E-acceptable
- For each training pattern
- Apply the inputs to the network
- Calculate the output for every neuron from the
input layer, through the hidden layer(s), to the
output layer. - Calculate the error at the outputs
- Use the output error to compute error signals for
pre-output layers - Use the error signals to compute weight
adjustments - Apply the weight adjustments
63Backpropagation Math
- Consider the square error
- ESw1/2?d ? S ?k ? output (td,k-od,k)2
- Gradient ?ESw
- Update ww - ? ?ESw
- How do we compute the Gradient?
- Use the chain rule to compute the Gradient
64Calculate The Error Signal For Each Output Neuron
- The output neuron error signal dpj is given by
dpj(Tpj-Opj) Opj (1-Opj) - Tpj is the target value of output neuron j for
pattern p - Opj is the actual output value of output neuron j
for pattern p
65Calculate The Error Signal For Each Hidden Neuron
- The hidden neuron error signal dpj is given by
- where dpk is the error signal of a post-synaptic
neuron k and Wkj is the weight of the connection
from hidden neuron j to the post-synaptic neuron
k
66Calculate And Apply Weight Adjustments
- Compute weight adjustments DWji byDWji ? dpj
Opi - Apply weight adjustments according to Wji lt
Wji DWji
67Backpropagation The Momentum
- Backpropagation has the disadvantage of being too
slow if ? is small, and it can oscillate too
widely if ? is large. - To solve this problem, we can add a momentum (?)
to give each connection some inertia, forcing it
to change in the direction of the downhill
force. - Weight change is proportional to current gradient
and previous gradient - New Delta Rule
- ?Wji(t1) -? ?E/?Wji ? ?Wji(t)
68Backpropagation Summary
- Gradient descent over entire network weight
vector - Finds a local, not necessarily global error
minimum - in practice often works well
- requires multiple invocations with different
initial weights - Training is fairly slow, yet prediction is fast
69Problems with training
- Nets get stuck
- Not enough degrees of freedom
- Hidden layer is too small
- Training becomes unstable
- too many degrees of freedom
- Hidden layer is too big / too many hidden layers
- Over-fitting
- Can find every pattern, not all are significant.
If neural net is over-fit it will not
generalize well to the testing dataset
70Comparison Perceptron and Gradient Descent Rule
- Perceptron learning rule guaranteed to succeed if
- Training examples are linearly separable
- No guarantee otherwise
- Linear unit using Gradient Descent
- Converges to hypothesis with minimum squared
error. - Given sufficiently small learning rate ?
- Even when training data contains noise
- Even when training data not linearly separable