Title: Feed-Forward Neural Networks
1Feed-Forward Neural Networks
2Content
- Introduction
- Single-Layer Perceptron Networks
- Learning Rules for Single-Layer Perceptron
Networks - Perceptron Learning Rule
- Adaline Leaning Rule
- ?-Leaning Rule
- Multilayer Perceptron
- Back Propagation Learning algorithm
3Feed-Forward Neural Networks
4Historical Background
- 1943 McCulloch and Pitts proposed the first
computational models of neuron. - 1949 Hebb proposed the first learning rule.
- 1958 Rosenblatts work in perceptrons.
- 1969 Minsky and Paperts exposed limitation of
the theory. - 1970s Decade of dormancy for neural networks.
- 1980-90s Neural network return (self-organization,
back-propagation algorithms, etc)
5Nervous Systems
- Human brain contains 1011 neurons.
- Each neuron is connected 104 others.
- Some scientists compared the brain with a
complex, nonlinear, parallel computer. - The largest modern neural networks achieve the
complexity comparable to a nervous system of a
fly.
6Neurons
- The main purpose of neurons is to receive,
analyze and transmit further the information in a
form of signals (electric pulses). - When a neuron sends the information we say that a
neuron fires.
7Neurons
Acting through specialized projections known as
dendrites and axons, neurons carry information
throughout the neural network.
This animation demonstrates the firing of a
synapse between the pre-synaptic terminal of one
neuron to the soma (cell body) of another neuron.
8A Model of Artificial Neuron
9A Model of Artificial Neuron
10Feed-Forward Neural Networks
- Graph representation
- nodes neurons
- arrows signal flow directions
- A neural network that does not contain cycles
(feedback loops) is called a feedforward network
(or perceptron).
11Layered Structure
Hidden Layer(s)
12Knowledge and Memory
- The output behavior of a network is determined by
the weights. - Weights ? the memory of an NN.
- Knowledge ? distributed across the network.
- Large number of nodes
- increases the storage capacity
- ensures that the knowledge is robust
- fault tolerance.
- Store new information by changing weights.
13Pattern Classification
output pattern y
- Function x ? y
- The NNs output is used to distinguish between
and recognize different input patterns. - Different output patterns correspond to
particular classes of input patterns. - Networks with hidden layers can be used for
solving more complex problems then just a linear
pattern classification.
input pattern x
14Training
Training Set
. . .
. . .
Goal
. . .
. . .
15Generalization
- By properly training a neural network may produce
reasonable answers for input patterns not seen
during training (generalization). - Generalization is particularly useful for the
analysis of a noisy data (e.g. timeseries).
16Generalization
- By properly training a neural network may produce
reasonable answers for input patterns not seen
during training (generalization). - Generalization is particularly useful for the
analysis of a noisy data (e.g. timeseries).
17Applications
- Pattern classification
- Object recognition
- Function approximation
- Data compression
- Time series analysis and forecast
- . . .
18Feed-Forward Neural Networks
- Single-Layer Perceptron Networks
19The Single-Layered Perceptron
20Training a Single-Layered Perceptron
Training Set
Goal
21Learning Rules
- Linear Threshold Units (LTUs) Perceptron
Learning Rule - Linearly Graded Units (LGUs) Widrow-Hoff
learning Rule
Training Set
Goal
22Feed-Forward Neural Networks
- Learning Rules for
- Single-Layered Perceptron Networks
- Perceptron Learning Rule
- Adline Leaning Rule
- ?-Learning Rule
23Perceptron
Linear Threshold Unit
sgn
24Perceptron
Goal
Linear Threshold Unit
sgn
25Example
Goal
Class 1
g(x) ?2x1 x220
Class 2
26Augmented input vector
Goal
Class 1 (1)
Class 2 (?1)
27Augmented input vector
Goal
28Augmented input vector
Goal
A plane passes through the origin in the
augmented input space.
29Linearly Separable vs. Linearly Non-Separable
AND
OR
XOR
Linearly Separable
Linearly Separable
Linearly Non-Separable
30Goal
- Given training sets T1?C1 and T2 ? C2 with
elements in form of x(x1, x2 , ... , xm-1 , xm)
T , where x1, x2 , ... , xm-1 ?R and xm ?1. - Assume T1 and T2 are linearly separable.
- Find w(w1, w2 , ... , wm) T such that
31Goal
wTx 0 is a hyperplain passes through the origin
of augmented input space.
- Given training sets T1?C1 and T2 ? C2 with
elements in form of x(x1, x2 , ... , xm-1 , xm)
T , where x1, x2 , ... , xm-1 ?R and xm ?1. - Assume T1 and T2 are linearly separable.
- Find w(w1, w2 , ... , wm) T such that
32Observation
Which ws correctly classify x?
What trick can be used?
33Observation
Is this w ok?
w1x1 w2x2 0
34Observation
w1x1 w2x2 0
Is this w ok?
35Observation
w1x1 w2x2 0
Is this w ok?
How to adjust w?
?w ?
36Observation
Is this w ok?
How to adjust w?
?w ??x
reasonable?
gt0
lt0
37Observation
Is this w ok?
reasonable?
How to adjust w?
?w ?x
gt0
lt0
38Observation
Is this w ok?
?
?w ?
?x
??x
or
39Perceptron Learning Rule
Upon misclassification on
Define error
40Perceptron Learning Rule
Define error
41Perceptron Learning Rule
42Summary ? Perceptron Learning Rule
Based on the general weight learning rule.
correct
incorrect
43Summary ? Perceptron Learning Rule
Converge?
44Perceptron Convergence Theorem
- Exercise Reference some papers or textbooks to
prove the theorem.
If the given training set is linearly separable,
the learning process will converge in a finite
number of steps.
45The Learning Scenario
Linearly Separable.
46The Learning Scenario
47The Learning Scenario
48The Learning Scenario
49The Learning Scenario
50The Learning Scenario
w4 w3
w3
51The Learning Scenario
w
52The Learning Scenario
The demonstration is in augmented space.
w
Conceptually, in augmented space, we adjust the
weight vector to fit the data.
53Weight Space
A weight in the shaded area will give correct
classification for the positive example.
w
54Weight Space
A weight in the shaded area will give correct
classification for the positive example.
?w ?x
w
55Weight Space
A weight not in the shaded area will give correct
classification for the negative example.
w
56Weight Space
A weight not in the shaded area will give correct
classification for the negative example.
w
?w ??x
57The Learning Scenario in Weight Space
58The Learning Scenario in Weight Space
To correctly classify the training set, the
weight must move into the shaded area.
w2
w1
59The Learning Scenario in Weight Space
To correctly classify the training set, the
weight must move into the shaded area.
w2
w1
w1
w0
60The Learning Scenario in Weight Space
To correctly classify the training set, the
weight must move into the shaded area.
w2
w2
w1
w1
w0
61The Learning Scenario in Weight Space
To correctly classify the training set, the
weight must move into the shaded area.
w2
w2
w3
w1
w1
w0
62The Learning Scenario in Weight Space
To correctly classify the training set, the
weight must move into the shaded area.
w2
w4
w2
w3
w1
w1
w0
63The Learning Scenario in Weight Space
To correctly classify the training set, the
weight must move into the shaded area.
w2
w4
w2
w3
w5
w1
w1
w0
64The Learning Scenario in Weight Space
To correctly classify the training set, the
weight must move into the shaded area.
w2
w4
w2
w3
w5
w1
w6
w1
w0
65The Learning Scenario in Weight Space
To correctly classify the training set, the
weight must move into the shaded area.
w2
w7
w4
w2
w3
w5
w1
w6
w1
w0
66The Learning Scenario in Weight Space
To correctly classify the training set, the
weight must move into the shaded area.
w2
w8
w7
w4
w2
w3
w5
w1
w6
w1
w0
67The Learning Scenario in Weight Space
To correctly classify the training set, the
weight must move into the shaded area.
w9
w2
w8
w7
w4
w2
w3
w5
w1
w6
w1
w0
68The Learning Scenario in Weight Space
To correctly classify the training set, the
weight must move into the shaded area.
w9
w10
w2
w8
w7
w4
w2
w3
w5
w1
w6
w1
w0
69The Learning Scenario in Weight Space
To correctly classify the training set, the
weight must move into the shaded area.
w9
w10
w2
w11
w8
w7
w4
w2
w3
w5
w1
w6
w1
w0
70The Learning Scenario in Weight Space
To correctly classify the training set, the
weight must move into the shaded area.
w2
w11
w1
w0
Conceptually, in weight space, we move the weight
into the feasible region.
71Feed-Forward Neural Networks
- Learning Rules for
- Single-Layered Perceptron Networks
- Perceptron Learning Rule
- Adaline Leaning Rule
- ?-Learning Rule
72Adaline (Adaptive Linear Element)
Widrow 1962
73Adaline (Adaptive Linear Element)
In what condition, the goal is reachable?
Goal
Widrow 1962
74LMS (Least Mean Square)
Minimize the cost function (error function)
75Gradient Decent Algorithm
Our goal is to go downhill.
Contour Map
?w
(w1, w2)
76Gradient Decent Algorithm
Our goal is to go downhill.
How to find the steepest decent direction?
Contour Map
?w
(w1, w2)
77Gradient Operator
Let f(w) f (w1, w2,, wm) be a function over Rm.
Define
78Gradient Operator
df positive
df zero
df negative
Go uphill
Plain
Go downhill
79The Steepest Decent Direction
To minimize f , we choose ?w ?? ? f
df positive
df zero
df negative
Go uphill
Plain
Go downhill
80LMS (Least Mean Square)
Minimize the cost function (error function)
? (k)
81Adaline Learning Rule
Minimize the cost function (error function)
82Learning Modes
- Batch Learning Mode
- Incremental Learning Mode
83Summary ? Adaline Learning Rule
?-Learning Rule LMS Algorithm Widrow-Hoff
Learning Rule
Converge?
84LMS Convergence
- Based on the independence theory (Widrow, 1976).
- The successive input vectors are statistically
independent. - At time t, the input vector x(t) is statistically
independent of all previous samples of the
desired response, namely d(1), d(2), , d(t?1). - At time t, the desired response d(t) is dependent
on x(t), but statistically independent of all
previous values of the desired response. - The input vector x(t) and desired response d(t)
are drawn from Gaussian distributed populations.
85LMS Convergence
It can be shown that LMS is convergent if
where ?max is the largest eigenvalue of the
correlation matrix Rx for the inputs.
86LMS Convergence
Since ?max is hardly available, we commonly use
It can be shown that LMS is convergent if
where ?max is the largest eigenvalue of the
correlation matrix Rx for the inputs.
87Comparisons
Hebbian Assumption
Gradient Decent
Fundamental
Converge Asymptotically
Convergence
In finite steps
Linearly Separable
Linear Independence
Constraint
88Feed-Forward Neural Networks
- Learning Rules for
- Single-Layered Perceptron Networks
- Perceptron Learning Rule
- Adaline Leaning Rule
- ?-Learning Rule
89Adaline
90Unipolar Sigmoid
91Bipolar Sigmoid
92Goal
Minimize
93Gradient Decent Algorithm
Minimize
94The Gradient
Minimize
Depends on the activation function used.
?
?
95Weight Modification Rule
Minimize
Batch
Learning Rule
Incremental
96The Learning Efficacy
Minimize
Sigmoid
Unipolar
Bipolar
Adaline
Exercise
97Learning Rule ? Unipolar Sigmoid
Minimize
98Comparisons
Batch
Adaline
Incremental
Batch
Sigmoid
Incremental
99The Learning Efficacy
Sigmoid
Adaline
depends on output
constant
100The Learning Efficacy
Sigmoid
Adaline
The learning efficacy of Adaline is constant
meaning that the Adline will never get saturated.
depends on output
constant
101The Learning Efficacy
Sigmoid
Adaline
The sigmoid will get saturated if its output
value nears the two extremes.
depends on output
constant
102Initialization for Sigmoid Neurons
Why?
Before training, it weight must be sufficiently
small.
103Feed-Forward Neural Networks
104Multilayer Perceptron
Output Layer
Hidden Layer
Input Layer
105Multilayer Perceptron
Where the knowledge from?
Classification
Output
Analysis
Learning
Input
106How an MLP Works?
Example
- Not linearly separable.
- Is a single layer perceptron workable?
XOR
107How an MLP Works?
Example
00
01
11
108How an MLP Works?
Example
00
01
11
109How an MLP Works?
Example
00
01
11
110How an MLP Works?
Example
111Parity Problem
Is the problem linearly separable?
112Parity Problem
x3
P1
P2
x2
P3
x1
113Parity Problem
111
011
001
000
114Parity Problem
111
011
001
000
115Parity Problem
111
P4
011
001
000
116Parity Problem
P4
117General Problem
118General Problem
119Hyperspace Partition
120Region Encoding
001
000
010
100
101
110
111
121Hyperspace Partition Region Encoding Layer
122Region Identification Layer
123Region Identification Layer
124Region Identification Layer
125Region Identification Layer
126Region Identification Layer
127Region Identification Layer
128Region Identification Layer
129Classification
0
?1
1
130Feed-Forward Neural Networks
- Back Propagation Learning algorithm
131Activation Function Sigmoid
Remember this
132Supervised Learning
Training Set
Output Layer
Hidden Layer
Input Layer
133Supervised Learning
Training Set
Sum of Squared Errors
Goal
Minimize
134Back Propagation Learning Algorithm
- Learning on Output Neurons
- Learning on Hidden Neurons
135Learning on Output Neurons
?
?
136Learning on Output Neurons
depends on the activation function
137Learning on Output Neurons
Using sigmoid,
138Learning on Output Neurons
Using sigmoid,
139Learning on Output Neurons
140Learning on Output Neurons
How to train the weights connecting to output
neurons?
141Learning on Hidden Neurons
?
?
142Learning on Hidden Neurons
143Learning on Hidden Neurons
?
144Learning on Hidden Neurons
145Learning on Hidden Neurons
146Back Propagation
147Back Propagation
148Back Propagation
149Learning Factors
- Initial Weights
- Learning Constant (?)
- Cost Functions
- Momentum
- Update Rules
- Training Data and Generalization
- Number of Layers
- Number of Hidden Nodes
150Reading Assignments
- Shi Zhong and Vladimir Cherkassky, Factors
Controlling Generalization Ability of MLP
Networks. In Proc. IEEE Int. Joint Conf. on
Neural Networks, vol. 1, pp. 625-630, Washington
DC. July 1999. (http//www.cse.fau.edu/zhong/pubs
.htm) - Rumelhart, D. E., Hinton, G. E., and Williams, R.
J. (1986b). "Learning Internal Representations by
Error Propagation," in Parallel Distributed
Processing Explorations in the Microstructure of
Cognition, vol. I, D. E. Rumelhart, J. L.
McClelland, and the PDP Research Group. MIT
Press, Cambridge (1986). - (http//www.cnbc.cmu.edu/plaut/85-419/papers/Rum
elhartETAL86.backprop.pdf).