Multiple Layer Perceptron - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Multiple Layer Perceptron

Description:

Weights updated pattern-by-pattern basis until one epoch ... nodes, then decay, prune weights ... Adjustable weights should have own learning rate parameter ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 46

Provided by: aiKai

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Layer Perceptron

1
Multiple Layer Perceptron

2004? 2?
KAIST ????
? ??

2
Limitations of Single Layer Perceptron

The nonlinearity used in the perceptron (sign
function) was not differentiable ? cannot to
applied to multilayer
Solve only linearly separable cases
Only Simple problems can be solved
Not all logical Boolean functions can be
implemented by single perceptron
AND, OR, NAND. NOR is ok, but not for XOR

3
Multi Layer Perceptron (MLP)

Feed-forward network with one or more hidden
layers
The network consist of
An input layer of source neurons
Hidden layer(s) of computational neurons
Output layer of computational neurons
Input signals propagates toward output node
Can be used for arbitrarily complex function
mapping
all functions of Boolean logic, Combination of
linear functions
Differential non-linear activation function with
relatively simple training algorithm back error
propagation algorithm

4
Expressive power of MLP

Can every decision be implemented by Three layer
?
Yes. Any continuous function from input to output
can be implemented with sufficient number of
hidden neurons
Kolmogorovs theorem, Fouriers theorem

5
Expressive power of MLP
6
Expressive power of MLP
7
MLP Distinctive Characteristics

Non-linear activation function
differentiable
Mostly sigmoidal function
nonlinearity prevent reduction to single-layer
perceptron
One or more layers of hidden neurons
progressively extracting more meaningful features
from input patterns
High degree of connectivity
Nonlinearity and high degree of connectivity
makes theoretical analysis difficult
Learning process is hard to visualize
BP is a landmark in NN computationally efficient
training

8
Error back-propagation algorithm

Supervised, error-correction learning algorithm
which is based on delta rule
Two computations in Training
Forward pass
computation of function signal
input vector is applied to input nodes
its effects propagate through the network
layer-by-layer
with fixed synaptic weights
backward pass
synaptic weights are adjusted to reduce error
signal
computation of an estimate of gradient vector
gradient of error surface with respect to the
weights
error signal propagates backward, layer-by-layer
fashion

9
Notation Three-layer back-propagation neural
network
Input signals
1
z
1
x
1
1
1
2
z
2
x
2
2
2
w
kj
i
w
ji
j
z
k
k
x
i
m
n
z
l
l
x
n
Hidden
Input
Output
layer
layer
layer
Error signals
10
Back-Propagation Algorithm

Error signal for neuron j at iteration n
Total error energy
C is set of the output nodes
Average squared error energy
average over all training sample
cost function as a measure of learning
performance
Objective of Learning process
adjust NN parameters (synaptic weights) to
minimize Eav
Weights updated pattern-by-pattern basis until
one epoch
complete presentation of the entire training set

11
Notation
12
Back Propagation Algorithm

Gradient Descent
For notational simplicity, we will drop time
index n

13
BPA update rule for j?k (output node)

Gradient
determine the direction of search in weight space
Sensitivity
Describes how the overall error change with
units net activation

14
BPA Update rule for i?j (hidden node)

Sensitivity

15
BP Summary

forward pass
backward pass
recursively compute local gradient ?
from output layer toward input layer
synaptic weight change by delta rule

16
With Activation Functions

Sigmoid function
Hyperbolic tangent function

17
Output as Probabilities

Modeling posteriors
0-1 Target value
With infinite training data, output will produce
probability
Sum of outputs should be 1
Exponential activation function
Normalize outputs to sum to 1.0
SOFTMAX winner-takes-all
Max, value tranformed to 1, others to 0

18
Feature Detection

Hidden neurons act as feature detectors
As learning progress, hidden neuron gradually
discover salient features that characterize
training data
Nonlinear transformation of input data to feature
space
Close resemblance to Fishers linear discriminant

19
Approximation of Functions

Non-linear input-output mapping
M0 input space to ML output space
What is the minimum number of hidden layers in a
MLP that provide approximate any continuous
mapping ?
Universal Approximation Theorem
existence of approximation of arbitrary
continuous function
single hidden layer is sufficient for MLP to
compute a uniform ? approximation to a given
training set
not saying single layer is optimum in the sense
of training time, easy of implementation, or
generalization
Bound of Approximation Errors of single hidden
node NN
larger the number of hidden nodes, more accurate
the approximation
smaller the number of hidden nodes, more
accurate the empirical fit

20
Training Set Size for Generalization

Generalization
Input-output mapping is correct for data never
seen before
Overfitting - Overtraining
memorize training data, not the essence of the
training data
learns idiosyncrasy and noise
Occams Razer
find the simplest function among those which
satisfy given conditions
Genralization is influenced
size of training set
architecture of Neural Network
Given architecture, determine the size of
training set for good generalization
Given set of training samples, determine the best
architecture for good generalization
VC dimension - theoretical basis

21
Cross-Validation

Validate learned model on different set to assess
the generalization performance
guarding against overfitting
Partition Training set into
Estimation subset
validation subset
cross-validation for
best model selection
determine when to stop training

22
Model selection
Practical Techniques in Improving BP

Choosing MLP with the best number of weights with
given N training samples
Issue is to choose r
to minimize classification error of model trained
by the estimation set when it tested with the
validation set
Kearns(1996) Qualitative properties of optimum
r
Analysis with VC Dim
for small complexity problem (desired response is
small compared to N), performance of
cross-validation is insensitive to r
single fixed r nearly optimal for wide range of
target function
suggest r 0.2
80 of training set is estimation set

23
Training Protocol

Stochastic training
Select samples randomly
Batch training
Epoch single presentation of all training
samples
Weight are updated once in an epoch
Robust with outliers
Sequential mode
for each training sample, synaptic weights are
updated
require less storage
converge much fast, particularly training data is
redundant
Risky - Less controllable
random order makes trapping at local minimum less
likely
Online training when
Training Data is abundant
Memory cost is high, storing impossible

24
Practical Techniques in Improving BP

Selection of activation function
Parameters for the sigmoid
Scaling input
Target values
Training with noise
Manufacturing data
Number of hidden units
Number of hidden layers
Initializing weights
Learning rates
Momentum
Weight decay
Learning with hints
Stopping training
Other criterion function
Speeding up the learning

25
Selection of Activation function
Practical Techniques in Improving BP

If there are good reasons to select a particular
activation function, then do it
Mixture of Gaussian ? Gaussian activation
function
Properties of activation function
Non-linear
Saturate some max and min value
Continuity and smooth
Monotonicity nonessential
Sigmod function has all the good properties
Distributed representation vs local
represetnation
An input is to yield throughout several hidden
units or not

26
Parameters of Sigmoid
Practical Techniques in Improving BP

Centered at zero
Anti-symmetric
f(-net) - f(net)
Faster learning
Overall range and slope are not important
Avoid f(.) become zero
Network paralysis

27
Scaling Input / Target value
Practical Techniques in Improving BP

Standardize
Large scale difference
error depends mostly on large scale feature
Shifted to Zero mean, unit variance
Need full data set
Target value
Output is saturated
In the training, the output never reach saturated
value
Full training never terminated
(1 target category, -1 non-target categories)
is suggested

28
Training with Noise / Manufacturing Data
Practical Techniques in Improving BP

Training with Noise
Generate virtual or surrogate training patterns
Ex d-dim Gaussian random noise
Variance of added data lt 1 (e.g. 0.1)
Manufacturing Data
If we know source of variation, we can
manufactrure data
e.g. rotation for OCR, image processing for
simulation of bold face character
Memory requirement is large

29
Number of hidden units
Practical Techniques in Improving BP

(hidden units) governs the expressive power of
net complexity of decision boundary
Well-separated ? fewer hidden nodes
From complicated density, highly interspersed ?
many hidden nodes
Heuristics rule of thumb
More training data yields better result
( weight )lt ( training data)
( weight ) ( training data)/10
Adjust ( weight ) in response to the training
data
Start with a large number of hidden nodes, then
decay, prune weights

30
Number of Hidden Layers
Practical Techniques in Improving BP

Three, four or more layers is OK w/
differentiable activation function
But three layer is sufficient
More layers ? more chance of local minima
Single hidden layer vs double(multiple) hidden
layer
single HL NN is good for any approximation of
continuous function
double HL NN may be good some times
double(multiple) hidden layer
first hidden layer - local feature detection
second hidden layer - global feature detection
Problem-specific reason of more layers
Each layer learns different aspects
e.g. neocognitron case translation, rotation,

31
Initializing Weights
Practical Techniques in Improving BP

Not to set zero no learning take place
Selection of good Seed for Fast and uniform
learning
Reach final equilibrium values at about the same
time
For standardized data
Choose randomly from single distribution
Give positive and negative values equally ? lt w
lt ?
If ? is too small, net activation is small
linear model
If ? is too large, hidden units will saturate
before learning begins
For d input unit network,
Input weights
Hidden to output weights

32
Moment term
Practical Techniques in Improving BP

benefit of preventing the learning process from
terminating in a shallow local minimum
where ? is momentum constant
converge if 0?? ? ? 1, typical value 0.9
the partial derivative has the same sign on
consecutive iterations, grows in magnitude -
accelerate descent
opposite sign - shrinks stabilizing effect

33
Learning Rate ?
Practical Techniques in Improving BP

Smaller learning-rate parameter makes smoother
path
increase rate of learning yet avoiding danger of
instability
First choice ? 0.1
? of last layer should be assigned smaller one
last layer has large local gradient (by limiting
effect), learns fast
LeCuns suggestion learning rate is inversely
proportional to square root of the number of
synaptic connection ( m-1/2)
May change during training

34
Heuristics of Acceleration with learning rate
parameter

Adjustable weights should have own learning rate
parameter
Learning rate parameters should be allowed to
vary on iteration
If sign of the derivative is same for several
iteration, learning rate parameter should be
increased
Apply the Momentum idea even on learning rate
parameters
If sign of the derivative is alternating for
several iteration, learning rate parameter should
be decreased

35
Weight Decay
Practical Techniques in Improving BP

Heuristic Keep the weight small
in order to simplying network and avoiding
overfitting
Start with many weights and decay them during
training simple !!
Small weights are eliminated

36
Weight Sharing (tying)

A set of cells in one layer using the same
incoming weight
It leads to all cells detecting the same feature,
though different positions in the image
(receptive fields)
Reducing number of parameters
Better generalization
Effect of convolution with a kernel defined by
the weights

37
Network Pruning
Practical Techniques in Improving BP

Minimizing network improves generalization
less likely to learn idiosyncrasies or noise
Network pruning
eliminate synaptic weights w/ small magnitude
Complexity-regularization
tradeoff between reliability of training data and
goodness of the model
supervised learning by minimizing the risk
function
where

38
Wald Statistics
Practical Techniques in Improving BP

Estimate the importance of parameter in a model,
then Eliminate based on the estimation
Hessian-based Network Pruning
Optimal Brain Surgeon
Optimal Brain Damage
Identify parameters whose deletion will cause the
least increase in Eav
by Tayer series

39
Optimal Brain Surgeon
Practical Techniques in Improving BP