Title: Training Data
 1 Concept Map
 Practical Design Issues
Learning Algorithm
Training Data
Topology
Initial Weights
Fast Learning
Network Size
Generalization
Noise weight sharing Small size Increase 
Training Data
Ceoss-validation  Early stopping
Occams Razor
Network Growing
Network Pruning
Brain Damage
Weight Decay 
 2 Concept Map
Fast Learning
BP variants
Cost Function Activation Function
Training Data
No weight Learning For Correctly Classified Patter
ns
?
Normalize Scale Present at Random
Adaptive slope
Momentum
Architecture
Other Minimization Method
Fahlmanns
Modular
Committee
Conjugate Gradient 
 3Chapter 4. Designing  Training MLPs
- Practical Issues 
-  Performance  f (training data, topology, 
 initial weights, learning algorithm, . . .)
-   Training Error, 
 Net Size, Generalization.
- How to prepare training data, test data ? 
-  - The training set must contain enough info to 
 learn the task.
-  - Eliminate redundancy, maybe by data 
 clustering.
-  - Training Set size N gt W/?(N   of training 
 data, W   of weights,
-  e Classification error permitted on 
 Test data
-  Generalization error)
4Ex. Modes of Preparing Training Data for Robot 
Control The importance of the training data 
for tracking performance can not be 
overemphasized. Basically, three modes of 
training data selection are considered here. In 
the regular mode, the training data are obtained 
by tessellating the robots workspace and taking 
the grid points as shown in the next page. 
However, for better generalization, a sufficient 
amount of random training set might be obtained 
by observing the light positions in response to 
uniformly random Cartesian commands to the robot. 
This is the random mode. The best generalization 
power is achieved by the semi-random mode which 
evenly tessellates the workspace into many cubes, 
and chooses a randomly selected training point 
within each cube. This mode is essentially a 
blend of the regular and the random modes.  
 5Training Data Acquisition mode
Regular mode
Random mode
Semi-random mode 
 6Fig.10. Comparison of training errors and 
generalization errors for random and semi-random 
training methods. 
 7- Optimal Implementation 
-  A. Network Size 
- Occams Razor  
- Any learning machine should be 
- sufficiently large to solve a given problem, but 
 not
- larger. 
- A scientific model should favor simplicity or 
- shave off the fat in the model. 
- Occam  14th century British monk 
8a. Network Growing Start with a few / add more 
 (Ref. Kim, Modified Error BP Adding Neurons 
to Hidden Layer, J. of KIEE 92/4) 
If E gt ?1 and ?E lt ?2, Add a hidden 
node. Use the current weights for existing 
weights and small random values for newly added 
weights as initial weights for new learning. b. 
Network Pruning ? Remove unimportant 
connections After brain damage, retrain 
the network. ? Improves generalization. 
? Weight decay 
after each epoch c. Size Reduction by Dim. 
Reduction or Sparse Connectivity in Input 
Layer e.g. Use 4 random instead of 8 connections 
 9(No Transcript) 
 10B. Generalization  Train (memorize) and 
 Apply to an Actual 
problem (generalize)
Poor
Good
test(O)
test(O)
train(X)
train(X)
Overfitting (due to too many traning samples, 
weights) noise
R
X
T  Training Data
X  Test Data
R'
T
R  NN with Good Generalization
R'  NN with Poor Generalization
U 
 11 For good generalization, train with Learning 
Subset. Check on validation set. Determine 
best structure based on Validation Subset 10 at 
every 5-10 iterations. Train further with 
the full Training Set. Evaluate on test set. 
 Statistics of training (validation) data must be 
similar to that of test (actual problem) 
data. Tradeoff between training error and 
generalization !
Stopping Criterion Classification  Stop upon 
no error 
Function Approximation  check 
 12An Example showing how to prepare the various 
data sets to learn an unknown function from data 
samples 
 13-  Other measures to improve generalization. 
- Add Noise (1-5 ) to the Training Data or 
 Weights.
- Hard (Soft) Weight Sharing (Using Equal Values 
 for Groups of Weights)
-  Can Improve Generalization. 
- For fixed training data, the smaller the net the 
 better the generalization.
- Increase the training set to improve 
 generalization.
- For insufficient training data, use leave-one 
 (some)-out method
-   Select an example and train the net 
 without this example, evaluate with
 this unused example.
- If still does not generalize well, retrain with 
 the new problem data.
- C. Speeding Up Accelerating Convergence 
-  - Ref. Book by Hertz, AI Expert Magazine 91/7 
- To speed up calculation itself 
- Reduce  Floating Point Ops by Using a Fixed 
 Point Arithmetic
- And Use a Piecewise-Linear approximation for the 
 sigmoid.
14Students Questions from 2005
What will happen if more than 5-10  validation 
data are used ? Consider 2 industrial assembly 
robots for precision jobs made by the same 
company with an identical spec. If the same NN is 
used for both, then the robots will act 
differently. Do we need better generalization 
methods to compensate for this difference ? Large 
N may increase noisy data. However, wouldnt 
large N offset the problem by yielding more 
reliability ? How big an influence would noise 
have upon misguided learning ? Wonder what 
measures can prevent the local minimum traps. 
 15Is there any mathematical validation for the 
existence of a stopping point in validation 
samples ? The number of hidden nodes are adjusted 
by a human. An NN is supposed to self-learn and 
therefore there must be a way to automatically 
adjust the number of the hidden nodes. 
 16(No Transcript) 
 17? Normalize Inputs, Scale Outputs. Zero 
mean, Decorrelate (PCA) and Covariance 
equalization 
 18 ? Start with small uniform random initial 
weights for tanh  
? Present training patterns in random (shuffled) 
order (or mix different classes). ? Alternative 
Cost or Activation Functions Ex. 
 Cost Use 
 with as 
targets or ( 
 , , 
 at ) 
 19? Chen  Mars Differential step size 
 Cf. Principes Book recommends 
 . Best to try diff. values.
? (Accelerating BP Algorithm through Omitting 
Redundant Learning, J. of KIEE 92/9 ) If 
, Ep lt ? do not update weight on the pth 
training pattern  NO BP 
E
p
e
p 
 20? Ahalt - Modular Net 
? Ahalt - Adapt Slope (Sharpness) Parameters
 vary ? in  
 21? Jacobs - Learning Rate Adaptation 
Ref. Neural Networks, Vol. 1, No. 4, 88. 
a. Momentum  
In plateau, where 
 is the effective learning rate  
 22b. rule  
 where  
For actual parameters to be used, consult Jacobs 
paper and also Getting a fast break with 
Backprop, Tveter, AI Expert Magazine, excerpt 
from pdf files that I provided. 
 23Students Questions from 2005 Is there any way to 
design a spherical error surface for faster 
convergence ? Momentum provides inertia to jump 
over a small peak. Parameter Optimization 
technique seems to a good help to NN design. I am 
afraid that optimizing even the sigmoid slope and 
the learning rate may expedite overfitting. In 
what aspect is it more manageable to remove the 
mean, decorrelate, etc. ? How does using a bigger 
learning rate for the output layer help learning 
? Does the solution always converge if we use the 
gradient descent ? 
 24Are there any shortcomings in using fast learning 
algorithms ? In the Ahalts modular net, is it 
faster for a single output only or all the 
outputs than an MLP ? Various fast learning 
methods have been proposed. Which is the best one 
? Is it problem-dependent ? The Jacobs method 
cannot find the global min. for an error surface 
like  
 25? Conjugate Gradient  Fletcher  Reeves 
Line Search 
If ? is fixed and
? Gradient Descent 
If
? Steepest Descent 
 26GradientDescent SteepestDescent 
ConjugateGradient
 Gradient D. Line Search Steepest 
Descent  Momentum
SD
GD
w(n)
w(n1)
w(n)
w(n1)
w(n2)
w(n2)
Momentum
CG
w(n)
w(n1)
w(n-1)
w(n)
s(n1)
w(n1) 
 27If 
  Conjugate Gradient 1) 
Line Search 
2) Choose ? such that 
From Polak-Ribiere Rule  
 28START
Initialize 
Line Search
N
Y
Y
N 
 29Comparison of SD and CG
Steepest Descent
Conjugate Gradient
 Each step takes a line search. For N-variable 
quadratic functions, converges in N steps at 
most Recommended Steepest Descent  n 
steps of Conjugate Gradient  Steepest 
Descent  n steps of Conjugate Gradient  
??? 
 30X. Swarm Intelligence 
-  What is swarm intelligence and why is it 
-  interesting? 
-  Two kinds of swarm intelligence 
-  particle swarm optimization 
-  ant colony optimization 
-  Some applications 
-  Discussion 
31What is Swarm intelligence?
- Swarm Intelligence is a property of systems of 
 non-intelligent agents exhibiting collectively
 intelligent behavior.
- Characteristics of a swarm 
- distributed, no central control or data source 
- no (explicit) model of the environment 
- perception of environment 
- ability to change environment 
I cant do
We can do 
 32Group of friends each having a metal detector are 
on a treasure finding mission. Each can 
communicate the signal and current position to 
the n nearest neighbors. If you neighbor is 
closer to the treasure than him, you can move 
closer to that neighbor thereby improving your 
own chance of finding the treasure. Also, the 
treasure may be found more easily than if you 
were on your own. Individuals in a swarm 
interact to solve a global objective in a more 
efficient manner than one single individual 
could. A swarm is defined as a structured 
collection of interacting organisms ants, bees, 
wasps, termites, fish in schools an birds in 
flocks or agents. Within the swarms, 
individuals are simple in structure, but their 
collective behaviors can be quite complex. Hence, 
the global behavior of a swam emerges in a 
nonlinear manner from the behavior of the 
individuals in that swarm. The interaction among 
individuals plays a vital role in shaping the 
swarms behavior. Interaction aids in refining 
experiential knowledge about the environment, and 
enhances the progress of the swarm toward 
optimality. The interaction is determined 
genetically or throgh social interaction. Applicat
ions function optimization, optimal route 
finding, scheduling, image and data analysis.  
 33Why is it interesting?
- Robust nature of animal problem-solving 
- simple creatures exhibit complex behavior 
- behavior modified by dynamic environment 
- e.g.) ants, bees, birds, fishes, etc,.
34Two kinds of Swarm intelligence
- Particle swarm optimization 
-  Proposed in 1995 by J. Kennedy and R. C. 
 Eberhart
-  based on the behavior of bird flocks and fish 
 schools
- Ant colony optimization 
-  defined in 1999 by Dorigo, Di Cargo and 
 Gambardella
-  based on the behavior of ant colonies
351. Particle Swarm Optimization
- Population-based method 
- Has three main principles 
-  a particle has a movement 
-  this particle wants to go back to the best 
 previously visited position
-  this particle tries to get to the position of 
 the best positioned particles
36- Four types of neighborhood 
-  star (global)  all particles are neighbors of 
 all
-  particles 
-  ring (circle)  particles have a fixed number of 
 
-  neighbors K (usually 2) 
-  wheel  only one particle is connected to all 
-  particles and act as hub 
-  random  N random conections are made between 
-  the particles
37(No Transcript) 
 38Initialization
 xid(0)  random value, vid(0)  0
Calculate performance
 F (xid(t))  ? (F  performance)
Update best particle
 F (xid(t)) is better than the pbest -gt pbest  
F(xid(t)), pid  xid(t), Same for the gbest
Move each particle
 See next slide
Until system converges 
 39for convergence c1 c2 lt 4 Kennedy 1998 
 40(No Transcript) 
 41(No Transcript) 
 42http//uk.geocities.com/markcsinclair/pso.html 
http//www.engr.iupui.edu/shi/PSO/AppletGUI.html  
 43? Fuzzy control of Learning rate, Slope 
(Principes, Chap. 4.16) 
? Local Minimum Problem
-  Restart with different initial weights, learning 
 rates, and number
-  of hidden nodes 
-  Add (and anneal) noise a little (zero mean white 
 Gaussian) to
 weights or training data desired output or input
 (for better generalization)
-  Use Simulated Annealing or Genetic Algorithm 
 Optimization then BP
? Design aided by a Graphic User Interface NN 
Oscilloscope 
Look at Internal weights/Node Activities with 
Color Coding 
 44Students Questions from 2005 When the learning 
rate is optimized and initialized, there must be 
a rough boundary for it. Just an empirical way to 
do it ? In Conjugate Gradient, s(n)  -g(n1) 
 The learning rate annealing just keeps on 
decreasing the error as n without looking at 
where in the error surface the current weights 
are. Is this OK ? Conjugate Gradient is similar 
to Momentum in that old search direction is 
utilized in determining the new search direction. 
It is also similar to rule using the past 
trend. Is CG always faster converging than the SD 
? Do the diff. initial values of the weights 
affect the output results ? How can we choose 
them ?