Title: Introduction to Predictive Learning
1Introduction to Predictive Learning
- LECTURE SET 6
- Neural Network Learning
Electrical and Computer Engineering
2OUTLINE
- Objectives
- - introduce biologically inspired NN learning
methods for clustering, regression and
classification - - explain similarities and differences between
statistical and NN methods - - show examples using synthetic and real-life
data - Brief history and motivation for artificial
neural networks - Sequential estimation of model parameters
- Methods for supervised learning
- Methods for unsupervised learning
- Summary and discussion
3Brief history and motivation for ANN
- Huge interest in understanding the nature and
mechanism of biological/ human learning - Biologists psychologists do not adopt classical
parametric statistical learning, because - - parametric modeling is not biologically
plausible - - biological info processing is clearly
different from algorithmic models of computation - Mid 1980s growing interest in applying
biologically inspired computational models to - - developing computer models (of human brain)
- - various engineering applications
- ? New field Artificial Neural Networks (1986
1987) - ANNs represent nonlinear estimators implementing
the ERM approach (usually squared-loss function)
4History and motivation (contd)
- Relationship to the problem of inductive
learning - The same learning problem setting
- Neural-style learning algorithm
- - on-line (flow through)
- - simple processing
- Biological terminology
5Neural vs Algorithmic computation
- Biological systems do not use principles of
digital circuits - Digital Biological
- Connectivity 110 10,000
- Signal digital analog
- Timing synchronous asynchronous
- Signal propag. feedforward feedback
- Redundancy no yes
- Parallel proc. no yes
- Learning no yes
- Noise tolerance no yes
6Neural vs Algorithmic computation
- Computers excel at algorithmic tasks (well-posed
mathematical problems) - Biological systems are superior to digital
systems for ill-posed problems with noisy data - Example object recognition Hopfield, 1987
- PIGEON 109 neurons, cycle time 0.1 sec,
- each neuron sends 2 bits to 1K other neurons
- ? 2x1013 bit operations per sec
- OLD PC 107 gates, cycle time 10-7,
connectivity2 - ? 10x1014 bit operations per sec
- Both have similar raw processing capability, but
pigeons are better at recognition tasks
7Neural terminology and artificial neurons
- Some general descriptions of ANNs
- http//www.doc.ic.ac.uk/nd/surprise_96/journal/vo
l4/cs11/report.html - http//en.wikipedia.org/wiki/Neural_network
- McCulloch-Pitts neuron (1943)
- Threshold (indicator) function of weighted sum of
inputs
8Goals of ANNs
- Develop models of computation inspired by
biological systems - Study computational capabilities of networks of
interconnected neurons - Apply these models to real-life applications
- Learning in NNs modification (adaptation) of
synaptic connections (weights) in response to
external inputs
9Historical highlights of ANN
- McCulloch-Pitts neuron
- 1949 Hebbian learning
- 1960s Rosenblatt (perceptron), Widrow
- 60s-70s dominance of hard AI
- 1980s resurgence of interest (PDP group, MLP,
SOM etc.) - 1990s connection to statistics/VC-theory
- 2000s mature field/ fragmentation
10OUTLINE
- Objectives
- Brief history and motivation for artificial
neural networks - Sequential estimation of model parameters
- Methods for supervised learning
- Methods for unsupervised learning
- Summary and Discussion
11Sequential estimation of model parameters
- Batch vs on-line (iterative) learning
- - Algorithmic (statistical) approaches batch
- - Neural-network inspired methods on-line
- BUT the only difference is on the implementation
level (so both types of methods should yield
similar generalization) -
- Recall ERM inductive principle (for regression)
-
- Assume dictionary parameterization with fixed
basis fcts
12Sequential (on-line) least squares minimization
- Training pairs presented
sequentially - On-line update equations for minimizing empirical
risk (MSE) wrt parameters w are -
- (gradient descent learning)
- where the gradient is computed via the chain
rule -
- the learning rate is a small positive
value (decreasing with k)
13On-line least-squares minimization algorithm
- Known as delta-rule (Widrow and Hoff, 1960)
- Given initial parameter estimates w(0), update
parameters during each presentation of k-th
training sample x(k),y(k) - Step 1 forward pass computation
- - estimated output
- Step 2 backward pass computation
- - error term (delta)
14Neural network interpretation of delta rule
- Forward pass Backward pass
-
15Theoretical basis for on-line learning
- Standard inductive learning given training data
find the model providing min of
prediction risk -
- Stochastic Approximation guarantees minimization
of risk (asymptotically) - under general conditions
- on the learning rate
16Practical issues for on-line learning
- Given finite training set (n samples)
- this set is presented sequentially to a learning
algorithm many times. Each presentation of n
samples is called an epoch, and the process of
repeated presentations is called recycling (of
training data) - Learning rate schedule initially set large, then
slowly decreasing with k (iteration number).
Typically good learning rate schedules are
data-dependent. - Stopping conditions
- (1) monitor the gradient (i.e., stop when the
gradient falls below some small threshold) - (2) early stopping can be used for complexity
control -
17OUTLINE
- Objectives
- Brief history and motivation for artificial
neural networks - Sequential estimation of model parameters
- Methods for supervised learning
- - MultiLayer Perceptron (MLP) networks
- - Radial Basis Function (RBF) Networks
- Methods for unsupervised learning
- Summary and discussion
18Multilayer Perceptrons (MLP)
- Recall graphical NN
- representation for
- dictionary methods
- where
- How to estimate parameters (weights) via ERM?
19Learning for a single neuron (delta rule)
- Forward pass Backward pass
-
- How to implement gradient-descent learning in a
network of neurons? -
20Backpropagation training
- Minimization of
- with respect to parameters (weights) W, V
- Gradient descent optimization for
- where
- Careful application of gradient descent leads
- leads to the backpropagation algorithm
21Backpropagation forward passfor training input
x(k), estimate predicted output
22Backpropagation backward passupdate the weights
by propagating the error
23Details of backpropagation
- Sigmoid activation - picture?
- simple derivative
- ? Poor behaviour for large t saturation
- How to avoid saturation?
- - Proper initialization (small weights)
- - Pre-scaling of inputs (zero mean, unit
variance) - Learning rate schedule (initial, final)
- Stopping rules, number of epochs
- Number of hidden units
24Regularization Effect of Backpropagation
- Backpropagation iterative optimization
- Final model (weights) depends on
- - initial point final point (stopping rules)
- ? initialization and/ or stopping rules can be
used for model complexity control
25Various forms of complexity control
- MLP topology number of hidden units
- Constraints on parameters (weights) weight
decay - Type of optimization algorithm (many versions of
backprop., other opt. methods) - Stopping rules
- Initial conditions (initial small weights)
- Multiple factors make it difficult to control
complexity usually vary one complexity parameter
while keeping all others fixed
26Example univariate regression
- Data set 30 samples generated using sine-squared
target function with Gaussian noise (st.
deviation 0.1). - MLP network
- (two hidden units)
- underfitting
27Example univariate regression
- Data set 30 samples generated using sine-squared
target function with Gaussian noise (st.
deviation 0.1). - MLP network
- (five hidden units)
- near optimal
28Example univariate regression
- Data set 30 samples generated using sine-squared
target function with Gaussian noise (st.
deviation 0.1). - MLP network
- (20 hidden units)
- little overfitting
29Backpropagation for classification
- Original MLP is for regression
- (as shown)
- For classification
- - sigmoid output unit ( logistic regression
using log-likelihood loss see textbook) - - during training, use real-values 0/1 for class
labels - - during operation, threshold the output of a
trained MLP classifier at 0.5 to predict class
labels
30Classification example (Ripleys data set)
- Data set 250 samples mixture of gaussians,
where Class 0 data has centers (-0.3, 0.7) and
(0.4, 0.7), and Class 1 data has centers (-0.7,
0.3) and (0.3, 0.3). The variance of all
gaussians is 0.03. - MLP classifier
- (two hidden units)
- underfitting
31Classification Example
- MLP classifier (three hidden units)
- near optimal solution
32Classification Example
- MLP classifier (six hidden units)
- some overfitting
33MLP software
- MLP software widely available in public domain
- Can handle multi-class problems
- For example, Netlab toolbox (in Matlab) at
http//www1.aston.ac.uk/eas/research/groups/ncrg/r
esources/netlab/ - Many commercial products (full of marketing hype)
- Nearly 80 Accurate Market Forecasting
SoftwareGet FREE up to date predictions and see
for yourself!
34NetTalk (Sejnowski and Rosenberg, 1987)
- One of the first successful applications of
backpropagation - http//www.cnl.salk.edu/ParallelNetsPronounce/inde
x.php - Goal Learning to read (English text) aloud, i.e.
- Learn Mapping English text ? phonemes
- using MLP classifier network
- Network inputs encode 7-letter window (the 4-th
letter in the middle needs to be pronounced) - Network outputs (26 units) encode phonemes that
drive a speech synthesizer - The MLP network is trained using labeled data
(both individual words and unrestricted text) -
35NetTalk architecture
Input encoding 7x29 203 unitsOutput encoding
26 units (phonemes) Hidden layer 80 hidden units
36Listening to NetTalk-generated speech
- Listen to tape recordings illustrating NETtalk
operation available on Youtube - http//www.youtube.com/watch?vgakJlr3GecE
- These three recordings contain 3 different audio
outputs of NETtalk - (a) during the first 5 minutes of training,
starting with weights initialized to zero. - (b) after training using the set of 10,000 words.
This training set corresponds to 20 passes
(epochs) over 500-word text. - (c) generated with new text input that was not
part of the training set. - After listening to these recordings, answer and
comment on the following questions - - can you recognize words in the recording (a),
(b) and (c)? Explain why. - - compare the quality of outputs (b) and (c).
Which one seems closer to human speech and why? - Question for discussion Problem 6.8
- - Why NETtalk uses a seven-letter window?
37Radial Basis Function (RBF) Networks
- Dictionary parameterization
- - each b.f. is (usually) local
- - center and width
- i.e. Gaussian
-
- Typically used for regression or classification
38RBF network training
- RBF training (learning) estimation of
- (1) RBF parameters (centers, width)
- (2) linear weights ws
- Non-adaptive implementation
- (1) Estimate RBF parameters via unsupervised
learning (only x-values of training data) can
use SOM, GLA etc. - (2) Estimate weights w via linear least squares
- Advantages
- - fast training
- - when x-samples are plenty, but (x,y) data are
few - Limitations cannot discard irrelevant inputs
- the curse of dimensionalty
39Non-adaptive RBF training algorithm
- Choose the number of basis functions (centers) m.
- Estimate centers using x-values of training
data via unsupervised learning (SOM, GLA,
clustering etc.) - Determine width parameters using heuristic
- For a given center
- (a) find the distance to the closest center
- for all
- (b) set the width parameter
- where parameter controls degree of overlap
between adjacent basis functions. Typically - 4. Estimate weights w via linear least squares
(minimization of the empirical risk MSE).
40RBF network complexity control
- RBF model complexity can be controlled by
- The number of RBFs
- Goal select opt number of units (RBFs)
- RBF width
- Goal select opt width parameter (for large
number of RBFs) - Penalization of large weights ws
- See toy examples next (using the number of units
as the complexity parameter)
41Example RBF regression
- Data set 30 samples generated using sine-squared
target function with Gaussian noise (st.
deviation 0.1). - RBF network automatic width selection (via
x-validation) - 2 RBFs
- underfitting
42Example RBF regression
- Data set 30 samples generated using sine-squared
target function with Gaussian noise (st.
deviation 0.1). - RBF network automatic width selection
- 5 RBFs
- optimal
43Example RBF regression
- Data set 30 samples generated using sine-squared
target function with Gaussian noise (st.
deviation 0.1). - RBF network automatic width selection
- 20 RBFs
- overfitting
44RBF Classification example (Ripleys data)
- Data set 250 samples mixture of gaussians,
where Class 0 data has centers (-0.3, 0.7) and
(0.4, 0.7), and Class 1 data has centers (-0.7,
0.3) and (0.3, 0.3). The variance of all
gaussians is 0.03. - RBF classifier
- (4 units)
- some underfitting
45RBF Classification example (contd)
- RBF classifier (9 units)
- Optimal
46RBF Classification example (contd)
- RBF classifier (25 units)
- Little overfitting
47OUTLINE
- Objectives
- Brief history and motivation for artificial
neural networks - Sequential estimation of model parameters
- Methods for supervised learning
- Methods for unsupervised learning
- - clustering and vector quantization
- - Self-Organizing Maps (SOM)
- - Application example
- Summary and discussion
48Overview
- Recall from Lecture Set 2
- unsupervised learning
- data reduction approach
- Example Training data represented by 3 centers
H
49Two types of problems
- 1. Data reduction
- VQ clustering
- Model m points
-
- Vector Quantizer Q
- VQ setting given n training samples
- find the coordinates of m centers
(prototypes) such that the total squared error
distortion is minimized -
50- Dimensionality reduction
- linear nonlinear
-
-
- Model projection of high-dim. data onto
low-dim. space. - Note the goal is to estimate a mapping from
d-dimensional input space (d2) to
low-dimensional feature space (d1) -
51Vector Quantization and Clustering
- Two complementary goals of VQ
- 1. partition the input space into disjoint
regions - 2. find positions of units (coordinates of
prototypes) - Note optimal partitioning into regions is
according to the nearest-neighbor rule ( the
Voronoi regions) -
52Generalized Lloyd Algorithm(GLA) for VQ
- Given data points , loss function L (i.e.,
squared loss) and initial centers -
- Perform the following updates upon presentation
of - 1. Find the nearest center to the data point
(the winning unit) -
- 2. Update the winning unit coordinates (only)
via - Increment k and iterate steps (1) (2) above
- Note - the learning rate decreases with
iteration number k - - biological interpretations of steps (1)-(2)
exist -
53Batch version of GLA
- Given data points , loss function L (i.e.,
squared loss) and initial centers - Iterate the following two steps
- 1. Partition the data (assign sample to
unit j ) using the nearest neighbor rule.
Partitioning matrix Q -
- 2. Update unit coordinates as centroids of the
data -
- Note final solution may depend on initialization
(local min) potential problem for both on-line
and batch GLA -
54Numeric Example of univariate VQ
- Given data 2,4,10,12,3,20,30,11,25, set m2
- Initialization (random) c13,c24
- Iteration 1
- Projection P12,3 P24,10,12,20,30,11,25
- Expectation (averaging) c12.5, c216
- Iteration 2
- Projection P12,3,4, P210,12,20,30,11,25
Expectation(averaging) c13, c218 - Iteration 3
- Projection P12,3,4,10,P212,20,30,11,25
Expectation(averaging) c14.75, c219.6 - Iteration 4
- Projection P12,3,4,10,11,12, P220,30,25
Expectation(averaging) c17, c225 - Stop as the algorithm is stabilized with these
values
55GLA Example 1
- Modeling doughnut distribution using 5 units
- (a) initialization (b) final position (of units)
56GLA Example 2
- Modeling doughnut distribution using 3 units
- Bad initialization ? poor local minimum
57GLA Example 3
- Modeling doughnut distribution using 20 units
- 7 units were never moved by the GLA
- ? the problem of unused units (dead units)
58Avoiding local minima with GLA
- Starting with many random initializations, and
then choosing the best GLA solution - Conscience mechanism forcing dead units to
participate in competition, by keeping the
frequency count (of past winnings) for each unit, - i.e. for on-line version of GLA in Step 1
- Self-Organizing Map introduce topological
relationship (map), thus forcing the neighbors of
the winning unit to move towards the data.
59Clustering methods
- Clustering separating a data set into several
groups (clusters) according to some measure of
similarity - Goals of clustering
- interpretation (of resulting clusters)
- exploratory data analysis
- preprocessing for supervised learning
- often the goal is not formally stated
- VQ-style methods (GLA) often used for clustering,
i.e. k-means or c-means - Many other clustering methods as well
60Clustering (contd)
- Clustering partition a set of n objects
(samples) into k disjoint groups, based on some
similarity measure. Assumptions - - similarity distance metric dist (i,j)
- - usually k given a priori (but not always!)
- Intuitive motivation
- similar objects into one cluster
- dissimilar objects into different clusters
- ? the goal is not formally stated
- Similarity (distance) measure is critical
- but usually hard to define (objectively).
Distance needs to be defined for different types
of input variables.
61Self-Organizing Maps
- History and biological motivation
- Brain changes its internal structure to reflect
life experiences ? interaction with environment
is critical at early stages of brain development
(first 1-2 years of life) - Existence of various regions (maps) in the brain
- How these maps may be formed?
- i.e. information-processing model leading to map
formation - T. Kohonen (early 1980s) proposed SOM
62Goal of SOM
- Dimensionality reduction project given
(high-dim.) data onto low-dimensional space
(called a map) - Feature space (Z-space) is 1D or 2D and is
discretized as a number of units, i.e., 10x10 map - Z-space has distance metric ? ordering of units
- Similarities and differences between VQ and SOM
-
-
63Self-Organizing Map
- Discretization of 2D space via 10x10 map. In this
discrete - space, distance relations exist between all pairs
of units. Distance relation map topology
64SOM Algorithm (flow through)
- Given data points , distance metric in the
input space ( Euclidean), map topology (in
z-space), initial position of units (in x-space) -
- Perform the following updates upon presentation
of - 1. Find the nearest unit to the data point (the
winning unit denoted as z(k)) -
- 2. Update all units around the winning unit via
- Increment k, decrease the learning rate and the
neighborhood width, and repeat steps (1) (2)
above -
65SOM example (one iteration)
Step 1
Step 2
66SOM example (next iteration)
Step 1
Step 2
Final map
67Hyper-parameters of SOM
- SOM performance depends on parameters (
user-defined) - Map dimension and topology (usually 1D or 2D)
- Number of SOM units quantization level (of
z-space) - Neighborhood function usually
rectangular or gaussian (shape not important) - Neighborhood width decrease schedule (important),
i.e. exponential decrease for Gaussian - with user defined
- Also linear decrease of neighborhood width
- Learning rate schedule (important)
- (also linear decrease)
- Note learning rate and neighborhood decrease
should be set jointly -
68Modeling uniform distribution via SOM
- (a) 300 random samples (b) 10X10 map
SOM neighborhood Gaussian Learning rate linear
decrease
69Position of SOM units (a) initial, (b) after 50
iterations, (c) after 100 iterations, (d) after
10,000 iterations
70Batch SOM (similar to Batch GLA)
- Given data points , distance metric (i.e.,
squared loss), map topology and initial centers
- Iterate the following two steps
- 1. Partition the data into clusters using the
minimum distance rule. This results in assignment
of n samples to m clusters (units) according to
assignment matrix Q -
- 2. Update center coordinates as the weighted
average of all data samples (in each cluster) -
- Decrease the neighborhood width, and iterate.
71Example effect of the final neighborhood
width 90 50
10
72SOM Applications
- Two types of applications
- Vector Quantization
- Clustering of multivariate data
- Main web site http//www.cis.hut.fi/research/som-
research/ - Numerous Applications
- Marketing surveys/ segmentation
- Financial/ stock market data
- Text data / document map WEBSOM
- Image data / picture map - PicSOM
- see HUT web site
73Practical Issues for SOM
- Pre-scaling of inputs, usually to 0, 1 range.
Why? - Map topology usually 1D or 2D
- Number of map units (per dimension)
- Learning rate schedule (for on-line version)
- Neighborhood type and schedule
- Initial size (1), final size
- Final neighborhood size the number of units
affect model complexity.
74Modeling US states using 1D SOM(performed by
Feng Cai)
- Purpose clustering of US states
- Data encoding each state described by 5
socio-economic indicators obesity index, result
of 2004 presidential elections, median income,
mean NAEP, IQ score - Data scaling each input scaled independently to
0,1 range - SOM specs 1D map, 9 units, initial neighborhood
width 1, final width 0.05
75(No Transcript)
76(No Transcript)
77SOM Modeling 1 of US states
78(No Transcript)
79SOM Modeling 2 of US states- remove the voting
input and apply 1D SOM
80SOM Modeling 2 of US states (contd)- remove
voting input and apply 1D SOM
81Clustering of European Languages
- Background historical linguistics studies
relatedness btwn languages based on - phonology, morphology, syntax and lexicon
- Difficulty of the problem due to evolving nature
of human languages and globalization. - Hypothesis similarity based on analysis of a
small stable word set. - See glottochronology, Swadesh list, at
- http//en.wikipedia.org/wiki/Glottochronology
82SOM Clustering of European Languages
- Modeling approach language 10 word set.
- Assuming words in different languages are encoded
in the same alphabet, it is possible to perform
clustering using some distance measure. - Issues
- selection of a stable word set
- data encoding distance metric
- Stable word set numbers 1 to 10
- Data encoding Latin alphabet, use 3 first
letters (in each word)
83Numbers word set in 18 European languages
- Each language is a feature vector encoding 10
words
84Data Encoding
- Word feature vector encoding 3 first letters
- Alphabet 26 letters 1 symbol BLANK
- vector encoding
-
-
- For example, ONE O14 N15 E05
85Word Encoding (contd)
- Word ? 27-dimensional feature vector
- Encoding is insensitive to order (of 3 letters)
- Encoding of 10-word set concatenate feature
vectors of all words one two ten - ? word set encoded as vector of dim. 1 X 270
86SOM Modeling Approach
- 2-Dimensional SOM (Batch Algorithm)
- Number of Units per dimension4
- Initial Neighborhood 1 Final Neighborhood
0.15 - Total Number of Iterations 70
87OUTLINE
- Objectives
- Brief history and motivation for artificial
neural networks - Sequential estimation of model parameters
- Methods for supervised learning
- Methods for unsupervised learning
- Summary and discussion
88Summary and Discussion
- Neural Network methods (vs statistical
approaches) - - new techniques/ grad descent style methods
- - simple (brute-force) computational approaches
- - black-box models (e.g. MLP network)
- - biological motivation
- The same fundamental issues small-sample
problems, curse-of-dimensionality, non-linear
optimization, complexity control - Neural network methods implement ERM or SRM
approach (under predictive learning setting) - Hype and controversy