Introduction to Predictive Learning - PowerPoint PPT Presentation

1 / 88

About This Presentation

Title:

Introduction to Predictive Learning

Description:

Introduction to Predictive Learning LECTURE SET 6 Neural Network Learning Electrical and Computer Engineering * ... – PowerPoint PPT presentation

Number of Views:526

Avg rating:3.0/5.0

Slides: 89

Provided by: Vlad8

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Predictive Learning

1
Introduction to Predictive Learning

LECTURE SET 6
Neural Network Learning

Electrical and Computer Engineering
2
OUTLINE

Objectives
- introduce biologically inspired NN learning
methods for clustering, regression and
classification
- explain similarities and differences between
statistical and NN methods
- show examples using synthetic and real-life
data
Brief history and motivation for artificial
neural networks
Sequential estimation of model parameters
Methods for supervised learning
Methods for unsupervised learning
Summary and discussion

3
Brief history and motivation for ANN

Huge interest in understanding the nature and
mechanism of biological/ human learning
Biologists psychologists do not adopt classical
parametric statistical learning, because
- parametric modeling is not biologically
plausible
- biological info processing is clearly
different from algorithmic models of computation
Mid 1980s growing interest in applying
biologically inspired computational models to
- developing computer models (of human brain)
- various engineering applications
? New field Artificial Neural Networks (1986
1987)
ANNs represent nonlinear estimators implementing
the ERM approach (usually squared-loss function)

4
History and motivation (contd)

Relationship to the problem of inductive
learning
The same learning problem setting
Neural-style learning algorithm
- on-line (flow through)
- simple processing
Biological terminology

5
Neural vs Algorithmic computation

Biological systems do not use principles of
digital circuits
Digital Biological
Connectivity 110 10,000
Signal digital analog
Timing synchronous asynchronous
Signal propag. feedforward feedback
Redundancy no yes
Parallel proc. no yes
Learning no yes
Noise tolerance no yes

6
Neural vs Algorithmic computation

Computers excel at algorithmic tasks (well-posed
mathematical problems)
Biological systems are superior to digital
systems for ill-posed problems with noisy data
Example object recognition Hopfield, 1987
PIGEON 109 neurons, cycle time 0.1 sec,
each neuron sends 2 bits to 1K other neurons
? 2x1013 bit operations per sec
OLD PC 107 gates, cycle time 10-7,
connectivity2
? 10x1014 bit operations per sec
Both have similar raw processing capability, but
pigeons are better at recognition tasks

7
Neural terminology and artificial neurons

Some general descriptions of ANNs
http//www.doc.ic.ac.uk/nd/surprise_96/journal/vo
l4/cs11/report.html
http//en.wikipedia.org/wiki/Neural_network
McCulloch-Pitts neuron (1943)
Threshold (indicator) function of weighted sum of
inputs

8
Goals of ANNs

Develop models of computation inspired by
biological systems
Study computational capabilities of networks of
interconnected neurons
Apply these models to real-life applications
Learning in NNs modification (adaptation) of
synaptic connections (weights) in response to
external inputs

9
Historical highlights of ANN

McCulloch-Pitts neuron
1949 Hebbian learning
1960s Rosenblatt (perceptron), Widrow
60s-70s dominance of hard AI
1980s resurgence of interest (PDP group, MLP,
SOM etc.)
1990s connection to statistics/VC-theory
2000s mature field/ fragmentation

10
OUTLINE

Objectives
Brief history and motivation for artificial
neural networks
Sequential estimation of model parameters
Methods for supervised learning
Methods for unsupervised learning
Summary and Discussion

11
Sequential estimation of model parameters

Batch vs on-line (iterative) learning
- Algorithmic (statistical) approaches batch
- Neural-network inspired methods on-line
BUT the only difference is on the implementation
level (so both types of methods should yield
similar generalization)
Recall ERM inductive principle (for regression)
Assume dictionary parameterization with fixed
basis fcts

12
Sequential (on-line) least squares minimization

Training pairs presented
sequentially
On-line update equations for minimizing empirical
risk (MSE) wrt parameters w are
(gradient descent learning)
where the gradient is computed via the chain
rule
the learning rate is a small positive
value (decreasing with k)

13
On-line least-squares minimization algorithm

Known as delta-rule (Widrow and Hoff, 1960)
Given initial parameter estimates w(0), update
parameters during each presentation of k-th
training sample x(k),y(k)
Step 1 forward pass computation
- estimated output
Step 2 backward pass computation
- error term (delta)

14
Neural network interpretation of delta rule

Forward pass Backward pass

Biological learning

15
Theoretical basis for on-line learning

Standard inductive learning given training data
find the model providing min of
prediction risk
Stochastic Approximation guarantees minimization
of risk (asymptotically)
under general conditions
on the learning rate

16
Practical issues for on-line learning

Given finite training set (n samples)
this set is presented sequentially to a learning
algorithm many times. Each presentation of n
samples is called an epoch, and the process of
repeated presentations is called recycling (of
training data)
Learning rate schedule initially set large, then
slowly decreasing with k (iteration number).
Typically good learning rate schedules are
data-dependent.
Stopping conditions
(1) monitor the gradient (i.e., stop when the
gradient falls below some small threshold)
(2) early stopping can be used for complexity
control

17
OUTLINE

Objectives
Brief history and motivation for artificial
neural networks
Sequential estimation of model parameters
Methods for supervised learning
- MultiLayer Perceptron (MLP) networks
- Radial Basis Function (RBF) Networks
Methods for unsupervised learning
Summary and discussion

18
Multilayer Perceptrons (MLP)

Recall graphical NN
representation for
dictionary methods
where
How to estimate parameters (weights) via ERM?

19
Learning for a single neuron (delta rule)

Forward pass Backward pass

How to implement gradient-descent learning in a
network of neurons?

20
Backpropagation training

Minimization of
with respect to parameters (weights) W, V
Gradient descent optimization for
where
Careful application of gradient descent leads
leads to the backpropagation algorithm

21
Backpropagation forward passfor training input
x(k), estimate predicted output
22
Backpropagation backward passupdate the weights
by propagating the error
23
Details of backpropagation

Sigmoid activation - picture?
simple derivative
? Poor behaviour for large t saturation
How to avoid saturation?
- Proper initialization (small weights)
- Pre-scaling of inputs (zero mean, unit
variance)
Learning rate schedule (initial, final)
Stopping rules, number of epochs
Number of hidden units

24
Regularization Effect of Backpropagation

Backpropagation iterative optimization
Final model (weights) depends on
- initial point final point (stopping rules)
? initialization and/ or stopping rules can be
used for model complexity control

25
Various forms of complexity control

MLP topology number of hidden units
Constraints on parameters (weights) weight
decay
Type of optimization algorithm (many versions of
backprop., other opt. methods)
Stopping rules
Initial conditions (initial small weights)
Multiple factors make it difficult to control
complexity usually vary one complexity parameter
while keeping all others fixed

26
Example univariate regression

Data set 30 samples generated using sine-squared
target function with Gaussian noise (st.
deviation 0.1).
MLP network
(two hidden units)
underfitting

27
Example univariate regression

Data set 30 samples generated using sine-squared
target function with Gaussian noise (st.
deviation 0.1).
MLP network
(five hidden units)
near optimal

28
Example univariate regression

Data set 30 samples generated using sine-squared
target function with Gaussian noise (st.
deviation 0.1).
MLP network
(20 hidden units)
little overfitting

29
Backpropagation for classification

Original MLP is for regression
(as shown)
For classification
- sigmoid output unit ( logistic regression
using log-likelihood loss see textbook)
- during training, use real-values 0/1 for class
labels
- during operation, threshold the output of a
trained MLP classifier at 0.5 to predict class
labels

30
Classification example (Ripleys data set)

Data set 250 samples mixture of gaussians,
where Class 0 data has centers (-0.3, 0.7) and
(0.4, 0.7), and Class 1 data has centers (-0.7,
0.3) and (0.3, 0.3). The variance of all
gaussians is 0.03.
MLP classifier
(two hidden units)
underfitting

31
Classification Example

MLP classifier (three hidden units)
near optimal solution

32
Classification Example

MLP classifier (six hidden units)
some overfitting

33
MLP software

MLP software widely available in public domain
Can handle multi-class problems
For example, Netlab toolbox (in Matlab) at
http//www1.aston.ac.uk/eas/research/groups/ncrg/r
esources/netlab/
Many commercial products (full of marketing hype)
Nearly 80 Accurate Market Forecasting
SoftwareGet FREE up to date predictions and see
for yourself!

34
NetTalk (Sejnowski and Rosenberg, 1987)

One of the first successful applications of
backpropagation
http//www.cnl.salk.edu/ParallelNetsPronounce/inde
x.php
Goal Learning to read (English text) aloud, i.e.
Learn Mapping English text ? phonemes
using MLP classifier network
Network inputs encode 7-letter window (the 4-th
letter in the middle needs to be pronounced)
Network outputs (26 units) encode phonemes that
drive a speech synthesizer
The MLP network is trained using labeled data
(both individual words and unrestricted text)

35
NetTalk architecture
Input encoding 7x29 203 unitsOutput encoding
26 units (phonemes) Hidden layer 80 hidden units
36
Listening to NetTalk-generated speech

Listen to tape recordings illustrating NETtalk
operation available on Youtube
http//www.youtube.com/watch?vgakJlr3GecE
These three recordings contain 3 different audio
outputs of NETtalk
(a) during the first 5 minutes of training,
starting with weights initialized to zero.
(b) after training using the set of 10,000 words.
This training set corresponds to 20 passes
(epochs) over 500-word text.
(c) generated with new text input that was not
part of the training set.
After listening to these recordings, answer and
comment on the following questions
- can you recognize words in the recording (a),
(b) and (c)? Explain why.
- compare the quality of outputs (b) and (c).
Which one seems closer to human speech and why?
Question for discussion Problem 6.8
- Why NETtalk uses a seven-letter window?

37
Radial Basis Function (RBF) Networks

Dictionary parameterization
- each b.f. is (usually) local
- center and width
i.e. Gaussian
Typically used for regression or classification

38
RBF network training

RBF training (learning) estimation of
(1) RBF parameters (centers, width)
(2) linear weights ws
Non-adaptive implementation
(1) Estimate RBF parameters via unsupervised
learning (only x-values of training data) can
use SOM, GLA etc.
(2) Estimate weights w via linear least squares
Advantages
- fast training
- when x-samples are plenty, but (x,y) data are
few
Limitations cannot discard irrelevant inputs
the curse of dimensionalty

39
Non-adaptive RBF training algorithm

Choose the number of basis functions (centers) m.
Estimate centers using x-values of training
data via unsupervised learning (SOM, GLA,
clustering etc.)
Determine width parameters using heuristic
For a given center
(a) find the distance to the closest center
for all
(b) set the width parameter
where parameter controls degree of overlap
between adjacent basis functions. Typically
4. Estimate weights w via linear least squares
(minimization of the empirical risk MSE).

40
RBF network complexity control

RBF model complexity can be controlled by
The number of RBFs
Goal select opt number of units (RBFs)
RBF width
Goal select opt width parameter (for large
number of RBFs)
Penalization of large weights ws
See toy examples next (using the number of units
as the complexity parameter)

41
Example RBF regression

Data set 30 samples generated using sine-squared
target function with Gaussian noise (st.
deviation 0.1).
RBF network automatic width selection (via
x-validation)
2 RBFs
underfitting

42
Example RBF regression

Data set 30 samples generated using sine-squared
target function with Gaussian noise (st.
deviation 0.1).
RBF network automatic width selection
5 RBFs
optimal

43
Example RBF regression

Data set 30 samples generated using sine-squared
target function with Gaussian noise (st.
deviation 0.1).
RBF network automatic width selection
20 RBFs
overfitting

44
RBF Classification example (Ripleys data)

Data set 250 samples mixture of gaussians,
where Class 0 data has centers (-0.3, 0.7) and
(0.4, 0.7), and Class 1 data has centers (-0.7,
0.3) and (0.3, 0.3). The variance of all
gaussians is 0.03.
RBF classifier
(4 units)
some underfitting

45
RBF Classification example (contd)

RBF classifier (9 units)
Optimal

46
RBF Classification example (contd)

RBF classifier (25 units)
Little overfitting

47
OUTLINE

Objectives
Brief history and motivation for artificial
neural networks
Sequential estimation of model parameters
Methods for supervised learning
Methods for unsupervised learning
- clustering and vector quantization
- Self-Organizing Maps (SOM)
- Application example
Summary and discussion

48
Overview

Recall from Lecture Set 2
unsupervised learning
data reduction approach
Example Training data represented by 3 centers

H
49
Two types of problems

1. Data reduction
VQ clustering
Model m points
Vector Quantizer Q
VQ setting given n training samples
find the coordinates of m centers
(prototypes) such that the total squared error
distortion is minimized

Dimensionality reduction
linear nonlinear
Model projection of high-dim. data onto
low-dim. space.
Note the goal is to estimate a mapping from
d-dimensional input space (d2) to
low-dimensional feature space (d1)

51
Vector Quantization and Clustering

Two complementary goals of VQ
1. partition the input space into disjoint
regions
2. find positions of units (coordinates of
prototypes)
Note optimal partitioning into regions is
according to the nearest-neighbor rule ( the
Voronoi regions)

52
Generalized Lloyd Algorithm(GLA) for VQ

Given data points , loss function L (i.e.,
squared loss) and initial centers
Perform the following updates upon presentation
of
1. Find the nearest center to the data point
(the winning unit)
2. Update the winning unit coordinates (only)
via
Increment k and iterate steps (1) (2) above
Note - the learning rate decreases with
iteration number k
- biological interpretations of steps (1)-(2)
exist

53
Batch version of GLA

Given data points , loss function L (i.e.,
squared loss) and initial centers
Iterate the following two steps
1. Partition the data (assign sample to
unit j ) using the nearest neighbor rule.
Partitioning matrix Q
2. Update unit coordinates as centroids of the
data
Note final solution may depend on initialization
(local min) potential problem for both on-line
and batch GLA

54
Numeric Example of univariate VQ

Given data 2,4,10,12,3,20,30,11,25, set m2
Initialization (random) c13,c24
Iteration 1
Projection P12,3 P24,10,12,20,30,11,25
Expectation (averaging) c12.5, c216
Iteration 2
Projection P12,3,4, P210,12,20,30,11,25
Expectation(averaging) c13, c218
Iteration 3
Projection P12,3,4,10,P212,20,30,11,25
Expectation(averaging) c14.75, c219.6
Iteration 4
Projection P12,3,4,10,11,12, P220,30,25
Expectation(averaging) c17, c225
Stop as the algorithm is stabilized with these
values

55
GLA Example 1

Modeling doughnut distribution using 5 units
(a) initialization (b) final position (of units)

56
GLA Example 2

Modeling doughnut distribution using 3 units
Bad initialization ? poor local minimum

57
GLA Example 3

Modeling doughnut distribution using 20 units
7 units were never moved by the GLA
? the problem of unused units (dead units)

58
Avoiding local minima with GLA

Starting with many random initializations, and
then choosing the best GLA solution
Conscience mechanism forcing dead units to
participate in competition, by keeping the
frequency count (of past winnings) for each unit,
i.e. for on-line version of GLA in Step 1
Self-Organizing Map introduce topological
relationship (map), thus forcing the neighbors of
the winning unit to move towards the data.

59
Clustering methods

Clustering separating a data set into several
groups (clusters) according to some measure of
similarity
Goals of clustering
interpretation (of resulting clusters)
exploratory data analysis
preprocessing for supervised learning
often the goal is not formally stated
VQ-style methods (GLA) often used for clustering,
i.e. k-means or c-means
Many other clustering methods as well

60
Clustering (contd)

Clustering partition a set of n objects
(samples) into k disjoint groups, based on some
similarity measure. Assumptions
- similarity distance metric dist (i,j)
- usually k given a priori (but not always!)
Intuitive motivation
similar objects into one cluster
dissimilar objects into different clusters
? the goal is not formally stated
Similarity (distance) measure is critical
but usually hard to define (objectively).
Distance needs to be defined for different types
of input variables.

61
Self-Organizing Maps

History and biological motivation
Brain changes its internal structure to reflect
life experiences ? interaction with environment
is critical at early stages of brain development
(first 1-2 years of life)
Existence of various regions (maps) in the brain
How these maps may be formed?
i.e. information-processing model leading to map
formation
T. Kohonen (early 1980s) proposed SOM

62
Goal of SOM

Dimensionality reduction project given
(high-dim.) data onto low-dimensional space
(called a map)
Feature space (Z-space) is 1D or 2D and is
discretized as a number of units, i.e., 10x10 map
Z-space has distance metric ? ordering of units
Similarities and differences between VQ and SOM

63
Self-Organizing Map

Discretization of 2D space via 10x10 map. In this
discrete
space, distance relations exist between all pairs
of units. Distance relation map topology

64
SOM Algorithm (flow through)

Given data points , distance metric in the
input space ( Euclidean), map topology (in
z-space), initial position of units (in x-space)
Perform the following updates upon presentation
of
1. Find the nearest unit to the data point (the
winning unit denoted as z(k))
2. Update all units around the winning unit via
Increment k, decrease the learning rate and the
neighborhood width, and repeat steps (1) (2)
above

65
SOM example (one iteration)
Step 1
Step 2
66
SOM example (next iteration)
Step 1
Step 2
Final map
67
Hyper-parameters of SOM

SOM performance depends on parameters (
user-defined)
Map dimension and topology (usually 1D or 2D)
Number of SOM units quantization level (of
z-space)
Neighborhood function usually
rectangular or gaussian (shape not important)
Neighborhood width decrease schedule (important),
i.e. exponential decrease for Gaussian
with user defined
Also linear decrease of neighborhood width
Learning rate schedule (important)
(also linear decrease)
Note learning rate and neighborhood decrease
should be set jointly

68
Modeling uniform distribution via SOM

(a) 300 random samples (b) 10X10 map

SOM neighborhood Gaussian Learning rate linear
decrease
69
Position of SOM units (a) initial, (b) after 50
iterations, (c) after 100 iterations, (d) after
10,000 iterations
70
Batch SOM (similar to Batch GLA)

Given data points , distance metric (i.e.,
squared loss), map topology and initial centers
Iterate the following two steps
1. Partition the data into clusters using the
minimum distance rule. This results in assignment
of n samples to m clusters (units) according to
assignment matrix Q
2. Update center coordinates as the weighted
average of all data samples (in each cluster)
Decrease the neighborhood width, and iterate.

71
Example effect of the final neighborhood
width 90 50

10
72
SOM Applications

Two types of applications
Vector Quantization
Clustering of multivariate data
Main web site http//www.cis.hut.fi/research/som-
research/
Numerous Applications
Marketing surveys/ segmentation
Financial/ stock market data
Text data / document map WEBSOM
Image data / picture map - PicSOM
see HUT web site

73
Practical Issues for SOM

Pre-scaling of inputs, usually to 0, 1 range.
Why?
Map topology usually 1D or 2D
Number of map units (per dimension)
Learning rate schedule (for on-line version)
Neighborhood type and schedule
Initial size (1), final size
Final neighborhood size the number of units
affect model complexity.

74
Modeling US states using 1D SOM(performed by
Feng Cai)

Purpose clustering of US states
Data encoding each state described by 5
socio-economic indicators obesity index, result
of 2004 presidential elections, median income,
mean NAEP, IQ score
Data scaling each input scaled independently to
0,1 range
SOM specs 1D map, 9 units, initial neighborhood
width 1, final width 0.05

75
(No Transcript)
76
(No Transcript)
77
SOM Modeling 1 of US states
78
(No Transcript)
79
SOM Modeling 2 of US states- remove the voting
input and apply 1D SOM
80
SOM Modeling 2 of US states (contd)- remove
voting input and apply 1D SOM
81
Clustering of European Languages

Background historical linguistics studies
relatedness btwn languages based on
phonology, morphology, syntax and lexicon
Difficulty of the problem due to evolving nature
of human languages and globalization.
Hypothesis similarity based on analysis of a
small stable word set.
See glottochronology, Swadesh list, at
http//en.wikipedia.org/wiki/Glottochronology

82
SOM Clustering of European Languages

Modeling approach language 10 word set.
Assuming words in different languages are encoded
in the same alphabet, it is possible to perform
clustering using some distance measure.
Issues
selection of a stable word set
data encoding distance metric
Stable word set numbers 1 to 10
Data encoding Latin alphabet, use 3 first
letters (in each word)

83
Numbers word set in 18 European languages

Each language is a feature vector encoding 10
words

84
Data Encoding

Word feature vector encoding 3 first letters
Alphabet 26 letters 1 symbol BLANK
vector encoding
For example, ONE O14 N15 E05

85
Word Encoding (contd)

Word ? 27-dimensional feature vector
Encoding is insensitive to order (of 3 letters)
Encoding of 10-word set concatenate feature
vectors of all words one two ten
? word set encoded as vector of dim. 1 X 270

86
SOM Modeling Approach

2-Dimensional SOM (Batch Algorithm)
Number of Units per dimension4
Initial Neighborhood 1 Final Neighborhood
0.15
Total Number of Iterations 70

87
OUTLINE

Objectives
Brief history and motivation for artificial
neural networks
Sequential estimation of model parameters
Methods for supervised learning
Methods for unsupervised learning
Summary and discussion

88
Summary and Discussion

Neural Network methods (vs statistical
approaches)
- new techniques/ grad descent style methods
- simple (brute-force) computational approaches
- black-box models (e.g. MLP network)
- biological motivation
The same fundamental issues small-sample
problems, curse-of-dimensionality, non-linear
optimization, complexity control
Neural network methods implement ERM or SRM
approach (under predictive learning setting)
Hype and controversy