Introduction to Predictive Learning - PowerPoint PPT Presentation

1 / 88
About This Presentation
Title:

Introduction to Predictive Learning

Description:

Introduction to Predictive Learning LECTURE SET 6 Neural Network Learning Electrical and Computer Engineering * ... – PowerPoint PPT presentation

Number of Views:523
Avg rating:3.0/5.0
Slides: 89
Provided by: Vlad8
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Predictive Learning


1
Introduction to Predictive Learning
  • LECTURE SET 6
  • Neural Network Learning

Electrical and Computer Engineering
2
OUTLINE
  • Objectives
  • - introduce biologically inspired NN learning
    methods for clustering, regression and
    classification
  • - explain similarities and differences between
    statistical and NN methods
  • - show examples using synthetic and real-life
    data
  • Brief history and motivation for artificial
    neural networks
  • Sequential estimation of model parameters
  • Methods for supervised learning
  • Methods for unsupervised learning
  • Summary and discussion

3
Brief history and motivation for ANN
  • Huge interest in understanding the nature and
    mechanism of biological/ human learning
  • Biologists psychologists do not adopt classical
    parametric statistical learning, because
  • - parametric modeling is not biologically
    plausible
  • - biological info processing is clearly
    different from algorithmic models of computation
  • Mid 1980s growing interest in applying
    biologically inspired computational models to
  • - developing computer models (of human brain)
  • - various engineering applications
  • ? New field Artificial Neural Networks (1986
    1987)
  • ANNs represent nonlinear estimators implementing
    the ERM approach (usually squared-loss function)

4
History and motivation (contd)
  • Relationship to the problem of inductive
    learning
  • The same learning problem setting
  • Neural-style learning algorithm
  • - on-line (flow through)
  • - simple processing
  • Biological terminology

5
Neural vs Algorithmic computation
  • Biological systems do not use principles of
    digital circuits
  • Digital Biological
  • Connectivity 110 10,000
  • Signal digital analog
  • Timing synchronous asynchronous
  • Signal propag. feedforward feedback
  • Redundancy no yes
  • Parallel proc. no yes
  • Learning no yes
  • Noise tolerance no yes

6
Neural vs Algorithmic computation
  • Computers excel at algorithmic tasks (well-posed
    mathematical problems)
  • Biological systems are superior to digital
    systems for ill-posed problems with noisy data
  • Example object recognition Hopfield, 1987
  • PIGEON 109 neurons, cycle time 0.1 sec,
  • each neuron sends 2 bits to 1K other neurons
  • ? 2x1013 bit operations per sec
  • OLD PC 107 gates, cycle time 10-7,
    connectivity2
  • ? 10x1014 bit operations per sec
  • Both have similar raw processing capability, but
    pigeons are better at recognition tasks

7
Neural terminology and artificial neurons
  • Some general descriptions of ANNs
  • http//www.doc.ic.ac.uk/nd/surprise_96/journal/vo
    l4/cs11/report.html
  • http//en.wikipedia.org/wiki/Neural_network
  • McCulloch-Pitts neuron (1943)
  • Threshold (indicator) function of weighted sum of
    inputs

8
Goals of ANNs
  • Develop models of computation inspired by
    biological systems
  • Study computational capabilities of networks of
    interconnected neurons
  • Apply these models to real-life applications
  • Learning in NNs modification (adaptation) of
    synaptic connections (weights) in response to
    external inputs

9
Historical highlights of ANN
  • McCulloch-Pitts neuron
  • 1949 Hebbian learning
  • 1960s Rosenblatt (perceptron), Widrow
  • 60s-70s dominance of hard AI
  • 1980s resurgence of interest (PDP group, MLP,
    SOM etc.)
  • 1990s connection to statistics/VC-theory
  • 2000s mature field/ fragmentation

10
OUTLINE
  • Objectives
  • Brief history and motivation for artificial
    neural networks
  • Sequential estimation of model parameters
  • Methods for supervised learning
  • Methods for unsupervised learning
  • Summary and Discussion

11
Sequential estimation of model parameters
  • Batch vs on-line (iterative) learning
  • - Algorithmic (statistical) approaches batch
  • - Neural-network inspired methods on-line
  • BUT the only difference is on the implementation
    level (so both types of methods should yield
    similar generalization)
  • Recall ERM inductive principle (for regression)
  • Assume dictionary parameterization with fixed
    basis fcts

12
Sequential (on-line) least squares minimization
  • Training pairs presented
    sequentially
  • On-line update equations for minimizing empirical
    risk (MSE) wrt parameters w are
  • (gradient descent learning)
  • where the gradient is computed via the chain
    rule
  • the learning rate is a small positive
    value (decreasing with k)

13
On-line least-squares minimization algorithm
  • Known as delta-rule (Widrow and Hoff, 1960)
  • Given initial parameter estimates w(0), update
    parameters during each presentation of k-th
    training sample x(k),y(k)
  • Step 1 forward pass computation
  • - estimated output
  • Step 2 backward pass computation
  • - error term (delta)

14
Neural network interpretation of delta rule
  • Forward pass Backward pass
  • Biological learning

15
Theoretical basis for on-line learning
  • Standard inductive learning given training data
    find the model providing min of
    prediction risk
  • Stochastic Approximation guarantees minimization
    of risk (asymptotically)
  • under general conditions
  • on the learning rate

16
Practical issues for on-line learning
  • Given finite training set (n samples)
  • this set is presented sequentially to a learning
    algorithm many times. Each presentation of n
    samples is called an epoch, and the process of
    repeated presentations is called recycling (of
    training data)
  • Learning rate schedule initially set large, then
    slowly decreasing with k (iteration number).
    Typically good learning rate schedules are
    data-dependent.
  • Stopping conditions
  • (1) monitor the gradient (i.e., stop when the
    gradient falls below some small threshold)
  • (2) early stopping can be used for complexity
    control

17
OUTLINE
  • Objectives
  • Brief history and motivation for artificial
    neural networks
  • Sequential estimation of model parameters
  • Methods for supervised learning
  • - MultiLayer Perceptron (MLP) networks
  • - Radial Basis Function (RBF) Networks
  • Methods for unsupervised learning
  • Summary and discussion

18
Multilayer Perceptrons (MLP)
  • Recall graphical NN
  • representation for
  • dictionary methods
  • where
  • How to estimate parameters (weights) via ERM?

19
Learning for a single neuron (delta rule)
  • Forward pass Backward pass
  • How to implement gradient-descent learning in a
    network of neurons?

20
Backpropagation training
  • Minimization of
  • with respect to parameters (weights) W, V
  • Gradient descent optimization for
  • where
  • Careful application of gradient descent leads
  • leads to the backpropagation algorithm

21
Backpropagation forward passfor training input
x(k), estimate predicted output
22
Backpropagation backward passupdate the weights
by propagating the error
23
Details of backpropagation
  • Sigmoid activation - picture?
  • simple derivative
  • ? Poor behaviour for large t saturation
  • How to avoid saturation?
  • - Proper initialization (small weights)
  • - Pre-scaling of inputs (zero mean, unit
    variance)
  • Learning rate schedule (initial, final)
  • Stopping rules, number of epochs
  • Number of hidden units

24
Regularization Effect of Backpropagation
  • Backpropagation iterative optimization
  • Final model (weights) depends on
  • - initial point final point (stopping rules)
  • ? initialization and/ or stopping rules can be
    used for model complexity control

25
Various forms of complexity control
  • MLP topology number of hidden units
  • Constraints on parameters (weights) weight
    decay
  • Type of optimization algorithm (many versions of
    backprop., other opt. methods)
  • Stopping rules
  • Initial conditions (initial small weights)
  • Multiple factors make it difficult to control
    complexity usually vary one complexity parameter
    while keeping all others fixed

26
Example univariate regression
  • Data set 30 samples generated using sine-squared
    target function with Gaussian noise (st.
    deviation 0.1).
  • MLP network
  • (two hidden units)
  • underfitting

27
Example univariate regression
  • Data set 30 samples generated using sine-squared
    target function with Gaussian noise (st.
    deviation 0.1).
  • MLP network
  • (five hidden units)
  • near optimal

28
Example univariate regression
  • Data set 30 samples generated using sine-squared
    target function with Gaussian noise (st.
    deviation 0.1).
  • MLP network
  • (20 hidden units)
  • little overfitting

29
Backpropagation for classification
  • Original MLP is for regression
  • (as shown)
  • For classification
  • - sigmoid output unit ( logistic regression
    using log-likelihood loss see textbook)
  • - during training, use real-values 0/1 for class
    labels
  • - during operation, threshold the output of a
    trained MLP classifier at 0.5 to predict class
    labels

30
Classification example (Ripleys data set)
  • Data set 250 samples mixture of gaussians,
    where Class 0 data has centers (-0.3, 0.7) and
    (0.4, 0.7), and Class 1 data has centers (-0.7,
    0.3) and (0.3, 0.3). The variance of all
    gaussians is 0.03.
  • MLP classifier
  • (two hidden units)
  • underfitting

31
Classification Example
  • MLP classifier (three hidden units)
  • near optimal solution

32
Classification Example
  • MLP classifier (six hidden units)
  • some overfitting

33
MLP software
  • MLP software widely available in public domain
  • Can handle multi-class problems
  • For example, Netlab toolbox (in Matlab) at
    http//www1.aston.ac.uk/eas/research/groups/ncrg/r
    esources/netlab/
  • Many commercial products (full of marketing hype)
  • Nearly 80 Accurate Market Forecasting
    SoftwareGet FREE up to date predictions and see
    for yourself!

34
NetTalk (Sejnowski and Rosenberg, 1987)
  • One of the first successful applications of
    backpropagation
  • http//www.cnl.salk.edu/ParallelNetsPronounce/inde
    x.php
  • Goal Learning to read (English text) aloud, i.e.
  • Learn Mapping English text ? phonemes
  • using MLP classifier network
  • Network inputs encode 7-letter window (the 4-th
    letter in the middle needs to be pronounced)
  • Network outputs (26 units) encode phonemes that
    drive a speech synthesizer
  • The MLP network is trained using labeled data
    (both individual words and unrestricted text)

35
NetTalk architecture
Input encoding 7x29 203 unitsOutput encoding
26 units (phonemes) Hidden layer 80 hidden units
36
Listening to NetTalk-generated speech
  • Listen to tape recordings illustrating NETtalk
    operation available on Youtube
  • http//www.youtube.com/watch?vgakJlr3GecE
  • These three recordings contain 3 different audio
    outputs of NETtalk
  • (a) during the first 5 minutes of training,
    starting with weights initialized to zero.
  • (b) after training using the set of 10,000 words.
    This training set corresponds to 20 passes
    (epochs) over 500-word text.
  • (c) generated with new text input that was not
    part of the training set.
  • After listening to these recordings, answer and
    comment on the following questions
  • - can you recognize words in the recording (a),
    (b) and (c)? Explain why.
  • - compare the quality of outputs (b) and (c).
    Which one seems closer to human speech and why?
  • Question for discussion Problem 6.8
  • - Why NETtalk uses a seven-letter window?

37
Radial Basis Function (RBF) Networks
  • Dictionary parameterization
  • - each b.f. is (usually) local
  • - center and width
  • i.e. Gaussian
  • Typically used for regression or classification

38
RBF network training
  • RBF training (learning) estimation of
  • (1) RBF parameters (centers, width)
  • (2) linear weights ws
  • Non-adaptive implementation
  • (1) Estimate RBF parameters via unsupervised
    learning (only x-values of training data) can
    use SOM, GLA etc.
  • (2) Estimate weights w via linear least squares
  • Advantages
  • - fast training
  • - when x-samples are plenty, but (x,y) data are
    few
  • Limitations cannot discard irrelevant inputs
  • the curse of dimensionalty

39
Non-adaptive RBF training algorithm
  • Choose the number of basis functions (centers) m.
  • Estimate centers using x-values of training
    data via unsupervised learning (SOM, GLA,
    clustering etc.)
  • Determine width parameters using heuristic
  • For a given center
  • (a) find the distance to the closest center
  • for all
  • (b) set the width parameter
  • where parameter controls degree of overlap
    between adjacent basis functions. Typically
  • 4. Estimate weights w via linear least squares
    (minimization of the empirical risk MSE).

40
RBF network complexity control
  • RBF model complexity can be controlled by
  • The number of RBFs
  • Goal select opt number of units (RBFs)
  • RBF width
  • Goal select opt width parameter (for large
    number of RBFs)
  • Penalization of large weights ws
  • See toy examples next (using the number of units
    as the complexity parameter)

41
Example RBF regression
  • Data set 30 samples generated using sine-squared
    target function with Gaussian noise (st.
    deviation 0.1).
  • RBF network automatic width selection (via
    x-validation)
  • 2 RBFs
  • underfitting

42
Example RBF regression
  • Data set 30 samples generated using sine-squared
    target function with Gaussian noise (st.
    deviation 0.1).
  • RBF network automatic width selection
  • 5 RBFs
  • optimal

43
Example RBF regression
  • Data set 30 samples generated using sine-squared
    target function with Gaussian noise (st.
    deviation 0.1).
  • RBF network automatic width selection
  • 20 RBFs
  • overfitting

44
RBF Classification example (Ripleys data)
  • Data set 250 samples mixture of gaussians,
    where Class 0 data has centers (-0.3, 0.7) and
    (0.4, 0.7), and Class 1 data has centers (-0.7,
    0.3) and (0.3, 0.3). The variance of all
    gaussians is 0.03.
  • RBF classifier
  • (4 units)
  • some underfitting

45
RBF Classification example (contd)
  • RBF classifier (9 units)
  • Optimal

46
RBF Classification example (contd)
  • RBF classifier (25 units)
  • Little overfitting

47
OUTLINE
  • Objectives
  • Brief history and motivation for artificial
    neural networks
  • Sequential estimation of model parameters
  • Methods for supervised learning
  • Methods for unsupervised learning
  • - clustering and vector quantization
  • - Self-Organizing Maps (SOM)
  • - Application example
  • Summary and discussion

48
Overview
  • Recall from Lecture Set 2
  • unsupervised learning
  • data reduction approach
  • Example Training data represented by 3 centers

H
49
Two types of problems
  • 1. Data reduction
  • VQ clustering
  • Model m points
  • Vector Quantizer Q
  • VQ setting given n training samples
  • find the coordinates of m centers
    (prototypes) such that the total squared error
    distortion is minimized

50
  • Dimensionality reduction
  • linear nonlinear
  • Model projection of high-dim. data onto
    low-dim. space.
  • Note the goal is to estimate a mapping from
    d-dimensional input space (d2) to
    low-dimensional feature space (d1)

51
Vector Quantization and Clustering
  • Two complementary goals of VQ
  • 1. partition the input space into disjoint
    regions
  • 2. find positions of units (coordinates of
    prototypes)
  • Note optimal partitioning into regions is
    according to the nearest-neighbor rule ( the
    Voronoi regions)

52
Generalized Lloyd Algorithm(GLA) for VQ
  • Given data points , loss function L (i.e.,
    squared loss) and initial centers
  • Perform the following updates upon presentation
    of
  • 1. Find the nearest center to the data point
    (the winning unit)
  • 2. Update the winning unit coordinates (only)
    via
  • Increment k and iterate steps (1) (2) above
  • Note - the learning rate decreases with
    iteration number k
  • - biological interpretations of steps (1)-(2)
    exist

53
Batch version of GLA
  • Given data points , loss function L (i.e.,
    squared loss) and initial centers
  • Iterate the following two steps
  • 1. Partition the data (assign sample to
    unit j ) using the nearest neighbor rule.
    Partitioning matrix Q
  • 2. Update unit coordinates as centroids of the
    data
  • Note final solution may depend on initialization
    (local min) potential problem for both on-line
    and batch GLA

54
Numeric Example of univariate VQ
  • Given data 2,4,10,12,3,20,30,11,25, set m2
  • Initialization (random) c13,c24
  • Iteration 1
  • Projection P12,3 P24,10,12,20,30,11,25
  • Expectation (averaging) c12.5, c216
  • Iteration 2
  • Projection P12,3,4, P210,12,20,30,11,25
    Expectation(averaging) c13, c218
  • Iteration 3
  • Projection P12,3,4,10,P212,20,30,11,25
    Expectation(averaging) c14.75, c219.6
  • Iteration 4
  • Projection P12,3,4,10,11,12, P220,30,25
    Expectation(averaging) c17, c225
  • Stop as the algorithm is stabilized with these
    values

55
GLA Example 1
  • Modeling doughnut distribution using 5 units
  • (a) initialization (b) final position (of units)

56
GLA Example 2
  • Modeling doughnut distribution using 3 units
  • Bad initialization ? poor local minimum

57
GLA Example 3
  • Modeling doughnut distribution using 20 units
  • 7 units were never moved by the GLA
  • ? the problem of unused units (dead units)

58
Avoiding local minima with GLA
  • Starting with many random initializations, and
    then choosing the best GLA solution
  • Conscience mechanism forcing dead units to
    participate in competition, by keeping the
    frequency count (of past winnings) for each unit,
  • i.e. for on-line version of GLA in Step 1
  • Self-Organizing Map introduce topological
    relationship (map), thus forcing the neighbors of
    the winning unit to move towards the data.

59
Clustering methods
  • Clustering separating a data set into several
    groups (clusters) according to some measure of
    similarity
  • Goals of clustering
  • interpretation (of resulting clusters)
  • exploratory data analysis
  • preprocessing for supervised learning
  • often the goal is not formally stated
  • VQ-style methods (GLA) often used for clustering,
    i.e. k-means or c-means
  • Many other clustering methods as well

60
Clustering (contd)
  • Clustering partition a set of n objects
    (samples) into k disjoint groups, based on some
    similarity measure. Assumptions
  • - similarity distance metric dist (i,j)
  • - usually k given a priori (but not always!)
  • Intuitive motivation
  • similar objects into one cluster
  • dissimilar objects into different clusters
  • ? the goal is not formally stated
  • Similarity (distance) measure is critical
  • but usually hard to define (objectively).
    Distance needs to be defined for different types
    of input variables.

61
Self-Organizing Maps
  • History and biological motivation
  • Brain changes its internal structure to reflect
    life experiences ? interaction with environment
    is critical at early stages of brain development
    (first 1-2 years of life)
  • Existence of various regions (maps) in the brain
  • How these maps may be formed?
  • i.e. information-processing model leading to map
    formation
  • T. Kohonen (early 1980s) proposed SOM

62
Goal of SOM
  • Dimensionality reduction project given
    (high-dim.) data onto low-dimensional space
    (called a map)
  • Feature space (Z-space) is 1D or 2D and is
    discretized as a number of units, i.e., 10x10 map
  • Z-space has distance metric ? ordering of units
  • Similarities and differences between VQ and SOM

63
Self-Organizing Map
  • Discretization of 2D space via 10x10 map. In this
    discrete
  • space, distance relations exist between all pairs
    of units. Distance relation map topology

64
SOM Algorithm (flow through)
  • Given data points , distance metric in the
    input space ( Euclidean), map topology (in
    z-space), initial position of units (in x-space)
  • Perform the following updates upon presentation
    of
  • 1. Find the nearest unit to the data point (the
    winning unit denoted as z(k))
  • 2. Update all units around the winning unit via
  • Increment k, decrease the learning rate and the
    neighborhood width, and repeat steps (1) (2)
    above

65
SOM example (one iteration)
Step 1
Step 2
66
SOM example (next iteration)
Step 1
Step 2
Final map
67
Hyper-parameters of SOM
  • SOM performance depends on parameters (
    user-defined)
  • Map dimension and topology (usually 1D or 2D)
  • Number of SOM units quantization level (of
    z-space)
  • Neighborhood function usually
    rectangular or gaussian (shape not important)
  • Neighborhood width decrease schedule (important),
    i.e. exponential decrease for Gaussian
  • with user defined
  • Also linear decrease of neighborhood width
  • Learning rate schedule (important)
  • (also linear decrease)
  • Note learning rate and neighborhood decrease
    should be set jointly

68
Modeling uniform distribution via SOM
  • (a) 300 random samples (b) 10X10 map

SOM neighborhood Gaussian Learning rate linear
decrease
69
Position of SOM units (a) initial, (b) after 50
iterations, (c) after 100 iterations, (d) after
10,000 iterations
70
Batch SOM (similar to Batch GLA)
  • Given data points , distance metric (i.e.,
    squared loss), map topology and initial centers
  • Iterate the following two steps
  • 1. Partition the data into clusters using the
    minimum distance rule. This results in assignment
    of n samples to m clusters (units) according to
    assignment matrix Q
  • 2. Update center coordinates as the weighted
    average of all data samples (in each cluster)
  • Decrease the neighborhood width, and iterate.

71
Example effect of the final neighborhood
width 90 50


10
72
SOM Applications
  • Two types of applications
  • Vector Quantization
  • Clustering of multivariate data
  • Main web site http//www.cis.hut.fi/research/som-
    research/
  • Numerous Applications
  • Marketing surveys/ segmentation
  • Financial/ stock market data
  • Text data / document map WEBSOM
  • Image data / picture map - PicSOM
  • see HUT web site

73
Practical Issues for SOM
  • Pre-scaling of inputs, usually to 0, 1 range.
    Why?
  • Map topology usually 1D or 2D
  • Number of map units (per dimension)
  • Learning rate schedule (for on-line version)
  • Neighborhood type and schedule
  • Initial size (1), final size
  • Final neighborhood size the number of units
    affect model complexity.

74
Modeling US states using 1D SOM(performed by
Feng Cai)
  • Purpose clustering of US states
  • Data encoding each state described by 5
    socio-economic indicators obesity index, result
    of 2004 presidential elections, median income,
    mean NAEP, IQ score
  • Data scaling each input scaled independently to
    0,1 range
  • SOM specs 1D map, 9 units, initial neighborhood
    width 1, final width 0.05

75
(No Transcript)
76
(No Transcript)
77
SOM Modeling 1 of US states
78
(No Transcript)
79
SOM Modeling 2 of US states- remove the voting
input and apply 1D SOM
80
SOM Modeling 2 of US states (contd)- remove
voting input and apply 1D SOM
81
Clustering of European Languages
  • Background historical linguistics studies
    relatedness btwn languages based on
  • phonology, morphology, syntax and lexicon
  • Difficulty of the problem due to evolving nature
    of human languages and globalization.
  • Hypothesis similarity based on analysis of a
    small stable word set.
  • See glottochronology, Swadesh list, at
  • http//en.wikipedia.org/wiki/Glottochronology

82
SOM Clustering of European Languages
  • Modeling approach language 10 word set.
  • Assuming words in different languages are encoded
    in the same alphabet, it is possible to perform
    clustering using some distance measure.
  • Issues
  • selection of a stable word set
  • data encoding distance metric
  • Stable word set numbers 1 to 10
  • Data encoding Latin alphabet, use 3 first
    letters (in each word)

83
Numbers word set in 18 European languages
  • Each language is a feature vector encoding 10
    words

84
Data Encoding
  • Word feature vector encoding 3 first letters
  • Alphabet 26 letters 1 symbol BLANK
  • vector encoding
  • For example, ONE O14 N15 E05

85
Word Encoding (contd)
  • Word ? 27-dimensional feature vector
  • Encoding is insensitive to order (of 3 letters)
  • Encoding of 10-word set concatenate feature
    vectors of all words one two ten
  • ? word set encoded as vector of dim. 1 X 270

86
SOM Modeling Approach
  • 2-Dimensional SOM (Batch Algorithm)
  • Number of Units per dimension4
  • Initial Neighborhood 1 Final Neighborhood
    0.15
  • Total Number of Iterations 70

87
OUTLINE
  • Objectives
  • Brief history and motivation for artificial
    neural networks
  • Sequential estimation of model parameters
  • Methods for supervised learning
  • Methods for unsupervised learning
  • Summary and discussion

88
Summary and Discussion
  • Neural Network methods (vs statistical
    approaches)
  • - new techniques/ grad descent style methods
  • - simple (brute-force) computational approaches
  • - black-box models (e.g. MLP network)
  • - biological motivation
  • The same fundamental issues small-sample
    problems, curse-of-dimensionality, non-linear
    optimization, complexity control
  • Neural network methods implement ERM or SRM
    approach (under predictive learning setting)
  • Hype and controversy
Write a Comment
User Comments (0)
About PowerShow.com