Title: Introduction to Radial Basis Function Networks
1Introduction to Radial Basis Function Networks
2Content
- Overview
- The Models of Function Approximator
- The Radial Basis Function Networks
- RBFNs for Function Approximation
- The Projection Matrix
- Learning the Kernels
- Bias-Variance Dilemma
- The Effective Number of Parameters
- Model Selection
- Incremental Operations
3Introduction to Radial Basis Function Networks
4Typical Applications of NN
- Pattern Classification
- Function Approximation
- Time-Series Forecasting
5Function Approximation
Unknown
Approximator
6Supervised Learning
Unknown Function
Neural Network
7Neural Networks as Universal Approximators
- Feedforward neural networks with a single hidden
layer of sigmoidal units are capable of
approximating uniformly any continuous
multivariate function, to any desired degree of
accuracy. - Hornik, K., Stinchcombe, M., and White, H.
(1989). "Multilayer Feedforward Networks are
Universal Approximators," Neural Networks, 2(5),
359-366. - Like feedforward neural networks with a single
hidden layer of sigmoidal units, it can be shown
that RBF networks are universal approximators. - Park, J. and Sandberg, I. W. (1991). "Universal
Approximation Using Radial-Basis-Function
Networks," Neural Computation, 3(2), 246-257. - Park, J. and Sandberg, I. W. (1993).
"Approximation and Radial-Basis-Function
Networks," Neural Computation, 5(2), 305-316.
8Statistics vs. Neural Networks
9Introduction to Radial Basis Function Networks
- The Model of
- Function Approximator
10Linear Models
Weights
Fixed Basis Functions
11Linear Models
Linearly weighted output
Output Units
- Decomposition
- Feature Extraction
- Transformation
Hidden Units
Inputs
Feature Vectors
12Linear Models
Can you say some bases?
y
Linearly weighted output
Output Units
w2
w1
wm
- Decomposition
- Feature Extraction
- Transformation
Hidden Units
?1
?2
?m
Inputs
Feature Vectors
x1
x2
xn
x
13Example Linear Models
Are they orthogonal bases?
- Polynomial
- Fourier Series
14Single-Layer Perceptrons as Universal
Aproximators
With sufficient number of sigmoidal units, it can
be a universal approximator.
Hidden Units
15Radial Basis Function Networks as Universal
Aproximators
With sufficient number of radial-basis-function
units, it can also be a universal approximator.
Hidden Units
16Non-Linear Models
Weights
Adjusted by the Learning process
17Introduction to Radial Basis Function Networks
- The Radial Basis Function Networks
18Radial Basis Functions
Three parameters for a radial function
?i(x)? (x ? xi)
xi
- Center
- Distance Measure
- Shape
r x ? xi
?
19Typical Radial Functions
- Gaussian
- Hardy Multiquadratic
- Inverse Multiquadratic
20Gaussian Basis Function (?0.5,1.0,1.5)
21Inverse Multiquadratic
c5
c4
c3
c2
c1
22Most General RBF
Basis ?i i 1,2, is near orthogonal.
23Properties of RBFs
- On-Center, Off Surround
- Analogies with localized receptive fields found
in several biological structures, e.g., - visual cortex
- ganglion cells
24The Topology of RBF
As a function approximator
Output Units
Interpolation
Hidden Units
Projection
Feature Vectors
Inputs
25The Topology of RBF
As a pattern classifier.
Output Units
Classes
Hidden Units
Subclasses
Feature Vectors
Inputs
26Introduction to Radial Basis Function Networks
- RBFNs for
- Function Approximation
27The idea
y
x
28The idea
y
x
29The idea
y
x
30The idea
y
x
31The idea
y
x
32Radial Basis Function Networks as Universal
Aproximators
Training set
Goal
for all k
33Learn the Optimal Weight Vector
Training set
Goal
for all k
34Regularization
Training set
If regularization is unneeded, set
Goal
for all k
35Learn the Optimal Weight Vector
Minimize
36Learn the Optimal Weight Vector
Define
37Learn the Optimal Weight Vector
Define
38Learn the Optimal Weight Vector
39Learn the Optimal Weight Vector
Design Matrix
Variance Matrix
40Summary
Training set
41Introduction to Radial Basis Function Networks
42The Empirical-Error Vector
43The Empirical-Error Vector
Error Vector
44Sum-Squared-Error
If ?0, the RBFNs learning algorithm is to
minimize SSE (MSE).
Error Vector
45The Projection Matrix
Error Vector
46Introduction to Radial Basis Function Networks
47RBFNs as Universal Approximators
Training set
Kernels
48What to Learn?
- Weights wijs
- Centers ?js of ?js
- Widths ?js of ?js
- Number of ?js ? Model Selection
49One-Stage Learning
50One-Stage Learning
The simultaneous updates of all three sets of
parameters may be suitable for non-stationary
environments or on-line setting.
51Two-Stage Training
Step 2
Determines wijs.
E.g., using batch-learning.
Step 1
- Determines
- Centers ?js of ?js.
- Widths ?js of ?js.
- Number of ?js.
52Train the Kernels
53Unsupervised Training
54Methods
- Subset Selection
- Random Subset Selection
- Forward Selection
- Backward Elimination
- Clustering Algorithms
- KMEANS
- LVQ
- Mixture Models
- GMM
55Subset Selection
56Random Subset Selection
- Randomly choosing a subset of points from
training set - Sensitive to the initially chosen points.
- Using some adaptive techniques to tune
- Centers
- Widths
- points
57Clustering Algorithms
Partition the data points into K clusters.
58Clustering Algorithms
Is such a partition satisfactory?
59Clustering Algorithms
How about this?
60Clustering Algorithms
?1
?2
?4
?3
61Introduction to Radial Basis Function Networks
62Goal Revisit
- Ultimate Goal ? Generalization
Minimize Prediction Error
- Goal of Our Learning Procedure
Minimize Empirical Error
63Badness of Fit
- Underfitting
- A model (e.g., network) that is not sufficiently
complex can fail to detect fully the signal in a
complicated data set, leading to underfitting. - Produces excessive bias in the outputs.
- Overfitting
- A model (e.g., network) that is too complex may
fit the noise, not just the signal, leading to
overfitting. - Produces excessive variance in the outputs.
64Underfitting/Overfitting Avoidance
- Model selection
- Jittering
- Early stopping
- Weight decay
- Regularization
- Ridge Regression
- Bayesian learning
- Combining networks
65Best Way to Avoid Overfitting
- Use lots of training data, e.g.,
- 30 times as many training cases as there are
weights in the network. - for noise-free data, 5 times as many training
cases as weights may be sufficient. - Dont arbitrarily reduce the number of weights
for fear of underfitting.
66Badness of Fit
Underfit
Overfit
67Badness of Fit
Underfit
Overfit
68Bias-Variance Dilemma
However, it's not really a dilemma.
Underfit
Overfit
Large bias
Small bias
Small variance
Large variance
69Bias-Variance Dilemma
- More on overfitting
- Easily lead to predictions that are far beyond
the range of the training data. - Produce wild predictions in multilayer
perceptrons even with noise-free data.
70Bias-Variance Dilemma
It's not really a dilemma.
71Bias-Variance Dilemma
The mean of the bias?
The variance of the bias?
The true model
bias
bias
bias
E.g., depend on hidden nodes used.
72Bias-Variance Dilemma
The mean of the bias?
The variance of the bias?
Variance
The true model
E.g., depend on hidden nodes used.
73Model Selection
Reduce the effective number of parameters.
Reduce the number of hidden nodes.
Variance
The true model
E.g., depend on hidden nodes used.
74Bias-Variance Dilemma
Goal
The true model
E.g., depend on hidden nodes used.
75Bias-Variance Dilemma
Goal
Goal
0
constant
76Bias-Variance Dilemma
0
77Bias-Variance Dilemma
Goal
bias2
variance
Minimize both bias2 and variance
noise
Cannot be minimized
78Model Complexity vs. Bias-Variance
Goal
bias2
variance
noise
Model Complexity (Capacity)
79Bias-Variance Dilemma
Goal
bias2
variance
noise
80Example (Polynomial Fits)
81Example (Polynomial Fits)
82Example (Polynomial Fits)
Degree 1
Degree 5
Degree 10
Degree 15
83Introduction to Radial Basis Function Networks
- The Effective Number of Parameters
84Variance Estimation
Mean
Variance
85Variance Estimation
Mean
Variance
Loss 1 degree of freedom
86Simple Linear Regression
87Simple Linear Regression
Minimize
88Mean Squared Error (MSE)
Minimize
Loss 2 degrees of freedom
89Variance Estimation
Loss m degrees of freedom
m parameters of the model
90The Number of Parameters
m
degrees of freedom
91The Effective Number of Parameters (?)
The projection Matrix
92The Effective Number of Parameters (?)
Facts
Pf)
The projection Matrix
93Regularization
The effective number of parameters
Penalize models with large weights
SSE
94Regularization
The effective number of parameters
Without penalty (?i0), there are m degrees of
freedom to minimize SSE (Cost). The effective
number of parameters ? m.
Penalize models with large weights
SSE
95Regularization
The effective number of parameters
With penalty (?igt0), the liberty to minimize SSE
will be reduced. The effective number of
parameters ? ltm.
Penalize models with large weights
SSE
96Variance Estimation
The effective number of parameters
Loss ? degrees of freedom
97Variance Estimation
The effective number of parameters
98Introduction to Radial Basis Function Networks
99Model Selection
- Goal
- Choose the fittest model
- Criteria
- Least prediction error
- Main Tools (Estimate Model Fitness)
- Cross validation
- Projection matrix
- Methods
- Weight decay (Ridge regression)
- Pruning and Growing RBFNs
100Empirical Error vs. Model Fitness
- Ultimate Goal ? Generalization
Minimize Prediction Error
- Goal of Our Learning Procedure
Minimize Empirical Error
(MSE)
Minimize Prediction Error
101Estimating Prediction Error
- When you have plenty of data use independent test
sets - E.g., use the same training set to train
different models, and choose the best model by
comparing on the test set. - When data is scarce, use
- Cross-Validation
- Bootstrap
102Cross Validation
- Simplest and most widely used method for
estimating prediction error. - Partition the original set into several different
ways and to compute an average score over the
different partitions, e.g., - K-fold Cross-Validation
- Leave-One-Out Cross-Validation
- Generalize Cross-Validation
103K-Fold CV
- Split the set, say, D of available input-output
patterns into k mutually exclusive subsets, say
D1, D2, , Dk. - Train and test the learning algorithm k times,
each time it is trained on D\Di and tested on Di.
104K-Fold CV
Available Data
105K-Fold CV
Test Set
Training Set
Available Data
D1
D2
D3
. . .
Dk
Estimate ?2
106Leave-One-Out CV
A special case of k-fold CV.
- Split the p available input-output patterns into
a training set of size p?1 and a test set of size
1. - Average the squared error on the left-out pattern
over the p possible ways of partition.
107Error Variance Predicted by LOO
A special case of k-fold CV.
The estimate for the variance of prediction error
using LOO
Error-square for the left-out element.
108Error Variance Predicted by LOO
A special case of k-fold CV.
Given a model, the function with least empirical
error for Di.
As an index of models fitness. We want to find a
model also minimize this.
The estimate for the variance of prediction error
using LOO
Error-square for the left-out element.
109Error Variance Predicted by LOO
A special case of k-fold CV.
Are there any efficient ways?
How to estimate?
The estimate for the variance of prediction error
using LOO
Error-square for the left-out element.
110Error Variance Predicted by LOO
Error-square for the left-out element.
111Generalized Cross-Validation
112More Criteria Based on CV
GCV (Generalized CV)
Akaikes Information Criterion
UEV (Unbiased estimate of variance)
FPE (Final Prediction Error)
BIC (Bayesian Information Criterio)
113More Criteria Based on CV
114More Criteria Based on CV
115Regularization
Standard Ridge Regression,
Penalize models with large weights
SSE
116Regularization
Standard Ridge Regression,
Penalize models with large weights
SSE
117Solution Review
Used to compute model selection criteria
118Example
Width of RBF r 0.5
119Example
Width of RBF r 0.5
120Example
Width of RBF r 0.5
How the determine the optimal regularization
parameter effectively?
121Optimizing the Regularization Parameter
Re-Estimation Formula
122Local Ridge Regression
Re-Estimation Formula
123Example
Width of RBF
124Example
Width of RBF
125Example
Width of RBF
There are two local-minima.
Using the about re-estimation formula, it will be
stuck at the nearest local minimum.
That is, the solution depends on the initial
setting.
126Example
Width of RBF
There are two local-minima.
127Example
Width of RBF
There are two local-minima.
128Example
Width of RBF
RMSE Root Mean Squared Error In real case, it is
not available.
129Example
Width of RBF
RMSE Root Mean Squared Error In real case, it is
not available.
130Local Ridge Regression
Standard Ridge Regression
Local Ridge Regression
131Local Ridge Regression
Standard Ridge Regression
?j ?? implies that ?j(?) can be removed.
Local Ridge Regression
132The Solutions
Used to compute model selection criteria
133Optimizing the Regularization Parameters
Incremental Operation
P The current projection Matrix.
Pj The projection Matrix obtained by removing
?j(?).
134Optimizing the Regularization Parameters
Solve
Subject to
135Optimizing the Regularization Parameters
Solve
Subject to
136Optimizing the Regularization Parameters
Remove ?j(?)
Solve
Subject to
137The Algorithm
- Initialize ?is.
- e.g., performing standard ridge regression.
- Repeat the following until GCV converges
- Randomly select j and compute
- Perform local ridge regression
- If GCV reduce remove ?j(?)
138References
Mark J. L. Orr (April 1996), Introduction to
Radial Basis Function Networks,
http//www.anc.ed.ac.uk/mjo/intro/intro.html.
Kohavi, R. (1995), "A study of cross-validation
and bootstrap for accuracy estimation and model
selection," International Joint Conference on
Artificial Intelligence (IJCAI).
139Introduction to Radial Basis Function Networks