Title: Radial-Basis Function Networks
1Radial-Basis Function Networks
- CS/CMPE 537 Neural Networks
2Introduction
- Typical tasks performed by neural networks are
association, classification, and filtering. This
categorization has historical significance as
well. - These tasks involve input-output mappings, and
the network is designed to learn the mapping from
knowledge of the problem environment - Thus, the design of a neural network can be
viewed as a curve-fitting or function
approximation problem - This viewpoint is the motivation for radial-basis
function networks
3Radial-Basis Function Networks
- RBF are 2 layer networks input source nodes,
hidden neurons with basis functions (nonlinear),
and output neurons with linear/nonlinear
activation functions - The theory of radial-basis function networks is
built upon function approximation theory in
mathematics - RBF networks were first used in 1988. Major work
was done by Moody and Darken (1989) and Poggio
and Girosi (1990) - In RBF networks, the mapping from input to
high-dimension hidden space is nonlinear, while
that from hidden to output space is linear - What is the basis for this ?
4Radial-Basis Function Network
F1(.)
w
x1
y1
y2
xp
FM(.)
Source nodes
Hidden neurons with RBF activation functions
Output neurons
5Covers Theorem (1)
- Covers theorem (1965) gives the motivation for
RBF networks - Covers theorem on the separability of patterns
- Complex pattern-classification problems cast in
high-dimensional space nonlinearly is more likely
to be linearly separable than low-dimensional
space
6Covers Theorem (2)
- Consider set X of N p-dimensional vectors (input
patterns) x1 to xN. Let X and X- be a binary
partition of X, and f(x) f1(x), f2(x),,
fM(x)T. - Covers theorem
- A binary partition (dichotomy) X, X- of X is
said to be f-separable if there exist an
m-dimensional vector w such that - wTf(x) 0 when x belong to X
- wTf(x) lt 0 when x belong to X-
- Decision boundary or surface
- wTf(x) 0
7Covers Theorem (3)
8Example (1)
- Consider the XOR problem to illustrate the
significance of f-separability and Covers
theorem. - Define a pair of Gaussian hidden functions
- f1(x) exp(-x t12) t1 1, 1T
- f2(x) exp(-x t22) t1 0, 0T
- Output of these function for each pattern
9Example (2)
10Function Approximation (1)
- Function approximation seeks to describe the
behavior of complex functions by ensembles of
simpler functions - Describe f(x) by F(x)
- F(x) can be described in a compact region of
input space by - F(x) Si1 N wifi(x)
- Such that
- f(x) F(x) lt e
- e can be made arbitrarily small
- Choice of function f(.) ?
11Function Approximation (2)
- Find F(x) that best approximates the
map/function f. The best approximation is problem
dependent, and it can be strict interpolation or
good generalization (regularized interpolation). - Design decisions
- Choice of elementary functions f(.)
- How to compute the weights w ?
- How many elementary functions to use (i.e. what
should be N)? - How to obtain a good generalization ?
12Choice of Elementary Functions f
- Let f(x) belongs to the function space L2(Rp)
(true for almost all physical systems) - We want f to be a basis of L2
- What is meant by a basis?
- A set of functions fi (i 1, M) are a basis of
L2 if linear superposition of fi can generate any
function in L2 . Moreover, they must be linearly
independent - w1 f1 w2 f2 wM fM 0 iff wi 0 for all
i
13Interpolation Problem (1)
- In general, the map from an input space to an
output space is given by - f Rp -gt Rq
- p and q input and out space dimensions f map
or hypersurface - Strict interpolation problem
- Given a set of N different points xi (i 1, N)
and a corresponding set of N real numbers di (i
1, N) find a function F Rp -gt R1 that satisfies
the interpolation condition - F(xi) di i 1, N
- The function F passes through all the points
14Interpolation Problem (2)
- A common type of f(.) is radial-symmetric basis
functions - F(x) Si1 N wifx xi
- Substituting and writing in matrix format
- ? w d
- ? fji (i, j 1, N) interpolation matrix fji
fxj xi - w linear weight vector d desired response
vector
15Interpolation Problem (3)
- ? is known to be positive definite for a certain
class of radial-basis functions. Thus, - w ?-1 d
- In theory, w can be computed. In practice,
however, ? is close to singular - Then what ?
- Regularization theory to perturb ? to make it
non-singular - But, there is another problem poor
generalization or overfitting
16Ill-Posed Problems
- Supervised learning is a an ill-posed problem
- There is not enough information in the training
data to reconstruct the input-output mapping
uniquely - The presence of noise or imprecision in the input
data adds uncertainty to the reconstruction of
the input-outut mapping - To achieve good generalization additional
information of the domain is needed - In other words, the input-output patterns should
exhibit redundancy - Redundancy is achieved when the physical
generator of data is smooth, and thus can be used
to generate redundant input-output examples
17Regularization Theory (1)
- How to make an ill-posed problem well-posed ?
- By constraining the mapping with additional
information (e.g. smoothness) in the form of a
nonnegative functional - Proposed by Tikhonov in 1963 in the context of
function approximation in mathematics
18Regularization Theory (2)
- Input-output examples xi, di (i 1, N)
- Find the mapping F(x) Rp -gt R1 for the
input-output examples - In regularization theory, F is found by
minimizing the cost functional ?(F) - ?(F) ?s(F) ??c(F)
- Standard error term
- ?s(F) 0.5Si1 N (di yi)2 0.5Si1 N (di
F(xi))2 - Regularization term
- ?c(F) 0.5PF(x)2
- P linear differential operator
19Regularization Theory (3)
- Regularization term depends on the geometric
properties of the approximating function - The selection of the operator P is therefore
problem dependent based on prior knowledge of the
geometric properties of the actual function f(x)
(e.g. the smoothness of f(x)) - Regularization parameter ? a positive real
number - This parameter indicates the sufficiency of the
given input-output examples in capturing the
underlying function f(x) - The solution of the regularization problem is a
function type F(x) - We wont go into the details of how to find F as
it requires good understanding of functional
analysis
20Regularization Theory (4)
- Solution of the regularization problem yields
- F(x) 1/?Si1 N di - F(xi)G(x, xi) Si1 N
wiG(x, xi) - G(x,xi) Greens function centered on xi
- In matrix form
- F Gw
- or
- (G ?I)w d
- and
- w (G ?I)-1d
- G depends only on the operator P
21Type of Function G(x xi)
- If P is translationally invariant then G(x xi)
depends only on the difference of x and xi, i.e. - G(x xi) G(x - xi)
- If P is both translationally and rotationally
invariant then G(x xi) depends only on Euclidean
norm of the difference vector x - xi, i.e. - G(x xi) G(x xi)
- This is a radial-basis function
- If P is further constrained, and G(x xi) is
positive definite, then we have the Gaussian
radial-basis function, i.e. - G(x xi) exp(- (1/2s2) x xi2)
22Regularization Network (1)
23Regularization Network (2)
- The regularization network is based on the
regularized interpolation problem - F(x) Si1 N wiG(x, xi)
- It has 3 layers
- Input layer of p source nodes, where p is the
dimension of the input vector x (or number of
independent variables) - Hidden layer with N neurons, where N is the
number of input-output examples. Each neuron uses
the activation function G(x xi) - Output layer with q neurons, where q is the
output dimension - The unknowns are the weights w (only) from the
hidden layer to the output layer
24RBF Networks (in Practice) (1)
- The regularization network requires N hidden
neurons, which becomes computationally expensive
for large N - The complexity of the network is reduced to
obtain an approximate solution to the
regularization problem - The approximate solution F(x) is then given by
- F(x) Si1 M wifi(x)
- fi(x) (i 1,M) new set of basis functions M
is typically less than N - Using radial basis functions
- F(x) Si1 M wifi(x ti)
25RBF Networks (2)
26RBF Networks (3)
- Unknowns in the RBF network
- M, the number of hidden neurons (M lt N)
- The centers ti of the radial-basis functions
- And the weights w
27How to Train RBF Networks - Learning
- Normally the training of the hidden layer
parameters (number of hidden neurons, centers and
variance of Gaussian) is done prior to the
training of the weights (i.e. on a different
time scale) - This is justified based on the fact that the
hidden layer performs a different task
(nonlinear) than the output layer weights
(linear) - The weights are learned by supervised learning
using an appropriate algorithm (LMS or BP) - The hidden layer parameters are learned by (in
general, but not always) unsupervised learning
28Fixed Centers Selected at Random
- Randomly select M inputs as centers for the
activation functions - Fix the variance of the Gaussian based on the
distance between the selected centers. A
radial-basis function centered at ti is then
given by - f(x ti) exp(- M/d2 x ti2)
- d max. distance between the chosen centers
- The width or standard deviation of the functions
is fixed, given by s d/v2M - The linear weights are then computed by solving
the regularization problem or by using supervised
learning -
29Self-Organized Selection of Centers
- Use a self-organizing or clustering technique to
determine the number and centers of the Gaussian
functions - A common algorithm is the k-means algorithm. This
algorithms assigns a label to a vector x by the
majority label on the k-nearest neighbors - Then compute the weights using a supervised
error-correction learning such as LMS
30Supervised Selection of Centers
- All unknown parameters are trained using
error-correcting supervised learning - A gradient descent approach is used to find the
minimum of the cost function wrt the weights wi
and activation function centers ti and spread of
centers s
31Example (1)
- Classify between two overlapping
two-dimensional, Gaussian-distributed patterns - Conditional probability density function for the
two classes - f(x C1) 1/2ps12 exp-1/2s12 x µ12
- µ1 mean 0 0T and s12 variance 1
- f(x C2) 1/2ps22 exp-1/2s22 x µ22
- µ2 mean 2 0T and s22 variance 4
- x x1 x2T two dimensional input
- C1 and C2 class labels
32Example (2)
33Example (3)
34Example (4)
- Consider a two-input, M hidden neurons, and
two-output RBF - Decision rule an input x is classified to C1 if
y1 gt 0 - The training set is generated from the
probability distribution functions - Using the perceptron algorithm, the network is
trained for minimum mean-square-error - The testing set is generated from the probability
distribution functions - The trained network is tested for correct
classification - For other implementation details, see the Matlab
code
35Example Function Approximation (1)
- Approximate relationship between a cars fuel
economy (in miles per gallon) and its
characteristics - Input data description 9 independent discrete
valued, boolean, and continuous variables - X1 number of cylinders
- X2 displacement
- X3 horsepower
- X4 weight
- X5 acceleration
- X6 model year
- X7 Made in US? (0,1)
- X8 Made in Europe? (0,1)
- X9 Made in Japan? (0,1)
- Output f(X) is fuel economy in miles per gallon
36Example Function Approximation (2)
- Using the NNET toolbox, create and train a RBF
network with function newrb - The function parameters allows you to set the
mean-squared-error goal of the training, the
spread of the radial-basis functions, and the
maximum number of hidden layer neurons. - Newrb uses the following approach to find the
unknowns (it is a self-organizing approach) - Start with one hidden neuron compute network
error - Add another neuron with center equal to input
vector that produced the maximum error compute
network error - If network error does not improve significantly,
stop other go to previous step and add another
neuron
37Comparison of RBF Network and MLP (1)
- Both are universal approximators. Thus, a RBF
network exists for every MLP, and vice versa - An RBF has a single hidden layer, while an MLP
can have multiple hidden layers - The model of the computational neurons of an MLP
are all identical, while the neurons in the
hidden and output layers of an RBF network have
different models - The activation functions of the hidden nodes of
an RBF network is based on the Euclidean norm of
the input wrt to a center, while that of an MLP
is based on the inner product of input and weights
38Comparison of RBF Network and MLP (2)
- MLPs construct global approximations to nonlinear
input-output mapping. This is a consequence of
the global activation function (sigmoidal) used
in MLPs - As a result, MLP can perform generalization in
regions where input data is not available (i.e.
extrapolation) - RBF networks construct local approximations to
input-output data. This is a consequence of the
local Gaussian functions - As a result, RBF networks are capable of fast
learning from the training data