Radial Basis Functions

About This Presentation

Title:

Radial Basis Functions

Description:

What it says is that a good approximation or regression function is a RBF based ... Let's say the function to approximate has the following input/output pairs ... – PowerPoint PPT presentation

Number of Views:467

Avg rating:3.0/5.0

Slides: 135

Provided by: donco2

Category:

more less

Transcript and Presenter's Notes

Title: Radial Basis Functions

1
Radial Basis Functions

RBF Chapter 5 of text
Also Marty Haggans Book Neural Network Design

2
RBF

We have seen that a BPNN (backpropagation neural
network) can be used for function approximation
and classification
A RBFNN (radial basis function neural network) is
another network that can be used for both such
problems

3
RBFNN vs BPNN

RBFNN has only two layers, whereas BPNN can have
any number
Normally all nodes of the BPNN have the same
model, whereas the RBFNN has two different models
RBFNN has a nonlinear and a linear layer
Argument of RBFNN computes a Euclidean norm
whereas a BPNN computes a dot product
RBFNN trains faster than BPNN

4
Dot Product

If we view the weights of a neuron as a vector
Ww1, w2 , wn T, and the input to a neuron
as a vector Xx1, x2 , xn T, then the dot
product WT X is

5
Euclidean Norm

If we view the weights of a neuron as a vector
Ww1, w2 , wn T, and the input to a neuron
as a vector Xx1, x2 , xn T, then the
Euclidean norm is

6
RBFNN vs BPNN

RBFNN often leads to better decision boundaries
Hidden layer units of RBFNN have a much more
natural interpretation
RBFNN learning phase may be unsupervised and thus
could lose information
BPNN may give a more compact representation

7
(No Transcript)
8
(No Transcript)
9
RBF

There is considerable theory behind the
application of RBFNNs or simply RBFs.
Much of it is covered in the text
Our emphasis will be on the results and so we
will not cover the derivations, etc.

10
RBF

There are 2 general categories of RBFNNs
Classification separation of hyperspace first
portion of chapter
Function approximation use a RBFNN to
approximate some non-linear function

11
RBF-Classification

Covers Theorem separability of patterns
A complex pattern classification problem re-cast
in a high(er)-dimensional space nonlinearly is
more likely to be linearly separable than in a
low-dimension space, provided that the space is
not densely populated.

12
Covers Theorem

Given an input vector Xx1, xk ( of dimension
k), if we recast it using some set of nonlinear
transfer functions on each of the input
parameters (of dimension mgt k) then it is more
likely that this new set will be linearly
separable
In some cases simply using a nonlinear mapping
and not changing (increasing) the dimensionality
(m) is sufficient

13
Covers Theorem - Corollary

The expected (average) maximum number of randomly
assigned patterns (vectors) that are linearly
separable in a space of dimensionality m is 2m
Stated another way
2m is a definition of the separating capacity of
a family of decision surfaces having m degrees of
freedom

14
Covers Theorem

Explain in terms of

15
Covers Theorem

Another way to think of it (Covers theorem) in
terms of a probability density function is as
follows
Lets say we have N points in an m0 dimensional
space. Furthermore, lets say we randomly assign
those N points to one of two classes. Let P(N,m1)
denote the probability that this random picking
is linearly separable for some f.

16
Covers Theorem

The author gives a very complicated P(N,m1 )
Think of m1 as the number of non-linear
transformations
However, what is important is that he shows that
P(2m1 ,m1 ) 1/2
As m1 gets larger, P(N ,m1 ) approaches 1

17
Thesis Topic

Use an evolutionary approach to find a minimum
set of fs to make X linearly separable.

18
XOR Problem

In the XOR problem x1 x2 we want a 1 output
when the inputs are not equal, else output a 0.
This is not a linearly separable problem
Observe what happens if we use two non-linear
functions applied to x1 and x2

19
XOR
20
XOR
21
XOR

If one plots these new points (next slide), they
can see that they are now linearly separable

22
(1,1)
(0,1) (1,0)
(0,0)
23
XOR

What the previous slides show is that by
introducing a nonlinearity, even without
increasing the dimensionality, the space becomes
linearly separable

24
Regression

In a paper by Scott (1992), he showed that in the
presence of noisy data, a good approximation of
the relation between input/output pairs is

After much work, it can be shown that in the
discrete domain (assuming Gaussian noise)

26
Regression

The function is called a normalized
Gaussian, and also called a normalized RBF. Often
the denominator is left off.
What it says is that a good approximation or
regression function is a RBF based network of
such functions

27
Regularization

An important point to remember is that we always
have noise in the data, and thus it is probably
not a good idea to develop a function that
exactly fits the data.
Another way to think of it is that fitting the
data too closely will likely give poor
generalization
The next slide is bad, and the one after that is
good

28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Matlab RBF

The following slide shows how Matlabs view of a
RBF network is displayed.

35
(No Transcript)
36
Spread and Bias on Matlab

From the preceding figure we have that the output
of the RB neuron is

37
Bias in Matlab RBF

In the Matlab toolbox for Radial Basis networks,
the authors use bias b instead of what some
authors use (s) to be consistent.
The relationship is

38
Bias Spread in Matlab RBF

Lets say we set b 0.1 in a radial basis
neuron. What does that mean?
In the RB evaluation, it means we will have

39
Bias Spread in Matlab RBF

It means its output will be 0.5

40
More RBF

Consider the following for a RBF network (In
Matlab notation)
The next figure shows the network output as the
input varies -2,2

41
(No Transcript)
42
More RBF

The next slide shows how the network output
varies as the various parameters vary.
Remember

43
(No Transcript)
44
Matlab dist(W,P)

Calculates Euclidean distance from W to P
W SXR S rows X R columns
S number of weight vectors and each row is an R
dimension weight vector
PRXQ R rows X Q column vectors
Each column of P is an R dimension vector

45
Matlab dist(W,P)

Assume we have a first layer of 4 neurons with
each neuron having a 3-dimension input vector (P)
To calculate dist for each neuron
W4 rows (Ws) of 3 columns each row
P3 rows of one column for a single P vector)

46
(No Transcript)
47
Parameters, Spread, etc.

The following slide is the standard notation
for a MATLAB RBF neuron
In such a neuron, we use

48
(No Transcript)
49
Question

What happens to the RBF as b and/or the dist
gets bigger?
Bigger (larger maximum)
Smaller (smaller maximum)
Narrower
Wider
???

50
(No Transcript)
51
Interpolation

There is sort of a converse to Covers theorem
about separability. It is
One can often use a nonlinear mapping to
transform a difficult filtering (regression)
problem into one that can be solved linearly.

52
More Regression/Interpolation

In a sense, the regression problem builds an
equation to give us the input/output relationship
on a set of data. Given one of the inputs we
trained on, we should be able to use this
equation to get the output within some error
The interpolation problem addresses the issue of
what value(s) does our network give us for inputs
that we have not trained on

53
More Regression/Interpolation

Given a set of N different points, and a set of N
corresponding real numbers, find a function F
that satisfies
F(x) di i1,2,,N
For strict interpolation, F should pass through
all data points.

54
RBF Regression

The RBF approach to regression chooses F such
that

55
RBF Regression

From the preceding, we get the following
simultaneous linear equations for the unknown
weights

56
RBE Regression

The next slide represents the authors pictorial
of an RBF network

57
(No Transcript)
58
RBF Regression

Thus we have (in matrix form)
The question is whether or not is invertible

59
Micchellis Theorem

Micchellis Theorem defines a class of functions
f(x) that are invertible.
There is a large class of radial basis functions
covered by Micchellis theorem
In that which follows, it is required that all of
the data points be distinct, i.e. no two points
be in the same location in space.
NOTE Radial Basis functions are also called
kernel functions

60
Micchellis Theorem
61
RBF

The radial Basis Function most commonly used that
satisfies Micchellis theorem is

62
RBF

In the preceding, there will be one RBF neuron
for each data point (N). Thus, the f matrix will
be NXN, with each row of the matrix the fs for
that data point.
Thus, the preceding equation (A) is guaranteed to
pass through every point of the data if
There is one RBF neuron for every x point.
The sensitivity of each neuron is such that at
each data point there is only one neuron with a
non-zero value
The center of each neuron corresponds to a unique
data point

63
RBF

For this size network
One neuron for each data point usually means an
overly trained network
There is also the problem of inverting f an NxN
matrix the inversion time grows as O(N3 ). Thus
for 1,000 data points we would have to invert a
matrix of order O(1,000,000,000)

64
RBF

Thus, generally we accept a suboptimal solution
in which the number of basis functions or RBF
neurons is lt of data points (N), and the number
of centers of basis functions is lt N
The question then becomes one of choosing the
best number of neurons and the best centers for
those neurons

65
Classification

RBFs can also be used as classifiers.
Remember BPNNs tend to divide the input space
(figure a next slide) whereas RBFs tend to
kernelize the space (figure b next slide)

66
(a) BPNN (b) RBFNN
67
Learning Strategies

Typically, the weights of the two layers are
determined separately, i.e. find RBF weights
(centers), and then find output layer weights
There are several methods for parameterizing the
RBFs, selecting the centers, and selecting the
output layer weights

68
Learning Strategies

Earlier, we showed one way in which there was one
RBF neuron for each data point. For this network,
we only have to choose the sensitivity or spread
for a neuron so that it does not make its RBF
function overlap with another neuron
Of course we always have the problem of inverting
a really big matrix.

69
RBF Layer Another Strategy

In this case we simply select (randomly) from the
data set a subset of data points and use them as
the centers, i.e. the xis
How many to select?
Which ones to select?
The following approach defines the way to adjust
the sensitivity

70
Fixed Centers
71
Fixed Centers

From this, what parameters are left to find?

72
Fixed Centers

From this, what parameters are left to find?
The ws of the output layer
For this, we need a matrix w such that

73
Fixed Centers

The problem with the preceding is that f is
probably not square, and hence cant be inverted.
So how do we find W?
To do this we can again use the pseudo-inverse,
and thus

74
Fixed Centers Example - XOR

For the XOR, we will use the following simplified
Gaussian
The Centers of the RBFs will be 1,1 and 0,0

75
Fixed Centers Example - XOR

We want
data point Input pattern x desired output
1 (1,1) 0
2 (0,1) 1
3 (0,0) 0
4 (1,0) 1

76
XOR Example

What we want is to find W given f and D
Remember
Since we have chosen 2 RBF centers, our f matrix
will be 4 rows (there are 4 data points) and
three columns. The first two columns are for the
first two rbf centers, and the third column is 1
for the bias multiplier.

77
XOR Example
78
XOR Example

We have the following (after some algebra)

79
XOR Example

We want W such that
Wf-1 d, but is not invertible, i.e. not
square
We can use what is called the pseudo-inverse
Wf d(fT f)-1 fTd

80
XOR Example

Something like Wf d(fT f)-1 fTd looks
difficult however, with Matlab it is simple

81
(No Transcript)
82
XOR Example

We now have everything, including
WT -2.5019 -2.5019 2.8404
Remember were using fWd
For a given input say x1 1 we have
1 0.1353 1 -2.5019 -2.5019 2.8404

83
XOR Example

We now have everything, including
WT -2.5019 -2.5019 2.8404
Remember were using fWd
Input x desired output actual output
(1,1) 0 3.3E-15 (0)
(0,1) 1 0.999 (1)
(0,0) 0 0.999 (1)
(1,0) 1 3.3 E-15 (0)

84
Self-Organized Center Selection

Fixed centers may require a relatively large
number of randomly selected centers to work well
This method (self-organized) uses a two stage
process (iterative)
Self-organized learning stage estimate RBF
centers
Supervised Learning estimate linear weights of
output layer

85
Self-Organized Learning Stage

For this stage we use a clustering algorithm that
divides or partitions the data points into
subgroups
It needs to be unsupervised, so that it is simply
a clustering approach
Commonly use K-means clustering
Place RBF centers only in those regions of the
data space where there are significant
concentrations of data (input vectors)

86
Self-Organized Learning Stage

Clustering is a form of unsupervised learning
whereby a set of observations (i.e., data points)
is partitioned into natural groupings or
clusterings of patterns in such a way that the
measure of similarity between any pair of
observations assigned to each cluster minimizes a
specified cost function. Text/Haykin

87
Self-Organized Learning Stage

Let us assume we have K clusters where Kltnumber
of observations, i.e. data points
What we want is to find the minimum of a cost
function J(C)
C(i) a cluster cost This is the cost function
we use

88
Self-Organized Learning Stage

The first summation over C(i)j, means that for a
given xi we assign it to cluster j, i.e. the one
it is closest to.
The µ term is considered the center or mean of
the xis assigned to that cluster.
Thus we want the minimum set of distances from
cluster centers

89
Self-Organized Learning Stage

The expression
Can be thought of as the variance for a cluster.
Now how do we minimize J(C)?

90
Self-Organized Learning Stage

Given a set of N observations (data points), find
the encoder C that assigns these observations to
the K clusters in such a way that, within each
cluster, the average measure of dissimilarity of
the assigned observations from the cluster mean
is minimized. Hence, K-means clustering. (text)

91
Finding the Clusters

The easiest way to figure k-means clustering out
is with a numeric example.

92
Self-Organized Learning Stage

Initialization choose random values for the
initial set of centers tk(0) subject to the
restriction that these initial values must all be
different
Randomly draw a sample vector from the input
space
Similarity matching Let k(x) denote the best
matching center for input x (closest)

93
Self-Organized Learning Stage

Initialization choose random values for the
initial set of centers tk(0) subject to the
restriction that these initial values must all be
different
Randomly draw a sample vector from the input
space
Similarity matching Let k(x) denote the best
matching center for input x (closest)

94
Self-Organized Learning Stage

Find k(x) at iteration n using the minimum
distance Euclidean criterion

95
Self-Organized Learning Stage

Updating Adjust the center of the closest RBF
center as
Continuation Repeat until no noticeable changes
in centers occur
Generally, reduce the learning rate over time

96
Another Clustering Technique

In this example, we will use the data values as
the initial cluster centers and then iterate from
there.
Assume we have 4 data points
P1 (1,1)
P2 (2,1)
P3 (4,3)
P4 (5,4)

97
Another Clustering Technique

We will assume there are 2 clusters and choose
P1 and P2 as the initial cluster centers

98
(No Transcript)
99
Another Clustering Technique

We begin by calculating the distance from each
centroid to each data point
Each row,column (R,C) in D is the distance from
centroid R to data point C, e.g. D(1,3) 3.61 is
the distance from centroid 1 (actually initially
point 1) to point 3)

100
Another Clustering Technique

We can now assign each point to one of the 2
clusters

101
Another Clustering Technique

We now compute the new cluster centers based on
this grouping

102
Another Clustering Technique

We now compute the new cluster centers based on
this grouping

103
Another Clustering Technique

Now we again calculate the distance from each new
centroid to each data point
For each point (column), we assign it to the
cluster (row) that is closest

104
Another Clustering Technique

We now compute the new cluster centers based on
this new grouping

105
Another Clustering Technique

If we repeat this process again we will find that
there is no change to the clustering and so we
quit with the two cluster centers as given
(1.5,1) and (4.5, 3.5)

106
(No Transcript)
107
K-means Clustering

It should be noted that K-means clustering is
very dependent on
Number of centers
Initial location of centers

108
Adaptive K-means with dynamic initialization

Randomly pick a set of centers c1,..ck
1. Disable all cluster centers
2. Read an input vector x
3. If the closest enabled cluster center ci is
within distance r of x or if all cluster centers
are already enabled, update c as

109
Adaptive K-means with dynamic initialization -
Continued

4. Otherwise, enable a new cluster ck and set it
equal to x
5. Continue until a fixed number of iterations,
or until learning rate has decayed to 0
Yes you still have the challenge of setting R

110
Next Step

Once the centers or centroids have been chosen in
the first layer (the RBF layer), the next step is
to set the widths of the centers, i.e. the
variance
How?

111
Next Step

Can use a P nearest neighbor heuristic
Given a cluster center ck,
select the P nearest neighboring clusters
Set to the root mean square of the
distances to these clusters

112
Next Step
113
Finally

The last step is to set the weights of the output
layer
Remember, the output layer is a linear neuron(s)
Since its linear, could just use the delta
training rule
Note also, that there is no real requirement that
the output layer be linear, e.g. could be a BPNN

114
Linear Least Squares

Having set the parameters of the first (RB)
layer, we can use linear least squares to
determine the weights for the second layer
Instead of
p1 , t1 , pQ , tQ
Q of training points, poutput, t target

115
Linear Least Squares

We have

116
Linear Least Squares

We want to minimize the sum square error
performance

117
Linear Least Squares

With the preceding substitutions, the sum square
error becomes

118
Regularization

Regularization is one of many techniques used to
improve the generalization capabilities of a
network.
It is done by modifying the sum squared error
index so that it includes a penalty for network
complexity

119
Regularization

There are two ways to think of network complexity
Number of neurons
Magnitude of weights
When the weights are large, the network can have
large slopes and is more likely to generate
overfitting, i.e. its not as smooth

120
Regularization

Thus, in the F(x) error function we try to
minimize, we include such a term

121
Linear Least Squares

Going back to slide 117, we have

122
Linear Least Squares

If we solve the gradient for 0, we will have the
minimum

123
LLS Example

Lets say the function to approximate has the
following input/output pairs
P-2,-1.2,-0.4,0.4,1.2,2
T0,.19,.69,1.3,1.8,2
Choose the basis centers to be equally spaced
throughout the input range -2,2
Choose the biases to be 1/spacing between centers

124
LLS Example
125
LLS Example
126
(No Transcript)
127
(No Transcript)
128
Matlab

Newrbe
One neuron for each data item
Newrb
Adds neurons as needed one at a time until the
error maximum is reached
The smaller the error maximum the more neurons
needed

129
Matlab - NEWRBE

gtgt net newrbe(P,T,SPREAD)
Pinput vector, if there are Q input vectors,
there will be Q RBF neurons
T target vector
SPREAD spread constant for RBF neurons

130
Matlab - SPREAD

The bias value for each RBF neuron is set at
0.826/SPREAD
This means a neurons output will be 0.5 if the
weighted input is /-SPREAD.
A SPREAD of 4 means that a neuron a distance 4
from a vector will output a 0.5 for the vector.
SPREAD should be large enough that for each input
X several neurons will be active.

131
Matlab SPREAD

How does one determine what SPREAD should be to
meet the preceding criteria?

132
Testing the network

Once a network has been defined, one can test it
as
Ysim(net,X)
Xtest data
Netnetwork created with netnewrbe(..)
Yarray of outputs

133
NEWRB

Newrb is like newrbe except it starts with one
neuron and adds neurons until the mean squared
error is less than some maximum
Netnewrb(input,target,errormax,spread)

134
Matlab Demos