Title: Radial Basis Functions
1Radial Basis Functions
- RBF Chapter 5 of text
- Also Marty Haggans Book Neural Network Design
2RBF
- We have seen that a BPNN (backpropagation neural
network) can be used for function approximation
and classification - A RBFNN (radial basis function neural network) is
another network that can be used for both such
problems
3RBFNN vs BPNN
- RBFNN has only two layers, whereas BPNN can have
any number - Normally all nodes of the BPNN have the same
model, whereas the RBFNN has two different models - RBFNN has a nonlinear and a linear layer
- Argument of RBFNN computes a Euclidean norm
whereas a BPNN computes a dot product - RBFNN trains faster than BPNN
4Dot Product
- If we view the weights of a neuron as a vector
Ww1, w2 , wn T, and the input to a neuron
as a vector Xx1, x2 , xn T, then the dot
product WT X is
5Euclidean Norm
- If we view the weights of a neuron as a vector
Ww1, w2 , wn T, and the input to a neuron
as a vector Xx1, x2 , xn T, then the
Euclidean norm is
6RBFNN vs BPNN
- RBFNN often leads to better decision boundaries
- Hidden layer units of RBFNN have a much more
natural interpretation - RBFNN learning phase may be unsupervised and thus
could lose information - BPNN may give a more compact representation
7(No Transcript)
8(No Transcript)
9RBF
- There is considerable theory behind the
application of RBFNNs or simply RBFs. - Much of it is covered in the text
- Our emphasis will be on the results and so we
will not cover the derivations, etc.
10RBF
- There are 2 general categories of RBFNNs
- Classification separation of hyperspace first
portion of chapter - Function approximation use a RBFNN to
approximate some non-linear function
11RBF-Classification
- Covers Theorem separability of patterns
- A complex pattern classification problem re-cast
in a high(er)-dimensional space nonlinearly is
more likely to be linearly separable than in a
low-dimension space, provided that the space is
not densely populated.
12Covers Theorem
- Given an input vector Xx1, xk ( of dimension
k), if we recast it using some set of nonlinear
transfer functions on each of the input
parameters (of dimension mgt k) then it is more
likely that this new set will be linearly
separable - In some cases simply using a nonlinear mapping
and not changing (increasing) the dimensionality
(m) is sufficient
13Covers Theorem - Corollary
- The expected (average) maximum number of randomly
assigned patterns (vectors) that are linearly
separable in a space of dimensionality m is 2m - Stated another way
- 2m is a definition of the separating capacity of
a family of decision surfaces having m degrees of
freedom
14Covers Theorem
15Covers Theorem
- Another way to think of it (Covers theorem) in
terms of a probability density function is as
follows - Lets say we have N points in an m0 dimensional
space. Furthermore, lets say we randomly assign
those N points to one of two classes. Let P(N,m1)
denote the probability that this random picking
is linearly separable for some f.
16Covers Theorem
- The author gives a very complicated P(N,m1 )
- Think of m1 as the number of non-linear
transformations - However, what is important is that he shows that
- P(2m1 ,m1 ) 1/2
- As m1 gets larger, P(N ,m1 ) approaches 1
17Thesis Topic
- Use an evolutionary approach to find a minimum
set of fs to make X linearly separable.
18XOR Problem
- In the XOR problem x1 x2 we want a 1 output
when the inputs are not equal, else output a 0. - This is not a linearly separable problem
- Observe what happens if we use two non-linear
functions applied to x1 and x2
19XOR
20XOR
21XOR
- If one plots these new points (next slide), they
can see that they are now linearly separable
22(1,1)
(0,1) (1,0)
(0,0)
23XOR
- What the previous slides show is that by
introducing a nonlinearity, even without
increasing the dimensionality, the space becomes
linearly separable
24Regression
- In a paper by Scott (1992), he showed that in the
presence of noisy data, a good approximation of
the relation between input/output pairs is
25- After much work, it can be shown that in the
discrete domain (assuming Gaussian noise)
26Regression
- The function is called a normalized
Gaussian, and also called a normalized RBF. Often
the denominator is left off. - What it says is that a good approximation or
regression function is a RBF based network of
such functions
27Regularization
- An important point to remember is that we always
have noise in the data, and thus it is probably
not a good idea to develop a function that
exactly fits the data. - Another way to think of it is that fitting the
data too closely will likely give poor
generalization - The next slide is bad, and the one after that is
good
28(No Transcript)
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33(No Transcript)
34Matlab RBF
- The following slide shows how Matlabs view of a
RBF network is displayed.
35(No Transcript)
36Spread and Bias on Matlab
- From the preceding figure we have that the output
of the RB neuron is
37Bias in Matlab RBF
- In the Matlab toolbox for Radial Basis networks,
the authors use bias b instead of what some
authors use (s) to be consistent. - The relationship is
38Bias Spread in Matlab RBF
- Lets say we set b 0.1 in a radial basis
neuron. What does that mean? - In the RB evaluation, it means we will have
39Bias Spread in Matlab RBF
- It means its output will be 0.5
40More RBF
- Consider the following for a RBF network (In
Matlab notation) - The next figure shows the network output as the
input varies -2,2
41(No Transcript)
42More RBF
- The next slide shows how the network output
varies as the various parameters vary. - Remember
43(No Transcript)
44Matlab dist(W,P)
- Calculates Euclidean distance from W to P
- W SXR S rows X R columns
- S number of weight vectors and each row is an R
dimension weight vector - PRXQ R rows X Q column vectors
- Each column of P is an R dimension vector
45Matlab dist(W,P)
- Assume we have a first layer of 4 neurons with
each neuron having a 3-dimension input vector (P) - To calculate dist for each neuron
- W4 rows (Ws) of 3 columns each row
- P3 rows of one column for a single P vector)
46(No Transcript)
47Parameters, Spread, etc.
- The following slide is the standard notation
for a MATLAB RBF neuron - In such a neuron, we use
48(No Transcript)
49Question
- What happens to the RBF as b and/or the dist
gets bigger? - Bigger (larger maximum)
- Smaller (smaller maximum)
- Narrower
- Wider
- ???
50(No Transcript)
51Interpolation
- There is sort of a converse to Covers theorem
about separability. It is - One can often use a nonlinear mapping to
transform a difficult filtering (regression)
problem into one that can be solved linearly.
52More Regression/Interpolation
- In a sense, the regression problem builds an
equation to give us the input/output relationship
on a set of data. Given one of the inputs we
trained on, we should be able to use this
equation to get the output within some error - The interpolation problem addresses the issue of
what value(s) does our network give us for inputs
that we have not trained on
53More Regression/Interpolation
- Given a set of N different points, and a set of N
corresponding real numbers, find a function F
that satisfies - F(x) di i1,2,,N
- For strict interpolation, F should pass through
all data points.
54RBF Regression
- The RBF approach to regression chooses F such
that
55RBF Regression
- From the preceding, we get the following
simultaneous linear equations for the unknown
weights
56RBE Regression
- The next slide represents the authors pictorial
of an RBF network
57(No Transcript)
58RBF Regression
- Thus we have (in matrix form)
- The question is whether or not is invertible
59Micchellis Theorem
- Micchellis Theorem defines a class of functions
f(x) that are invertible. - There is a large class of radial basis functions
covered by Micchellis theorem - In that which follows, it is required that all of
the data points be distinct, i.e. no two points
be in the same location in space. - NOTE Radial Basis functions are also called
kernel functions
60Micchellis Theorem
61RBF
- The radial Basis Function most commonly used that
satisfies Micchellis theorem is
62RBF
- In the preceding, there will be one RBF neuron
for each data point (N). Thus, the f matrix will
be NXN, with each row of the matrix the fs for
that data point. - Thus, the preceding equation (A) is guaranteed to
pass through every point of the data if - There is one RBF neuron for every x point.
- The sensitivity of each neuron is such that at
each data point there is only one neuron with a
non-zero value - The center of each neuron corresponds to a unique
data point
63RBF
- For this size network
- One neuron for each data point usually means an
overly trained network - There is also the problem of inverting f an NxN
matrix the inversion time grows as O(N3 ). Thus
for 1,000 data points we would have to invert a
matrix of order O(1,000,000,000)
64RBF
- Thus, generally we accept a suboptimal solution
in which the number of basis functions or RBF
neurons is lt of data points (N), and the number
of centers of basis functions is lt N - The question then becomes one of choosing the
best number of neurons and the best centers for
those neurons
65Classification
- RBFs can also be used as classifiers.
- Remember BPNNs tend to divide the input space
(figure a next slide) whereas RBFs tend to
kernelize the space (figure b next slide)
66(a) BPNN (b) RBFNN
67Learning Strategies
- Typically, the weights of the two layers are
determined separately, i.e. find RBF weights
(centers), and then find output layer weights - There are several methods for parameterizing the
RBFs, selecting the centers, and selecting the
output layer weights
68Learning Strategies
- Earlier, we showed one way in which there was one
RBF neuron for each data point. For this network,
we only have to choose the sensitivity or spread
for a neuron so that it does not make its RBF
function overlap with another neuron - Of course we always have the problem of inverting
a really big matrix.
69RBF Layer Another Strategy
- In this case we simply select (randomly) from the
data set a subset of data points and use them as
the centers, i.e. the xis - How many to select?
- Which ones to select?
- The following approach defines the way to adjust
the sensitivity
70Fixed Centers
71Fixed Centers
- From this, what parameters are left to find?
72Fixed Centers
- From this, what parameters are left to find?
- The ws of the output layer
- For this, we need a matrix w such that
73Fixed Centers
- The problem with the preceding is that f is
probably not square, and hence cant be inverted.
So how do we find W? - To do this we can again use the pseudo-inverse,
and thus
74Fixed Centers Example - XOR
- For the XOR, we will use the following simplified
Gaussian - The Centers of the RBFs will be 1,1 and 0,0
75Fixed Centers Example - XOR
- We want
- data point Input pattern x desired output
- 1 (1,1) 0
- 2 (0,1) 1
- 3 (0,0) 0
- 4 (1,0) 1
76XOR Example
- What we want is to find W given f and D
- Remember
- Since we have chosen 2 RBF centers, our f matrix
will be 4 rows (there are 4 data points) and
three columns. The first two columns are for the
first two rbf centers, and the third column is 1
for the bias multiplier.
77XOR Example
78XOR Example
- We have the following (after some algebra)
79XOR Example
- We want W such that
- Wf-1 d, but is not invertible, i.e. not
square - We can use what is called the pseudo-inverse
- Wf d(fT f)-1 fTd
80XOR Example
- Something like Wf d(fT f)-1 fTd looks
difficult however, with Matlab it is simple
81(No Transcript)
82XOR Example
- We now have everything, including
- WT -2.5019 -2.5019 2.8404
- Remember were using fWd
- For a given input say x1 1 we have
- 1 0.1353 1 -2.5019 -2.5019 2.8404
83XOR Example
- We now have everything, including
- WT -2.5019 -2.5019 2.8404
- Remember were using fWd
- Input x desired output actual output
- (1,1) 0 3.3E-15 (0)
- (0,1) 1 0.999 (1)
- (0,0) 0 0.999 (1)
- (1,0) 1 3.3 E-15 (0)
84Self-Organized Center Selection
- Fixed centers may require a relatively large
number of randomly selected centers to work well - This method (self-organized) uses a two stage
process (iterative) - Self-organized learning stage estimate RBF
centers - Supervised Learning estimate linear weights of
output layer
85Self-Organized Learning Stage
- For this stage we use a clustering algorithm that
divides or partitions the data points into
subgroups - It needs to be unsupervised, so that it is simply
a clustering approach - Commonly use K-means clustering
- Place RBF centers only in those regions of the
data space where there are significant
concentrations of data (input vectors)
86Self-Organized Learning Stage
- Clustering is a form of unsupervised learning
whereby a set of observations (i.e., data points)
is partitioned into natural groupings or
clusterings of patterns in such a way that the
measure of similarity between any pair of
observations assigned to each cluster minimizes a
specified cost function. Text/Haykin
87Self-Organized Learning Stage
- Let us assume we have K clusters where Kltnumber
of observations, i.e. data points - What we want is to find the minimum of a cost
function J(C) - C(i) a cluster cost This is the cost function
we use
88Self-Organized Learning Stage
- The first summation over C(i)j, means that for a
given xi we assign it to cluster j, i.e. the one
it is closest to. - The µ term is considered the center or mean of
the xis assigned to that cluster. - Thus we want the minimum set of distances from
cluster centers
89Self-Organized Learning Stage
- The expression
- Can be thought of as the variance for a cluster.
- Now how do we minimize J(C)?
90Self-Organized Learning Stage
- Given a set of N observations (data points), find
the encoder C that assigns these observations to
the K clusters in such a way that, within each
cluster, the average measure of dissimilarity of
the assigned observations from the cluster mean
is minimized. Hence, K-means clustering. (text)
91Finding the Clusters
- The easiest way to figure k-means clustering out
is with a numeric example.
92Self-Organized Learning Stage
- Initialization choose random values for the
initial set of centers tk(0) subject to the
restriction that these initial values must all be
different - Randomly draw a sample vector from the input
space - Similarity matching Let k(x) denote the best
matching center for input x (closest)
93Self-Organized Learning Stage
- Initialization choose random values for the
initial set of centers tk(0) subject to the
restriction that these initial values must all be
different - Randomly draw a sample vector from the input
space - Similarity matching Let k(x) denote the best
matching center for input x (closest)
94Self-Organized Learning Stage
- Find k(x) at iteration n using the minimum
distance Euclidean criterion
95Self-Organized Learning Stage
- Updating Adjust the center of the closest RBF
center as - Continuation Repeat until no noticeable changes
in centers occur - Generally, reduce the learning rate over time
96Another Clustering Technique
- In this example, we will use the data values as
the initial cluster centers and then iterate from
there. - Assume we have 4 data points
- P1 (1,1)
- P2 (2,1)
- P3 (4,3)
- P4 (5,4)
97Another Clustering Technique
- We will assume there are 2 clusters and choose
P1 and P2 as the initial cluster centers
98(No Transcript)
99Another Clustering Technique
- We begin by calculating the distance from each
centroid to each data point - Each row,column (R,C) in D is the distance from
centroid R to data point C, e.g. D(1,3) 3.61 is
the distance from centroid 1 (actually initially
point 1) to point 3)
100Another Clustering Technique
- We can now assign each point to one of the 2
clusters
101Another Clustering Technique
- We now compute the new cluster centers based on
this grouping
102Another Clustering Technique
- We now compute the new cluster centers based on
this grouping
103Another Clustering Technique
- Now we again calculate the distance from each new
centroid to each data point - For each point (column), we assign it to the
cluster (row) that is closest
104Another Clustering Technique
- We now compute the new cluster centers based on
this new grouping
105Another Clustering Technique
- If we repeat this process again we will find that
there is no change to the clustering and so we
quit with the two cluster centers as given - (1.5,1) and (4.5, 3.5)
106(No Transcript)
107K-means Clustering
- It should be noted that K-means clustering is
very dependent on - Number of centers
- Initial location of centers
108Adaptive K-means with dynamic initialization
- Randomly pick a set of centers c1,..ck
- 1. Disable all cluster centers
- 2. Read an input vector x
- 3. If the closest enabled cluster center ci is
within distance r of x or if all cluster centers
are already enabled, update c as -
109Adaptive K-means with dynamic initialization -
Continued
- 4. Otherwise, enable a new cluster ck and set it
equal to x - 5. Continue until a fixed number of iterations,
or until learning rate has decayed to 0 - Yes you still have the challenge of setting R
110Next Step
- Once the centers or centroids have been chosen in
the first layer (the RBF layer), the next step is
to set the widths of the centers, i.e. the
variance - How?
111Next Step
- Can use a P nearest neighbor heuristic
- Given a cluster center ck,
- select the P nearest neighboring clusters
- Set to the root mean square of the
distances to these clusters
112Next Step
113Finally
- The last step is to set the weights of the output
layer - Remember, the output layer is a linear neuron(s)
- Since its linear, could just use the delta
training rule - Note also, that there is no real requirement that
the output layer be linear, e.g. could be a BPNN
114Linear Least Squares
- Having set the parameters of the first (RB)
layer, we can use linear least squares to
determine the weights for the second layer - Instead of
- p1 , t1 , pQ , tQ
- Q of training points, poutput, t target
115Linear Least Squares
116Linear Least Squares
- We want to minimize the sum square error
performance -
117Linear Least Squares
- With the preceding substitutions, the sum square
error becomes -
118Regularization
- Regularization is one of many techniques used to
improve the generalization capabilities of a
network. - It is done by modifying the sum squared error
index so that it includes a penalty for network
complexity
119Regularization
- There are two ways to think of network complexity
- Number of neurons
- Magnitude of weights
- When the weights are large, the network can have
large slopes and is more likely to generate
overfitting, i.e. its not as smooth
120Regularization
- Thus, in the F(x) error function we try to
minimize, we include such a term
121Linear Least Squares
- Going back to slide 117, we have
-
122Linear Least Squares
- If we solve the gradient for 0, we will have the
minimum -
123LLS Example
- Lets say the function to approximate has the
following input/output pairs - P-2,-1.2,-0.4,0.4,1.2,2
- T0,.19,.69,1.3,1.8,2
- Choose the basis centers to be equally spaced
throughout the input range -2,2 - Choose the biases to be 1/spacing between centers
124LLS Example
125LLS Example
126(No Transcript)
127(No Transcript)
128Matlab
- Newrbe
- One neuron for each data item
- Newrb
- Adds neurons as needed one at a time until the
error maximum is reached - The smaller the error maximum the more neurons
needed
129Matlab - NEWRBE
- gtgt net newrbe(P,T,SPREAD)
- Pinput vector, if there are Q input vectors,
there will be Q RBF neurons - T target vector
- SPREAD spread constant for RBF neurons
130Matlab - SPREAD
- The bias value for each RBF neuron is set at
0.826/SPREAD - This means a neurons output will be 0.5 if the
weighted input is /-SPREAD. - A SPREAD of 4 means that a neuron a distance 4
from a vector will output a 0.5 for the vector. - SPREAD should be large enough that for each input
X several neurons will be active.
131Matlab SPREAD
- How does one determine what SPREAD should be to
meet the preceding criteria?
132Testing the network
- Once a network has been defined, one can test it
as - Ysim(net,X)
- Xtest data
- Netnetwork created with netnewrbe(..)
- Yarray of outputs
133NEWRB
- Newrb is like newrbe except it starts with one
neuron and adds neurons until the mean squared
error is less than some maximum - Netnewrb(input,target,errormax,spread)
134Matlab Demos
- Demorb1
- Solves a regression with 5 neurons
- Demorb3
- Sets the spread too small
- Demorb4
- Sets the spread too large