Radial Basis Functions - PowerPoint PPT Presentation

1 / 134
About This Presentation
Title:

Radial Basis Functions

Description:

What it says is that a good approximation or regression function is a RBF based ... Let's say the function to approximate has the following input/output pairs ... – PowerPoint PPT presentation

Number of Views:465
Avg rating:3.0/5.0
Slides: 135
Provided by: donco2
Category:

less

Transcript and Presenter's Notes

Title: Radial Basis Functions


1
Radial Basis Functions
  • RBF Chapter 5 of text
  • Also Marty Haggans Book Neural Network Design

2
RBF
  • We have seen that a BPNN (backpropagation neural
    network) can be used for function approximation
    and classification
  • A RBFNN (radial basis function neural network) is
    another network that can be used for both such
    problems

3
RBFNN vs BPNN
  • RBFNN has only two layers, whereas BPNN can have
    any number
  • Normally all nodes of the BPNN have the same
    model, whereas the RBFNN has two different models
  • RBFNN has a nonlinear and a linear layer
  • Argument of RBFNN computes a Euclidean norm
    whereas a BPNN computes a dot product
  • RBFNN trains faster than BPNN

4
Dot Product
  • If we view the weights of a neuron as a vector
    Ww1, w2 , wn T, and the input to a neuron
    as a vector Xx1, x2 , xn T, then the dot
    product WT X is

5
Euclidean Norm
  • If we view the weights of a neuron as a vector
    Ww1, w2 , wn T, and the input to a neuron
    as a vector Xx1, x2 , xn T, then the
    Euclidean norm is

6
RBFNN vs BPNN
  • RBFNN often leads to better decision boundaries
  • Hidden layer units of RBFNN have a much more
    natural interpretation
  • RBFNN learning phase may be unsupervised and thus
    could lose information
  • BPNN may give a more compact representation

7
(No Transcript)
8
(No Transcript)
9
RBF
  • There is considerable theory behind the
    application of RBFNNs or simply RBFs.
  • Much of it is covered in the text
  • Our emphasis will be on the results and so we
    will not cover the derivations, etc.

10
RBF
  • There are 2 general categories of RBFNNs
  • Classification separation of hyperspace first
    portion of chapter
  • Function approximation use a RBFNN to
    approximate some non-linear function

11
RBF-Classification
  • Covers Theorem separability of patterns
  • A complex pattern classification problem re-cast
    in a high(er)-dimensional space nonlinearly is
    more likely to be linearly separable than in a
    low-dimension space, provided that the space is
    not densely populated.

12
Covers Theorem
  • Given an input vector Xx1, xk ( of dimension
    k), if we recast it using some set of nonlinear
    transfer functions on each of the input
    parameters (of dimension mgt k) then it is more
    likely that this new set will be linearly
    separable
  • In some cases simply using a nonlinear mapping
    and not changing (increasing) the dimensionality
    (m) is sufficient

13
Covers Theorem - Corollary
  • The expected (average) maximum number of randomly
    assigned patterns (vectors) that are linearly
    separable in a space of dimensionality m is 2m
  • Stated another way
  • 2m is a definition of the separating capacity of
    a family of decision surfaces having m degrees of
    freedom

14
Covers Theorem
  • Explain in terms of

15
Covers Theorem
  • Another way to think of it (Covers theorem) in
    terms of a probability density function is as
    follows
  • Lets say we have N points in an m0 dimensional
    space. Furthermore, lets say we randomly assign
    those N points to one of two classes. Let P(N,m1)
    denote the probability that this random picking
    is linearly separable for some f.

16
Covers Theorem
  • The author gives a very complicated P(N,m1 )
  • Think of m1 as the number of non-linear
    transformations
  • However, what is important is that he shows that
  • P(2m1 ,m1 ) 1/2
  • As m1 gets larger, P(N ,m1 ) approaches 1

17
Thesis Topic
  • Use an evolutionary approach to find a minimum
    set of fs to make X linearly separable.

18
XOR Problem
  • In the XOR problem x1 x2 we want a 1 output
    when the inputs are not equal, else output a 0.
  • This is not a linearly separable problem
  • Observe what happens if we use two non-linear
    functions applied to x1 and x2

19
XOR
20
XOR
21
XOR
  • If one plots these new points (next slide), they
    can see that they are now linearly separable

22
(1,1)
(0,1) (1,0)
(0,0)
23
XOR
  • What the previous slides show is that by
    introducing a nonlinearity, even without
    increasing the dimensionality, the space becomes
    linearly separable

24
Regression
  • In a paper by Scott (1992), he showed that in the
    presence of noisy data, a good approximation of
    the relation between input/output pairs is

25
  • After much work, it can be shown that in the
    discrete domain (assuming Gaussian noise)

26
Regression
  • The function is called a normalized
    Gaussian, and also called a normalized RBF. Often
    the denominator is left off.
  • What it says is that a good approximation or
    regression function is a RBF based network of
    such functions

27
Regularization
  • An important point to remember is that we always
    have noise in the data, and thus it is probably
    not a good idea to develop a function that
    exactly fits the data.
  • Another way to think of it is that fitting the
    data too closely will likely give poor
    generalization
  • The next slide is bad, and the one after that is
    good

28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
Matlab RBF
  • The following slide shows how Matlabs view of a
    RBF network is displayed.

35
(No Transcript)
36
Spread and Bias on Matlab
  • From the preceding figure we have that the output
    of the RB neuron is

37
Bias in Matlab RBF
  • In the Matlab toolbox for Radial Basis networks,
    the authors use bias b instead of what some
    authors use (s) to be consistent.
  • The relationship is

38
Bias Spread in Matlab RBF
  • Lets say we set b 0.1 in a radial basis
    neuron. What does that mean?
  • In the RB evaluation, it means we will have

39
Bias Spread in Matlab RBF
  • It means its output will be 0.5

40
More RBF
  • Consider the following for a RBF network (In
    Matlab notation)
  • The next figure shows the network output as the
    input varies -2,2

41
(No Transcript)
42
More RBF
  • The next slide shows how the network output
    varies as the various parameters vary.
  • Remember

43
(No Transcript)
44
Matlab dist(W,P)
  • Calculates Euclidean distance from W to P
  • W SXR S rows X R columns
  • S number of weight vectors and each row is an R
    dimension weight vector
  • PRXQ R rows X Q column vectors
  • Each column of P is an R dimension vector

45
Matlab dist(W,P)
  • Assume we have a first layer of 4 neurons with
    each neuron having a 3-dimension input vector (P)
  • To calculate dist for each neuron
  • W4 rows (Ws) of 3 columns each row
  • P3 rows of one column for a single P vector)

46
(No Transcript)
47
Parameters, Spread, etc.
  • The following slide is the standard notation
    for a MATLAB RBF neuron
  • In such a neuron, we use

48
(No Transcript)
49
Question
  • What happens to the RBF as b and/or the dist
    gets bigger?
  • Bigger (larger maximum)
  • Smaller (smaller maximum)
  • Narrower
  • Wider
  • ???

50
(No Transcript)
51
Interpolation
  • There is sort of a converse to Covers theorem
    about separability. It is
  • One can often use a nonlinear mapping to
    transform a difficult filtering (regression)
    problem into one that can be solved linearly.

52
More Regression/Interpolation
  • In a sense, the regression problem builds an
    equation to give us the input/output relationship
    on a set of data. Given one of the inputs we
    trained on, we should be able to use this
    equation to get the output within some error
  • The interpolation problem addresses the issue of
    what value(s) does our network give us for inputs
    that we have not trained on

53
More Regression/Interpolation
  • Given a set of N different points, and a set of N
    corresponding real numbers, find a function F
    that satisfies
  • F(x) di i1,2,,N
  • For strict interpolation, F should pass through
    all data points.

54
RBF Regression
  • The RBF approach to regression chooses F such
    that

55
RBF Regression
  • From the preceding, we get the following
    simultaneous linear equations for the unknown
    weights

56
RBE Regression
  • The next slide represents the authors pictorial
    of an RBF network

57
(No Transcript)
58
RBF Regression
  • Thus we have (in matrix form)
  • The question is whether or not is invertible

59
Micchellis Theorem
  • Micchellis Theorem defines a class of functions
    f(x) that are invertible.
  • There is a large class of radial basis functions
    covered by Micchellis theorem
  • In that which follows, it is required that all of
    the data points be distinct, i.e. no two points
    be in the same location in space.
  • NOTE Radial Basis functions are also called
    kernel functions

60
Micchellis Theorem
61
RBF
  • The radial Basis Function most commonly used that
    satisfies Micchellis theorem is

62
RBF
  • In the preceding, there will be one RBF neuron
    for each data point (N). Thus, the f matrix will
    be NXN, with each row of the matrix the fs for
    that data point.
  • Thus, the preceding equation (A) is guaranteed to
    pass through every point of the data if
  • There is one RBF neuron for every x point.
  • The sensitivity of each neuron is such that at
    each data point there is only one neuron with a
    non-zero value
  • The center of each neuron corresponds to a unique
    data point

63
RBF
  • For this size network
  • One neuron for each data point usually means an
    overly trained network
  • There is also the problem of inverting f an NxN
    matrix the inversion time grows as O(N3 ). Thus
    for 1,000 data points we would have to invert a
    matrix of order O(1,000,000,000)

64
RBF
  • Thus, generally we accept a suboptimal solution
    in which the number of basis functions or RBF
    neurons is lt of data points (N), and the number
    of centers of basis functions is lt N
  • The question then becomes one of choosing the
    best number of neurons and the best centers for
    those neurons

65
Classification
  • RBFs can also be used as classifiers.
  • Remember BPNNs tend to divide the input space
    (figure a next slide) whereas RBFs tend to
    kernelize the space (figure b next slide)

66
(a) BPNN (b) RBFNN
67
Learning Strategies
  • Typically, the weights of the two layers are
    determined separately, i.e. find RBF weights
    (centers), and then find output layer weights
  • There are several methods for parameterizing the
    RBFs, selecting the centers, and selecting the
    output layer weights

68
Learning Strategies
  • Earlier, we showed one way in which there was one
    RBF neuron for each data point. For this network,
    we only have to choose the sensitivity or spread
    for a neuron so that it does not make its RBF
    function overlap with another neuron
  • Of course we always have the problem of inverting
    a really big matrix.

69
RBF Layer Another Strategy
  • In this case we simply select (randomly) from the
    data set a subset of data points and use them as
    the centers, i.e. the xis
  • How many to select?
  • Which ones to select?
  • The following approach defines the way to adjust
    the sensitivity

70
Fixed Centers
71
Fixed Centers
  • From this, what parameters are left to find?

72
Fixed Centers
  • From this, what parameters are left to find?
  • The ws of the output layer
  • For this, we need a matrix w such that

73
Fixed Centers
  • The problem with the preceding is that f is
    probably not square, and hence cant be inverted.
    So how do we find W?
  • To do this we can again use the pseudo-inverse,
    and thus

74
Fixed Centers Example - XOR
  • For the XOR, we will use the following simplified
    Gaussian
  • The Centers of the RBFs will be 1,1 and 0,0

75
Fixed Centers Example - XOR
  • We want
  • data point Input pattern x desired output
  • 1 (1,1) 0
  • 2 (0,1) 1
  • 3 (0,0) 0
  • 4 (1,0) 1

76
XOR Example
  • What we want is to find W given f and D
  • Remember
  • Since we have chosen 2 RBF centers, our f matrix
    will be 4 rows (there are 4 data points) and
    three columns. The first two columns are for the
    first two rbf centers, and the third column is 1
    for the bias multiplier.

77
XOR Example
78
XOR Example
  • We have the following (after some algebra)

79
XOR Example
  • We want W such that
  • Wf-1 d, but is not invertible, i.e. not
    square
  • We can use what is called the pseudo-inverse
  • Wf d(fT f)-1 fTd

80
XOR Example
  • Something like Wf d(fT f)-1 fTd looks
    difficult however, with Matlab it is simple

81
(No Transcript)
82
XOR Example
  • We now have everything, including
  • WT -2.5019 -2.5019 2.8404
  • Remember were using fWd
  • For a given input say x1 1 we have
  • 1 0.1353 1 -2.5019 -2.5019 2.8404

83
XOR Example
  • We now have everything, including
  • WT -2.5019 -2.5019 2.8404
  • Remember were using fWd
  • Input x desired output actual output
  • (1,1) 0 3.3E-15 (0)
  • (0,1) 1 0.999 (1)
  • (0,0) 0 0.999 (1)
  • (1,0) 1 3.3 E-15 (0)

84
Self-Organized Center Selection
  • Fixed centers may require a relatively large
    number of randomly selected centers to work well
  • This method (self-organized) uses a two stage
    process (iterative)
  • Self-organized learning stage estimate RBF
    centers
  • Supervised Learning estimate linear weights of
    output layer

85
Self-Organized Learning Stage
  • For this stage we use a clustering algorithm that
    divides or partitions the data points into
    subgroups
  • It needs to be unsupervised, so that it is simply
    a clustering approach
  • Commonly use K-means clustering
  • Place RBF centers only in those regions of the
    data space where there are significant
    concentrations of data (input vectors)

86
Self-Organized Learning Stage
  • Clustering is a form of unsupervised learning
    whereby a set of observations (i.e., data points)
    is partitioned into natural groupings or
    clusterings of patterns in such a way that the
    measure of similarity between any pair of
    observations assigned to each cluster minimizes a
    specified cost function. Text/Haykin

87
Self-Organized Learning Stage
  • Let us assume we have K clusters where Kltnumber
    of observations, i.e. data points
  • What we want is to find the minimum of a cost
    function J(C)
  • C(i) a cluster cost This is the cost function
    we use

88
Self-Organized Learning Stage
  • The first summation over C(i)j, means that for a
    given xi we assign it to cluster j, i.e. the one
    it is closest to.
  • The µ term is considered the center or mean of
    the xis assigned to that cluster.
  • Thus we want the minimum set of distances from
    cluster centers

89
Self-Organized Learning Stage
  • The expression
  • Can be thought of as the variance for a cluster.
  • Now how do we minimize J(C)?

90
Self-Organized Learning Stage
  • Given a set of N observations (data points), find
    the encoder C that assigns these observations to
    the K clusters in such a way that, within each
    cluster, the average measure of dissimilarity of
    the assigned observations from the cluster mean
    is minimized. Hence, K-means clustering. (text)

91
Finding the Clusters
  • The easiest way to figure k-means clustering out
    is with a numeric example.

92
Self-Organized Learning Stage
  • Initialization choose random values for the
    initial set of centers tk(0) subject to the
    restriction that these initial values must all be
    different
  • Randomly draw a sample vector from the input
    space
  • Similarity matching Let k(x) denote the best
    matching center for input x (closest)

93
Self-Organized Learning Stage
  • Initialization choose random values for the
    initial set of centers tk(0) subject to the
    restriction that these initial values must all be
    different
  • Randomly draw a sample vector from the input
    space
  • Similarity matching Let k(x) denote the best
    matching center for input x (closest)

94
Self-Organized Learning Stage
  • Find k(x) at iteration n using the minimum
    distance Euclidean criterion

95
Self-Organized Learning Stage
  • Updating Adjust the center of the closest RBF
    center as
  • Continuation Repeat until no noticeable changes
    in centers occur
  • Generally, reduce the learning rate over time

96
Another Clustering Technique
  • In this example, we will use the data values as
    the initial cluster centers and then iterate from
    there.
  • Assume we have 4 data points
  • P1 (1,1)
  • P2 (2,1)
  • P3 (4,3)
  • P4 (5,4)

97
Another Clustering Technique
  • We will assume there are 2 clusters and choose
    P1 and P2 as the initial cluster centers

98
(No Transcript)
99
Another Clustering Technique
  • We begin by calculating the distance from each
    centroid to each data point
  • Each row,column (R,C) in D is the distance from
    centroid R to data point C, e.g. D(1,3) 3.61 is
    the distance from centroid 1 (actually initially
    point 1) to point 3)

100
Another Clustering Technique
  • We can now assign each point to one of the 2
    clusters

101
Another Clustering Technique
  • We now compute the new cluster centers based on
    this grouping

102
Another Clustering Technique
  • We now compute the new cluster centers based on
    this grouping

103
Another Clustering Technique
  • Now we again calculate the distance from each new
    centroid to each data point
  • For each point (column), we assign it to the
    cluster (row) that is closest

104
Another Clustering Technique
  • We now compute the new cluster centers based on
    this new grouping

105
Another Clustering Technique
  • If we repeat this process again we will find that
    there is no change to the clustering and so we
    quit with the two cluster centers as given
  • (1.5,1) and (4.5, 3.5)

106
(No Transcript)
107
K-means Clustering
  • It should be noted that K-means clustering is
    very dependent on
  • Number of centers
  • Initial location of centers

108
Adaptive K-means with dynamic initialization
  • Randomly pick a set of centers c1,..ck
  • 1. Disable all cluster centers
  • 2. Read an input vector x
  • 3. If the closest enabled cluster center ci is
    within distance r of x or if all cluster centers
    are already enabled, update c as

109
Adaptive K-means with dynamic initialization -
Continued
  • 4. Otherwise, enable a new cluster ck and set it
    equal to x
  • 5. Continue until a fixed number of iterations,
    or until learning rate has decayed to 0
  • Yes you still have the challenge of setting R

110
Next Step
  • Once the centers or centroids have been chosen in
    the first layer (the RBF layer), the next step is
    to set the widths of the centers, i.e. the
    variance
  • How?

111
Next Step
  • Can use a P nearest neighbor heuristic
  • Given a cluster center ck,
  • select the P nearest neighboring clusters
  • Set to the root mean square of the
    distances to these clusters

112
Next Step
113
Finally
  • The last step is to set the weights of the output
    layer
  • Remember, the output layer is a linear neuron(s)
  • Since its linear, could just use the delta
    training rule
  • Note also, that there is no real requirement that
    the output layer be linear, e.g. could be a BPNN

114
Linear Least Squares
  • Having set the parameters of the first (RB)
    layer, we can use linear least squares to
    determine the weights for the second layer
  • Instead of
  • p1 , t1 , pQ , tQ
  • Q of training points, poutput, t target

115
Linear Least Squares
  • We have

116
Linear Least Squares
  • We want to minimize the sum square error
    performance

117
Linear Least Squares
  • With the preceding substitutions, the sum square
    error becomes

118
Regularization
  • Regularization is one of many techniques used to
    improve the generalization capabilities of a
    network.
  • It is done by modifying the sum squared error
    index so that it includes a penalty for network
    complexity

119
Regularization
  • There are two ways to think of network complexity
  • Number of neurons
  • Magnitude of weights
  • When the weights are large, the network can have
    large slopes and is more likely to generate
    overfitting, i.e. its not as smooth

120
Regularization
  • Thus, in the F(x) error function we try to
    minimize, we include such a term

121
Linear Least Squares
  • Going back to slide 117, we have

122
Linear Least Squares
  • If we solve the gradient for 0, we will have the
    minimum

123
LLS Example
  • Lets say the function to approximate has the
    following input/output pairs
  • P-2,-1.2,-0.4,0.4,1.2,2
  • T0,.19,.69,1.3,1.8,2
  • Choose the basis centers to be equally spaced
    throughout the input range -2,2
  • Choose the biases to be 1/spacing between centers

124
LLS Example
125
LLS Example
126
(No Transcript)
127
(No Transcript)
128
Matlab
  • Newrbe
  • One neuron for each data item
  • Newrb
  • Adds neurons as needed one at a time until the
    error maximum is reached
  • The smaller the error maximum the more neurons
    needed

129
Matlab - NEWRBE
  • gtgt net newrbe(P,T,SPREAD)
  • Pinput vector, if there are Q input vectors,
    there will be Q RBF neurons
  • T target vector
  • SPREAD spread constant for RBF neurons

130
Matlab - SPREAD
  • The bias value for each RBF neuron is set at
    0.826/SPREAD
  • This means a neurons output will be 0.5 if the
    weighted input is /-SPREAD.
  • A SPREAD of 4 means that a neuron a distance 4
    from a vector will output a 0.5 for the vector.
  • SPREAD should be large enough that for each input
    X several neurons will be active.

131
Matlab SPREAD
  • How does one determine what SPREAD should be to
    meet the preceding criteria?

132
Testing the network
  • Once a network has been defined, one can test it
    as
  • Ysim(net,X)
  • Xtest data
  • Netnetwork created with netnewrbe(..)
  • Yarray of outputs

133
NEWRB
  • Newrb is like newrbe except it starts with one
    neuron and adds neurons until the mean squared
    error is less than some maximum
  • Netnewrb(input,target,errormax,spread)

134
Matlab Demos
  • Demorb1
  • Solves a regression with 5 neurons
  • Demorb3
  • Sets the spread too small
  • Demorb4
  • Sets the spread too large
Write a Comment
User Comments (0)
About PowerShow.com