Radial-Basis Function Networks - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Radial-Basis Function Networks

Description:

In regularization theory, F is found by minimizing the cost functional ?(F) ... The regularization network requires N hidden neurons, which becomes ... – PowerPoint PPT presentation

Number of Views:372
Avg rating:3.0/5.0
Slides: 39
Provided by: asimk
Category:

less

Transcript and Presenter's Notes

Title: Radial-Basis Function Networks


1
Radial-Basis Function Networks
  • CS/CMPE 537 Neural Networks

2
Introduction
  • Typical tasks performed by neural networks are
    association, classification, and filtering. This
    categorization has historical significance as
    well.
  • These tasks involve input-output mappings, and
    the network is designed to learn the mapping from
    knowledge of the problem environment
  • Thus, the design of a neural network can be
    viewed as a curve-fitting or function
    approximation problem
  • This viewpoint is the motivation for radial-basis
    function networks

3
Radial-Basis Function Networks
  • RBF are 2 layer networks input source nodes,
    hidden neurons with basis functions (nonlinear),
    and output neurons with linear/nonlinear
    activation functions
  • The theory of radial-basis function networks is
    built upon function approximation theory in
    mathematics
  • RBF networks were first used in 1988. Major work
    was done by Moody and Darken (1989) and Poggio
    and Girosi (1990)
  • In RBF networks, the mapping from input to
    high-dimension hidden space is nonlinear, while
    that from hidden to output space is linear
  • What is the basis for this ?

4
Radial-Basis Function Network
F1(.)
w
x1
y1
y2
xp
FM(.)
Source nodes
Hidden neurons with RBF activation functions
Output neurons
5
Covers Theorem (1)
  • Covers theorem (1965) gives the motivation for
    RBF networks
  • Covers theorem on the separability of patterns
  • Complex pattern-classification problems cast in
    high-dimensional space nonlinearly is more likely
    to be linearly separable than low-dimensional
    space

6
Covers Theorem (2)
  • Consider set X of N p-dimensional vectors (input
    patterns) x1 to xN. Let X and X- be a binary
    partition of X, and f(x) f1(x), f2(x),,
    fM(x)T.
  • Covers theorem
  • A binary partition (dichotomy) X, X- of X is
    said to be f-separable if there exist an
    m-dimensional vector w such that
  • wTf(x) 0 when x belong to X
  • wTf(x) lt 0 when x belong to X-
  • Decision boundary or surface
  • wTf(x) 0

7
Covers Theorem (3)
8
Example (1)
  • Consider the XOR problem to illustrate the
    significance of f-separability and Covers
    theorem.
  • Define a pair of Gaussian hidden functions
  • f1(x) exp(-x t12) t1 1, 1T
  • f2(x) exp(-x t22) t1 0, 0T
  • Output of these function for each pattern

9
Example (2)
10
Function Approximation (1)
  • Function approximation seeks to describe the
    behavior of complex functions by ensembles of
    simpler functions
  • Describe f(x) by F(x)
  • F(x) can be described in a compact region of
    input space by
  • F(x) Si1 N wifi(x)
  • Such that
  • f(x) F(x) lt e
  • e can be made arbitrarily small
  • Choice of function f(.) ?

11
Function Approximation (2)
  • Find F(x) that best approximates the
    map/function f. The best approximation is problem
    dependent, and it can be strict interpolation or
    good generalization (regularized interpolation).
  • Design decisions
  • Choice of elementary functions f(.)
  • How to compute the weights w ?
  • How many elementary functions to use (i.e. what
    should be N)?
  • How to obtain a good generalization ?

12
Choice of Elementary Functions f
  • Let f(x) belongs to the function space L2(Rp)
    (true for almost all physical systems)
  • We want f to be a basis of L2
  • What is meant by a basis?
  • A set of functions fi (i 1, M) are a basis of
    L2 if linear superposition of fi can generate any
    function in L2 . Moreover, they must be linearly
    independent
  • w1 f1 w2 f2 wM fM 0 iff wi 0 for all
    i

13
Interpolation Problem (1)
  • In general, the map from an input space to an
    output space is given by
  • f Rp -gt Rq
  • p and q input and out space dimensions f map
    or hypersurface
  • Strict interpolation problem
  • Given a set of N different points xi (i 1, N)
    and a corresponding set of N real numbers di (i
    1, N) find a function F Rp -gt R1 that satisfies
    the interpolation condition
  • F(xi) di i 1, N
  • The function F passes through all the points

14
Interpolation Problem (2)
  • A common type of f(.) is radial-symmetric basis
    functions
  • F(x) Si1 N wifx xi
  • Substituting and writing in matrix format
  • ? w d
  • ? fji (i, j 1, N) interpolation matrix fji
    fxj xi
  • w linear weight vector d desired response
    vector

15
Interpolation Problem (3)
  • ? is known to be positive definite for a certain
    class of radial-basis functions. Thus,
  • w ?-1 d
  • In theory, w can be computed. In practice,
    however, ? is close to singular
  • Then what ?
  • Regularization theory to perturb ? to make it
    non-singular
  • But, there is another problem poor
    generalization or overfitting

16
Ill-Posed Problems
  • Supervised learning is a an ill-posed problem
  • There is not enough information in the training
    data to reconstruct the input-output mapping
    uniquely
  • The presence of noise or imprecision in the input
    data adds uncertainty to the reconstruction of
    the input-outut mapping
  • To achieve good generalization additional
    information of the domain is needed
  • In other words, the input-output patterns should
    exhibit redundancy
  • Redundancy is achieved when the physical
    generator of data is smooth, and thus can be used
    to generate redundant input-output examples

17
Regularization Theory (1)
  • How to make an ill-posed problem well-posed ?
  • By constraining the mapping with additional
    information (e.g. smoothness) in the form of a
    nonnegative functional
  • Proposed by Tikhonov in 1963 in the context of
    function approximation in mathematics

18
Regularization Theory (2)
  • Input-output examples xi, di (i 1, N)
  • Find the mapping F(x) Rp -gt R1 for the
    input-output examples
  • In regularization theory, F is found by
    minimizing the cost functional ?(F)
  • ?(F) ?s(F) ??c(F)
  • Standard error term
  • ?s(F) 0.5Si1 N (di yi)2 0.5Si1 N (di
    F(xi))2
  • Regularization term
  • ?c(F) 0.5PF(x)2
  • P linear differential operator

19
Regularization Theory (3)
  • Regularization term depends on the geometric
    properties of the approximating function
  • The selection of the operator P is therefore
    problem dependent based on prior knowledge of the
    geometric properties of the actual function f(x)
    (e.g. the smoothness of f(x))
  • Regularization parameter ? a positive real
    number
  • This parameter indicates the sufficiency of the
    given input-output examples in capturing the
    underlying function f(x)
  • The solution of the regularization problem is a
    function type F(x)
  • We wont go into the details of how to find F as
    it requires good understanding of functional
    analysis

20
Regularization Theory (4)
  • Solution of the regularization problem yields
  • F(x) 1/?Si1 N di - F(xi)G(x, xi) Si1 N
    wiG(x, xi)
  • G(x,xi) Greens function centered on xi
  • In matrix form
  • F Gw
  • or
  • (G ?I)w d
  • and
  • w (G ?I)-1d
  • G depends only on the operator P

21
Type of Function G(x xi)
  • If P is translationally invariant then G(x xi)
    depends only on the difference of x and xi, i.e.
  • G(x xi) G(x - xi)
  • If P is both translationally and rotationally
    invariant then G(x xi) depends only on Euclidean
    norm of the difference vector x - xi, i.e.
  • G(x xi) G(x xi)
  • This is a radial-basis function
  • If P is further constrained, and G(x xi) is
    positive definite, then we have the Gaussian
    radial-basis function, i.e.
  • G(x xi) exp(- (1/2s2) x xi2)

22
Regularization Network (1)
23
Regularization Network (2)
  • The regularization network is based on the
    regularized interpolation problem
  • F(x) Si1 N wiG(x, xi)
  • It has 3 layers
  • Input layer of p source nodes, where p is the
    dimension of the input vector x (or number of
    independent variables)
  • Hidden layer with N neurons, where N is the
    number of input-output examples. Each neuron uses
    the activation function G(x xi)
  • Output layer with q neurons, where q is the
    output dimension
  • The unknowns are the weights w (only) from the
    hidden layer to the output layer

24
RBF Networks (in Practice) (1)
  • The regularization network requires N hidden
    neurons, which becomes computationally expensive
    for large N
  • The complexity of the network is reduced to
    obtain an approximate solution to the
    regularization problem
  • The approximate solution F(x) is then given by
  • F(x) Si1 M wifi(x)
  • fi(x) (i 1,M) new set of basis functions M
    is typically less than N
  • Using radial basis functions
  • F(x) Si1 M wifi(x ti)

25
RBF Networks (2)
26
RBF Networks (3)
  • Unknowns in the RBF network
  • M, the number of hidden neurons (M lt N)
  • The centers ti of the radial-basis functions
  • And the weights w

27
How to Train RBF Networks - Learning
  • Normally the training of the hidden layer
    parameters (number of hidden neurons, centers and
    variance of Gaussian) is done prior to the
    training of the weights (i.e. on a different
    time scale)
  • This is justified based on the fact that the
    hidden layer performs a different task
    (nonlinear) than the output layer weights
    (linear)
  • The weights are learned by supervised learning
    using an appropriate algorithm (LMS or BP)
  • The hidden layer parameters are learned by (in
    general, but not always) unsupervised learning

28
Fixed Centers Selected at Random
  • Randomly select M inputs as centers for the
    activation functions
  • Fix the variance of the Gaussian based on the
    distance between the selected centers. A
    radial-basis function centered at ti is then
    given by
  • f(x ti) exp(- M/d2 x ti2)
  • d max. distance between the chosen centers
  • The width or standard deviation of the functions
    is fixed, given by s d/v2M
  • The linear weights are then computed by solving
    the regularization problem or by using supervised
    learning

29
Self-Organized Selection of Centers
  • Use a self-organizing or clustering technique to
    determine the number and centers of the Gaussian
    functions
  • A common algorithm is the k-means algorithm. This
    algorithms assigns a label to a vector x by the
    majority label on the k-nearest neighbors
  • Then compute the weights using a supervised
    error-correction learning such as LMS

30
Supervised Selection of Centers
  • All unknown parameters are trained using
    error-correcting supervised learning
  • A gradient descent approach is used to find the
    minimum of the cost function wrt the weights wi
    and activation function centers ti and spread of
    centers s

31
Example (1)
  • Classify between two overlapping
    two-dimensional, Gaussian-distributed patterns
  • Conditional probability density function for the
    two classes
  • f(x C1) 1/2ps12 exp-1/2s12 x µ12
  • µ1 mean 0 0T and s12 variance 1
  • f(x C2) 1/2ps22 exp-1/2s22 x µ22
  • µ2 mean 2 0T and s22 variance 4
  • x x1 x2T two dimensional input
  • C1 and C2 class labels

32
Example (2)
33
Example (3)
34
Example (4)
  • Consider a two-input, M hidden neurons, and
    two-output RBF
  • Decision rule an input x is classified to C1 if
    y1 gt 0
  • The training set is generated from the
    probability distribution functions
  • Using the perceptron algorithm, the network is
    trained for minimum mean-square-error
  • The testing set is generated from the probability
    distribution functions
  • The trained network is tested for correct
    classification
  • For other implementation details, see the Matlab
    code

35
Example Function Approximation (1)
  • Approximate relationship between a cars fuel
    economy (in miles per gallon) and its
    characteristics
  • Input data description 9 independent discrete
    valued, boolean, and continuous variables
  • X1 number of cylinders
  • X2 displacement
  • X3 horsepower
  • X4 weight
  • X5 acceleration
  • X6 model year
  • X7 Made in US? (0,1)
  • X8 Made in Europe? (0,1)
  • X9 Made in Japan? (0,1)
  • Output f(X) is fuel economy in miles per gallon

36
Example Function Approximation (2)
  • Using the NNET toolbox, create and train a RBF
    network with function newrb
  • The function parameters allows you to set the
    mean-squared-error goal of the training, the
    spread of the radial-basis functions, and the
    maximum number of hidden layer neurons.
  • Newrb uses the following approach to find the
    unknowns (it is a self-organizing approach)
  • Start with one hidden neuron compute network
    error
  • Add another neuron with center equal to input
    vector that produced the maximum error compute
    network error
  • If network error does not improve significantly,
    stop other go to previous step and add another
    neuron

37
Comparison of RBF Network and MLP (1)
  • Both are universal approximators. Thus, a RBF
    network exists for every MLP, and vice versa
  • An RBF has a single hidden layer, while an MLP
    can have multiple hidden layers
  • The model of the computational neurons of an MLP
    are all identical, while the neurons in the
    hidden and output layers of an RBF network have
    different models
  • The activation functions of the hidden nodes of
    an RBF network is based on the Euclidean norm of
    the input wrt to a center, while that of an MLP
    is based on the inner product of input and weights

38
Comparison of RBF Network and MLP (2)
  • MLPs construct global approximations to nonlinear
    input-output mapping. This is a consequence of
    the global activation function (sigmoidal) used
    in MLPs
  • As a result, MLP can perform generalization in
    regions where input data is not available (i.e.
    extrapolation)
  • RBF networks construct local approximations to
    input-output data. This is a consequence of the
    local Gaussian functions
  • As a result, RBF networks are capable of fast
    learning from the training data
Write a Comment
User Comments (0)
About PowerShow.com