Sparse Kernel Machines - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Sparse Kernel Machines

Description:

Christopher M. Bishop, Pattern Recognition and Machine Learning degree of belief data W * capture our beliefs about the ... – PowerPoint PPT presentation

Number of Views:470
Avg rating:3.0/5.0
Slides: 53
Provided by: via92
Category:

less

Transcript and Presenter's Notes

Title: Sparse Kernel Machines


1
Sparse Kernel Machines
  • Christopher M. Bishop,
  • Pattern Recognition and Machine Learning

2
Outline
  • Introduction to kernel methods
  • Support vector machines (SVM)
  • Relevance vector machines (RVM)
  • Applications
  • Conclusions

3
Supervised Learning
  • In machine learning, applications in which the
    training data comprises examples of the input
    vectors along with their corresponding target
    vectors are called supervised learning

(x,t) (1,60,pass) (2,53,fail) (3,77,pass) (4,34,fa
il) ?
y(x)
output
4
Classification
x2
ylt0
ygt0
y0
t1
t-1
x1
5
Regression
t
1
0
-1
x
new x
0
1
6
Linear Models
  • Linear models for regression and classification
  • if we apply feature extraction,

model parameter
input
7
Problems with Feature Space
  • Why feature extraction? Working in high
    dimensional feature spaces solves the problem of
    expressing complex functions
  • Problems
  • - there is a computational problem (working
    with very large vectors)
  • - curse of dimensionality

8
Kernel Methods (1)
  • Kernel function inner products in some feature
    space ? nonlinear similarity measure
  • Examples
  • - polynomial
  • - Gaussian

9
Kernel Methods (2)
  • Many linear models can be reformulated using a
    dual representation where the kernel functions
    arise naturally ? only require inner products
    between data (input)

10
Kernel Methods (3)
  • We can benefit from the kernel trick
  • - choosing a kernel function is equivalent
    to
  • choosing f ? no need to specify what
  • features are being used
  • - We can save computation by not explicitly
  • mapping the data to feature space, but
    just
  • working out the inner product in the
    data
  • space

11
Kernel Methods (4)
  • Kernel methods exploit information about the
    inner products between data items
  • We can construct kernels indirectly by choosing a
    feature space mapping f, or directly choose a
    valid kernel function
  • If a bad kernel function is chosen, it will map
    to a space with many irrelevant features, so we
    need some prior knowledge of the target

12
Kernel Methods (5)
  • Two basic modules for kernel methods

General purpose learning model
Problem specific kernel function
13
Kernel Methods (6)
  • Limitation the kernel function k(xn,xm) must be
    evaluated for all possible pairs xn and xm of
    training points when making predictions for new
    data points
  • Sparse kernel machine makes prediction only by a
    subset of the training data points

14
Outline
  • Introduction to kernel methods
  • Support vector machines (SVM)
  • Relevance vector machines (RVM)
  • Applications
  • Conclusions

15
Support Vector Machines (1)
  • Support Vector Machines are a system for
    efficiently training the linear machines in the
    kernel-induced feature spaces while respecting
    the insights provided by the generalization
    theory and exploiting the optimization theory
  • Generalization theory describes how to control
    the learning machines to prevent them from
    overfitting

16
Support Vector Machines (2)
  • To avoid overfitting, SVM modify the error
    function to a regularized form
  • where hyperparameter ? balances the
    trade-off
  • The aim of EW is to limit the estimated functions
    to smooth functions
  • As a side effect, SVM obtain a sparse model

17
Support Vector Machines (3)
Fig. 1 Architecture of SVM
18
SVM for Classification (1)
  • The mechanism to prevent overfitting in
    classification is maximum margin classifiers
  • SVM is fundamentally a two-class classifier

19
Maximum Margin Classifiers (1)
  • The aim of classification is to find a D-1
    dimension hyperplane to classify data in a D
    dimension space
  • 2D example

20
Maximum Margin Classifiers (2)
support vectors
support vectors
margin
21
Maximum Margin Classifiers (3)
small margin
large margin
22
Maximum Margin Classifiers (4)
  • Intuitively it is a robust solution
  • - If weve made a small error in the
    location of the boundary, this gives us least
    chance of causing a misclassification
  • The concept of max margin is usually justified
    using Vapniks Statistical learning theory
  • Empirically it works well

23
SVM for Classification (2)
  • After the optimization process, we obtain the
    prediction model
  • where (xn,tn) are N training data
  • we can find that an will be zero except for
    that of the support vectors ? sparse

24
SVM for Classification (3)
Fig. 2 data from twp classes in two dimensions
showing contours of constant y(x) obtained from a
SVM having a Gaussian kernel function
25
SVM for Classification (4)
  • For overlapping class distributions, SVM allow
    some of the training points to be misclassified ?
    soft margin

penalty
26
SVM for Classification (5)
  • For multiclass problems, there are some methods
    to combine multiple two-class SVMs
  • - one versus the rest
  • - one versus one ? more training time

Fig. 3 Problems in multiclass classification
using multiple SVMs
27
SVM for Regression (1)
  • For regression problems, the mechanism to prevent
    overfitting is e-insensitive error function

quadratic error function
e-insensitive error funciton
28
SVM for Regression (2)

Error y(x)-t- e
No error
Fig . 4 e-tube
29
SVM for Regression (3)
  • After the optimization process, we obtain the
    prediction model
  • we can find that an will be zero except for
    that of the support vectors ? sparse

30
SVM for Regression (4)
Fig . 5 Regression results. Support vectors are
line on the boundary of the tube or outside the
tube
31
Disadvantages
  • Its not sparse enough since the number of
    support vectors required typically grows linearly
    with the size of the training set
  • Predictions are not probabilistic
  • The estimation of error/margin trade-off
    parameters must utilize cross-validation which is
    a waste of computation
  • Kernel functions are limited
  • Multiclass classification problems

32
Outline
  • Introduction to kernel methods
  • Support vector machines (SVM)
  • Relevance vector machines (RVM)
  • Applications
  • Conclusions

33
Relevance Vector Machines (1)
  • The relevance vector machine (RVM) is a Bayesian
    sparse kernel technique that shares many of the
    characteristics of SVM whilst avoiding its
    principal limitations
  • RVM are based on Bayesian formulation and
    provides posterior probabilistic outputs, as well
    as having much sparser solutions than SVM

34
Relevance Vector Machines (2)
  • RVM intend to mirror the structure of the SVM and
    use a Bayesian treatment to remove the
    limitations of SVM
  • the kernel functions are simply treated as
    basis functions, rather than dot-product in some
    space

35
Bayesian Inference
  • Bayesian inference allows one to model
    uncertainty about the world and outcomes of
    interest by combining common-sense knowledge and
    observational evidence.

36
Relevance Vector Machines (3)
  • In the Bayesian framework, we use a prior
    distribution over w to avoid overfitting
  • where a is a hyperparameter which control
    the model parameter w

37
Relevance Vector Machines (4)
  • Goal find most probable a and ß to compute the
    predictive distribution over tnew for a new input
    xnew, i.e.
  • p(tnew xnew, X, t,
    a, ß)
  • Maximize the likelihood function to obtain a and
    ß
  • p(tX, a, ß)

Training data and their target values
38
Relevance Vector Machines (5)
  • RVM utilize the automatic relevance
    determination to achieve sparsity
  • where am represents the precision of wm
  • In the procedure of finding am, some am will
    become infinity which leads the corresponding wm
    to be zero ? remain relevance vectors !

39
Comparisons - Regression
SVM
RVM (on standard deviation predictive
distribution)
40
Comparisons - Regression
41
Comparison - Classification
SVM
RVM
42
Comparison - Classification
43
Comparisons
  • RVM are much sparser and make probabilistic
    prediction
  • RVM gives better generalization in regression
  • SVM gives better generalization in classification
  • RVM is computationally demanding while learning

44
Outline
  • Introduction to kernel methods
  • Support vector machines (SVM)
  • Relevance vector machines (RVM)
  • Applications
  • Conclusions

45
Applications (1)
  • SVM for face detection

46
Applications (2)
Marti Hearst, Support Vector Machines ,1998
47
Applications (3)
  • In the feature-matching based object tracking,
    SVM are used to detect false feature matches

Weiyu Zhu et al., Tracking of Object with SVM
Regression , 2001
48
Applications (4)
  • Recovering 3D human poses by RVM

A. Agarwal and B. Triggs, 3D Human Pose from
Silhouettes by Relevance Vector Regression 2004
49
Outline
  • Introduction to kernel methods
  • Support vector machines (SVM)
  • Relevance vector machines (RVM)
  • Applications
  • Conclusions

50
Conclusions
  • The SVM is a learning machine based on kernel
    method and generalization theory which can
    perform binary classification and real valued
    function approximation tasks
  • The RVM have the same model as SVM but provides
    probabilistic prediction and sparser solutions

51
References
  • www.support-vector.net
  • N. Cristianini and J. Shawe-Taylor, An
    Introduction to Support Vector Machines and Other
    Kernel-based Learning Methods, Cambridge
    University Press,2000
  • M. E. Tipping, Sparse Bayesian Learning and the
    Relevance Vector Machine, Journal of Machine
    Learning Research, 2001

52
Underfitting and Overfitting
underfitting-too simple
overfitting-too complex
new data
Adapted from http//www.dtreg.com/svm.htm
Write a Comment
User Comments (0)
About PowerShow.com