Title: Sparse Kernel Machines
1Sparse Kernel Machines
- Christopher M. Bishop,
- Pattern Recognition and Machine Learning
2Outline
- Introduction to kernel methods
- Support vector machines (SVM)
- Relevance vector machines (RVM)
- Applications
- Conclusions
3Supervised Learning
- In machine learning, applications in which the
training data comprises examples of the input
vectors along with their corresponding target
vectors are called supervised learning
(x,t) (1,60,pass) (2,53,fail) (3,77,pass) (4,34,fa
il) ?
y(x)
output
4Classification
x2
ylt0
ygt0
y0
t1
t-1
x1
5Regression
t
1
0
-1
x
new x
0
1
6Linear Models
- Linear models for regression and classification
- if we apply feature extraction,
-
-
model parameter
input
7Problems with Feature Space
- Why feature extraction? Working in high
dimensional feature spaces solves the problem of
expressing complex functions - Problems
- - there is a computational problem (working
with very large vectors) - - curse of dimensionality
8Kernel Methods (1)
- Kernel function inner products in some feature
space ? nonlinear similarity measure - Examples
- - polynomial
- - Gaussian
-
-
9Kernel Methods (2)
-
- Many linear models can be reformulated using a
dual representation where the kernel functions
arise naturally ? only require inner products
between data (input)
10Kernel Methods (3)
- We can benefit from the kernel trick
- - choosing a kernel function is equivalent
to - choosing f ? no need to specify what
- features are being used
- - We can save computation by not explicitly
- mapping the data to feature space, but
just - working out the inner product in the
data - space
11Kernel Methods (4)
- Kernel methods exploit information about the
inner products between data items - We can construct kernels indirectly by choosing a
feature space mapping f, or directly choose a
valid kernel function - If a bad kernel function is chosen, it will map
to a space with many irrelevant features, so we
need some prior knowledge of the target
12Kernel Methods (5)
- Two basic modules for kernel methods
General purpose learning model
Problem specific kernel function
13Kernel Methods (6)
- Limitation the kernel function k(xn,xm) must be
evaluated for all possible pairs xn and xm of
training points when making predictions for new
data points - Sparse kernel machine makes prediction only by a
subset of the training data points
14Outline
- Introduction to kernel methods
- Support vector machines (SVM)
- Relevance vector machines (RVM)
- Applications
- Conclusions
15Support Vector Machines (1)
- Support Vector Machines are a system for
efficiently training the linear machines in the
kernel-induced feature spaces while respecting
the insights provided by the generalization
theory and exploiting the optimization theory - Generalization theory describes how to control
the learning machines to prevent them from
overfitting
16Support Vector Machines (2)
- To avoid overfitting, SVM modify the error
function to a regularized form -
- where hyperparameter ? balances the
trade-off - The aim of EW is to limit the estimated functions
to smooth functions - As a side effect, SVM obtain a sparse model
-
17Support Vector Machines (3)
Fig. 1 Architecture of SVM
18SVM for Classification (1)
- The mechanism to prevent overfitting in
classification is maximum margin classifiers - SVM is fundamentally a two-class classifier
19Maximum Margin Classifiers (1)
- The aim of classification is to find a D-1
dimension hyperplane to classify data in a D
dimension space - 2D example
-
20Maximum Margin Classifiers (2)
support vectors
support vectors
margin
21Maximum Margin Classifiers (3)
small margin
large margin
22Maximum Margin Classifiers (4)
- Intuitively it is a robust solution
- - If weve made a small error in the
location of the boundary, this gives us least
chance of causing a misclassification - The concept of max margin is usually justified
using Vapniks Statistical learning theory - Empirically it works well
23SVM for Classification (2)
- After the optimization process, we obtain the
prediction model - where (xn,tn) are N training data
- we can find that an will be zero except for
that of the support vectors ? sparse -
24SVM for Classification (3)
Fig. 2 data from twp classes in two dimensions
showing contours of constant y(x) obtained from a
SVM having a Gaussian kernel function
25SVM for Classification (4)
- For overlapping class distributions, SVM allow
some of the training points to be misclassified ?
soft margin
penalty
26SVM for Classification (5)
- For multiclass problems, there are some methods
to combine multiple two-class SVMs - - one versus the rest
- - one versus one ? more training time
Fig. 3 Problems in multiclass classification
using multiple SVMs
27SVM for Regression (1)
- For regression problems, the mechanism to prevent
overfitting is e-insensitive error function
quadratic error function
e-insensitive error funciton
28SVM for Regression (2)
Error y(x)-t- e
No error
Fig . 4 e-tube
29SVM for Regression (3)
- After the optimization process, we obtain the
prediction model - we can find that an will be zero except for
that of the support vectors ? sparse -
-
30SVM for Regression (4)
Fig . 5 Regression results. Support vectors are
line on the boundary of the tube or outside the
tube
31Disadvantages
- Its not sparse enough since the number of
support vectors required typically grows linearly
with the size of the training set - Predictions are not probabilistic
- The estimation of error/margin trade-off
parameters must utilize cross-validation which is
a waste of computation - Kernel functions are limited
- Multiclass classification problems
32Outline
- Introduction to kernel methods
- Support vector machines (SVM)
- Relevance vector machines (RVM)
- Applications
- Conclusions
33Relevance Vector Machines (1)
- The relevance vector machine (RVM) is a Bayesian
sparse kernel technique that shares many of the
characteristics of SVM whilst avoiding its
principal limitations - RVM are based on Bayesian formulation and
provides posterior probabilistic outputs, as well
as having much sparser solutions than SVM
34Relevance Vector Machines (2)
- RVM intend to mirror the structure of the SVM and
use a Bayesian treatment to remove the
limitations of SVM - the kernel functions are simply treated as
basis functions, rather than dot-product in some
space -
35Bayesian Inference
- Bayesian inference allows one to model
uncertainty about the world and outcomes of
interest by combining common-sense knowledge and
observational evidence.
36Relevance Vector Machines (3)
- In the Bayesian framework, we use a prior
distribution over w to avoid overfitting - where a is a hyperparameter which control
the model parameter w -
37Relevance Vector Machines (4)
- Goal find most probable a and ß to compute the
predictive distribution over tnew for a new input
xnew, i.e. - p(tnew xnew, X, t,
a, ß) - Maximize the likelihood function to obtain a and
ß - p(tX, a, ß)
Training data and their target values
38Relevance Vector Machines (5)
- RVM utilize the automatic relevance
determination to achieve sparsity - where am represents the precision of wm
- In the procedure of finding am, some am will
become infinity which leads the corresponding wm
to be zero ? remain relevance vectors ! -
39Comparisons - Regression
SVM
RVM (on standard deviation predictive
distribution)
40Comparisons - Regression
41Comparison - Classification
SVM
RVM
42Comparison - Classification
43Comparisons
- RVM are much sparser and make probabilistic
prediction - RVM gives better generalization in regression
- SVM gives better generalization in classification
- RVM is computationally demanding while learning
44Outline
- Introduction to kernel methods
- Support vector machines (SVM)
- Relevance vector machines (RVM)
- Applications
- Conclusions
45Applications (1)
46Applications (2)
Marti Hearst, Support Vector Machines ,1998
47Applications (3)
- In the feature-matching based object tracking,
SVM are used to detect false feature matches
Weiyu Zhu et al., Tracking of Object with SVM
Regression , 2001
48Applications (4)
- Recovering 3D human poses by RVM
A. Agarwal and B. Triggs, 3D Human Pose from
Silhouettes by Relevance Vector Regression 2004
49Outline
- Introduction to kernel methods
- Support vector machines (SVM)
- Relevance vector machines (RVM)
- Applications
- Conclusions
50Conclusions
- The SVM is a learning machine based on kernel
method and generalization theory which can
perform binary classification and real valued
function approximation tasks - The RVM have the same model as SVM but provides
probabilistic prediction and sparser solutions
51References
- www.support-vector.net
- N. Cristianini and J. Shawe-Taylor, An
Introduction to Support Vector Machines and Other
Kernel-based Learning Methods, Cambridge
University Press,2000 - M. E. Tipping, Sparse Bayesian Learning and the
Relevance Vector Machine, Journal of Machine
Learning Research, 2001
52Underfitting and Overfitting
underfitting-too simple
overfitting-too complex
new data
Adapted from http//www.dtreg.com/svm.htm