Introduction to SVMs - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Introduction to SVMs

Description:

The hyperplane is a stiff sheet ... 'No free lunch' for kernels (Sch lkopf) ... My website: http://www.music.mcgill.ca/~rebecca/6080/SVM_bib.htm. References ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 29

Provided by: rebeccaf7

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to SVMs

1
Introduction to SVMs

Rebecca Fiebrink
IFT6080
13 March 2006

2
Disclaimer

This is a high-level overview.
I am not an SVM expert.
Images are liberally stolen from tutorials by
Schölkopf and Burges.

3
Separating planes Basic idea

A separating plane is great for doing 2-class
classification for linearly-separable data
Its easy to reason about, too
Can discuss formal bounds on generalization error
Can think of straightforward ways to compute this
plane

4
SVM Basic idea

What about when the data is not linearly
separable?
Map it into a higher dimension where it is
separable

Use the kernel trick to implicitly map up,
compute stuff, and map back in one step.

5
Contents of this presentation

Introduce idea of linear maximum-margin
hyperplane for separable data
Extend to non-separable data
Extend to non-linear hyperplanes
The kernel trick
Finding the maximum-margine hyperplane
Theoretical and practical implications
Extensions of SVMs

6
Linear plane, separable data

Each data instance xi is a vector in Rd
Each xi has a class label yi ? -1, 1
A separating hyperplane can be defined by normal
vector w and scalar b
The plane is specified so
sgn(?xi, w?b)sgn(yi)
and ??xi, w?b? 1
New data can be classified as

7
Maximum margin hyperplane

A maximum-margin hyperplane minimizes w while
requiring that all training points are correctly
classified (and all yi 1)
There is one solution for this plane (or
equivalent global solutions)

The training points that lie on the margin
(circled) are the support vectors
Removal of any of these points from the training
set will change the solution hyperplane
Each support vector may have a different
importance to the solution
(Only) these points can be used to classify new
data
They specify the hyperplane

8
Linear plane, non-separable data

Introduce non-negative slack variables ?i (one
for each training point)
Relaxes the requirement that all training points
be correctly classified
Ability to control balance between size of ?is
and number of training errors
Classify new data in the same way as before

9
Mechanical analogy

Consider L R2
The hyperplane is a stiff sheet
Each support vector exerts a force normal to this
sheet, scaled according to its importance
System is in equilibrium
Sum of forces is 0
Sum of torques is 0

10
Nonlinear SVM

Map data from lower-dimensional space LRd to a
(much-)higher-dimensional space H using some
function
Compute the maximum-margin hyperplane in H
exactly as before

11
Nonlinear SVM

Could explicitly construct ?, map all the
training data into H, compute hyperplane, map all
the testing data into H to classify, etc.
But, the equations we use to find w and b can be
expressed using the training data only in
dot-product form
Additionally, the equations we use to classify
new points can be expressed using only the
support vectors and only in dot-product form
So
We can take a shortcut use a function that
directly computes dot products in H on vectors in
L
We dont even have to know what H is, or how to
explicitly get there H just has to have an inner
product defined
This is called the kernel trick

12
Kernel Trick

Replace all dot-products in L with dot-products
in H as follows
If we have a ? we know and love
Change ?xi, xj? to ??(xi),?(xj)?
Call K(xi, xj) ??(xi),?(xj)?
Otherwise, if we have some K such that we know
K(xi, xj) ??(xi),?(xj)? for some H, somewhere
Change ?xi, xj? to K(xi, xj) directly
Do the above for both the training and the
testing of the SVM.

13
Mercers condition

How do we know if K(xi, xj) computes a
dot-product in some higher-dimensional space?
Mercers condition K(x, y) is a valid kernel
Mercers condition doesnt tell us anything about
? or H, only that they exist

14
A few notes about the kernel trick

The image of ? may live in a space of very high
dimension, but it is just a (possibly very
contorted) surface whose intrinsic dimension is
just that of L (see below)
w typically doesnt have a representation in L,
otherwise we could just classify linearly
Kernels arent just for SVMs!

15
Choosing a kernel

Common kernels
Polynomial
Radial basis function
2-layer sigmoidal neural network
Choosing a kernel corresponds to choosing a
similarity measure for the data (Schölkopf)
No free lunch for kernels (Schölkopf)
The number of parameters one needs to set (tune)
is a key consideration for practical use of SVMs
(Hsu et al.)

16
Quadratic programming

Finding the maximum-margin separating hyperplane
is quadratic programming problem
QP Type of optimization where the objective
function is allowed to have quadratic terms
In general, QP involves minimizing w.r.t. x
under some constraints
If E is positive definite, then f(x) is a convex
function and constraints are linear functions.
There are well-understood circumstances under
which a solution is optimal Karush-Kuhn-Tucker
(KKT) conditions
There is a collection of known methods for
solving QP problems.
Thank you, Wikipedia.

17
QP whats it mean for SVM?

If Mercers conditions are met, then E is
positive definite.
This means that any solution we find is global.
We can identify (check) a solution using KKT
conditions.
The solution process is still sort of ugly.

18
What are all those equations?

Problem is formulated as primal and dual
Lagrangians
Primal and dual are complementary equations
You can solve either one
Primal The objective function is a combination
of n variables and m constraints.
Want to maximize (or minimize) the value of the
objective function subject to the constraints
A solution is a vector of n values that achieves
this maximum
Dual The objective function is a combination of
the m values arising from the m constraints in
the dual, and n constraints.
Again, maximize/minimize the objective function
subject to the constraints
Lagrange multipliers
Method for dealing with constraints
An unknown scalar multiplier ?i is assigned to
each constraint

19
The details

Separable case
Non-separable case
Why?
Want a maximum margin, and margin 1/w
Separable
Non-separable
and, control complexity (balance between
training errors and size of ?is)

Primal minimize
Dual maximize
20
Theoretical implications

Structural risk minimization (SRM) Wed like to
find a machine for which the training error is
low and bounded tightly
A formal bound on generalization performance of a
learner

21
VC dimension

h in the previous slide
Measures the capacity of a classification
algorithm
Capacity relates to ability to learn perfectly
(shatter), which relates to (in)ability to
generalize

22
SVM and VC

VC dimension of an SVM is dependent on the kernel
used
In general, capacity is very high (or infinite)
This isnt necessarily a problem other
algorithms also have infinite VC bound (e.g., kNN)

23
Gap-tolerant classifiers

We can come up with meaningful (but looser)
bounds for gap-tolerant classifiers, a sort of
idealized and generalized version of SVMs
In practice, these bounds can tell us useful
things about the performance
According to Burges, SVMs magic happens in the
maximization of the margin

24
Practical implications for SVMs

They can take a really long time
There are methods for solving the QP problem more
efficiently (e.g., chunking), and these are used
in practice
Nominal attributes are typically binarized
(results in more attributes)
Unlike other classifiers (e.g., neural nets),
SVMs always find a global optimum
Special things must be done to handle multi-class
problems

25
Extensions More than one class

There are a variety of ways to handle this, some
of them designed for extending 2-class
classifiers in general
1-vs-1 (pairwise) Construct an SVM for every
pair of N classes. The class that gets the most
votes wins (max-wins)
1-vs-all (1-vs-rest) Construct an SVM for each
class. The class that gets the strongest vote in
favor wins.
Others e.g., Platts DAGSVM
There are also a variety of ways to get more
interesting outputs (e.g., probabilities)

26
Extensions large datasets

Scalability is a problem.
Some algorithms use selective sampling/active
learning to sample the training data
intelligently
Some approaches reformulate the QP problem so
that it can be solved more efficiently
Portions of the basic SVM computation can be
parallelized

From Yu et al. 2003
27
For more information

My website http//www.music.mcgill.ca/rebecca/60
80/SVM_bib.htm

28
References

Burges, C. 1999. A tutorial on support vector
machines for pattern recognition. Given at DAGM.
Available online http//www.kernel-machines.org/t
utorial.html.
Duda, R., P. Hart, and D. Stork. 2001. Pattern
classification. New York John Wiley Sons. (2nd
ed.)
Hsu, C., C. Chang, and C. Lin. A practical guide
to support vector classification. Available
online http//www.csie.ntu.edu.tw/cjlin/papers/g
uide/guide.pdf.
Hsu, C., and C. Lin. 2002. A comparison of
methods for multiclass support vector machines.
IEEE transactions on neural networks. 13(2)
415-25.
Platt, J. 1999. Probabilities for support vector
machines. Advances in large margin classifiers,
A. Smola, P. Bartlett, B. Schölkopf, D.
Schuurmans, eds. MIT Press. 61-74. Original
title Probabilistic outputs for support vector
machines and comparisons to regularized
likelihood methods. Available online
http//research.microsoft.com/jplatt/abstracts/SV
prob.html.
B. Schölkopf. A short tutorial on kernels, 2000.
Tutorial given at the NIPS'00 Kernel Workshop.
Available online http//www.dcs.rhbnc.ac.uk/colt/
nips2000/kernels-tutorial.ps.20gz
Schölkopf, B., and A. Smola. 2002. Learning with
kernels Support vector machines, regularization,
optimization, and beyond. Cambridge, MA MIT
Press.
www.kernel-machines.org
www.wikipedia.org
Yu, H., J. Yang, and J. Han. 2003. Classifying
large data sets using SVMs with hierarchical
clusters. Proceedings of SIGKDD 2003.