Title: Introduction to Support Vector Machines
1Introduction to Support Vector Machines
2Introduction
- Support Vector Machine (SVM) is a learning
methodology based on Vapniks statistical
learning theory - Addressed in the 1990s
- To solve the problems in traditional statistical
learning( over fitting, capacity control,) - It achieved the best performances in practical
applications - Handwritten digit recognition
- text categorization
3Classification Problem
- Given training set S(x1,y1),(x2,y2),,(xl,yl),
and xi? XRn, yi? Y1,-1, i1,2,,l - To learn a function g(x), and make the decision
function f(x)sgn(g(x)) can classify new input x - So this is a supervised batch learning method
4Linear classifier
5Maximum Marginal Classifier
6Maximum Marginal Classifier
- Let us select two points on the two hyperplan
respectively.
Distance from hyperplan to origin
7Maximum Marginal Classifier
8Then
equal to
Note we have constraints
equal to
By scaling w and b by setting?1
9Lagrange duality
For the problem
We can write lagrange form
10Let us review generalized Lagrangian
By lagrangian
Let us consider
Note the constraints must be satisfied,
otherwise, the maxL will be infinite.
11Let us review generalized Lagrangian
If the constraints are satisfied, then we must
have
Now you can found that, maxL takes the same value
as the objective of our problem f(x). Therefore,
we can consider the minimization problem
Let us define the optimal value of the primal
problem as p Then, let us define the dual problem
They are similar
Now we define the optimal value of the dual
problem as d
12Relationship between Primal and Dual problems
Why? Just remember it
Then if under some conditions, dp We can solve
the dual problem in lieu of the primal problem
What is the conditions?
13The famous KKT conditionsKarush-Kuhn-Tucker
conditions
What does it imply?
14The famous KKT conditionsKarush-Kuhn-Tucker
conditions
Very important!
15Return to our problem
Let us first solve minwL with respective to the w
Substitute the two equations back to L(w,b,a)
16We have
Then, what we have the maximum optimum problem
with respect to aß
Now, we have only one parameter a We can solve
it and then solve w, And then b, because
17How to predict
For a new sample x, we can predict it by
18Non-separable case
What is non-separable case? I will not give an
example. I suppose you know that?
Then what is the optimal problem
Next, by forming the lagrangian
19Dual form
What is the difference from the previous form??!!
Also note following conditions
20How to train SVM how to solve the optimal
problem
Sequential minimal optimization (SMO) algorithm,
due to John Platt.
First, let us introduce coordinate ascent
algorithm Loop until convergence For i1,
, m aiargmaxaiL(a1,, ai-1, ai,
ai1,, am)
21coordinate ascent is ok?
Is it ok?
22SMO
Change the algorithm by this is just SMO Repeat
until convergence 1. select some pair ai and
aj to update next. (using a heuristic that tries
to pick the two that will allow us to make the
biggest progress towards the global maximum).
2. reoptimize L(a) with respect to ai and aj,
while holding all the other a.
23SMO(2)
This is a quadratic function in a2. I.e. it can
be written as
24Solving a2
For the quadratic function, we can simply solve
it by setting its derivative to zero. Let us use
a2new, unclipped as the resulting value.
Having find a2, we can go back to find the
optimal a1. Please read Platts paper if you want
to read more details
25Kernel
1. Why kernel? 2. What is feature space mapping?
kernel function
With kernel, whats more interesting to us?
26We can compute kernel without calculating mapping
Replace all by
Now, we need to compute F(x) first. That may be
expensive. But with kernel, we can ignore the
step. Why? Because, both in the training and
test, there have the expression ltx, zgt.
For example
27References
- Vladimir N. Vapnik. The nature of statistical
learning theory. Springer-Verlag New York. 1998. - Andrew Ng. CS229 Lecture notes. Lectures from
10/19/03 to 10/26/03. Part V. Support Vector
Machines - CHRISTOPHER J.C. BURGES. A Tutorial on Support
Vector Machines for Pattern Recognition. Data
Mining and Knowledge Discovery, 2, 121167
(1998). 1998 Kluwer Academic Publishers, Boston.
Manufactured in The Netherlands. - Cristianini, N., Shawe-Taylor, J., An
Introduction to Support Vector Machines,
Cambridge University Press, (2000).
28People
- Vladimir Vapnik.
- J. Platt
- J. Platt, N. Cristianini, J. Shawe-Taylor
- Shawe-Taylor, J.
- Burges, C. J. C.
- Thorsten Joachims
- Etc.