Title: Support Vector Machines
1Support Vector Machines
- 1. Introduction to SVMs
- 2. Linear SVMs
- 3. Non-linear SVMs
References 1. S.Y. Kung, M.W. Mak, and S.H. Lin.
Biometric Authentication A Machine Learning
Approach, Prentice Hall, to appear. 2. S.R. Gunn,
1998. Support Vector Machines for Classification
and Regression. (http//www.isis.ecs.soton.ac.uk/r
esources/svminfo/) 3. Bernhard Schölkopf.
Statistical learning and kernel methods. MSR-TR
2000-23, Microsoft Research, 2000.
(ftp//ftp.research.microsoft.com/pub/tr/tr-2000-2
3.pdf) 4. For more resources on support vector
machines, see http//www.kernel-machines.org/
2Introduction
- SVMs were developed by Vapnik in 1995 and are
becoming popular due to their attractive features
and promising performance. - Conventional neural networks are based on
empirical risk minimization where network weights
are determined by minimizing the mean squares
error between the actual outputs and the desired
outputs. - SVMs are based on the structural risk
minimization principle where parameters are
optimized by minimizing classification error. - SVMs have been shown to posses better
generalization capability than conventional
neural networks.
3Introduction (Cont.)
- Given N labeled empirical data
(1)
where X is the set of input data in and
yi are the labels.
Domain X
4Introduction (Cont.)
- We construct a simple classifier by computing the
means of the two classes
(2)
where N1 and N2 are the number of data in the
class with positive and negative labels,
respectively.
- We assign a new point x to the class whose mean
is closer to it. - To achieve this, we compute
5Introduction (Cont.)
- Then, we determine the class of x by checking
whether the vector connecting x and c encloses an
angle smaller than ?/2 with the vector
Domain X
where
x
6Introduction (Cont.)
- In the special case where b 0, we have
(3)
- This means that we use ALL data point xi, each
being weighted equally by 1/N1 or 1/N2, to define
the decision plane.
7Introduction (Cont.)
x
Decision plan
Domain X
8Introduction (Cont.)
- However, we might want to remove the influence of
patterns that are far away from the decision
boundary, because their influence is usually
small. - We may also select only a few important data
point (called support vectors) and weight them
differently. - Then, we have a support vector machine.
9Introduction (Cont.)
Margin
Support vectors
x
Decision plane
Domain X
- We aim to find a decision plane that maximizes
the margin.
10Linear SVMs
- Assume that all training data satisfy the
constraints
(4)
which means
(5)
- Training data points for which the above equality
holds lie on hyperplanes parallel to the decision
plane.
11Linear SVMs (Conts.)
Margin d
- Therefore, maximizing the margin is equivalent to
minimizing w2.
12Linear SVMs (Lagrangian)
- We minimize w2 subject to the constraint
that
(6)
- This can be achieved by introducing Lagrange
multipliers - and a Lagrangian
(7)
- The Lagrangian has to be minimized with respect
to w and b and maximized with respect to
13Linear SVMs (Lagrangian)
(8)
- Patterns for which are called Support
Vectors. These vectors lie on the margin and
satisfy
where S contains the indexes to the support
vectors.
- Patterns for which are considered to be
irrelevant to the classification.
14Linear SVMs (Wolfe Dual)
- Substituting (8) into (7), we obtain the Wolfe
dual
(9)
- The hyper-decision plane is thus
15Linear SVMs (Example)
- Analytical example (3-point problem)
16Linear SVMs (Example)
- We introduce another Lagrange multiplier ? to
obtain the Lagrangian
- Differentiating F(a, ?) with respect to ? and ai
and set the results to zero, we obtain
17Linear SVMs (Example)
- Substitute the Lagrange multipliers into Eq. 8
18Linear SVMs (Example)
- 4-point linear separable problem
4 SVs
3 SVs
19Linear SVMs (Non-linearly separable)
- Non-linearly separable patterns that cannot be
separated by a linear decision boundary without
incurring classification error.
Data that causes classification error in linear
SVMs
20Linear SVMs (Non-linearly separable)
- We introduce a set of slack variables
with
- The slack variables allow some data to violate
the constraints defined for the linearly
separable case (Eq. 6)
- Therefore, for some we
have
21Linear SVMs (Non-linearly separable)
- E.g. because x10
and x19 are inside the margins, i.e. they
violate the constraint (Eq. 6).
22Linear SVMs (Non-linearly separable)
where C is a user-defined penalty parameter to
penalize any violation of the margins.
23Linear SVMs (Non-linearly separable)
- The output weight vector and bias term are
242. Linear SVMs (Types of SVs)
- Three types of support vectors
- On the margin
2. Inside the margin
3. Outside the margin
252. Linear SVMs (Types of SVs)
262. Linear SVMs (Types of SVs)
Swapping Class 1 and Class 2
272. Linear SVMs (Types of SVs)
C 0.1
C 100
283. Non-linear SVMs
- In case the training data X are not linearly
separable, we may use a kernel function to map
the data from the input space to a feature space
where data become linearly separable.
Decision boundary
Decision boundary
Input Space (Domain X)
Feature Space
293. Non-linear SVMs (Conts.)
- The decision function becomes
(a)
303. Non-linear SVMs (Conts.)
313. Non-linear SVMs (Conts.)
- The decision function becomes
323. Non-linear SVMs (Conts.)
- The optimization problem becomes
(9)
- The decision function becomes
333. Non-linear SVMs (Conts.)
- The effect of varying C on RBF-SVMs
C 1000
C 10
343. Non-linear SVMs (Conts.)
- The effect of varying C on Polynomial-SVMs
C 1000
C 10