Title: 7. Support Vector Machines (SVMs)
17. Support Vector Machines (SVMs)
- Basic Idea
- Transform the data with a non-linear mapping f so
that it is linearly separable. Cf Covers
theorem non-linearly separable data can be
transformed into a new feature space which is
linearly separable if 1) mapping is non-linear 2)
dimensionality of feature space is high enough - Construct the optimal hyperplane (linear
weighted sum of outputs of first layer) which
maximises the degree of separation (the margin
of separation denoted by r) between the 2
classes
2MLPs and RBFN stop training when all points are
classified correctly. Thus the decision surfaces
are not optimised in the sense that the
generalization error is not minimized
SVM
r
MLP
RBF
3x1
x2
b w0 bias
Input m-D vector
y Output y S wi f(xi) b y wT f(x)
wm1
fm1 (x)
xm0
First layer mapping performed from the input
space into a feature space of higher dimension
where the data is now linearly separable using a
set of m1 non-linear functions (cf RBFNs)
4- After learning both RBFN and MLP decision
surfaces might not be at the optimal position.
For example, as shown in the figure, both
learning rules will not perform further
iterations (learning) since the error criterion
is satisfied (cf perceptron) - In contrast the SVM algorithm generates the
optimal decision boundary (the dotted line) by
maximising the distance between the classes r
which is specified by the distance between the
decision boundary and the nearest data points - Points which lie exactly r/2 away from the
decision boundary are known as Support Vectors - Intuition is that these are the most important
points since moving them moves the decision
boundary
5Moving a support vector moves the decision
boundary
Moving the other vectors has no effect
The algorithm to generate the weights proceeds in
such a way that only the support vectors
determine the weights and thus the boundary
6However, we shall see that the output of the SVM
can also be interpreted as a weighted sum of the
inner (dot) products of the images of the input
x and the support vectors xi in the feature
space, which is computed by an inner product
kernel function K(x,xm)
b bias
x1
y Output y S ai di K(x, xi) b
x2
Input m-D vector
aNdN
KN (x) K(x, xN) fT(x). f( xN)
xm0
Where fT(x) f1(x), f2(x), .. , fm1(x)T I.e.
image of x in feature space and di /- 1
depending on the class of xi
7Why should inner product kernels be involved in
pattern recognition? -- Intuition is that they
provide some measure of similarity -- cf Inner
product in 2D between 2 vectors of unit length
returns the cosine of the angle between them.
e.g. x 1, 0T , y 0, 1T I.e. if they are
parallel inner product is 1 xT x x.x 1 If
they are perpendicular inner product is 0 xT
y x.y 0
8- Differs to MLP (etc) approaches in a fundamental
way - In MLPs complexity is controlled by keeping
number of hidden nodes small - Here complexity is controlled independently of
dimensionality - The mapping means that the decision surface is
constructed in a very high (often infinite)
dimensional space - However, the curse of dimensionality (which
makes finding the optimal weights difficult) is
avoided by using the notion of an inner product
kernel (see the kernel trick, later) and
optimising the weights in the input space
9SVMs are a superclass of network containing both
MLPs and RBFNs (and both can be generated using
the SV algorithm) Strengths Previous slide i.e.
complexity/capacity is independent of
dimensionality of the data thus avoiding curse of
dimensionality Statistically motivated gt Can get
bounds on the error, can use the theory of VC
dimension and structural risk minimisation
(theory which characterises generalisation
abilities of learning machines) Finding the
weights is a quadratic programming problem
guaranteed to find a minimum of the error
surface. Thus the algorithm is efficient and SVMs
generate near optimal classification and are
insensitive to overtraining Obtain good
generalisation performance due to high dimension
of feature space
10Most important (?) by using a suitable kernel,
SVM automatically computes all network parameters
for that kernel. Eg RBF SVM automatically
selects the number and position of hidden nodes
(and weights and bias) Weaknesses Scale (metric)
dependent Slow training (compared to RBFNs/MLPs)
due to computationally intensive solution to QP
problem especially for large amounts of training
data gt need special algorithms Generates
complex solutions (normally gt 60 of training
points are used as support vectors) especially
for large amounts of training data. E.g. from
Haykin increase in performance of 1.5 over MLP.
However, MLP used 2 hidden nodes, SVM used 285
Difficult to incorporate prior knowledge
11The SVM was proposed by Vapnik and colleagues in
the 70s but has only recently become popular
early 90s). It (and other kernel techniques) is
currently a very active (and trendy) topic of
research See for example http//www.kernel-m
achines.org or (book) AN INTRODUCTION TO
SUPPORT VECTOR MACHINES (and other
kernel-based learning methods). N. Cristianini
and J. Shawe-Taylor, Cambridge University Press.
2000. ISBN 0 521 78019 5 for recent developments
12First consider a linearly separable problem where
the decision boundary is given by g(x)
wTx b 0 And a set of training data X(xi,di)
i1, .., N where di 1 if xi is in class 1
and 1 if its in class 2. Let the optimal
weight-bias combination be w0 and b0
xn
x
xp
w
Now x xp xn xp r w0 / w0
where r xn Since g(xp) 0, g(x)
w0T(xp r w0 / w0) b0 g(x) r w0T
w0 / w0 r w0 or r g(x)/
w0
13- Thus, as g(x) gives us the algebraic distance to
the hyperplane, we want g(xi) w0Txi b0 gt
1 for di 1 - and g(xi) w0Txi b0 lt -1 for di -1
- (remembering that w0 and b0 can be rescaled
without changing the boundary) with equality for
the support vectors xs. Thus, considering points
on the boundary and that - r g(x)/ w0
- we have
- r 1/ w0 for dS 1 and r -1/ w0
for dS -1 - and so the margin of separation is
- r 2 / w0
- Thus, the solution w0 maximises the margin of
separation - Maximising this margin is equivalent to
minimising w
14We now need a computationally efficient algorithm
to find w0 and b0 using the training data (xi,
di). That is we want to minimise F(w) 1/2
wTw subject to di(wTxi b) gt 1 for i 1, ..
N which is known as the primal problem. Note that
the cost function F is convex in w (gt a unique
solution) and that the constraints are linear in
w. Thus we can solve for w using the technique
of Lagrange multipliers (non-maths technique for
solving constrained optimisation problems). For a
geometrical interpretation of Lagrange
multipliers see Bishop, 95, Appendix C.
15First we construct the Lagrangian function L(w
, a) 1/2 wTw - Si ai di(wTxi b) - 1 where
ai are the Lagrange multipliers. L must be
minimised with respect to w and b and maximised
with respect to ai (it can be shown that such
problems have a saddle-point at the optimal
solution). Note that the Karush-Kuhn-Tucker (or,
intuitively, the maximisation/constraint)
conditions means that at the optimum ai
di(wTxi b) - 1 0 This means that unless
the data point is a support vector ai 0 and the
respective points are not involved in the
optimisation. We then set the partial derivatives
of L wr to b and w to zero to obtain the
conditions of optimality w Si ai di xi and
Si ai di 0
16Given such a constrained convex problem, we can
reform the primal problem using the optimality
conditions to get the equivalent dual
problem Given the training data sample
(xi,di), i1, ,N, find the Lagrangian
multipliers ai which maximise subject to
the constraints and ai gt0 Notice
that the input vectors are only involved as an
inner product
17Once the optimal Lagrangian multipliers a0, i
have been found we can use them to find the
optimal w w0 S a0,i di xi and the optimal
bias from the fact that for a positive support
vector wTxi b0 1 gt b0 1 - w0Txi
However, from a numerical perspective it is
better to take the mean value of b0 resulting
from all such data points in the sample Since
a0,i 0 if xi is not a support vector, ONLY the
support vectors determine the optimal hyperplane
which was our intuition
18For a non-linearly separable problem we have to
first map data onto feature space so that they
are linear separable
xi f(xi) with
the procedure for determining w the same except
that xi is replaced by f(xi) that is Given the
training data sample (xi,di), i1, ,N, find
the optimum values of the weight vector w and
bias b w S a0,i di f(xi) where a0,i are
the optimal Lagrange multipliers determined by
maximising the following objective function
subject to the constraints S ai di 0
ai gt0
19Example XOR problem revisited Let the nonlinear
mapping be f(x) (1,x12, 21/2 x1x2, x22,
21/2 x1 , 21/2 x2)T And f(xi)(1,xi12, 21/2
xi1xi2, xi22, 21/2 xi1 , 21/2 xi2)T Therefore
the feature space is in 6D with input data in 2D
x1 (-1,-1), d1 - 1
x2 (-1,1), d2 1 x3
(1,-1), d3 1 x4 (-1,-1), d4
-1
20Q(a) S ai ½ S S ai aj di dj f(xi) Tf(xj)
-1/2 (112122) a1 a1 1/2
(11-212-2)a1 a2 a1 a2 a3
a4 a1 a2 a3 a4 ½(9 a1 a1 - 2a1
a2 -2 a1 a3 2a1 a4 9a2 a2 2a2 a3
-2a2 a4 9a3 a3 -2a3 a4 9 a4 a4 ) To
minimize Q, we only need to calculate Partial Q
/partial ai 0 (due to optimality conditions)
which gives 1 9 a1 - a2 - a3 a4 1 -a1
9 a2 a3 - a4 1 -a1 a2 9 a3 - a4 1 a1
- a2 - a3 9 a4
21The solution of which gives the optimal values
a0,1 a0,2 a0,3 a0,4 1/8
w0 S a0,i di f(xi) 1/8f(x1)- f(x2)-
f(x3) f(x4)
Where the first element of w0 gives the bias b
22From earlier we have that the optimal hyperplane
is defined by w0T f(x) 0 That is
w0T f(x)
which is the optimal decision boundary for the
XOR problem. Furthermore we note that the
solution is unique since the optimal decision
boundary is unique
23Output for polynomial RBF
24- SVM building procedure
- Pick a nonlinear mapping f
- Solve for the optimal weight vector
- However how do we pick the function f?
- In practical applications, if it is not totally
impossible to find f, it is very hard - In the previous example, the function f is quite
complex How would we find it? - Answer the Kernel Trick
25Notice that in the dual problem the image of
input vectors only involved as an inner product
meaning that the optimisation can be performed in
the (lower dimensional) input space and that the
inner product can be replaced by an inner-product
kernel Q(a) S ai ½ S S ai aj di dj
f(xi) T f(xj) S ai ½ S S ai
aj di dj K(xi, xj) How do we relate the output
of the SVM to the kernel K? Look at the equation
of the boundary in the feature space and use the
optimality conditions derived from the Lagrangian
formulations
26(No Transcript)
27In the XOR problem, we chose to use the kernel
function K(x, xi) (x T xi1)2
1 x12 xi12 2 x1x2 xi1xi2
x22 xi22 2x1xi1 , 2x2xi2 Which implied the
form of our nonlinear functions f(x) (1,x12,
21/2 x1x2, x22, 21/2 x1 , 21/2 x2)T And
f(xi)(1,xi12, 21/2 xi1xi2, xi22, 21/2 xi1 , 21/2
xi2)T However, we did not need to calculate f
at all and could simply have used the kernel to
calculate Q(a) S ai ½ S S ai aj di dj
K(xi, xj) Maximised and solved for ai and
derived the hyperplane via
28We therefore only need a suitable choice of
kernel function cf Mercers Theorem Let
K(x,y) be a continuous symmetric kernel that
defined in the closed interval a,b. The
kernel K can be expanded in the form K (x,y)
f(x) T f(y) provided it is positive definite.
Some of the usual choices for K are Polynomial
SVM (x T xi1)p p specified by user RBF
SVM exp(-1/(2s2) x xi2) s specified by
user MLP SVM tanh(s0 x T xi s1) Mercers
theorem not satisfied for all s0 and
s1
29- How to recover f from a given K ???? Not
essential that we do -
- Further development
- 1. In practical applications, it is found that
the support vector - machine can outperform other learning
machines - 2. How to choose the kernel?
- 3. How much better is the SVM compared with
traditional machine? - Feng J., and Williams P. M. (2001) The
generalization error of the symmetric and scaled
support vector machine IEEE T. Neural Networks
Vol. 12, No. 5. 1255-1260
30What about regularisation? Important that we
dont allow noise to spoil our generalisation we
want a soft margin of separation Introduce slack
variables ei gt 0 such that di(wTxi b) gt 1
ei for i 1, .. N Rather than
di(wTxi b) gt 1
0 lt ei lt 1
ei gt 1
ei 0
But all 3 are support vectors since di(wTxi b)
1 ei
31Thus the slack variables measure our deviation
from the ideal pattern separability and also
allow us some freedom in specifying the
hyperplane Therefore formulate new problem to
minimise F(w, e ) 1/2 wTw C Sei subject
to di(wTxi b) gt 1 for i 1, .. N And
ei gt 0 Where C acts as a (inverse)
regularisation parameter which can be determined
experimentally or analytically.
32The solution proceeds in the same way as before
(Lagrangian, formulate dual and maximise) to
obtain optimal ai for Q(a) S ai ½ S S ai aj
di dj K(xi, xj) subject to the constraints
S ai di 0 0lt ai lt C Thus, the
nonseparable problem differs from the separable
one only in that the second constraint is more
stringent. Again the optimal solution is w0
S a0,i di f(xi) However, this time the KKT
conditions imply that ei 0 if ai lt C
33SVMs for non-linear regression SVMs can also be
used for non-linear regression. However, unlike
MLPs and RBFs the formulation does not follow
directly from the classification case Starting
point we have input data X (x1,d1), .,
(xN,dN) Where xi is in D dimensions and di is a
scalar. We want to find a robust function f(x)
that has at most e deviation from the targets d,
while at the same time being as flat (in the
regularisation sense of a smooth boundary) as
possible.
34Thus setting f(x) wTf(x) b The problem
becomes, minimise ½ wTw (for flatness) think
of gradient between (0,0) and (1,1) if weights
are (1,1) vs (1000, 1000) Subject to di -
wTf(xi) b lt e wTf(xi) b - di lt e
35This formalisation is called e -insensitive
regression as it is equivalent to minimising the
empirical risk (amount you might be wrong) using
an e -insensitive loss function L(f, d, x)
f(x) d - e for f(x) d lt e
0 else
36- Comparing e -insensitive loss function to least
squares loss function (used for MLP/RBFN) - More robust (robust to small changes in data/
model) - Less sensitive to outliers
- Non-continuous derivative
- Cost function is
- C Si L(f, di, xi)
- Where C can be viewed as a regularisation
parameter
37O e
e 0.1
Original (O)
O - e
e 0.2
e 0.5
Regression for different e function selected is
the flattest
38We now introduce 2 slack variables, ei and ei as
in the case of nonlinearly separable data and
write di - wTf(xi) b lt e ei wTf(xi)
b - di lt e ei Where ei , ei gt 0 Thus
C S L(f, di, xi) C S (ei ei) And the
problem becomes to minimise F(w, e ) 1/2
wTw C S (ei ei) subject to di -
wTf(xi) b lt e ei wTf(xi) b - di lt e
ei And ei , ei gt 0
39We now form the Lagrangian, and find the dual.
Note that this time, there will be 2 sets of
Lagrangian multipliers as there are 2
constraints. The dual to be maximised is
Where e and C are free parameters that control
the approximating function f(x) wTf(x)
Si (ai ai) K (x, xi)
40From the KKT conditions we now have ai (e ei -
di wTxi b) 0 ai (e ei di - wTxi
- b) 0 This means that the Lagrange
multipliers will only be non-zero for points
where f(xi) di gt e That is, only for
points outside the e tube.
Thus these points are the support vectors and we
have a sparse expansion of w in terms of x
41e 0.2
e 0.5
SVs
Data points
e 0.1
e 0.02
e Controls the amount of SVs selected
42Only non-zero as can contribute the Lagrange
multipliers act like forces on the regression.
However, they can only be applied at points
outside or touching the e tube
Points where forces act
43- One note of warning
- Regression is much harder than classification for
2 reasons - 1. Regression is intrinsically more difficult
than classification - 2. e and C must be tuned simultaneously
44- Research issues
- Incorporation of prior knowledge e.g.
- train a machine,
- add in virtual support vectors which incorporate
known invariances, of SVs found in 1. - retrain
- Speeding up training time?
- Various techniques, mainly to deal with
reducing the size of the data set. Chunking use
subsets of the data at a time and only keep SVs.
Also, more sophisticated versions which use
linear combinations of the training points as
inputs
45- Optimisation packages/techniques?
- Off the shelf ones are not brilliant (including
the MATLAB one). Sequential Minimal Opitmisation
(SMO) widely used. For details of that and others
see - A. J. Smola and B. Schölkopf. A Tutorial on
Support Vector Regression. NeuroCOLT Technical
Report NC-TR-98-030, Royal Holloway College,
University of London, UK, 1998. - Selection of e and C