5th lecture - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

5th lecture

Description:

for convex f and linear g we have equality: maxaQ(a) = min(w,b)f(w,b) ... the Kuhn-Tucker conditions: aigi(w,b) = 0 for all i. look at a textbook on convex ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 24
Provided by: Barb403
Category:
Tags: 5th | lecture | max | tucker

less

Transcript and Presenter's Notes

Title: 5th lecture


1
Topics in Machine Learning
  • 5th lecture
  • Support Vector Machines

2
Motivation
  • We first consider the problem to learn a binary
    classification, i.e. a function fRn? -1,1
    based on examples (xi,yi)

x
y
perceptron
fixed nonlinear mapping F into a high-dimensional
vector space
? only linearly separable functions
? for appropriately nonlinear F and high
dimensionality, the problem likely becomes
linearly separable
F(x)
3
Motivation
  • Example where a nonlinear F helps

XOR not linearly separable
nonlinear mapping F(a1,a2)(a1,a2,a1a2)
perceptron solution w (-2,-2,4), b -1
One can show d points in Rn ? monomials of
degree d form linearly independent (in
particularly linearly separable for arbitrary
outputs) vectors
4
Motivation
x
y
perceptron
F(x)
? high dimensional feature space ? inefficient
computation of F and the dot products of the
perceptron
? high dimensional feature space ? curse of
dimensionality, i.e. generalization to new data
is probably very bad
5
SVM-trick 1 maximum margin
  • Assume you are already in the feature space,
    vectors (xi,yi)

a linear separator with maximum distance to the
data points
... maximum possible noise tolerance
some linear separator
... not very robust with respect to noise ? bad
generalization
VC-dim of the SVM is inverse to the margin ?
dimensionality independent but margin dependent
generalization bound ? details session
learnability
6
SVM-trick 1 maximum margin
  • Avoid the curse of dimensionality by choosing the
    solution with maximum margin distance to the
    data points

Training problem find (w,b) such that all points
are classified correctly and the margin is
maximized
In a formula (xi,yi) is classified
correctly for all i ? wtx i- b 0 if yi
1 and wtx i- b lt 0 if yi -1 for all
i w.l.o.g. wtx i- b 1 if yi 1 and wtx i- b
-1 if yi -1 for all i ? (wtx i- b) yi
1 for all i ? constraint all points are
correct
7
SVM-trick 1 maximum margin
  • Whats the margin?

margin xi x for x it holds wtx-b 0
and x xi t w hence x xi ((b
wtxi)/wtw) w hence margin (b wtxi/ wtw)
w b - wtxi / w
w
x
xi
optimize this
activation, at least 1
maximize margin ? minimize w ? minimize w2/2
8
SVM-trick 1 maximum margin
  • Training problem find (w,b) such that all points
    are classified correctly and the margin is
    maximized

Nice problem, quadratic linear, any local
optimum is also a global one
Equivalent (primal) training problem minimize
w2/2 such that (wtx i- b) yi 1 for all i
linear constraints
convex function to be minimized (f convex iff
the line through f(a) and f(b) for two points a
and b is above the function values)
9
General optimization
Primal training problem minimize f(w,b)
w2/2 such that gi(w,b) 1 - (wtxi- b) yi 0
for all i

Primal-dual formulation maximizea Q(a) where
Q(a) min(w,b) L(w,b,a) f(w,b)
Siaigi(w,b) with ai0
Lagrange function
for feasible (w,b) of the primal problem and
feasible a of the primal-dual fomulation we find
Q(a) min L(w,b,a) L(w,b,a) f(w,b)
Siaigi(w,b) f(w,b)
10
General optimization
Primal training problem minimize f(w,b)
w2/2 such that gi(w,b) 1 - (wtx i- b) yi
0 for all i

Primal-dual formulation maximizea Q(a) where
Q(a) min(w,b) L(w,b,a) f(w,b)
Siaigi(w,b) with ai0
Lagrange function
for convex f and linear g we have equality
maxaQ(a) min(w,b)f(w,b), thus the optima
fulfill Q(a) f(w,b) Siaigi(w,b) f(w,b), in
particular the Kuhn-Tucker conditions aigi(w,b)
0 for all i
? look at a textbook on convex optimization!
11
SVM-trick 1 maximum margin
Primal-dual formulation maximizea Q(a) where
Q(a) min(w,b) L(w,b,a) f(w,b)
Siaigi(w,b) with ai0 and Kuhn-Tucker
conditions aigi(w,b) 0 for all i
derivative w.r.t. w,b is 0
w Siaiyixi and Siaiyi 0
L(w,b,a) -Sijaiajyiyjxitxj/2 Siai
blackboard
Dual formulation maximizea Sijaiajyiyjxitxj/2
Siai with ai0, Siaiyi 0
12
SVM-trick 1 maximum margin
Final dual formulation of training maximizea
Sijaiajyiyjxitxj/2 Siai with ai0, Siaiyi 0
This problem can be solved efficiently with
different methods simplistic and not very fast
drop the bias (i.e. constraint Siaiyi 0) and
perform a gradient descent with the barriers ai
0 good and also appropriate for large data sets
sequential minimal optimization ? seminar
paper several interior point methods intuition -
substitute the constraints by a continuous
barrier (which is made more and more precise
while training) and solve the unconstraint
problem with some fast unconstraint optimization
algorithm such as conjugate gradients fast and
nice the kernel adatron ? later slide
13
SVM-trick 1 maximum margin
  • We have obtained a feasible dual problem which
    gives us optimum Lagrange mutlipliers ai.
  • But we are NOT interested in the dual variables,
    whats the classifier?
  • Remember w Siaiyixi ? we can recover the
    weight from the ai
  • KKT-conditions aigi(w,b) ai (1 - (wtxi- b)
    yi ) 0 ? b wtx i- yi for ai ? 0
  • Thus

Solve the dual problem in the Lagrange parameters
ai, compute w Siaiyixi and b wtx i- yi for ai
? 0 Interesting KKT ? ai0 or (wtxi- b) yi
1 ? support vectors
14
SVM-trick 1 maximum margin
support vectors (SV) points with minimum
distance from the hyperplane, they determine the
classifier, the solution is sparse and formulated
only in terms of the SVs
? example blackboard
15
SVM-trick 2 kernel functions
  • F(x) instead of x should be used, i.e.
  • Note we need only the dot product F(x)tF(y) in
    the feature space!

Final dual formulation of training maximizea
SijaiajyiyjF(xi)tF(xj)/2 Siai with ai0,
Siaiyi 0 classifier w SiaiyiF(xi) and b
wtF(x j) - yj SiaiyiF(xi)tF(x j) i.e. x? wtF(x)
b SiaiyiF(xi)tF(x) - b
16
SVM-trick 2 kernel functions
A kernel k RnxRn ? R is a function such that
some (possibly high dimensional) Hilbert space X
and a function FRn ? X exists with
k(x,y) F(x)tF(y) i.e. a kernel
is a potential shortcut to compute the dot
product efficiently also in high dimensional
spaces.
  • Example
  • polynomial kernel k(x,y) (xty1)d corresponds
    to an X which dimensionality which increases
    exponentially with d
  • ((x1,x2)t(y1,y2)1)2 (x1y1x2y21)2
    x12y12x22y222x1y1x2y22x1y12x2y21
  • (x12,x22,21/2x1x2,21/2x1,21/2x2,1)t(y12,y22,21/2
    y1y2,21/2y1,21/2y2,1) F(x)tF(y)

17
SVM-trick 2 kernel functions
  • More kernels
  • polynomial kernel
  • RBF kernel k(x,y) exp(-x-y2/(2s2))
    corresponds to an infinite dimensional feature
    space X
  • sigmoidal kernel k(x,y) tanh(k xty b) for
    specific choices of k, b
  • recently several specifically for the given data
    designed kernels e.g. string kernel,
    convolutaional kernel, Fisher kernel,

? seminar paper
Mercer condition A symmetric and continuous
function k RnxRn ? R is a kernel if for all
functions g Rn ? R with ?g(x)2dxlt8 the
condition ?k(x,y)g(x)g(y)dxdy 0 holds.
A possibility to design kernels without designing
F.
18
SVM - final basic algorithm
Final dual kernelized version maximizea
Sijaiajyiyjk(xi,xj)/2 Siai with ai0, Siaiyi
0 Support vectors points with ai?0 Classifier
determine b Siaiyik(xi,x j), x? Siaiyik(xi,x) -
b
19
SVM kernel adatron
Reasonably fast (exponential convergence rate)
and very nice kernel adatron initialize
ai1 repeat compute zi Sjajyjk(xi,xj)
compute ?i yi(zi-b) compute ?ai
?(1-?i) if ai?ai0 set ai0 else set
aiai?ai compute b (min zi max
zi-)/2 thereby, ? is a small learning rate, zi
/ zi- refers to the positive/negative examples
should be positive, gives the margin
stop if all points are correct!
negative for good points
bias as compromise between the most critical
pos/neg point
20
SVM allow errors
For simplicity non-kernelized versions in the
following A feasible solution with (wtx i- b) yi
1 for all i might not exist. Observation for
these data points, ai explodes. Solution allow
errors introducing slack-variables.
Primal training problem minimize w2/2
CSi?i such that (wtx i- b) yi 1 - ?i and ?i
0 for all i
positive ?i indicates an error or too small
margin, Cgt0 guides the number of allowed errors
Dual problem maximizea Sijaiajyiyjxitxj/2
Siai with Cai0, Siaiyi 0
the Lagrange multipliers can no longer explode!
21
SVM regression
Outputs yi are real numbers. Approximation
function x ? wtx b (not yet kernelized) Error
as e-tube
Primal training problem minimize w2/2
CSi(?i?i) such that (wtx i- b) - yi e ?i
and ?i 0 for all i yi - (wtx
i- b) e ?i and ?i 0 for all i
22
SVM regression
Dual training problem for regression maximize
Sij(ai-ai)(aj-aj)yiyjxitxj/2 - e Si(aiai)
Siyi(ai-ai) such that Si(ai-ai)0,
0ai,aiC Classifier x ?
Si(ai-ai)xitx b, determine b such that
support vectors are mapped to e
23
SVM further issues
alternative loss functions ? seminar more than
one class by tricky combinations of binary
classifiers ? seminar kernelization of other
methods like PCA SVM for unsupervised learning
tasks like novelty detection ? seminar solutions
which do not scale with the number of
parameters input pruning training algos and large
scale SVMs adaptive kernels . see also
www.kernel-machines.org
Write a Comment
User Comments (0)
About PowerShow.com