Title: Online Learning of Maximum Margin Classifiers
1Online Learning of Maximum Margin
Classifiers
p-Norm
with Bias
- Kohei HATANO
- Kyusyu University
- (Joint work with K. Ishibashi and M. Takeda)
COLT 2008
2Plan of this talk
- Introduction
- Preliminalies
- ROMMA
- Our result
- Our new algorithm
- Our implicit reduction
- Experiments
PUMMA
3Maximum Margin Classification
- SVMs Boser et al. 92
- 2-norm margin
- Boosting FreundSchapire 97
- 8-norm margin (approximtely)
- Why maximum (or large) margin?
- Good generalization
- Schapire et al. 98
- Shawe-Taylor et al. 98
- Formulated as convex optimization problems(QP, LP)
margin
4Scaling up Max. Margin Classification
- Decomposition Methods (for SVMs)
- Break original QP into smaller QPs
- SMO Platt 99,SVMlight Joachims 99,
- LIBSVM Chang Lin 01
- state-of-the-art implementations
- Online Learning (our approach)
5Online Learning
- Online Learning Algorithm
- For t1 to T
- Receive an instance xt in Rn
- Guess a label ytsign(wt xtbt)
- Receive the label yt in -1,1
- Update (wt1,bt1)UPDATE_RULE(wt,bt,xt,yt)
- end
-1
xt
1?
- Advantages of online Learning
- Simple easy to implement
- Uses less memory
- Adaptive for changing concepts
6Online Learning Algorithms for maximum margin
classification
- Max Margin Perceptron Kowalzyk 00
- ROMMA Li Long 02
- ALMA Gentile 01
- LASVM Bordes et al. 05
- MICRA TsampoukaShawe-Taylor 07
- Pegasos Shalev-Shwalz et al. 07
- Etc.
hyperplane with bias
0
hyperplane w/o bias
7Typical Reduction to deal with bias Cf.
Cristianini Shawe-Taylor 00
- Adding an extra dimension corresponding bias.
Augmented space
Original space
?
instance
hyperplane
?
NOTE
margin (over normalized Instances)
?
8Our New Online Learning Algorithm
- PUMMA(P-norm Utilizing Maximum Margin Algorithm)
- PUMMA can learn maximum margin classifiers
- with bias directly (without using the typical
reduction!). - Margin is defined as p-norm (p2)
- For p2, similar to Perceptron.
- For pO(ln n) Gentile 03, similar to Winnow
Littlestone 89. - Fast when the target is sparse.
- Extended to linearly inseparable case (omitted).
- Soft margin with2-norm slack variables.
9Problem of finding the p-norm maximum margin
hyperplane Cf. Mangasarian 99
Given (linearly separable) S((x1,y1),,(xT,yT))
,
Goal Find an approximate solution of (w,b)
q-norm (dual norm) 1/p1/q1 E.g. p2, q2 p8,
q1
We want an online alg. solving the problem with
small of updates.
0
10ROMMA(Relaxed Online Maximum Margin
Algorithm)LiLong,02
- Given S((x1,y1),,(xt-1,yt-1)), xt,
- Predict ytsign(wtxt), and receive yt
- If yt(wt xt )lt1-d (margin is insufficient),
- update
- Otherwise, wt1wt
-
Constraint over the last example which causes an
update
Constraint over the last hyperplane
NOTE bias is fixed with 0
11ROMMA LiLong,02
feasible region of SVM
weght space
4
3
0
1
2
12Solution of ROMMA
Solution of ROMMA is an additive update
13PUMMA
- Given S((x1,y1),,(xt-1,yt-1)), xt,
- Predict ytsign(wtxt), and receive yt
- If yt(wt xt bt)gt1-d, update
- Otherwise, wt1wt
q-norm (1/p1/q1)
xpost, xnegt last positive and negative
examples which incur updates
bias is optimized
link function Grove et al. 97
14Solution of PUMMA
Solution of PUMMA is found numerically
xpost, xnegt last positive and negative
examples which incur updates
- Observation
- For p2, the solution is the same as that of
ROMMA for zt xtpos xtneg.
15Our (implicit) reduction which preserves the
margin
margin
margin
hyperplane without bias over pairs of positive
and negative instances
hyperplane with bias
16Main Result
- Thm
- Suppose that given S((x1,y1),,(xT,yT)),
- there exists a linear classifier (u,b) , s.t.
yt(uxb)1 for t1,,T. - ( of updates of PUMMAp(d)) (p-1)?u?q2R2/ d2
- After (p-1)?u ?q2R2/ d2 updates,
- PUMMAp(d) outputs a hypothesis with p-norm
margin - (1-d)? (? margin of (u,b) ).
similar to those of previous algorithms
17Experiment over artificial data
- example (x,y)
- - x n(100)-dimensional -1,1-valued vector
- - yf(x),where
- generate 1000 examples randomly
- 3 datasets (b1 (small), 9(medium), 15(large))
- Compare ROMMA(p2), ALMA(p2ln n).
18Results over Artificial Data
p2ln n
p2
ROMMA
PUMMA
ALMA
margin
PUMMA
margin
of updates
of updates
of updates
- NOTE1 margin is defined over the original space
(w/o reduction) - NOTE2 We omit the results for b9 for clarity .
19Computation Time
p2
p2ln n
ALMA
ROMMA
Sec.
Sec.
PUMMA
PUMMA
large? bias ?small
large? bias ?small
For p2,PUMMA is faster than ROMMA. For p2ln
n,PUMMA is faster than ALMA even though PUMMA
uses Newton method.
20Results over UCI Adult data
adult adult
of data 32561 32561
algorithm sec. magin rate
SVMlight 5893 100
ROMMA (99) 71296 99.03
PUMMA (99) 44480 99.14
- Fix p2.
- 2-norm soft margin formulation for linearly
inseparable data. - Run ROMMA and PUMMA until they achieves 99 of
the maximum margin.
21Results over MNIST data
MNIST MNIST
of data
algorithm sec. margin rate()
SVMlight 401.36 100
ROMMA (99) 1715.57 93.5
PUMMA (99) 1971.30 99.2
- Fix p2.
- Use polynomial kernels.
- 2-norm soft margin formulation for linearly
inseparable data. - Run ROMMA and PUMMA until they achieves 99 of
the maximum margin.
22Summary
- PUMMA can learn p-norm maximum margin classifiers
with bias directly. - of updates is similar to those of previous
algs. - achieves (1-d) times the maximum p-norm margin.
- PUMMA outperforms other online algs
- when the underlying hyperplane has large bias.
23Future work
- Maximizing 8-norm margin directly.
- Tighter bounds of of updates
- In our experiments, PUMMA is faster especially
when bias is large (like WINNOW). - Our current bound does not reflect this fact.