Multiple Kernel Learning presentation

About This Presentation

Transcript and Presenter's Notes

Title: Multiple Kernel Learning

1
Multiple Kernel Learning
Manik Varma Microsoft Research India
2
Object Categorization
Is this a cat or a dog?
3
Object Detection
Where is the cat?
4
Predicting Facial Attractiveness
Anandan
Kentaro
Kenta Rao?
How hot are they?
5
Object Categorization Results Caltech 101

Adding the Gist Kernel and some post-processing
gives 98.2 ? 0.3 Bosch et al. IJCV submitted

6
Object Detection Results PASCAL VOC2007
7
MSRI Top 5
Number 5
Prasad (7.97)
8
MSRI Top 5
Number 5
Number 4
Prasad (7.97)
Shanmuga (8.00)
9
MSRI Top 5
2nd Runner Up
Number 5
Number 4
Satya (8.30)
Prasad (7.97)
Shanmuga (8.00)
10
MSRI Top 5
1st Runner Up
2nd Runner Up
Number 5
Number 4
Vishnu (8.30)
Satya (8.30)
Prasad (7.97)
Shanmuga (8.00)
11
MSRI Top 5
Mr. MSRI
1st Runner Up
2nd Runner Up
Number 5
Number 4
Venkatesh (8.44)
Vishnu (8.30)
Satya (8.30)
Prasad (7.97)
Shanmuga (8.00)
12
Predicting Facial Attractiveness
7.92
7.63
Failure mode?
13
SVM Classification
K(xi,xj) ?t(xi)?(xj)
Margin 2 / ?wTw
? gt 1
Misclassified point
? lt 1
b
Support Vector
Support Vector
? 0
w
wT?(x) b -1
wT?(x) b 0
wT?(x) b 1
14
The C-SVM Primal Formulation

Minimise ½wtw C ?i?i
Subject to
yi wt?(xi) b 1 ?i
?i 0
.where
(xi, yi) is the ith training point.
C is the misclassification penalty.
Decision function f(x) sign(wt?(x) b)

15
The C-SVM Dual Formulation

Maximise 1t? ½?tYKY?
Subject to
1tY? 0
0 ? ? ? C
where
? are the Lagrange multipliers corresponding to
the support vector coeffs
Y is a diagonal matrix such that Yii yi

16
Multiple Kernel Learning (MKL)

Given multiple base kernels Kk learn
Kopt ?k dkKk
subject to regularization on dk
Various formulations
Kernel Target Alignment.
MKL via Semi Definite Programming.
MKL Block l1 via Sequential Minimal
Optimization.
MKL Block l1 via Semi Infinite Linear
Programming.
MKL via Projected Gradient Descent.
Formulations based on Boosting, Hyperkernels,
etc.

17
Kernel Target Alignment

Kernel Target Alignment Cristianini et al.
2001
Alignment
A(K1,K2) ltK1,K2gt / (ltK1,K1gtltK2,K2gt)½
where ltK1,K2gt ?i ?j K1(xi,xj)K2(xi,xj)
Ideal Kernel Kideal yyT
Alignment to Ideal
A(K) ltK,yyTgt / nltK,Kgt½
Optimal Kernel
Kopt ?k dkKk where Kk vkvkT (rank 1)

18
Kernel Target Alignment

Kernel Target Alignment
Optimal Alignment
A(Kopt) ? dkltvk,ygt2 / n(? dk2)½
Assume ? dk2 1.
Lagrangian
L(?,d) ? dkltvk,ygt2 ?(? dk2 1).
Optimal weights dk ? ltvk,ygt2
Some generalisation bounds have been given but
the task is not directly related to classification

19
Multiple Kernel Learning SDP
d2
NP Hard Region
? dk 1
d1
K 0 (SDP)
Brute force search
Lanckriet et al.
K d1 K1 d2 K2
20
Multiple Kernel Learning SDP

Multiple Kernel Learning Lanckriet et al. 2002
Minimise ½wtw C ?i?i
Subject to
yi wt?d(xi) b 1 ?i
?i 0
K ?k dkKk is positive semi definite
trace(K) constant
Optimisation is an SDP (SOCP if dk ? 0).
Other loss functions possible (square hinge, KTA)

21
Multiple Kernel Learning Block l1
d2
NP Hard Region
Bach et al. Sonnenberg et al. Rakotomamonjy et al.
? dk 1
d1
K 0 (SDP Region)
Brute force search
K d1 K1 d2 K2
22
MKL Block l1 SMO

MKL Block l1 SMO Bach et al. 2004
Min ½ (?k ??k wk2)2 C ?i?i ½?k ak2
wk22
Subject to
yi ?k wkT?k(xi) b 1 ?i
?i 0
M-Y reg. ensures differentiability (for SMO)
Block l1 reg. ensures sparsity (for kernel
weights and SMO)
Optimisation is carried out via iterative SMO.

23
MKL Block l1 SILP

MKL Block l1 SILP Sonnenberg et al. 2005
Min ½ (?k ?dk wk2)2 C ?i?i
Subject to
yi ?k wkT?k(xi) b 1 ?i
?i 0
?k dk 1
Iterative SILP-QP solution.
Solve a 10M point problem with 20 kernels
Generalize to regression, novelty detection, etc.

24
Other Formulations

Hyperkernels Ong and Smola 2002
Learn a kernel per training point (not per
class)
SDP formulation improved to SOCP Tsang and Kwok
2006.
Boosting
Exp/Log loss over pairs of distances Crammer et
al. 2002
LPBoost Bi et al. 2004
KernelBoost Hert et al. 2006
Multi-class MKL Zien and Ong 2007.

25
Multiple Kernel Learning Varma Ray 07
d2
NP Hard Region
d 0 (SOCP Region)
d1
K 0 (SDP Region)
Brute force search
K d1 K1 d2 K2
26
Our Primal Formulation

Minimise ½wtw C ?i?i ?k?kdk
Subject to
yi wt?d(xi) b 1 ?i
?i 0
dk 0
K(xi,xj) ?k dk Kk(xi,xj)
Very efficient gradient descent based solution
Similar to Rakotomamonjy et al. 2007 but more
accurate as our search space is larger.

27
Final Algorithm

Initialise d0 randomly
Repeat until convergence
Form K(x,y) ?k dkn Kk(x,y)
Use any SVM solver to solve the standard SVM
problem with kernel K and obtain ?.
Update dkn1 dkn ?n(?k ½?tYKkY?)
Project dn1 back onto the feasible set if it
does not satisfy the constraints dn1 0

28
Kernel Generalizations

The learnt kernel can now have any functional
form as long as
?dK(d) exists and is continuous.
K(d) is strictly positive definite for feasible
d.
For example, K(d) ?k dk0 ??l exp( dkl ?2)

29
Regularization Generalizations

Any regularizer can be used as long as it has
continuous first derivative with respect to d
We can now put Gaussian rather than Laplacian
priors on the kernel weights.
We can, once again, have negative weights.

30
Loss Function Generalizations

The loss function can be generalized to handle
Regression.
Novelty detection (1 class SVM).
Multi-class classification.
Ordinal Regression.
Ranking.

31
Penalties

Our formulation is no longer convex
Somehow, this seems not to make much of a
difference.
Furthermore, early termination can sometimes
improve results,

32
Conclusions

Gradient descent optimization of MKL
Is efficient and can scale to large problems.
Can be implemented using off-the-shelf SVM
packages.
Allows generalization of the loss function to
ranking, regression, novelty detection, etc.
Allows learning of more general kernel
combinations than sums of base kernels.
Allows other priors on the kernel weights than
the standard zero mean Laplacian prior.

33
Regression on Hot or Not Training Data
7.3
6.5
9.4
7.5
7.7
7.7
6.5
6.9
7.4
8.7

Write a Comment

User Comments (0)

About PowerShow.com

Multiple Kernel Learning PowerPoint PPT Presentation