Title: Multiple Kernel Learning
1Multiple Kernel Learning
Manik Varma Microsoft Research India
2Object Categorization
Is this a cat or a dog?
3Object Detection
Where is the cat?
4Predicting Facial Attractiveness
Anandan
Kentaro
Kenta Rao?
How hot are they?
5Object Categorization Results Caltech 101
- Adding the Gist Kernel and some post-processing
gives 98.2 ? 0.3 Bosch et al. IJCV submitted
6Object Detection Results PASCAL VOC2007
7MSRI Top 5
Number 5
Prasad (7.97)
8MSRI Top 5
Number 5
Number 4
Prasad (7.97)
Shanmuga (8.00)
9MSRI Top 5
2nd Runner Up
Number 5
Number 4
Satya (8.30)
Prasad (7.97)
Shanmuga (8.00)
10MSRI Top 5
1st Runner Up
2nd Runner Up
Number 5
Number 4
Vishnu (8.30)
Satya (8.30)
Prasad (7.97)
Shanmuga (8.00)
11MSRI Top 5
Mr. MSRI
1st Runner Up
2nd Runner Up
Number 5
Number 4
Venkatesh (8.44)
Vishnu (8.30)
Satya (8.30)
Prasad (7.97)
Shanmuga (8.00)
12Predicting Facial Attractiveness
7.92
7.63
Failure mode?
13SVM Classification
K(xi,xj) ?t(xi)?(xj)
Margin 2 / ?wTw
? gt 1
Misclassified point
? lt 1
b
Support Vector
Support Vector
? 0
w
wT?(x) b -1
wT?(x) b 0
wT?(x) b 1
14The C-SVM Primal Formulation
- Minimise ½wtw C ?i?i
- Subject to
- yi wt?(xi) b 1 ?i
- ?i 0
- .where
- (xi, yi) is the ith training point.
- C is the misclassification penalty.
- Decision function f(x) sign(wt?(x) b)
15The C-SVM Dual Formulation
- Maximise 1t? ½?tYKY?
- Subject to
- 1tY? 0
- 0 ? ? ? C
- where
- ? are the Lagrange multipliers corresponding to
the support vector coeffs - Y is a diagonal matrix such that Yii yi
16Multiple Kernel Learning (MKL)
- Given multiple base kernels Kk learn
- Kopt ?k dkKk
- subject to regularization on dk
- Various formulations
- Kernel Target Alignment.
- MKL via Semi Definite Programming.
- MKL Block l1 via Sequential Minimal
Optimization. - MKL Block l1 via Semi Infinite Linear
Programming. - MKL via Projected Gradient Descent.
- Formulations based on Boosting, Hyperkernels,
etc.
17Kernel Target Alignment
- Kernel Target Alignment Cristianini et al.
2001 - Alignment
- A(K1,K2) ltK1,K2gt / (ltK1,K1gtltK2,K2gt)½
- where ltK1,K2gt ?i ?j K1(xi,xj)K2(xi,xj)
- Ideal Kernel Kideal yyT
- Alignment to Ideal
- A(K) ltK,yyTgt / nltK,Kgt½
- Optimal Kernel
- Kopt ?k dkKk where Kk vkvkT (rank 1)
18Kernel Target Alignment
- Kernel Target Alignment
- Optimal Alignment
- A(Kopt) ? dkltvk,ygt2 / n(? dk2)½
- Assume ? dk2 1.
- Lagrangian
- L(?,d) ? dkltvk,ygt2 ?(? dk2 1).
- Optimal weights dk ? ltvk,ygt2
- Some generalisation bounds have been given but
the task is not directly related to classification
19Multiple Kernel Learning SDP
d2
NP Hard Region
? dk 1
d1
K 0 (SDP)
Brute force search
Lanckriet et al.
K d1 K1 d2 K2
20Multiple Kernel Learning SDP
- Multiple Kernel Learning Lanckriet et al. 2002
- Minimise ½wtw C ?i?i
- Subject to
- yi wt?d(xi) b 1 ?i
- ?i 0
- K ?k dkKk is positive semi definite
- trace(K) constant
- Optimisation is an SDP (SOCP if dk ? 0).
- Other loss functions possible (square hinge, KTA)
21Multiple Kernel Learning Block l1
d2
NP Hard Region
Bach et al. Sonnenberg et al. Rakotomamonjy et al.
? dk 1
d1
K 0 (SDP Region)
Brute force search
K d1 K1 d2 K2
22MKL Block l1 SMO
- MKL Block l1 SMO Bach et al. 2004
- Min ½ (?k ??k wk2)2 C ?i?i ½?k ak2
wk22 - Subject to
- yi ?k wkT?k(xi) b 1 ?i
- ?i 0
- M-Y reg. ensures differentiability (for SMO)
- Block l1 reg. ensures sparsity (for kernel
weights and SMO) - Optimisation is carried out via iterative SMO.
23MKL Block l1 SILP
- MKL Block l1 SILP Sonnenberg et al. 2005
- Min ½ (?k ?dk wk2)2 C ?i?i
- Subject to
- yi ?k wkT?k(xi) b 1 ?i
- ?i 0
- ?k dk 1
- Iterative SILP-QP solution.
- Solve a 10M point problem with 20 kernels
- Generalize to regression, novelty detection, etc.
24Other Formulations
- Hyperkernels Ong and Smola 2002
- Learn a kernel per training point (not per
class) - SDP formulation improved to SOCP Tsang and Kwok
2006. - Boosting
- Exp/Log loss over pairs of distances Crammer et
al. 2002 - LPBoost Bi et al. 2004
- KernelBoost Hert et al. 2006
- Multi-class MKL Zien and Ong 2007.
25Multiple Kernel Learning Varma Ray 07
d2
NP Hard Region
d 0 (SOCP Region)
d1
K 0 (SDP Region)
Brute force search
K d1 K1 d2 K2
26Our Primal Formulation
- Minimise ½wtw C ?i?i ?k?kdk
- Subject to
- yi wt?d(xi) b 1 ?i
- ?i 0
- dk 0
- K(xi,xj) ?k dk Kk(xi,xj)
- Very efficient gradient descent based solution
- Similar to Rakotomamonjy et al. 2007 but more
accurate as our search space is larger.
27Final Algorithm
- Initialise d0 randomly
- Repeat until convergence
- Form K(x,y) ?k dkn Kk(x,y)
- Use any SVM solver to solve the standard SVM
problem with kernel K and obtain ?. - Update dkn1 dkn ?n(?k ½?tYKkY?)
- Project dn1 back onto the feasible set if it
does not satisfy the constraints dn1 0
28Kernel Generalizations
- The learnt kernel can now have any functional
form as long as - ?dK(d) exists and is continuous.
- K(d) is strictly positive definite for feasible
d. - For example, K(d) ?k dk0 ??l exp( dkl ?2)
29Regularization Generalizations
- Any regularizer can be used as long as it has
continuous first derivative with respect to d - We can now put Gaussian rather than Laplacian
priors on the kernel weights. - We can, once again, have negative weights.
30Loss Function Generalizations
- The loss function can be generalized to handle
- Regression.
- Novelty detection (1 class SVM).
- Multi-class classification.
- Ordinal Regression.
- Ranking.
31Penalties
- Our formulation is no longer convex
- Somehow, this seems not to make much of a
difference. - Furthermore, early termination can sometimes
improve results,
32Conclusions
- Gradient descent optimization of MKL
- Is efficient and can scale to large problems.
- Can be implemented using off-the-shelf SVM
packages. - Allows generalization of the loss function to
ranking, regression, novelty detection, etc. - Allows learning of more general kernel
combinations than sums of base kernels. - Allows other priors on the kernel weights than
the standard zero mean Laplacian prior.
33Regression on Hot or Not Training Data
7.3
6.5
9.4
7.5
7.7
7.7
6.5
6.9
7.4
8.7