Multiple Kernel Learning - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Multiple Kernel Learning

Description:

2nd Runner Up. Number 5. Number 4. MSRI Top 5. Vishnu (8.30) Satya (8.30) Prasad (7.97) ... 1st Runner Up. 2nd Runner Up. Number 5. Number 4. Predicting Facial ... – PowerPoint PPT presentation

Number of Views:668
Avg rating:5.0/5.0
Slides: 34
Provided by: Man6160
Category:

less

Transcript and Presenter's Notes

Title: Multiple Kernel Learning


1
Multiple Kernel Learning
Manik Varma Microsoft Research India
2
Object Categorization
Is this a cat or a dog?
3
Object Detection
Where is the cat?
4
Predicting Facial Attractiveness
Anandan
Kentaro
Kenta Rao?
How hot are they?
5
Object Categorization Results Caltech 101
  • Adding the Gist Kernel and some post-processing
    gives 98.2 ? 0.3 Bosch et al. IJCV submitted

6
Object Detection Results PASCAL VOC2007
7
MSRI Top 5
Number 5
Prasad (7.97)
8
MSRI Top 5
Number 5
Number 4
Prasad (7.97)
Shanmuga (8.00)
9
MSRI Top 5
2nd Runner Up
Number 5
Number 4
Satya (8.30)
Prasad (7.97)
Shanmuga (8.00)
10
MSRI Top 5
1st Runner Up
2nd Runner Up
Number 5
Number 4
Vishnu (8.30)
Satya (8.30)
Prasad (7.97)
Shanmuga (8.00)
11
MSRI Top 5
Mr. MSRI
1st Runner Up
2nd Runner Up
Number 5
Number 4
Venkatesh (8.44)
Vishnu (8.30)
Satya (8.30)
Prasad (7.97)
Shanmuga (8.00)
12
Predicting Facial Attractiveness
7.92
7.63
Failure mode?
13
SVM Classification
K(xi,xj) ?t(xi)?(xj)
Margin 2 / ?wTw
? gt 1
Misclassified point
? lt 1
b
Support Vector
Support Vector
? 0
w
wT?(x) b -1
wT?(x) b 0
wT?(x) b 1
14
The C-SVM Primal Formulation
  • Minimise ½wtw C ?i?i
  • Subject to
  • yi wt?(xi) b 1 ?i
  • ?i 0
  • .where
  • (xi, yi) is the ith training point.
  • C is the misclassification penalty.
  • Decision function f(x) sign(wt?(x) b)

15
The C-SVM Dual Formulation
  • Maximise 1t? ½?tYKY?
  • Subject to
  • 1tY? 0
  • 0 ? ? ? C
  • where
  • ? are the Lagrange multipliers corresponding to
    the support vector coeffs
  • Y is a diagonal matrix such that Yii yi

16
Multiple Kernel Learning (MKL)
  • Given multiple base kernels Kk learn
  • Kopt ?k dkKk
  • subject to regularization on dk
  • Various formulations
  • Kernel Target Alignment.
  • MKL via Semi Definite Programming.
  • MKL Block l1 via Sequential Minimal
    Optimization.
  • MKL Block l1 via Semi Infinite Linear
    Programming.
  • MKL via Projected Gradient Descent.
  • Formulations based on Boosting, Hyperkernels,
    etc.

17
Kernel Target Alignment
  • Kernel Target Alignment Cristianini et al.
    2001
  • Alignment
  • A(K1,K2) ltK1,K2gt / (ltK1,K1gtltK2,K2gt)½
  • where ltK1,K2gt ?i ?j K1(xi,xj)K2(xi,xj)
  • Ideal Kernel Kideal yyT
  • Alignment to Ideal
  • A(K) ltK,yyTgt / nltK,Kgt½
  • Optimal Kernel
  • Kopt ?k dkKk where Kk vkvkT (rank 1)

18
Kernel Target Alignment
  • Kernel Target Alignment
  • Optimal Alignment
  • A(Kopt) ? dkltvk,ygt2 / n(? dk2)½
  • Assume ? dk2 1.
  • Lagrangian
  • L(?,d) ? dkltvk,ygt2 ?(? dk2 1).
  • Optimal weights dk ? ltvk,ygt2
  • Some generalisation bounds have been given but
    the task is not directly related to classification

19
Multiple Kernel Learning SDP
d2
NP Hard Region
? dk 1
d1
K 0 (SDP)
Brute force search
Lanckriet et al.
K d1 K1 d2 K2
20
Multiple Kernel Learning SDP
  • Multiple Kernel Learning Lanckriet et al. 2002
  • Minimise ½wtw C ?i?i
  • Subject to
  • yi wt?d(xi) b 1 ?i
  • ?i 0
  • K ?k dkKk is positive semi definite
  • trace(K) constant
  • Optimisation is an SDP (SOCP if dk ? 0).
  • Other loss functions possible (square hinge, KTA)

21
Multiple Kernel Learning Block l1
d2
NP Hard Region
Bach et al. Sonnenberg et al. Rakotomamonjy et al.
? dk 1
d1
K 0 (SDP Region)
Brute force search
K d1 K1 d2 K2
22
MKL Block l1 SMO
  • MKL Block l1 SMO Bach et al. 2004
  • Min ½ (?k ??k wk2)2 C ?i?i ½?k ak2
    wk22
  • Subject to
  • yi ?k wkT?k(xi) b 1 ?i
  • ?i 0
  • M-Y reg. ensures differentiability (for SMO)
  • Block l1 reg. ensures sparsity (for kernel
    weights and SMO)
  • Optimisation is carried out via iterative SMO.

23
MKL Block l1 SILP
  • MKL Block l1 SILP Sonnenberg et al. 2005
  • Min ½ (?k ?dk wk2)2 C ?i?i
  • Subject to
  • yi ?k wkT?k(xi) b 1 ?i
  • ?i 0
  • ?k dk 1
  • Iterative SILP-QP solution.
  • Solve a 10M point problem with 20 kernels
  • Generalize to regression, novelty detection, etc.

24
Other Formulations
  • Hyperkernels Ong and Smola 2002
  • Learn a kernel per training point (not per
    class)
  • SDP formulation improved to SOCP Tsang and Kwok
    2006.
  • Boosting
  • Exp/Log loss over pairs of distances Crammer et
    al. 2002
  • LPBoost Bi et al. 2004
  • KernelBoost Hert et al. 2006
  • Multi-class MKL Zien and Ong 2007.

25
Multiple Kernel Learning Varma Ray 07
d2
NP Hard Region
d 0 (SOCP Region)
d1
K 0 (SDP Region)
Brute force search
K d1 K1 d2 K2
26
Our Primal Formulation
  • Minimise ½wtw C ?i?i ?k?kdk
  • Subject to
  • yi wt?d(xi) b 1 ?i
  • ?i 0
  • dk 0
  • K(xi,xj) ?k dk Kk(xi,xj)
  • Very efficient gradient descent based solution
  • Similar to Rakotomamonjy et al. 2007 but more
    accurate as our search space is larger.

27
Final Algorithm
  • Initialise d0 randomly
  • Repeat until convergence
  • Form K(x,y) ?k dkn Kk(x,y)
  • Use any SVM solver to solve the standard SVM
    problem with kernel K and obtain ?.
  • Update dkn1 dkn ?n(?k ½?tYKkY?)
  • Project dn1 back onto the feasible set if it
    does not satisfy the constraints dn1 0

28
Kernel Generalizations
  • The learnt kernel can now have any functional
    form as long as
  • ?dK(d) exists and is continuous.
  • K(d) is strictly positive definite for feasible
    d.
  • For example, K(d) ?k dk0 ??l exp( dkl ?2)

29
Regularization Generalizations
  • Any regularizer can be used as long as it has
    continuous first derivative with respect to d
  • We can now put Gaussian rather than Laplacian
    priors on the kernel weights.
  • We can, once again, have negative weights.

30
Loss Function Generalizations
  • The loss function can be generalized to handle
  • Regression.
  • Novelty detection (1 class SVM).
  • Multi-class classification.
  • Ordinal Regression.
  • Ranking.

31
Penalties
  • Our formulation is no longer convex
  • Somehow, this seems not to make much of a
    difference.
  • Furthermore, early termination can sometimes
    improve results,

32
Conclusions
  • Gradient descent optimization of MKL
  • Is efficient and can scale to large problems.
  • Can be implemented using off-the-shelf SVM
    packages.
  • Allows generalization of the loss function to
    ranking, regression, novelty detection, etc.
  • Allows learning of more general kernel
    combinations than sums of base kernels.
  • Allows other priors on the kernel weights than
    the standard zero mean Laplacian prior.

33
Regression on Hot or Not Training Data
7.3
6.5
9.4
7.5
7.7
7.7
6.5
6.9
7.4
8.7
Write a Comment
User Comments (0)
About PowerShow.com