Kernels, Margins, and Low-dimensional Mappings - PowerPoint PPT Presentation

1 / 16

About This Presentation

Title:

Kernels, Margins, and Low-dimensional Mappings

Description:

Conceptual: View kernel as (principled) way of doing feature generation. ... For specific natural kernels, like K(x,y) = (1 x y)m, is there an efficient ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 17

Provided by: avr50

Category:

more less

Transcript and Presenter's Notes

Title: Kernels, Margins, and Low-dimensional Mappings

1
Kernels, Margins, and Low-dimensional Mappings

Maria-Florina Balcan, Avrim Blum, Santosh Vempala

NIPS 2007 Workshop on TOPOLOGY LEARNING
2
Generic problem

Given a set of images , want to
learn a linear separator to distinguish men from
women.
Problem pixel representation no good.

Old style advice
Pick a better set of features!
But seems ad-hoc. Not scientific.

New style advice
Use a Kernel! K( , ) ?(
)?( ). ? is implicit, high-dimensional
mapping.

Feels more scientific. Many algorithms can be
kernelized. Use magic of implicit high-diml
space. Dont pay for it if exists a large margin
separator.

3
Generic problem

Old style advice
Pick a better set of features!
But seems ad-hoc. Not scientific.

New style advice
Use a Kernel! K( , ) ?( )
?( ). ? is implicit, high-dimensional
mapping.

Feels more scientific. Many algorithms can be
kernelized. Use magic of implicit high-diml
space. Dont pay for it if exists a large margin
separator.

E.g., K(x,y) (x y 1)m. ?(n-diml space) !
(nm-diml space).

4
Claim

Can view new method as way of conducting old
method.
Given a kernel as a black-box program K(x,y)
and access to typical inputs samples from D,
Claim Can run K and reverse-engineer an explicit
(small) set of features, such that if K is good
9 large-margin separator in ?-space for D,c,
then this is a good feature set 9 almost-as-good
separator.
You give me a kernel, I give you a set of
features
Do this using idea of random projection

5
Claim

Can view new method as way of conducting old
method.
Given a kernel as a black-box program K(x,y)
and access to typical inputs samples from D,
Claim Can run K and reverse-engineer an explicit
(small) set of features, such that if K is good
9 large-margin separator in ?-space for D,c,
then this is a good feature set 9 almost-as-good
separator.
E.g., sample z1,...,zd from D. Given x, define
xi K(x,zi).

Implications
Practical alternative to kernelizing the
algorithm.
Conceptual View kernel as (principled) way of
doing feature generation. View as similarity
function, rather than magic power of implicit
high dimensional space.

6
Basic setup, definitions

Instance space X.

Distribution D, target c. Use P (D,c).

K(x,y) ?(x)?(y).

P is separable with margin g in ?-space if 9 w
s.t. Pr(x,l)2 Pl(w?(x)/?(x)) lt g0. (w1)

Error e at margin g replace 0 with e.

Goal is to use K to get mapping to low-diml
space.
P(D,c)
7
One idea Johnson-Lindenstrauss lemma

If P separable with margin g in f-space, then
with prob 1-d, a random linear projection down to
space of dimension d O((1/g2)log1/(de)) will
have a linear separator of error lt e. Arriaga
Vempala

If vectors are r1,r2,...,rd, then can view as
features xi ?(x) ri.

Problem uses ?. Can we do directly, using K as
black-box, without computing ??

8
3 methods (from simplest to best)

Draw d examples z1,...,zd from D. Use
F(x) (K(x,z1), ..., K(x,zd)). So, xi
K(x,zi)
For d (8/e)1/g2 ln 1/d, if P was
separable with margin g in ?-space, then whp this
will be separable with error e. (but this method
doesnt preserve margin).
Same d, but a little more complicated. Separable
with error e at margin g/2.
Combine (2) with further projection as in JL
lemma. Get d with log dependence on 1/e, rather
than linear. So, can set e 1/d.

All these methods need access to D, unlike JL.
Can this be removed? We show NO for generic K,
but may be possible for natural K.
9
Key fact

Claim If 9 perfect w of margin g in f-space,
then if draw z1,...,zd 2 D for d (8/e)1/g2
ln 1/d, whp (1-d) exists w in
span(?(z1),...,?(zd)) of error e at margin g/2.
Proof Let S examples drawn so far. Assume
w1, ?(z)1 8 z.

win proj(w,span(S)), wout w win.

Say wout is large if Prz(wout?(z) g/2) e
else small.
If small, then done w win.
Else, next z has at least e prob of improving S.

wout2 Ã wout2 (g/2)2

Can happen at most 4/g2 times. ?

10
So....

If draw z1,...,zd 2 D for d (8/e)1/g2 ln
1/d, then whp exists w in span(?(z1),...,?(zd))
of error e at margin g/2.

So, for some w a1?(z1) ... ad?(zd),
Pr(x,l) 2 P sign(w ?(x)) ¹ l e.
But notice that w?(x) a1K(x,z1) ...
adK(x,zd).
) vector (a1,...ad) is an e-good separator in
the feature space xi K(x,zi).
But margin not preserved because length of
target, examples not preserved.

11
How to preserve margin? (mapping 2)

We know 9 w in span(?(z1),...,?(zd)) of error
e at margin g/2.
So, given a new x, just want to do an orthogonal
projection of ?(x) into that span. (preserves
dot-product, decreases ?(x), so only increases
margin).
Run K(zi,zj) for all i,j1,...,d. Get matrix M.
Decompose M UTU.
(Mapping 2) (mapping 1)U-1. ?

12
Mapping 2, Details

Draw a set Sz1, ..., zd of d (8/e)1/g2 ln
1/d, unlabeled examples from D.
Run K(x,y) for all x,y2S, get M(S)(K(zi,zj))zi,zj
2 S.
Place S into d-dim. space based on K (or M(S)).

Rd
F2(z3)
X
K(z1,z1)F2(z1)2
K(z3,z3)
z3
z1
F2(z1)
F1
K(z1,z2)
z2
F2(z2)
K(z2,z2)
13
Mapping 2, Details, cont

What to do with new points?
Extend the embedding F1 to all of X
consider F2 X ! Rd defined as follows for x 2
X, let F2(x) 2 Rd be the point of smallest length
such that F2(x) F2(zi) K(x,zi), for all i 2
1, ..., d.
The mapping is equivalent to orthogonally
projecting ?(x) down to span(?(z1),, ?(zd)).

14
How to improve dimension?

Current mapping (F2) gives d (8/e)1/g2 ln
1/d.
Johnson-Lindenstrauss gives d1 O((1/g2) log
1/(de) ). Nice because can have d 1/?.
Answer just combine the two...
Run Mapping 2, then do random projection down
from that.
Gives us desired dimension ( features), though
sample-complexity remains as in mapping 2.

15
RN
X
O
O
X
O
X
?
X
X
O
Rd
O
X
F2
X
O
X
X
O
X
O
X
X
X
JL
X
X
X
O
O
O
F
O
Rd1
X
O
X
O
X
X
O
X
O
16
Mapping 3

Do JL(mapping2(x)).
JL says fix y,w. Random projection M down to
space of dimension O(1/g2 log 1/d) will with
prob (1-d) preserve margin of y up to g/4.
Use d ed.
) For all y, PrMfailure on y lt ed,
) PrD, Mfailure on y lt ed,
) PrMfail on prob mass e lt d.
So, we get desired dimension ( features), though
sample-complexity remains as in mapping 2.

17
Lower bound (on necessity of access to D)

For arbitrary black-box kernel K, cant hope to
convert to small feature space without access to
D.
Consider X0,1n, random X½ X of size 2n/2, D
uniform over X.
c arbitrary function (so learning is hopeless).
But we have this magic kernel K(x,y) ?(x)?(y)
?(x) (1,0) if x Ï X.
?(x) (-½, p3/2) if x 2 X, c(x)pos.
?(x) (-½,-p3/2) if x 2 X, c(x)neg.

P is separable with margin p3/2 in ?-space.

But, without access to D, all attempts at running
K(x,y) will give answer of 1.

18
Open Problems

For specific natural kernels, like K(x,y) (1
xy)m, is there an efficient analog to JL,
without needing access to D?
Or, at least can one at least reduce the
sample-complexity ? (use fewer accesses to D)
Can one extend results (e.g., mapping 1 x ?
K(x,z1), ..., K(x,zd)) to more general
similarity functions K?
Not exactly clear what theorem statement would
look like.