Kernels, Margins, and Low-dimensional Mappings - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Kernels, Margins, and Low-dimensional Mappings

Description:

Conceptual: View kernel as (principled) way of doing feature generation. ... For specific natural kernels, like K(x,y) = (1 x y)m, is there an efficient ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 17
Provided by: avr50
Category:

less

Transcript and Presenter's Notes

Title: Kernels, Margins, and Low-dimensional Mappings


1
Kernels, Margins, and Low-dimensional Mappings
  • Maria-Florina Balcan, Avrim Blum, Santosh Vempala

NIPS 2007 Workshop on TOPOLOGY LEARNING
2
Generic problem
  • Given a set of images , want to
    learn a linear separator to distinguish men from
    women.
  • Problem pixel representation no good.
  • Old style advice
  • Pick a better set of features!
  • But seems ad-hoc. Not scientific.
  • New style advice
  • Use a Kernel! K( , ) ?(
    )?( ). ? is implicit, high-dimensional
    mapping.
  • Feels more scientific. Many algorithms can be
    kernelized. Use magic of implicit high-diml
    space. Dont pay for it if exists a large margin
    separator.

3
Generic problem
  • Old style advice
  • Pick a better set of features!
  • But seems ad-hoc. Not scientific.
  • New style advice
  • Use a Kernel! K( , ) ?( )
    ?( ). ? is implicit, high-dimensional
    mapping.
  • Feels more scientific. Many algorithms can be
    kernelized. Use magic of implicit high-diml
    space. Dont pay for it if exists a large margin
    separator.
  • E.g., K(x,y) (x y 1)m. ?(n-diml space) !
    (nm-diml space).

4
Claim
  • Can view new method as way of conducting old
    method.
  • Given a kernel as a black-box program K(x,y)
    and access to typical inputs samples from D,
  • Claim Can run K and reverse-engineer an explicit
    (small) set of features, such that if K is good
    9 large-margin separator in ?-space for D,c,
    then this is a good feature set 9 almost-as-good
    separator.
  • You give me a kernel, I give you a set of
    features
  • Do this using idea of random projection

5
Claim
  • Can view new method as way of conducting old
    method.
  • Given a kernel as a black-box program K(x,y)
    and access to typical inputs samples from D,
  • Claim Can run K and reverse-engineer an explicit
    (small) set of features, such that if K is good
    9 large-margin separator in ?-space for D,c,
    then this is a good feature set 9 almost-as-good
    separator.
  • E.g., sample z1,...,zd from D. Given x, define
    xi K(x,zi).
  • Implications
  • Practical alternative to kernelizing the
    algorithm.
  • Conceptual View kernel as (principled) way of
    doing feature generation. View as similarity
    function, rather than magic power of implicit
    high dimensional space.

6
Basic setup, definitions
  • Instance space X.
  • Distribution D, target c. Use P (D,c).
  • K(x,y) ?(x)?(y).
  • P is separable with margin g in ?-space if 9 w
    s.t. Pr(x,l)2 Pl(w?(x)/?(x)) lt g0. (w1)
  • Error e at margin g replace 0 with e.

Goal is to use K to get mapping to low-diml
space.
P(D,c)
7
One idea Johnson-Lindenstrauss lemma
  • If P separable with margin g in f-space, then
    with prob 1-d, a random linear projection down to
    space of dimension d O((1/g2)log1/(de)) will
    have a linear separator of error lt e. Arriaga
    Vempala
  • If vectors are r1,r2,...,rd, then can view as
    features xi ?(x) ri.
  • Problem uses ?. Can we do directly, using K as
    black-box, without computing ??

8
3 methods (from simplest to best)
  • Draw d examples z1,...,zd from D. Use
  • F(x) (K(x,z1), ..., K(x,zd)). So, xi
    K(x,zi)
  • For d (8/e)1/g2 ln 1/d, if P was
    separable with margin g in ?-space, then whp this
    will be separable with error e. (but this method
    doesnt preserve margin).
  • Same d, but a little more complicated. Separable
    with error e at margin g/2.
  • Combine (2) with further projection as in JL
    lemma. Get d with log dependence on 1/e, rather
    than linear. So, can set e 1/d.

All these methods need access to D, unlike JL.
Can this be removed? We show NO for generic K,
but may be possible for natural K.
9
Key fact
  • Claim If 9 perfect w of margin g in f-space,
    then if draw z1,...,zd 2 D for d (8/e)1/g2
    ln 1/d, whp (1-d) exists w in
    span(?(z1),...,?(zd)) of error e at margin g/2.
  • Proof Let S examples drawn so far. Assume
    w1, ?(z)1 8 z.
  • win proj(w,span(S)), wout w win.
  • Say wout is large if Prz(wout?(z) g/2) e
    else small.
  • If small, then done w win.
  • Else, next z has at least e prob of improving S.

wout2 Ã wout2 (g/2)2
  • Can happen at most 4/g2 times. ?

10
So....
  • If draw z1,...,zd 2 D for d (8/e)1/g2 ln
    1/d, then whp exists w in span(?(z1),...,?(zd))
    of error e at margin g/2.
  • So, for some w a1?(z1) ... ad?(zd),
  • Pr(x,l) 2 P sign(w ?(x)) ¹ l e.
  • But notice that w?(x) a1K(x,z1) ...
    adK(x,zd).
  • ) vector (a1,...ad) is an e-good separator in
    the feature space xi K(x,zi).
  • But margin not preserved because length of
    target, examples not preserved.

11
How to preserve margin? (mapping 2)
  • We know 9 w in span(?(z1),...,?(zd)) of error
    e at margin g/2.
  • So, given a new x, just want to do an orthogonal
    projection of ?(x) into that span. (preserves
    dot-product, decreases ?(x), so only increases
    margin).
  • Run K(zi,zj) for all i,j1,...,d. Get matrix M.
  • Decompose M UTU.
  • (Mapping 2) (mapping 1)U-1. ?

12
Mapping 2, Details
  • Draw a set Sz1, ..., zd of d (8/e)1/g2 ln
    1/d, unlabeled examples from D.
  • Run K(x,y) for all x,y2S, get M(S)(K(zi,zj))zi,zj
    2 S.
  • Place S into d-dim. space based on K (or M(S)).

Rd
F2(z3)
X
K(z1,z1)F2(z1)2
K(z3,z3)
z3
z1
F2(z1)
F1
K(z1,z2)
z2
F2(z2)
K(z2,z2)
13
Mapping 2, Details, cont
  • What to do with new points?
  • Extend the embedding F1 to all of X
  • consider F2 X ! Rd defined as follows for x 2
    X, let F2(x) 2 Rd be the point of smallest length
    such that F2(x) F2(zi) K(x,zi), for all i 2
    1, ..., d.
  • The mapping is equivalent to orthogonally
    projecting ?(x) down to span(?(z1),, ?(zd)).

14
How to improve dimension?
  • Current mapping (F2) gives d (8/e)1/g2 ln
    1/d.
  • Johnson-Lindenstrauss gives d1 O((1/g2) log
    1/(de) ). Nice because can have d 1/?.
  • Answer just combine the two...
  • Run Mapping 2, then do random projection down
    from that.
  • Gives us desired dimension ( features), though
    sample-complexity remains as in mapping 2.

15
RN
X
O
O
X
O
X
?
X
X
O
Rd
O
X
F2
X
O
X
X
O
X
O
X
X
X
JL
X
X
X
O
O
O
F
O
Rd1
X
O
X
O
X
X
O
X
O
16
Mapping 3
  • Do JL(mapping2(x)).
  • JL says fix y,w. Random projection M down to
    space of dimension O(1/g2 log 1/d) will with
    prob (1-d) preserve margin of y up to g/4.
  • Use d ed.
  • ) For all y, PrMfailure on y lt ed,
  • ) PrD, Mfailure on y lt ed,
  • ) PrMfail on prob mass e lt d.
  • So, we get desired dimension ( features), though
    sample-complexity remains as in mapping 2.

17
Lower bound (on necessity of access to D)
  • For arbitrary black-box kernel K, cant hope to
    convert to small feature space without access to
    D.
  • Consider X0,1n, random X½ X of size 2n/2, D
    uniform over X.
  • c arbitrary function (so learning is hopeless).
  • But we have this magic kernel K(x,y) ?(x)?(y)
  • ?(x) (1,0) if x Ï X.
  • ?(x) (-½, p3/2) if x 2 X, c(x)pos.
  • ?(x) (-½,-p3/2) if x 2 X, c(x)neg.
  • P is separable with margin p3/2 in ?-space.
  • But, without access to D, all attempts at running
    K(x,y) will give answer of 1.

18
Open Problems
  • For specific natural kernels, like K(x,y) (1
    xy)m, is there an efficient analog to JL,
    without needing access to D?
  • Or, at least can one at least reduce the
    sample-complexity ? (use fewer accesses to D)
  • Can one extend results (e.g., mapping 1 x ?
    K(x,z1), ..., K(x,zd)) to more general
    similarity functions K?
  • Not exactly clear what theorem statement would
    look like.
Write a Comment
User Comments (0)
About PowerShow.com