On Kernels, Margins, and Lowdimensional Mappings - PowerPoint PPT Presentation

About This Presentation
Title:

On Kernels, Margins, and Lowdimensional Mappings

Description:

New style advice: Use a Kernel! K( , ) = f( ) f( ). f is implicit, high-dimensional mapping. ... Use 'magic' of implicit high-dim'l space. Don't pay for it if ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 18
Provided by: avrim
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: On Kernels, Margins, and Lowdimensional Mappings


1
On Kernels, Margins, and Low-dimensional Mappings
  • or
  • Kernels versus features

Nina Balcan CMU Avrim Blum
CMU Santosh Vempala MIT
2
Generic problem
  • Given a set of images , want to
    learn a linear separator to distinguish men from
    women.
  • Problem pixel representation no good.
  • Old style advice
  • Pick a better set of features!
  • But seems ad-hoc. Not scientific.
  • New style advice
  • Use a Kernel! K( , ) f(
    )f( ). f is implicit, high-dimensional
    mapping.
  • Sounds more scientific. Many algorithms can be
    kernelized. Use magic of implicit high-diml
    space. Dont pay for it if exists a large margin
    separator.

3
Generic problem
  • Old style advice
  • Pick a better set of features!
  • But seems ad-hoc. Not scientific.
  • New style advice
  • Use a Kernel! K( , ) f(
    )f( ). f is implicit, high-dimensional
    mapping.
  • Sounds more scientific. Many algorithms can be
    kernelized. Use magic of implicit high-diml
    space. Dont pay for it if exists a large margin
    separator.
  • E.g., K(x,y) (x y 1)m. f(n-diml space) !
    (nm-diml space).

4
Main point of this work
  • Can view new method as way of conducting old
    method.
  • Given a kernel as a black-box program K(x,y)
    and access to typical inputs samples from D,
  • Claim Can run K and reverse-engineer an explicit
    (small) set of features, such that if K is good
    9 large-margin separator in f-space for D,c,
    then this is a good feature set 9 almost-as-good
    separator.
  • You give me a kernel, I give you a set of
    features

5
Main point of this work
  • Can view new method as way of conducting old
    method.
  • Given a kernel as a black-box program K(x,y)
    and access to typical inputs samples from D,
  • Claim Can run K and reverse-engineer an explicit
    (small) set of features, such that if K is good
    9 large-margin separator in f-space for D,c,
    then this is a good feature set 9 almost-as-good
    separator.
  • E.g., sample z1,...,zd from D. Given x, define
    xi K(x,zi).
  • Implications
  • Practical alternative to kernelizing the
    algorithm.
  • Conceptual View kernel as (principled) way of
    doing feature generation. View as similarity
    function, rather than magic power of implicit
    high dimensional space.

6
Basic setup, definitions
  • Instance space X.
  • Distribution D, target c. Use P (D,c).
  • K(x,y) f(x)f(y).
  • P is separable with margin g in f-space if 9 w
    s.t. Pr(x,l)2 Pl(w f(x)) lt g0. (normalizing
    w1, f(x)1)
  • Error e at margin g replace 0 with e.

Goal is to use K to get mapping to low-diml
space.
P(D,c)
7
Idea Johnson-Lindenstrauss lemma
  • If P separable with margin g in f-space, then
    with prob 1-d, a random linear projection down to
    space of dimension d O((1/g2)log1/(de)) will
    have a linear separator of error lt e. AV
  • If vectors are r1,r2,...,rd, then can view as
    features xi f(x) ri.
  • Problem uses f. Can we do directly, using K as
    black-box, without computing f?

8
3 methods (from simplest to best)
  • Draw d examples z1,...,zd from D. Use
  • F(x) (K(x,z1), ..., K(x,zd)). So, xi
    K(x,zi)
  • For d (8/e)1/g2 ln 1/d, if P was
    separable with margin g in f-space, then whp this
    will be separable with error e. (but this method
    doesnt preserve margin).
  • Same d, but a little more complicated. Separable
    with error e at margin g/2.
  • Combine (2) with further projection as in JL
    lemma. Get d with log dependence on 1/e, rather
    than linear. So, can set e 1/d.

All these methods need access to D, unlike JL.
Can this be removed? We show NO for generic K,
but may be possible for natural K.
9
Actually, the argument is pretty easy...
  • (though we did try a lot of things first that
    didnt work...)

10
Key fact
  • Claim If 9 perfect w of margin g in f-space,
    then if draw z1,...,zd 2 D for d (8/e)1/g2
    ln 1/d, whp (1-d) exists w in
    span(f(z1),...,f(zd)) of error e at margin g/2.
  • Proof Let S examples drawn so far. Assume
    w1, f(z)1 8 z.
  • win proj(w,span(S)), wout w win.
  • Say wout is large if Prz(woutf(z) g/2) e
    else small.
  • If small, then done w win.
  • Else, next z has at least e prob of improving S.

wout2 Ã wout2 (g/2)2
  • Can happen at most 4/g2 times. a

11
So....
  • If draw z1,...,zd 2 D for d (8/e)1/g2 ln
    1/d, then whp exists w in span(f(z1),...,f(zd))
    of error e at margin g/2.
  • So, for some w a1f(z1) ... adf(zd),
  • Pr(x,l) 2 P sign(w f(x)) ¹ l e.
  • But notice that wf(x) a1K(x,z1) ...
    adK(x,zd).
  • ) vector (a1,...ad) is an e-good separator in
    the feature space xi K(x,zi).
  • But margin not preserved because of length of
    target, examples.

12
How to preserve margin? (mapping 2)
  • We know 9 w in span(f(z1),...,f(zd)) of error
    e at margin g/2.
  • So, given a new x, just want to do an orthogonal
    projection into that span. (preserves
    dot-product, decreases x, so only increases
    margin).
  • Run K(zi,zj) for all i,j1,...,d. Get matrix M.
  • Decompose M UTU.
  • (Mapping 2) (mapping 1)U-1. a

13
How to improve dimension?
  • Current mapping gives d (8/e)1/g2 ln 1/d.
  • Johnson-Lindenstrauss gives d O((1/g2) log
    1/(de) ).
  • JL is nice because can have e 1/d. Good if alg
    wants data to be perfectly separable.
  • (Learning a separator of margin g can be done
    in time poly(1/g), but if no perfect separator
    exists, minimizing error is NP-hard.)
  • Answer just combine the two...

14
RN
X
O
O
X
O
X
?
X
X
O
Rd1
O
X
F1
X
O
X
X
O
X
O
X
X
X
JL
X
X
X
O
O
O
F
O
Rd
X
O
X
O
X
X
O
X
O
15
Mapping 3
  • Do JL(mapping2(x)).
  • JL says fix y,w. Random projection M down to
    space of dimension O(1/g2 log 1/d) will with
    prob (1-d) preserve margin of y up to g/4.
  • Use d ed.
  • ) For all y, PrMfailure on y lt ed,
  • ) PrD, Mfailure on y lt ed,
  • ) PrMfail on prob mass e lt d.
  • So, we get desired dimension ( features), though
    sample-complexity remains as in mapping 2.

16
Lower bound (on necessity of access to D)
  • For arbitrary black-box kernel K, cant hope to
    convert to small feature space without access to
    D.
  • Consider X0,1n, random X½ X of size 2n/2, D
    uniform over X.
  • c arbitrary function (so learning is hopeless).
  • But we have this magic kernel K(x,y) f(x)f(y)
  • f(x) (1,0) if x Ï X.
  • f(x) (-½, p3/2) if x 2 X, c(x)pos.
  • f(x) (-½,-p3/2) if x 2 X, c(x)neg.
  • P is separable with margin p3/2 in f-space.
  • But, without access to D, all attempts at running
    K(x,y) will give answer of 1.

17
Open Problems
  • For specific, natural kernels,
  • like, K(x,y) (1 x y)m,
  • Is there an efficient (probability
    distribution over) mappings that is good for any
    P (c,D) for which the kernel is good?
  • I.e., an efficient analog to JL for these
    kernels.
  • Or, at least can these mappings be constructed
    using less sample-complexity (fewer accesses to
    D)?
Write a Comment
User Comments (0)
About PowerShow.com