Random projection, margins, kernels, and featureselection - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Random projection, margins, kernels, and featureselection

Description:

E.g., to round max-cut, just pick a random hyperplane ... Pick a random hyperplane. See if it is any good. ... chance a random hyperplane will be a weak ... – PowerPoint PPT presentation

Number of Views:213
Avg rating:3.0/5.0
Slides: 30
Provided by: avr71
Category:

less

Transcript and Presenter's Notes

Title: Random projection, margins, kernels, and featureselection


1
Random projection, margins, kernels, and
feature-selection
  • Avrim Blum
  • Carnegie Mellon University

Portions of this are joint work with Nina Balcan
and Santosh Vempala
and portions have nothing to do with me at all
PASCAL workshop on Subspace, Latent Structure
and Feature Selection 2005
2
Random Projection
  • Simple technique thats been very useful in
    approximation algorithms.
  • Plan for today discuss how can give insight into
    problems/topics in machine learning
  • Margins
  • why is having a large margin such a good thing?
  • Kernels
  • What are kernels really doing for us?
  • Feature selection/construction
  • Esp connection to kernels and margins

3
Random Projection
  • Given n points in Euclidean space like Rn,
    project down to random k-diml subspace for k ltlt
    n.
  • If k is medium-size like O(?-2 log n), then apx
    preserves many interesting quantities.
  • If k is small like 1, then can often still get
    something useful.

Well see aspects of both here
4
Uses in approximation algorithms
  • Random projection used in two main ways
  • Dimensionality Reduction via Johnson-Lindenstrauss
    Lemma
  • Given n points in Euclidean space, if project
    randomly to space of dimension k O(?-2 log n),
    then whp all relative distances preserved up to
    1??.
  • E.g., use for fast approximate nearest-neighbor.
  • Randomized rounding (e.g., of SDP)
  • Max-cut, graph coloring, graph layout problems
  • E.g., to round max-cut, just pick a random
    hyperplane

Now, on to how can be used in machine learning
5
Basic Supervised learning setting
  • Examples are points x in instance space, like Rn.
  • Labeled or -.
  • Assume drawn from some probability distribution
  • Distribution P over (x, l)
  • Or distribution D over x, labeled by target
    function c.
  • Given labeled training data, want algorithm to do
    well on new data.

6
Margins
  • If data is separable by large margin ?, then
    thats a good thing. Need sample size only
    Õ(1/?2).
  • Some ways to see it
  • The perceptron algorithm does well makes only
    1/?2 mistakes.
  • Modern margin bounds whp all consistent
    large-margin separators have low true error.
  • Random projection
  • Random projection

w?x/x ? ?, w1
7
JL Lemma
  • Given n points in Rn, if project randomly to Rk,
    for k O(?-2 log n), then whp all pairwise
    distances preserved up to 1 ? ? (after scaling by
    (n/k)1/2).
  • Cleanest proofs IM98, DG99

8
JL Lemma
Given n points in Rn, if project randomly to Rk,
for k O(?-2 log n), then whp all pairwise
distances preserved up to 1?? (after
scaling). Cleanest proofs IM98, DG99
  • Proof intuition
  • Consider a random unit-length vector
    (x1,x2,,xn)2Rn. What does x1 coordinate look
    like?
  • Ex121/n. Usually c/n.
  • If indep, Pr(x12 xk2) k/n ?k/n
    e-O(k?2).
  • So, at k O(?-2 log n), with prob 1 1/poly(n),
    projection to 1st k coordinates has length
    (k/n)1/2 (1 ?).
  • Now, apply this to vector vij pi pj,
    projecting onto random k-diml space.

Whp all vij project to length (k/n)1/2(1??)vij
9
JL Lemma, cont
  • Proof easiest for slightly different projection
  • Pick k vectors u1, , uk iid from n-diml
    gaussian.
  • Map p ! (p u1, , p uk).
  • What happens to vij pi pj?
  • Becomes (vij u1, , vij uk)
  • Each component is iid from 1-diml gaussian,
    scaled by vij.
  • Plug in bound for sum of squares of iid gaussian
    RVs.
  • So, whp all lengths apx preserved, and in fact
    not hard to see that whp all angles are apx
    preserved too.

10
Random projection and margins
  • Natural connection AV99
  • Suppose we have a set S of points in Rn,
    separable by margin ?.
  • JL lemma says if project to random k-dimensional
    space for kO(?-2 log S), whp still separable
    (by margin ?/2).
  • Think of projecting points and target vector w.
  • Angles between pi and w change by at most ??/2.
  • Could have picked projection before sampling
    data.
  • So, its really just a k-dimensional problem
    after all.

So, thats one way random projections can help us
think about margins.
11
Random projection and margins
  • Heres another way random projections can help us
    think about why a large margin is a good thing
  • Consider the following simple learning algorithm
  • Pick a random hyperplane.
  • See if it is any good.
  • If it is a weak-learner (error rate ? ½ - ?/4),
    plug into boosting. Else dont. Repeat.
  • Claim if data has a large margin separator,
    theres a reasonable chance a random hyperplane
    will be a weak-learner.

12
Random projection and margins
  • Claim if data has a separator of margin ?,
    theres a reasonable chance a random hyperplane
    will have error ? ½ - ?/4.
  • Proof
  • Pick a (positive) example x. Consider the 2-d
    plane defined by x and target w.
  • Prh(h?x ? 0 h?w ? 0)
  • ? (?/2 - ?)/? ½ - ?/?.
  • So, Ehminerr(h),err(-h) ? ½ - ?/?.
  • Since .. is bounded between 0 and ½, there must
    be a reasonable chance of success.

13
Application to Semi-Supervised learning BB
  • In Co-Training, under admittedly strong
    assumptions (independence given the label), can
    boost weak h from unlabeled data. BM
  • Iterative Co-Training use labeled data to make
    initial h, then unlabeled data to bootstrap.
  • Rand-proj shows if target is large-margin
    separator, can randomly choose initial hyps, use
    unlabeled data to bootstrap, and then use labeled
    data to pick.
  • Only requires O(1) labeled examples.
  • Can even do without needing large margin using
    fancier tricks (outlier-removal, rescaling).

Of course, just shows how strong assumption is.
14
OK, now on to kernels and feature selection
15
Generic problem
  • Given a set of images , want to
    learn a linear separator to distinguish men from
    women.
  • Problem pixel representation no good.
  • Old style advice
  • Pick a better set of features!
  • But seems ad-hoc. Not scientific.
  • New style advice
  • Use a Kernel! K( , ) ?(
    )?( ). ? is implicit, high-dimensional
    mapping.
  • Feels more scientific. Many algorithms can be
    kernelized. Use magic of implicit high-diml
    space. Dont pay for it if exists a large margin
    separator.

16
Generic problem
  • Old style advice
  • Pick a better set of features!
  • But seems ad-hoc. Not scientific.
  • New style advice
  • Use a Kernel! K( , ) ?( )
    ?( ). ? is implicit, high-dimensional
    mapping.
  • Feels more scientific. Many algorithms can be
    kernelized. Use magic of implicit high-diml
    space. Dont pay for it if exists a large margin
    separator.
  • E.g., K(x,y) (x y 1)m. ?(n-diml space) !
    (nm-diml space).

17
Claim
  • Can view new method as way of conducting old
    method.
  • Given a kernel as a black-box program K(x,y)
    and access to typical inputs samples from D,
  • Claim Can run K and reverse-engineer an explicit
    (small) set of features, such that if K is good
    9 large-margin separator in ?-space for D,c,
    then this is a good feature set 9 almost-as-good
    separator.
  • You give me a kernel, I give you a set of
    features
  • Do this using idea of random projection

18
Claim
  • Can view new method as way of conducting old
    method.
  • Given a kernel as a black-box program K(x,y)
    and access to typical inputs samples from D,
  • Claim Can run K and reverse-engineer an explicit
    (small) set of features, such that if K is good
    9 large-margin separator in ?-space for D,c,
    then this is a good feature set 9 almost-as-good
    separator.
  • E.g., sample z1,...,zd from D. Given x, define
    xi K(x,zi).
  • Implications
  • Practical alternative to kernelizing the
    algorithm.
  • Conceptual View kernel as (principled) way of
    doing feature generation. View as similarity
    function, rather than magic power of implicit
    high dimensional space.

19
Basic setup, definitions
  • Instance space X.
  • Distribution D, target c. Use P (D,c).
  • K(x,y) ?(x)?(y).
  • P is separable with margin g in ?-space if 9 w
    s.t. Pr(x,l)2 Pl(w?(x)/?(x)) lt g0. (w1)
  • Error e at margin g replace 0 with e.

Goal is to use K to get mapping to low-diml
space.
P(D,c)
20
One idea Johnson-Lindenstrauss lemma
  • If P separable with margin g in f-space, then
    with prob 1-d, a random linear projection down to
    space of dimension d O((1/g2)log1/(de)) will
    have a linear separator of error lt e. AV
  • If vectors are r1,r2,...,rd, then can view as
    features xi ?(x) ri.
  • Problem uses ?. Can we do directly, using K as
    black-box, without computing ??

21
3 methods (from simplest to best)
  • Draw d examples z1,...,zd from D. Use
  • F(x) (K(x,z1), ..., K(x,zd)). So, xi
    K(x,zi)
  • For d (8/e)1/g2 ln 1/d, if P was
    separable with margin g in ?-space, then whp this
    will be separable with error e. (but this method
    doesnt preserve margin).
  • Same d, but a little more complicated. Separable
    with error e at margin g/2.
  • Combine (2) with further projection as in JL
    lemma. Get d with log dependence on 1/e, rather
    than linear. So, can set e 1/d.

All these methods need access to D, unlike JL.
Can this be removed? We show NO for generic K,
but may be possible for natural K.
22
Actually, the argument is pretty easy...
  • (though we did try a lot of things first that
    didnt work...)

23
Key fact
  • Claim If 9 perfect w of margin g in f-space,
    then if draw z1,...,zd 2 D for d (8/e)1/g2
    ln 1/d, whp (1-d) exists w in
    span(?(z1),...,?(zd)) of error e at margin g/2.
  • Proof Not hard but its getting late

24
Key fact
  • Claim If 9 perfect w of margin g in f-space,
    then if draw z1,...,zd 2 D for d (8/e)1/g2
    ln 1/d, whp (1-d) exists w in
    span(?(z1),...,?(zd)) of error e at margin g/2.
  • Proof Let S examples drawn so far. Assume
    w1, ?(z)1 8 z.
  • win proj(w,span(S)), wout w win.
  • Say wout is large if Prz(wout?(z) g/2) e
    else small.
  • If small, then done w win.
  • Else, next z has at least e prob of improving S.

wout2 Ã wout2 (g/2)2
  • Can happen at most 4/g2 times. ?

25
So....
  • If draw z1,...,zd 2 D for d (8/e)1/g2 ln
    1/d, then whp exists w in span(?(z1),...,?(zd))
    of error e at margin g/2.
  • So, for some w a1?(z1) ... ad?(zd),
  • Pr(x,l) 2 P sign(w ?(x)) ¹ l e.
  • But notice that w?(x) a1K(x,z1) ...
    adK(x,zd).
  • ) vector (a1,...ad) is an e-good separator in
    the feature space xi K(x,zi).
  • But margin not preserved because length of
    target, examples not preserved.

26
How to preserve margin? (mapping 2)
  • We know 9 w in span(?(z1),...,?(zd)) of error
    e at margin g/2.
  • So, given a new x, just want to do an orthogonal
    projection of ?(x) into that span. (preserves
    dot-product, decreases ?(x), so only increases
    margin).
  • Run K(zi,zj) for all i,j1,...,d. Get matrix M.
  • Decompose M UTU.
  • (Mapping 2) (mapping 1)U-1. ?

27
How to improve dimension?
  • Current mapping gives d (8/e)1/g2 ln 1/d.
  • Johnson-Lindenstrauss gives d O((1/g2) log
    1/(de) ). Nice because can have d 1/?.
  • Answer just combine the two...
  • Run Mapping 2, then do random projection down
    from that.
  • Gives us desired dimension ( features), though
    sample-complexity remains as in mapping 2.

28
RN
X
O
O
X
O
X
?
X
X
O
Rd1
O
X
F1
X
O
X
X
O
X
O
X
X
X
JL
X
X
X
O
O
O
F
O
Rd
X
O
X
O
X
X
O
X
O
29
Mapping 3
  • Do JL(mapping2(x)).
  • JL says fix y,w. Random projection M down to
    space of dimension O(1/g2 log 1/d) will with
    prob (1-d) preserve margin of y up to g/4.
  • Use d ed.
  • ) For all y, PrMfailure on y lt ed,
  • ) PrD, Mfailure on y lt ed,
  • ) PrMfail on prob mass e lt d.
  • So, we get desired dimension ( features), though
    sample-complexity remains as in mapping 2.

30
Lower bound (on necessity of access to D)
  • For arbitrary black-box kernel K, cant hope to
    convert to small feature space without access to
    D.
  • Consider X0,1n, random X½ X of size 2n/2, D
    uniform over X.
  • c arbitrary function (so learning is hopeless).
  • But we have this magic kernel K(x,y) ?(x)?(y)
  • ?(x) (1,0) if x Ï X.
  • ?(x) (-½, p3/2) if x 2 X, c(x)pos.
  • ?(x) (-½,-p3/2) if x 2 X, c(x)neg.
  • P is separable with margin p3/2 in ?-space.
  • But, without access to D, all attempts at running
    K(x,y) will give answer of 1.

31
Open Problems
  • For specific natural kernels, like K(x,y) (1
    xy)m, is there an efficient analog to JL,
    without needing access to D?
  • Or, at least can one at least reduce the
    sample-complexity ? (use fewer accesses to D)
  • Can one extend results (e.g., mapping 1 x ?
    K(x,z1), ..., K(x,zd)) to more general
    similarity functions K?
  • Not exactly clear what theorem statement would
    look like.
Write a Comment
User Comments (0)
About PowerShow.com