Random projection, margins, kernels, and featureselection - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Random projection, margins, kernels, and featureselection

Description:

E.g., to round max-cut, just pick a random hyperplane ... Pick a random hyperplane. See if it is any good. ... chance a random hyperplane will be a weak ... – PowerPoint PPT presentation

Number of Views:213

Avg rating:3.0/5.0

Slides: 30

Provided by: avr71

Category:

more less

Transcript and Presenter's Notes

Title: Random projection, margins, kernels, and featureselection

1
Random projection, margins, kernels, and
feature-selection

Avrim Blum
Carnegie Mellon University

Portions of this are joint work with Nina Balcan
and Santosh Vempala
and portions have nothing to do with me at all
PASCAL workshop on Subspace, Latent Structure
and Feature Selection 2005
2
Random Projection

Simple technique thats been very useful in
approximation algorithms.
Plan for today discuss how can give insight into
problems/topics in machine learning
Margins
why is having a large margin such a good thing?
Kernels
What are kernels really doing for us?
Feature selection/construction
Esp connection to kernels and margins

3
Random Projection

Given n points in Euclidean space like Rn,
project down to random k-diml subspace for k ltlt
n.
If k is medium-size like O(?-2 log n), then apx
preserves many interesting quantities.
If k is small like 1, then can often still get
something useful.

Well see aspects of both here
4
Uses in approximation algorithms

Random projection used in two main ways
Dimensionality Reduction via Johnson-Lindenstrauss
Lemma
Given n points in Euclidean space, if project
randomly to space of dimension k O(?-2 log n),
then whp all relative distances preserved up to
1??.
E.g., use for fast approximate nearest-neighbor.
Randomized rounding (e.g., of SDP)
Max-cut, graph coloring, graph layout problems
E.g., to round max-cut, just pick a random
hyperplane

Now, on to how can be used in machine learning
5
Basic Supervised learning setting

Examples are points x in instance space, like Rn.
Labeled or -.
Assume drawn from some probability distribution
Distribution P over (x, l)
Or distribution D over x, labeled by target
function c.
Given labeled training data, want algorithm to do
well on new data.

6
Margins

If data is separable by large margin ?, then
thats a good thing. Need sample size only
Õ(1/?2).
Some ways to see it
The perceptron algorithm does well makes only
1/?2 mistakes.
Modern margin bounds whp all consistent
large-margin separators have low true error.
Random projection
Random projection

w?x/x ? ?, w1
7
JL Lemma

Given n points in Rn, if project randomly to Rk,
for k O(?-2 log n), then whp all pairwise
distances preserved up to 1 ? ? (after scaling by
(n/k)1/2).
Cleanest proofs IM98, DG99

8
JL Lemma
Given n points in Rn, if project randomly to Rk,
for k O(?-2 log n), then whp all pairwise
distances preserved up to 1?? (after
scaling). Cleanest proofs IM98, DG99

Proof intuition
Consider a random unit-length vector
(x1,x2,,xn)2Rn. What does x1 coordinate look
like?
Ex121/n. Usually c/n.
If indep, Pr(x12 xk2) k/n ?k/n
e-O(k?2).
So, at k O(?-2 log n), with prob 1 1/poly(n),
projection to 1st k coordinates has length
(k/n)1/2 (1 ?).
Now, apply this to vector vij pi pj,
projecting onto random k-diml space.

Whp all vij project to length (k/n)1/2(1??)vij
9
JL Lemma, cont

Proof easiest for slightly different projection
Pick k vectors u1, , uk iid from n-diml
gaussian.
Map p ! (p u1, , p uk).
What happens to vij pi pj?
Becomes (vij u1, , vij uk)
Each component is iid from 1-diml gaussian,
scaled by vij.
Plug in bound for sum of squares of iid gaussian
RVs.
So, whp all lengths apx preserved, and in fact
not hard to see that whp all angles are apx
preserved too.

10
Random projection and margins

Natural connection AV99
Suppose we have a set S of points in Rn,
separable by margin ?.
JL lemma says if project to random k-dimensional
space for kO(?-2 log S), whp still separable
(by margin ?/2).
Think of projecting points and target vector w.
Angles between pi and w change by at most ??/2.
Could have picked projection before sampling
data.
So, its really just a k-dimensional problem
after all.

So, thats one way random projections can help us
think about margins.
11
Random projection and margins

Heres another way random projections can help us
think about why a large margin is a good thing
Consider the following simple learning algorithm
Pick a random hyperplane.
See if it is any good.
If it is a weak-learner (error rate ? ½ - ?/4),
plug into boosting. Else dont. Repeat.
Claim if data has a large margin separator,
theres a reasonable chance a random hyperplane
will be a weak-learner.

12
Random projection and margins

Claim if data has a separator of margin ?,
theres a reasonable chance a random hyperplane
will have error ? ½ - ?/4.

Proof
Pick a (positive) example x. Consider the 2-d
plane defined by x and target w.
Prh(h?x ? 0 h?w ? 0)
? (?/2 - ?)/? ½ - ?/?.
So, Ehminerr(h),err(-h) ? ½ - ?/?.
Since .. is bounded between 0 and ½, there must
be a reasonable chance of success.

13
Application to Semi-Supervised learning BB

In Co-Training, under admittedly strong
assumptions (independence given the label), can
boost weak h from unlabeled data. BM
Iterative Co-Training use labeled data to make
initial h, then unlabeled data to bootstrap.
Rand-proj shows if target is large-margin
separator, can randomly choose initial hyps, use
unlabeled data to bootstrap, and then use labeled
data to pick.
Only requires O(1) labeled examples.
Can even do without needing large margin using
fancier tricks (outlier-removal, rescaling).

Of course, just shows how strong assumption is.
14
OK, now on to kernels and feature selection
15
Generic problem

Given a set of images , want to
learn a linear separator to distinguish men from
women.
Problem pixel representation no good.

Old style advice
Pick a better set of features!
But seems ad-hoc. Not scientific.

New style advice
Use a Kernel! K( , ) ?(
)?( ). ? is implicit, high-dimensional
mapping.

Feels more scientific. Many algorithms can be
kernelized. Use magic of implicit high-diml
space. Dont pay for it if exists a large margin
separator.

16
Generic problem

Old style advice
Pick a better set of features!
But seems ad-hoc. Not scientific.

New style advice
Use a Kernel! K( , ) ?( )
?( ). ? is implicit, high-dimensional
mapping.

Feels more scientific. Many algorithms can be
kernelized. Use magic of implicit high-diml
space. Dont pay for it if exists a large margin
separator.

E.g., K(x,y) (x y 1)m. ?(n-diml space) !
(nm-diml space).

17
Claim

Can view new method as way of conducting old
method.
Given a kernel as a black-box program K(x,y)
and access to typical inputs samples from D,
Claim Can run K and reverse-engineer an explicit
(small) set of features, such that if K is good
9 large-margin separator in ?-space for D,c,
then this is a good feature set 9 almost-as-good
separator.
You give me a kernel, I give you a set of
features
Do this using idea of random projection

18
Claim

Can view new method as way of conducting old
method.
Given a kernel as a black-box program K(x,y)
and access to typical inputs samples from D,
Claim Can run K and reverse-engineer an explicit
(small) set of features, such that if K is good
9 large-margin separator in ?-space for D,c,
then this is a good feature set 9 almost-as-good
separator.
E.g., sample z1,...,zd from D. Given x, define
xi K(x,zi).

Implications
Practical alternative to kernelizing the
algorithm.
Conceptual View kernel as (principled) way of
doing feature generation. View as similarity
function, rather than magic power of implicit
high dimensional space.

19
Basic setup, definitions

Instance space X.

Distribution D, target c. Use P (D,c).

K(x,y) ?(x)?(y).

P is separable with margin g in ?-space if 9 w
s.t. Pr(x,l)2 Pl(w?(x)/?(x)) lt g0. (w1)

Error e at margin g replace 0 with e.

Goal is to use K to get mapping to low-diml
space.
P(D,c)
20
One idea Johnson-Lindenstrauss lemma

If P separable with margin g in f-space, then
with prob 1-d, a random linear projection down to
space of dimension d O((1/g2)log1/(de)) will
have a linear separator of error lt e. AV

If vectors are r1,r2,...,rd, then can view as
features xi ?(x) ri.

Problem uses ?. Can we do directly, using K as
black-box, without computing ??

21
3 methods (from simplest to best)

Draw d examples z1,...,zd from D. Use
F(x) (K(x,z1), ..., K(x,zd)). So, xi
K(x,zi)
For d (8/e)1/g2 ln 1/d, if P was
separable with margin g in ?-space, then whp this
will be separable with error e. (but this method
doesnt preserve margin).
Same d, but a little more complicated. Separable
with error e at margin g/2.
Combine (2) with further projection as in JL
lemma. Get d with log dependence on 1/e, rather
than linear. So, can set e 1/d.

All these methods need access to D, unlike JL.
Can this be removed? We show NO for generic K,
but may be possible for natural K.
22
Actually, the argument is pretty easy...

(though we did try a lot of things first that
didnt work...)

23
Key fact

Claim If 9 perfect w of margin g in f-space,
then if draw z1,...,zd 2 D for d (8/e)1/g2
ln 1/d, whp (1-d) exists w in
span(?(z1),...,?(zd)) of error e at margin g/2.
Proof Not hard but its getting late

24
Key fact

Claim If 9 perfect w of margin g in f-space,
then if draw z1,...,zd 2 D for d (8/e)1/g2
ln 1/d, whp (1-d) exists w in
span(?(z1),...,?(zd)) of error e at margin g/2.
Proof Let S examples drawn so far. Assume
w1, ?(z)1 8 z.

win proj(w,span(S)), wout w win.

Say wout is large if Prz(wout?(z) g/2) e
else small.
If small, then done w win.
Else, next z has at least e prob of improving S.

wout2 Ã wout2 (g/2)2

Can happen at most 4/g2 times. ?

25
So....

If draw z1,...,zd 2 D for d (8/e)1/g2 ln
1/d, then whp exists w in span(?(z1),...,?(zd))
of error e at margin g/2.

So, for some w a1?(z1) ... ad?(zd),
Pr(x,l) 2 P sign(w ?(x)) ¹ l e.
But notice that w?(x) a1K(x,z1) ...
adK(x,zd).
) vector (a1,...ad) is an e-good separator in
the feature space xi K(x,zi).
But margin not preserved because length of
target, examples not preserved.

26
How to preserve margin? (mapping 2)

We know 9 w in span(?(z1),...,?(zd)) of error
e at margin g/2.
So, given a new x, just want to do an orthogonal
projection of ?(x) into that span. (preserves
dot-product, decreases ?(x), so only increases
margin).
Run K(zi,zj) for all i,j1,...,d. Get matrix M.
Decompose M UTU.
(Mapping 2) (mapping 1)U-1. ?

27
How to improve dimension?

Current mapping gives d (8/e)1/g2 ln 1/d.
Johnson-Lindenstrauss gives d O((1/g2) log
1/(de) ). Nice because can have d 1/?.
Answer just combine the two...
Run Mapping 2, then do random projection down
from that.
Gives us desired dimension ( features), though
sample-complexity remains as in mapping 2.

28
RN
X
O
O
X
O
X
?
X
X
O
Rd1
O
X
F1
X
O
X
X
O
X
O
X
X
X
JL
X
X
X
O
O
O
F
O
Rd
X
O
X
O
X
X
O
X
O
29
Mapping 3

Do JL(mapping2(x)).
JL says fix y,w. Random projection M down to
space of dimension O(1/g2 log 1/d) will with
prob (1-d) preserve margin of y up to g/4.
Use d ed.
) For all y, PrMfailure on y lt ed,
) PrD, Mfailure on y lt ed,
) PrMfail on prob mass e lt d.
So, we get desired dimension ( features), though
sample-complexity remains as in mapping 2.

30
Lower bound (on necessity of access to D)

For arbitrary black-box kernel K, cant hope to
convert to small feature space without access to
D.
Consider X0,1n, random X½ X of size 2n/2, D
uniform over X.
c arbitrary function (so learning is hopeless).
But we have this magic kernel K(x,y) ?(x)?(y)
?(x) (1,0) if x Ï X.
?(x) (-½, p3/2) if x 2 X, c(x)pos.
?(x) (-½,-p3/2) if x 2 X, c(x)neg.

P is separable with margin p3/2 in ?-space.

But, without access to D, all attempts at running
K(x,y) will give answer of 1.

31
Open Problems

For specific natural kernels, like K(x,y) (1
xy)m, is there an efficient analog to JL,
without needing access to D?
Or, at least can one at least reduce the
sample-complexity ? (use fewer accesses to D)
Can one extend results (e.g., mapping 1 x ?
K(x,z1), ..., K(x,zd)) to more general
similarity functions K?
Not exactly clear what theorem statement would
look like.