Title: Random projection, margins, kernels, and featureselection
1Random projection, margins, kernels, and
feature-selection
- Avrim Blum
- Carnegie Mellon University
Portions of this are joint work with Nina Balcan
and Santosh Vempala
and portions have nothing to do with me at all
PASCAL workshop on Subspace, Latent Structure
and Feature Selection 2005
2Random Projection
- Simple technique thats been very useful in
approximation algorithms. - Plan for today discuss how can give insight into
problems/topics in machine learning - Margins
- why is having a large margin such a good thing?
- Kernels
- What are kernels really doing for us?
- Feature selection/construction
- Esp connection to kernels and margins
3Random Projection
- Given n points in Euclidean space like Rn,
project down to random k-diml subspace for k ltlt
n. - If k is medium-size like O(?-2 log n), then apx
preserves many interesting quantities. - If k is small like 1, then can often still get
something useful.
Well see aspects of both here
4Uses in approximation algorithms
- Random projection used in two main ways
- Dimensionality Reduction via Johnson-Lindenstrauss
Lemma - Given n points in Euclidean space, if project
randomly to space of dimension k O(?-2 log n),
then whp all relative distances preserved up to
1??. - E.g., use for fast approximate nearest-neighbor.
- Randomized rounding (e.g., of SDP)
- Max-cut, graph coloring, graph layout problems
- E.g., to round max-cut, just pick a random
hyperplane
Now, on to how can be used in machine learning
5Basic Supervised learning setting
- Examples are points x in instance space, like Rn.
- Labeled or -.
- Assume drawn from some probability distribution
- Distribution P over (x, l)
- Or distribution D over x, labeled by target
function c. - Given labeled training data, want algorithm to do
well on new data.
6Margins
- If data is separable by large margin ?, then
thats a good thing. Need sample size only
Õ(1/?2). - Some ways to see it
- The perceptron algorithm does well makes only
1/?2 mistakes. - Modern margin bounds whp all consistent
large-margin separators have low true error. - Random projection
- Random projection
w?x/x ? ?, w1
7JL Lemma
- Given n points in Rn, if project randomly to Rk,
for k O(?-2 log n), then whp all pairwise
distances preserved up to 1 ? ? (after scaling by
(n/k)1/2). - Cleanest proofs IM98, DG99
8JL Lemma
Given n points in Rn, if project randomly to Rk,
for k O(?-2 log n), then whp all pairwise
distances preserved up to 1?? (after
scaling). Cleanest proofs IM98, DG99
- Proof intuition
- Consider a random unit-length vector
(x1,x2,,xn)2Rn. What does x1 coordinate look
like? - Ex121/n. Usually c/n.
- If indep, Pr(x12 xk2) k/n ?k/n
e-O(k?2). - So, at k O(?-2 log n), with prob 1 1/poly(n),
projection to 1st k coordinates has length
(k/n)1/2 (1 ?). - Now, apply this to vector vij pi pj,
projecting onto random k-diml space.
Whp all vij project to length (k/n)1/2(1??)vij
9JL Lemma, cont
- Proof easiest for slightly different projection
- Pick k vectors u1, , uk iid from n-diml
gaussian. - Map p ! (p u1, , p uk).
- What happens to vij pi pj?
- Becomes (vij u1, , vij uk)
- Each component is iid from 1-diml gaussian,
scaled by vij. - Plug in bound for sum of squares of iid gaussian
RVs. - So, whp all lengths apx preserved, and in fact
not hard to see that whp all angles are apx
preserved too.
10Random projection and margins
- Natural connection AV99
- Suppose we have a set S of points in Rn,
separable by margin ?. - JL lemma says if project to random k-dimensional
space for kO(?-2 log S), whp still separable
(by margin ?/2). - Think of projecting points and target vector w.
- Angles between pi and w change by at most ??/2.
- Could have picked projection before sampling
data. - So, its really just a k-dimensional problem
after all.
So, thats one way random projections can help us
think about margins.
11Random projection and margins
- Heres another way random projections can help us
think about why a large margin is a good thing - Consider the following simple learning algorithm
- Pick a random hyperplane.
- See if it is any good.
- If it is a weak-learner (error rate ? ½ - ?/4),
plug into boosting. Else dont. Repeat. - Claim if data has a large margin separator,
theres a reasonable chance a random hyperplane
will be a weak-learner.
12Random projection and margins
- Claim if data has a separator of margin ?,
theres a reasonable chance a random hyperplane
will have error ? ½ - ?/4.
- Proof
- Pick a (positive) example x. Consider the 2-d
plane defined by x and target w. - Prh(h?x ? 0 h?w ? 0)
- ? (?/2 - ?)/? ½ - ?/?.
- So, Ehminerr(h),err(-h) ? ½ - ?/?.
- Since .. is bounded between 0 and ½, there must
be a reasonable chance of success.
13Application to Semi-Supervised learning BB
- In Co-Training, under admittedly strong
assumptions (independence given the label), can
boost weak h from unlabeled data. BM - Iterative Co-Training use labeled data to make
initial h, then unlabeled data to bootstrap. - Rand-proj shows if target is large-margin
separator, can randomly choose initial hyps, use
unlabeled data to bootstrap, and then use labeled
data to pick. - Only requires O(1) labeled examples.
- Can even do without needing large margin using
fancier tricks (outlier-removal, rescaling).
Of course, just shows how strong assumption is.
14OK, now on to kernels and feature selection
15Generic problem
- Given a set of images , want to
learn a linear separator to distinguish men from
women. - Problem pixel representation no good.
- Old style advice
- Pick a better set of features!
- But seems ad-hoc. Not scientific.
- New style advice
- Use a Kernel! K( , ) ?(
)?( ). ? is implicit, high-dimensional
mapping.
- Feels more scientific. Many algorithms can be
kernelized. Use magic of implicit high-diml
space. Dont pay for it if exists a large margin
separator.
16Generic problem
- Old style advice
- Pick a better set of features!
- But seems ad-hoc. Not scientific.
- New style advice
- Use a Kernel! K( , ) ?( )
?( ). ? is implicit, high-dimensional
mapping.
- Feels more scientific. Many algorithms can be
kernelized. Use magic of implicit high-diml
space. Dont pay for it if exists a large margin
separator.
- E.g., K(x,y) (x y 1)m. ?(n-diml space) !
(nm-diml space).
17Claim
- Can view new method as way of conducting old
method. - Given a kernel as a black-box program K(x,y)
and access to typical inputs samples from D, - Claim Can run K and reverse-engineer an explicit
(small) set of features, such that if K is good
9 large-margin separator in ?-space for D,c,
then this is a good feature set 9 almost-as-good
separator. - You give me a kernel, I give you a set of
features - Do this using idea of random projection
18Claim
- Can view new method as way of conducting old
method. - Given a kernel as a black-box program K(x,y)
and access to typical inputs samples from D, - Claim Can run K and reverse-engineer an explicit
(small) set of features, such that if K is good
9 large-margin separator in ?-space for D,c,
then this is a good feature set 9 almost-as-good
separator. - E.g., sample z1,...,zd from D. Given x, define
xi K(x,zi).
- Implications
- Practical alternative to kernelizing the
algorithm. - Conceptual View kernel as (principled) way of
doing feature generation. View as similarity
function, rather than magic power of implicit
high dimensional space.
19Basic setup, definitions
- Distribution D, target c. Use P (D,c).
- P is separable with margin g in ?-space if 9 w
s.t. Pr(x,l)2 Pl(w?(x)/?(x)) lt g0. (w1)
- Error e at margin g replace 0 with e.
Goal is to use K to get mapping to low-diml
space.
P(D,c)
20One idea Johnson-Lindenstrauss lemma
- If P separable with margin g in f-space, then
with prob 1-d, a random linear projection down to
space of dimension d O((1/g2)log1/(de)) will
have a linear separator of error lt e. AV
- If vectors are r1,r2,...,rd, then can view as
features xi ?(x) ri.
- Problem uses ?. Can we do directly, using K as
black-box, without computing ??
213 methods (from simplest to best)
- Draw d examples z1,...,zd from D. Use
- F(x) (K(x,z1), ..., K(x,zd)). So, xi
K(x,zi) - For d (8/e)1/g2 ln 1/d, if P was
separable with margin g in ?-space, then whp this
will be separable with error e. (but this method
doesnt preserve margin). - Same d, but a little more complicated. Separable
with error e at margin g/2. - Combine (2) with further projection as in JL
lemma. Get d with log dependence on 1/e, rather
than linear. So, can set e 1/d.
All these methods need access to D, unlike JL.
Can this be removed? We show NO for generic K,
but may be possible for natural K.
22Actually, the argument is pretty easy...
- (though we did try a lot of things first that
didnt work...)
23Key fact
- Claim If 9 perfect w of margin g in f-space,
then if draw z1,...,zd 2 D for d (8/e)1/g2
ln 1/d, whp (1-d) exists w in
span(?(z1),...,?(zd)) of error e at margin g/2. - Proof Not hard but its getting late
24Key fact
- Claim If 9 perfect w of margin g in f-space,
then if draw z1,...,zd 2 D for d (8/e)1/g2
ln 1/d, whp (1-d) exists w in
span(?(z1),...,?(zd)) of error e at margin g/2. - Proof Let S examples drawn so far. Assume
w1, ?(z)1 8 z.
- win proj(w,span(S)), wout w win.
- Say wout is large if Prz(wout?(z) g/2) e
else small. - If small, then done w win.
- Else, next z has at least e prob of improving S.
wout2 Ã wout2 (g/2)2
- Can happen at most 4/g2 times. ?
25So....
- If draw z1,...,zd 2 D for d (8/e)1/g2 ln
1/d, then whp exists w in span(?(z1),...,?(zd))
of error e at margin g/2.
- So, for some w a1?(z1) ... ad?(zd),
- Pr(x,l) 2 P sign(w ?(x)) ¹ l e.
- But notice that w?(x) a1K(x,z1) ...
adK(x,zd). - ) vector (a1,...ad) is an e-good separator in
the feature space xi K(x,zi). - But margin not preserved because length of
target, examples not preserved.
26How to preserve margin? (mapping 2)
- We know 9 w in span(?(z1),...,?(zd)) of error
e at margin g/2. - So, given a new x, just want to do an orthogonal
projection of ?(x) into that span. (preserves
dot-product, decreases ?(x), so only increases
margin). - Run K(zi,zj) for all i,j1,...,d. Get matrix M.
- Decompose M UTU.
- (Mapping 2) (mapping 1)U-1. ?
27How to improve dimension?
- Current mapping gives d (8/e)1/g2 ln 1/d.
- Johnson-Lindenstrauss gives d O((1/g2) log
1/(de) ). Nice because can have d 1/?. - Answer just combine the two...
- Run Mapping 2, then do random projection down
from that. - Gives us desired dimension ( features), though
sample-complexity remains as in mapping 2.
28RN
X
O
O
X
O
X
?
X
X
O
Rd1
O
X
F1
X
O
X
X
O
X
O
X
X
X
JL
X
X
X
O
O
O
F
O
Rd
X
O
X
O
X
X
O
X
O
29Mapping 3
- Do JL(mapping2(x)).
- JL says fix y,w. Random projection M down to
space of dimension O(1/g2 log 1/d) will with
prob (1-d) preserve margin of y up to g/4. - Use d ed.
- ) For all y, PrMfailure on y lt ed,
- ) PrD, Mfailure on y lt ed,
- ) PrMfail on prob mass e lt d.
- So, we get desired dimension ( features), though
sample-complexity remains as in mapping 2.
30Lower bound (on necessity of access to D)
- For arbitrary black-box kernel K, cant hope to
convert to small feature space without access to
D. - Consider X0,1n, random X½ X of size 2n/2, D
uniform over X. - c arbitrary function (so learning is hopeless).
- But we have this magic kernel K(x,y) ?(x)?(y)
- ?(x) (1,0) if x Ï X.
- ?(x) (-½, p3/2) if x 2 X, c(x)pos.
- ?(x) (-½,-p3/2) if x 2 X, c(x)neg.
- P is separable with margin p3/2 in ?-space.
- But, without access to D, all attempts at running
K(x,y) will give answer of 1.
31Open Problems
- For specific natural kernels, like K(x,y) (1
xy)m, is there an efficient analog to JL,
without needing access to D? - Or, at least can one at least reduce the
sample-complexity ? (use fewer accesses to D) - Can one extend results (e.g., mapping 1 x ?
K(x,z1), ..., K(x,zd)) to more general
similarity functions K? - Not exactly clear what theorem statement would
look like.