Features, Kernels, and Similarity functions presentation

About This Presentation

Transcript and Presenter's Notes

Title: Features, Kernels, and Similarity functions

1
Features, Kernels, and Similarity functions
Avrim Blum Machine learning lunch 03/05/07
2
Suppose you want to

use learning to solve some classification
problem.
E.g., given a set of images, learn a
rule to distinguish men from women.
The first thing you need to do is decide what you
want as features.
Or, for algs like SVM and Perceptron, can use a
kernel function, which provides an implicit
feature space. But then what kernel to use?
Can Theory provide any help or guidance?

3
Plan for this talk

Discuss a few ways theory might be of help
Algorithms designed to do well in large feature
spaces when only a small number of features are
actually useful.
So you can pile a lot on when you dont know much
about the domain.
Kernel functions. Standard theoretical view,
plus new one that may provide more guidance.
Bridge between implicit mapping and similarity
function views. Talk about quality of a kernel
in terms of more tangible properties. work with
Nina Balcan
Combining the above. Using kernels to generate
explicit features.

4
A classic conceptual question

How is it possible to learn anything quickly when
there is so much irrelevant information around?
Must there be some hard-coded focusing mechanism,
or can learning handle it?

5
A classic conceptual question

Lets try a very simple theoretical model.
Have n boolean features. Labels are or -.
1001101110
1100111101
0111010111 -
Assume distinction is based on just one feature.
How many prediction mistakes do you need to make
before youve figured out which one it is?
Can take majority vote over all possibilities
consistent with data so far. Each mistake
crosses off at least half. O(log n) mistakes
total.
log(n) is good doubling n only adds 1 more
mistake.
Cant do better (consider log(n) random strings
with random labels. Whp there is a consistent
feature in hindsight).

6
A classic conceptual question

What about more interesting classes of functions
(not just target ? a single feature)?

7
Littlestones Winnow algorithm MLJ 1988

Motivated by the question what if target is an
OR of r ltlt n features?
Majority vote scheme over all nr possibilities
would make O(r log n) mistakes but totally
impractical. Can you do this efficiently?
Winnow is simple efficient algorithm that meets
this bound.
More generally, if exists LTF such that
positives satisfy w1x1w2x2wnxn ? c,
negatives satisfy w1x1w2x2wnxn ? c - ?,
(W?iwi)
Then mistakes O((W/?)2 log n).
E.g., if target is k of r function, get O(r2
log n).
Key point still only log dependence on n.

100101011001101011 x4 ? x7 ? x10
8
Littlestones Winnow algorithm MLJ 1988
1001011011001

How does it work? Balanced version
Maintain weight vectors w and w-.
Initialize all weights to 1. Classify based on
whether w?x or w-?x is larger. (Have x0?0)
If make mistake on positive x, for each xi1,
wi (1?)wi, wi- (1-?)wi-.
And vice-versa for mistake on negative x.
Other properties
Can show this approximates maxent constraints.
In other direction, Ng04 shows that maxent
with L1 regularization gets Winnow-like bounds.

9
Practical issues

On batch problem, may want to cycle through data,
each time with smaller ?.
Can also do margin version update if just barely
correct.
If want to output a likelihood, natural is
ew?x/ew?x ew-?x. Can extend to
multiclass too.
William Vitor have paper with some other nice
practical adjustments.

10
Winnow versus Perceptron/SVM

Winnow is similar at high level to Perceptron
updates. Whats the difference?
Suppose data is linearly separable by w?x 0
with w?x ? ?.

For Perceptron, mistakes/samples bounded by
O((L2(w)L2(x)/?)2)
For Winnow, mistakes/samples bounded by
O((L1(w)L?(x)/?)2(log n))

For boolean features, L?(x)1. L2(x) can be
sqrt(n).
If target is sparse, examples dense, Winnow is
better.
E.g., x random in 0,1n, f(x)x1. Perceptron
O(n) mistakes.
If target is dense (most features are relevant)
and examples are sparse, then Perceptron wins.

11
OK, now on to kernels
12
Generic problem

Given a set of images , want to
learn a linear separator to distinguish men from
women.
Problem pixel representation no good.

One approach
Pick a better set of features! But seems ad-hoc.

Instead
Use a Kernel! K( , ) ?(
)?( ). ? is implicit, high-dimensional
mapping.

Perceptron/SVM only interact with data through
dot-products, so can be kernelized. If data is
separable in ?-space by large L2 margin, dont
have to pay for it.

13
Kernels

E.g., the kernel K(x,y) (1xy)d for the case
of n2, d2, corresponds to the implicit mapping

14
Kernels

Perceptron/SVM only interact with data through
dot-products, so can be kernelized. If data is
separable in ?-space by large L2 margin, dont
have to pay for it.

E.g., K(x,y) (1 x?y)d
?(n-diml space) ! (nd-diml space).
E.g., K(x,y) e-(x-y)2

Conceptual warning Youre not really getting
all the power of the high dimensional space
without paying for it. The margin matters.
E.g., K(x,y)1 if xy, K(x,y)0 otherwise.
Corresponds to mapping where every example gets
its own coordinate. Everything is linearly
separable but no generalization.

15
Question do we need the notion of an implicit
space to understand what makes a kernel helpful
for learning?
16
Focus on batch setting

Assume examples drawn from some probability
distribution
Distribution D over x, labeled by target function
c.
Or distribution P over (x, l)
Will call P (or (c,D)) our learning problem.
Given labeled training data, want algorithm to do
well on new data.

17
Something funny about theory of kernels

On the one hand, operationally a kernel is just a
similarity function K(x,y) 2 -1,1, with some
extra requirements. here Im scaling to ?(x)
1
And in practice, people think of a good kernel as
a good measure of similarity between data points
for the task at hand.
But Theory talks about margins in implicit
high-dimensional F-space. K(x,y) F(x)F(y).

18
I want to use ML to classify protein structures
and Im trying to decide on a similarity fn to
use. Any help?
It should be pos. semidefinite, and should result
in your data having a large margin separator in
implicit high-diml space you probably cant even
calculate.
19
Umm thanks, I guess.
It should be pos. semidefinite, and should result
in your data having a large margin separator in
implicit high-diml space you probably cant even
calculate.
20
Something funny about theory of kernels

Theory talks about margins in implicit
high-dimensional F-space. K(x,y) F(x)F(y).
Not great for intuition (do I expect this kernel
or that one to work better for me)
Can we connect better with idea of a good kernel
being one that is a good notion of similarity for
the problem at hand?
Motivation BBV If margin ? in ?-space, then
can pick Õ(1/?2) random examples y1,,yn
(landmarks), and do mapping x ?
K(x,y1),,K(x,yn). Whp data in this space will
be apx linearly separable.

21
Goal notion of good similarity function that

Talks in terms of more intuitive properties (no
implicit high-diml spaces, no requirement of
positive-semidefiniteness, etc)
If K satisfies these properties for our given
problem, then has implications to learning
Is broad includes usual notion of good kernel
(one that induces a large margin separator in
F-space).
If so, then this can help with designing the K.

Recent work with Nina, with extensions by Nati
Srebro
22
Proposal satisfying (1) and (2)

Say have a learning problem P (distribution D
over examples labeled by unknown target f).
Sim fn K(x,y)!-1,1 is (?,?)-good for P if at
least a 1-? fraction of examples x satisfy

EyDK(x,y)l(y)l(x) EyDK(x,y)l(y)?l(x)?

Q how could you use this to learn?

23
How to use it

At least a 1-? prob mass of x satisfy
EyDK(x,y)l(y)l(x) EyDK(x,y)l(y)?l(x)
?

Draw S of O((1/?2)ln 1/?2) positive examples.
Draw S- of O((1/?2)ln 1/?2) negative examples.
Classify x based on which gives better score.
Hoeffding for any given good x, prob of error
over draw of S,S- at most ?2.
So, at most ? chance our draw is bad on more than
? fraction of good x.
With prob 1-?, error rate ? ?.

24
But not broad enough
30o
30o

K(x,y)xy has good separator but doesnt satisfy
defn. (half of positives are more similar to negs
that to typical pos)

25
But not broad enough
30o
30o

Idea would work if we didnt pick ys from
top-left.
Broaden to say OK if 9 large region R s.t. most
x are on average more similar to y2R of same
label than to y2R of other label. (even if dont
know R in advance)

26
Broader defn

Say K(x,y)!-1,1 is an (?,?)-good similarity
function for P if exists a weighting function
w(y)20,1 s.t. at least 1-? frac. of x satisfy

EyDw(y)K(x,y)l(y)l(x) EyDw(y)K(x,y)l(y)?
l(x)?

Can still use for learning
Draw S y1,,yn, S- z1,,zn. nÕ(1/?2)
Use to triangulate data
x ? K(x,y1), ,K(x,yn), K(x,z1),,K(x,zn).
Whp, exists good separator in this space w
w(y1),,w(yn),-w(z1),,-w(zn)

27
Broader defn

Say K(x,y)!-1,1 is an (?,?)-good similarity
function for P if exists a weighting function
w(y)20,1 s.t. at least 1-? frac. of x satisfy

EyDw(y)K(x,y)l(y)l(x) EyDw(y)K(x,y)l(y)?
l(x)?

So, take new set of examples, project to this
space, and run your favorite linear separator
learning algorithm.

Whp, exists good separator in this space w
w(y1),,w(yn),-w(z1),,-w(zn)

Technically bounds are better if adjust
definition to penalize examples more that fail
the inequality badly
28
Broader defn
Algorithm

Draw Sy1, ?, yd, S-z1, ?, zd, dO((1/?2)
ln(1/?2)). Think of these as landmarks.

Use to triangulate data

X ? K(x,y1), ,K(x,yd), K(x,zd),,K(x,zd).
Guarantee with prob. 1-?, exists linear
separator of error ? ? at margin ?/4.

Actually, margin is good in both L1 and L2
senses.
This particular approach requires wasting
examples for use as the landmarks. But could
use unlabeled data for this part.

29
Interesting property of definition

An (?,?)-good kernel at least 1-? fraction of x
have margin ? is an (?,?)-good sim fn under
this definition.
But our current proofs suffer a penalty ? ?
?extra, ? ?3?extra.
So, at qualitative level, can have theory of
similarity functions that doesnt require
implicit spaces.

Nati Srebro has improved to ?2, which is tight,
extended to hinge-loss.
30
Approach were investigating

With Nina Mugizi
Take a problem where original features already
pretty good, plus you have a couple reasonable
similarity functions K1, K2,
Take some unlabeled data as landmarks, use to
enlarge feature space K1(x,y1), K2(x,y1),
K1(x,y2),
Run Winnow on the result.
Can prove guarantees if some convex combination
of the Ki is good.

31
Open questions

This view gives some sufficient conditions for a
similarity function to be useful for learning but
doesnt have direct implications to direct use in
SVM, say.
Can one define other interesting, reasonably
intuitive, sufficient conditions for a similarity
function to be useful for learning?

Write a Comment

User Comments (0)

About PowerShow.com

Features, Kernels, and Similarity functions PowerPoint PPT Presentation