Features, Kernels, and Similarity functions - PowerPoint PPT Presentation

About This Presentation
Title:

Features, Kernels, and Similarity functions

Description:

Winnow is simple efficient algorithm that meets this bound. ... For Winnow, mistakes/samples bounded by O((L1(w)L (x)/ )2(log n) ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 32
Provided by: avrim
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Features, Kernels, and Similarity functions


1
Features, Kernels, and Similarity functions
Avrim Blum Machine learning lunch 03/05/07
2
Suppose you want to
  • use learning to solve some classification
    problem.
  • E.g., given a set of images, learn a
    rule to distinguish men from women.
  • The first thing you need to do is decide what you
    want as features.
  • Or, for algs like SVM and Perceptron, can use a
    kernel function, which provides an implicit
    feature space. But then what kernel to use?
  • Can Theory provide any help or guidance?

3
Plan for this talk
  • Discuss a few ways theory might be of help
  • Algorithms designed to do well in large feature
    spaces when only a small number of features are
    actually useful.
  • So you can pile a lot on when you dont know much
    about the domain.
  • Kernel functions. Standard theoretical view,
    plus new one that may provide more guidance.
  • Bridge between implicit mapping and similarity
    function views. Talk about quality of a kernel
    in terms of more tangible properties. work with
    Nina Balcan
  • Combining the above. Using kernels to generate
    explicit features.

4
A classic conceptual question
  • How is it possible to learn anything quickly when
    there is so much irrelevant information around?
  • Must there be some hard-coded focusing mechanism,
    or can learning handle it?

5
A classic conceptual question
  • Lets try a very simple theoretical model.
  • Have n boolean features. Labels are or -.
  • 1001101110
  • 1100111101
  • 0111010111 -
  • Assume distinction is based on just one feature.
  • How many prediction mistakes do you need to make
    before youve figured out which one it is?
  • Can take majority vote over all possibilities
    consistent with data so far. Each mistake
    crosses off at least half. O(log n) mistakes
    total.
  • log(n) is good doubling n only adds 1 more
    mistake.
  • Cant do better (consider log(n) random strings
    with random labels. Whp there is a consistent
    feature in hindsight).

6
A classic conceptual question
  • What about more interesting classes of functions
    (not just target ? a single feature)?

7
Littlestones Winnow algorithm MLJ 1988
  • Motivated by the question what if target is an
    OR of r ltlt n features?
  • Majority vote scheme over all nr possibilities
    would make O(r log n) mistakes but totally
    impractical. Can you do this efficiently?
  • Winnow is simple efficient algorithm that meets
    this bound.
  • More generally, if exists LTF such that
  • positives satisfy w1x1w2x2wnxn ? c,
  • negatives satisfy w1x1w2x2wnxn ? c - ?,
    (W?iwi)
  • Then mistakes O((W/?)2 log n).
  • E.g., if target is k of r function, get O(r2
    log n).
  • Key point still only log dependence on n.

100101011001101011 x4 ? x7 ? x10
8
Littlestones Winnow algorithm MLJ 1988
1001011011001
  • How does it work? Balanced version
  • Maintain weight vectors w and w-.
  • Initialize all weights to 1. Classify based on
    whether w?x or w-?x is larger. (Have x0?0)
  • If make mistake on positive x, for each xi1,
  • wi (1?)wi, wi- (1-?)wi-.
  • And vice-versa for mistake on negative x.
  • Other properties
  • Can show this approximates maxent constraints.
  • In other direction, Ng04 shows that maxent
    with L1 regularization gets Winnow-like bounds.

9
Practical issues
  • On batch problem, may want to cycle through data,
    each time with smaller ?.
  • Can also do margin version update if just barely
    correct.
  • If want to output a likelihood, natural is
    ew?x/ew?x ew-?x. Can extend to
    multiclass too.
  • William Vitor have paper with some other nice
    practical adjustments.

10
Winnow versus Perceptron/SVM
  • Winnow is similar at high level to Perceptron
    updates. Whats the difference?
  • Suppose data is linearly separable by w?x 0
    with w?x ? ?.
  • For Perceptron, mistakes/samples bounded by
    O((L2(w)L2(x)/?)2)
  • For Winnow, mistakes/samples bounded by
    O((L1(w)L?(x)/?)2(log n))
  • For boolean features, L?(x)1. L2(x) can be
    sqrt(n).
  • If target is sparse, examples dense, Winnow is
    better.
  • E.g., x random in 0,1n, f(x)x1. Perceptron
    O(n) mistakes.
  • If target is dense (most features are relevant)
    and examples are sparse, then Perceptron wins.

11
OK, now on to kernels
12
Generic problem
  • Given a set of images , want to
    learn a linear separator to distinguish men from
    women.
  • Problem pixel representation no good.
  • One approach
  • Pick a better set of features! But seems ad-hoc.
  • Instead
  • Use a Kernel! K( , ) ?(
    )?( ). ? is implicit, high-dimensional
    mapping.
  • Perceptron/SVM only interact with data through
    dot-products, so can be kernelized. If data is
    separable in ?-space by large L2 margin, dont
    have to pay for it.

13
Kernels
  • E.g., the kernel K(x,y) (1xy)d for the case
    of n2, d2, corresponds to the implicit mapping

14
Kernels
  • Perceptron/SVM only interact with data through
    dot-products, so can be kernelized. If data is
    separable in ?-space by large L2 margin, dont
    have to pay for it.
  • E.g., K(x,y) (1 x?y)d
  • ?(n-diml space) ! (nd-diml space).
  • E.g., K(x,y) e-(x-y)2
  • Conceptual warning Youre not really getting
    all the power of the high dimensional space
    without paying for it. The margin matters.
  • E.g., K(x,y)1 if xy, K(x,y)0 otherwise.
    Corresponds to mapping where every example gets
    its own coordinate. Everything is linearly
    separable but no generalization.

15
Question do we need the notion of an implicit
space to understand what makes a kernel helpful
for learning?
16
Focus on batch setting
  • Assume examples drawn from some probability
    distribution
  • Distribution D over x, labeled by target function
    c.
  • Or distribution P over (x, l)
  • Will call P (or (c,D)) our learning problem.
  • Given labeled training data, want algorithm to do
    well on new data.

17
Something funny about theory of kernels
  • On the one hand, operationally a kernel is just a
    similarity function K(x,y) 2 -1,1, with some
    extra requirements. here Im scaling to ?(x)
    1
  • And in practice, people think of a good kernel as
    a good measure of similarity between data points
    for the task at hand.
  • But Theory talks about margins in implicit
    high-dimensional F-space. K(x,y) F(x)F(y).

18
I want to use ML to classify protein structures
and Im trying to decide on a similarity fn to
use. Any help?
It should be pos. semidefinite, and should result
in your data having a large margin separator in
implicit high-diml space you probably cant even
calculate.
19
Umm thanks, I guess.
It should be pos. semidefinite, and should result
in your data having a large margin separator in
implicit high-diml space you probably cant even
calculate.
20
Something funny about theory of kernels
  • Theory talks about margins in implicit
    high-dimensional F-space. K(x,y) F(x)F(y).
  • Not great for intuition (do I expect this kernel
    or that one to work better for me)
  • Can we connect better with idea of a good kernel
    being one that is a good notion of similarity for
    the problem at hand?
  • Motivation BBV If margin ? in ?-space, then
    can pick Õ(1/?2) random examples y1,,yn
    (landmarks), and do mapping x ?
    K(x,y1),,K(x,yn). Whp data in this space will
    be apx linearly separable.

21
Goal notion of good similarity function that
  • Talks in terms of more intuitive properties (no
    implicit high-diml spaces, no requirement of
    positive-semidefiniteness, etc)
  • If K satisfies these properties for our given
    problem, then has implications to learning
  • Is broad includes usual notion of good kernel
    (one that induces a large margin separator in
    F-space).
  • If so, then this can help with designing the K.

Recent work with Nina, with extensions by Nati
Srebro
22
Proposal satisfying (1) and (2)
  • Say have a learning problem P (distribution D
    over examples labeled by unknown target f).
  • Sim fn K(x,y)!-1,1 is (?,?)-good for P if at
    least a 1-? fraction of examples x satisfy

EyDK(x,y)l(y)l(x) EyDK(x,y)l(y)?l(x)?
  • Q how could you use this to learn?

23
How to use it
  • At least a 1-? prob mass of x satisfy
  • EyDK(x,y)l(y)l(x) EyDK(x,y)l(y)?l(x)
    ?
  • Draw S of O((1/?2)ln 1/?2) positive examples.
  • Draw S- of O((1/?2)ln 1/?2) negative examples.
  • Classify x based on which gives better score.
  • Hoeffding for any given good x, prob of error
    over draw of S,S- at most ?2.
  • So, at most ? chance our draw is bad on more than
    ? fraction of good x.
  • With prob 1-?, error rate ? ?.

24
But not broad enough
30o
30o
  • K(x,y)xy has good separator but doesnt satisfy
    defn. (half of positives are more similar to negs
    that to typical pos)

25
But not broad enough
30o
30o
  • Idea would work if we didnt pick ys from
    top-left.
  • Broaden to say OK if 9 large region R s.t. most
    x are on average more similar to y2R of same
    label than to y2R of other label. (even if dont
    know R in advance)

26
Broader defn
  • Say K(x,y)!-1,1 is an (?,?)-good similarity
    function for P if exists a weighting function
    w(y)20,1 s.t. at least 1-? frac. of x satisfy

EyDw(y)K(x,y)l(y)l(x) EyDw(y)K(x,y)l(y)?
l(x)?
  • Can still use for learning
  • Draw S y1,,yn, S- z1,,zn. nÕ(1/?2)
  • Use to triangulate data
  • x ? K(x,y1), ,K(x,yn), K(x,z1),,K(x,zn).
  • Whp, exists good separator in this space w
    w(y1),,w(yn),-w(z1),,-w(zn)

27
Broader defn
  • Say K(x,y)!-1,1 is an (?,?)-good similarity
    function for P if exists a weighting function
    w(y)20,1 s.t. at least 1-? frac. of x satisfy

EyDw(y)K(x,y)l(y)l(x) EyDw(y)K(x,y)l(y)?
l(x)?
  • So, take new set of examples, project to this
    space, and run your favorite linear separator
    learning algorithm.
  • Whp, exists good separator in this space w
    w(y1),,w(yn),-w(z1),,-w(zn)

Technically bounds are better if adjust
definition to penalize examples more that fail
the inequality badly
28
Broader defn
Algorithm
  • Draw Sy1, ?, yd, S-z1, ?, zd, dO((1/?2)
    ln(1/?2)). Think of these as landmarks.
  • Use to triangulate data

X ? K(x,y1), ,K(x,yd), K(x,zd),,K(x,zd).
Guarantee with prob. 1-?, exists linear
separator of error ? ? at margin ?/4.
  • Actually, margin is good in both L1 and L2
    senses.
  • This particular approach requires wasting
    examples for use as the landmarks. But could
    use unlabeled data for this part.

29
Interesting property of definition
  • An (?,?)-good kernel at least 1-? fraction of x
    have margin ? is an (?,?)-good sim fn under
    this definition.
  • But our current proofs suffer a penalty ? ?
    ?extra, ? ?3?extra.
  • So, at qualitative level, can have theory of
    similarity functions that doesnt require
    implicit spaces.

Nati Srebro has improved to ?2, which is tight,
extended to hinge-loss.
30
Approach were investigating
  • With Nina Mugizi
  • Take a problem where original features already
    pretty good, plus you have a couple reasonable
    similarity functions K1, K2,
  • Take some unlabeled data as landmarks, use to
    enlarge feature space K1(x,y1), K2(x,y1),
    K1(x,y2),
  • Run Winnow on the result.
  • Can prove guarantees if some convex combination
    of the Ki is good.

31
Open questions
  • This view gives some sufficient conditions for a
    similarity function to be useful for learning but
    doesnt have direct implications to direct use in
    SVM, say.
  • Can one define other interesting, reasonably
    intuitive, sufficient conditions for a similarity
    function to be useful for learning?
Write a Comment
User Comments (0)
About PowerShow.com