On a Theory of Similarity functions for Learning and Clustering - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

On a Theory of Similarity functions for Learning and Clustering

Description:

On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan, Nati Srebro ... – PowerPoint PPT presentation

Number of Views:315
Avg rating:3.0/5.0
Slides: 45
Provided by: Avr58
Category:

less

Transcript and Presenter's Notes

Title: On a Theory of Similarity functions for Learning and Clustering


1
On a Theory of Similarity functions for Learning
and Clustering
  • Avrim Blum
  • Carnegie Mellon University
  • This talk is based on work joint with Nina
    Balcan, Nati Srebro and Santosh Vempala

Theory and Practice of Computational Learning,
2009
2
2-minute version
  • Suppose we are given a set of images
    , and want to learn a rule to distinguish men
    from women. Problem pixel representation not so
    good.
  • A powerful technique for such settings is to use
    a kernel a special kind of pairwise similarity
    function K( , ).
  • But, theory in terms of implicit mappings.

Q Can we develop a theory that just views K as a
measure of similarity? Develop more general and
intuitive theory of when K is useful for learning?
3
2-minute version
  • Suppose we are given a set of images
    , and want to learn a rule to distinguish men
    from women. Problem pixel representation not so
    good.
  • A powerful technique for such settings is to use
    a kernel a special kind of pairwise similarity
    function K( , ).
  • But, theory in terms of implicit mappings.

Q What if we only have unlabeled data (i.e.,
clustering)? Can we develop a theory of
properties that are sufficient to be able to
cluster well?
4
2-minute version
  • Suppose we are given a set of images
    , and want to learn a rule to distinguish men
    from women. Problem pixel representation not so
    good.
  • A powerful technique for such settings is to use
    a kernel a special kind of pairwise similarity
    function K( , ).
  • But, theory in terms of implicit mappings.

Develop a kind of PAC model for clustering.
5
Part 1 On similarity functions for learning
6
Theme of this part
  • Theory of natural sufficient conditions for
    similarity functions to be useful for
    classification learning problems.
  • Dont require PSD, no implicit spaces, but
    includes notion of large-margin kernel.
  • At a formal level, can even allow you to learn
    more (can define classes of functions with no
    large-margin kernel even if allow substantial
    hinge-loss but that do have a good similarity fn
    under this notion)

7
Kernels
  • We have a lot of great algorithms for learning
    linear separators (perceptron, SVM, ). But, a
    lot of time, data is not linearly separable.
  • Old answer use a multi-layer neural network.
  • New answer use a kernel function!
  • Many algorithms only interact with the data via
    dot-products.
  • So, lets just re-define dot-product.
  • E.g., K(x,y) (1 xy)d.
  • K(x,y) ?(x) ?(y), where ?() is implicit
    mapping into an nd-dimensional space.
  • Algorithm acts as if data is in ?-space. Allows
    it to produce non-linear curve in original space.

8
Kernels
A kernel K is a legal def of
dot-product i.e. there exists an implicit
mapping ? such that K( , )? ( )? (
).
E.g., K(x,y) (x y 1)d
? (n-dimensional space) ! nd-dimensional space
Why Kernels are so useful
Many algorithms interact with data only via
dot-products.
So, if replace x y with K(x,y), they act
implicitly as if data was in the
higher-dimensional ?-space.
9
Example
K(x,y) (xy)d corresponds to
  • E.g., for n2, d2, the kernel

original space
?-space
z2
10
Moreover, generalize well if good Margin
  • If data is linearly separable by large margin in
    ?-space, then good sample complexity.

If margin ? in ?-space, then need sample size
of only Õ(1/?2) to get confidence in
generalization.
no dependence on dimension
?(x) 1
  • Kernels useful in practice for dealing with many,
    many different kinds of data.

11
Limitations of the Current Theory
In practice kernels are constructed by viewing
them as measures of similarity.
Existing Theory in terms of margins in implicit
spaces.
Not best for intuition.
Kernel requirement rules out many natural
similarity functions.
Alternative, perhaps more general theoretical
explanation?
12
A notion of a good similarity function that is
Balcan-Blum, ICML 2006
Balcan-Blum-Srebro, MLJ 2008
Balcan-Blum-Srebro, COLT 2008
  1. In terms of natural direct quantities.

Main notion
  • no implicit high-dimensional spaces
  • no requirement that K(x,y)?(x) ? (y)

Good kernels
K can be used to learn well.
First attempt
2) Is broad includes usual notion of good
kernel,
has a large margin sep. in ?-space
3) Even formally allows you to do more.
13
A First Attempt
P distribution over labeled examples (x, l(x))
Goal output classification rule good for P
K is good if most x are on average more
similar to points y of their own type than to
points y of the other type.
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
gap
Average similarity to points of opposite label
Average similarity to points of the same label
14
A First Attempt
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
E.g., most images of men are on average ?-more
similar to random images of men than random
images of women, and vice-versa.
15
A First Attempt
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
Algorithm
  • Draw sets S, S- of positive and negative
    examples.
  • Classify x based on average similarity to S
    versus to S-.

S
S-
x
x
16
A First Attempt
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
Algorithm
  • Draw sets S, S- of positive and negative
    examples.
  • Classify x based on average similarity to S
    versus to S-.

Theorem
If S and S- are ?((1/?2)
ln(1/??)), then with probability 1-?, error
??.
17
A First Attempt Not Broad Enough
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?






30o
30o
½ versus ½ 1 ½ (- ½) ¼
-
-
-
-
-
-
Similarity function K(x,y)x y
  • has a large margin separator

does not satisfy our definition.
18
A First Attempt Not Broad Enough
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
R
30o
30o
Broaden 9 non-negligible R s.t. most x are
on average more similar to y 2 R of same label
than to y 2 R of other label.
even if do not know R in advance
19
Broader Definition
  • K is (?, ?, ?)-good if 9 a set R of
    reasonable y (allow probabilistic) s.t. 1-?
    fraction of x satisfy

(technically ? hinge loss)
EyPK(x,y)l(y)l(x), R(y) EyPK(x,y)l(y)?l(
x), R(y)?
At least ? prob. mass of reasonable positives
negatives.
Algorithm
  • Draw Sy1, ?, yd set of landmarks.

F(x) K(x,y1), ,K(x,yd).
x !
Re-represent data.
P
  • If enough landmarks (d?(1/?2 ? )), then with
    high prob. there exists a good L1 large margin
    linear separator.

w0,0,1/n,1/n,0,0,0,-1/n-,0,0
20
Broader Definition
  • K is (?, ?, ?)good if 9 a set R of
    reasonable y (allow probabilistic) s.t. 1-?
    fraction of x satisfy

(technically ? hinge loss)
EyPK(x,y)l(y)l(x), R(y) EyPK(x,y)l(y)?l(
x), R(y)?
At least ? prob. mass of reasonable positives
negatives.
Algorithm
duÕ(1/(?2? ))
dlO((1/(?2²acc))ln du)
  • Draw Sy1, ?, yd set of landmarks.

F(x) K(x,y1), ,K(x,yd)
x !
Re-represent data.
P
X
X
X
X
X
O
X
X
O
O
O
O
X
O
O
X
O
O
X
O
  • Take a new set of labeled examples, project to
    this space, and run a good L1 linear separator
    alg. (e.g., Winnow etc).

21
Kernels and Similarity Functions
Theorem
K is also a good similarity function.
K is a good kernel
(but ? gets squared).
If K has margin ? in implicit space, then for any
?, K is (?,?2,?)-good in our sense.
22
Kernels and Similarity Functions
Theorem
K is also a good similarity function.
K is a good kernel
(but ? gets squared).
Can also show a separation.
Theorem
Exists class C, distrib D s.t. 9 a similarity
function with large ? for all f in C, but no
large-margin kernel function exists.
23
Kernels and Similarity Functions
Theorem
For any class C of pairwise uncorrelated
functions, 9 a similarity function good for all f
in C, but no such good kernel function exists.
  • In principle, should be able to learn from
    O(?-1log(C/?)) labeled examples.
  • Claim 1 can define generic (0,1,1/C)-good
    similarity function achieving this bound. (Assume
    D not too concentrated)
  • Claim 2 There is no (?,?) good kernel in hinge
    loss, even if ?1/2 and ?1/C1/2. So, margin
    based SC is d?(C).

24
Generic Similarity Function
  • Partition X into regions R1,,RC with P(Ri) gt
    1/poly(C).
  • Ri will be R for target fi.
  • For y in Ri, define K(x,y)fi(x)fi(y).
  • So, for any target fi in C, any x, we get
  • Eyl(x)l(y)K(x,y) y in Ri El(x)2l(y)2 1.
  • So, K is (0,1,1/poly(C))-good.

Gives bound O(?-1log(C))
25
Similarity Functions for Classification
Algorithmic Implications
  • Can use non-PSD similarities, no need to
    transform them into PSD functions and plug into
    SVM. Instead use empirical similarity map.

E.g., Liao and Noble, Journal of Computational
Biology
  • Give justification to the following rule
  • Shows that anything learnable with SVM is also
    learnable this way.

26
Learning with Multiple Similarity Functions
  • Let K1, , Kr be similarity functions s. t.
    some (unknown) convex combination of them is
    (?,?)-good.

Algorithm
  • Draw Sy1, ?, yd set of landmarks. Concatenate
    features.

F(x) K1(x,y1), ,Kr(x,y1), ,
K1(x,yd),,Kr(x,yd).
  • Run same L1 optimization algorithm as before in
    this new feature space.


27
Learning with Multiple Similarity Functions
  • Let K1, , Kr be similarity functions s. t.
    some (unknown) convex combination of them is
    (?,?)-good.

Algorithm
  • Draw Sy1, ?, yd set of landmarks. Concatenate
    features.

F(x) K1(x,y1), ,Kr(x,y1), ,
K1(x,yd),,Kr(x,yd).
Guarantee Whp the induced distribution F(P)
in R2dr has a separator of error ? ? at L1
margin at least

Sample complexity is roughly
Only increases by log(r) factor!
28
Learning with Multiple Similarity Functions
  • Let K1, , Kr be similarity functions s. t.
    some (unknown) convex combination of them is
    (?,?)-good.

Algorithm
  • Draw Sy1, ?, yd set of landmarks. Concatenate
    features.

F(x) K1(x,y1), ,Kr(x,y1), ,
K1(x,yd),,Kr(x,yd).
Guarantee Whp the induced distribution F(P)
in R2dr has a separator of error ? ? at L1
margin at least

Proof imagine mapping Fo(x) Ko(x,y1), ,Ko
(x,yd), for the good similarity function Ko ?1
K1 . ?r Kr Consider wo (w1, , wd) of L1
norm 1, margin ?/4. The vector w (?1 w1 ,
?2 w1,, ?r w1, , ?1 wd , ?2 wd,, ?r wd) also
has norm 1 and has wF(x) woFo(x).
29
Learning with Multiple Similarity Functions
  • Because property defined in terms of L1, no
    change in margin!
  • Only log(r) penalty for concatenating feature
    spaces.
  • If L2, margin would drop by factor r1/2, giving
    O(r) penalty in sample complexity.
  • Algorithm is also very simple (just concatenate).
  • Alternative algorithm do joint optimization
  • solve for Ko (?1K1 ?nKn), vector wo s.t.
    wo has good L1 margin in space defined by Fo(x)
    Ko(x,y1),,Ko(x,yd)
  • Bound also holds here since capacity only lower.
  • But we dont know how to do this efficiently

30
Learning with Multiple Similarity Functions
  • Interesting fact because property defined in
    terms of L1, no change in margin!
  • Only log(r) penalty for concatenating feature
    spaces.
  • If L2, margin would drop by factor r1/2, giving
    O(r) penalty in sample complexity.
  • Also, since any large-margin kernel is also a
    good similarity function,
  • log(r) penalty applies to concatenate and
    optimize L1 margin alg for kernels.
  • But ? is potentially squared in translation and
    add extra ? to hinge loss at 1/? cost in
    unlabeled data.
  • Nonetheless, if r is large, this can be a good
    tradeoff!

31
Open questions (part I)
  • Can we deal (efficiently?) with general convex
    class K of similarity functions?
  • Not just K ?1K1 ?rKr ?i 0, ?1?r1.
  • Can we efficiently implement direct joint
    optimization for convex combination case?
  • Alternatively can we use concatenation alg to
    extract a good convex combination Ko?
  • Two quite different algorithm styles anything
    in-between?
  • Use this approach for transfer learning?

32
Part 2 Can we use this angle to help think about
clustering?
33
Clustering comes up in many places
  • Given a set of documents or search results,
    cluster them by topic.
  • Given a collection of protein sequences, cluster
    them by function.
  • Given a set of images of people, cluster by who
    is in them.

34
Can model clustering like this
  • Given data set S of n objects.
  • There is some (unknown) ground truth clustering
  • Goal produce hypothesis clustering C1,C2,,Ck
    that matches target as much as possible.
  • Problem no labeled data!
  • But do have a measure of similarity

news articles
sports
politics
C1,C2,,Ck.
minimize mistakes up to renumbering of indices
35
Can model clustering like this
What conditions on a similarity measure would be
enough to allow one to cluster well?
  • Given data set S of n objects.
  • There is some (unknown) ground truth clustering
  • Goal produce hypothesis clustering C1,C2,,Ck
    that matches target as much as possible.
  • Problem no labeled data!
  • But do have a measure of similarity

news articles
sports
politics
C1,C2,,Ck.
minimize mistakes up to renumbering of indices
36
What conditions on a similarity measure would be
enough to allow one to cluster well?
  • Contrast with more standard approach to
    clustering analysis
  • View similarity/distance info as ground truth
  • Analyze abilities of algorithms to achieve
    different optimization criteria.
  • Or, assume generative model, like mixture of
    Gaussians
  • Here, no generative assumptions. Instead given
    data, how powerful a K do we need to be able to
    cluster it well?

min-sum, k-means, k-median,
37
Here is a condition that trivially works
What conditions on a similarity measure would be
enough to allow one to cluster well?
  • Suppose K has property that
  • K(x,y) gt 0 for all x,y such that C(x) C(y).
  • K(x,y) lt 0 for all x,y such that C(x) ? C(y).
  • If we have such a K, then clustering is easy.
  • Now, lets try to make this condition a little
    weaker.

38
What conditions on a similarity measure would be
enough to allow one to cluster well?
  • Suppose K has property that all x are more
    similar to all points y in their own cluster than
    to any y in other clusters.
  • Still a very strong condition.
  • Problem the same K can satisfy for two very
    different clusterings of the same data!

baseball
basketball
39
What conditions on a similarity measure would be
enough to allow one to cluster well?
  • Suppose K has property that all x are more
    similar to all points y in their own cluster than
    to any y in other clusters.
  • Still a very strong condition.
  • Problem the same K can satisfy for two very
    different clusterings of the same data!

baseball
Math
Physics
basketball
40
Lets weaken our goals a bit
  • OK to produce a hierarchical clustering (tree)
    such that target clustering is apx some pruning
    of it.
  • E.g., in case from last slide
  • Can view as saying if any of these clusters is
    too broad, just click and I will split it for
    you
  • Or, OK to output a small of clusterings such
    that at least one has low error (like
    list-decoding) but wont talk about this one
    today.

41
Then you can start getting somewhere.
all x more similar to all y in their own cluster
than to any y from any other cluster
  • 1.
  • is sufficient to get hierarchical clustering such
    that target is some pruning of tree. (Kruskals /
    single-linkage works)

42
Then you can start getting somewhere.
all x more similar to all y in their own cluster
than to any y from any other cluster
  • 1.
  • is sufficient to get hierarchical clustering such
    that target is some pruning of tree. (Kruskals /
    single-linkage works)

2. Weaker condition ground truth is stable
For all clusters C, C, for all AµC, AµC A and
A not both more similar on avg to each other
than to rest of own clusters.
View K(x,y) as attraction between x and y
(plus technical conditions at boundary)
Sufficient to get a good tree using average
single linkage alg.
43
Analysis for slightly simpler version
Assume for all C, C, all A½C, AµC, we have
K(A,C-A) gt K(A,A), and say K is symmetric.
Avgx2A, y2C-AS(x,y)
  • Algorithm average single-linkage
  • Like Kruskal, but at each step merge pair of
    clusters whose average similarity is highest.
  • Analysis (all clusters made are laminar wrt
    target)
  • Failure iff merge C1, C2 s.t. C1½C, C2ÅC ?.

44
Analysis for slightly simpler version
Assume for all C, C, all A½C, AµC, we have
K(A,C-A) gt K(A,A), and say K is symmetric.
C3
Avgx2A, y2C-AS(x,y)
C2
C1
  • Algorithm average single-linkage
  • Like Kruskal, but at each step merge pair of
    clusters whose average similarity is highest.
  • Analysis (all clusters made are laminar wrt
    target)
  • Failure iff merge C1, C2 s.t. C1½C, C2ÅC ?.
  • But must exist C3½C at least as similar to C1 as
    the average. Contradiction.

45
More sufficient properties
all x more similar to all y in their own cluster
than to any y from any other cluster
  • 3.
  • But add noisy data.
  • Noisy data can ruin bottom-up algorithms, but can
    show a generate-and-test style algorithm works.
  • Create collection of plausible clusters.
  • Use series of pairwise tests to remove/shrink
    clusters until consistent with a tree

46
More sufficient properties
all x more similar to all y in their own cluster
than to any y from any other cluster
  • 3.
  • But add noisy data.
  • 4. Implicit assumptions made by optimization
    approach

Any approximately-optimal ..k-median.. solution
is close (in terms of how pts are clustered) to
the target.
Nina Balcans talk on Saturday
47
Can also analyze inductive setting
  • Can use regularity type results of AFKK to
    argue that whp, a reasonable size S will give
    good estimates of all desired quantities.
  • Once S is hierarchically partitioned, can insert
    new points as they arrive.

48
Like a PAC model for clustering
  • A property is a relation between target and
    similarity information (data). Like a
    data-dependent concept class in learning.
  • Given data and a similarity function K, a
    property induces a concept class C of all
    clusterings c such that (c,K) is consistent with
    the property.
  • Tree model want tree T s.t. set of prunings of T
    form an ?-cover of C.
  • In inductive model, want this with prob 1-?.

49
Summary (part II)
  • Exploring the question what does an algorithm
    need in order to cluster well?
  • What natural properties allow a similarity
    measure to be useful for clustering?
  • To get a good theory, helps to relax what we mean
    by useful for clustering.
  • User can then decide how specific he wanted to be
    in each part of domain.
  • Analyze a number of natural properties and
    prove guarantees on algorithms able to use them.

50
Wrap-up
  • Tour through learning and clustering by
    similarity functions.
  • User with some knowledge of the problem domain
    comes up with pairwise similarity measure K(x,y)
    that makes sense for the given problem.
  • Algorithm uses this (together with labeled data
    in the case of learning) to find a good solution.
  • Goals of a theory
  • Give guidance to similarity-function designer
    (what properties to shoot for?).
  • Understand what properties are sufficient for
    learning/clustering, and by what algorithms.
  • For learning, get theory of kernels without need
    for implicit spaces.
  • For clustering, reverses the usual view.
    Suggests giving the algorithm some slack (tree vs
    partitioning).
  • A lot of interesting questions still open in
    these areas.
Write a Comment
User Comments (0)
About PowerShow.com