On a Theory of Similarity functions for Learning and Clustering

About This Presentation

Title:

On a Theory of Similarity functions for Learning and Clustering

Description:

On a Theory of Similarity functions for Learning and Clustering Avrim Blum Carnegie Mellon University This talk is based on work joint with Nina Balcan, Nati Srebro ... – PowerPoint PPT presentation

Number of Views:315

Avg rating:3.0/5.0

Slides: 45

Provided by: Avr58

Category:

more less

Transcript and Presenter's Notes

Title: On a Theory of Similarity functions for Learning and Clustering

1
On a Theory of Similarity functions for Learning
and Clustering

Avrim Blum
Carnegie Mellon University
This talk is based on work joint with Nina
Balcan, Nati Srebro and Santosh Vempala

Theory and Practice of Computational Learning,
2009
2
2-minute version

Suppose we are given a set of images
, and want to learn a rule to distinguish men
from women. Problem pixel representation not so
good.
A powerful technique for such settings is to use
a kernel a special kind of pairwise similarity
function K( , ).
But, theory in terms of implicit mappings.

Q Can we develop a theory that just views K as a
measure of similarity? Develop more general and
intuitive theory of when K is useful for learning?
3
2-minute version

Suppose we are given a set of images
, and want to learn a rule to distinguish men
from women. Problem pixel representation not so
good.
A powerful technique for such settings is to use
a kernel a special kind of pairwise similarity
function K( , ).
But, theory in terms of implicit mappings.

Q What if we only have unlabeled data (i.e.,
clustering)? Can we develop a theory of
properties that are sufficient to be able to
cluster well?
4
2-minute version

Suppose we are given a set of images
, and want to learn a rule to distinguish men
from women. Problem pixel representation not so
good.
A powerful technique for such settings is to use
a kernel a special kind of pairwise similarity
function K( , ).
But, theory in terms of implicit mappings.

Develop a kind of PAC model for clustering.
5
Part 1 On similarity functions for learning
6
Theme of this part

Theory of natural sufficient conditions for
similarity functions to be useful for
classification learning problems.
Dont require PSD, no implicit spaces, but
includes notion of large-margin kernel.
At a formal level, can even allow you to learn
more (can define classes of functions with no
large-margin kernel even if allow substantial
hinge-loss but that do have a good similarity fn
under this notion)

7
Kernels

We have a lot of great algorithms for learning
linear separators (perceptron, SVM, ). But, a
lot of time, data is not linearly separable.
Old answer use a multi-layer neural network.
New answer use a kernel function!
Many algorithms only interact with the data via
dot-products.
So, lets just re-define dot-product.
E.g., K(x,y) (1 xy)d.
K(x,y) ?(x) ?(y), where ?() is implicit
mapping into an nd-dimensional space.
Algorithm acts as if data is in ?-space. Allows
it to produce non-linear curve in original space.

8
Kernels
A kernel K is a legal def of
dot-product i.e. there exists an implicit
mapping ? such that K( , )? ( )? (
).
E.g., K(x,y) (x y 1)d
? (n-dimensional space) ! nd-dimensional space
Why Kernels are so useful
Many algorithms interact with data only via
dot-products.
So, if replace x y with K(x,y), they act
implicitly as if data was in the
higher-dimensional ?-space.
9
Example
K(x,y) (xy)d corresponds to

E.g., for n2, d2, the kernel

original space
?-space
z2
10
Moreover, generalize well if good Margin

If data is linearly separable by large margin in
?-space, then good sample complexity.

If margin ? in ?-space, then need sample size
of only Õ(1/?2) to get confidence in
generalization.
no dependence on dimension
?(x) 1

Kernels useful in practice for dealing with many,
many different kinds of data.

11
Limitations of the Current Theory
In practice kernels are constructed by viewing
them as measures of similarity.
Existing Theory in terms of margins in implicit
spaces.
Not best for intuition.
Kernel requirement rules out many natural
similarity functions.
Alternative, perhaps more general theoretical
explanation?
12
A notion of a good similarity function that is
Balcan-Blum, ICML 2006
Balcan-Blum-Srebro, MLJ 2008
Balcan-Blum-Srebro, COLT 2008

In terms of natural direct quantities.

Main notion

no implicit high-dimensional spaces
no requirement that K(x,y)?(x) ? (y)

Good kernels
K can be used to learn well.
First attempt
2) Is broad includes usual notion of good
kernel,
has a large margin sep. in ?-space
3) Even formally allows you to do more.
13
A First Attempt
P distribution over labeled examples (x, l(x))
Goal output classification rule good for P
K is good if most x are on average more
similar to points y of their own type than to
points y of the other type.
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
gap
Average similarity to points of opposite label
Average similarity to points of the same label
14
A First Attempt
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
E.g., most images of men are on average ?-more
similar to random images of men than random
images of women, and vice-versa.
15
A First Attempt
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
Algorithm

Draw sets S, S- of positive and negative
examples.

Classify x based on average similarity to S
versus to S-.

S
S-
x
x
16
A First Attempt
K is (?,?)-good for P if a 1-? prob. mass of x
satisfy
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
Algorithm

Draw sets S, S- of positive and negative
examples.

Classify x based on average similarity to S
versus to S-.

Theorem
If S and S- are ?((1/?2)
ln(1/??)), then with probability 1-?, error
??.
17
A First Attempt Not Broad Enough
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?

30o
30o
½ versus ½ 1 ½ (- ½) ¼
-
-
-
-
-
-
Similarity function K(x,y)x y

has a large margin separator

does not satisfy our definition.
18
A First Attempt Not Broad Enough
EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
R
30o
30o
Broaden 9 non-negligible R s.t. most x are
on average more similar to y 2 R of same label
than to y 2 R of other label.
even if do not know R in advance
19
Broader Definition

K is (?, ?, ?)-good if 9 a set R of
reasonable y (allow probabilistic) s.t. 1-?
fraction of x satisfy

(technically ? hinge loss)
EyPK(x,y)l(y)l(x), R(y) EyPK(x,y)l(y)?l(
x), R(y)?
At least ? prob. mass of reasonable positives
negatives.
Algorithm

Draw Sy1, ?, yd set of landmarks.

F(x) K(x,y1), ,K(x,yd).
x !
Re-represent data.
P

If enough landmarks (d?(1/?2 ? )), then with
high prob. there exists a good L1 large margin
linear separator.

w0,0,1/n,1/n,0,0,0,-1/n-,0,0
20
Broader Definition

K is (?, ?, ?)good if 9 a set R of
reasonable y (allow probabilistic) s.t. 1-?
fraction of x satisfy

(technically ? hinge loss)
EyPK(x,y)l(y)l(x), R(y) EyPK(x,y)l(y)?l(
x), R(y)?
At least ? prob. mass of reasonable positives
negatives.
Algorithm
duÕ(1/(?2? ))
dlO((1/(?2²acc))ln du)

Draw Sy1, ?, yd set of landmarks.

F(x) K(x,y1), ,K(x,yd)
x !
Re-represent data.
P
X
X
X
X
X
O
X
X
O
O
O
O
X
O
O
X
O
O
X
O

Take a new set of labeled examples, project to
this space, and run a good L1 linear separator
alg. (e.g., Winnow etc).

21
Kernels and Similarity Functions
Theorem
K is also a good similarity function.
K is a good kernel
(but ? gets squared).
If K has margin ? in implicit space, then for any
?, K is (?,?2,?)-good in our sense.
22
Kernels and Similarity Functions
Theorem
K is also a good similarity function.
K is a good kernel
(but ? gets squared).
Can also show a separation.
Theorem
Exists class C, distrib D s.t. 9 a similarity
function with large ? for all f in C, but no
large-margin kernel function exists.
23
Kernels and Similarity Functions
Theorem
For any class C of pairwise uncorrelated
functions, 9 a similarity function good for all f
in C, but no such good kernel function exists.

In principle, should be able to learn from
O(?-1log(C/?)) labeled examples.

Claim 1 can define generic (0,1,1/C)-good
similarity function achieving this bound. (Assume
D not too concentrated)

Claim 2 There is no (?,?) good kernel in hinge
loss, even if ?1/2 and ?1/C1/2. So, margin
based SC is d?(C).

24
Generic Similarity Function

Partition X into regions R1,,RC with P(Ri) gt
1/poly(C).
Ri will be R for target fi.
For y in Ri, define K(x,y)fi(x)fi(y).
So, for any target fi in C, any x, we get
Eyl(x)l(y)K(x,y) y in Ri El(x)2l(y)2 1.
So, K is (0,1,1/poly(C))-good.

Gives bound O(?-1log(C))
25
Similarity Functions for Classification
Algorithmic Implications

Can use non-PSD similarities, no need to
transform them into PSD functions and plug into
SVM. Instead use empirical similarity map.

E.g., Liao and Noble, Journal of Computational
Biology

Give justification to the following rule

Shows that anything learnable with SVM is also
learnable this way.

26
Learning with Multiple Similarity Functions

Let K1, , Kr be similarity functions s. t.
some (unknown) convex combination of them is
(?,?)-good.

Algorithm

Draw Sy1, ?, yd set of landmarks. Concatenate
features.

F(x) K1(x,y1), ,Kr(x,y1), ,
K1(x,yd),,Kr(x,yd).

Run same L1 optimization algorithm as before in
this new feature space.

27
Learning with Multiple Similarity Functions

Let K1, , Kr be similarity functions s. t.
some (unknown) convex combination of them is
(?,?)-good.

Algorithm

Draw Sy1, ?, yd set of landmarks. Concatenate
features.

F(x) K1(x,y1), ,Kr(x,y1), ,
K1(x,yd),,Kr(x,yd).
Guarantee Whp the induced distribution F(P)
in R2dr has a separator of error ? ? at L1
margin at least

Sample complexity is roughly
Only increases by log(r) factor!
28
Learning with Multiple Similarity Functions

Let K1, , Kr be similarity functions s. t.
some (unknown) convex combination of them is
(?,?)-good.

Algorithm

Draw Sy1, ?, yd set of landmarks. Concatenate
features.

F(x) K1(x,y1), ,Kr(x,y1), ,
K1(x,yd),,Kr(x,yd).
Guarantee Whp the induced distribution F(P)
in R2dr has a separator of error ? ? at L1
margin at least

Proof imagine mapping Fo(x) Ko(x,y1), ,Ko
(x,yd), for the good similarity function Ko ?1
K1 . ?r Kr Consider wo (w1, , wd) of L1
norm 1, margin ?/4. The vector w (?1 w1 ,
?2 w1,, ?r w1, , ?1 wd , ?2 wd,, ?r wd) also
has norm 1 and has wF(x) woFo(x).
29
Learning with Multiple Similarity Functions

Because property defined in terms of L1, no
change in margin!
Only log(r) penalty for concatenating feature
spaces.
If L2, margin would drop by factor r1/2, giving
O(r) penalty in sample complexity.
Algorithm is also very simple (just concatenate).
Alternative algorithm do joint optimization
solve for Ko (?1K1 ?nKn), vector wo s.t.
wo has good L1 margin in space defined by Fo(x)
Ko(x,y1),,Ko(x,yd)
Bound also holds here since capacity only lower.
But we dont know how to do this efficiently

30
Learning with Multiple Similarity Functions

Interesting fact because property defined in
terms of L1, no change in margin!
Only log(r) penalty for concatenating feature
spaces.
If L2, margin would drop by factor r1/2, giving
O(r) penalty in sample complexity.
Also, since any large-margin kernel is also a
good similarity function,
log(r) penalty applies to concatenate and
optimize L1 margin alg for kernels.
But ? is potentially squared in translation and
add extra ? to hinge loss at 1/? cost in
unlabeled data.
Nonetheless, if r is large, this can be a good
tradeoff!

31
Open questions (part I)

Can we deal (efficiently?) with general convex
class K of similarity functions?
Not just K ?1K1 ?rKr ?i 0, ?1?r1.
Can we efficiently implement direct joint
optimization for convex combination case?
Alternatively can we use concatenation alg to
extract a good convex combination Ko?
Two quite different algorithm styles anything
in-between?
Use this approach for transfer learning?

32
Part 2 Can we use this angle to help think about
clustering?
33
Clustering comes up in many places

Given a set of documents or search results,
cluster them by topic.
Given a collection of protein sequences, cluster
them by function.
Given a set of images of people, cluster by who
is in them.

34
Can model clustering like this

Given data set S of n objects.
There is some (unknown) ground truth clustering
Goal produce hypothesis clustering C1,C2,,Ck
that matches target as much as possible.
Problem no labeled data!
But do have a measure of similarity

news articles
sports
politics
C1,C2,,Ck.
minimize mistakes up to renumbering of indices
35
Can model clustering like this
What conditions on a similarity measure would be
enough to allow one to cluster well?

Given data set S of n objects.
There is some (unknown) ground truth clustering
Goal produce hypothesis clustering C1,C2,,Ck
that matches target as much as possible.
Problem no labeled data!
But do have a measure of similarity

news articles
sports
politics
C1,C2,,Ck.
minimize mistakes up to renumbering of indices
36
What conditions on a similarity measure would be
enough to allow one to cluster well?

Contrast with more standard approach to
clustering analysis
View similarity/distance info as ground truth
Analyze abilities of algorithms to achieve
different optimization criteria.
Or, assume generative model, like mixture of
Gaussians
Here, no generative assumptions. Instead given
data, how powerful a K do we need to be able to
cluster it well?

min-sum, k-means, k-median,
37
Here is a condition that trivially works
What conditions on a similarity measure would be
enough to allow one to cluster well?

Suppose K has property that
K(x,y) gt 0 for all x,y such that C(x) C(y).
K(x,y) lt 0 for all x,y such that C(x) ? C(y).
If we have such a K, then clustering is easy.
Now, lets try to make this condition a little
weaker.

38
What conditions on a similarity measure would be
enough to allow one to cluster well?

Suppose K has property that all x are more
similar to all points y in their own cluster than
to any y in other clusters.
Still a very strong condition.
Problem the same K can satisfy for two very
different clusterings of the same data!

baseball
basketball
39
What conditions on a similarity measure would be
enough to allow one to cluster well?

Suppose K has property that all x are more
similar to all points y in their own cluster than
to any y in other clusters.
Still a very strong condition.
Problem the same K can satisfy for two very
different clusterings of the same data!

baseball
Math
Physics
basketball
40
Lets weaken our goals a bit

OK to produce a hierarchical clustering (tree)
such that target clustering is apx some pruning
of it.
E.g., in case from last slide
Can view as saying if any of these clusters is
too broad, just click and I will split it for
you
Or, OK to output a small of clusterings such
that at least one has low error (like
list-decoding) but wont talk about this one
today.

41
Then you can start getting somewhere.
all x more similar to all y in their own cluster
than to any y from any other cluster

1.
is sufficient to get hierarchical clustering such
that target is some pruning of tree. (Kruskals /
single-linkage works)

42
Then you can start getting somewhere.
all x more similar to all y in their own cluster
than to any y from any other cluster

1.
is sufficient to get hierarchical clustering such
that target is some pruning of tree. (Kruskals /
single-linkage works)

2. Weaker condition ground truth is stable
For all clusters C, C, for all AµC, AµC A and
A not both more similar on avg to each other
than to rest of own clusters.
View K(x,y) as attraction between x and y
(plus technical conditions at boundary)
Sufficient to get a good tree using average
single linkage alg.
43
Analysis for slightly simpler version
Assume for all C, C, all A½C, AµC, we have
K(A,C-A) gt K(A,A), and say K is symmetric.
Avgx2A, y2C-AS(x,y)

Algorithm average single-linkage
Like Kruskal, but at each step merge pair of
clusters whose average similarity is highest.
Analysis (all clusters made are laminar wrt
target)
Failure iff merge C1, C2 s.t. C1½C, C2ÅC ?.

44
Analysis for slightly simpler version
Assume for all C, C, all A½C, AµC, we have
K(A,C-A) gt K(A,A), and say K is symmetric.
C3
Avgx2A, y2C-AS(x,y)
C2
C1

Algorithm average single-linkage
Like Kruskal, but at each step merge pair of
clusters whose average similarity is highest.
Analysis (all clusters made are laminar wrt
target)
Failure iff merge C1, C2 s.t. C1½C, C2ÅC ?.
But must exist C3½C at least as similar to C1 as
the average. Contradiction.

45
More sufficient properties
all x more similar to all y in their own cluster
than to any y from any other cluster

3.
But add noisy data.
Noisy data can ruin bottom-up algorithms, but can
show a generate-and-test style algorithm works.
Create collection of plausible clusters.
Use series of pairwise tests to remove/shrink
clusters until consistent with a tree

46
More sufficient properties
all x more similar to all y in their own cluster
than to any y from any other cluster

3.
But add noisy data.
4. Implicit assumptions made by optimization
approach

Any approximately-optimal ..k-median.. solution
is close (in terms of how pts are clustered) to
the target.
Nina Balcans talk on Saturday
47
Can also analyze inductive setting

Can use regularity type results of AFKK to
argue that whp, a reasonable size S will give
good estimates of all desired quantities.
Once S is hierarchically partitioned, can insert
new points as they arrive.

48
Like a PAC model for clustering

A property is a relation between target and
similarity information (data). Like a
data-dependent concept class in learning.
Given data and a similarity function K, a
property induces a concept class C of all
clusterings c such that (c,K) is consistent with
the property.
Tree model want tree T s.t. set of prunings of T
form an ?-cover of C.
In inductive model, want this with prob 1-?.

49
Summary (part II)

Exploring the question what does an algorithm
need in order to cluster well?
What natural properties allow a similarity
measure to be useful for clustering?
To get a good theory, helps to relax what we mean
by useful for clustering.
User can then decide how specific he wanted to be
in each part of domain.
Analyze a number of natural properties and
prove guarantees on algorithms able to use them.

50
Wrap-up

Tour through learning and clustering by
similarity functions.
User with some knowledge of the problem domain
comes up with pairwise similarity measure K(x,y)
that makes sense for the given problem.
Algorithm uses this (together with labeled data
in the case of learning) to find a good solution.
Goals of a theory
Give guidance to similarity-function designer
(what properties to shoot for?).
Understand what properties are sufficient for
learning/clustering, and by what algorithms.
For learning, get theory of kernels without need
for implicit spaces.
For clustering, reverses the usual view.
Suggests giving the algorithm some slack (tree vs
partitioning).
A lot of interesting questions still open in
these areas.