A Theory of Learning and Clustering via Similarity Functions - PowerPoint PPT Presentation

About This Presentation
Title:

A Theory of Learning and Clustering via Similarity Functions

Description:

Coco Chanel. sports. fashion. soccer. tennis. Lacoste. Coco Chanel. Strict ... Coco. Chanel. Lacoste. All topics ... sports. fashion. Coco. Chanel. tennis ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 35
Provided by: mariaflor
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: A Theory of Learning and Clustering via Similarity Functions


1
A Theory of Learning and Clustering via
Similarity Functions
Maria-Florina Balcan
Carnegie Mellon University
Joint work with Avrim Blum and Santosh Vempala
09/17/2007
2
2-Minute Version
Generic classification problem
learn to distinguish men from women.
Problem pixel representation not so good.
Nice SLT theory
Powerful technique use a kernel, a special kind
of similarity function
But, theory in terms of implicit mappings.
Can we develop a theory that views K as a measure
of similarity? What are general sufficient
conditions for K to be useful for learning?
3
2-Minute Version
Generic classification problem
learn to distinguish men from women.
Problem pixel representation not so good.
Powerful technique use a kernel, a special kind
of similarity function
What if dont have any labeled data? (i.e.,
clustering) Can we develop a theory of
conditions sufficient for K to be useful now?
4
Part I On Similarity Functions for
Classification
5
Kernel Functions and Learning
E.g., given images labeled by
gender, learn a rule to distinguish men from
women.
Goal do well on new data
Problem our best algorithms learn linear
separators,not good for data in natural
representation.
Old approach learn a more complex class of
functions.
New approach use a Kernel.
6
Kernels, Kernalizable Algorithms
  • K kernel if 9 implicit mapping ? s.t.
    K(x,y)?(x) ?(y).

Point many algorithms interact with data only
via dot-products.
  • If replace xy with K(x,y), it acts implicitly
    as if data was in higher-dimensional ?-space.
  • If data is linearly separable by large margin
    in ?-space, dont have to pay in terms of sample
    complexity or comp time.

If margin ? in ?-space, only need 1/?2 examples
to learn well.
7
Kernels and Similarity Functions
Kernels useful for many kinds of data, elegant
SLT.
Our Work analyze more general similarity
functions.
Characterization of good similarity functions
1) In terms of natural direct properties.
  • no implicit high-dimensional spaces
  • no requirement of positive-semidefiniteness

2) If K satisfies these, can be used for learning.
3) Is broad includes usual notion of good
kernel.
has a large margin sep. in ?-space
8
A First Attempt Definition Satisfying (1) and (2)
P distribution over labeled examples (x, l(x))
  • K(x,y) ! -1,1 is an (?,?)-good similarity for
    P if a 1-? prob. mass of x satisfy

EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
  • E.g., K(x,y) 0.2, l(x) l(y)

K(x,y) random in -1,1, l(x) ? l(y)
Note might not be a legal kernel.
9
A First Attempt Definition Satisfying (1) and
(2). How to use it?
  • K(x,y) ! -1,1 is an (?,?)-good similarity for
    P if a 1-? prob. mass of x satisfy

EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
Algorithm
  • Draw S of O((1/?2) ln(1/?2)) positive examples.
  • Draw S- of O((1/?2) ln(1/?2)) negative examples.
  • Classify x based on which gives better score.

Guarantee with probability 1-?, error ? ?.
10
A First Attempt Definition Satisfying (1) and
(2). How to use it?
  • K(x,y) ! -1,1 is an (?,?)-good similarity for
    P if a 1-? prob. mass of x satisfy

EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
Guarantee with probability 1-?, error ? ?.
  • Hoeffding for any given good x, prob. of error
    w.r.t. x (over draw of S, S-) is ?2.
  • At most ? chance that the error rate over GOOD is
    ?.
  • Overall error rate ? ?.

11
A First Attempt Not Broad Enough
  • K(x,y) ! -1,1 is an (?,?)-good similarity for
    P if a 1-? prob. mass of x satisfy

EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
  • K(x,y)x y has large margin separator but
    doesnt satisfy our definition.

12
A First Attempt Not Broad Enough
  • K(x,y) ! -1,1 is an (?,?)-good similarity for
    P if a 1-? prob. mass of x satisfy

EyPK(x,y)l(y)l(x) EyPK(x,y)l(y)?l(x)?
R






-
-
-
-
-
-
Broaden OK if 9 non-negligible R s.t. most x
are on average more similar to y2R of same label
than to y2 R of other label.
13
Broader/Main Definition
  • K(x,y) ! -1,1 is an (?,?)-good similarity for
    P if exists a weighting function w(y) 2 0,1 a
    1-? prob. mass of x satisfy

EyPw(y)K(x,y)l(y)l(x) EyPw(y)K(x,y)l(y)?
l(x)?
Algorithm
  • Draw Sy1, ?, yd, S-z1, ?, zd, dO((1/?2)
    ln(1/?2)).
  • Triangulate data

F(x) K(x,y1), ,K(x,yd), K(x,zd),,K(x,zd).
  • Take a new set of labeled examples, project to
    this space, and run any alg for learning lin.
    separators.

Theorem with probability 1-?, exists linear
separator of error ? ? at margin ?/4.
14
Main Definition Algorithm, Implications
  • Sy1, ?, yd, S-z1, ?, zd, dO((1/?2)
    ln(1/?2)).
  • Triangulate data

F(x) K(x,y1), ,K(x,yd), K(x,zd),,K(x,zd).
Theorem with prob. 1-?, exists linear
separator of error ? ? at margin ?/4.
legal kernel
K arbitrary sim. function


(?,?)-good sim. function
(??,?/4)-good kernel function
Theorem
Any (?,?)-good kernel is an (?,?)-good
similarity function.
(some penalty
? ? ?extra, ? ?2?extra
)
15
Similarity Functions for Classification, Summary
  • Formal way of understanding kernels as similarity
    functions.
  • Algorithms and guarantees for general similarity
    functions that arent necessarily PSD.

16
Part II Can we use this angle to help think
about Clustering?
17
What if only unlabeled examples available?
sports
fashion
documents, images
S set of n objects.
There is some (unknown) ground truth
clustering.
Each object has true label l(x) in 1,,t.
topic
Goal h of low error up to isomorphism of label
names.
Err(h) min?PrxS?(h(x)) ? l(x)
Problem only have unlabeled data!
But we have a Similarity function!
18
What conditions on a similarity function would be
enough to allow one to cluster well?
sports
fashion
documents, images
S set of n objects.
There is some (unknown) ground truth
clustering.
Each object has true label l(x) in 1,,t.
topic
Goal h of low error up to isomorphism of label
names.
Err(h) min?PrxS?(h(x)) ? l(x)
Problem only have unlabeled data!
But we have a Similarity function!
19
Contrast with Standard Approach
Traditional approach the input is a graph or
embedding of points into Rd.
- analyze algos to optimize various criteria
- which criterion produces better-looking
results
We flip this perspective around.
More natural, since the input
graph/similarity is merely based on some
heuristic.
- closer to learning mixtures of Gaussians
- discriminative, not generative
20
Condition that trivially works.
What conditions on a similarity function would be
enough to allow one to cluster well?
K(x,y) 0 for all x,y, l(x) l(y).K(x,y) for all x,y, l(x) ? l(y).
21
What conditions on a similarity function would be
enough to allow one to cluster well?
Still Strong
Strict Ordering Property
K is s.t. all x are more similar to points y in
their own cluster than to any y in other
clusters.
Problem same K can satisfy it for two very
different clusterings of the same data!
Unlike learning, you cant even test your
hypotheses!
22
Relax Our Goals
1. Produce a hierarchical clustering s.t.
correct answer is approximately some pruning of
it.
23
Relax Our Goals
1. Produce a hierarchical clustering s.t.
correct answer is approximately some pruning of
it.
All topics
sports
fashion
soccer
tennis
Coco Chanel
Lacoste
24
Relax Our Goals
1. Produce a hierarchical clustering s.t.
correct answer is approximately some pruning of
it.
All topics
sports
fashion
soccer
tennis
Coco Chanel
Lacoste
25
Relax Our Goals
1. Produce a hierarchical clustering s.t.
correct answer is approximately some pruning of
it.
All topics
sports
fashion
Coco Chanel
Lacoste
26
Relax Our Goals
1. Produce a hierarchical clustering s.t.
correct answer is approximately some pruning of
it.
All topics
sports
fashion
soccer
Coco Chanel
tennis
Lacoste
27
Relax Our Goals
1. Produce a hierarchical clustering s.t.
correct answer is approximately some pruning of
it.
All topics
sports
fashion
soccer
Coco Chanel
tennis
Lacoste
28
Relax Our Goals
1. Produce a hierarchical clustering s.t.
correct answer is approximately some pruning of
it.
All topics
sports
fashion
soccer
tennis
2. List of clusterings s.t. at least one has
low error.
Tradeoff strength of assumption with size of list.
29
Start Getting Nice Algorithms/Properties
Sufficient for hierarchical clustering
Strict Ordering Property
K is s.t. all x are more similar to points y in
their own cluster than to any y in other
clusters.
Sufficient for hierarchical clustering
Weak Stability Property
For all clusters C, C, for all A in C, A in C
at least one of A, A is more attracted
to its own cluster than to the other.
A
A
30
Example Analysis for Strong Stability Property
K is s.t. for all C, C, all A in C, A in C
K(A,C-A) K(A,A),
(K(A,A) - average attraction between A and A)
Average Single-Linkage.
Algorithm
  • merge parts whose average similarity is highest.

All parts made are laminar wrt target
clustering.
Analysis
  • Failure iff merge P1, P2 s.t. P1 ½ C, P2 Å C ?.
  • But must exist P3 ½ C s.t. K(P1,P3) K(P1,C-P1)
    and

  • K(P1,C-P1) K(P1,P2).

Contradiction.
31
Strong Stability Property, Inductive Setting
Inductive Setting
Draw sample S, hierarchically partition S.
Insert new points as they arrive.
Assume for all C, C, all A ½ C, Aµ C
K(A,C-A) K(A,A)?
  • Need to argue that sampling preserves stability.
  • A sample cplx type argument using Regularity type
    results of AFKK.

32
Weaker Conditions
Not Sufficient for hierarchy
Average Attraction Property
Ex 2 C(x)K(x,x) Ex 2 C K(x,x)? (8
C?C(x))
Can produce a small list of clusterings.
Upper bound tO(t/?2). doesnt depend on n
Lower bound t?(1/?).
Sufficient for hierarchy
Stability of Large Subsets Property
Might cause bottom-up algorithms to fail.
Find hierarchy using learning-based algorithm.
(running time tO(t/?2))
33
Similarity Functions for Clustering, Summary
Discriminative/SLT-style model for Clustering
with non-interactive feedback.
  • Minimal conditions on K to be useful for
    clustering.
  • List Clustering
  • Hierarchical clustering
  • Our notion of property analogue of a data
    dependent concept class in classification.

34
Thank you !
Write a Comment
User Comments (0)
About PowerShow.com