Title: Theoretical Foundations of Clustering MLSS Tutorial
1Theoretical Foundations ofClustering MLSS
Tutorial
- Shai Ben-David
- University of Waterloo,
- Waterloo,
- Canada
2 The Theory-Practice Gap
Clustering is one of the most widely used
tool for exploratory data analysis. Social
Sciences Biology Astronomy Computer
Science . . All apply clustering to gain
a first understanding of the structure of large
data sets.
Yet, there exist distressingly little
theoretical understanding of clustering
3Overview of this tutorial
- What is clustering? Can we formally define it?
- Model selection issues How would you chose the
best clustering paradigm for your data? How
should you choose the number of clusters? - Computational complexity issues Can good
clustering be efficiently computed?
4Questions that research of fundamentals of
clustering should address
- Can clustering be given an formal and general
definition? - What is a good clustering?
- Can we distinguish clusterable from
structureless data? - Can we distinguish meaningful clustering from
random structure?
5 Inherent Obstacles
Clustering is not well defined. There is a wide
variety of different clustering tasks, with
different (often implicit) measures of quality.
6There are Many Clustering Tasks
- Clustering is an ill defined problem
There are many different clustering
tasks, leading to different clustering
paradigms
7There are Many Clustering Tasks
- Clustering is an ill defined problem
There are many different clustering tasks,
leading to different clustering
paradigms
8 Some more examples
9Some real examples of clustering ambiguity
- Cluster paintings
- by painter vs. topic
- Cluster speech recordings
- by speaker vs. content
- Cluster text documents
- by sentiment vs. topic
-
-
-
10 Other Inherent Obstacles
In most practical clustering tasks there is no
clear ground truth to evaluate your solution
by. (in contrast with classification tasks, in
which you can have a hold out labeled set to
evaluate the classifier against).
11Examples of some popular clustering paradigms
Linkage Clustering
- Given a set of points and distances between them,
we extend the distance function to - apply to any pair of domain subsets. Then the
clustering algorithm proceeds in stages. - In each stage the two clusters that have the
minimal distance between them are merged. -
- The user has to set the stopping criteria
when should the merging stop.
12Single Linkage Clustering- early stopping
13Single Linkage Clustering correct stopping
14Single Linkage Clustering late stopping
15Examples of some popular clustering paradigms
Center-Based Clustering
- The algorithm picks k center points
- and the clusters are defined by assigning
each domain point to the center closest to it. - The algorithm aims to minimize some cost
- function that reflects how compact the
resulting clusters are. -
- Center-based algorithm differ by their choice
of the cost function (k-means, sum of distances,
k-median and more) - The number of clusters, k, is picked by the
user.
164-Means clustering example
17Examples of some popular clustering paradigms
- Single Linkage
- The K-means algorithm
- The K-means objective optimization.
- Spectral clustering
- The actual algorithms
- The objective function justification.
- EM algorithms over parameterized families of
(mixtures of simple) distributions.
18 Common Solutions
Objective utility functions Sum Of In-Cluster
Distances, Average Distances to Center Points,
Cut Weight, etc. (Shmoys, Charikar, Meyerson
)
Consider a restricted set of distributions
(generative models) E., g, Mixtures of
Gaussians Dasgupta 99, Vempala,, 03,
Kannan et al 04, Achlitopas, McSherry 05.
Add structure Relevant Information
Information Bottleneck approach Tishby,
Pereira, Bialek 99
19 Common Solutions (2)
Focus on specific algorithmic paradigms Projectio
ns based clustering (random/spectral) all the
above papers
Spectral-based representations (Meila and Shi,
Belkin, ..) The k-means algorithm
Axiomatic approach Postulate clustering axioms
that, ideally, every clustering approach should
satisfy - usually conclude negative results (e.g.
Hartigan 1975, Puzicha, Hofmann, Buhmann 00,
Kleinberg 03).
Many more
20Quest for a general Clustering theory
- What can we say independently of any particular
algorithm, - particular objective function
- or specific generative data model
- ?
21Many different clustering setups
- Different inputs
- Points in Euclidean space.
- Arbitrary domain with a point-similarity measure.
- A graph (e.g., social networks, web pageslinks)
- ..
- Different outputs
- Hierarchical (dendograms)
- Partitioning of the domain
- Soft/probabilistic clusters
- ..
22Our Basic Setting for Formal Discussion
- For a finite domain set S, a dissimilarity
function (DF) is a mapping, dSxS ? R, such
that - d is symmetric,
- and
- d(x,y)0 iff xy.
- Our Input A dissimilarity function on S (or a
matrix of pairwise distances between domain
points) - Our Output A partition of S.
- We wish to define the properties that
distinguish clustering functions from other
functions that output domain partitions.
23Output a partition of the domain, x1,x7,
x2,x5,x9
24Kleinbergs Axioms
- Scale Invariance
- F(?d)F(d) for all d and all strictly
positive ?. - Richness
- For any finite domain S,
- F(d) d is a DF over SPP a partition of
S - Consistency
- If d equals d, except for shrinking distances
within clusters of F(d) or stretching
between-cluster distances , then F(d)F(d).
25Note that any pair is realizable
- Consider Single-Linkage with different
stopping criteria - k connected components.
- Distance r stopping.
- Scale a stopping
- add edges as long as their length is
- at most a(max-distance)
26The Surprising result
- Theorem There exist no clustering function
- (that satisfies all of the three Kleinberg
axioms simultaneously).
27Kleinbergs Impossibility result
- There exist no clustering function
- Proof
28What is the Take-Home Message?
- A popular interpretation of Kelinbergs result
is (roughly) - Its Impossible to axiomatize clustering
- But, what that paper shows is (only)
- These specific three axioms cannot work
29Ideal Theory
- We would like the axioms to be such that
- 1. Any clustering method satisfies all the
axioms, - and
- 2. Any function that is clearly not a
clustering fails to - satisfy at least one of the axioms.
- (this is probably too much to hope for).
- We would like to have a list of simple properties
- so that major clustering methods are
distinguishable from each other using these
properties.
30Axioms to guide a taxonomy of clustering paradigms
- The goal is to generate a variety of axioms (or
properties) over a fixed framework, so that
different clustering approaches could be
classified by the different subsets of axioms
they satisfy.
Axioms
Properties
31Types of Axioms/Properties
- Richness requirements
- E.g., relaxations of Kelinbergs richness,
e.g., - F(d) d is a DF over SPP a partition of S
into k sets - Invariance/Robustness/Stability requirements.
- E.g., Scale-Invariance, Consistency,
robustness - to perturbations of d (smoothness of F) or
stability w.r.t. sampling of S.
32Relaxations of Consistency
- Local Consistency
- Let C1, Ck be the clusters of F(d).
- For every ?0 1 and positive ?1, ..?k 1, if
d is defined by - ?id(a,b) if a and b are in Ci
- d(a,b)
- ?0d(a,b) if a,b are not in same
F(d)-cluster, - then F(d)F(d).
- Is there any known clustering method for which
it fails?
33Some more structure
- For partitions P1, P2 of 1, m say that P1
refines P2 if every cluster of P1 is contained in
some cluster of P2. - A collection CPi is a chain if, for any P, Q,
in C, one of them refines the other. - A collection of partitions is an antichain, if no
partition there refines another. - Kleibergs impossibility result can be rephrased
as - If F is Scale Invariant and Consistent then
its range is an antichain.
34Relaxations of Consistency
- Refinement Consistency
- Same as Consistency (shrink in-cluster,
strech between-clusters) but we relax the
Consistency requirement F(d)F(d) to - one of F(d), F(d) is a refinement of
the other. -
- Note A natural version of Single Linkage (join
x,y, iff d(x,y) lt ?maxd(s,t) s,t in X)
satisfies this Scale Invariance Richness. - So Kleinbergs impossibility result breaks
down. - Should this be an axiom?
- Is there any common clustering function that
fails that?
35More on Refinement Consistency
- Minimize Sum of In-Cluster Distances satisfies
it - (as well as Richness and Scale Invariance).
- Center-Based clustering fails to satisfy
Refinement Consistency - This is quite surprising, since they look very
much alike.
(Where d is Euclidean distance, and ci the center
of mass of Ci)
36Hierarchical Clustering
- Hierarchical clustering takes, on top of d, a
coarseness parameter t. - For any fixed t, F(t,d) is a clustering
function. - We require, for every d
- CdF(t,d) 0 t Max a chain.
- F(0,d) x x e S and F(Max,d)S.
37Hierarchical versions of axioms
- Scale Invariance For any d, and ?gt0,
- F(t,d) t F(t, ?d)t (as sets of
partitions). - Richness For any finite domain S,
- F(t,d)t d is a DF over SCC a chain of
partitions of S (with the needed Min and Max
partitions). - Consistency If, for some t, d is an F(t,d)
-consistent transformation of d, then, for some
t, F(t,d)F(t,d)
38Characterizing Single Linkage
- Ordinal Clustering axiom
- If, for all w,x,y,z,
- d(w,x)ltd(y,z) iff d(w,x)ltd(yz)
- then F(t,d) t F(t,d)t (as sets of
partitions). - (note that this implies Scale Invariance)
- Hierarchical Richness Consistency Ordinal
Clustering characterize Single Linkage
clustering. -
39Other types of clustering
- Edge-Detection (advantage to smooth contours)
- Texture clustering
-
- -The professors example.
40A different setup for axiomatizationMeasuring
the Quality of clustering
- You get a data set.
- Run, say, your 5-means clustering algorithm,
- and get a clustering C.
- You compute its 5-means cost -- its 0.7.
- Can you conclude that C is a good clustering?
- How can we verify that structure described by
- C is not just noise?
41Clustering Quality Measures
- A clustering-quality measure is a function
- m(dataset, clustering)
- so that these values reflect how good or
cogent that clustering is.
42Axiomatizing Quality Measures
- Consistency
- Whenever d is a C consistent variant of d,
- then m(C,X, d) m(C,X, d).
- Scale Invariance
- For every positive ?, m(C,X, d) m(C,X, ?d).
- Richness
- For each non-trivial clustering C of X,
- there exists a distance function d over X
such that C Argmaxm(C,X, d).
43An Additional Axiom Isomorphism Invariance
- Clusterings C and C over (X, d) are isomorphic,
if there exists a distance-preserving
automorphism F X ? X, - such that x, y share the same C-cluster iff
F(x) and F(y) share the same C-cluster. - Isomorphism Invariance
- If C and C are isomorphic then
- m(C,X, d) m(C,X, d).
44Major gain (over Kleinbergs framework)
- Every reasonable clustering quality measure
satisfies our axioms. - Clustering functions can be defined as
- functions that optimize the clustering
quality.
45Some examples of quality measures
- Normalized clustering cost functions
- (e.g., k-means, ratio-cut, k-median etc. )
- Variance ratio
- V R(C,X, d)
- Relative margins
- (the average ratio between the distance from a
point to its cluster center, and its distance to
its second-closest cluster center).
46Major gain (over Kleinbergs framework)
- Every reasonable clustering quality measure
satisfies our axioms. - Clustering functions can be defined as
- functions that optimize the clustering
quality.
47Basic Open Questions
- What do we want from a set of clustering axioms?
(Meta axiomatization ) - How can the completeness of a set of axioms be
defined/argued? - Is there a general property distinguishing,
- say, linkage-based from center-based
clusterings? - Any candidate general clustering properties
that the axioms should prove?
48Single Linkage Clustering