Theoretical Foundations of Clustering MLSS Tutorial - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Theoretical Foundations of Clustering MLSS Tutorial

Description:

All apply clustering to gain a first understanding of the structure of ... (X, d) are isomorphic, if there exists a distance-preserving automorphism F : X X, ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 49

Provided by: shaibe

Category:

more less

Transcript and Presenter's Notes

Title: Theoretical Foundations of Clustering MLSS Tutorial

1
Theoretical Foundations ofClustering MLSS
Tutorial

Shai Ben-David
University of Waterloo,
Waterloo,
Canada

2
The Theory-Practice Gap
Clustering is one of the most widely used
tool for exploratory data analysis. Social
Sciences Biology Astronomy Computer
Science . . All apply clustering to gain
a first understanding of the structure of large
data sets.
Yet, there exist distressingly little
theoretical understanding of clustering
3
Overview of this tutorial

What is clustering? Can we formally define it?
Model selection issues How would you chose the
best clustering paradigm for your data? How
should you choose the number of clusters?
Computational complexity issues Can good
clustering be efficiently computed?

4
Questions that research of fundamentals of
clustering should address

Can clustering be given an formal and general
definition?
What is a good clustering?
Can we distinguish clusterable from
structureless data?
Can we distinguish meaningful clustering from
random structure?

5
Inherent Obstacles
Clustering is not well defined. There is a wide
variety of different clustering tasks, with
different (often implicit) measures of quality.
6
There are Many Clustering Tasks

Clustering is an ill defined problem

There are many different clustering
tasks, leading to different clustering
paradigms
7
There are Many Clustering Tasks

Clustering is an ill defined problem

There are many different clustering tasks,
leading to different clustering
paradigms
8
Some more examples
9
Some real examples of clustering ambiguity

Cluster paintings
by painter vs. topic
Cluster speech recordings
by speaker vs. content
Cluster text documents
by sentiment vs. topic

10
Other Inherent Obstacles
In most practical clustering tasks there is no
clear ground truth to evaluate your solution
by. (in contrast with classification tasks, in
which you can have a hold out labeled set to
evaluate the classifier against).

11
Examples of some popular clustering paradigms
Linkage Clustering

Given a set of points and distances between them,
we extend the distance function to
apply to any pair of domain subsets. Then the
clustering algorithm proceeds in stages.
In each stage the two clusters that have the
minimal distance between them are merged.
The user has to set the stopping criteria
when should the merging stop.

12
Single Linkage Clustering- early stopping
13
Single Linkage Clustering correct stopping
14
Single Linkage Clustering late stopping
15
Examples of some popular clustering paradigms
Center-Based Clustering

The algorithm picks k center points
and the clusters are defined by assigning
each domain point to the center closest to it.
The algorithm aims to minimize some cost
function that reflects how compact the
resulting clusters are.
Center-based algorithm differ by their choice
of the cost function (k-means, sum of distances,
k-median and more)
The number of clusters, k, is picked by the
user.

16
4-Means clustering example
17
Examples of some popular clustering paradigms

Single Linkage
The K-means algorithm
The K-means objective optimization.
Spectral clustering
The actual algorithms
The objective function justification.
EM algorithms over parameterized families of
(mixtures of simple) distributions.

18
Common Solutions
Objective utility functions Sum Of In-Cluster
Distances, Average Distances to Center Points,
Cut Weight, etc. (Shmoys, Charikar, Meyerson
)
Consider a restricted set of distributions
(generative models) E., g, Mixtures of
Gaussians Dasgupta 99, Vempala,, 03,
Kannan et al 04, Achlitopas, McSherry 05.
Add structure Relevant Information
Information Bottleneck approach Tishby,
Pereira, Bialek 99
19
Common Solutions (2)
Focus on specific algorithmic paradigms Projectio
ns based clustering (random/spectral) all the
above papers
Spectral-based representations (Meila and Shi,
Belkin, ..) The k-means algorithm
Axiomatic approach Postulate clustering axioms
that, ideally, every clustering approach should
satisfy - usually conclude negative results (e.g.
Hartigan 1975, Puzicha, Hofmann, Buhmann 00,
Kleinberg 03).
Many more
20
Quest for a general Clustering theory

What can we say independently of any particular
algorithm,
particular objective function
or specific generative data model
?

21
Many different clustering setups

Different inputs
Points in Euclidean space.
Arbitrary domain with a point-similarity measure.
A graph (e.g., social networks, web pageslinks)
..
Different outputs
Hierarchical (dendograms)
Partitioning of the domain
Soft/probabilistic clusters
..

22
Our Basic Setting for Formal Discussion

For a finite domain set S, a dissimilarity
function (DF) is a mapping, dSxS ? R, such
that
d is symmetric,
and
d(x,y)0 iff xy.
Our Input A dissimilarity function on S (or a
matrix of pairwise distances between domain
points)
Our Output A partition of S.
We wish to define the properties that
distinguish clustering functions from other
functions that output domain partitions.

Input

Output a partition of the domain, x1,x7,
x2,x5,x9
24
Kleinbergs Axioms

Scale Invariance
F(?d)F(d) for all d and all strictly
positive ?.
Richness
For any finite domain S,
F(d) d is a DF over SPP a partition of
S
Consistency
If d equals d, except for shrinking distances
within clusters of F(d) or stretching
between-cluster distances , then F(d)F(d).

25
Note that any pair is realizable

Consider Single-Linkage with different
stopping criteria
k connected components.
Distance r stopping.
Scale a stopping
add edges as long as their length is
at most a(max-distance)

26
The Surprising result

Theorem There exist no clustering function
(that satisfies all of the three Kleinberg
axioms simultaneously).

27
Kleinbergs Impossibility result

There exist no clustering function
Proof

28
What is the Take-Home Message?

A popular interpretation of Kelinbergs result
is (roughly)
Its Impossible to axiomatize clustering
But, what that paper shows is (only)
These specific three axioms cannot work

29
Ideal Theory

We would like the axioms to be such that
1. Any clustering method satisfies all the
axioms,
and
2. Any function that is clearly not a
clustering fails to
satisfy at least one of the axioms.
(this is probably too much to hope for).
We would like to have a list of simple properties
so that major clustering methods are
distinguishable from each other using these
properties.

30
Axioms to guide a taxonomy of clustering paradigms

The goal is to generate a variety of axioms (or
properties) over a fixed framework, so that
different clustering approaches could be
classified by the different subsets of axioms
they satisfy.

Axioms
Properties
31
Types of Axioms/Properties

Richness requirements
E.g., relaxations of Kelinbergs richness,
e.g.,
F(d) d is a DF over SPP a partition of S
into k sets
Invariance/Robustness/Stability requirements.
E.g., Scale-Invariance, Consistency,
robustness
to perturbations of d (smoothness of F) or
stability w.r.t. sampling of S.

32
Relaxations of Consistency

Local Consistency
Let C1, Ck be the clusters of F(d).
For every ?0 1 and positive ?1, ..?k 1, if
d is defined by
?id(a,b) if a and b are in Ci
d(a,b)
?0d(a,b) if a,b are not in same
F(d)-cluster,
then F(d)F(d).
Is there any known clustering method for which
it fails?

33
Some more structure

For partitions P1, P2 of 1, m say that P1
refines P2 if every cluster of P1 is contained in
some cluster of P2.
A collection CPi is a chain if, for any P, Q,
in C, one of them refines the other.
A collection of partitions is an antichain, if no
partition there refines another.
Kleibergs impossibility result can be rephrased
as
If F is Scale Invariant and Consistent then
its range is an antichain.

34
Relaxations of Consistency

Refinement Consistency
Same as Consistency (shrink in-cluster,
strech between-clusters) but we relax the
Consistency requirement F(d)F(d) to
one of F(d), F(d) is a refinement of
the other.
Note A natural version of Single Linkage (join
x,y, iff d(x,y) lt ?maxd(s,t) s,t in X)
satisfies this Scale Invariance Richness.
So Kleinbergs impossibility result breaks
down.
Should this be an axiom?
Is there any common clustering function that
fails that?

35
More on Refinement Consistency

Minimize Sum of In-Cluster Distances satisfies
it
(as well as Richness and Scale Invariance).
Center-Based clustering fails to satisfy
Refinement Consistency
This is quite surprising, since they look very
much alike.

(Where d is Euclidean distance, and ci the center
of mass of Ci)
36
Hierarchical Clustering

Hierarchical clustering takes, on top of d, a
coarseness parameter t.
For any fixed t, F(t,d) is a clustering
function.
We require, for every d
CdF(t,d) 0 t Max a chain.
F(0,d) x x e S and F(Max,d)S.

37
Hierarchical versions of axioms

Scale Invariance For any d, and ?gt0,
F(t,d) t F(t, ?d)t (as sets of
partitions).
Richness For any finite domain S,
F(t,d)t d is a DF over SCC a chain of
partitions of S (with the needed Min and Max
partitions).
Consistency If, for some t, d is an F(t,d)
-consistent transformation of d, then, for some
t, F(t,d)F(t,d)

38
Characterizing Single Linkage

Ordinal Clustering axiom
If, for all w,x,y,z,
d(w,x)ltd(y,z) iff d(w,x)ltd(yz)
then F(t,d) t F(t,d)t (as sets of
partitions).
(note that this implies Scale Invariance)
Hierarchical Richness Consistency Ordinal
Clustering characterize Single Linkage
clustering.

39
Other types of clustering

Edge-Detection (advantage to smooth contours)
Texture clustering
-The professors example.

40
A different setup for axiomatizationMeasuring
the Quality of clustering

You get a data set.
Run, say, your 5-means clustering algorithm,
and get a clustering C.
You compute its 5-means cost -- its 0.7.
Can you conclude that C is a good clustering?
How can we verify that structure described by
C is not just noise?

41
Clustering Quality Measures

A clustering-quality measure is a function
m(dataset, clustering)
so that these values reflect how good or
cogent that clustering is.

42
Axiomatizing Quality Measures

Consistency
Whenever d is a C consistent variant of d,
then m(C,X, d) m(C,X, d).
Scale Invariance
For every positive ?, m(C,X, d) m(C,X, ?d).
Richness
For each non-trivial clustering C of X,
there exists a distance function d over X
such that C Argmaxm(C,X, d).

43
An Additional Axiom Isomorphism Invariance

Clusterings C and C over (X, d) are isomorphic,
if there exists a distance-preserving
automorphism F X ? X,
such that x, y share the same C-cluster iff
F(x) and F(y) share the same C-cluster.
Isomorphism Invariance
If C and C are isomorphic then
m(C,X, d) m(C,X, d).

44
Major gain (over Kleinbergs framework)

Every reasonable clustering quality measure
satisfies our axioms.
Clustering functions can be defined as
functions that optimize the clustering
quality.

45
Some examples of quality measures

Normalized clustering cost functions
(e.g., k-means, ratio-cut, k-median etc. )
Variance ratio
V R(C,X, d)
Relative margins
(the average ratio between the distance from a
point to its cluster center, and its distance to
its second-closest cluster center).

46
Major gain (over Kleinbergs framework)