Title: ON STATISTICAL MODEL OF CLUSTER STABILITY
1ON STATISTICAL MODEL OF CLUSTER STABILITY
- Z. Volkovich
- Software Engineering Department, ORT Braude
College of Engineering, Karmiel 21982, Israel
Department of Mathematics and Statistics, the
University of Maryland, Baltimore County, USA - Z. Barzily
- Software Engineering Department, ORT Braude
College of Engineering, Karmiel 21982, Israel
2Concept
-
- We propose a general statistical approach for
the study of the cluster validation problem. - Our concept suggests that partitions obtained
by a cluster algorithm rerunning can be
considered as realizations of a random variable
such that the most stable random variable infers
the true number of clusters. - We offer to measure stability by means of
probability distances within the observed
realization. - Our method suggests the application of
probability metrics between random variables.
3Motivating works
- T. Lange, V. Roth, L. M. Braun, and J. M.
Buhmann. Stability-based validation of clustering
solutions. Neural Computation, 16(6)1299 1323,
2004. - V. Roth, V. Lange, M. Braun, and Buhmann J. A
resampling approach to cluster validation. In
COMPSTAT, 2002.
4Clustering
Goal partition a set S ? Rd by means of a
clustering algorithm CL such that CL(x) CL(y)
if x and y are similar CL(x) ? CL(y) if x
and y are dissimilar
5Clustering (cont)
CL(x) CL(y)
CL(x) ? CL(y)
An important question how many clusters are
there?
6Example a three-cluster set partitioned into 2
and 4 clusters
7Implication
- It is observed that in the case when the
- number of clusters is not correctly chosen non
- consistent clusters can be formed. Such a
- cluster is a union of several heterogeneous
- parts having different distributions.
8Concept
DATA
SAMPLE - S
CLUSTERING MACHINE- CL
INITIAL PARTITION
CLUSTERED SAMPLE S CL(S)
9Concept (cont. 1)
S1 CL(S1)
These clustered samples can be considered as
estimates of the looked-for true partition, and
appropriately, our concept views these partitions
as instances of a random variable such that the
steadiest random variable is associated with the
true number of clusters.
DATA
10Concept (cont. 2)
- Outliers in the samples and limitations of
clustering algorithms significantly add to the
noise level and increase the models volatility.
Thus, the clustering procedure has to be iterated
many times to obtain meaningful results. - The stability of random variables is measured
via probability distances. According to our
principle, it is natural to assume that the true
number of clusters corresponds to the distance
distribution having the minimal inner variation. -
11Some probability metrics
- Probability metrics are introduced on a space ?
of real-valued random variables defined on the
same probability space. - A functional dis ??R is called a probability
metric if it satisfies the following additional
properties - Identity
- Symmetry
- Triangular inequality
- The last two properties are not always required.
If a probability metric identifies the
distribution - i.e.
- then the metric is called simple, otherwise the
metric is called - compound.
12Ky Fan metrics
- K1( X,Y) infe gt P(X-Y gte)lte
Birnbaum-Orlicz metrics
13Examples
- LP-metric
- Ky Fan metrics m
- Hellinger distance
- Relative entropy (or Kullback-Leibler divergence)
- Kolmogorov (or Uniform) metric
- Levy metric
- Prokhorov metric
- Total variation distance
- Wasserstein (or Kantorovich) metric
- Chi-2 distance
- And so on.
14Concentration measure index
- The LP and the Ky Fan metrics are compound and
the other - metrics shown above are simple. A difference
between the - simple and compound metrics is that, unlike the
compound - ones, simple metrics equal zero for two
independent - identically distributed random variables.
Moreover, if dis is a - compound metric, dis(X , X ) 0 and X , X are
independent - realizations of the same random variable X, then
X is a - constant almost surely.
- A compound distance can be used as a measure of
- uncertainty. Particularly, dis(X , X ) d(X )
is called a - concentration measure index derived by the
compound - distance dis. The stability of a random variable
can be - estimated by the average value of this index.
15Simple and Compound Metrics
Simple metrics
Compound metrics
Geometrical Algorithms
Membership Algorithms
16Geometrical Algorithm
- Several algorithms can be offered here. The
first - type is to consider the Geometrical Stability.
The - basic idea is that if one properly clusters,
two - independent samples then, under the assumption
- of a consistent clustering algorithm, the
clustered - samples can be classified as samples drawn from
- the same population.
17Example Cluster stability via samples
Cluster 2
Cluster 1
The samples found the same clusters. A partition
is stable
18General algorithm. Given a probability metric
dis(, )
- For each tested number of clusters k, execute
steps 2 to 6 as follows - Draw pairs of disjoint samples (St, St1),
t1,2, - Cluster the united sample (S CL(St?St1)) and
separate S to the two clustered samples (St,St1)
- Calculate the distance values dtdis(St,St1)
- Average each consecutive T distances
- Normalize the set of averages and obtain the set
DISk - The number of clusters k, which yields the most
concentrated normalized distribution DISk, is
chosen as the estimate of the true number of
clusters.
19How is the concentration measured?
- The distances are inherently positive, and we
are interested in distributions concentrated near
zero. Thus, we can use as concentration measures
the sample mean, the sample standard deviation
and the lower quartile. -
20Simple distances
- Many simple distances, applicable in two sample
tests, are simple metrics. For instance, the well
known Kolmogorov-Smirnov test is based on the
max-distance between the probability functions in
the one dimensional case. In the multivariate
case, the following tests can be mentioned in
this context - the Friedman-Rafsky test 1979. (Minimal spanning
tree based) - the Nearest Neighbors test of Henze 1988.
- the Energy test, Zech and Aslan 2005.
- the N-distances test of Klebanov 2005.
- Such metrics can be used in the Geometrical
Algorithm.
21Simple distances (cont)
- Recall, that the classical two-sample problem
- is intended for testing the hypothesis
- H0 F(x) G(x)
- against the general alternative
- H1 F(x) ? G(x),
- when the distributions F and G are unknown.
- Here we consider applications of the N-
- distances test of Klebanov and the
Friedman-Rafsky - test.
22 Klebanovs N-distances
- N-distances are defined as follows
- where X1, X1 and Y1, Y1 are independent
- realizations of X and Y respectively. N is a
- negative definite kernel. We use
- N(x,y)x-ya , 0lta?2.
23Graphical illustration
- Let us consider two samples S1 and S2 partitioned
into clusters
S1
S2
24Graphical illustration (cont. 1)Distances
between points belonging to different samples
Cluster C1
Cluster C2
25Graphical illustration (cont. 2)
Cluster C2
Cluster C1
26Remark
- Note, the metric is actually the average
distance - between the samples in the clusters. If this
- value is close to zero then it can be concluded
- that the partition is stable. Namely, the
elements - of the samples are closely located inside the
- clusters and can be deemed as a consistent set
- within each of the clusters.
27Distances calculations
28Histograms of the distances values
Two cases are presented. The case when the
quantity T of the averaged distance values is big
and the case when T is small correspondently. In
the second case, measuring the concentration via
the lower quartile appears to be more
appropriated.
29ExampleThe Iris Flower Dataset
- The kernel N(x,y)x-y
- Sample size 70
- Number of samples 2080
- Number of averaged samples 80
30Graph of the normalized mean value
31Graph of the normalized quartile value
32Euclidean Minimal Spanning Tree
- The Euclidean minimum spanning tree or EMST is a
- minimum spanning tree of a set S of points in an
- Euclidean space Rd, where the weight of the edge
- between each pair of points is the distance
between - those two points. In simpler terms, an EMST
- connects a set of dots using lines such that the
total - length of all the lines is minimized and any dot
can - be reached from any other by following the lines
- ( see, Wikipedia).
33Euclidean minimal spanning tree (cont.)
- A tree on S is a graph which has no loops and
whose vertices are elements of S. - If all distances between the items of S are
distinct then the set S is called nice and it has
a unique EMST. - An EMST can be built in O(n2) time (including
distance calculations) using the Prims,
Kruskals, Boruvkas or the Dijkstras
algorithms.
34An EMST of 60 random points
35How can an EMST be used in the cluster validation
problem?
We draw two samples S1 and S2 and determine
S S1?S2- pooled sample S CL(S)-clustered
pooled sample We expect that, in the case of a
stable clustering, the two samples items are
closely located inside the clusters. This fact
is characterized by the number of edges
connecting points from different samples. These
edges are marked by red in the diagrams in the
examples.
36Graphical illustration. Stable clustering
37Graphical illustration. Non-stable clustering
38The two-sample MST-test
- The two- sample test the hypothesis
- H0 F(x) G(x)
- against the general alternative
- H1 F(x) ? G(x),
- when the distributions F and G are unknown.
- The number of the edges connecting points from
different samples has been considered in the
Friedman-Rafskys MST test. - Particularly, let X xi, i1,n, Y yj,
j1,m be two samples of independent random
elements, distributed according to F and G,
respectively.
39The two-sample MST-test (cont. 1)
- Suppose that the set Z X ? Y is nice.
- Friedman and Rafskys test statistic Rmn can be
defined as the number of edges of Z which connect
a point of X to a point of Y . - Friedman and Rafsky actually introduced the
statistic 1Rmn, which expresses the number of
disjoint sub-trees resulting from removing all
edges uniting vertices of different samples.
40The two-sample MST-test (cont. 2)
- Henze and Penrose (1979) considered the
asymptotic behavior of Rmn. Suppose that m ?? and
n ?? such that m/(mn) ?p? (0, 1). Introducing
q 1 - p and r 2pq, they obtained - where the convergence is in distribution and
N(0,?2) denotes - the normal distribution with a zero expectation
and a variance - ?2 r (r Cd(1 - 2r))
- for some constant Cd depending only on the
spaces - dimension d.
41The two-sample MST-test (cont. 3)
- This results coincides with our intuition since
if two sets are closed then there are many edges
which unit points from different samples. In the
spirit of The Central Limit Theorem, this
quantity is expected to be asymptotically
normally distributed.
42Theorems application
- To use this theorem, for each possible number
of clusters k2,,k , we draw many pairs of
disjoint samples S1 and S2 having the same
sufficiently big size n and calculate S S1? S2,
St CL(S). - We consider sub- samples
- S1,j S1??j(St),
- S2,j S2??j(St),
- where ?j(St), j1,,k is the jth cluster in the
partition of St obtained by means of CL.
43Theorems application (cont. 1)
- For each j1,,k we compute a value of the two-
sample MST-test statistics Rnn(S1,j, S2,j) - In the case where all clusters are homogenous
sets this statistic is normally distributed. We
construct masses of such values and look at the
distance between their standardized distribution
and the standard normal distribution. - The number of clusters k, which yields the
minimal distance, is chosen as the estimate of
the true number of clusters.
44Example Calculation of Rn(S1,S2)
45Distances from normality
- We can consider the following distances from
normality - The Kolmogorov-Smirnov test statistics
- The Friedmans Pursuit Index
- Entropy Based Pursuit Indexes
- BCM function
- Generally, any simple metric can be used here.
46The Kolmogorov-Smirnov Distance
- The Kolmogorov-Smirnov distance is based on the
empirical distribution function. - Given N ordered data items R1 R2 R3 ...
RN, a function FN(i) is defined as the fraction
of items less or equal to Ri. - The distance to the standard normal
distribution is given by - Dmax(FN(i) G(i), G(i)- FN(i)),
- where G(i) is the value of standard normal
- cumulative distribution function evaluated at
the - point i.
47The Kolmogorov-Smirnov Distance
48Example synthetic data
Mixture of three Gaussian clusters
49Example synthetic data (cont. 1)
50Another application
- Another approach by Jain, Xu, Ho and Xiao and
by Smith and Jain proposes to carry out a
uniformity testing for cluster detection. - The main idea here is to locate an
inconsistent edge whose length is significantly
larger than the average length of nearby edges.
The distribution of these lengths must be normal
under the null hypothesis assumption. - This method is unable to asses the true number
of clusters in the data.
51(No Transcript)
52Membership Stability Algorithm
- Such algorithm uses a simulation of Random
Variables, obtained by repeated clusterings of
the same sample. - Obtained results are compared by means of a
compound (non-simple) metric, which is a function
of the variable defined on the set Ck1,,k,
where k is the number of clusters considered.
53Lp-metric
- The Lp metrics is the most known example of a
compound metric. For every p the Lp -metrics is
defined by -
- In particular
-
-
-
- The last metric has been, de facto, applied in
- T. Lange, V. Roth, L. M. Braun, and J. M.
Buhmann. Stability-based validation of clustering
solutions. Neural Computation, 16(6)1299 1323,
2004. - V. Roth, V. Lange, M. Braun, and Buhmann J. A
resampling approach to cluster validation. In
COMPSTAT, 2002.
54A family of clustering algorithms
In contrast with the geometrical algorithm, we
use here a family of clustering algorithms. It
can be the same algorithm starting from
different initial points.
In contrast with the geometrical algorithm, we
use here a family of clustering algorithms. It
can be the same algorithm starting from different
initial points.
55Clusters correspondence problem
Let us consider we twice cluster the same set.
Cluster 1
S
Cluster 2
The same cluster can be labeled differently
Cluster 2
Cluster 1
56Correspondence between labels a and ßobtained
for a sample S.
- This task is solved by finding of a permutation ?
of the set Ck which - achieves the smallest misclassification between a
and the permuted ß. - I.e.
- .
-
- Computational complexity for solving this problem
by the well known - Hungarian method is O(k3). This technique has
been used in Lange T. - et al, 2004 and is also known in the clusters
combining area (see, for - example Topchy A. et al, 2003).