ON STATISTICAL MODEL OF CLUSTER STABILITY

About This Presentation

Title:

ON STATISTICAL MODEL OF CLUSTER STABILITY

Description:

Chi-2 distance. And so on.... 14. Concentration measure index ... CLUSTERING MACHINE M- CLM. In contrast with the geometrical algorithm, we use here a family of ... – PowerPoint PPT presentation

Number of Views:12

Avg rating:3.0/5.0

Slides: 56

Provided by: velblodVid

Category:

more less

Transcript and Presenter's Notes

Title: ON STATISTICAL MODEL OF CLUSTER STABILITY

1
ON STATISTICAL MODEL OF CLUSTER STABILITY

Z. Volkovich
Software Engineering Department, ORT Braude
College of Engineering, Karmiel 21982, Israel
Department of Mathematics and Statistics, the
University of Maryland, Baltimore County, USA
Z. Barzily
Software Engineering Department, ORT Braude
College of Engineering, Karmiel 21982, Israel

2
Concept

We propose a general statistical approach for
the study of the cluster validation problem.
Our concept suggests that partitions obtained
by a cluster algorithm rerunning can be
considered as realizations of a random variable
such that the most stable random variable infers
the true number of clusters.
We offer to measure stability by means of
probability distances within the observed
realization.
Our method suggests the application of
probability metrics between random variables.

3
Motivating works

T. Lange, V. Roth, L. M. Braun, and J. M.
Buhmann. Stability-based validation of clustering
solutions. Neural Computation, 16(6)1299 1323,
2004.
V. Roth, V. Lange, M. Braun, and Buhmann J. A
resampling approach to cluster validation. In
COMPSTAT, 2002.

4
Clustering
Goal partition a set S ? Rd by means of a
clustering algorithm CL such that CL(x) CL(y)
if x and y are similar CL(x) ? CL(y) if x
and y are dissimilar
5
Clustering (cont)
CL(x) CL(y)
CL(x) ? CL(y)
An important question how many clusters are
there?
6
Example a three-cluster set partitioned into 2
and 4 clusters
7
Implication

It is observed that in the case when the
number of clusters is not correctly chosen non
consistent clusters can be formed. Such a
cluster is a union of several heterogeneous
parts having different distributions.

8
Concept
DATA
SAMPLE - S
CLUSTERING MACHINE- CL
INITIAL PARTITION
CLUSTERED SAMPLE S CL(S)
9
Concept (cont. 1)
S1 CL(S1)

These clustered samples can be considered as
estimates of the looked-for true partition, and
appropriately, our concept views these partitions
as instances of a random variable such that the
steadiest random variable is associated with the
true number of clusters.
DATA
10
Concept (cont. 2)

Outliers in the samples and limitations of
clustering algorithms significantly add to the
noise level and increase the models volatility.
Thus, the clustering procedure has to be iterated
many times to obtain meaningful results.
The stability of random variables is measured
via probability distances. According to our
principle, it is natural to assume that the true
number of clusters corresponds to the distance
distribution having the minimal inner variation.

11
Some probability metrics

Probability metrics are introduced on a space ?
of real-valued random variables defined on the
same probability space.
A functional dis ??R is called a probability
metric if it satisfies the following additional
properties
Identity
Symmetry
Triangular inequality
The last two properties are not always required.
If a probability metric identifies the
distribution
i.e.
then the metric is called simple, otherwise the
metric is called
compound.

12
Ky Fan metrics

K1( X,Y) infe gt P(X-Y gte)lte

Birnbaum-Orlicz metrics
13
Examples

LP-metric
Ky Fan metrics m
Hellinger distance
Relative entropy (or Kullback-Leibler divergence)
Kolmogorov (or Uniform) metric
Levy metric
Prokhorov metric
Total variation distance
Wasserstein (or Kantorovich) metric
Chi-2 distance
And so on.

14
Concentration measure index

The LP and the Ky Fan metrics are compound and
the other
metrics shown above are simple. A difference
between the
simple and compound metrics is that, unlike the
compound
ones, simple metrics equal zero for two
independent
identically distributed random variables.
Moreover, if dis is a
compound metric, dis(X , X ) 0 and X , X are
independent
realizations of the same random variable X, then
X is a
constant almost surely.
A compound distance can be used as a measure of
uncertainty. Particularly, dis(X , X ) d(X )
is called a
concentration measure index derived by the
compound
distance dis. The stability of a random variable
can be
estimated by the average value of this index.

15
Simple and Compound Metrics
Simple metrics
Compound metrics
Geometrical Algorithms
Membership Algorithms
16
Geometrical Algorithm

Several algorithms can be offered here. The
first
type is to consider the Geometrical Stability.
The
basic idea is that if one properly clusters,
two
independent samples then, under the assumption
of a consistent clustering algorithm, the
clustered
samples can be classified as samples drawn from
the same population.

17
Example Cluster stability via samples
Cluster 2
Cluster 1
The samples found the same clusters. A partition
is stable
18
General algorithm. Given a probability metric
dis(, )

For each tested number of clusters k, execute
steps 2 to 6 as follows
Draw pairs of disjoint samples (St, St1),
t1,2,
Cluster the united sample (S CL(St?St1)) and
separate S to the two clustered samples (St,St1)
Calculate the distance values dtdis(St,St1)
Average each consecutive T distances
Normalize the set of averages and obtain the set
DISk
The number of clusters k, which yields the most
concentrated normalized distribution DISk, is
chosen as the estimate of the true number of
clusters.

19
How is the concentration measured?

The distances are inherently positive, and we
are interested in distributions concentrated near
zero. Thus, we can use as concentration measures
the sample mean, the sample standard deviation
and the lower quartile.

20
Simple distances

Many simple distances, applicable in two sample
tests, are simple metrics. For instance, the well
known Kolmogorov-Smirnov test is based on the
max-distance between the probability functions in
the one dimensional case. In the multivariate
case, the following tests can be mentioned in
this context
the Friedman-Rafsky test 1979. (Minimal spanning
tree based)
the Nearest Neighbors test of Henze 1988.
the Energy test, Zech and Aslan 2005.
the N-distances test of Klebanov 2005.
Such metrics can be used in the Geometrical
Algorithm.

21
Simple distances (cont)

Recall, that the classical two-sample problem
is intended for testing the hypothesis
H0 F(x) G(x)
against the general alternative
H1 F(x) ? G(x),
when the distributions F and G are unknown.
Here we consider applications of the N-
distances test of Klebanov and the
Friedman-Rafsky
test.

22
Klebanovs N-distances

N-distances are defined as follows
where X1, X1 and Y1, Y1 are independent
realizations of X and Y respectively. N is a
negative definite kernel. We use
N(x,y)x-ya , 0lta?2.

23
Graphical illustration

Let us consider two samples S1 and S2 partitioned
into clusters

S1
S2
24
Graphical illustration (cont. 1)Distances
between points belonging to different samples
Cluster C1
Cluster C2
25
Graphical illustration (cont. 2)
Cluster C2
Cluster C1
26
Remark

Note, the metric is actually the average
distance
between the samples in the clusters. If this
value is close to zero then it can be concluded
that the partition is stable. Namely, the
elements
of the samples are closely located inside the
clusters and can be deemed as a consistent set
within each of the clusters.

27
Distances calculations
28
Histograms of the distances values
Two cases are presented. The case when the
quantity T of the averaged distance values is big
and the case when T is small correspondently. In
the second case, measuring the concentration via
the lower quartile appears to be more
appropriated.
29
ExampleThe Iris Flower Dataset

The kernel N(x,y)x-y
Sample size 70
Number of samples 2080
Number of averaged samples 80

30
Graph of the normalized mean value
31
Graph of the normalized quartile value
32
Euclidean Minimal Spanning Tree

The Euclidean minimum spanning tree or EMST is a
minimum spanning tree of a set S of points in an
Euclidean space Rd, where the weight of the edge
between each pair of points is the distance
between
those two points. In simpler terms, an EMST
connects a set of dots using lines such that the
total
length of all the lines is minimized and any dot
can
be reached from any other by following the lines
( see, Wikipedia).

33
Euclidean minimal spanning tree (cont.)

A tree on S is a graph which has no loops and
whose vertices are elements of S.
If all distances between the items of S are
distinct then the set S is called nice and it has
a unique EMST.
An EMST can be built in O(n2) time (including
distance calculations) using the Prims,
Kruskals, Boruvkas or the Dijkstras
algorithms.

34
An EMST of 60 random points
35
How can an EMST be used in the cluster validation
problem?
We draw two samples S1 and S2 and determine
S S1?S2- pooled sample S CL(S)-clustered
pooled sample We expect that, in the case of a
stable clustering, the two samples items are
closely located inside the clusters. This fact
is characterized by the number of edges
connecting points from different samples. These
edges are marked by red in the diagrams in the
examples.
36
Graphical illustration. Stable clustering
37
Graphical illustration. Non-stable clustering
38
The two-sample MST-test

The two- sample test the hypothesis
H0 F(x) G(x)
against the general alternative
H1 F(x) ? G(x),
when the distributions F and G are unknown.
The number of the edges connecting points from
different samples has been considered in the
Friedman-Rafskys MST test.
Particularly, let X xi, i1,n, Y yj,
j1,m be two samples of independent random
elements, distributed according to F and G,
respectively.

39
The two-sample MST-test (cont. 1)

Suppose that the set Z X ? Y is nice.
Friedman and Rafskys test statistic Rmn can be
defined as the number of edges of Z which connect
a point of X to a point of Y .
Friedman and Rafsky actually introduced the
statistic 1Rmn, which expresses the number of
disjoint sub-trees resulting from removing all
edges uniting vertices of different samples.

40
The two-sample MST-test (cont. 2)

Henze and Penrose (1979) considered the
asymptotic behavior of Rmn. Suppose that m ?? and
n ?? such that m/(mn) ?p? (0, 1). Introducing
q 1 - p and r 2pq, they obtained
where the convergence is in distribution and
N(0,?2) denotes
the normal distribution with a zero expectation
and a variance
?2 r (r Cd(1 - 2r))
for some constant Cd depending only on the
spaces
dimension d.

41
The two-sample MST-test (cont. 3)

This results coincides with our intuition since
if two sets are closed then there are many edges
which unit points from different samples. In the
spirit of The Central Limit Theorem, this
quantity is expected to be asymptotically
normally distributed.

42
Theorems application

To use this theorem, for each possible number
of clusters k2,,k , we draw many pairs of
disjoint samples S1 and S2 having the same
sufficiently big size n and calculate S S1? S2,
St CL(S).
We consider sub- samples
S1,j S1??j(St),
S2,j S2??j(St),
where ?j(St), j1,,k is the jth cluster in the
partition of St obtained by means of CL.

43
Theorems application (cont. 1)

For each j1,,k we compute a value of the two-
sample MST-test statistics Rnn(S1,j, S2,j)
In the case where all clusters are homogenous
sets this statistic is normally distributed. We
construct masses of such values and look at the
distance between their standardized distribution
and the standard normal distribution.
The number of clusters k, which yields the
minimal distance, is chosen as the estimate of
the true number of clusters.

44
Example Calculation of Rn(S1,S2)
45
Distances from normality

We can consider the following distances from
normality
The Kolmogorov-Smirnov test statistics
The Friedmans Pursuit Index
Entropy Based Pursuit Indexes
BCM function
Generally, any simple metric can be used here.

46
The Kolmogorov-Smirnov Distance

The Kolmogorov-Smirnov distance is based on the
empirical distribution function.
Given N ordered data items R1 R2 R3 ...
RN, a function FN(i) is defined as the fraction
of items less or equal to Ri.
The distance to the standard normal
distribution is given by
Dmax(FN(i) G(i), G(i)- FN(i)),
where G(i) is the value of standard normal
cumulative distribution function evaluated at
the
point i.

47
The Kolmogorov-Smirnov Distance
48
Example synthetic data
Mixture of three Gaussian clusters
49
Example synthetic data (cont. 1)
50
Another application

Another approach by Jain, Xu, Ho and Xiao and
by Smith and Jain proposes to carry out a
uniformity testing for cluster detection.
The main idea here is to locate an
inconsistent edge whose length is significantly
larger than the average length of nearby edges.
The distribution of these lengths must be normal
under the null hypothesis assumption.
This method is unable to asses the true number
of clusters in the data.

51
(No Transcript)
52
Membership Stability Algorithm

Such algorithm uses a simulation of Random
Variables, obtained by repeated clusterings of
the same sample.
Obtained results are compared by means of a
compound (non-simple) metric, which is a function
of the variable defined on the set Ck1,,k,
where k is the number of clusters considered.

53
Lp-metric

The Lp metrics is the most known example of a
compound metric. For every p the Lp -metrics is
defined by
In particular

The last metric has been, de facto, applied in
T. Lange, V. Roth, L. M. Braun, and J. M.
Buhmann. Stability-based validation of clustering
solutions. Neural Computation, 16(6)1299 1323,
2004.
V. Roth, V. Lange, M. Braun, and Buhmann J. A
resampling approach to cluster validation. In
COMPSTAT, 2002.

54
A family of clustering algorithms
In contrast with the geometrical algorithm, we
use here a family of clustering algorithms. It
can be the same algorithm starting from
different initial points.
In contrast with the geometrical algorithm, we
use here a family of clustering algorithms. It
can be the same algorithm starting from different
initial points.
55
Clusters correspondence problem
Let us consider we twice cluster the same set.
Cluster 1
S
Cluster 2
The same cluster can be labeled differently
Cluster 2
Cluster 1
56
Correspondence between labels a and ßobtained
for a sample S.

This task is solved by finding of a permutation ?
of the set Ck which
achieves the smallest misclassification between a
and the permuted ß.
I.e.
.
Computational complexity for solving this problem
by the well known
Hungarian method is O(k3). This technique has
been used in Lange T.
et al, 2004 and is also known in the clusters
combining area (see, for
example Topchy A. et al, 2003).

Write a Comment

User Comments (0)

About PowerShow.com

ON STATISTICAL MODEL OF CLUSTER STABILITY - PowerPoint PPT Presentation

ON STATISTICAL MODEL OF CLUSTER STABILITY

Chi-2 distance. And so on.... 14. Concentration measure index ... CLUSTERING MACHINE M- CLM. In contrast with the geometrical algorithm, we use here a family of ... – PowerPoint PPT presentation