About Pairwise Data Clustering HansJoachim Mucha - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

About Pairwise Data Clustering HansJoachim Mucha

Description:

Special weighted distances and weights of observations ... Voronoi tessellation, where the objects have. minimum distance to their. centroid and thus, the ... – PowerPoint PPT presentation

Number of Views:193

Avg rating:3.0/5.0

Slides: 26

Provided by: smabseam2

Category:

more less

Transcript and Presenter's Notes

Title: About Pairwise Data Clustering HansJoachim Mucha

1
About Pairwise Data ClusteringHans-Joachim Mucha
2
Plan

Cluster analysis based on proximity matrices
A generalization of the sum-of-squares criterion
Special weighted distances and weights of
observations
Pairwise data clustering of contingency tables

3
Introduction - About pairwise data clustering

Finding clusters arises as a data analysis
problem in various research areas like, for
instance, ecology, psychology, sociology, and
linguistics.
In these application fields, often proximity
matrices and contingency tables instead of data
matrices have to be analyzed by clustering
techniques in order to detect subsets (groups,
clusters). Proximity matrices contain pairwise
proximities (distances, similarities). That is, a
proximity value for each pair of data points is
given.
There are advantages and disadvantages of
partitioning and hierarchical methods of pairwise
data clustering. Two well-known clustering
methods for analyzing symmetric proximity
matrices (or data matrices), K-means and Ward,
will be considered here in a little bit more
detail.
Some generalizations of the simple model-based
Gaussian clustering (K-means and Ward) are based
on weights of observations and weighted
(Euclidean) distances. This leads to methods,
which are of special interest, e.g. for
clustering of rows and columns of contingency
tables.

4
Cluster analysis based on proximity matrices
Note often, only a distance matrix D is needed
for (model-based) hierarchical and
partitioning methods!

From a data
matrix X to a
distance matrix
D, and then to
hierarchies and
partitions

5
Cluster analysis based on proximity matrices
Now suppose for a while (in order to derive
pairwise data clustering from the sum-of-squares
criterion only), a sample of I independent
observations (objects) is given in RJ and denote
by X (xij) the corresponding data matrix
consisting of I rows and J columns (variables).
Further, let C x1, ..., xi, ..., xI denote
the finite set of these I entities
(observations). Alternatively, let us write
shortly C 1, ..., i, ..., I. The main task
of cluster analysis is finding a partition of the
set of I objects (observations) into K non-empty
clusters (subsets, groups) Ck, k 1,2,...,K.
Hereby, the intersection of each pair of clusters
Ck and Cl is empty. And, the union of all
clusters gives the whole set C.
6
Cluster analysis based on proximity matrices
The sum-of-squares criterion has to be
minimized in the Gaussian case with uniform
diagonal covariance matrix. Herein is the
sample cross-product matrix for the kth cluster
Ck and is the usual maximum likelihood estimate
of expectation values in cluster Ck. The sum of
squares criterion can be written in the
following equivalent form without an explicit
specification of cluster centers .
7
Cluster analysis based on proximity matrices
Herein dil is the squared Euclidean distance
between two observations i and l
. Furthermore, nk is the cardinality of
cluster Ck. There are two well- known clustering
techniques for minimizing the sum-of-squares
criterion based on pairwise distances the
partitioning K-means (Späth (1985) Cluster
Dissection...) minimizes this criterion for a
single partition into K clusters by exchanging
observations between clusters, and the
hierarchical Ward method (Ward (1963)) minimizes
it in a stepwise manner by agglomerations of
subsets starting with clusters containing only a
single observation each.
8
Example Hierarchical clustering

Visualization of Ward's
clustering results into 15
and 3 clusters.
As the three cluster
solution shows, the Ward
method never creates
hyperplanes as borderlines
between clusters.
Proximity Euclidean
distances of 4000 random
generated points in R2
coming from one standard
normal distributed population.

9
Cluster analysis based on proximity matrices
Partitioning methods

The partitioning K-means method based on a data
matrix X goes back
to Steinhaus (1956) and MacQueen (1967). Here we
deal with a K-
means technique based on pairwise (squared
Euclidean) distances.
This, however, presents practical problems of
storage and computation
time for increasing C because of their quadratic
increase, as Späth
already pointed out in 1985. Meanwhile, new
generations of computers
can deal easily with both problems of pairwise
data clustering of about
10000 objects. So, everyone can do this on its
personal computer today.
In order to cluster a practical unlimited number
of observations, we
generalize the criterion later on by introducing
positive weights of
observations. Instead of dealing with millions of
objects directly, their
appropriate weighted representatives are
clustered subsequent to a
preprocessing step of data aggregation.
Note pairwise data clustering is the first
choice in case of I ltlt J !!!

10
Example K-means partitioning method based on a
proximity matrix

Both Ward and K-means
minimize the same
criterion, but the
results are usually
different.
The K-means method is
leading to the well-known
Voronoi tessellation,
where the objects have
minimum distance to their
centroid and thus, the
borderlines between
clusters are hyperplanes.
Data 4000 points in R2
(same data as before).

11
Example pairwise clustering

Bivariate density surface
of a two-dimensional
three class data. Here
both Ward and K-means
clustering should be
successful in dividing
(decomposing) the data
into smaller subsets.
Proximity Euclidean
distances of 4000 random
generated points in R2
coming from three
different normal
distributed populations.

12
Example pairwise clustering

A comparison of the
performance of Ward and
K-means clustering of the
three class data says
Ward performs slightly
better than K-means.
The three sub-populations
have the following
different mean values
(-3 , 3), (0 , 0), (3 , 3)
and standard deviations
(1, 1), (0.7, 0.7), (1.2,1.2).

13
Cluster analysis based on proximity matrices
some applications

Archaeology
Data 613 Roman bricks
and tiles from the Rhine
area that are described
by nine oxids and ten
chemical trace elements.
Aim of clustering
Finding brickyards that
are not yet identified and
confirm supposed sites
of brickyards.
Are there clusters?
Fingerprint of the
proximity matrix
(Euclidean distances)

14
Cluster analysis based on proximity matrices
some applications

Fingerprint of the same
proximity matrix as
before but rearranged
with respect to the
result of pairwise data
clustering. Obviously,
there are clusters with
low pairwise distances
inside and high
distances between.
(Mucha et al. (2002),
Proc. of the 24th Annual
Conference of GfKl,
Springer, Berlin).

15
Cluster analysis based on proximity matrices
some applications

Linguistics Pairwise clustering of 217 regions
(Mucha Haimerl (2005),
Proc. of the 28th Annual Conference of GfKl,
Springer, Berlin).

Data Similarity matrix of Dialect regions
(coming from 3386 phonetic maps at 217
locations in North-Italy). Task Segmentation
of the set of locations.
16
A generalization of the sum-of-squares criterion

Now let the diagonal covariance matrix vary
between groups. Then the
logarithmic sum-of-squares criterion
has to be minimized. An equivalent formulation
based on pairwise
distances can be derived as
.

17
Special weighted distances and weights of
observations

The sum-of-squares criterion can be generalized
by using positive
weights of observations to
,
where Mk and mi denote the weight (mass) of
cluster Ck and the mass
of observation i, respectively. Concerning the
K-means algorithm, the
observations have to be exchanged between
clusters in order to
minimize the criterion above. Here the following
condition of exchange
of an observation i coming from cluster Ck and
shifting into cluster Cg
has to be fulfilled
where
and

18
Special weighted distances and weights of
observations

Similarly, a generalized logarithmic
sum-of-squares criterion can be
derived as
,
where u(Ck ) denotes the within-cluster weighted
logarithmic sum-of-
squares for the cluster Ck .
Taking into account weights of the variables, the
squared weighted
Euclidean distance
is used, where here Q is restricted to be
diagonal. For example,
means weighting the variables by their inverse
variances. Here S
denotes the usual estimate of the covariance
matrix.

19
Pairwise data clustering of contingency tables

Now lets consider a contigency table H (hij)
consisting of I rows
and J columns. Then, and
denote the corresponding row profiles and weights
of rows, respectively.
Without loss of generality, let us consider the
cluster analysis of rows.
Usually, the total inertia of H
has to be decomposed by cluster analysis in a way
that the sum of
within-cluster inertitia
becomes a minimum, where
defines the weights of variables (columns).

20
Pairwise data clustering of contingency tables

Example 1
Data World's largest
merchant fleets by country
of owner.
Self-Propelled Oceangoing
Vessels 1,000 Gross Tons
and Greater (as of July 1,
2003). "Other" are Roll-
on/Roll-off, passenger,
breakbulk ships, partial
containerships,
refrigerated cargo, barge
carriers, and specialized
cargo ships.
Source CIA World Factbook

21
Pairwise data clustering of contingency tables

World's largest
merchant fleets
Correspondence
analysis plot

22
Pairwise data clustering of contingency tables

Hierarchical cluster
analysis of world's
largest merchant
Fleets.
From contingency
tables to Chi-square
Distances, and further
to hierarchies and
partitions.
The column points are
clustered in the same
way as the row points.

23
Pairwise data clustering of contingency tables

Example 2
Reference Greenacre, M. J. (1988)
Clustering the Rows and Columns of a
Contingency Table, Journal of Classific.
5, 39-51.
Data Guttman (1971) 1554 Israeli
adults and their principal worries.

Ward clustering (Greenacre)
K-means leads to a better result in four
clusters Oth, POL, ECO, MIL, ENR, SAB,
MTO, and PER.
24
Other pairwise data clustering methods

Formulation of the optimization problem of a
pairwise clustering cost function in the maximum
entropy framework using a variational principle
to derive corresponding data partitionings in a
d-dimensional Euclidian space. See Buhmann
Hoffman (1994) A Maximum Entropy Approach to
Pairwise Data Clustering. In Proceedings of the
International Conference on Pattern Recognition,
Hebrew University, Jerusalem, vol.II, IEEE
Computer Society Press, pp.207-212.
PAM (partitioning around medoids), e.g., for
clustering gene-expression data. See Kaufman
Rousseeuw (1990) Finding Groups in Data An
Introduction to Cluster Analysis. New York
Wiley.

25
Conclusions - About pairwise data clustering

The well-known sum-of-squares criterion of
model-based cluster analysis can be formulated on
the basis of pairwise distances.
It can be generalized in three ways first by
allowing different volumes of clusters, second by
using weighted observations, and then by
weighting the variables.
As a special case of weighting the variables and
observations, the decomposition of the inertia of
contingency tables into clusters is possible by
the hierarchical Ward or the partitioning K-means
method.
Thank you for your attention