Title: About Pairwise Data Clustering HansJoachim Mucha
1About Pairwise Data ClusteringHans-Joachim Mucha
2Plan
- Cluster analysis based on proximity matrices
- A generalization of the sum-of-squares criterion
- Special weighted distances and weights of
observations - Pairwise data clustering of contingency tables
3Introduction - About pairwise data clustering
- Finding clusters arises as a data analysis
problem in various research areas like, for
instance, ecology, psychology, sociology, and
linguistics. - In these application fields, often proximity
matrices and contingency tables instead of data
matrices have to be analyzed by clustering
techniques in order to detect subsets (groups,
clusters). Proximity matrices contain pairwise
proximities (distances, similarities). That is, a
proximity value for each pair of data points is
given. - There are advantages and disadvantages of
partitioning and hierarchical methods of pairwise
data clustering. Two well-known clustering
methods for analyzing symmetric proximity
matrices (or data matrices), K-means and Ward,
will be considered here in a little bit more
detail. - Some generalizations of the simple model-based
Gaussian clustering (K-means and Ward) are based
on weights of observations and weighted
(Euclidean) distances. This leads to methods,
which are of special interest, e.g. for
clustering of rows and columns of contingency
tables.
4Cluster analysis based on proximity matrices
Note often, only a distance matrix D is needed
for (model-based) hierarchical and
partitioning methods!
- From a data
- matrix X to a
- distance matrix
- D, and then to
- hierarchies and
- partitions
5Cluster analysis based on proximity matrices
Now suppose for a while (in order to derive
pairwise data clustering from the sum-of-squares
criterion only), a sample of I independent
observations (objects) is given in RJ and denote
by X (xij) the corresponding data matrix
consisting of I rows and J columns (variables).
Further, let C x1, ..., xi, ..., xI denote
the finite set of these I entities
(observations). Alternatively, let us write
shortly C 1, ..., i, ..., I. The main task
of cluster analysis is finding a partition of the
set of I objects (observations) into K non-empty
clusters (subsets, groups) Ck, k 1,2,...,K.
Hereby, the intersection of each pair of clusters
Ck and Cl is empty. And, the union of all
clusters gives the whole set C.
6Cluster analysis based on proximity matrices
The sum-of-squares criterion has to be
minimized in the Gaussian case with uniform
diagonal covariance matrix. Herein is the
sample cross-product matrix for the kth cluster
Ck and is the usual maximum likelihood estimate
of expectation values in cluster Ck. The sum of
squares criterion can be written in the
following equivalent form without an explicit
specification of cluster centers .
7Cluster analysis based on proximity matrices
Herein dil is the squared Euclidean distance
between two observations i and l
. Furthermore, nk is the cardinality of
cluster Ck. There are two well- known clustering
techniques for minimizing the sum-of-squares
criterion based on pairwise distances the
partitioning K-means (Späth (1985) Cluster
Dissection...) minimizes this criterion for a
single partition into K clusters by exchanging
observations between clusters, and the
hierarchical Ward method (Ward (1963)) minimizes
it in a stepwise manner by agglomerations of
subsets starting with clusters containing only a
single observation each.
8Example Hierarchical clustering
- Visualization of Ward's
- clustering results into 15
- and 3 clusters.
- As the three cluster
- solution shows, the Ward
- method never creates
- hyperplanes as borderlines
- between clusters.
- Proximity Euclidean
- distances of 4000 random
- generated points in R2
- coming from one standard
- normal distributed population.
9Cluster analysis based on proximity matrices
Partitioning methods
- The partitioning K-means method based on a data
matrix X goes back - to Steinhaus (1956) and MacQueen (1967). Here we
deal with a K- - means technique based on pairwise (squared
Euclidean) distances. - This, however, presents practical problems of
storage and computation - time for increasing C because of their quadratic
increase, as Späth - already pointed out in 1985. Meanwhile, new
generations of computers - can deal easily with both problems of pairwise
data clustering of about - 10000 objects. So, everyone can do this on its
personal computer today. - In order to cluster a practical unlimited number
of observations, we - generalize the criterion later on by introducing
positive weights of - observations. Instead of dealing with millions of
objects directly, their - appropriate weighted representatives are
clustered subsequent to a - preprocessing step of data aggregation.
- Note pairwise data clustering is the first
choice in case of I ltlt J !!!
10Example K-means partitioning method based on a
proximity matrix
- Both Ward and K-means
- minimize the same
- criterion, but the
- results are usually
- different.
- The K-means method is
- leading to the well-known
- Voronoi tessellation,
- where the objects have
- minimum distance to their
- centroid and thus, the
- borderlines between
- clusters are hyperplanes.
- Data 4000 points in R2
- (same data as before).
11Example pairwise clustering
- Bivariate density surface
- of a two-dimensional
- three class data. Here
- both Ward and K-means
- clustering should be
- successful in dividing
- (decomposing) the data
- into smaller subsets.
- Proximity Euclidean
- distances of 4000 random
- generated points in R2
- coming from three
- different normal
- distributed populations.
12Example pairwise clustering
- A comparison of the
- performance of Ward and
- K-means clustering of the
- three class data says
- Ward performs slightly
- better than K-means.
- The three sub-populations
- have the following
- different mean values
- (-3 , 3), (0 , 0), (3 , 3)
- and standard deviations
- (1, 1), (0.7, 0.7), (1.2,1.2).
13Cluster analysis based on proximity matrices
some applications
- Archaeology
- Data 613 Roman bricks
- and tiles from the Rhine
- area that are described
- by nine oxids and ten
- chemical trace elements.
- Aim of clustering
- Finding brickyards that
- are not yet identified and
- confirm supposed sites
- of brickyards.
- Are there clusters?
- Fingerprint of the
- proximity matrix
- (Euclidean distances)
14Cluster analysis based on proximity matrices
some applications
- Fingerprint of the same
- proximity matrix as
- before but rearranged
- with respect to the
- result of pairwise data
- clustering. Obviously,
- there are clusters with
- low pairwise distances
- inside and high
- distances between.
- (Mucha et al. (2002),
- Proc. of the 24th Annual
- Conference of GfKl,
- Springer, Berlin).
15Cluster analysis based on proximity matrices
some applications
- Linguistics Pairwise clustering of 217 regions
(Mucha Haimerl (2005), - Proc. of the 28th Annual Conference of GfKl,
Springer, Berlin).
Data Similarity matrix of Dialect regions
(coming from 3386 phonetic maps at 217
locations in North-Italy). Task Segmentation
of the set of locations.
16A generalization of the sum-of-squares criterion
- Now let the diagonal covariance matrix vary
between groups. Then the - logarithmic sum-of-squares criterion
-
- has to be minimized. An equivalent formulation
based on pairwise - distances can be derived as
-
- .
17Special weighted distances and weights of
observations
- The sum-of-squares criterion can be generalized
by using positive - weights of observations to
- ,
- where Mk and mi denote the weight (mass) of
cluster Ck and the mass - of observation i, respectively. Concerning the
K-means algorithm, the - observations have to be exchanged between
clusters in order to - minimize the criterion above. Here the following
condition of exchange - of an observation i coming from cluster Ck and
shifting into cluster Cg - has to be fulfilled
- where
-
- and
18Special weighted distances and weights of
observations
- Similarly, a generalized logarithmic
sum-of-squares criterion can be - derived as
- ,
- where u(Ck ) denotes the within-cluster weighted
logarithmic sum-of- - squares for the cluster Ck .
- Taking into account weights of the variables, the
squared weighted - Euclidean distance
- is used, where here Q is restricted to be
diagonal. For example, - means weighting the variables by their inverse
variances. Here S - denotes the usual estimate of the covariance
matrix.
19Pairwise data clustering of contingency tables
- Now lets consider a contigency table H (hij)
consisting of I rows - and J columns. Then, and
- denote the corresponding row profiles and weights
of rows, respectively. - Without loss of generality, let us consider the
cluster analysis of rows. - Usually, the total inertia of H
- has to be decomposed by cluster analysis in a way
that the sum of - within-cluster inertitia
- becomes a minimum, where
- defines the weights of variables (columns).
20Pairwise data clustering of contingency tables
- Example 1
- Data World's largest
- merchant fleets by country
- of owner.
- Self-Propelled Oceangoing
- Vessels 1,000 Gross Tons
- and Greater (as of July 1,
- 2003). "Other" are Roll-
- on/Roll-off, passenger,
- breakbulk ships, partial
- containerships,
- refrigerated cargo, barge
- carriers, and specialized
- cargo ships.
- Source CIA World Factbook
21Pairwise data clustering of contingency tables
- World's largest
- merchant fleets
- Correspondence
- analysis plot
22Pairwise data clustering of contingency tables
- Hierarchical cluster
- analysis of world's
- largest merchant
- Fleets.
- From contingency
- tables to Chi-square
- Distances, and further
- to hierarchies and
- partitions.
- The column points are
- clustered in the same
- way as the row points.
23Pairwise data clustering of contingency tables
- Example 2
- Reference Greenacre, M. J. (1988)
- Clustering the Rows and Columns of a
- Contingency Table, Journal of Classific.
- 5, 39-51.
- Data Guttman (1971) 1554 Israeli
- adults and their principal worries.
Ward clustering (Greenacre)
K-means leads to a better result in four
clusters Oth, POL, ECO, MIL, ENR, SAB,
MTO, and PER.
24Other pairwise data clustering methods
- Formulation of the optimization problem of a
pairwise clustering cost function in the maximum
entropy framework using a variational principle
to derive corresponding data partitionings in a
d-dimensional Euclidian space. See Buhmann
Hoffman (1994) A Maximum Entropy Approach to
Pairwise Data Clustering. In Proceedings of the
International Conference on Pattern Recognition,
Hebrew University, Jerusalem, vol.II, IEEE
Computer Society Press, pp.207-212. - PAM (partitioning around medoids), e.g., for
clustering gene-expression data. See Kaufman
Rousseeuw (1990) Finding Groups in Data An
Introduction to Cluster Analysis. New York
Wiley.
25Conclusions - About pairwise data clustering
- The well-known sum-of-squares criterion of
model-based cluster analysis can be formulated on
the basis of pairwise distances. - It can be generalized in three ways first by
allowing different volumes of clusters, second by
using weighted observations, and then by
weighting the variables. - As a special case of weighting the variables and
observations, the decomposition of the inertia of
contingency tables into clusters is possible by
the hierarchical Ward or the partitioning K-means
method. - Thank you for your attention