Title: L10.1
1Lecture 10 Cluster analysis
- Uses of cluster analysis
- Clustering methods
- Hierarchical
- Partitioned
- Additive trees
- Cluster distance metrics
2Cluster analysis I grouping objects
- Given a set of p variables X1, X2,, Xp, and a
set of N objects, the task is to group the
objects into classes so that objects within
classes are more similar to one another than to
members of other classes.
- Questions of interest does the set of objects
fall into a smaller set of natural groups?
What are the relationships among different
objects? - Note in most cases, clusters are not defined a
priori.
3Cluster analysis II grouping variables
- Given a set of p variables X1, X2,, Xp, and a
set of N objects, the task is to group the
variables into classes so that variables within
classes are more highly correlated with one
another than to members of other classes.
- Questions of interest does the set of variables
fall into a smaller set of natural groups?
What are the relationships among different
variables?
4Cluster analysis III grouping objects and
variables
- Given a set of p variables X1, X2,, Xp, and a
set of N objects, the task is to group the
objects and variables into classes so that
variables and objects within classes are more
highly correlated with one another than to
members of other classes.
- Questions of interest does the set of
variables/objects combinations fall into a
smaller set of natural groups? What are the
relationships among the different combinations?
5The basic principle
- Objects that are similar to/highly correlated
with one another should be in the same group,
whereas objects that are dissimilar/uncorrelated
should be in different groups. - Thus, all cluster analyses begin with measures of
similarity/dissimilarity among objects (distance
matrices) or correlation matrices.
6Clustering objects
- Objects that are closer together based on
pairwise multivariate distances or pairwise
correlations are assigned to the same cluster,
whereas those farther apart or having low
pairwise correlations are assigned to different
clusters.
7Clustering variables
- Variables that have high pairwise correlations
are assigned to the same cluster, whereas those
having low pairwise correlations are assigned to
different clusters.
8Clustering objects and variables
- Object/variable combinations are classified into
discrete categories determined by the magnitude
of the corresponding entries in the original data
matrix - Allows for easier visualization of
object/variable clusters.
9Types of clusters
- Exclusive each object/variable belongs to one
and only one cluster. - Overlapping an object or variable may belong to
more than one cluster.
Exclusive clusters
Overlapping clusters
10Scale considerations
- In general, correlation measures are not
influenced by differences in scale, but distance
measures (e.g. Euclidean distance) are affected. - So, use distance measures when variables are
measured on common scales, or compute distance
measures based on standardized values when
variables are not on the same scale.
11Exclusive clustering methods I. Hierarchical
clustering of objects
- Begins with calculation of distances/correlations
among all pairs of objects - with groups being formed by agglomeration
(lumping of objects) - The end result is a dendogram (tree) which shows
the distances between pairs of objects.
12Exclusive clustering methods I. Hierarchical
clustering of variables
- Begins with calculation of correlations/distances
between all pairs of variables - with groups being formed lumping of highly
correlated variables. - The end result is a dendogram or tree which shows
the distances between pairs of variables.
MOLARBR
MANDBRTH
MOLARL
MANDHT
MOLARS
MOLARS2
0
5
10
15
Distance
13Hierarchical clustering of objects and variables
- Standardized data matrix is used to produce a
two-dimensional colour/shading graph with colour
codes/shading intensities determined by the
magnitude of the values in the original data
matrix - which allows one to pick out similar objects
and variables at a glance.
14Hierarchical joining algorithms
Centroid
Cluster 1
- Single (nearest-neighbour) distance between two
clusters distance between two closest members
of the two clusters. - Complete (furthest neighbour) distance between
two clusters distance between two most distant
cluster members. - Centroid distance between two clusters
distance between multivariate means of each
cluster.
Single
Cluster 2
Cluster 3
Complete
15Hierarchical joining algorithms (contd)
Cluster 1
- Average distance between two clusters average
distance between all members of the two clusters. - Median distance between two clusters median
distance between all members of the two clusters. - Ward distance between two clusters average
distance between all members of the two clusters
with adjustment for covariances.
Cluster 2
Cluster 3
Mean/median/adjusted mean of all pairwise
distances
16Simple joining (nearest neighbour)
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
4 (1, 2), (3, 4, 5)
5 (1, 2, 3, 4, 5)
17Complete joining (furthest neighbour)
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
5 (1, 2), (3, 4, 5)
10 (1, 2, 3, 4, 5)
18Average joining
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
4.5 (1, 2), (3, 4, 5)
7.8 (1, 2, 3, 4, 5)
19Median joining
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
3.75 (1, 2), (3, 4, 5)
5.44 (1, 2, 3, 4, 5)
20Centroid joining
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
3.75 (1, 2), (3, 4, 5)
6.00 (1, 2, 3, 4, 5)
21Ward joining
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
5 (1, 2), (3, 4, 5)
14.4 (1, 2, 3, 4, 5)
22Important note!
Cluster Tree
- Centroid, average, median and Ward joining need
not produce a strictly hierarchical tree with
increasing lumping distances, resulting in
unattached branches. - If you encounter this problem, try another method!
23Exclusive hierarchical clustering II.
Partitioned clustering
- In partitioned clustering, the object is to
partition a set of N objects into a number k
predetermined clusters by maximizing the distance
between cluster centers while minimizing the
within-cluster variation.
24Partitioned clustering the procedure
X1
- Choose k seed cases which are spread apart from
center of all objects as much as possible. - Assign all remaining objects to nearest seed.
- Reassign objects so that within-group sum of
squares is reduced - and continue to do so until SSwithin is
minimized.
Seed 1
Seed 2
Seed 3
X2
25K-means clustering
- Because k-means clustering does not search though
every possible partitioning, it is always
possible that there are other solutions yielding
smaller SSwithin.
- A method of partitioned clustering whereby a set
of k clusters is produced by minimizing the
SSwithin based on Euclidean distances. - This is very much like a single-classification
MANOVA with k groups, except that groups are not
known a priori.
26K-means partitioning example
k 2 clustering of 6 dog species
- Cluster means plots give z-scores for each
variable used in clustering objects, with
variables ordered by univariate F ratios - Zero indicates mean of all objects.
- The more similar the profiles for objects within
a cluster, the smaller the within-cluster
heterogeneity.
27K-means partitioning example
k 2 clustering of 6 dog species
- Cluster means plots give means for each variable
used in clustering objects, with variables
ordered by univariate F ratios - Dashed indicates mean of all objects .
- The greater the difference in group means, the
greater the discriminating ability of the
variable in question
28Some clustering distances
Distance metric Description Data type
Gamma Computed using 1 g correlation Ordinal, rank order
Pearson 1- r for each pair of objects quantitative
R2 1 r2 for each pair of objects quantitative
Euclidean Normalized Euclidean distance quantitative
Minkowski pth root of mean pth powered distance quantitative
c2 c2 measure of independence of rows and columns on 2 X N frequency tables counts
MW Increment in SSwithin if object moved into a particular cluster quantitative
29Exclusive non-hierarchical clustering Additive
trees
- In additive trees clustering, the objective is to
partition a set of N objects into a set of
clusters represented by additive rather than
hierarchical trees. - For hierarchical trees, we assume (1) all
within-cluster distances are smaller than between
cluster distances (2) all within-cluster
distances are the same. For additive trees,
neither assumption need hold.
30Additive trees
- In additive tree clustering, branch length can
vary within clusters - and objects within clusters are compared by
considering the sum of the branch lengths
connecting them
Hierarchical tree
1
2
3
4
5
Additive tree
31Additive trees an example
- In additive tree clustering, branch length can
vary within clusters - and objects within clusters are compared by
considering the sum of the branch lengths
connecting them
Hierarchical tree
1
2
3
4
5
Additive tree
32Additive trees joining
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
5
Distance matrix
7
4
Node Length Child
1 1.5 Object1
2 0.5 Object2
6 4.0 (1, 2)
7 2.25 (4, 5)
8 0.25 (6, 3)
3
9
2
8
6
1
D1,3 1.5 4.0 0.5 6.0
33Deciding what to cluster and how to cluster them
Question Decision
Am I interested in clustering objects, variables or both? Choose object (row), variable (column) or both (matrix) clustering
Do I want strictly hierarchical clusters? Yes hierarchical trees No partitioned clusters (e.g. k-means) or additive trees.
Are my variables quantitative? Yes quantitative metrics (e.g. Euclidean, Minkowski, etc). No non-quantitative metrics (e.g., g, c2, etc.)