L10.1 - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

L10.1

Description:

Lecture 10: Cluster analysis Uses of cluster analysis Clustering methods Hierarchical Partitioned Additive trees Cluster distance metrics Chinese wolf – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 34

Provided by: louis150

Category:

more less

Transcript and Presenter's Notes

Title: L10.1

1
Lecture 10 Cluster analysis

Uses of cluster analysis
Clustering methods
Hierarchical
Partitioned
Additive trees
Cluster distance metrics

2
Cluster analysis I grouping objects

Given a set of p variables X1, X2,, Xp, and a
set of N objects, the task is to group the
objects into classes so that objects within
classes are more similar to one another than to
members of other classes.

Questions of interest does the set of objects
fall into a smaller set of natural groups?
What are the relationships among different
objects?
Note in most cases, clusters are not defined a
priori.

3
Cluster analysis II grouping variables

Given a set of p variables X1, X2,, Xp, and a
set of N objects, the task is to group the
variables into classes so that variables within
classes are more highly correlated with one
another than to members of other classes.

Questions of interest does the set of variables
fall into a smaller set of natural groups?
What are the relationships among different
variables?

4
Cluster analysis III grouping objects and
variables

Given a set of p variables X1, X2,, Xp, and a
set of N objects, the task is to group the
objects and variables into classes so that
variables and objects within classes are more
highly correlated with one another than to
members of other classes.

Questions of interest does the set of
variables/objects combinations fall into a
smaller set of natural groups? What are the
relationships among the different combinations?

5
The basic principle

Objects that are similar to/highly correlated
with one another should be in the same group,
whereas objects that are dissimilar/uncorrelated
should be in different groups.
Thus, all cluster analyses begin with measures of
similarity/dissimilarity among objects (distance
matrices) or correlation matrices.

6
Clustering objects

Objects that are closer together based on
pairwise multivariate distances or pairwise
correlations are assigned to the same cluster,
whereas those farther apart or having low
pairwise correlations are assigned to different
clusters.

7
Clustering variables

Variables that have high pairwise correlations
are assigned to the same cluster, whereas those
having low pairwise correlations are assigned to
different clusters.

8
Clustering objects and variables

Object/variable combinations are classified into
discrete categories determined by the magnitude
of the corresponding entries in the original data
matrix
Allows for easier visualization of
object/variable clusters.

9
Types of clusters

Exclusive each object/variable belongs to one
and only one cluster.
Overlapping an object or variable may belong to
more than one cluster.

Exclusive clusters
Overlapping clusters
10
Scale considerations

In general, correlation measures are not
influenced by differences in scale, but distance
measures (e.g. Euclidean distance) are affected.
So, use distance measures when variables are
measured on common scales, or compute distance
measures based on standardized values when
variables are not on the same scale.

11
Exclusive clustering methods I. Hierarchical
clustering of objects

Begins with calculation of distances/correlations
among all pairs of objects
with groups being formed by agglomeration
(lumping of objects)
The end result is a dendogram (tree) which shows
the distances between pairs of objects.

12
Exclusive clustering methods I. Hierarchical
clustering of variables

Begins with calculation of correlations/distances
between all pairs of variables
with groups being formed lumping of highly
correlated variables.
The end result is a dendogram or tree which shows
the distances between pairs of variables.

MOLARBR
MANDBRTH
MOLARL
MANDHT
MOLARS
MOLARS2
0
5
10
15
Distance
13
Hierarchical clustering of objects and variables

Standardized data matrix is used to produce a
two-dimensional colour/shading graph with colour
codes/shading intensities determined by the
magnitude of the values in the original data
matrix
which allows one to pick out similar objects
and variables at a glance.

14
Hierarchical joining algorithms
Centroid
Cluster 1

Single (nearest-neighbour) distance between two
clusters distance between two closest members
of the two clusters.
Complete (furthest neighbour) distance between
two clusters distance between two most distant
cluster members.
Centroid distance between two clusters
distance between multivariate means of each
cluster.

Single
Cluster 2
Cluster 3
Complete
15
Hierarchical joining algorithms (contd)
Cluster 1

Average distance between two clusters average
distance between all members of the two clusters.
Median distance between two clusters median
distance between all members of the two clusters.
Ward distance between two clusters average
distance between all members of the two clusters
with adjustment for covariances.

Cluster 2
Cluster 3
Mean/median/adjusted mean of all pairwise
distances
16
Simple joining (nearest neighbour)
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
4 (1, 2), (3, 4, 5)
5 (1, 2, 3, 4, 5)
17
Complete joining (furthest neighbour)
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
5 (1, 2), (3, 4, 5)
10 (1, 2, 3, 4, 5)
18
Average joining
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
4.5 (1, 2), (3, 4, 5)
7.8 (1, 2, 3, 4, 5)
19
Median joining
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
3.75 (1, 2), (3, 4, 5)
5.44 (1, 2, 3, 4, 5)
20
Centroid joining
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
3.75 (1, 2), (3, 4, 5)
6.00 (1, 2, 3, 4, 5)
21
Ward joining
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
5 (1, 2), (3, 4, 5)
14.4 (1, 2, 3, 4, 5)
22
Important note!
Cluster Tree

Centroid, average, median and Ward joining need
not produce a strictly hierarchical tree with
increasing lumping distances, resulting in
unattached branches.
If you encounter this problem, try another method!

23
Exclusive hierarchical clustering II.
Partitioned clustering

In partitioned clustering, the object is to
partition a set of N objects into a number k
predetermined clusters by maximizing the distance
between cluster centers while minimizing the
within-cluster variation.

24
Partitioned clustering the procedure
X1

Choose k seed cases which are spread apart from
center of all objects as much as possible.
Assign all remaining objects to nearest seed.
Reassign objects so that within-group sum of
squares is reduced
and continue to do so until SSwithin is
minimized.

Seed 1
Seed 2
Seed 3
X2
25
K-means clustering

Because k-means clustering does not search though
every possible partitioning, it is always
possible that there are other solutions yielding
smaller SSwithin.

A method of partitioned clustering whereby a set
of k clusters is produced by minimizing the
SSwithin based on Euclidean distances.
This is very much like a single-classification
MANOVA with k groups, except that groups are not
known a priori.

26
K-means partitioning example
k 2 clustering of 6 dog species

Cluster means plots give z-scores for each
variable used in clustering objects, with
variables ordered by univariate F ratios
Zero indicates mean of all objects.

The more similar the profiles for objects within
a cluster, the smaller the within-cluster
heterogeneity.

27
K-means partitioning example
k 2 clustering of 6 dog species

Cluster means plots give means for each variable
used in clustering objects, with variables
ordered by univariate F ratios
Dashed indicates mean of all objects .

The greater the difference in group means, the
greater the discriminating ability of the
variable in question

28
Some clustering distances
Distance metric Description Data type
Gamma Computed using 1 g correlation Ordinal, rank order
Pearson 1- r for each pair of objects quantitative
R2 1 r2 for each pair of objects quantitative
Euclidean Normalized Euclidean distance quantitative
Minkowski pth root of mean pth powered distance quantitative
c2 c2 measure of independence of rows and columns on 2 X N frequency tables counts
MW Increment in SSwithin if object moved into a particular cluster quantitative
29
Exclusive non-hierarchical clustering Additive
trees

In additive trees clustering, the objective is to
partition a set of N objects into a set of
clusters represented by additive rather than
hierarchical trees.
For hierarchical trees, we assume (1) all
within-cluster distances are smaller than between
cluster distances (2) all within-cluster
distances are the same. For additive trees,
neither assumption need hold.

30
Additive trees

In additive tree clustering, branch length can
vary within clusters
and objects within clusters are compared by
considering the sum of the branch lengths
connecting them

Hierarchical tree
1
2
3
4
5
Additive tree
31
Additive trees an example

In additive tree clustering, branch length can
vary within clusters
and objects within clusters are compared by
considering the sum of the branch lengths
connecting them

Hierarchical tree
1
2
3
4
5
Additive tree
32
Additive trees joining
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
5
Distance matrix
7
4
Node Length Child
1 1.5 Object1
2 0.5 Object2
6 4.0 (1, 2)
7 2.25 (4, 5)
8 0.25 (6, 3)
3
9
2
8
6
1
D1,3 1.5 4.0 0.5 6.0
33
Deciding what to cluster and how to cluster them
Question Decision
Am I interested in clustering objects, variables or both? Choose object (row), variable (column) or both (matrix) clustering
Do I want strictly hierarchical clusters? Yes hierarchical trees No partitioned clusters (e.g. k-means) or additive trees.
Are my variables quantitative? Yes quantitative metrics (e.g. Euclidean, Minkowski, etc). No non-quantitative metrics (e.g., g, c2, etc.)

Write a Comment

User Comments (0)