L10.1 - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

L10.1

Description:

Lecture 10: Cluster analysis Uses of cluster analysis Clustering methods Hierarchical Partitioned Additive trees Cluster distance metrics Chinese wolf – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 34
Provided by: louis150
Category:
Tags: clustering | graph | l10

less

Transcript and Presenter's Notes

Title: L10.1


1
Lecture 10 Cluster analysis
  • Uses of cluster analysis
  • Clustering methods
  • Hierarchical
  • Partitioned
  • Additive trees
  • Cluster distance metrics

2
Cluster analysis I grouping objects
  • Given a set of p variables X1, X2,, Xp, and a
    set of N objects, the task is to group the
    objects into classes so that objects within
    classes are more similar to one another than to
    members of other classes.
  • Questions of interest does the set of objects
    fall into a smaller set of natural groups?
    What are the relationships among different
    objects?
  • Note in most cases, clusters are not defined a
    priori.

3
Cluster analysis II grouping variables
  • Given a set of p variables X1, X2,, Xp, and a
    set of N objects, the task is to group the
    variables into classes so that variables within
    classes are more highly correlated with one
    another than to members of other classes.
  • Questions of interest does the set of variables
    fall into a smaller set of natural groups?
    What are the relationships among different
    variables?

4
Cluster analysis III grouping objects and
variables
  • Given a set of p variables X1, X2,, Xp, and a
    set of N objects, the task is to group the
    objects and variables into classes so that
    variables and objects within classes are more
    highly correlated with one another than to
    members of other classes.
  • Questions of interest does the set of
    variables/objects combinations fall into a
    smaller set of natural groups? What are the
    relationships among the different combinations?

5
The basic principle
  • Objects that are similar to/highly correlated
    with one another should be in the same group,
    whereas objects that are dissimilar/uncorrelated
    should be in different groups.
  • Thus, all cluster analyses begin with measures of
    similarity/dissimilarity among objects (distance
    matrices) or correlation matrices.

6
Clustering objects
  • Objects that are closer together based on
    pairwise multivariate distances or pairwise
    correlations are assigned to the same cluster,
    whereas those farther apart or having low
    pairwise correlations are assigned to different
    clusters.

7
Clustering variables
  • Variables that have high pairwise correlations
    are assigned to the same cluster, whereas those
    having low pairwise correlations are assigned to
    different clusters.

8
Clustering objects and variables
  • Object/variable combinations are classified into
    discrete categories determined by the magnitude
    of the corresponding entries in the original data
    matrix
  • Allows for easier visualization of
    object/variable clusters.

9
Types of clusters
  • Exclusive each object/variable belongs to one
    and only one cluster.
  • Overlapping an object or variable may belong to
    more than one cluster.

Exclusive clusters
Overlapping clusters
10
Scale considerations
  • In general, correlation measures are not
    influenced by differences in scale, but distance
    measures (e.g. Euclidean distance) are affected.
  • So, use distance measures when variables are
    measured on common scales, or compute distance
    measures based on standardized values when
    variables are not on the same scale.

11
Exclusive clustering methods I. Hierarchical
clustering of objects
  • Begins with calculation of distances/correlations
    among all pairs of objects
  • with groups being formed by agglomeration
    (lumping of objects)
  • The end result is a dendogram (tree) which shows
    the distances between pairs of objects.

12
Exclusive clustering methods I. Hierarchical
clustering of variables
  • Begins with calculation of correlations/distances
    between all pairs of variables
  • with groups being formed lumping of highly
    correlated variables.
  • The end result is a dendogram or tree which shows
    the distances between pairs of variables.

MOLARBR
MANDBRTH
MOLARL
MANDHT
MOLARS
MOLARS2
0
5
10
15
Distance
13
Hierarchical clustering of objects and variables
  • Standardized data matrix is used to produce a
    two-dimensional colour/shading graph with colour
    codes/shading intensities determined by the
    magnitude of the values in the original data
    matrix
  • which allows one to pick out similar objects
    and variables at a glance.

14
Hierarchical joining algorithms
Centroid
Cluster 1
  • Single (nearest-neighbour) distance between two
    clusters distance between two closest members
    of the two clusters.
  • Complete (furthest neighbour) distance between
    two clusters distance between two most distant
    cluster members.
  • Centroid distance between two clusters
    distance between multivariate means of each
    cluster.

Single
Cluster 2
Cluster 3
Complete
15
Hierarchical joining algorithms (contd)
Cluster 1
  • Average distance between two clusters average
    distance between all members of the two clusters.
  • Median distance between two clusters median
    distance between all members of the two clusters.
  • Ward distance between two clusters average
    distance between all members of the two clusters
    with adjustment for covariances.

Cluster 2
Cluster 3
Mean/median/adjusted mean of all pairwise
distances
16
Simple joining (nearest neighbour)
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
4 (1, 2), (3, 4, 5)
5 (1, 2, 3, 4, 5)
17
Complete joining (furthest neighbour)
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
5 (1, 2), (3, 4, 5)
10 (1, 2, 3, 4, 5)
18
Average joining
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
4.5 (1, 2), (3, 4, 5)
7.8 (1, 2, 3, 4, 5)
19
Median joining
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
3.75 (1, 2), (3, 4, 5)
5.44 (1, 2, 3, 4, 5)
20
Centroid joining
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
3.75 (1, 2), (3, 4, 5)
6.00 (1, 2, 3, 4, 5)
21
Ward joining
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
Distance matrix
Distance Cluster
0 1,2,3,4,5
2 (1, 2), 3, 4, 5
3 (1, 2), 3, (4, 5)
5 (1, 2), (3, 4, 5)
14.4 (1, 2, 3, 4, 5)
22
Important note!
Cluster Tree
  • Centroid, average, median and Ward joining need
    not produce a strictly hierarchical tree with
    increasing lumping distances, resulting in
    unattached branches.
  • If you encounter this problem, try another method!

23
Exclusive hierarchical clustering II.
Partitioned clustering
  • In partitioned clustering, the object is to
    partition a set of N objects into a number k
    predetermined clusters by maximizing the distance
    between cluster centers while minimizing the
    within-cluster variation.

24
Partitioned clustering the procedure
X1
  • Choose k seed cases which are spread apart from
    center of all objects as much as possible.
  • Assign all remaining objects to nearest seed.
  • Reassign objects so that within-group sum of
    squares is reduced
  • and continue to do so until SSwithin is
    minimized.

Seed 1
Seed 2
Seed 3
X2
25
K-means clustering
  • Because k-means clustering does not search though
    every possible partitioning, it is always
    possible that there are other solutions yielding
    smaller SSwithin.
  • A method of partitioned clustering whereby a set
    of k clusters is produced by minimizing the
    SSwithin based on Euclidean distances.
  • This is very much like a single-classification
    MANOVA with k groups, except that groups are not
    known a priori.

26
K-means partitioning example
k 2 clustering of 6 dog species
  • Cluster means plots give z-scores for each
    variable used in clustering objects, with
    variables ordered by univariate F ratios
  • Zero indicates mean of all objects.
  • The more similar the profiles for objects within
    a cluster, the smaller the within-cluster
    heterogeneity.

27
K-means partitioning example
k 2 clustering of 6 dog species
  • Cluster means plots give means for each variable
    used in clustering objects, with variables
    ordered by univariate F ratios
  • Dashed indicates mean of all objects .
  • The greater the difference in group means, the
    greater the discriminating ability of the
    variable in question

28
Some clustering distances
Distance metric Description Data type
Gamma Computed using 1 g correlation Ordinal, rank order
Pearson 1- r for each pair of objects quantitative
R2 1 r2 for each pair of objects quantitative
Euclidean Normalized Euclidean distance quantitative
Minkowski pth root of mean pth powered distance quantitative
c2 c2 measure of independence of rows and columns on 2 X N frequency tables counts
MW Increment in SSwithin if object moved into a particular cluster quantitative
29
Exclusive non-hierarchical clustering Additive
trees
  • In additive trees clustering, the objective is to
    partition a set of N objects into a set of
    clusters represented by additive rather than
    hierarchical trees.
  • For hierarchical trees, we assume (1) all
    within-cluster distances are smaller than between
    cluster distances (2) all within-cluster
    distances are the same. For additive trees,
    neither assumption need hold.

30
Additive trees
  • In additive tree clustering, branch length can
    vary within clusters
  • and objects within clusters are compared by
    considering the sum of the branch lengths
    connecting them

Hierarchical tree
1
2
3
4
5
Additive tree
31
Additive trees an example
  • In additive tree clustering, branch length can
    vary within clusters
  • and objects within clusters are compared by
    considering the sum of the branch lengths
    connecting them

Hierarchical tree
1
2
3
4
5
Additive tree
32
Additive trees joining
Object 1 2 3 4 5
1
2 2
3 6 5
4 10 9 4
5 9 8 5 3
5
Distance matrix
7
4
Node Length Child
1 1.5 Object1
2 0.5 Object2
6 4.0 (1, 2)
7 2.25 (4, 5)
8 0.25 (6, 3)
3
9
2
8
6
1
D1,3 1.5 4.0 0.5 6.0
33
Deciding what to cluster and how to cluster them
Question Decision
Am I interested in clustering objects, variables or both? Choose object (row), variable (column) or both (matrix) clustering
Do I want strictly hierarchical clusters? Yes hierarchical trees No partitioned clusters (e.g. k-means) or additive trees.
Are my variables quantitative? Yes quantitative metrics (e.g. Euclidean, Minkowski, etc). No non-quantitative metrics (e.g., g, c2, etc.)
Write a Comment
User Comments (0)
About PowerShow.com