Title: Cluster Analysis
1Cluster Analysis
- Hal Whitehead
- BIOL4062/5062
2- What is cluster analysis?
- Non-hierarchical cluster analysis
- K-means
- Hierarchical divisive cluster analysis
- Hierarchical agglomerative cluster analysis
- Linkage single, complete, average,
- Cophenetic correlation coefficient
- Additive trees
- Problems with cluster analyses
3Cluster Analysis
- Classification
- Maximize within cluster homogeneity
- (similar individuals within cluster)
- The Search for Discontinuities
- Discontinuities places to put divisions between
clusters
4Discontinuities
- Discontinuities generally present
- taxonomy
- social organization
- community ecology??
5Types of cluster analysis
- Uses data, dissimilarity, similarity matrix
- Non-hierarchical
- K-means
- Hierarchical
- Hierarchical divisive (repeated K-means)
- Hierarchical agglomerative
- single linkage, average linkage, ...
- Additive trees
6Non-hierarchical Clustering TechniquesK-Means
- Uses data matrix with Euclidean distances
- Maximizes between-cluster variance for given
number of clusters - i.e. Choose clusters to maximize F-ratio in 1-way
MANOVA
7K-Means
- Works iteratively
- 1. Choose number of clusters
- 2. Assigns points to clusters
- Randomly or some other clustering technique
- 3. Moves each point to other clusters in
turn--increase in between cluster variance? - 4. Repeat step 3. until no improvement possible
8K-means with three clusters
9K-means with three clusters
- Variable Between SS df Within SS df
F-ratio - X 0.536 2 0.007 7
256.163 - Y 0.541 2 0.050 7
37.566 - TOTAL 1.078 4 0.058 14
10K-means with three clusters
- Cluster 1 of 3 contains 4 cases
- Members
Statistics - Case Distance Variable Minimum
Mean Maximum St.Dev. - Case 1 0.02 X 0.41
0.45 0.49 0.04 - Case 2 0.11 Y 0.03
0.19 0.27 0.11 - Case 3 0.06
- Case 4 0.05
- Cluster 2 of 3 contains 4 cases
- Members
Statistics - Case Distance Variable Minimum
Mean Maximum St.Dev. - Case 7 0.06 X 0.11
0.15 0.19 0.03 - Case 8 0.03 Y 0.61
0.70 0.77 0.07 - Case 9 0.02
- Case 10 0.06
- Cluster 3 of 3 contains 2 cases
- Members
Statistics - Case Distance Variable Minimum
Mean Maximum St.Dev. - Case 5 0.01 X 0.77
0.77 0.78 0.01 - Case 6 0.01 Y 0.33
0.35 0.36 0.02
11Disadvantages of K-means
- Reaches optimum, but not necessarily global
- Must choose number of clusters before analysis
- How many clusters?
12Example Sperm whale codas
- Patterned series of clicks
-
- ic1 ic2 ic3 ic4
- For 5-click codas 681 x 4 data set
135-click codas
ic1 ic2 ic3 ic4
93 of variance in 2 PCs
145-click codasK-means with 10 clusters
15Hierarchical Cluster Analysis
- Usually represented by
- Dendrogram or tree-diagram
16Hierarchical Cluster Analysis
- Hierarchical Divisive Cluster Analysis
- Hierarchical Agglomerative Cluster Analysis
17Hierarchical Divisive Cluster Analysis
- Starts with all units in one
cluster, successively splits them - Successive use of K-Means, or some other divisive
technique, with n2 - Either Each time use the cluster with the
greatest sum of squared distances - Or Split each cluster each time.
- Hierarchical divisive are good techniques, but
rarely used
18Hierarchical Agglomerative Cluster Analysis
- Start with each individual units occupying its
own cluster - The clusters are then gradually merged until just
one is left - The most common cluster analyses
19Hierarchical Agglomerative Cluster Analysis
- Works on dissimilarity matrix
- or negative similarity matrix
- may be Euclidean, Penrose, distances
- At each step
- 1. There is a symmetric matrix of
dissimilarities between clusters - 2. The two clusters with least dissimilarity are
merged - 3. The dissimilarity between the new (merged)
cluster and all others is calculated - Different techniques do step 3. in different
ways
20Hierarchical Agglomerative Cluster Analysis
- A B C D E
- A 0 . . . .
- B 0.35 0 . . .
- C 0.45 0.67 0 . .
- D 0.11 0.45 0.57 0 .
- E 0.22 0.56 0.78 0.19 0
-
- AD B C E
- AD 0 . . .
- B ? 0 . .
- C ? 0.67 0 .
- E ? 0.56 0.78 0
How to calculate new disimmilarities?
21Hierarchical Agglomerative Cluster
AnalysisSingle Linkage
- A B C D E
- A 0 . . . .
- B 0.35 0 . . .
- C 0.45 0.67 0 . .
- D 0.11 0.45 0.57 0 .
- E 0.22 0.56 0.78 0.19 0
-
- AD B C E
- AD 0 . . .
- B 0.35 0 . .
- C ? 0.67 0 .
- E ? 0.56 0.78 0
d(AD,B)Mind(A,B), d(D,B)
22Hierarchical Agglomerative Cluster
AnalysisComplete Linkage
- A B C D E
- A 0 . . . .
- B 0.35 0 . . .
- C 0.45 0.67 0 . .
- D 0.11 0.45 0.57 0 .
- E 0.22 0.56 0.78 0.19 0
-
- AD B C E
- AD 0 . . .
- B 0.45 0 . .
- C ? 0.67 0 .
- E ? 0.56 0.78 0
d(AD,B)Maxd(A,B), d(D,B)
23Hierarchical Agglomerative Cluster
AnalysisAverage Linkage
- A B C D E
- A 0 . . . .
- B 0.35 0 . . .
- C 0.45 0.67 0 . .
- D 0.11 0.45 0.57 0 .
- E 0.22 0.56 0.78 0.19 0
-
- AD B C E
- AD 0 . . .
- B 0.40 0 . .
- C ? 0.67 0 .
- E ? 0.56 0.78 0
d(AD,B)Meand(A,B), d(D,B)
24Hierarchical Agglomerative Cluster
AnalysisCentroid Clustering (uses data matrix,
or true distance matrix)
- V1 V2 V3
- A 0.11 0.75 0.33
- B 0.35 0.99 0.41
- C 0.45 0.67 0.22
- D 0.11 0.71 0.37
- E 0.22 0.56 0.78
- F 0.13 0.14 0.55
- G 0.55 0.90 0.21
- V1 V2 V3
- AD 0.11 0.73 0.35
- B 0.35 0.99 0.41
- C 0.45 0.67 0.22
- E 0.22 0.56 0.78
- F 0.13 0.14 0.55
- G 0.55 0.90 0.21
V1(AD)MeanV1(A),V1(D)
25Hierarchical Agglomerative Cluster
AnalysisWards Method
- Minimizes within-cluster sum-of squares
- Similar to centroid clustering
26- 1 1.00
- 2 0.00 1.00
- 4 0.53 0.00 1.00
- 5 0.18 0.05 0.00 1.00
- 9 0.22 0.09 0.13 0.25 1.00
- 11 0.36 0.00 0.17 0.40 0.33 1.00
- 12 0.00 0.37 0.18 0.00 0.13 0.00 1.00
- 14 0.74 0.00 0.30 0.20 0.23 0.17 0.00 1.00
- 15 0.53 0.00 0.30 0.00 0.36 0.00 0.26 0.56 1.00
- 19 0.00 0.00 0.17 0.21 0.43 0.32 0.29 0.09 0.09
1.00 - 20 0.04 0.00 0.17 0.00 0.14 0.10 0.35 0.00 0.18
0.25 1.00 - 1 2 4 5 9 11
12 14 15 19 20
27(No Transcript)
28Hierarchical Agglomerative Clustering Techniques
- Single Linkage
- Produces straggly clusters
- Not recommended if much experimental error
- Used in taxonomy
- Invariant to transformations
- Complete Linkage
- Produces tight clusters
- Not recommended if much experimental error
- Invariant to transformations
- Average Linkage, Centroid, Wards
- Most likely to mimic input clusters
- Not invariant to transformations in dissimilarity
measure
29Cophenetic Correlation CoefficientCCC
- Correlation between original disimilarity matrix
and dissimilarity inferred from cluster analysis - CCC gt 0.8 indicate a good match
- CCC lt 0.8, dendrogram not a good representation
- probably should not be displayed
- Use CCC to choose best linkage method (highest
coefficient)
30CCC0.83
CCC0.77
CCC0.75
CCC0.80
31Additive trees
- Dendrogram in which path lengths represent
dissimilarities - Computation quite complex (cross between
agglomerative techniques and multidimensional
scaling) - Good when data are measured as similarities and
dissimilarities - Often used in taxonomy and genetics
A B C D E A . . . . . B 14 . . . . C 6 12 . . . D
81 7 13 . . E 17 1 6 16 .
32Problems with Cluster Analysis
- Are there really biologically-meaningful clusters
in the data? - Does the dendrogram represent biological reality
(web-of-life versus tree-of-life)? - How many clusters to use?
- stopping rules are arbitrary
- Which method to use?
- best technique is data-dependent
- Dendrograms become messy with many units
33Social Structure of 160 northern bottlenose whales
34Clustering Techniques
- Type Technique Use
-
- Non-hierarchical K-Means Dividing data sets
- Hierarchical divisive Repeated K-means Good
technique on small data sets - Hierarchical agglomerative
- Single linkage Taxonomy
- Complete linkage Tighter Clusters
- Average linkage,
- Centroid, Wards Usually Preferred
- Hierarchical Additive trees Excellent for
displaying similarity/dissimilarity
taxonomy, genetics
35(No Transcript)