Cluster Analysis - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Cluster Analysis

Description:

Produces a set of nested clusters organized as a hierarchical tree ... Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 28
Provided by: COMPUTA5
Category:

less

Transcript and Presenter's Notes

Title: Cluster Analysis


1
Cluster Analysis
  • Lecture Notes for Chapter 8

2
Review
  • Quiz all clustering lectures including todays
  • Problems with K-means
  • Initial centroid selection
  • Solutions?
  • Different shapes, densities sizes
  • Why?
  • Solution?
  • Empty clusters
  • Solution?
  • Unsupervised cluster Validation
  • Reasons

3
Hierarchical Clustering
  • Produces a set of nested clusters organized as a
    hierarchical tree
  • Can be visualized as a dendrogram
  • A tree like diagram that records the sequences of
    merges or splits

4
Strengths of Hierarchical Clustering
  • Do not have to assume any particular number of
    clusters
  • Any desired number of clusters can be obtained by
    cutting the dendrogram at the proper level
  • They may correspond to meaningful taxonomies
  • Example in biological sciences (e.g., animal
    kingdom, phylogeny reconstruction, )

5
The Hyaluronidases A Chemical, Biological and
Clinical Overview
  • Hyaluronidases are a group of neglected enzymes
    that have recently taken on greater significance
  • Alignments of selected hyaluronidases from
    various species were performed using Amino acid
    sequences.

6
Hierarchical Clustering
  • Two main types of hierarchical clustering
  • Agglomerative (bottom-up)
  • Start with the points as individual clusters
  • At each step, merge the closest pair of clusters
    until only one cluster is (or k clusters are)
    left
  • Divisive (top-down)
  • Start with one, all-inclusive cluster
  • At each step, split a cluster until each cluster
    contains a point (or there are k clusters)
  • Traditional hierarchical algorithms use a
    similarity or distance matrix (proximity matrix)
  • Merge or split one cluster at a time

7
Proximity Matrix
Proximity Matrix
8
(No Transcript)
9
Agglomerative Clustering Algorithm
  • More popular hierarchical clustering technique
  • Basic algorithm is straightforward
  • Compute the proximity matrix
  • Let each data point be a cluster
  • Repeat
  • Merge the two closest clusters
  • Update the proximity matrix (?)
  • Until only a single cluster remains
  • Key operation is the computation of the proximity
    of two clusters
  • Different approaches to defining the distance
    between clusters distinguish the different
    algorithms

10
Starting Situation
  • Start with clusters of individual points and a
    proximity matrix

Proximity Matrix
11
Intermediate Situation
  • After some merging steps, we have some clusters

C3
C4
Proximity Matrix
C1
C5
C2
12
Intermediate Situation
  • We want to merge the two closest clusters (C2 and
    C5) and update the proximity matrix.

C3
C4
Proximity Matrix
C1
C5
C2
13
After Merging
  • The question is How do we update the proximity
    matrix?

C2 U C5
C1
C3
C4
?
C1
? ? ? ?
C2 U C5
C3
?
C3
C4
?
C4
Proximity Matrix
C1
C2 U C5
14
How to Define Inter-Cluster Similarity
Similarity?
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
15
How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
16
How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
17
How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
18
(No Transcript)
19
How to Define Inter-Cluster Similarity
?
?
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
20
Cluster Similarity MIN or Single Link
  • Distance between two clusters is based on the two
    most similar (closest) points in the different
    clusters

21
Hierarchical Clustering MIN
Dendrogram
Nested Clusters
Dist(3,6, 2,5) min(dist(3,2), dist(6,2),
dist(3,5), dist(6,5))
min(0.15, 0.25, 0.28, 0.39)
0.15
22
Cluster Similarity MAX or Complete Linkage
  • Distance between two clusters is based on the two
    least similar (most distant) points in the
    different clusters !!!FIGURE IS WRONG!!!

23
Hierarchical Clustering MAX
Dist(3,6, 2,5) max(dist(3,2), dist(6,2),
dist(3,5), dist(6, 5))
max(0.15, 0.25, 0.28, 0.39)
0.39
Nested Clusters
Dist(3,6, 4) max(dist(3,4), dist(6,4))
max(0.15, 0.22)
0.22
Dist(3,6, 1) max(dist(3,1), dist(6,1))
max(0.22, 0.23)
0.23
24
Strength of MIN
Original Points
  • Can handle non-elliptical shapes

25
Limitations of MIN
Original Points
  • Sensitive to noise and outliers

26
Strength of MAX
Original Points
  • Less susceptible to noise and outliers

27
Limitations of MAX
Original Points
  • Tends to break large clusters
  • Biased towards globular clusters
Write a Comment
User Comments (0)
About PowerShow.com