Title: Cluster%20Analysis
1Cluster Analysis
2Cluster Analysis
- What is Cluster Analysis?
- Types of Data in Cluster Analysis
- A Categorization of Major Clustering Methods
- Partitioning Methods
- Hierarchical Methods
- Density-Based Methods
- Grid-Based Methods
- Model-Based Clustering Methods
- Outlier Analysis
- Summary
3Hierarchical Clustering
- Produces a set of nested clusters organized as a
hierarchical tree - Can be visualized as a dendrogram
- A tree like diagram that records the sequences of
merges or splits
4Strengths of Hierarchical Clustering
- Do not have to assume any particular number of
clusters - Any desired number of clusters can be obtained by
cutting the dendogram at the proper level - They may correspond to meaningful taxonomies
- Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, )
5Hierarchical Clustering
- Two main types of hierarchical clustering
- Agglomerative
- Start with the points as individual clusters
- At each step, merge the closest pair of clusters
until only one cluster (or k clusters) left - Divisive
- Start with one, all-inclusive cluster
- At each step, split a cluster until each cluster
contains a point (or there are k clusters) - Traditional hierarchical algorithms use a
similarity or distance matrix - Merge or split one cluster at a time
6Agglomerative Clustering Algorithm
- More popular hierarchical clustering technique
- Basic algorithm is straightforward
- Compute the proximity matrix
- Let each data point be a cluster
- Repeat
- Merge the two closest clusters
- Update the proximity matrix
- Until only a single cluster remains
-
- Key operation is the computation of the proximity
of two clusters - Different approaches to defining the distance
between clusters distinguish the different
algorithms
7Starting Situation
- Start with clusters of individual points and a
proximity matrix
Proximity Matrix
8Intermediate Situation
- After some merging steps, we have some clusters
C3
C4
Proximity Matrix
C1
C5
C2
9Intermediate Situation
- We want to merge the two closest clusters (C2 and
C5) and update the proximity matrix.
C3
C4
Proximity Matrix
C1
C5
C2
10After Merging
C2 U C5
- The question is How do we update the proximity
matrix?
C1
C3
C4
?
C1
? ? ? ?
C2 U C5
C3
?
C3
C4
?
C4
Proximity Matrix
C1
C2 U C5
11How to Define Inter-Cluster Similarity
Similarity?
- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error
Proximity Matrix
12How to Define Inter-Cluster Similarity
- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error
Proximity Matrix
13How to Define Inter-Cluster Similarity
- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error
Proximity Matrix
14How to Define Inter-Cluster Similarity
- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error
Proximity Matrix
15How to Define Inter-Cluster Similarity
?
?
- MIN
- MAX
- Group Average
- Distance Between Centroids
- Other methods driven by an objective function
- Wards Method uses squared error
Proximity Matrix
16Cluster Similarity MIN or Single Link
- Similarity of two clusters is based on the two
most similar (closest) points in the different
clusters - Determined by one pair of points, i.e., by one
link in the proximity graph.
17Hierarchical Clustering MIN
Nested Clusters
Dendrogram
18Strength of MIN
Original Points
- Can handle non-elliptical shapes
19Limitations of MIN
Original Points
- Sensitive to noise and outliers
20Cluster Similarity MAX or Complete Linkage
- Similarity of two clusters is based on the two
least similar (most distant) points in the
different clusters - Determined by all pairs of points in the two
clusters
21Hierarchical Clustering MAX
Nested Clusters
Dendrogram
22Strength of MAX
Original Points
- Less susceptible to noise and outliers
23Limitations of MAX
Original Points
- Tends to break large clusters
- Biased towards globular clusters
24Cluster Similarity Group Average
- Proximity of two clusters is the average of
pairwise proximity between points in the two
clusters. - Need to use average connectivity for scalability
since total proximity favors large clusters
25Hierarchical Clustering Group Average
Nested Clusters
Dendrogram
26Hierarchical Clustering Group Average
- Compromise between Single and Complete Link
- Strengths
- Less susceptible to noise and outliers
- Limitations
- Biased towards globular clusters
27Cluster Similarity Wards Method
- Similarity of two clusters is based on the
increase in squared error when two clusters are
merged - Similar to group average if distance between
points is distance squared - Less susceptible to noise and outliers
- Biased towards globular clusters
- Hierarchical analogue of K-means
- Can be used to initialize K-means
28Hierarchical Clustering Comparison
MIN
MAX
Wards Method
Group Average
29Hierarchical Clustering Time and Space
requirements
- O(N2) space since it uses the proximity matrix.
- N is the number of points.
- O(N3) time in many cases
- There are N steps and at each step the size, N2,
proximity matrix must be updated and searched - Complexity can be reduced to O(N2 log(N) ) time
for some approaches
30CURE (Clustering Using REpresentatives )
data to be clustered
clusters generated by conventional methods (e.g.,
k-means, BIRCH)
- CURE proposed by Guha, Rastogi Shim, 1998
- Stops the creation of a cluster hierarchy if a
level consists of k clusters - Uses multiple representative points to evaluate
the distance between clusters, adjusts well to
arbitrary shaped clusters and avoids single-link
effect
31Cure The Algorithm
- Draw random sample s.
- Partition sample to p partitions with size s/p
- Partially cluster partitions into s/pq clusters
- Eliminate outliers
- By random sampling
- If a cluster grows too slow, eliminate it.
- Cluster partial clusters.
- Label data in disk
32CURE cluster representation
- Uses a number of points to represent a cluster
- Representative points are found by selecting a
constant number of points from a cluster and then
shrinking them toward the center of the cluster - Cluster similarity is the similarity of the
closest pair of representative points from
different clusters
?
?
33CURE
- Shrinking representative points toward the center
helps avoid problems with noise and outliers - CURE is better able to handle clusters of
arbitrary shapes and sizes
34Experimental Results CURE
Picture from CURE, Guha, Rastogi, Shim.
35Experimental Results CURE
(centroid)
(single link)
Picture from CURE, Guha, Rastogi, Shim.
36CURE Cannot Handle Differing Densities
CURE
Original Points
37ROCK (RObust Clustering using linKs)
- Clustering algorithm for data with categorical
and Boolean attributes - A pair of points is defined to be neighbors if
their similarity is greater than some threshold - Use a hierarchical clustering scheme to cluster
the data. - Obtain a sample of points from the data set
- Compute the link value for each set of points,
i.e., transform the original similarities
(computed by Jaccard coefficient) into
similarities that reflect the number of shared
neighbors between points - Perform an agglomerative hierarchical clustering
on the data using the number of shared
neighbors as similarity measure and maximizing
the shared neighbors objective function - Assign the remaining points to the clusters that
have been found
38Clustering Categorical Data The ROCK Algorithm
- ROCK RObust Clustering using linKs
- S. Guha, R. Rastogi K. Shim, ICDE99
- Major ideas
- Use links to measure similarity/proximity
- Not distance-based
- Computational complexity
39Similarity Measure in ROCK
- Traditional measures for categorical data may not
work well, e.g., Jaccard coefficient - Example Two groups (clusters) of transactions
- C1. lta, b, c, d, egt a, b, c, a, b, d, a, b,
e, a, c, d, a, c, e, a, d, e, b, c, d,
b, c, e, b, d, e, c, d, e - C2. lta, b, f, ggt a, b, f, a, b, g, a, f,
g, b, f, g - Jaccard co-efficient may lead to wrong clustering
result - C1 0.2 (a, b, c, b, d, e to 0.5 (a, b, c,
a, b, d) - C1 C2 could be as high as 0.5 (a, b, c, a,
b, f) - Jaccard co-efficient-based similarity function
- Ex. Let T1 a, b, c, T2 c, d, e
40Link Measure in ROCK
- Links of common neighbors
- C1 lta, b, c, d, egt a, b, c, a, b, d, a, b,
e, a, c, d, a, c, e, a, d, e, b, c, d,
b, c, e, b, d, e, c, d, e - C2 lta, b, f, ggt a, b, f, a, b, g, a, f, g,
b, f, g - Let T1 a, b, c, T2 c, d, e, T3 a, b,
f - link(T1, T2) 4, since they have 4 common
neighbors - a, c, d, a, c, e, b, c, d, b, c, e
- link(T1, T3) 3, since they have 3 common
neighbors - a, b, d, a, b, e, a, b, g
- Thus link is a better measure than Jaccard
coefficient
41CHAMELEON Hierarchical Clustering Using Dynamic
Modeling (1999)
- CHAMELEON by G. Karypis, E.H. Han, and V.
Kumar99 - Measures the similarity based on a dynamic model
- Two clusters are merged only if the
interconnectivity and closeness (proximity)
between two clusters are high relative to the
internal interconnectivity of the clusters and
closeness of items within the clusters - Cure ignores information about interconnectivity
of the objects, Rock ignores information about
the closeness of two clusters - A two-phase algorithm
- Use a graph partitioning algorithm cluster
objects into a large number of relatively small
sub-clusters - Use an agglomerative hierarchical clustering
algorithm find the genuine clusters by
repeatedly combining these sub-clusters
42Overall Framework of CHAMELEON
Construct Sparse Graph
Partition the Graph
Data Set
Merge Partition
Final Clusters
43CHAMELEON (Clustering Complex Objects)
44Cluster Analysis
- What is Cluster Analysis?
- Types of Data in Cluster Analysis
- A Categorization of Major Clustering Methods
- Partitioning Methods
- Hierarchical Methods
- Density-Based Methods
- Grid-Based Methods
- Model-Based Clustering Methods
- Outlier Analysis
- Summary
45Density-Based Clustering Methods
- Clustering based on density (local cluster
criterion), such as density-connected points - Major features
- Discover clusters of arbitrary shape
- Handle noise
- One scan
- Need density parameters as termination condition
- Several interesting studies
- DBSCAN Ester, et al. (KDD96)
- OPTICS Ankerst, et al (SIGMOD99).
- DENCLUE Hinneburg D. Keim (KDD98)
- CLIQUE Agrawal, et al. (SIGMOD98)
46Density-Based Clustering Background
- Neighborhood of point pall points within
distance e from p - NEps(p)q dist(p,q) lt e
- Two parameters
- e Maximum radius of the neighbourhood
- MinPts Minimum number of points in an e
-neighbourhood of that point - If the number of points in the e -neighborhood of
p is at least MinPts, then p is called a core
object. - Directly density-reachable A point p is directly
density-reachable from a point q wrt. e, MinPts
if - 1) p belongs to NEps(q)
- 2) core point condition
- NEps (q) gt MinPts
47Density-Based Clustering Background (II)
- Density-reachable
- A point p is density-reachable from a point q
wrt. Eps, MinPts if there is a chain of points
p1, , pn, p1 q, pn p such that pi1 is
directly density-reachable from pi - Density-connected
- A point p is density-connected to a point q wrt.
Eps, MinPts if there is a point o such that both,
p and q are density-reachable from o wrt. Eps and
MinPts.
p
p1
q
48DBSCAN Density Based Spatial Clustering of
Applications with Noise
- Relies on a density-based notion of cluster A
cluster is defined as a maximal set of
density-connected points - Discovers clusters of arbitrary shape in spatial
databases with noise
49DBSCAN The Algorithm
- Arbitrary select a point p
- Retrieve all points density-reachable from p wrt
Eps and MinPts. - If p is a core point, a cluster is formed.
- If p is a border point, no points are
density-reachable from p and DBSCAN visits the
next point of the database. - Continue the process until all of the points have
been processed.