Cluster%20Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Cluster%20Analysis

Description:

DBSCAN: Density Based Spatial Clustering of Applications with Noise Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density- ... – PowerPoint PPT presentation

Number of Views:285
Avg rating:3.0/5.0
Slides: 50
Provided by: HKUC4
Learn more at: https://www.cs.bu.edu
Category:

less

Transcript and Presenter's Notes

Title: Cluster%20Analysis


1
Cluster Analysis
2
Cluster Analysis
  • What is Cluster Analysis?
  • Types of Data in Cluster Analysis
  • A Categorization of Major Clustering Methods
  • Partitioning Methods
  • Hierarchical Methods
  • Density-Based Methods
  • Grid-Based Methods
  • Model-Based Clustering Methods
  • Outlier Analysis
  • Summary

3
Hierarchical Clustering
  • Produces a set of nested clusters organized as a
    hierarchical tree
  • Can be visualized as a dendrogram
  • A tree like diagram that records the sequences of
    merges or splits

4
Strengths of Hierarchical Clustering
  • Do not have to assume any particular number of
    clusters
  • Any desired number of clusters can be obtained by
    cutting the dendogram at the proper level
  • They may correspond to meaningful taxonomies
  • Example in biological sciences (e.g., animal
    kingdom, phylogeny reconstruction, )

5
Hierarchical Clustering
  • Two main types of hierarchical clustering
  • Agglomerative
  • Start with the points as individual clusters
  • At each step, merge the closest pair of clusters
    until only one cluster (or k clusters) left
  • Divisive
  • Start with one, all-inclusive cluster
  • At each step, split a cluster until each cluster
    contains a point (or there are k clusters)
  • Traditional hierarchical algorithms use a
    similarity or distance matrix
  • Merge or split one cluster at a time

6
Agglomerative Clustering Algorithm
  • More popular hierarchical clustering technique
  • Basic algorithm is straightforward
  • Compute the proximity matrix
  • Let each data point be a cluster
  • Repeat
  • Merge the two closest clusters
  • Update the proximity matrix
  • Until only a single cluster remains
  • Key operation is the computation of the proximity
    of two clusters
  • Different approaches to defining the distance
    between clusters distinguish the different
    algorithms

7
Starting Situation
  • Start with clusters of individual points and a
    proximity matrix

Proximity Matrix
8
Intermediate Situation
  • After some merging steps, we have some clusters

C3
C4
Proximity Matrix
C1
C5
C2
9
Intermediate Situation
  • We want to merge the two closest clusters (C2 and
    C5) and update the proximity matrix.

C3
C4
Proximity Matrix
C1
C5
C2
10
After Merging
C2 U C5
  • The question is How do we update the proximity
    matrix?

C1
C3
C4
?
C1
? ? ? ?
C2 U C5
C3
?
C3
C4
?
C4
Proximity Matrix
C1
C2 U C5
11
How to Define Inter-Cluster Similarity
Similarity?
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
12
How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
13
How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
14
How to Define Inter-Cluster Similarity
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
15
How to Define Inter-Cluster Similarity
?
?
  • MIN
  • MAX
  • Group Average
  • Distance Between Centroids
  • Other methods driven by an objective function
  • Wards Method uses squared error

Proximity Matrix
16
Cluster Similarity MIN or Single Link
  • Similarity of two clusters is based on the two
    most similar (closest) points in the different
    clusters
  • Determined by one pair of points, i.e., by one
    link in the proximity graph.

17
Hierarchical Clustering MIN
Nested Clusters
Dendrogram
18
Strength of MIN
Original Points
  • Can handle non-elliptical shapes

19
Limitations of MIN
Original Points
  • Sensitive to noise and outliers

20
Cluster Similarity MAX or Complete Linkage
  • Similarity of two clusters is based on the two
    least similar (most distant) points in the
    different clusters
  • Determined by all pairs of points in the two
    clusters

21
Hierarchical Clustering MAX
Nested Clusters
Dendrogram
22
Strength of MAX
Original Points
  • Less susceptible to noise and outliers

23
Limitations of MAX
Original Points
  • Tends to break large clusters
  • Biased towards globular clusters

24
Cluster Similarity Group Average
  • Proximity of two clusters is the average of
    pairwise proximity between points in the two
    clusters.
  • Need to use average connectivity for scalability
    since total proximity favors large clusters

25
Hierarchical Clustering Group Average
Nested Clusters
Dendrogram
26
Hierarchical Clustering Group Average
  • Compromise between Single and Complete Link
  • Strengths
  • Less susceptible to noise and outliers
  • Limitations
  • Biased towards globular clusters

27
Cluster Similarity Wards Method
  • Similarity of two clusters is based on the
    increase in squared error when two clusters are
    merged
  • Similar to group average if distance between
    points is distance squared
  • Less susceptible to noise and outliers
  • Biased towards globular clusters
  • Hierarchical analogue of K-means
  • Can be used to initialize K-means

28
Hierarchical Clustering Comparison
MIN
MAX
Wards Method
Group Average
29
Hierarchical Clustering Time and Space
requirements
  • O(N2) space since it uses the proximity matrix.
  • N is the number of points.
  • O(N3) time in many cases
  • There are N steps and at each step the size, N2,
    proximity matrix must be updated and searched
  • Complexity can be reduced to O(N2 log(N) ) time
    for some approaches

30
CURE (Clustering Using REpresentatives )
data to be clustered
clusters generated by conventional methods (e.g.,
k-means, BIRCH)
  • CURE proposed by Guha, Rastogi Shim, 1998
  • Stops the creation of a cluster hierarchy if a
    level consists of k clusters
  • Uses multiple representative points to evaluate
    the distance between clusters, adjusts well to
    arbitrary shaped clusters and avoids single-link
    effect

31
Cure The Algorithm
  • Draw random sample s.
  • Partition sample to p partitions with size s/p
  • Partially cluster partitions into s/pq clusters
  • Eliminate outliers
  • By random sampling
  • If a cluster grows too slow, eliminate it.
  • Cluster partial clusters.
  • Label data in disk

32
CURE cluster representation
  • Uses a number of points to represent a cluster
  • Representative points are found by selecting a
    constant number of points from a cluster and then
    shrinking them toward the center of the cluster
  • Cluster similarity is the similarity of the
    closest pair of representative points from
    different clusters

?
?
33
CURE
  • Shrinking representative points toward the center
    helps avoid problems with noise and outliers
  • CURE is better able to handle clusters of
    arbitrary shapes and sizes

34
Experimental Results CURE
Picture from CURE, Guha, Rastogi, Shim.
35
Experimental Results CURE
(centroid)
(single link)
Picture from CURE, Guha, Rastogi, Shim.
36
CURE Cannot Handle Differing Densities
CURE
Original Points
37
ROCK (RObust Clustering using linKs)
  • Clustering algorithm for data with categorical
    and Boolean attributes
  • A pair of points is defined to be neighbors if
    their similarity is greater than some threshold
  • Use a hierarchical clustering scheme to cluster
    the data.
  • Obtain a sample of points from the data set
  • Compute the link value for each set of points,
    i.e., transform the original similarities
    (computed by Jaccard coefficient) into
    similarities that reflect the number of shared
    neighbors between points
  • Perform an agglomerative hierarchical clustering
    on the data using the number of shared
    neighbors as similarity measure and maximizing
    the shared neighbors objective function
  • Assign the remaining points to the clusters that
    have been found

38
Clustering Categorical Data The ROCK Algorithm
  • ROCK RObust Clustering using linKs
  • S. Guha, R. Rastogi K. Shim, ICDE99
  • Major ideas
  • Use links to measure similarity/proximity
  • Not distance-based
  • Computational complexity

39
Similarity Measure in ROCK
  • Traditional measures for categorical data may not
    work well, e.g., Jaccard coefficient
  • Example Two groups (clusters) of transactions
  • C1. lta, b, c, d, egt a, b, c, a, b, d, a, b,
    e, a, c, d, a, c, e, a, d, e, b, c, d,
    b, c, e, b, d, e, c, d, e
  • C2. lta, b, f, ggt a, b, f, a, b, g, a, f,
    g, b, f, g
  • Jaccard co-efficient may lead to wrong clustering
    result
  • C1 0.2 (a, b, c, b, d, e to 0.5 (a, b, c,
    a, b, d)
  • C1 C2 could be as high as 0.5 (a, b, c, a,
    b, f)
  • Jaccard co-efficient-based similarity function
  • Ex. Let T1 a, b, c, T2 c, d, e

40
Link Measure in ROCK
  • Links of common neighbors
  • C1 lta, b, c, d, egt a, b, c, a, b, d, a, b,
    e, a, c, d, a, c, e, a, d, e, b, c, d,
    b, c, e, b, d, e, c, d, e
  • C2 lta, b, f, ggt a, b, f, a, b, g, a, f, g,
    b, f, g
  • Let T1 a, b, c, T2 c, d, e, T3 a, b,
    f
  • link(T1, T2) 4, since they have 4 common
    neighbors
  • a, c, d, a, c, e, b, c, d, b, c, e
  • link(T1, T3) 3, since they have 3 common
    neighbors
  • a, b, d, a, b, e, a, b, g
  • Thus link is a better measure than Jaccard
    coefficient

41
CHAMELEON Hierarchical Clustering Using Dynamic
Modeling (1999)
  • CHAMELEON by G. Karypis, E.H. Han, and V.
    Kumar99
  • Measures the similarity based on a dynamic model
  • Two clusters are merged only if the
    interconnectivity and closeness (proximity)
    between two clusters are high relative to the
    internal interconnectivity of the clusters and
    closeness of items within the clusters
  • Cure ignores information about interconnectivity
    of the objects, Rock ignores information about
    the closeness of two clusters
  • A two-phase algorithm
  • Use a graph partitioning algorithm cluster
    objects into a large number of relatively small
    sub-clusters
  • Use an agglomerative hierarchical clustering
    algorithm find the genuine clusters by
    repeatedly combining these sub-clusters

42
Overall Framework of CHAMELEON
Construct Sparse Graph
Partition the Graph
Data Set
Merge Partition
Final Clusters
43
CHAMELEON (Clustering Complex Objects)
44
Cluster Analysis
  • What is Cluster Analysis?
  • Types of Data in Cluster Analysis
  • A Categorization of Major Clustering Methods
  • Partitioning Methods
  • Hierarchical Methods
  • Density-Based Methods
  • Grid-Based Methods
  • Model-Based Clustering Methods
  • Outlier Analysis
  • Summary

45
Density-Based Clustering Methods
  • Clustering based on density (local cluster
    criterion), such as density-connected points
  • Major features
  • Discover clusters of arbitrary shape
  • Handle noise
  • One scan
  • Need density parameters as termination condition
  • Several interesting studies
  • DBSCAN Ester, et al. (KDD96)
  • OPTICS Ankerst, et al (SIGMOD99).
  • DENCLUE Hinneburg D. Keim (KDD98)
  • CLIQUE Agrawal, et al. (SIGMOD98)

46
Density-Based Clustering Background
  • Neighborhood of point pall points within
    distance e from p
  • NEps(p)q dist(p,q) lt e
  • Two parameters
  • e Maximum radius of the neighbourhood
  • MinPts Minimum number of points in an e
    -neighbourhood of that point
  • If the number of points in the e -neighborhood of
    p is at least MinPts, then p is called a core
    object.
  • Directly density-reachable A point p is directly
    density-reachable from a point q wrt. e, MinPts
    if
  • 1) p belongs to NEps(q)
  • 2) core point condition
  • NEps (q) gt MinPts

47
Density-Based Clustering Background (II)
  • Density-reachable
  • A point p is density-reachable from a point q
    wrt. Eps, MinPts if there is a chain of points
    p1, , pn, p1 q, pn p such that pi1 is
    directly density-reachable from pi
  • Density-connected
  • A point p is density-connected to a point q wrt.
    Eps, MinPts if there is a point o such that both,
    p and q are density-reachable from o wrt. Eps and
    MinPts.

p
p1
q
48
DBSCAN Density Based Spatial Clustering of
Applications with Noise
  • Relies on a density-based notion of cluster A
    cluster is defined as a maximal set of
    density-connected points
  • Discovers clusters of arbitrary shape in spatial
    databases with noise

49
DBSCAN The Algorithm
  • Arbitrary select a point p
  • Retrieve all points density-reachable from p wrt
    Eps and MinPts.
  • If p is a core point, a cluster is formed.
  • If p is a border point, no points are
    density-reachable from p and DBSCAN visits the
    next point of the database.
  • Continue the process until all of the points have
    been processed.
Write a Comment
User Comments (0)
About PowerShow.com