CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

Description:

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : ; Data : 2001/12/18 – PowerPoint PPT presentation

Number of Views:237
Avg rating:3.0/5.0
Slides: 34
Provided by: SuC61
Category:

less

Transcript and Presenter's Notes

Title: CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling


1
CHAMELEONA Hierarchical Clustering Algorithm
Using Dynamic Modeling
  • Paper presentation in data mining class
  • Presenter ??? ???
  • Data 2001/12/18

2
About this paper
  • Department of Computer Science and Engineering ,
    University of Minnesota
  • George Karypis
  • Eui-Honh (Sam) Han
  • Vipin Kumar
  • IEEE Computer Journal - Aug. 1999

3
Outline
  • Problems definition
  • Main algorithm
  • Keys features of CHAMELEON
  • Experiment and related worked
  • Conclusion and discussion

4
Problems definition
  • Clustering
  • Intracluster similarity is maximized
  • Intercluster similarity is minimized
  • Problems of existing clustering algorithms
  • Static model constrain
  • Breakdown when clusters that are of diverse
    shapes,densities,and sizes
  • Susceptible to noise , outliers , and artifacts

5
Static model constrain
  • Data space constrain
  • K means , PAM etc
  • Suitable only for data in metric spaces
  • Cluster shape constrain
  • K means , PAM , CLARANS
  • Assume cluster as ellipsoidal or globular and are
    similar sizes
  • Cluster density constrain
  • DBScan
  • Points within genuine cluster are
    density-reachable and point across different
    clusters are not
  • Similarity determine constrain
  • CURE , ROCK
  • Use static model to determine the most similar
    cluster to merge

6
Partition techniques problem
7
Hierarchical technique problem (1/2)
  • The (c) , (d) will be choose to merge when we
    only consider closeness

8
Hierarchical technique problem (2/2)
  • The (a) , (c) will be choose to merge when we
    only consider inter-connectivity

9
Main algorithm
  • Two phase algorithm
  • PHASE I
  • Use graph partitioning algorithm to cluster the
    data items into a large number of relatively
    small sub-clusters.
  • PHASE II
  • Uses an agglomerative hierarchical clustering
    algorithm to find the genuine clusters by
    repeatedly combining together these sub-clusters.

10
Framework

11
Keys features of CHAMELEON
  • Modeling the data
  • Modeling the cluster similarity
  • Partition algorithms
  • Merge schemes

12
Terms
  • Arguments needed
  • K
  • K-nearest neighbor graph
  • MINSIZE
  • The minima size of initial cluster
  • TRI
  • Threshold of related inter-connectivity
  • TRC
  • Threshold of related intra-connectivity
  • a
  • Coefficient for weight of RI and RC

13
Modeling the data
  • K-nearest neighbor graph approach
  • Advantages
  • Data points that are far apart are completely
    disconnected in the Gk
  • Gk capture the concept of neighborhood
    dynamically
  • The edge weights of dense regions in Gk tend to
    be large and the edge weights of sparse tend to
    be small

14
Example of k-nearest neighbor graph
15
Modeling the clustering similarity (1/2)
  • Relative interconnectivity
  • Relative closeness

16
Modeling the clustering similarity (2/2)
  • If related is considered , (c) , (d) will be
    merged

17
Partition algorithm (PHASE I)
  • What
  • Finding the initial sub-clusters
  • Why
  • RI and RC cant be accurately calculated for
    clusters containing only a few data points
  • How
  • Utilize multilevel graph partitioning algorithm
    (hMETIS)
  • Coarsening phase
  • Partitioning phase
  • Uncoarsening phase

18
Partition algorithm (cont.)
  • Initial
  • all points belonging to the same cluster
  • Repeat until (size of all clusters lt MINSIZE)
  • Select the largest cluster and use hMETIS to
    bisect
  • Post scriptum
  • Balance constrain
  • Spilt Ci into CiA and CiB and each sub-clusters
    contains at least 25 of the node of Ci

19
(No Transcript)
20
Merge schemes (Phase II)
  • What
  • Merging sub-clusters using a dynamic framework
  • How
  • Finding and merging the pair of sub-clusters that
    are the most similar
  • Scheme 1
  • Scheme 2

21
Experiment and related worked
  • Introduction of CURE
  • Introduction of DBSCAN
  • Results of experiment
  • Performance analysis

22
Introduction of CURE (1/n)
  • Clustering Using Representative points
  • 1. Properties
  • Fit for non-spherical shapes.
  • Shrinking can help to dampen the effects of
    outliers.
  • Multiple representative points chosen for
    non-spherical
  • Each iteration , representative points shrunk
    ratio related to merge procedure by some
    scattered points chosen
  • Random sampling in data sets is fit for large
    databases

23
Introduction of CURE (2/n)
  • 2. Drawbacks
  • Partitioning method can not prove data points
    chosen are good.
  • Clustering accuracy with respect to the
    parameters below
  • (1) Shrink factor s CURE always find the right
    clusters by range of s values from 0.2
    to 0.7.
  • (2) Number of representative points c CURE
    always found right clusters for value of
    c greater than 10.
  • (3) Number of Partitions p with as many as 50
    partitions , CURE always discovered the
    desired clusters.
  • (4) Random Sample size r
  • (a) for sample size up to 2000 , clusters found
    poor quality
  • (b) from 2500 sample points and above , about
    2.5 of the data set size , CURE always
    correctly find the clusters.

24
  • 3. Clustering algorithm Representative points

25
  • Merge procedure

26
Introduction of DBSCAN (1/n)
  • Density Based Spatial Clustering of Application
    With Noise
  • 1. Properties
  • Can discovery clusters of arbitrary shape.
  • Each cluster with a typical density of points
    which is higher than outside of cluster.
  • The density within the areas of noise is lower
    than the density in any of the clusters.
  • Input the parameters MinPts only
  • Easy to implement in C language using R-tree
  • Runtime is linear depending on the number of
    points.
  • Time complexity is O(n log n)

27
Introduction of DBSCAN (2/n)
  • 2. Drawbacks
  • Cannot apply to polygons.
  • Cannot apply to high dimensional feature spaces.
  • Cannot process the shape of k-dist graph with
    multi-features.
  • Cannot fit for large database because no method
    applied to reduce spatial database.
  • 3. Definitions
  • Eps-neighborhood of a point p
  • NEps(p)qD dist(p,q)ltEps
  • Each cluster with MinPts points

28
Introduction of DBSCAN (3/n)
  • 4. p is directly density-reachable from q
  • (1) p NEps(q) and
  • (2) NEps(q) gtMinPts (core point condition)
  • We know directly density-reachable is symmetric
    when p and q both are core point , otherwise is
    asymmetric if one core point and one border
    point.
  • 5. p is density-reachable from q if there is a
    chain of points between p and q
  • Density-reachable is transitive , but not
    symmetric
  • Density-reachable is symmetric for core points.

29
Introduction of DBSCAN (4/n)
  • 6. A point p is density-connected to a point q if
    there is a point s such that both p and q are
    density-reachable from s.
  • Density-connected is symmetric and reflexive
    relation
  • A cluster is defined to be a set of
    density-connected points which is maximal
    density-reachability.
  • Noise is the set of points not belong to any of
    clusters.
  • 7. How to find cluster C ?
  • Maximality
  • ? p , q if p C and q is density-reachable from
    p , then q C
  • Connectivity
  • ? p , q C p is density-connected to q
  • 8. How to find noises ?
  • ? p , if p is not belong to any clusters , then p
    is noise point

30
Results of experiment
31
Performance analysis (1/2)
  • The time of construct the k-nearest neighbor
  • Low-dimensional data sets based on k-d trees ,
    overall complexity of O(n log n)
  • High-dimensional data sets based on k-d trees not
    applicable , overall complexity of O(n2)
  • Finding initial sub-clusters
  • Obtains m clusters by repeated partitioning
    successively smaller graphs , overall
    computational complexity is O(n log (n/m))
  • Is bounded by O(n log n)
  • A faster partitioning algorithm to obtain the
    initial m clusters in time O(nm log m) using
    multilevel m-way partitioning algorithm

32
Performance analysis (2/2)
  • Merging sub-clusters using a dynamic framework
  • The time of compute the internal
    inter-connectivity and internal closeness for
    each initial cluster is which is O(nm)
  • The time of the most similar pair of clusters to
    merge is O(m2 log m) by using a heap-based
    priority queue
  • So overall complexity of CHAMELEONs is O(n log
    n nm m2 log m)

33
Conclusion and discussion
  • Dynamic model with related interconnectivity and
    closeness
  • This paper ignore the issue of scaling to large
    data
  • Other graph representation methodology??
  • Other Partition algorithm??
Write a Comment
User Comments (0)
About PowerShow.com