Title: CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling
1CHAMELEONA Hierarchical Clustering Algorithm
Using Dynamic Modeling
- Paper presentation in data mining class
- Presenter ??? ???
- Data 2001/12/18
2About this paper
- Department of Computer Science and Engineering ,
University of Minnesota - George Karypis
- Eui-Honh (Sam) Han
- Vipin Kumar
- IEEE Computer Journal - Aug. 1999
3Outline
- Problems definition
- Main algorithm
- Keys features of CHAMELEON
- Experiment and related worked
- Conclusion and discussion
4Problems definition
- Clustering
- Intracluster similarity is maximized
- Intercluster similarity is minimized
- Problems of existing clustering algorithms
- Static model constrain
- Breakdown when clusters that are of diverse
shapes,densities,and sizes - Susceptible to noise , outliers , and artifacts
5Static model constrain
- Data space constrain
- K means , PAM etc
- Suitable only for data in metric spaces
- Cluster shape constrain
- K means , PAM , CLARANS
- Assume cluster as ellipsoidal or globular and are
similar sizes - Cluster density constrain
- DBScan
- Points within genuine cluster are
density-reachable and point across different
clusters are not - Similarity determine constrain
- CURE , ROCK
- Use static model to determine the most similar
cluster to merge
6Partition techniques problem
7Hierarchical technique problem (1/2)
- The (c) , (d) will be choose to merge when we
only consider closeness
8Hierarchical technique problem (2/2)
- The (a) , (c) will be choose to merge when we
only consider inter-connectivity
9Main algorithm
- Two phase algorithm
- PHASE I
- Use graph partitioning algorithm to cluster the
data items into a large number of relatively
small sub-clusters. - PHASE II
- Uses an agglomerative hierarchical clustering
algorithm to find the genuine clusters by
repeatedly combining together these sub-clusters.
10Framework
11Keys features of CHAMELEON
- Modeling the data
- Modeling the cluster similarity
- Partition algorithms
- Merge schemes
12Terms
- Arguments needed
- K
- K-nearest neighbor graph
- MINSIZE
- The minima size of initial cluster
- TRI
- Threshold of related inter-connectivity
- TRC
- Threshold of related intra-connectivity
- a
- Coefficient for weight of RI and RC
13Modeling the data
- K-nearest neighbor graph approach
- Advantages
- Data points that are far apart are completely
disconnected in the Gk - Gk capture the concept of neighborhood
dynamically - The edge weights of dense regions in Gk tend to
be large and the edge weights of sparse tend to
be small
14Example of k-nearest neighbor graph
15Modeling the clustering similarity (1/2)
- Relative interconnectivity
- Relative closeness
16Modeling the clustering similarity (2/2)
- If related is considered , (c) , (d) will be
merged
17Partition algorithm (PHASE I)
- What
- Finding the initial sub-clusters
- Why
- RI and RC cant be accurately calculated for
clusters containing only a few data points - How
- Utilize multilevel graph partitioning algorithm
(hMETIS) - Coarsening phase
- Partitioning phase
- Uncoarsening phase
18Partition algorithm (cont.)
- Initial
- all points belonging to the same cluster
- Repeat until (size of all clusters lt MINSIZE)
- Select the largest cluster and use hMETIS to
bisect - Post scriptum
- Balance constrain
- Spilt Ci into CiA and CiB and each sub-clusters
contains at least 25 of the node of Ci
19(No Transcript)
20Merge schemes (Phase II)
- What
- Merging sub-clusters using a dynamic framework
- How
- Finding and merging the pair of sub-clusters that
are the most similar - Scheme 1
- Scheme 2
21Experiment and related worked
- Introduction of CURE
- Introduction of DBSCAN
- Results of experiment
- Performance analysis
22Introduction of CURE (1/n)
- Clustering Using Representative points
- 1. Properties
- Fit for non-spherical shapes.
- Shrinking can help to dampen the effects of
outliers. - Multiple representative points chosen for
non-spherical - Each iteration , representative points shrunk
ratio related to merge procedure by some
scattered points chosen - Random sampling in data sets is fit for large
databases
23Introduction of CURE (2/n)
- 2. Drawbacks
- Partitioning method can not prove data points
chosen are good. - Clustering accuracy with respect to the
parameters below - (1) Shrink factor s CURE always find the right
clusters by range of s values from 0.2
to 0.7. - (2) Number of representative points c CURE
always found right clusters for value of
c greater than 10. - (3) Number of Partitions p with as many as 50
partitions , CURE always discovered the
desired clusters. - (4) Random Sample size r
- (a) for sample size up to 2000 , clusters found
poor quality - (b) from 2500 sample points and above , about
2.5 of the data set size , CURE always
correctly find the clusters.
24- 3. Clustering algorithm Representative points
25 26Introduction of DBSCAN (1/n)
- Density Based Spatial Clustering of Application
With Noise - 1. Properties
- Can discovery clusters of arbitrary shape.
- Each cluster with a typical density of points
which is higher than outside of cluster. - The density within the areas of noise is lower
than the density in any of the clusters. - Input the parameters MinPts only
- Easy to implement in C language using R-tree
- Runtime is linear depending on the number of
points. - Time complexity is O(n log n)
27Introduction of DBSCAN (2/n)
- 2. Drawbacks
- Cannot apply to polygons.
- Cannot apply to high dimensional feature spaces.
- Cannot process the shape of k-dist graph with
multi-features. - Cannot fit for large database because no method
applied to reduce spatial database. - 3. Definitions
- Eps-neighborhood of a point p
- NEps(p)qD dist(p,q)ltEps
- Each cluster with MinPts points
28Introduction of DBSCAN (3/n)
- 4. p is directly density-reachable from q
- (1) p NEps(q) and
- (2) NEps(q) gtMinPts (core point condition)
- We know directly density-reachable is symmetric
when p and q both are core point , otherwise is
asymmetric if one core point and one border
point. - 5. p is density-reachable from q if there is a
chain of points between p and q - Density-reachable is transitive , but not
symmetric - Density-reachable is symmetric for core points.
29Introduction of DBSCAN (4/n)
- 6. A point p is density-connected to a point q if
there is a point s such that both p and q are
density-reachable from s. - Density-connected is symmetric and reflexive
relation - A cluster is defined to be a set of
density-connected points which is maximal
density-reachability. - Noise is the set of points not belong to any of
clusters. - 7. How to find cluster C ?
- Maximality
- ? p , q if p C and q is density-reachable from
p , then q C - Connectivity
- ? p , q C p is density-connected to q
- 8. How to find noises ?
- ? p , if p is not belong to any clusters , then p
is noise point
30Results of experiment
31Performance analysis (1/2)
- The time of construct the k-nearest neighbor
- Low-dimensional data sets based on k-d trees ,
overall complexity of O(n log n) - High-dimensional data sets based on k-d trees not
applicable , overall complexity of O(n2) - Finding initial sub-clusters
- Obtains m clusters by repeated partitioning
successively smaller graphs , overall
computational complexity is O(n log (n/m)) - Is bounded by O(n log n)
- A faster partitioning algorithm to obtain the
initial m clusters in time O(nm log m) using
multilevel m-way partitioning algorithm
32Performance analysis (2/2)
- Merging sub-clusters using a dynamic framework
- The time of compute the internal
inter-connectivity and internal closeness for
each initial cluster is which is O(nm) - The time of the most similar pair of clusters to
merge is O(m2 log m) by using a heap-based
priority queue - So overall complexity of CHAMELEONs is O(n log
n nm m2 log m)
33Conclusion and discussion
- Dynamic model with related interconnectivity and
closeness - This paper ignore the issue of scaling to large
data - Other graph representation methodology??
- Other Partition algorithm??