CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

Description:

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : ; Data : 2001/12/18 – PowerPoint PPT presentation

Number of Views:237

Avg rating:3.0/5.0

Slides: 34

Provided by: SuC61

Category:

more less

Transcript and Presenter's Notes

Title: CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling

1
CHAMELEONA Hierarchical Clustering Algorithm
Using Dynamic Modeling

Paper presentation in data mining class
Presenter ??? ???
Data 2001/12/18

2
About this paper

Department of Computer Science and Engineering ,
University of Minnesota
George Karypis
Eui-Honh (Sam) Han
Vipin Kumar
IEEE Computer Journal - Aug. 1999

3
Outline

Problems definition
Main algorithm
Keys features of CHAMELEON
Experiment and related worked
Conclusion and discussion

4
Problems definition

Clustering
Intracluster similarity is maximized
Intercluster similarity is minimized
Problems of existing clustering algorithms
Static model constrain
Breakdown when clusters that are of diverse
shapes,densities,and sizes
Susceptible to noise , outliers , and artifacts

5
Static model constrain

Data space constrain
K means , PAM etc
Suitable only for data in metric spaces
Cluster shape constrain
K means , PAM , CLARANS
Assume cluster as ellipsoidal or globular and are
similar sizes
Cluster density constrain
DBScan
Points within genuine cluster are
density-reachable and point across different
clusters are not
Similarity determine constrain
CURE , ROCK
Use static model to determine the most similar
cluster to merge

6
Partition techniques problem
7
Hierarchical technique problem (1/2)

The (c) , (d) will be choose to merge when we
only consider closeness

8
Hierarchical technique problem (2/2)

The (a) , (c) will be choose to merge when we
only consider inter-connectivity

9
Main algorithm

Two phase algorithm
PHASE I
Use graph partitioning algorithm to cluster the
data items into a large number of relatively
small sub-clusters.
PHASE II
Uses an agglomerative hierarchical clustering
algorithm to find the genuine clusters by
repeatedly combining together these sub-clusters.

10
Framework

11
Keys features of CHAMELEON

Modeling the data
Modeling the cluster similarity
Partition algorithms
Merge schemes

12
Terms

Arguments needed
K
K-nearest neighbor graph
MINSIZE
The minima size of initial cluster
TRI
Threshold of related inter-connectivity
TRC
Threshold of related intra-connectivity
a
Coefficient for weight of RI and RC

13
Modeling the data

K-nearest neighbor graph approach
Advantages
Data points that are far apart are completely
disconnected in the Gk
Gk capture the concept of neighborhood
dynamically
The edge weights of dense regions in Gk tend to
be large and the edge weights of sparse tend to
be small

14
Example of k-nearest neighbor graph
15
Modeling the clustering similarity (1/2)

Relative interconnectivity
Relative closeness

16
Modeling the clustering similarity (2/2)

If related is considered , (c) , (d) will be
merged

17
Partition algorithm (PHASE I)

What
Finding the initial sub-clusters
Why
RI and RC cant be accurately calculated for
clusters containing only a few data points
How
Utilize multilevel graph partitioning algorithm
(hMETIS)
Coarsening phase
Partitioning phase
Uncoarsening phase

18
Partition algorithm (cont.)

Initial
all points belonging to the same cluster
Repeat until (size of all clusters lt MINSIZE)
Select the largest cluster and use hMETIS to
bisect
Post scriptum
Balance constrain
Spilt Ci into CiA and CiB and each sub-clusters
contains at least 25 of the node of Ci

19
(No Transcript)
20
Merge schemes (Phase II)

What
Merging sub-clusters using a dynamic framework
How
Finding and merging the pair of sub-clusters that
are the most similar
Scheme 1
Scheme 2

21
Experiment and related worked

Introduction of CURE
Introduction of DBSCAN
Results of experiment
Performance analysis

22
Introduction of CURE (1/n)

Clustering Using Representative points
1. Properties
Fit for non-spherical shapes.
Shrinking can help to dampen the effects of
outliers.
Multiple representative points chosen for
non-spherical
Each iteration , representative points shrunk
ratio related to merge procedure by some
scattered points chosen
Random sampling in data sets is fit for large
databases

23
Introduction of CURE (2/n)

2. Drawbacks
Partitioning method can not prove data points
chosen are good.
Clustering accuracy with respect to the
parameters below
(1) Shrink factor s CURE always find the right
clusters by range of s values from 0.2
to 0.7.
(2) Number of representative points c CURE
always found right clusters for value of
c greater than 10.
(3) Number of Partitions p with as many as 50
partitions , CURE always discovered the
desired clusters.
(4) Random Sample size r
(a) for sample size up to 2000 , clusters found
poor quality
(b) from 2500 sample points and above , about
2.5 of the data set size , CURE always
correctly find the clusters.

3. Clustering algorithm Representative points

Merge procedure

26
Introduction of DBSCAN (1/n)

Density Based Spatial Clustering of Application
With Noise
1. Properties
Can discovery clusters of arbitrary shape.
Each cluster with a typical density of points
which is higher than outside of cluster.
The density within the areas of noise is lower
than the density in any of the clusters.
Input the parameters MinPts only
Easy to implement in C language using R-tree
Runtime is linear depending on the number of
points.
Time complexity is O(n log n)

27
Introduction of DBSCAN (2/n)

2. Drawbacks
Cannot apply to polygons.
Cannot apply to high dimensional feature spaces.
Cannot process the shape of k-dist graph with
multi-features.
Cannot fit for large database because no method
applied to reduce spatial database.
3. Definitions
Eps-neighborhood of a point p
NEps(p)qD dist(p,q)ltEps
Each cluster with MinPts points

28
Introduction of DBSCAN (3/n)

4. p is directly density-reachable from q
(1) p NEps(q) and
(2) NEps(q) gtMinPts (core point condition)
We know directly density-reachable is symmetric
when p and q both are core point , otherwise is
asymmetric if one core point and one border
point.
5. p is density-reachable from q if there is a
chain of points between p and q
Density-reachable is transitive , but not
symmetric
Density-reachable is symmetric for core points.

29
Introduction of DBSCAN (4/n)

6. A point p is density-connected to a point q if
there is a point s such that both p and q are
density-reachable from s.
Density-connected is symmetric and reflexive
relation
A cluster is defined to be a set of
density-connected points which is maximal
density-reachability.
Noise is the set of points not belong to any of
clusters.
7. How to find cluster C ?
Maximality
? p , q if p C and q is density-reachable from
p , then q C
Connectivity
? p , q C p is density-connected to q
8. How to find noises ?
? p , if p is not belong to any clusters , then p
is noise point

30
Results of experiment
31
Performance analysis (1/2)

The time of construct the k-nearest neighbor
Low-dimensional data sets based on k-d trees ,
overall complexity of O(n log n)
High-dimensional data sets based on k-d trees not
applicable , overall complexity of O(n2)
Finding initial sub-clusters
Obtains m clusters by repeated partitioning
successively smaller graphs , overall
computational complexity is O(n log (n/m))
Is bounded by O(n log n)
A faster partitioning algorithm to obtain the
initial m clusters in time O(nm log m) using
multilevel m-way partitioning algorithm

32
Performance analysis (2/2)

Merging sub-clusters using a dynamic framework
The time of compute the internal
inter-connectivity and internal closeness for
each initial cluster is which is O(nm)
The time of the most similar pair of clusters to
merge is O(m2 log m) by using a heap-based
priority queue
So overall complexity of CHAMELEONs is O(n log
n nm m2 log m)

33
Conclusion and discussion