BIRCH

About This Presentation

Title:

BIRCH

Description:

D1 Manhattan distance between centroids. D2 average inter-cluster distance ... centroid, radius, diameter, D0, D1, D2, D3 and D4, not all points are needed ... – PowerPoint PPT presentation

Number of Views:178

Avg rating:3.0/5.0

Slides: 17

Provided by: siakac

Learn more at: http://oak.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: BIRCH

1
BIRCH

An Efficient Data Clustering Method for Very
Large Databases
SIGMOD 96

2
Introduction

Balanced Iterative Reducing and Clustering using
Hierarchies
For multi-dimensional dataset
Minimized I/O cost (linear 1 or 2 scan)
Full utilization of memory
Hierarchies ? indexing method

3
Terminology

Property of a cluster
Given N d-dimensional data points
Centroid
Radius
Diameter

4
Terminology

Distance between 2 clusters
D0 Euclidian distance between centroids
D1 Manhattan distance between centroids
D2 average inter-cluster distance
D3 average intra-cluster distance
D4 variance increase distance

5
Clustering Feature

To calculate centroid, radius, diameter, D0, D1,
D2, D3 and D4, not all points are needed
3 values are stored to represent the cluster (CF)
N number of points in a cluster
LS linear sum of points in a cluster
SS square sum of points in a cluster
CF are additive

6
CF Tree

Similar to B-tree, R-tree
Parameter
B branching factor
T threshold
Leaf node contains at most L CF entries, each
CF should follows DltT or RltT
Non-leaf node contains at most B CF entries of
its child
Each node should fit into 1 page

7
BIRCH

Phase 1 Scan dataset once, build a CF tree in
memory
Phase 2 (Optional) Condense the CF tree to a
smaller CF tree
Phase 3 Global Clustering
Phase 4 (Optional) Clustering Refining (require
scan of dataset)

8
BIRCH
9
Building CF Tree (Phase 1)

CF of a data point (3,4) is (1,(3,4),25)
Insert a point to the tree
Find the path (based on D0, D1, D2, D3, D4
between CF of children in a non-leaf node)
Modify the leaf
Find closest leaf node entry (based on D0, D1,
D2, D3, D4 of CF in leaf node)
Check if it can absorb the new data point
Modify the path to the leaf
Splitting if leaf node is full, split into two
leaf node, add one more entry in parent

10
Building CF Tree (Phase 1)
Sum of CF(N,LS,SS) of all children
Non-leave node
Leave node
CF(N,LS,SS) under condition DltT or RltT
11
Condensing CF Tree (Phase 2)

Chose a larger T (threshold)
Consider entries in leaf nodes
Reinsert CF entries in the new tree
If new path is before original path, move
it to new path
If new path is the same as original path,
leave it unchanged

12
Global Clustering (Phase 3)

Consider CF entries in leaf nodes only
Use centroid as the representative of a cluster
Perform traditional clustering (e.g.
agglomerative hierarchy (complete link D2) or
K-mean or CL)
Cluster CF instead of data points

13
Cluster Refining (Phase 4)

Require scan of dataset one more time
Use clusters found in phase 3 as seeds
Redistribute data points to their closest seeds
and form new clusters
Removal of outliers
Acquisition of membership information

14
Performance
15
Visualization
16
Conclusion

A clustering algorithm taking consideration of
I/O costs, memory limitation
Utilize local information (each clustering
decision is made without scanning all data
points)
Not every data point is equally important for
clustering purpose

Write a Comment

User Comments (0)