BIRCH - PowerPoint PPT Presentation

About This Presentation
Title:

BIRCH

Description:

D1 Manhattan distance between centroids. D2 average inter-cluster distance ... centroid, radius, diameter, D0, D1, D2, D3 and D4, not all points are needed ... – PowerPoint PPT presentation

Number of Views:178
Avg rating:3.0/5.0
Slides: 17
Provided by: siakac
Learn more at: http://oak.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: BIRCH


1
BIRCH
  • An Efficient Data Clustering Method for Very
    Large Databases
  • SIGMOD 96

2
Introduction
  • Balanced Iterative Reducing and Clustering using
    Hierarchies
  • For multi-dimensional dataset
  • Minimized I/O cost (linear 1 or 2 scan)
  • Full utilization of memory
  • Hierarchies ? indexing method

3
Terminology
  • Property of a cluster
  • Given N d-dimensional data points
  • Centroid
  • Radius
  • Diameter

4
Terminology
  • Distance between 2 clusters
  • D0 Euclidian distance between centroids
  • D1 Manhattan distance between centroids
  • D2 average inter-cluster distance
  • D3 average intra-cluster distance
  • D4 variance increase distance

5
Clustering Feature
  • To calculate centroid, radius, diameter, D0, D1,
    D2, D3 and D4, not all points are needed
  • 3 values are stored to represent the cluster (CF)
  • N number of points in a cluster
  • LS linear sum of points in a cluster
  • SS square sum of points in a cluster
  • CF are additive

6
CF Tree
  • Similar to B-tree, R-tree
  • Parameter
  • B branching factor
  • T threshold
  • Leaf node contains at most L CF entries, each
    CF should follows DltT or RltT
  • Non-leaf node contains at most B CF entries of
    its child
  • Each node should fit into 1 page

7
BIRCH
  • Phase 1 Scan dataset once, build a CF tree in
    memory
  • Phase 2 (Optional) Condense the CF tree to a
    smaller CF tree
  • Phase 3 Global Clustering
  • Phase 4 (Optional) Clustering Refining (require
    scan of dataset)

8
BIRCH
9
Building CF Tree (Phase 1)
  • CF of a data point (3,4) is (1,(3,4),25)
  • Insert a point to the tree
  • Find the path (based on D0, D1, D2, D3, D4
    between CF of children in a non-leaf node)
  • Modify the leaf
  • Find closest leaf node entry (based on D0, D1,
    D2, D3, D4 of CF in leaf node)
  • Check if it can absorb the new data point
  • Modify the path to the leaf
  • Splitting if leaf node is full, split into two
    leaf node, add one more entry in parent

10
Building CF Tree (Phase 1)
Sum of CF(N,LS,SS) of all children
Non-leave node
Leave node
CF(N,LS,SS) under condition DltT or RltT
11
Condensing CF Tree (Phase 2)
  • Chose a larger T (threshold)
  • Consider entries in leaf nodes
  • Reinsert CF entries in the new tree
  • If new path is before original path, move
    it to new path
  • If new path is the same as original path,
    leave it unchanged

12
Global Clustering (Phase 3)
  • Consider CF entries in leaf nodes only
  • Use centroid as the representative of a cluster
  • Perform traditional clustering (e.g.
    agglomerative hierarchy (complete link D2) or
    K-mean or CL)
  • Cluster CF instead of data points

13
Cluster Refining (Phase 4)
  • Require scan of dataset one more time
  • Use clusters found in phase 3 as seeds
  • Redistribute data points to their closest seeds
    and form new clusters
  • Removal of outliers
  • Acquisition of membership information

14
Performance
15
Visualization
16
Conclusion
  • A clustering algorithm taking consideration of
    I/O costs, memory limitation
  • Utilize local information (each clustering
    decision is made without scanning all data
    points)
  • Not every data point is equally important for
    clustering purpose
Write a Comment
User Comments (0)
About PowerShow.com