BIRCH: An Efficient Data Clustering Method for Very Large Databases - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

BIRCH: An Efficient Data Clustering Method for Very Large Databases

Description:

A clustering method - BIRCH. Balanced Iterative ... Global or semi-global methods at the granularity of data points. ... Method to generate synthetic datasets ... – PowerPoint PPT presentation

Number of Views:1170

Avg rating:3.0/5.0

Slides: 38

Provided by: xinCz3

Category:

more less

Transcript and Presenter's Notes

Title: BIRCH: An Efficient Data Clustering Method for Very Large Databases

1
Advanced Topics in Database Management --- CS6203

BIRCH An Efficient Data Clustering Method for
Very Large Databases

Chen Yabing (HT00-6078H) Cong Gao
(HD99-2943W)
2
Outline

Introduction
Relevant Research
Background Knowledge
Clustering Feature and CF Tree
BIRCH Clustering Algorithm
Performance Analysis
Summary Future Research

3
Introduction

Definition of Data clustering
Given the desired number of clusters K and a
distance-based measurement function, we are asked
to find a partition of the dataset that minimizes
the value of the measurement function.
database-oriented constraint
The amount of memory available is limited and
we want to minimize the time required for I/O.

4
Introduction (Cont.)

A clustering method ------ BIRCH
Balanced Iterative Reducing and Clustering
using Hierarchies
I/O cost is linear in the size of the dataset a
single scan of the dataset yields a good
clustering.
Offers opportunities for parallelism, and for
interactive or dynamic performance tuning based
on knowledge about the dataset.
First algorithm proposed in the database area
that addresses outliers (data points that should
be regarded as noise) and proposes a plausible
solution.

5
Relevant Research

Probability-based approaches.
They make the assumption that probability
distributions on separate attributes are
statistically independent of each other
The probability-based tree that is built to
identify clusters is not height-balanced.

6
Relevant Research (Cont.)

Distance-based approaches.
Global or semi-global methods at the granularity
of data points.
Assume that all data points are given in advance
and can be scanned frequently.
Ignore the fact that not all data points in the
dataset are equally important.
None of them have linear time scalability with
stable quality.

7
Contributions of BIRCH

BIRCH is local in that each clustering decision
is made without scanning all data points.
BIRCH exploits the observation that the data
space is usually not uniformly occupied, and
hence not every data point is equally important
for clustering purposes.
BIRCH makes full use of available memory to
derive the finest possible subclusters ( to
ensure accuracy) while minimizing I/O costs ( to
ensure efficiency).

8
Background knowledge

Given N d-dimensional data points in a cluster
, where i 1,2,,N, the centroid, radius and
diameter is defined as
We treat , R and D as properties of a single
cluster.

9
Background knowledge (Cont.)

Next between two clusters, we define 5
alternative distances for measuring their
closeness.
Given the centroids of two clusters
, the two alternative distances D0 and D1 are
defined as

10
Background knowledge (Cont.)

Given N1 d-dimensional data points in a
cluster where i 1,2,N1, and N2 data
points in another cluster where j
N11,N12,,N1N2, the average inter-cluster
distance D2, average intra-cluster distance D3
and variance increase distance D4 of the two
clusters are defined as

Actually, D3 is D of the merged cluster.
D0,D1,D2,D3,D4 is some measurement of two
clusters. We can use them to determine if two
clusters are close.

11
Clustering Feature

A Clustering Feature is a tirple summarizing the
information that we maintain about a cluster.
Definition. Given N d-dimensional data points in
a cluster where i 1,2,N, the Clustering
Feature (CF) vector of the cluster is defined as
a triple , where N is the
number of data points in the cluster, is
the linear sum of the N data points, i.e.,
and SS is the square sum of the N data points,
i.e.,

12
Clustering Feature (Cont.)

CF Additive Theorem
Assume that CF1(N1, ,SS1), and CF2 (N2,
,SS2) are the CF vectors of two disjoint
clusters. Then the CF vector of the cluster that
is formed by merging the two disjoint clusters,
is
From the CF definition and additive theorem, we
know that the corresponding X0, R,D,D0,D1,D2,D3
and D4,can all be calculated easily.

13
CF Tree

A CF tree is a height-balanced tree with two
parameters
branching factor B. Each nonleaf node contains at
most B entries of the formcfi,childi, where i
1,2,,B, childi is a pointer to its i-th child
node, and CFi is the CF of the sub-cluster
represented by this child. A leaf node contains
at most L entries, each of the form CFi, where
i 1,2,L.
Initial threshold T. Which is an initial
threshold for the radius ( or diameter ) of any
cluster. The value of T is an upper bound on the
radius of any cluster and controls the number of
clusters that the algorithm discovers.

14
CF Tree (Cont.)

CF Tree will be built dynamically as new data
objects are inserted. It is used to guide a new
insertion into the correct subcluster for
clustering purpose. Actually, all clusters are
formed as each data point is inserted into the CF
Tree.
CF Tree is a very compact representation of the
dataset because each entry in a leaf node is not
a single data point but a subcluster.

15
Insertion into a CF Tree

we now present the algorithm for
inserting an entry into a CF tree.Given entry
Ent, it proceeds as below
1. Identifying the appropriate leaf
Starting from the root, according to a chosen
distance metric D0 to D4 as defined before, it
recursively descends the CF tree by choosing the
closest child node .
2. Modifying the leaf When it reaches a
leaf node, it finds the closest leaf entry, and
tests whether the node can absorb it without
violating the threshold condition.
If so, the CF vector for the node is updated to
reflect this.
If not, a new entry for it is added to the leaf.
If there is space on the leaf for this new entry,
we are done.
Otherwise we must split the leaf node. Node
splitting is done by choosing the farthest pair
of entries based on the closest criteria.

16
Insertion into a CF Tree (Cont.)

Modifying the path to the leaf After inserting
Ent into a leaf, we must update the CF
information for each nonleaf entry on the path to
the leaf.
In the absence of a split, this simply involves
adding CF vectors to reflect the additions of
Ent.
A leaf split requires us to insert a new nonleaf
entry into the parent node, to describe the newly
created leaf.
If the parent has space for this entry, at all
higher levels, we only need to update the CF
vectors to reflect the addition of Ent.
Otherwise, we may have to split the parent as
well, and so on up to the root.

17
Insertion into a CF Tree (Cont.)

Merging Refinement In the presence of skewed
data input order, split can affect the
clustering quality, and also reduce space
utilization. A simple additional merging often
helps ameliorate these problems suppose the
propagation of one split stops at some nonleaf
node Nj, i.e., Nj can accommodate the additional
entry resulting from the split.
Scan Nj to find the two closest entries.
If they are not the pair corresponding to the
split, merge them.
If there are more entries than one page can hold,
split it again.
During the resplitting, one of the seeds attracts
enough merged entries, the other receives the
rest entries.
In summary, it improve the distribution of
entries in the closest two children, even
free a node space for later use.

18
Insertion into a CF Tree (Cont.)

Notes Some problems which will be resolved in
the later phases of the entire algorithm.
Depending upon the order of data input and the
degree of skew, two subclusters that should not
be in one cluster are kept in the same node. It
will be remedied with a global algorithm that
arranges leaf entries across nodes.( Phase 3)
If the same data point is inserted twice, but at
two different times, the two copies might be
entered into distinct leaf entries. It will be
addressed with further refinement passes over the
data.(Phase 4)

19
Figure Effect of Split, Merge and Resplit
CF1
CF3
CF2
CF4
Split
CF1
CF3
CF5
CF2
CF4
CF1
CF3
CF5
CF6
CF2
CF4
CF8
CF7
Split
CF1
CF3
CF5
CF6
CF4
CF8
CF7
CF2
CF9
Merge
CF1
CF3
CF5
CF6
CF4
CF8
CF7
CF2
CF9
Resplit
CF1
CF3
CF5
CF6
CF4
CF8
CF7
CF2
CF9
20
BIRCH Clustering Algorithm - Overview
Data
Phase1 Load into memory by building a CF tree
Initial CF tree
Click Here
21
BIRCH Clustering Algorithm Phase 2,3 4

Phase 2 Condense into desirable range by
building a smaller CF tree
To meet the need of the input size range of Phase
3
Phase 3 Global clustering
To solve the problem 1
Approach re-cluster all subclusters by using the
existing global or semi-global clustering
algorithms
Phase 4 Global clustering
To solve the problem 2
Approach use the centroids of the cluster
produced by phase 3 as seeds, and redistributes
the data points to its closest seed to obtain a
set of new clusters. This can use existing
algorithm.

22
BIRCH Clustering Algorithm Control flow of
Phase1
Start CF tree t1 of Initial T
Continue scanning data and insert to t1
Finish scanning data
Out of memory

Increase T.
Rebuild CF tree t2 of new T from CF tree t1 if a
leaf entry of t1 is potential outlier and disk
space available, write to disk otherwise use it
to rebuild t2
T1lt - t2

Out of disk space
Otherwise
Re-absorb potential outliers in to t1
Re-absorb potential outliers in to t1
23
BIRCH Clustering Algorithm Keys of Phase1

Keys of Phase 1
Rebuilding CF tree
Threshold values
Outlier-handling Option
Delay-Split Option

24
BIRCH Clustering Phase1 Rebuilding

What to do?
Use all the leaf entries of the old CF tree to
rebuild a new CF tree with larger threshold.
What is path?
A leaf node corresponds to a path uniquely. Click
here
What is the rebuilding algorithm?
The algorithm scans and frees the old tree path
by path, and creates the new tree path by path.
Click here
Theorem
The size of rebuilt tree must be less than that
of old tree.
The transformation form the old tree to the new
tree needs at most h extra pages of memory, where
h is the height of old tree.

25
BIRCH Clustering Phase1 Rebuilding Algorithm
Start to rebuild from the leftmost path in the
old tree
For an OldCurrentPath(OCP), create the
corresponding NewCurrentPath (NCUP) in new
tree(no chance to become larger)
Find the NewClosestPath (NCLP) for each leaf
entry in OCP
Y
N
N
Insert Leaf entry in NCUP
Y
Insert Leaf entry in NCLP
Free space in Both OCP and NCUP
Y
N
End
26
BIRCH Clustering Phase1 Rebuilding CF tree
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3
1 2 3
1 2 3 4
(1,1)(1,2)(1,3) (2,1)(2,2)(2,3) (3,1)(3,2)(3,4)
Back
27
BIRCH Clustering Phase1 Threshold Value
Heuristic approach to increase the threshold Try
to choose new threshold value so that the number
of data points that will be scanned under the new
threshold value can double . Approach 1 find the
most crowded leaf node and the closest two
entries on the leaf can be merged under new
threshold. Approach 2 Assuming that the volume
occupied by the leaf clusters grows linearly with
data points. a series of value pair number of
data point and volume ? new volume (a new data
point, using least squares linear regression) ?
new threshold Using some heuristic methods to
adjust the above two thresholds and choose one.
28
BIRCH Clustering Phase1 Outlier-handling option

What is Outlier? leaf entries of low density that
are judged to be unimportant to the overall
clustering pattern. Use some disk space for
handling outliers
When to write to Outlier ? When rebuilding the CF
tree, an old leaf entry is write to disk if it is
considered to be a potential outlier. This can
reduce the size of CF tree.

29
BIRCH Clustering Phase1 Outlier-handling option

When a potential outlier is no longer qualified
as a outlier?
Increase in the threshold value
the change in the distribution due to more data
are read.
When to scan the potential outlier to reabsorb
those unqualified potential outlier without
causing the tree to grow in size?
The disk space run out
all data points have been scanned.

30
BIRCH Clustering Phase1 Delay-Split option

When we run out of memory. There may be more
data points that can fit in the current CF tree.
We can continue to read data point and write
those data points that require to split a node to
disk until the disk space is run out. The
advantage of this approach is that more data
points can fit in the tree before we have to
rebuild.

31
Performance Studies

Complexity Analysis
Experiment with Synthetic Datasets
Performance Comparisons of BIRCH and CLARANS with
Synthetic Datasets
Experiment with Real Datasets

32
Performance Studies-Complexity Analysis

Time Complexity Analysis for Phase 1
for inserting all data O( dimension ? the
number of data point ? the number of entries per
node ? the number of nodes by following a path
from the root to leaf)
for reinserting leaf entries O(the number of
rebuilding ? dimension ? the number of leaf
entries to reinsert ? the number of entries per
node ? the number of nodes by following a path
from the root to leaf )
Time Complexity Analysis for other Phase
The phase 2 is similar to phase 1
The phase 3 4 rely on the adopted methods

33
Performance Studies-Experiment with Synthetic
Datasets

Method to generate synthetic datasets
Given some parameters the number of clusters,
the number of data points in each cluster, the
radius and center of each cluster, noise rate and
input order
Parameters Setting
Memory, Disk, Distance def, Threshold def,
Initial threshold, Delay-split, page size,
outlier-handling, outlier def
Sensitivity to Parameters
Initial threshold
Page size? running time? ending threshold? leaf
entry? quality ?

34
Performance Studies-Performance Comparisons of
BIRCH and CLARANS with Synthetic Datasets
35
Performance Studies-Experiment with Real Datasets

BIRCH has been used for filtering real image.
The input dataset is a set of brightness value
pair from two similar images with different
wavelength. The output clusters are different
part of the images, for example, the tree image
can be filtered into sunlit leaves, shadow, and
branch. The result proves to be satisfactory.

36
Summary and Future Research