Title: BIRCH: An Efficient Data Clustering Method for Very Large Databases
1Advanced Topics in Database Management --- CS6203
- BIRCH An Efficient Data Clustering Method for
Very Large Databases
Chen Yabing (HT00-6078H) Cong Gao
(HD99-2943W)
2Outline
- Introduction
- Relevant Research
- Background Knowledge
- Clustering Feature and CF Tree
- BIRCH Clustering Algorithm
- Performance Analysis
- Summary Future Research
3Introduction
- Definition of Data clustering
- Given the desired number of clusters K and a
distance-based measurement function, we are asked
to find a partition of the dataset that minimizes
the value of the measurement function. - database-oriented constraint
- The amount of memory available is limited and
we want to minimize the time required for I/O.
4Introduction (Cont.)
- A clustering method ------ BIRCH
- Balanced Iterative Reducing and Clustering
using Hierarchies - I/O cost is linear in the size of the dataset a
single scan of the dataset yields a good
clustering. - Offers opportunities for parallelism, and for
interactive or dynamic performance tuning based
on knowledge about the dataset. - First algorithm proposed in the database area
that addresses outliers (data points that should
be regarded as noise) and proposes a plausible
solution. -
5Relevant Research
- Probability-based approaches.
- They make the assumption that probability
distributions on separate attributes are
statistically independent of each other - The probability-based tree that is built to
identify clusters is not height-balanced.
6Relevant Research (Cont.)
- Distance-based approaches.
- Global or semi-global methods at the granularity
of data points. - Assume that all data points are given in advance
and can be scanned frequently. - Ignore the fact that not all data points in the
dataset are equally important. - None of them have linear time scalability with
stable quality.
7Contributions of BIRCH
- BIRCH is local in that each clustering decision
is made without scanning all data points. - BIRCH exploits the observation that the data
space is usually not uniformly occupied, and
hence not every data point is equally important
for clustering purposes. - BIRCH makes full use of available memory to
derive the finest possible subclusters ( to
ensure accuracy) while minimizing I/O costs ( to
ensure efficiency).
8Background knowledge
- Given N d-dimensional data points in a cluster
, where i 1,2,,N, the centroid, radius and
diameter is defined as - We treat , R and D as properties of a single
cluster. -
9Background knowledge (Cont.)
- Next between two clusters, we define 5
alternative distances for measuring their
closeness. - Given the centroids of two clusters
, the two alternative distances D0 and D1 are
defined as -
-
10Background knowledge (Cont.)
- Given N1 d-dimensional data points in a
cluster where i 1,2,N1, and N2 data
points in another cluster where j
N11,N12,,N1N2, the average inter-cluster
distance D2, average intra-cluster distance D3
and variance increase distance D4 of the two
clusters are defined as
- Actually, D3 is D of the merged cluster.
- D0,D1,D2,D3,D4 is some measurement of two
clusters. We can use them to determine if two
clusters are close.
11Clustering Feature
- A Clustering Feature is a tirple summarizing the
information that we maintain about a cluster. - Definition. Given N d-dimensional data points in
a cluster where i 1,2,N, the Clustering
Feature (CF) vector of the cluster is defined as
a triple , where N is the
number of data points in the cluster, is
the linear sum of the N data points, i.e.,
and SS is the square sum of the N data points,
i.e.,
12Clustering Feature (Cont.)
- CF Additive Theorem
- Assume that CF1(N1, ,SS1), and CF2 (N2,
,SS2) are the CF vectors of two disjoint
clusters. Then the CF vector of the cluster that
is formed by merging the two disjoint clusters,
is - From the CF definition and additive theorem, we
know that the corresponding X0, R,D,D0,D1,D2,D3
and D4,can all be calculated easily. -
13CF Tree
- A CF tree is a height-balanced tree with two
parameters - branching factor B. Each nonleaf node contains at
most B entries of the formcfi,childi, where i
1,2,,B, childi is a pointer to its i-th child
node, and CFi is the CF of the sub-cluster
represented by this child. A leaf node contains
at most L entries, each of the form CFi, where
i 1,2,L. - Initial threshold T. Which is an initial
threshold for the radius ( or diameter ) of any
cluster. The value of T is an upper bound on the
radius of any cluster and controls the number of
clusters that the algorithm discovers.
14CF Tree (Cont.)
- CF Tree will be built dynamically as new data
objects are inserted. It is used to guide a new
insertion into the correct subcluster for
clustering purpose. Actually, all clusters are
formed as each data point is inserted into the CF
Tree. - CF Tree is a very compact representation of the
dataset because each entry in a leaf node is not
a single data point but a subcluster.
15Insertion into a CF Tree
- we now present the algorithm for
inserting an entry into a CF tree.Given entry
Ent, it proceeds as below - 1. Identifying the appropriate leaf
Starting from the root, according to a chosen
distance metric D0 to D4 as defined before, it
recursively descends the CF tree by choosing the
closest child node . - 2. Modifying the leaf When it reaches a
leaf node, it finds the closest leaf entry, and
tests whether the node can absorb it without
violating the threshold condition. - If so, the CF vector for the node is updated to
reflect this. - If not, a new entry for it is added to the leaf.
- If there is space on the leaf for this new entry,
we are done. - Otherwise we must split the leaf node. Node
splitting is done by choosing the farthest pair
of entries based on the closest criteria. -
-
-
16Insertion into a CF Tree (Cont.)
- Modifying the path to the leaf After inserting
Ent into a leaf, we must update the CF
information for each nonleaf entry on the path to
the leaf. - In the absence of a split, this simply involves
adding CF vectors to reflect the additions of
Ent. - A leaf split requires us to insert a new nonleaf
entry into the parent node, to describe the newly
created leaf. - If the parent has space for this entry, at all
higher levels, we only need to update the CF
vectors to reflect the addition of Ent. - Otherwise, we may have to split the parent as
well, and so on up to the root.
17Insertion into a CF Tree (Cont.)
- Merging Refinement In the presence of skewed
data input order, split can affect the
clustering quality, and also reduce space
utilization. A simple additional merging often
helps ameliorate these problems suppose the
propagation of one split stops at some nonleaf
node Nj, i.e., Nj can accommodate the additional
entry resulting from the split. - Scan Nj to find the two closest entries.
- If they are not the pair corresponding to the
split, merge them. - If there are more entries than one page can hold,
split it again. - During the resplitting, one of the seeds attracts
enough merged entries, the other receives the
rest entries. - In summary, it improve the distribution of
entries in the closest two children, even - free a node space for later use.
18Insertion into a CF Tree (Cont.)
- Notes Some problems which will be resolved in
the later phases of the entire algorithm. - Depending upon the order of data input and the
degree of skew, two subclusters that should not
be in one cluster are kept in the same node. It
will be remedied with a global algorithm that
arranges leaf entries across nodes.( Phase 3) - If the same data point is inserted twice, but at
two different times, the two copies might be
entered into distinct leaf entries. It will be
addressed with further refinement passes over the
data.(Phase 4)
19Figure Effect of Split, Merge and Resplit
CF1
CF3
CF2
CF4
Split
CF1
CF3
CF5
CF2
CF4
CF1
CF3
CF5
CF6
CF2
CF4
CF8
CF7
Split
CF1
CF3
CF5
CF6
CF4
CF8
CF7
CF2
CF9
Merge
CF1
CF3
CF5
CF6
CF4
CF8
CF7
CF2
CF9
Resplit
CF1
CF3
CF5
CF6
CF4
CF8
CF7
CF2
CF9
20BIRCH Clustering Algorithm - Overview
Data
Phase1 Load into memory by building a CF tree
Initial CF tree
Click Here
21BIRCH Clustering Algorithm Phase 2,3 4
- Phase 2 Condense into desirable range by
building a smaller CF tree - To meet the need of the input size range of Phase
3 - Phase 3 Global clustering
- To solve the problem 1
- Approach re-cluster all subclusters by using the
existing global or semi-global clustering
algorithms - Phase 4 Global clustering
- To solve the problem 2
- Approach use the centroids of the cluster
produced by phase 3 as seeds, and redistributes
the data points to its closest seed to obtain a
set of new clusters. This can use existing
algorithm.
22BIRCH Clustering Algorithm Control flow of
Phase1
Start CF tree t1 of Initial T
Continue scanning data and insert to t1
Finish scanning data
Out of memory
- Increase T.
- Rebuild CF tree t2 of new T from CF tree t1 if a
leaf entry of t1 is potential outlier and disk
space available, write to disk otherwise use it
to rebuild t2 - T1lt - t2
Out of disk space
Otherwise
Re-absorb potential outliers in to t1
Re-absorb potential outliers in to t1
23BIRCH Clustering Algorithm Keys of Phase1
- Keys of Phase 1
- Rebuilding CF tree
- Threshold values
- Outlier-handling Option
- Delay-Split Option
24BIRCH Clustering Phase1 Rebuilding
- What to do?
- Use all the leaf entries of the old CF tree to
rebuild a new CF tree with larger threshold. - What is path?
- A leaf node corresponds to a path uniquely. Click
here - What is the rebuilding algorithm?
- The algorithm scans and frees the old tree path
by path, and creates the new tree path by path.
Click here - Theorem
- The size of rebuilt tree must be less than that
of old tree. - The transformation form the old tree to the new
tree needs at most h extra pages of memory, where
h is the height of old tree.
25BIRCH Clustering Phase1 Rebuilding Algorithm
Start to rebuild from the leftmost path in the
old tree
For an OldCurrentPath(OCP), create the
corresponding NewCurrentPath (NCUP) in new
tree(no chance to become larger)
Find the NewClosestPath (NCLP) for each leaf
entry in OCP
Y
N
N
Insert Leaf entry in NCUP
Y
Insert Leaf entry in NCLP
Free space in Both OCP and NCUP
Y
N
End
26BIRCH Clustering Phase1 Rebuilding CF tree
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3
1 2 3
1 2 3 4
(1,1)(1,2)(1,3) (2,1)(2,2)(2,3) (3,1)(3,2)(3,4)
Back
27BIRCH Clustering Phase1 Threshold Value
Heuristic approach to increase the threshold Try
to choose new threshold value so that the number
of data points that will be scanned under the new
threshold value can double . Approach 1 find the
most crowded leaf node and the closest two
entries on the leaf can be merged under new
threshold. Approach 2 Assuming that the volume
occupied by the leaf clusters grows linearly with
data points. a series of value pair number of
data point and volume ? new volume (a new data
point, using least squares linear regression) ?
new threshold Using some heuristic methods to
adjust the above two thresholds and choose one.
28BIRCH Clustering Phase1 Outlier-handling option
- What is Outlier? leaf entries of low density that
are judged to be unimportant to the overall
clustering pattern. Use some disk space for
handling outliers - When to write to Outlier ? When rebuilding the CF
tree, an old leaf entry is write to disk if it is
considered to be a potential outlier. This can
reduce the size of CF tree.
29BIRCH Clustering Phase1 Outlier-handling option
- When a potential outlier is no longer qualified
as a outlier? - Increase in the threshold value
- the change in the distribution due to more data
are read. - When to scan the potential outlier to reabsorb
those unqualified potential outlier without
causing the tree to grow in size? - The disk space run out
- all data points have been scanned.
30BIRCH Clustering Phase1 Delay-Split option
- When we run out of memory. There may be more
data points that can fit in the current CF tree.
We can continue to read data point and write
those data points that require to split a node to
disk until the disk space is run out. The
advantage of this approach is that more data
points can fit in the tree before we have to
rebuild.
31Performance Studies
- Complexity Analysis
- Experiment with Synthetic Datasets
- Performance Comparisons of BIRCH and CLARANS with
Synthetic Datasets - Experiment with Real Datasets
32Performance Studies-Complexity Analysis
- Time Complexity Analysis for Phase 1
- for inserting all data O( dimension ? the
number of data point ? the number of entries per
node ? the number of nodes by following a path
from the root to leaf) - for reinserting leaf entries O(the number of
rebuilding ? dimension ? the number of leaf
entries to reinsert ? the number of entries per
node ? the number of nodes by following a path
from the root to leaf ) - Time Complexity Analysis for other Phase
- The phase 2 is similar to phase 1
- The phase 3 4 rely on the adopted methods
33Performance Studies-Experiment with Synthetic
Datasets
- Method to generate synthetic datasets
- Given some parameters the number of clusters,
the number of data points in each cluster, the
radius and center of each cluster, noise rate and
input order - Parameters Setting
- Memory, Disk, Distance def, Threshold def,
Initial threshold, Delay-split, page size,
outlier-handling, outlier def - Sensitivity to Parameters
- Initial threshold
- Page size? running time? ending threshold? leaf
entry? quality ?
34Performance Studies-Performance Comparisons of
BIRCH and CLARANS with Synthetic Datasets
35Performance Studies-Experiment with Real Datasets
- BIRCH has been used for filtering real image.
The input dataset is a set of brightness value
pair from two similar images with different
wavelength. The output clusters are different
part of the images, for example, the tree image
can be filtered into sunlit leaves, shadow, and
branch. The result proves to be satisfactory.
36Summary and Future Research
- More reasonable ways of increasing the threshold
dynamically - The dynamic adjustment of outlier criteria
- More accurate quality measurements
- Data parameters that are good indicators of how
well BIRCH is likely perform
37THANK YOU!
www.comp.nus.edu.sg/conggao/cs6203