BIRCH: An Efficient Data Clustering Method for Very Large Databases - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

BIRCH: An Efficient Data Clustering Method for Very Large Databases

Description:

A clustering method - BIRCH. Balanced Iterative ... Global or semi-global methods at the granularity of data points. ... Method to generate synthetic datasets ... – PowerPoint PPT presentation

Number of Views:1170
Avg rating:3.0/5.0
Slides: 38
Provided by: xinCz3
Category:

less

Transcript and Presenter's Notes

Title: BIRCH: An Efficient Data Clustering Method for Very Large Databases


1
Advanced Topics in Database Management --- CS6203
  • BIRCH An Efficient Data Clustering Method for
    Very Large Databases

Chen Yabing (HT00-6078H) Cong Gao
(HD99-2943W)
2
Outline
  • Introduction
  • Relevant Research
  • Background Knowledge
  • Clustering Feature and CF Tree
  • BIRCH Clustering Algorithm
  • Performance Analysis
  • Summary Future Research

3
Introduction
  • Definition of Data clustering
  • Given the desired number of clusters K and a
    distance-based measurement function, we are asked
    to find a partition of the dataset that minimizes
    the value of the measurement function.
  • database-oriented constraint
  • The amount of memory available is limited and
    we want to minimize the time required for I/O.


4
Introduction (Cont.)
  • A clustering method ------ BIRCH
  • Balanced Iterative Reducing and Clustering
    using Hierarchies
  • I/O cost is linear in the size of the dataset a
    single scan of the dataset yields a good
    clustering.
  • Offers opportunities for parallelism, and for
    interactive or dynamic performance tuning based
    on knowledge about the dataset.
  • First algorithm proposed in the database area
    that addresses outliers (data points that should
    be regarded as noise) and proposes a plausible
    solution.

5
Relevant Research
  • Probability-based approaches.
  • They make the assumption that probability
    distributions on separate attributes are
    statistically independent of each other
  • The probability-based tree that is built to
    identify clusters is not height-balanced.

6
Relevant Research (Cont.)
  • Distance-based approaches.
  • Global or semi-global methods at the granularity
    of data points.
  • Assume that all data points are given in advance
    and can be scanned frequently.
  • Ignore the fact that not all data points in the
    dataset are equally important.
  • None of them have linear time scalability with
    stable quality.

7
Contributions of BIRCH
  • BIRCH is local in that each clustering decision
    is made without scanning all data points.
  • BIRCH exploits the observation that the data
    space is usually not uniformly occupied, and
    hence not every data point is equally important
    for clustering purposes.
  • BIRCH makes full use of available memory to
    derive the finest possible subclusters ( to
    ensure accuracy) while minimizing I/O costs ( to
    ensure efficiency).

8
Background knowledge
  • Given N d-dimensional data points in a cluster
    , where i 1,2,,N, the centroid, radius and
    diameter is defined as
  • We treat , R and D as properties of a single
    cluster.

9
Background knowledge (Cont.)
  • Next between two clusters, we define 5
    alternative distances for measuring their
    closeness.
  • Given the centroids of two clusters
    , the two alternative distances D0 and D1 are
    defined as

10
Background knowledge (Cont.)
  • Given N1 d-dimensional data points in a
    cluster where i 1,2,N1, and N2 data
    points in another cluster where j
    N11,N12,,N1N2, the average inter-cluster
    distance D2, average intra-cluster distance D3
    and variance increase distance D4 of the two
    clusters are defined as
  • Actually, D3 is D of the merged cluster.
  • D0,D1,D2,D3,D4 is some measurement of two
    clusters. We can use them to determine if two
    clusters are close.

11
Clustering Feature
  • A Clustering Feature is a tirple summarizing the
    information that we maintain about a cluster.
  • Definition. Given N d-dimensional data points in
    a cluster where i 1,2,N, the Clustering
    Feature (CF) vector of the cluster is defined as
    a triple , where N is the
    number of data points in the cluster, is
    the linear sum of the N data points, i.e.,
    and SS is the square sum of the N data points,
    i.e.,

12
Clustering Feature (Cont.)
  • CF Additive Theorem
  • Assume that CF1(N1, ,SS1), and CF2 (N2,
    ,SS2) are the CF vectors of two disjoint
    clusters. Then the CF vector of the cluster that
    is formed by merging the two disjoint clusters,
    is
  • From the CF definition and additive theorem, we
    know that the corresponding X0, R,D,D0,D1,D2,D3
    and D4,can all be calculated easily.

13
CF Tree
  • A CF tree is a height-balanced tree with two
    parameters
  • branching factor B. Each nonleaf node contains at
    most B entries of the formcfi,childi, where i
    1,2,,B, childi is a pointer to its i-th child
    node, and CFi is the CF of the sub-cluster
    represented by this child. A leaf node contains
    at most L entries, each of the form CFi, where
    i 1,2,L.
  • Initial threshold T. Which is an initial
    threshold for the radius ( or diameter ) of any
    cluster. The value of T is an upper bound on the
    radius of any cluster and controls the number of
    clusters that the algorithm discovers.

14
CF Tree (Cont.)
  • CF Tree will be built dynamically as new data
    objects are inserted. It is used to guide a new
    insertion into the correct subcluster for
    clustering purpose. Actually, all clusters are
    formed as each data point is inserted into the CF
    Tree.
  • CF Tree is a very compact representation of the
    dataset because each entry in a leaf node is not
    a single data point but a subcluster.

15
Insertion into a CF Tree
  • we now present the algorithm for
    inserting an entry into a CF tree.Given entry
    Ent, it proceeds as below
  • 1. Identifying the appropriate leaf
    Starting from the root, according to a chosen
    distance metric D0 to D4 as defined before, it
    recursively descends the CF tree by choosing the
    closest child node .
  • 2. Modifying the leaf When it reaches a
    leaf node, it finds the closest leaf entry, and
    tests whether the node can absorb it without
    violating the threshold condition.
  • If so, the CF vector for the node is updated to
    reflect this.
  • If not, a new entry for it is added to the leaf.
  • If there is space on the leaf for this new entry,
    we are done.
  • Otherwise we must split the leaf node. Node
    splitting is done by choosing the farthest pair
    of entries based on the closest criteria.

16
Insertion into a CF Tree (Cont.)
  • Modifying the path to the leaf After inserting
    Ent into a leaf, we must update the CF
    information for each nonleaf entry on the path to
    the leaf.
  • In the absence of a split, this simply involves
    adding CF vectors to reflect the additions of
    Ent.
  • A leaf split requires us to insert a new nonleaf
    entry into the parent node, to describe the newly
    created leaf.
  • If the parent has space for this entry, at all
    higher levels, we only need to update the CF
    vectors to reflect the addition of Ent.
  • Otherwise, we may have to split the parent as
    well, and so on up to the root.

17
Insertion into a CF Tree (Cont.)
  • Merging Refinement In the presence of skewed
    data input order, split can affect the
    clustering quality, and also reduce space
    utilization. A simple additional merging often
    helps ameliorate these problems suppose the
    propagation of one split stops at some nonleaf
    node Nj, i.e., Nj can accommodate the additional
    entry resulting from the split.
  • Scan Nj to find the two closest entries.
  • If they are not the pair corresponding to the
    split, merge them.
  • If there are more entries than one page can hold,
    split it again.
  • During the resplitting, one of the seeds attracts
    enough merged entries, the other receives the
    rest entries.
  • In summary, it improve the distribution of
    entries in the closest two children, even
  • free a node space for later use.

18
Insertion into a CF Tree (Cont.)
  • Notes Some problems which will be resolved in
    the later phases of the entire algorithm.
  • Depending upon the order of data input and the
    degree of skew, two subclusters that should not
    be in one cluster are kept in the same node. It
    will be remedied with a global algorithm that
    arranges leaf entries across nodes.( Phase 3)
  • If the same data point is inserted twice, but at
    two different times, the two copies might be
    entered into distinct leaf entries. It will be
    addressed with further refinement passes over the
    data.(Phase 4)

19
Figure Effect of Split, Merge and Resplit
CF1
CF3
CF2
CF4
Split
CF1
CF3
CF5
CF2
CF4
CF1
CF3
CF5
CF6
CF2
CF4
CF8
CF7
Split
CF1
CF3
CF5
CF6
CF4
CF8
CF7
CF2
CF9
Merge
CF1
CF3
CF5
CF6
CF4
CF8
CF7
CF2
CF9
Resplit
CF1
CF3
CF5
CF6
CF4
CF8
CF7
CF2
CF9
20
BIRCH Clustering Algorithm - Overview
Data
Phase1 Load into memory by building a CF tree
Initial CF tree
Click Here
21
BIRCH Clustering Algorithm Phase 2,3 4
  • Phase 2 Condense into desirable range by
    building a smaller CF tree
  • To meet the need of the input size range of Phase
    3
  • Phase 3 Global clustering
  • To solve the problem 1
  • Approach re-cluster all subclusters by using the
    existing global or semi-global clustering
    algorithms
  • Phase 4 Global clustering
  • To solve the problem 2
  • Approach use the centroids of the cluster
    produced by phase 3 as seeds, and redistributes
    the data points to its closest seed to obtain a
    set of new clusters. This can use existing
    algorithm.

22
BIRCH Clustering Algorithm Control flow of
Phase1
Start CF tree t1 of Initial T
Continue scanning data and insert to t1
Finish scanning data
Out of memory
  • Increase T.
  • Rebuild CF tree t2 of new T from CF tree t1 if a
    leaf entry of t1 is potential outlier and disk
    space available, write to disk otherwise use it
    to rebuild t2
  • T1lt - t2

Out of disk space
Otherwise
Re-absorb potential outliers in to t1
Re-absorb potential outliers in to t1
23
BIRCH Clustering Algorithm Keys of Phase1
  • Keys of Phase 1
  • Rebuilding CF tree
  • Threshold values
  • Outlier-handling Option
  • Delay-Split Option

24
BIRCH Clustering Phase1 Rebuilding
  • What to do?
  • Use all the leaf entries of the old CF tree to
    rebuild a new CF tree with larger threshold.
  • What is path?
  • A leaf node corresponds to a path uniquely. Click
    here
  • What is the rebuilding algorithm?
  • The algorithm scans and frees the old tree path
    by path, and creates the new tree path by path.
    Click here
  • Theorem
  • The size of rebuilt tree must be less than that
    of old tree.
  • The transformation form the old tree to the new
    tree needs at most h extra pages of memory, where
    h is the height of old tree.

25
BIRCH Clustering Phase1 Rebuilding Algorithm
Start to rebuild from the leftmost path in the
old tree
For an OldCurrentPath(OCP), create the
corresponding NewCurrentPath (NCUP) in new
tree(no chance to become larger)
Find the NewClosestPath (NCLP) for each leaf
entry in OCP
Y
N
N
Insert Leaf entry in NCUP
Y
Insert Leaf entry in NCLP
Free space in Both OCP and NCUP
Y
N
End
26
BIRCH Clustering Phase1 Rebuilding CF tree
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3
1 2 3
1 2 3 4
(1,1)(1,2)(1,3) (2,1)(2,2)(2,3) (3,1)(3,2)(3,4)
Back
27
BIRCH Clustering Phase1 Threshold Value
Heuristic approach to increase the threshold Try
to choose new threshold value so that the number
of data points that will be scanned under the new
threshold value can double . Approach 1 find the
most crowded leaf node and the closest two
entries on the leaf can be merged under new
threshold. Approach 2 Assuming that the volume
occupied by the leaf clusters grows linearly with
data points. a series of value pair number of
data point and volume ? new volume (a new data
point, using least squares linear regression) ?
new threshold Using some heuristic methods to
adjust the above two thresholds and choose one.
28
BIRCH Clustering Phase1 Outlier-handling option
  • What is Outlier? leaf entries of low density that
    are judged to be unimportant to the overall
    clustering pattern. Use some disk space for
    handling outliers
  • When to write to Outlier ? When rebuilding the CF
    tree, an old leaf entry is write to disk if it is
    considered to be a potential outlier. This can
    reduce the size of CF tree.

29
BIRCH Clustering Phase1 Outlier-handling option
  • When a potential outlier is no longer qualified
    as a outlier?
  • Increase in the threshold value
  • the change in the distribution due to more data
    are read.
  • When to scan the potential outlier to reabsorb
    those unqualified potential outlier without
    causing the tree to grow in size?
  • The disk space run out
  • all data points have been scanned.

30
BIRCH Clustering Phase1 Delay-Split option
  • When we run out of memory. There may be more
    data points that can fit in the current CF tree.
    We can continue to read data point and write
    those data points that require to split a node to
    disk until the disk space is run out. The
    advantage of this approach is that more data
    points can fit in the tree before we have to
    rebuild.

31
Performance Studies
  • Complexity Analysis
  • Experiment with Synthetic Datasets
  • Performance Comparisons of BIRCH and CLARANS with
    Synthetic Datasets
  • Experiment with Real Datasets

32
Performance Studies-Complexity Analysis
  • Time Complexity Analysis for Phase 1
  • for inserting all data O( dimension ? the
    number of data point ? the number of entries per
    node ? the number of nodes by following a path
    from the root to leaf)
  • for reinserting leaf entries O(the number of
    rebuilding ? dimension ? the number of leaf
    entries to reinsert ? the number of entries per
    node ? the number of nodes by following a path
    from the root to leaf )
  • Time Complexity Analysis for other Phase
  • The phase 2 is similar to phase 1
  • The phase 3 4 rely on the adopted methods

33
Performance Studies-Experiment with Synthetic
Datasets
  • Method to generate synthetic datasets
  • Given some parameters the number of clusters,
    the number of data points in each cluster, the
    radius and center of each cluster, noise rate and
    input order
  • Parameters Setting
  • Memory, Disk, Distance def, Threshold def,
    Initial threshold, Delay-split, page size,
    outlier-handling, outlier def
  • Sensitivity to Parameters
  • Initial threshold
  • Page size? running time? ending threshold? leaf
    entry? quality ?

34
Performance Studies-Performance Comparisons of
BIRCH and CLARANS with Synthetic Datasets
35
Performance Studies-Experiment with Real Datasets
  • BIRCH has been used for filtering real image.
    The input dataset is a set of brightness value
    pair from two similar images with different
    wavelength. The output clusters are different
    part of the images, for example, the tree image
    can be filtered into sunlit leaves, shadow, and
    branch. The result proves to be satisfactory.

36
Summary and Future Research
  • More reasonable ways of increasing the threshold
    dynamically
  • The dynamic adjustment of outlier criteria
  • More accurate quality measurements
  • Data parameters that are good indicators of how
    well BIRCH is likely perform

37
THANK YOU!
  • COMMENTS ARE WELCOME!

www.comp.nus.edu.sg/conggao/cs6203
Write a Comment
User Comments (0)
About PowerShow.com