Scalable Clustering of Categorical Data - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Scalable Clustering of Categorical Data

Description:

A Movie Relation: Distance or similarity between values not ... In the Movie Relation, clustering C is better then clustering D according to this measure (why? ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 31
Provided by: HKU8
Category:

less

Transcript and Presenter's Notes

Title: Scalable Clustering of Categorical Data


1
Scalable Clustering of Categorical Data
  • HKU CS Database Research Seminar
  • August 11th, 2004
  • Panagiotis Karras

2
The Problem
  • Clustering a problem of great importance.
  • Partitioning data into groups so that similar
    objects are grouped together.
  • Clustering of numerical data well-treated.
  • Clustering of categorical data more challenging
    no inherent distance measure.

3
An Example
  • A Movie Relation
  • Distance or similarity between values not
    immediately obvious

4
Some Information Theory
  • Mutual Information measure employed.
  • Clusters should be informative about the data
    they contain.
  • Given a cluster, we should be able to predict the
    attribute values of its objects accurately.
  • Information loss should be minimized.

5
An Example
  • In the Movie Relation, clustering C is better
    then clustering D according to this measure
    (why?).

6
The Information Bottleneck Method
  • Formalized by Tishby et al. 1999
  • Clustering the compression of one random
    variable that preserves as much information as
    possible about another.
  • Conditional entropy of A given T
  • Captures the uncertainty of predicting the values
    of A given the values of T.

7
The Information Bottleneck Method
  • Mutual Information quantifies the amount of
    information that variables convey about each
    other Shannon, 1948

8
The Information Bottleneck Method
  • A set of n tuples on m attributes.
  • Let d be the size of the set of all possible
    attribute values.
  • Then the data can be conceptualized as a nd
    matrix M.
  • Mt,a1 iff tuple t contains value a.
  • Rows of normalized M contain conditional
    probability distributions p(At).

9
The Information Bottleneck Method
  • In the Movie Relation Example

10
The Information Bottleneck Method
  • Clustering is a problem of maximizing the Mutual
    Information I(AC) between attribute values and
    cluster identities, for a given number k of
    clusters Tishby et al. 1999.
  • Finding optimal clustering NP-complete.
  • Agglomerative Information Bottleneck proposed by
    Slonim and Tishby 1999.
  • Starts with n clusters, reduces one at each step
    so that Information Loss in I(AC) be minimized.

11
LIMBO Clustering
  • scaLable InforMation Bottleneck
  • Keeps only sufficient statistics in memory.
  • Compact Summary Model.
  • Clustering based on Model.

12
What is a DCF?
  • A Cluster is summarized in a Distributional
    Cluster Feature (DCF).
  • Pair of probability of cluster c and conditional
    probability distribution of attribute values
    given c.
  • Distance between DCFs is defined as the
    Information Loss incurred by merging the
    corresponding clusters (computed by the
    Jensen-Shannon divergence).

13
The DCF Tree
  • Height balanced tree of branching factor B.
  • DCFs at leaves define clustering of tuples.
  • Non-leaf nodes merge DCFs of children.
  • Compact hierarchical summarization of data.

14
The LIMBO algorithm
  • Three phases.
  • Phase 1 Insertion into DCF tree.
  • Each tuple t converted to DCF(t).
  • Follows path downward in tree along closest
    non-leaf DCFs.
  • At leaf level, let DCF(c) be entry closest to
    DCF(t).
  • If empty entry in leaf of DCF(c), then DCF(t)
    placed there.
  • If no empty entry and sufficient free space, leaf
    split in two halves, with two farthest DCFs as
    seeds for new leaves. Split moves upward as
    necessary.
  • Else, if no space, two closest DCF entries in
    leaf, t are merged.

15
The LIMBO algorithm
  • Phase 2 Clustering.
  • For a given value of k, the DCF tree is used to
    produce k DCFs that serve as representatives of k
    clusters, emplying the Agglomerative Information
    Bottleneck algorithm.
  • Phase 3 Associating tuples with clusters.
  • A scan over the data set is performed and each
    tuple is assigned to the closest cluster.

16
Intra-Attribute Value Distance
  • How to define the distance between categorical
    attribute values of the same attribute?
  • Values should be placed within a context.
  • Similar values appear in similar contexts.
  • What is a suitable context?
  • The distribution an attribute values induces on
    the remaining attributes.

17
Intra-Attribute Value Distance
  • The distance between two values is then defined
    as the Information Loss incurred about the other
    attributes if we merge these values.
  • In the Movie example, Scorsese and Coppola are
    the most similar directors.
  • Distance between tuples sum of distances
    between attributes.

18
Experiments - Algorithms
  • Four algorithms are compared
  • ROCK. An agglomerative algorithm by Guha et al.
    1999
  • COOLCAT. A scalable non-hierarchical algorithm
    most similar to LIMBO by Barbará et al. 2002
  • STIRR. A dynamical systems approach using a
    hypergraph of weighted attribute values, by
    Gibson et al. 1998
  • LIMBO. In addition to the space-bound version,
    LIMBO was implemented in an accuracy-control
    version, where a distance threshold is imposed on
    the decision of merging two DCFs, as multiple f
    of the average mutual information of all tuples.
    The two versions differ only in Phase 1.

19
Experiments Data Sets
  • The following Data Sers are used
  • Congressional Votes (435 boolean tuples on 16
    issues, from 1984, classified as Democrat or
    Republican).
  • Mushroom (8,124 tuples with 22 attributes,
    classified as poisonous or edible).
  • Database and Theory bibliography (8,000 tuples on
    research papers with 4 attributes).
  • Synthetic Data Sets (5,000 tuples, 10 attributes,
    DS5 and DS10 for 5 and 10 classes).
  • Web Data (web pages - a tuple set of authorities
    with the hubs they are linked to by as
    attributes)

20
Experiments - Quality Measures
  • Several measures are used to capture the
    subjectivity of clustering quality
  • Information Loss. The lower the better.
  • Category Utility. Difference between expected
    correct guesses of attribute values with and
    without a clustering.
  • Min Classification Error. For tuples already
    classified.
  • Precision (P), Recall (R). P measures the
    accuracy with which a cluster reproduces a class
    and R the completeness with which this is done.

21
Quality-Efficiency trade-offs for LIMBO
  • Both controlling the size (S) or the accuracy (f)
    of the model, there is a trade-off between
    expressiveness (large S, small f) and compactness
    (small S, large f).
  • For Branching factor B4 we obtain
  • For large S and small f, the bottleneck is Phase
    2.

22
Quality-Efficiency trade-offs for LIMBO
  • Still, in Phase 1 we can obtain significant
    compression of the data sets at no expense in the
    final quality.
  • This consistency can be attributed in part to the
    effect of Phase 3, which assigns tuples to
    cluster representatives.
  • ?ven for large values of f and small values of S,
    LIMBO obtains essentially the same clustering
    quality as AIB, but in linear time.

23
Comparative Evaluations
  • The table show the results for all algorithms and
    all quality measures for the Votes and Mushrooms
    data sets.
  • LIMBOs quality superior to ROCK and COOLCAT.
  • COOLCAT comes closest to LIMBO.

24
Web Data
  • Authorities clustered into three clusters with
    information loss 61.
  • LIMBO accurately characterizes structure of web
    graph.
  • Three clusters correspond to different viewpoints
    (pro, against, irrelevant).

25
Scalability Evaluation
  • Four data sets of size 500K, 1M, 5M, 10M (10
    clusters, 10 attributes each).
  • Phase 1 in detail for LIMBOf
  • For 1.0 lt f lt1.5 manageable size, fast execution
    time.

26
Scalability Evaluation
  • We set f 1.2, 1.3, S 1MB, 5MB.
  • Time scales linearly with data set size.
  • Varied number of attributes linear behavior.

27
Scalability - Quality Results
  • Quality measures the same for different data set
    sizes.

28
Conclusions
  • LIMBO has advantages over other information
    theoretic clustering algorithms in terms of
    scalability and quality.
  • LIMBO only hierarchical scalable categorical
    clustering algorithm based on compact summary
    model.

29
Main Reference
  • P. Andritsos, P. Tsaparas, R. J. Miller, K. C.
    Sevcik. LIMBO Scalable Clustering of Categorical
    Data , 9th International Conference on Extending
    DataBase Technology (EDBT), Heraklion, Greece,
    2004.

30
References
  • D. Barbará, J. Couto, and Y. Li. COOLCAT An
    entropy-based algorithm for categorical
    clustering. In CIKM, McLean, VA, 2002.
  • D. Gibson, J. M. Kleinberg, and P. Raghavan.
    Clustering Categorical Data An Approach Based on
    Dynamical Systems. In VLDB, New York, NY, 1998.
  • S. Guha, R. Rastogi, and K. Shim. ROCK A Robust
    Clustering Algorithm for Categorical Atributes.
    In ICDE, Sydney, Australia, 1999.
  • C. Shannon. A Mathematical Theory of
    Communication, 1948.
  • N. Slonim and N. Tishby. Agglomerative
    Information Bottleneck. In NIPS, Breckenridge,
    1999.
  • N. Tishby, F. C. Pereira, and W. Bialek. The
    Information Bottleneck Method. In 37th Annual
    Allerton Conference on Communication, Control and
    Computing, Urban-Champaign, IL, 1999.
Write a Comment
User Comments (0)
About PowerShow.com