Clustering Categorical Data Using Summaries - PowerPoint PPT Presentation

About This Presentation
Title:

Clustering Categorical Data Using Summaries

Description:

CACTUS. 2. Introduction. Most research on clustering focused on n-dimensional ... CACTUS. Goal: Fast scalable algorithm for discovering well-defined clusters ... – PowerPoint PPT presentation

Number of Views:584
Avg rating:3.0/5.0
Slides: 40
Provided by: venkate
Category:

less

Transcript and Presenter's Notes

Title: Clustering Categorical Data Using Summaries


1
Clustering Categorical Data Using Summaries
CACTUS
  • Venkatesh Ganti
  • joint work with
  • Johannes Gehrke and Raghu Ramakrishnan
  • (University of Wisconsin-Madison)

2
Introduction
  • Most research on clustering focused on
    n-dimensional numeric data
  • e.g., BIRCH ZRL96, CURE GRS98, Clustering
    framework BFR98, WaveCluster SCZ98 etc.
  • Data also consists of categorical attributes
  • e.g., the UC-Irvine collection of datasets
  • Problem similarity functions are not defined for
    categorical data

3
CACTUS
  • Goal Fast scalable algorithm for discovering
    well-defined clusters
  • Similarity use attribute value co-occurrence
    (STIRRGKR98)
  • Speed and scalability exploit the small domain
    sizes of categorical attributes

4
Preliminaries and Notation
  • Set of n categorical attributes with domains
    D1,,Dn
  • A tuple consists of a value from each domain,
    e.g., (a1,b2,c1)
  • Dataset a set of tuples

Note Sizes of D1,,Dn are typically very small
5
Similarity, between attributes
  • similarity between a1 and b1
    support(a1,b1)tuples containing (a1,b1)
  • a1 and b1 are strongly connected if
    support(a1,b1) is higher than expected
  • a1,a2,a3,a4 and b1,b2 are strongly
    connected if all pairs are

Not strongly connected
6
Similarity, within an attribute
  • simA(b1,b2) number of values of A which are
    strongly connected with both b1 and b2

sim(B) thru A thru C
(b1,b2) 4 2
(b1,b3) 0 2
(b1,b4) 0 0
(b2,b3) 0 2
(b2,b4) 0 0
A
B
C
7
Definitions
  • Support(ai,bk) is the number of tuples that
    contain both ai and bk
  • ai and bk are strongly connected if
    support(ai,bk) gtgt expected value
  • Si and Sk are strongly connected if every pair of
    values in Si x Sk is strongly connected.

8
An Example
  • Intuitively, a cluster is a high-density region
  • Region a1,a2 x b1,b2 x c1,c2

Note Dense regions lead to strongly connected
sets
9
Cluster Definition
  • Region a cross-product of sets of attribute
    values C1 x x Cn
  • CC1 x x Cn is a cluster iff
  • Ci and Cj are strongly connected, for all i,j
  • Ci is maximal, for all i
  • Support(C) gtgt expected
  • Ci cluster projection of C on Ai

10
CACTUS Outline
  • Idea compute and use data summaries for
    clustering
  • 3 phases
  • Summarization
  • Compute summaries of data
  • Clustering
  • Using the summaries to compute candidate clusters
  • Validation
  • Validate the set of candidate clusters from the
    clustering phase

11
Summaries
  • Two types of summaries
  • Inter-attribute summaries
  • Intra-attribute summaries

12
Inter-Attribute Summaries
  • Supports of all strongly connected attribute
    value pairs from different attributes
  • Similar in nature to frequent 2-itemsets
  • So is the computation

A
B
C
IJ(A,B) IJ(A,C) IJ(B,C)
(a1,b1) (a1,c1) (b1,c1)
(a1,b2) (a1,c2) (b1,c2)
(a2,b1) (a2,c1) (b2,c1)
(a2,b2) (a2,c2) (b2,c2)
(a3,b1) (b3,c1)

13
Intra-attribute summaries
  • simA(B) similarities thru A of attribute value
    pairs of B

sim(B) thru A thru C
(b1,b2) 4 2
(b1,b3) 0 2
(b1,b4) 0 0
(b2,b3) 0 2
(b2,b4) 0 0
A
B
C
14
Computing Intra-attribute Summaries
  • SQL query to compute simA(B)
  • Select T1.B, T2.B, count()
  • From IJ(A,B) as T1(A,B), IJ(A,B) as T2(A,B)
  • Where T1.B T2.B and T1.AT2.A
  • Group By T1.A, T2.A
  • Having count gt 0
  • Note Inter-attribute summaries are sufficient
  • Dataset is not accessed!

15
Memory Requirements for Summaries
  • Attribute domains are small
  • Typically less than 100
  • E.g., the largest attribute value domain in the
    UC-Irvine collection is 100 (Pendigits dataset)
  • 50 attributes, domain sizes 100, and 100 MB of
    main memory
  • Only one scan of the dataset for computing
    inter-attribute summaries

16
CACTUS
  • Summarization
  • Clustering Phase
  • Validation

17
Clustering Phase
  1. Compute cluster projections on each attribute
  2. Join cluster projections across
    attributescandidate cluster generation

Identify the cluster projections a1,a2,
b1,b2, c1,c2 Then identify the cluster
a1,a2 x b1,b2 x c1,c2
18
Computing Cluster Projections
  • Lemma Computing all projections of clusters on
    attribute pairs is NP-complete

19
Distinguishing Set Assumption
  • Each cluster projection Ci on Ai is distinguished
    by a small set of attribute values
  • Distinguishing set is bounded by k
    (distinguishing number)
  • Values for k are typically small

20
Distinguishing Set Assumption
  • Cluster a1,a2 x b1,b2 x c1,c2
  • a1 (or a2) distinguishes a1,a2
  • Approach Compute distinguishing sets and extend
    them to cluster projections

21
Candidate Cluster Generation
  • Cluster projections S1,,Sn on A1,,An
  • Cross product S1 x x Sn
  • Level-wise synthesis S1 x S2, prune, then add S3
    and so on.
  • May contain some dubious clusters!

C
S1C,C1 S2C2 S3C3 C x C2 x C3 not a
cluster
C3
C1
C2
22
The CACTUS Algorithm
  • Summarize
  • inter-attribute summaries scans dataset
  • intra-attribute summaries
  • Clustering phase
  • Compute cluster projections
  • Level-wise synthesis of cluster projections to
    form candidate clusters
  • Validation
  • Requires a scan of the dataset

23
STIRR GKR98
  • An iterative dynamical system
  • Weighted nodes in the graph
  • In each iteration, weights are propagated between
    connected nodes (determined by tuples in the
    dataset)
  • Each iteration requires a dataset scan
  • Iteration stops when the fixed point is reached
  • Similar nodes have similar weights

24
Experimental Evaluation
  • Compare CACTUS with STIRR
  • Synthetic datasets
  • Quasi-random data GKR98STIRR
  • Fix domain of each attribute
  • Randomly generate tuples from these domains
  • Identify clusters and plant additional (5) data
    within the clusters

25
Synthetic Datasets Cactus and STIRR
0,9 x 0,9 10,,19 x 10,,19
Both CACTUS and STIRR identified the two clusters
exactly
26
Synthetic Dataset (contd.)
0,,9 x 0,,9 x 0,,9 10,,19 x 10,,19
x 10,,19 0,,9 x 10,,19 x 10,,19
Cactus identifies the 3 clusters
STIRR returns 0,,9 x 0,,19 x
0,,9 10,,19 x 0,,19 x 10,,19
27
Scalability with Tuples
Attributes 10 Domain Size 100
CACTUS is 10 times faster
28
Scalability with Attributes
1 million tuples Domain size 100
29
Scalability with Domain Size
1 million tuples attributes 4
30
Bibliographic Data
  • Database and theory bibliographic entries
    Wie38500 entries
  • Attributes first author, second author,
    conference/journal, and year
  • Example cluster projections on the conference
    attribute

(1). ACM Sigmod, VLDB, ACM TODS, ICDE, ACM Sigmod
Record (2). ACMTG, CompGeom, FOCS, Geometry,
ICALP, IPL, JCSS, (3). PODS, Algorithmica,
FOCS, ICALP, INFCTRL, IPL, JCSS,
31
Conclusions
  • Formal definition of a cluster
  • A scalable fast summarization-based clustering
    algorithm for categorical data
  • Outperforms an earlier algorithm (STIRR) by
    almost an order of magnitude
  • Subspace clustering

32
on
33
Extensions
  • Dealing with large attribute value domains
  • In some rare cases, the inter-attribute or
    intra-attribute summaries may not fit in main
    memory
  • Clusters in subspaces when the number of
    attributes is large

34
Related Work
  • Conceptual Clustering (e.g., Fisher87), EM
    DLR77
  • Assume that datasets fit in main memory
  • Recent scalable clustering algorithms for
    clustering categorical data
  • STIRR GKR98
  • ROCK GRS99
  • the definition of clusters is not clear

35
Limitations
  • The cluster definition may be too strong for
    certain applications
  • That we require every pair of attribute values
    across attributes to be strongly connected
  • Consequence a large number of clusters

36
Outline of the talk
  • Notion of similarity
  • Cluster Definition
  • The CACTUS Algorithm
  • Experimental Evaluation
  • Extensions to CACTUS
  • Conclusions

37
Validation
  • Scan the dataset once more
  • Compute supports of candidate clusters
  • Retain only those with significantly high support

38
Computing Cluster Projections Algorithm
  • For the attribute A1, compute cluster projections
    from clusters on (A1,A2), (A1,A3),,(A1,An)
  • Intersection join on

39
Computing Cluster Projections
  • Lemma CC1 x x Cn be a cluster. Then Ci is the
    intersection of Ci (Ci,Ck) is a cluster on
    Ai, Ak
Write a Comment
User Comments (0)
About PowerShow.com