Title: Clustering Categorical Data Using Summaries
1Clustering Categorical Data Using Summaries
CACTUS
- Venkatesh Ganti
- joint work with
- Johannes Gehrke and Raghu Ramakrishnan
- (University of Wisconsin-Madison)
2Introduction
- Most research on clustering focused on
n-dimensional numeric data - e.g., BIRCH ZRL96, CURE GRS98, Clustering
framework BFR98, WaveCluster SCZ98 etc. - Data also consists of categorical attributes
- e.g., the UC-Irvine collection of datasets
- Problem similarity functions are not defined for
categorical data
3CACTUS
- Goal Fast scalable algorithm for discovering
well-defined clusters - Similarity use attribute value co-occurrence
(STIRRGKR98) - Speed and scalability exploit the small domain
sizes of categorical attributes
4Preliminaries and Notation
- Set of n categorical attributes with domains
D1,,Dn - A tuple consists of a value from each domain,
e.g., (a1,b2,c1) - Dataset a set of tuples
Note Sizes of D1,,Dn are typically very small
5Similarity, between attributes
- similarity between a1 and b1
support(a1,b1)tuples containing (a1,b1) - a1 and b1 are strongly connected if
support(a1,b1) is higher than expected - a1,a2,a3,a4 and b1,b2 are strongly
connected if all pairs are
Not strongly connected
6Similarity, within an attribute
- simA(b1,b2) number of values of A which are
strongly connected with both b1 and b2
sim(B) thru A thru C
(b1,b2) 4 2
(b1,b3) 0 2
(b1,b4) 0 0
(b2,b3) 0 2
(b2,b4) 0 0
A
B
C
7 Definitions
- Support(ai,bk) is the number of tuples that
contain both ai and bk - ai and bk are strongly connected if
support(ai,bk) gtgt expected value - Si and Sk are strongly connected if every pair of
values in Si x Sk is strongly connected.
8An Example
- Intuitively, a cluster is a high-density region
- Region a1,a2 x b1,b2 x c1,c2
Note Dense regions lead to strongly connected
sets
9Cluster Definition
- Region a cross-product of sets of attribute
values C1 x x Cn - CC1 x x Cn is a cluster iff
- Ci and Cj are strongly connected, for all i,j
- Ci is maximal, for all i
- Support(C) gtgt expected
- Ci cluster projection of C on Ai
10CACTUS Outline
- Idea compute and use data summaries for
clustering - 3 phases
- Summarization
- Compute summaries of data
- Clustering
- Using the summaries to compute candidate clusters
- Validation
- Validate the set of candidate clusters from the
clustering phase
11Summaries
- Two types of summaries
- Inter-attribute summaries
- Intra-attribute summaries
12Inter-Attribute Summaries
- Supports of all strongly connected attribute
value pairs from different attributes - Similar in nature to frequent 2-itemsets
- So is the computation
A
B
C
IJ(A,B) IJ(A,C) IJ(B,C)
(a1,b1) (a1,c1) (b1,c1)
(a1,b2) (a1,c2) (b1,c2)
(a2,b1) (a2,c1) (b2,c1)
(a2,b2) (a2,c2) (b2,c2)
(a3,b1) (b3,c1)
13Intra-attribute summaries
- simA(B) similarities thru A of attribute value
pairs of B
sim(B) thru A thru C
(b1,b2) 4 2
(b1,b3) 0 2
(b1,b4) 0 0
(b2,b3) 0 2
(b2,b4) 0 0
A
B
C
14Computing Intra-attribute Summaries
- SQL query to compute simA(B)
- Select T1.B, T2.B, count()
- From IJ(A,B) as T1(A,B), IJ(A,B) as T2(A,B)
- Where T1.B T2.B and T1.AT2.A
- Group By T1.A, T2.A
- Having count gt 0
- Note Inter-attribute summaries are sufficient
- Dataset is not accessed!
15Memory Requirements for Summaries
- Attribute domains are small
- Typically less than 100
- E.g., the largest attribute value domain in the
UC-Irvine collection is 100 (Pendigits dataset) - 50 attributes, domain sizes 100, and 100 MB of
main memory - Only one scan of the dataset for computing
inter-attribute summaries
16CACTUS
- Summarization
- Clustering Phase
- Validation
17Clustering Phase
- Compute cluster projections on each attribute
- Join cluster projections across
attributescandidate cluster generation
Identify the cluster projections a1,a2,
b1,b2, c1,c2 Then identify the cluster
a1,a2 x b1,b2 x c1,c2
18Computing Cluster Projections
- Lemma Computing all projections of clusters on
attribute pairs is NP-complete
19Distinguishing Set Assumption
- Each cluster projection Ci on Ai is distinguished
by a small set of attribute values - Distinguishing set is bounded by k
(distinguishing number) - Values for k are typically small
20Distinguishing Set Assumption
- Cluster a1,a2 x b1,b2 x c1,c2
- a1 (or a2) distinguishes a1,a2
- Approach Compute distinguishing sets and extend
them to cluster projections
21Candidate Cluster Generation
- Cluster projections S1,,Sn on A1,,An
- Cross product S1 x x Sn
- Level-wise synthesis S1 x S2, prune, then add S3
and so on. - May contain some dubious clusters!
C
S1C,C1 S2C2 S3C3 C x C2 x C3 not a
cluster
C3
C1
C2
22The CACTUS Algorithm
- Summarize
- inter-attribute summaries scans dataset
- intra-attribute summaries
- Clustering phase
- Compute cluster projections
- Level-wise synthesis of cluster projections to
form candidate clusters - Validation
- Requires a scan of the dataset
23STIRR GKR98
- An iterative dynamical system
- Weighted nodes in the graph
- In each iteration, weights are propagated between
connected nodes (determined by tuples in the
dataset) - Each iteration requires a dataset scan
- Iteration stops when the fixed point is reached
- Similar nodes have similar weights
24Experimental Evaluation
- Compare CACTUS with STIRR
- Synthetic datasets
- Quasi-random data GKR98STIRR
- Fix domain of each attribute
- Randomly generate tuples from these domains
- Identify clusters and plant additional (5) data
within the clusters
25Synthetic Datasets Cactus and STIRR
0,9 x 0,9 10,,19 x 10,,19
Both CACTUS and STIRR identified the two clusters
exactly
26Synthetic Dataset (contd.)
0,,9 x 0,,9 x 0,,9 10,,19 x 10,,19
x 10,,19 0,,9 x 10,,19 x 10,,19
Cactus identifies the 3 clusters
STIRR returns 0,,9 x 0,,19 x
0,,9 10,,19 x 0,,19 x 10,,19
27Scalability with Tuples
Attributes 10 Domain Size 100
CACTUS is 10 times faster
28Scalability with Attributes
1 million tuples Domain size 100
29Scalability with Domain Size
1 million tuples attributes 4
30Bibliographic Data
- Database and theory bibliographic entries
Wie38500 entries - Attributes first author, second author,
conference/journal, and year - Example cluster projections on the conference
attribute
(1). ACM Sigmod, VLDB, ACM TODS, ICDE, ACM Sigmod
Record (2). ACMTG, CompGeom, FOCS, Geometry,
ICALP, IPL, JCSS, (3). PODS, Algorithmica,
FOCS, ICALP, INFCTRL, IPL, JCSS,
31Conclusions
- Formal definition of a cluster
- A scalable fast summarization-based clustering
algorithm for categorical data - Outperforms an earlier algorithm (STIRR) by
almost an order of magnitude - Subspace clustering
32on
33Extensions
- Dealing with large attribute value domains
- In some rare cases, the inter-attribute or
intra-attribute summaries may not fit in main
memory - Clusters in subspaces when the number of
attributes is large
34Related Work
- Conceptual Clustering (e.g., Fisher87), EM
DLR77 - Assume that datasets fit in main memory
- Recent scalable clustering algorithms for
clustering categorical data - STIRR GKR98
- ROCK GRS99
- the definition of clusters is not clear
35Limitations
- The cluster definition may be too strong for
certain applications - That we require every pair of attribute values
across attributes to be strongly connected - Consequence a large number of clusters
36Outline of the talk
- Notion of similarity
- Cluster Definition
- The CACTUS Algorithm
- Experimental Evaluation
- Extensions to CACTUS
- Conclusions
37Validation
- Scan the dataset once more
- Compute supports of candidate clusters
- Retain only those with significantly high support
38Computing Cluster Projections Algorithm
- For the attribute A1, compute cluster projections
from clusters on (A1,A2), (A1,A3),,(A1,An) - Intersection join on
39Computing Cluster Projections
- Lemma CC1 x x Cn be a cluster. Then Ci is the
intersection of Ci (Ci,Ck) is a cluster on
Ai, Ak