Clustering Categorical Data Using Summaries - PowerPoint PPT Presentation

About This Presentation

Title:

Clustering Categorical Data Using Summaries

Description:

CACTUS. 2. Introduction. Most research on clustering focused on n-dimensional ... CACTUS. Goal: Fast scalable algorithm for discovering well-defined clusters ... – PowerPoint PPT presentation

Number of Views:584

Avg rating:3.0/5.0

Slides: 40

Provided by: venkate

Learn more at: https://pages.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Clustering Categorical Data Using Summaries

1
Clustering Categorical Data Using Summaries
CACTUS

Venkatesh Ganti
joint work with
Johannes Gehrke and Raghu Ramakrishnan
(University of Wisconsin-Madison)

2
Introduction

Most research on clustering focused on
n-dimensional numeric data
e.g., BIRCH ZRL96, CURE GRS98, Clustering
framework BFR98, WaveCluster SCZ98 etc.
Data also consists of categorical attributes
e.g., the UC-Irvine collection of datasets
Problem similarity functions are not defined for
categorical data

3
CACTUS

Goal Fast scalable algorithm for discovering
well-defined clusters
Similarity use attribute value co-occurrence
(STIRRGKR98)
Speed and scalability exploit the small domain
sizes of categorical attributes

4
Preliminaries and Notation

Set of n categorical attributes with domains
D1,,Dn
A tuple consists of a value from each domain,
e.g., (a1,b2,c1)
Dataset a set of tuples

Note Sizes of D1,,Dn are typically very small
5
Similarity, between attributes

similarity between a1 and b1
support(a1,b1)tuples containing (a1,b1)
a1 and b1 are strongly connected if
support(a1,b1) is higher than expected
a1,a2,a3,a4 and b1,b2 are strongly
connected if all pairs are

Not strongly connected
6
Similarity, within an attribute

simA(b1,b2) number of values of A which are
strongly connected with both b1 and b2

sim(B) thru A thru C
(b1,b2) 4 2
(b1,b3) 0 2
(b1,b4) 0 0
(b2,b3) 0 2
(b2,b4) 0 0
A
B
C
7
Definitions

Support(ai,bk) is the number of tuples that
contain both ai and bk
ai and bk are strongly connected if
support(ai,bk) gtgt expected value
Si and Sk are strongly connected if every pair of
values in Si x Sk is strongly connected.

8
An Example

Intuitively, a cluster is a high-density region

Region a1,a2 x b1,b2 x c1,c2

Note Dense regions lead to strongly connected
sets
9
Cluster Definition

Region a cross-product of sets of attribute
values C1 x x Cn
CC1 x x Cn is a cluster iff
Ci and Cj are strongly connected, for all i,j
Ci is maximal, for all i
Support(C) gtgt expected
Ci cluster projection of C on Ai

10
CACTUS Outline

Idea compute and use data summaries for
clustering
3 phases
Summarization
Compute summaries of data
Clustering
Using the summaries to compute candidate clusters
Validation
Validate the set of candidate clusters from the
clustering phase

11
Summaries

Two types of summaries
Inter-attribute summaries
Intra-attribute summaries

12
Inter-Attribute Summaries

Supports of all strongly connected attribute
value pairs from different attributes
Similar in nature to frequent 2-itemsets
So is the computation

A
B
C
IJ(A,B) IJ(A,C) IJ(B,C)
(a1,b1) (a1,c1) (b1,c1)
(a1,b2) (a1,c2) (b1,c2)
(a2,b1) (a2,c1) (b2,c1)
(a2,b2) (a2,c2) (b2,c2)
(a3,b1) (b3,c1)

13
Intra-attribute summaries

simA(B) similarities thru A of attribute value
pairs of B

sim(B) thru A thru C
(b1,b2) 4 2
(b1,b3) 0 2
(b1,b4) 0 0
(b2,b3) 0 2
(b2,b4) 0 0
A
B
C
14
Computing Intra-attribute Summaries

SQL query to compute simA(B)
Select T1.B, T2.B, count()
From IJ(A,B) as T1(A,B), IJ(A,B) as T2(A,B)
Where T1.B T2.B and T1.AT2.A
Group By T1.A, T2.A
Having count gt 0
Note Inter-attribute summaries are sufficient
Dataset is not accessed!

15
Memory Requirements for Summaries

Attribute domains are small
Typically less than 100
E.g., the largest attribute value domain in the
UC-Irvine collection is 100 (Pendigits dataset)
50 attributes, domain sizes 100, and 100 MB of
main memory
Only one scan of the dataset for computing
inter-attribute summaries

16
CACTUS

Summarization
Clustering Phase
Validation

17
Clustering Phase

Compute cluster projections on each attribute
Join cluster projections across
attributescandidate cluster generation

Identify the cluster projections a1,a2,
b1,b2, c1,c2 Then identify the cluster
a1,a2 x b1,b2 x c1,c2
18
Computing Cluster Projections

Lemma Computing all projections of clusters on
attribute pairs is NP-complete

19
Distinguishing Set Assumption

Each cluster projection Ci on Ai is distinguished
by a small set of attribute values
Distinguishing set is bounded by k
(distinguishing number)
Values for k are typically small

20
Distinguishing Set Assumption

Cluster a1,a2 x b1,b2 x c1,c2
a1 (or a2) distinguishes a1,a2
Approach Compute distinguishing sets and extend
them to cluster projections

21
Candidate Cluster Generation

Cluster projections S1,,Sn on A1,,An
Cross product S1 x x Sn
Level-wise synthesis S1 x S2, prune, then add S3
and so on.
May contain some dubious clusters!

C
S1C,C1 S2C2 S3C3 C x C2 x C3 not a
cluster
C3
C1
C2
22
The CACTUS Algorithm

Summarize
inter-attribute summaries scans dataset
intra-attribute summaries
Clustering phase
Compute cluster projections
Level-wise synthesis of cluster projections to
form candidate clusters
Validation
Requires a scan of the dataset

23
STIRR GKR98

An iterative dynamical system
Weighted nodes in the graph
In each iteration, weights are propagated between
connected nodes (determined by tuples in the
dataset)
Each iteration requires a dataset scan
Iteration stops when the fixed point is reached
Similar nodes have similar weights

24
Experimental Evaluation

Compare CACTUS with STIRR
Synthetic datasets
Quasi-random data GKR98STIRR
Fix domain of each attribute
Randomly generate tuples from these domains
Identify clusters and plant additional (5) data
within the clusters

25
Synthetic Datasets Cactus and STIRR
0,9 x 0,9 10,,19 x 10,,19
Both CACTUS and STIRR identified the two clusters
exactly
26
Synthetic Dataset (contd.)
0,,9 x 0,,9 x 0,,9 10,,19 x 10,,19
x 10,,19 0,,9 x 10,,19 x 10,,19
Cactus identifies the 3 clusters
STIRR returns 0,,9 x 0,,19 x
0,,9 10,,19 x 0,,19 x 10,,19
27
Scalability with Tuples
Attributes 10 Domain Size 100
CACTUS is 10 times faster
28
Scalability with Attributes
1 million tuples Domain size 100
29
Scalability with Domain Size
1 million tuples attributes 4
30
Bibliographic Data

Database and theory bibliographic entries
Wie38500 entries
Attributes first author, second author,
conference/journal, and year
Example cluster projections on the conference
attribute

(1). ACM Sigmod, VLDB, ACM TODS, ICDE, ACM Sigmod
Record (2). ACMTG, CompGeom, FOCS, Geometry,
ICALP, IPL, JCSS, (3). PODS, Algorithmica,
FOCS, ICALP, INFCTRL, IPL, JCSS,
31
Conclusions

Formal definition of a cluster
A scalable fast summarization-based clustering
algorithm for categorical data
Outperforms an earlier algorithm (STIRR) by
almost an order of magnitude
Subspace clustering

32
on
33
Extensions

Dealing with large attribute value domains
In some rare cases, the inter-attribute or
intra-attribute summaries may not fit in main
memory
Clusters in subspaces when the number of
attributes is large

34
Related Work

Conceptual Clustering (e.g., Fisher87), EM
DLR77
Assume that datasets fit in main memory
Recent scalable clustering algorithms for
clustering categorical data
STIRR GKR98
ROCK GRS99
the definition of clusters is not clear

35
Limitations

The cluster definition may be too strong for
certain applications
That we require every pair of attribute values
across attributes to be strongly connected
Consequence a large number of clusters

36
Outline of the talk