Title: CACTUS
1CACTUS Clustering Categorical Data Using
Summaries
- By Venkatesh Ganti, Johannes Gehrke and Raghu
Ramakrishnan - RongEn Li
- School of Informatics, Edinburgh
2Overview
- Introduction and motivation
- Existing tools for clustering categorical data
STIRR and ROCK - Definition of a cluster over categorical data
- The algorithm CACTUS
- Experiments and results
- Summary
3Introduction and motivation
- Numeric data, 1,2,3,4,5,
- Categorical data, LFD, PMR, DME
- Usually small number of attribute values in their
domains. Large domains are typically hard to
infer useful information - Use relations! Relations contain different
attributes, but the cross product of domain
attributes can be large. - CACTUS a fast summarisation-based algorithm
which uses summary information to find well-fined
clusters.
4Existing tools for clustering categorical data
- STIRR
- Each attribute value is represented as a weighted
vertex in a graph. - Multiple copies b1,,bm (basins) of weighted
vertices are maintained. They can have different
weights. - Starting Step a set of weights on all vertices
in all basins. - Iterative Step Increment the weight in basin bi
on vertex tj, for each vertices tuple tltt1, ,
tngt in bi, using a function combining the weights
of vertices other than tj in bi. - At fixed point the large positive weights and
small negative weights across the basins isolate
two groups of attribute values on each attribute. - ROCK
- Starts with each tuple in its own cluster.
- Merges close clusters until a required number
(user specified) of clusters remains. Closeness
defined by a similarity function. - Use STIRR to compare with CACTUS.
5Definitions Interval region, support and
belonging
- A1,,An is a set of categorical attributes with
domains D1,,Dn respectively. D is a set tuples
where each tuple t ? 1,,n. - Interval region SS1 X X Sn if Si subset of Di
for all i ? 1,,n. Equivalent to intervals in
numeric data - The support of a value pair sD(ai,aj)t?Dt.Aia
i t.Ajaj/D. The support of a region sD(S)
is the number of tuples in D contained in S - Belonging A tuple tltt.A1,t.Angt ? D belongs to
a region S if for all t ? 1,,n, t.Ai ? Si.
6Definitions expected support, strongly connected
- The expected support under attribute-independence
assumption, - Of a region EsD(S) DS1XXSn/D1XXDn
- Of a pair ai and aj EsD(ai,aj)
aD/DiXDj - a is normally set to 2 or 3
- Strongly Connected
- ai and aj if sD(ai,aj)gtEsD(ai,aj),
sD(ai,aj)sD(ai,aj) Otherwise, 0. - ai ? Si w.r.t Sj for all x ? Sj, ai and x are
strongly connected. - Si and Sj if each ai ? Si is strongly connected
with each aj ? Sj and if each aj ? Sj is strongly
connected with each ai ? Si.
7Definitions Cluster, Cluster-projection,
sub-cluster and subspace cluster
- CltC1,..Cngt is a cluster over A1,,An if
- 1. Ci and Cj are strongly connected
- 2. There exists on Ci such that Ci is a proper
superset of Ci and Ci and Ci are strongly
connected - 3. sD(C) of C is gt a the expected support of C
under attribute-independence assumption - Ci is a cluster-projection of C on Ai.
- C is a sub-cluster if it only satisfies 1 and 3.
- A cluster C over a subset of all attributes S
proper subset of A1,,An is a subspace cluster
on S.
8Definitions similarity, inter-attribute
summaries, intra-attribute summaries
- Similarity ?j(ai,a2) x ? Dj sD(a1,x)gt0 and
sD(a2,x)gt0 - Inter-attribute summary
- ?ij(ai,aj, sD(ai,aj) ai ? Di, aj ? Dj, and
sD(ai,aj)gt0 - Strongly connected attribute values pairs where
each pair has attribute values from different
attributes - Intra-attribute summary
- ?ij(ai,aj, ?jD(ai,aj) ai ? Di, aj ? Dj, and
?jD(ai,aj)gt0 - Similarities between attribute values of the same
attribute
9CACTUS Vs STIRR clusters found by CACTUS
10CACTUS Vs STIRR clusters found by STIRR
11CACTUS CAtegorical ClusTering Using Summaries
- Central idea data summary (inter- intra-
attribute summary) is sufficient enough to find
candidate clusters which can then be validated. - A three-phase clustering algorithm
- Summarisation
- Clustering
- Validation
12Summarisation Phase
- Assumption the inter- intra- attribute summary
of any pair of attributes fits easily into main
memory. - Inter-attribute Summaries
- Use a counter set to 0 initially for each pair
(ai,aj) ? Di x Dj. - Scan the dataset, increment the counter for each
pair. - After the scan, compute sD(ai,aj) and reset the
counters of those whose s lt EsD(ai,aj). Store
those values pairs. - Intra-attribute Summaries
- Scan the dataset and find those tuples (T1,T2) of
one domain such that T1.a is strongly connected
with T1.b and T2.a is strongly connected with
T2.b. - Very fast operation, hence only compute them when
needed
13Clustering Phase
- A two-step operation
- Step 1. analyse each attribute to compute all
cluster-projections on it - Step 2. Synthesise candidate clusters on sets of
attributes from the cluster-projections on
individual attributes
14Clustering Phase continued
- Step1 Compute cluster-projections on attributes
- Step A. Find all cluster-projections on Ai of
cluster over (Ai,Aj). - Step B. Compute all the cluster-projections on Ai
of cluster over A1,,An by intersecting sets of
cluster-projects from Step A. - Step A is NP-Hard! Solution use distinguishing
sets. - Distinguishing sets identify different
cluster-projections. - Construct distinguishing sets on Ai and extend
w.r.t Aj some of the candidate distinguishing
sets on Ai. - Detailed steps are too long for this
presentation, sorry! - StepB intersection of Cluster-projection
- Intersection joint S1?S2 s there exist s1?S1
and s2?S2 such that ss1?s2 and sgt1 - Apply intersection joint to all sets of attribute
values on Ai. - Step2 Try to augment ck with a cluster
projection ck1 on attribute Ak1. If new cluster
ltci,ck1gt is a sub-cluster on (Ai,Ak1), i ?
1,,k, then add ck1 ltc1,ck1gt to the final
cluster.
15Validation Phase
- Use a required threshold to recognise false
candidates which do not have enough support
because some of the 2-clusters combined to form a
candidate cluster may be due to different sets of
tuples.
16Experiments and Results
- To compare with STIRR
- Use 1 million tuples, 10 attributes and 100
attribute values for each attribute. - CACTUS discovers a broader class of clusters than
STIRR.
17Experiments and Results
18Conclusion
- The authors formalised the definition of a
cluster in categorical data - CACTUS is a fast and efficient algorithm for
clustering in categorical data. - I am sorry that I did not show some part of the
algorithm due to time constraint.
19Question Time