Title: Clustering Algorithms for Categorical Data Sets
1Clustering Algorithms for Categorical Data Sets
- As mentioned earlier, one essential issue for
clustering a categorical data set is to define a
similarity(dissimilarity) function between two
objects. - One of the most fundamental and important data
model of categorical data sets is the
market-basket data model.
2The Market-Basket Data Model
- In the data model, there is a set of objects O1,
O2,, On and a set of transactions T1, T2,,
Tm. Each transaction is actually is subset of
the object set. - A market-basket data set is typically represented
by a 2-dimensional table, in which each entry is
either 0 or 1.
3The Tabular Representation of the Market Basket
Data Model
4Data Sets with the Market-Basket Data Model
- A record of purchasing transactions.
- A record of web site accesses.
- A record of course enrollment.
5Clustering Objects in a Market-Basket Data Set
- In this problem, it is assumed that each
transaction is an independent event. - The commonly used measures of similarity include
- Jacard coefficient.
- Mutual information.
6- Once the similarity between each pair of objects
has been determined, then we may apply algorithms
such as single-link and complete-link to cluster
the objects. - Experiment results shows that the complete-link
algorithm generally yield better clustering
quality than the single-link algorithm.
7An Example
- Given the following web access record, we may
cluster the web sites accordingly.
Site 1 Site 2 Site 3 Site 4 Site 5
User1 1 1 0 1 1
User2 1 0 1 0 0
User3 0 1 0 1 1
User4 1 0 1 0 1
User5 1 0 1 1 1
8- Based on the Jacard coefficient, we have the
following similarity measurements - sim(s1, s2) 1/5
- sim(s1, s3) 3/4
- sim(s1, s4) 2/5
- sim(s1, s5) 3/5
- sim(s2, s3) 0
- sim(s2, s4) 2/3
- sim(s2, s5) 1/2
- sim(s3, s4) 1/5
- sim(s3, s5) 2/5
- sim(s4, s5) 3/5
9- If we employ the complete-link algorithm, then we
have the following cluster result
½
2/3
¾
s1
s3
s2
s4
s5
10- We may use the chi-square statistics as the
similarity measure. However, we need to consider
whether the accesses to two web sites are
positively correlated or negatively correlated. - For example
s1 s1
s3 3 0 3/5
s3 1 1 2/5
4/5 1/5
11s2 s2
s3 0 3 3/5
s3 2 0 2/5
2/5 3/5
12The Object-Attribute Data Model
- In the data model, there is a set of objects O1,
O2,, On and a set of attributes A1, A2,, Am.
Each attribute has a number of possible values. - For example, we may characterize a person by
education background, profession, marriage
status, etc.
13- If each attribute has exactly two possible
values, then the object-attribute data model is
degenerated to the market-basket data model. - An object-attribute data set can be transformed
to a market-basket data set as the following
example shows.
14The ROCK algorithm
- A categorical data clustering algorithm that
takes into account node connectivity. - In ROCK, each object is represented by a node.
- Two nodes are connected by an edge if the
similarity between the corresponding objects
exceeds a threshold.
15- Let link(ni, nj) of two nodes ni and nj denote
the number of common neighbors of these two
nodes. - Given a data set and an integer number k, the
ROCK algorithm partitions the objects into k
clusters so that the following function is
maximized.
16- The ROCK algorithm works bottom-up by merging the
pair of clusters that has maximum goodness
measurement
17Fundamental of the Criteria Functions
- Assume that the expected number of edges at a
node in cluster Ci is Cif(?). - Then, the expected number of links contributed by
a node in Ci is - Therefore, the expected number of links in Ci is
18The Pseudo-code of the ROCK Algorithm
- procedure cluster(S,k)begin link
compute_links(S) for each s?S do qs
build_local_heap(link,s) Q build_global_head(S
,q) while size(Q) gt k do u
extract_max(Q) v max(qu) delete(Q,v) w
merge(u,v) for each x?qu?qv do
linkx,w linkx,u linkx,v delete(
qx,u) delete(qx,v) insert(qx,w,g(w,x))
insert(qw,x,g(w,x)) update(Q,x,qx) in
sert(Q,w,qx) deallocate(qu)
deallocate(qv) end
19The COBWEB Conceptual Clustering Algorithm
- The COBWEB algorithm was developed by machine
learning researchers in the 1980s for clustering
objects in a object-attribute data set. - The COBWEB algorithm yields a clustering
dendrogram called classification tree that
characterizes each cluster with a probabilistic
description.
20The Classification Tree Generated by the COBWEB
Algorithm
21The Category Utility Function
- The COBWEB algorithm operates based on the
so-called category utility function (CU) that
measures clustering quality. - If we partition a set of objects into m clusters,
then the CU of this particular partition is
22Insights of the CU Function
- For a given object in cluster Ck, if we guess its
attribute values according to the probabilities
of occurring, then the expected number of
attribute values that we can correctly guess is
23- Given an object without knowing the cluster that
the object is in, if we guess its attribute
values according to the probabilities of
occurring, then the expected number of attribute
values that we can correctly guess is
24- P(Ck)is incorporated in the CU function to give
paper weighting to each cluster. - Finally, m is placed in the denominator to
prevent over-fitting.
25Operation of the COBWEB algorithm
- The COBWEB algorithm constructs a classification
tree incrementally by inserting the objects into
the classification tree one by one. - When inserting an object into the classification
tree, the COBWEB algorithm traverses the tree
top-down starting from the root node.
26- At each node, the COBWEB algorithm considers 4
possible operations and select the one that
yields the highest CU function value - insert.
- create.
- merge.
- split.
27- Insertion means that the new object is inserted
into one of the existing child nodes. The COBWEB
algorithm evaluates the respective CU function
value of inserting the new object into each of
the existing child nodes and selects the one with
the highest score. - The COBWEB algorithm also considers creating a
new child node specifically for the new object.
28- The COBWEB algorithm considers merging the two
existing child nodes with the highest and second
highest scores.
29- The COBWEB algorithm considers spliting the
existing child node with the highest score.
30The COBWEB Algorithm
- Cobweb(N, I)
- If N is a terminal node,
- Then Create-new-terminals(N, I)
- Incorporate(N,I).
- Else Incorporate(N, I).
- For each child C of node N,
- Compute the score for placing I in C.
- Let P be the node with the highest score W.
- Let Q be the node with the second highest
score. - Let X be the score for placing I in a new node
R. - Let Y be the score for merging P and Q into one
node. - Let Z be the score for splitting P into its
children. - If W is the best score,
- Then Cobweb(P, I) (place I in category P).
- Else if X is the best score,
- Then initialize Rs probabilities using Is
values - (place I by itself in the new category R).
- Else if Y is the best score,
- Then let O be Merge(P, R, N).
Input The current node N in the concept
hierarchy. An unclassified (attribute-value)
instance I. Results A concept hierarchy that
classifies the instance. Top-level
call Cobweb(Top-node, I). Variables C, P, Q,
and R are nodes in the hierarchy. U, V, W, and
X are clustering (partition) scores.
31Auxiliary COBWEB Operations
Variables N, O, P, and R are nodes in the
hierarchy. I is an unclassified instance. A
is a nominal attribute. V is a value of an
attribute. Incorporate(N, I) update the
probability of category N. For each attribute A
in instance I, For each value V of A, Update
the probability of V given category
N. Create-new-terminals(N, I) Create a new child
M of node N. Initialize Ms probabilities to
those for N. Create a new child O of node
N. Initialize Os probabilities using Is value.
Merge(P, R, N) Make O a new child of N. Set Os
probabilities to be P and Rs average. Remove P
and R as children of node N. Add P and R as
children of node O. Return O. Split(P,
N) Remove the child P of node N. Promote the
children of P to be children of N.
32Probability-Based Clustering
- The foundation of the probability-based
clustering approach is based on a so-called
finite mixture model. - A mixture is a set of k probability
distributions, each of which governs the
attribute values distribution of a cluster.
33A 2-Cluster Example of the Finite Mixture Model
- In this example, it is assumed that there are two
clusters and the attribute value distributions in
both clusters are normal distributions.
N(?2,?22)
N(?1,?12)
34The Data Set
- A 51 B 62 B 64 A 48 A 39 A 51
- A 43 A 47 A 51 B 64 B 62 A 48
- B 62 A 52 A 52 A 51 B 64 B 64
- B 64 B 64 B 62 B 63 A 52 A 42
- A 45 A 51 A 49 A 43 B 63 A 48
- A 42 B 65 A 48 B 65 B 64 A 41
- A 46 A 48 B 62 B 66 A 48
- A 45 A 49 A 43 B 65 B 64
- A 45 A 46 A 40 A 46 A 48
35Operation of the EM Algorithm
- The EM algorithm is to figure out the parameters
for the finite mixture model. - Let s1, s2,, sn denote the the set of samples.
- In this example, we need to figure out the
following 5 parameters - ?1, ?1, ?2, ?2, P(C1).
36- For a general 1-dimensional case that has k
clusters, we need to figure out totally 2k(k-1)
parameters. - The EM algorithm begins with an initial guess of
the parameter values.
37- Then, the probabilities that sample si belongs to
these two clusters are computed as follow
38- The new estimated values of parameters are
computed as follows.
39- The process is repeated until the clustering
results converge. - Generally, we attempt to maximize the following
likelihood function
40- Once we have figured out the approximate
parameter values, then we assign sample si into
C1, if - Otherwise, si is assigned into C2.
41The Finite Mixture Model for Multiple Attributes
- The finite mixture model described above can be
easily generalized to handle multiple independent
attributes. - For example, in a case that has two independent
attributes, then the distribution function of
cluster j is of form
42- Assume that there are 3 clusters in a
2-dimensional data set. Then, we have 14
parameters to be determined ?x1, ?y1, ?x1, ?y1,
?x2, ?y2, ?x2, ?y1, ?x3, ?y3, ?x3, ?y3, P(C1),
and P(C2). - The probability that sample si belongs to Cj is
43- The new estimated values of the parameters are
computed as follows
44Limitation of the Finite Mixture Model and the EM
Algorithm
- The finite mixture model and the EM algorithm
generally assume that the attributes are
independent. - Approaches have been proposed for handling
correlated attributes. However, these approaches
are subject to further limitations.
45Generalization of the Finite Mixture Model and
the EM Algorithm
- The finite mixture model and the EM algorithm can
be generalized to handle other types of
probability distributions. - For example, if we want to partition the objects
into k clusters based on m independent nominal
attributes, then we can apply the EM algorithm to
figure out the parameters required to describe
the distribution.
46- In this case, the total number of parameters is
equal to - If two attributes are correlated, then we can
merge these two attributes to form an attribute
with Ai Aj possible values.
47An Example
- Assume that we want to partition 100 samples of a
particular species of insects into 3 clusters
according to 4 attributes - Color(Ac) milk, light brown, or dark brown
- Head shape(Ah) spherical or triangular
- Body length(Al) long or short
- Weight(Aw) heavy or light.
48- If we determine that body length and weight are
correlated, then we create a composite attribute
As(length, weight) with 4 possible values (L,
H), (L, L), (S, H), and (S, L). - We can figure out the values of the parameters in
the following table with the EM algorithm, in
addition to P(C1), P(C2), and P(C3)
Color Head shape (Body length, Weight)
C1 P(MC1) P(LC1) P(DC1) P(SC1) P(TC1) P((L,H)C1), P((S,H)C1) P((L,L)C1), P((S,L)C1)
C2 P(MC2) P(LC2) P(DC2) P(SC2) P(TC2) P((L,H)C2), P((S,H)C2) P((L,L)C2), P((S,L)C2)
C3 P(MC3) P(LC3) P(DC3) P(SC3) P(TC3) P((L,H)C3), P((S,H)C3) P((L,L)C3), P((S,L)C3)
49- We invoke the EM algorithm with an initial guess
of these parameter values. - For each sample si(v1, v2, v3), we compute the
following probabilities
50- The new estimated values of the parameters are
computed as follows