Clustering Algorithms for Categorical Data Sets - PowerPoint PPT Presentation

About This Presentation
Title:

Clustering Algorithms for Categorical Data Sets

Description:

We may use the chi-square statistics as the similarity measure. ... COBWEB algorithm was developed by machine learning researchers in the 1980s for ... – PowerPoint PPT presentation

Number of Views:1054
Avg rating:3.0/5.0
Slides: 51
Provided by: csie
Learn more at: https://csie.org
Category:

less

Transcript and Presenter's Notes

Title: Clustering Algorithms for Categorical Data Sets


1
Clustering Algorithms for Categorical Data Sets
  • As mentioned earlier, one essential issue for
    clustering a categorical data set is to define a
    similarity(dissimilarity) function between two
    objects.
  • One of the most fundamental and important data
    model of categorical data sets is the
    market-basket data model.

2
The Market-Basket Data Model
  • In the data model, there is a set of objects O1,
    O2,, On and a set of transactions T1, T2,,
    Tm. Each transaction is actually is subset of
    the object set.
  • A market-basket data set is typically represented
    by a 2-dimensional table, in which each entry is
    either 0 or 1.

3
The Tabular Representation of the Market Basket
Data Model
4
Data Sets with the Market-Basket Data Model
  • A record of purchasing transactions.
  • A record of web site accesses.
  • A record of course enrollment.

5
Clustering Objects in a Market-Basket Data Set
  • In this problem, it is assumed that each
    transaction is an independent event.
  • The commonly used measures of similarity include
  • Jacard coefficient.
  • Mutual information.

6
  • Once the similarity between each pair of objects
    has been determined, then we may apply algorithms
    such as single-link and complete-link to cluster
    the objects.
  • Experiment results shows that the complete-link
    algorithm generally yield better clustering
    quality than the single-link algorithm.

7
An Example
  • Given the following web access record, we may
    cluster the web sites accordingly.

Site 1 Site 2 Site 3 Site 4 Site 5
User1 1 1 0 1 1
User2 1 0 1 0 0
User3 0 1 0 1 1
User4 1 0 1 0 1
User5 1 0 1 1 1
8
  • Based on the Jacard coefficient, we have the
    following similarity measurements
  • sim(s1, s2) 1/5
  • sim(s1, s3) 3/4
  • sim(s1, s4) 2/5
  • sim(s1, s5) 3/5
  • sim(s2, s3) 0
  • sim(s2, s4) 2/3
  • sim(s2, s5) 1/2
  • sim(s3, s4) 1/5
  • sim(s3, s5) 2/5
  • sim(s4, s5) 3/5

9
  • If we employ the complete-link algorithm, then we
    have the following cluster result

½
2/3
¾
s1
s3
s2
s4
s5
10
  • We may use the chi-square statistics as the
    similarity measure. However, we need to consider
    whether the accesses to two web sites are
    positively correlated or negatively correlated.
  • For example

s1 s1
s3 3 0 3/5
s3 1 1 2/5
4/5 1/5
11
  • On the other hand.

s2 s2
s3 0 3 3/5
s3 2 0 2/5
2/5 3/5
12
The Object-Attribute Data Model
  • In the data model, there is a set of objects O1,
    O2,, On and a set of attributes A1, A2,, Am.
    Each attribute has a number of possible values.
  • For example, we may characterize a person by
    education background, profession, marriage
    status, etc.

13
  • If each attribute has exactly two possible
    values, then the object-attribute data model is
    degenerated to the market-basket data model.
  • An object-attribute data set can be transformed
    to a market-basket data set as the following
    example shows.

14
The ROCK algorithm
  • A categorical data clustering algorithm that
    takes into account node connectivity.
  • In ROCK, each object is represented by a node.
  • Two nodes are connected by an edge if the
    similarity between the corresponding objects
    exceeds a threshold.

15
  • Let link(ni, nj) of two nodes ni and nj denote
    the number of common neighbors of these two
    nodes.
  • Given a data set and an integer number k, the
    ROCK algorithm partitions the objects into k
    clusters so that the following function is
    maximized.

16
  • The ROCK algorithm works bottom-up by merging the
    pair of clusters that has maximum goodness
    measurement

17
Fundamental of the Criteria Functions
  • Assume that the expected number of edges at a
    node in cluster Ci is Cif(?).
  • Then, the expected number of links contributed by
    a node in Ci is
  • Therefore, the expected number of links in Ci is

18
The Pseudo-code of the ROCK Algorithm
  • procedure cluster(S,k)begin link
    compute_links(S) for each s?S do qs
    build_local_heap(link,s) Q build_global_head(S
    ,q) while size(Q) gt k do u
    extract_max(Q) v max(qu) delete(Q,v) w
    merge(u,v) for each x?qu?qv do
    linkx,w linkx,u linkx,v delete(
    qx,u) delete(qx,v) insert(qx,w,g(w,x))
    insert(qw,x,g(w,x)) update(Q,x,qx) in
    sert(Q,w,qx) deallocate(qu)
    deallocate(qv) end

19
The COBWEB Conceptual Clustering Algorithm
  • The COBWEB algorithm was developed by machine
    learning researchers in the 1980s for clustering
    objects in a object-attribute data set.
  • The COBWEB algorithm yields a clustering
    dendrogram called classification tree that
    characterizes each cluster with a probabilistic
    description.

20
The Classification Tree Generated by the COBWEB
Algorithm
21
The Category Utility Function
  • The COBWEB algorithm operates based on the
    so-called category utility function (CU) that
    measures clustering quality.
  • If we partition a set of objects into m clusters,
    then the CU of this particular partition is

22
Insights of the CU Function
  • For a given object in cluster Ck, if we guess its
    attribute values according to the probabilities
    of occurring, then the expected number of
    attribute values that we can correctly guess is

23
  • Given an object without knowing the cluster that
    the object is in, if we guess its attribute
    values according to the probabilities of
    occurring, then the expected number of attribute
    values that we can correctly guess is

24
  • P(Ck)is incorporated in the CU function to give
    paper weighting to each cluster.
  • Finally, m is placed in the denominator to
    prevent over-fitting.

25
Operation of the COBWEB algorithm
  • The COBWEB algorithm constructs a classification
    tree incrementally by inserting the objects into
    the classification tree one by one.
  • When inserting an object into the classification
    tree, the COBWEB algorithm traverses the tree
    top-down starting from the root node.

26
  • At each node, the COBWEB algorithm considers 4
    possible operations and select the one that
    yields the highest CU function value
  • insert.
  • create.
  • merge.
  • split.

27
  • Insertion means that the new object is inserted
    into one of the existing child nodes. The COBWEB
    algorithm evaluates the respective CU function
    value of inserting the new object into each of
    the existing child nodes and selects the one with
    the highest score.
  • The COBWEB algorithm also considers creating a
    new child node specifically for the new object.

28
  • The COBWEB algorithm considers merging the two
    existing child nodes with the highest and second
    highest scores.

29
  • The COBWEB algorithm considers spliting the
    existing child node with the highest score.

30
The COBWEB Algorithm
  • Cobweb(N, I)
  • If N is a terminal node,
  • Then Create-new-terminals(N, I)
  • Incorporate(N,I).
  • Else Incorporate(N, I).
  • For each child C of node N,
  • Compute the score for placing I in C.
  • Let P be the node with the highest score W.
  • Let Q be the node with the second highest
    score.
  • Let X be the score for placing I in a new node
    R.
  • Let Y be the score for merging P and Q into one
    node.
  • Let Z be the score for splitting P into its
    children.
  • If W is the best score,
  • Then Cobweb(P, I) (place I in category P).
  • Else if X is the best score,
  • Then initialize Rs probabilities using Is
    values
  • (place I by itself in the new category R).
  • Else if Y is the best score,
  • Then let O be Merge(P, R, N).

Input The current node N in the concept
hierarchy. An unclassified (attribute-value)
instance I. Results A concept hierarchy that
classifies the instance. Top-level
call Cobweb(Top-node, I). Variables C, P, Q,
and R are nodes in the hierarchy. U, V, W, and
X are clustering (partition) scores.
31
Auxiliary COBWEB Operations
Variables N, O, P, and R are nodes in the
hierarchy. I is an unclassified instance. A
is a nominal attribute. V is a value of an
attribute. Incorporate(N, I) update the
probability of category N. For each attribute A
in instance I, For each value V of A, Update
the probability of V given category
N. Create-new-terminals(N, I) Create a new child
M of node N. Initialize Ms probabilities to
those for N. Create a new child O of node
N. Initialize Os probabilities using Is value.
Merge(P, R, N) Make O a new child of N. Set Os
probabilities to be P and Rs average. Remove P
and R as children of node N. Add P and R as
children of node O. Return O. Split(P,
N) Remove the child P of node N. Promote the
children of P to be children of N.
32
Probability-Based Clustering
  • The foundation of the probability-based
    clustering approach is based on a so-called
    finite mixture model.
  • A mixture is a set of k probability
    distributions, each of which governs the
    attribute values distribution of a cluster.

33
A 2-Cluster Example of the Finite Mixture Model
  • In this example, it is assumed that there are two
    clusters and the attribute value distributions in
    both clusters are normal distributions.

N(?2,?22)
N(?1,?12)
34
The Data Set
  • A 51 B 62 B 64 A 48 A 39 A 51
  • A 43 A 47 A 51 B 64 B 62 A 48
  • B 62 A 52 A 52 A 51 B 64 B 64
  • B 64 B 64 B 62 B 63 A 52 A 42
  • A 45 A 51 A 49 A 43 B 63 A 48
  • A 42 B 65 A 48 B 65 B 64 A 41
  • A 46 A 48 B 62 B 66 A 48
  • A 45 A 49 A 43 B 65 B 64
  • A 45 A 46 A 40 A 46 A 48

35
Operation of the EM Algorithm
  • The EM algorithm is to figure out the parameters
    for the finite mixture model.
  • Let s1, s2,, sn denote the the set of samples.
  • In this example, we need to figure out the
    following 5 parameters
  • ?1, ?1, ?2, ?2, P(C1).

36
  • For a general 1-dimensional case that has k
    clusters, we need to figure out totally 2k(k-1)
    parameters.
  • The EM algorithm begins with an initial guess of
    the parameter values.

37
  • Then, the probabilities that sample si belongs to
    these two clusters are computed as follow

38
  • The new estimated values of parameters are
    computed as follows.

39
  • The process is repeated until the clustering
    results converge.
  • Generally, we attempt to maximize the following
    likelihood function

40
  • Once we have figured out the approximate
    parameter values, then we assign sample si into
    C1, if
  • Otherwise, si is assigned into C2.

41
The Finite Mixture Model for Multiple Attributes
  • The finite mixture model described above can be
    easily generalized to handle multiple independent
    attributes.
  • For example, in a case that has two independent
    attributes, then the distribution function of
    cluster j is of form

42
  • Assume that there are 3 clusters in a
    2-dimensional data set. Then, we have 14
    parameters to be determined ?x1, ?y1, ?x1, ?y1,
    ?x2, ?y2, ?x2, ?y1, ?x3, ?y3, ?x3, ?y3, P(C1),
    and P(C2).
  • The probability that sample si belongs to Cj is

43
  • The new estimated values of the parameters are
    computed as follows

44
Limitation of the Finite Mixture Model and the EM
Algorithm
  • The finite mixture model and the EM algorithm
    generally assume that the attributes are
    independent.
  • Approaches have been proposed for handling
    correlated attributes. However, these approaches
    are subject to further limitations.

45
Generalization of the Finite Mixture Model and
the EM Algorithm
  • The finite mixture model and the EM algorithm can
    be generalized to handle other types of
    probability distributions.
  • For example, if we want to partition the objects
    into k clusters based on m independent nominal
    attributes, then we can apply the EM algorithm to
    figure out the parameters required to describe
    the distribution.

46
  • In this case, the total number of parameters is
    equal to
  • If two attributes are correlated, then we can
    merge these two attributes to form an attribute
    with Ai Aj possible values.

47
An Example
  • Assume that we want to partition 100 samples of a
    particular species of insects into 3 clusters
    according to 4 attributes
  • Color(Ac) milk, light brown, or dark brown
  • Head shape(Ah) spherical or triangular
  • Body length(Al) long or short
  • Weight(Aw) heavy or light.

48
  • If we determine that body length and weight are
    correlated, then we create a composite attribute
    As(length, weight) with 4 possible values (L,
    H), (L, L), (S, H), and (S, L).
  • We can figure out the values of the parameters in
    the following table with the EM algorithm, in
    addition to P(C1), P(C2), and P(C3)

Color Head shape (Body length, Weight)
C1 P(MC1) P(LC1) P(DC1) P(SC1) P(TC1) P((L,H)C1), P((S,H)C1) P((L,L)C1), P((S,L)C1)
C2 P(MC2) P(LC2) P(DC2) P(SC2) P(TC2) P((L,H)C2), P((S,H)C2) P((L,L)C2), P((S,L)C2)
C3 P(MC3) P(LC3) P(DC3) P(SC3) P(TC3) P((L,H)C3), P((S,H)C3) P((L,L)C3), P((S,L)C3)
49
  • We invoke the EM algorithm with an initial guess
    of these parameter values.
  • For each sample si(v1, v2, v3), we compute the
    following probabilities

50
  • The new estimated values of the parameters are
    computed as follows
Write a Comment
User Comments (0)
About PowerShow.com