Techniques of Classification and Clustering - PowerPoint PPT Presentation

1 / 126
About This Presentation
Title:

Techniques of Classification and Clustering

Description:

Problem Description Assume A={A1, A2, , Ad}: (ordered or unordered) domain S= A1 A2 Ad : d-dimensional (numerical or non-numerical) space Input V={v1, v2 ... – PowerPoint PPT presentation

Number of Views:653
Avg rating:3.0/5.0
Slides: 127
Provided by: Jongpi
Category:

less

Transcript and Presenter's Notes

Title: Techniques of Classification and Clustering


1
Techniques of Classification and Clustering
2
Problem Description
  • Assume
  • AA1, A2, , Ad (ordered or unordered) domain
  • S A1 ? A2 ? Ad d-dimensional (numerical or
    non-numerical) space
  • Input
  • Vv1, v2, , vm d-dimensional points, where vi
    ?vi1, vi2, , vid?.
  • The jth component of vi is drawn from domain Aj.
  • Output
  • Gg1, g2, , gk a set groups of V with label
    vL, where gi ? V.

3
Classification
  • Supervised classification
  • Discriminant analysis, or simply Classification
  • A collection of labeled (pre-classified) patterns
    are provided
  • Aims to label a newly encountered, yet unlabeled
    (training) patterns
  • Unsupervised classification
  • Clustering
  • Aims to group a given collection of unlabeled
    patterns into meaningful clusters
  • Category labels are data driven

4
Methods for Classification
  • Neural Nets
  • Classification functions are obtained by passing
    multiple passes over training sets
  • Poor generation efficiency
  • Not efficient handling of non-numerical data
  • Decision trees
  • If E contains only objects of one group, the
    decision tree is just a leaf labeled with that
    group.
  • Construct a DT that correctly classifies objects
    in the training data set.
  • Test to classify the unseen objects in the test
    data set.

5
Decision Trees (Ex Credit Analysis)
salary lt 20000
no
yes
education in graduate
accept
yes
no
accept
reject
6
Decision Trees
  • Pros
  • Fast execution time
  • Generated rules are easy to interpret by humans
  • Scale well for large data sets
  • Can handle high dimensional data
  • Cons
  • Cannot capture correlations among attributes
  • Consider only axis-parallel cuts

7
Decision Tree Algorithms
  • Classifiers from machine learning community
  • ID3J. R. Quinlan, Induction of decision trees,
    Machine Learning, 1, 1986.
  • C4.5J. Ross Quinlan, C4.5 Programs for and
    Neural Networks, Cambridge University Press,
    Cambridge, 1996. Machine Learning, Morgan
    Kaufman, 1993
  • CARTL. Breiman, J. H. Friedman, R. A. Olshen,
    and C. J. Stone, Classification and Regression
    Trees, Wadsworth, Belmont, 1984.
  • Classifiers for large database
  • SLIQMAR96, SPRINTJohn Shafer, Rakesh Agrawal,
    and Manish Mehta, SPRINT A scalable parallel
    classifier for data mining, the VLDB Conference,
    Bombay, India, September 1996.
  • SONARTakeshi Fukuda, Yasuhiko Morimoto, and
    Shinichi Morishita, Constructing efficient
    decision trees by using optimized numeric
    association rules, the VLDB Conference, Bombay,
    India, 1996.
  • RainforestJ. Gehrke, R. Ramakrishnan, V. Ganti,
    RainForest A Framework for Fast Decision Tree
    Construction of Large Datasets, Proc. of VLDB
    Conf., 1998.
  • Pruning phase followed by building phase

8
Decision Tree Algorithms
  • Building phase
  • Recursively split nodes using best splitting
    attribute for node
  • Pruning phase
  • Smaller imperfect decision tree generally
    achieves better accuracy
  • Prune leaf nodes recursively to prevent
    over-fitting

9
Preliminaries
  • Theoretic Background
  • Entropy
  • Similarity measures
  • Advanced terms

10
Information Theory Concepts
  • Entropy of a random variable X with probability
    distribution p(x)
  • The Kullback-Leibler(KL) Divergence or Relative
    Entropy between two probability distributions p
    and q
  • Mutual Information between random variables X and
    Y

11
What is Entropy
  • S is a sample of training data set
  • Entropy measures the impurity of S
  • H(X)The entropy of X
  • If H(X)0, it means X is one value As H()
    increases, X are heterogeneous values.
  • For the same number of X values,
  • Low Entropy means X is from a uniform (boring)
    distribution A histogram of the frequency
    distribution of values of X would be flat ?and
    so the values sampled from it would be all over
    the place
  • High Entropy means X is from varied (peaks and
    valleys) distribution A histogram of the
    frequency distribution of values of X would have
    many lows and one or two highs ? and so the
    values sampled from it would be more predictable.

12
Entropy-Based Data Segmentation
T. Fukuda, Y. Morimoto, S. Morishita, T.
Tokuyama, Constructing Efficient Decision Trees
by Using Optimized Numeric Association Rules,
Proc. of VLDB Conf., 1996.
  • Attribute has three categories, 40 C1, 30 C2, 30
    C3.

C1 C2 C3
100 40 30 30
  • Splitting

S1 C1 C2 C3
60 40 10 10
S2 C1 C2 C3
40 0 20 20
S3 C1 C2 C3
60 20 20 20
S4 C1 C2 C3
40 20 10 10
13
Information Theoretic Measure
R. Rgrawal, S. Ghosh, T. Imielinski, B. Iyer, A.
Swami, An Interval Classifier for Database Mining
Applications, Proc. ofVLDB, 1992.
  • Information gain by branching on Ai
  • gain(Ai) E - Ei
  • The entropy E of an object set
  • the object set containing
  • object ek of group Gk.
  • The expected entropy for
  • the tree with Ai as the root
  • where Eij is the expected entropy for the subtree
    of an object set.
  • Information content of
  • the value of Ai

14
Ex
C1 C2 C3
100 40 30 30
  • Splitting

S1 C1 C2 C3
60 20 20 20
S2 C1 C2 C3
40 20 10 10
S3 C1 C2 C3
40 40 0 0
S4 C1 C2 C3
30 0 30 0
S5 C1 C2 C3
30 0 0 30
  • Gain
  • gain(Ai) E - Ei

gain1E-E10.015 gain2E-E21.09
15
Distributional Similarity Measures
  • Cosine
  • Jaccard coefficient
  • Dice coefficient
  • Overlap coefficient
  • L1 distance (City block distance)
  • Euclidean distance (L2 distance)
  • Hellinger distance
  • Information Radius (Jensen-Shannon divergence)
  • Skew divergence
  • Confusion Probability
  • Lins Similarity Measure

16
Similarity Measures
  • Minkowski distance
  • Euclidean distance
  • p2
  • Manhattan distance
  • p1
  • Mahalanobis distance
  • Normalization due to weight schemes
  • ? is the sample covariance matrix of the patterns
    or the known covariance matrix of the pattern
    generation process

17
General form
  • I (common (A,B)) information content associated
    with the statement describing what A and B have
    in common
  • I (description (A,B)) information content
    associated with the statement describing A and B
  • ?(s) probability of the statement within the
    world of the objects in question, i.e., fraction
    of objects exhibiting feature s.

IT-Sim (A,B)
18
Similarity Measures
  • The Set/Bag Model Let X and Y be two collections
    of XML documents
  • Jaccards Coefficient
  • Dices Coefficient

19
Similarity Measures
  • Cosine-Similarity Measure (CSM)
  • The Vector-Space Model Cosine-Similarity Measure
    (CSM)

20
Query Processing a single cosine
  • For every term i, with each doc j, store term
    frequency tfij.
  • Some tradeoffs on whether to store term count,
    term weight, or weighted by idfi.
  • At query time, accumulate component-wise sum
  • If youre indexing 5 billion documents (web
    search) an array of accumulators is infeasible

Ideas?
21
Similarity Measures (2)
  • The Generalized Cosine-Similarity Measure (GCSM)
    Let X and Y be vectors and
  • where
  • Hierarchical Model
  • Why only for depth?

22
2 Dim Similarities
  • Cosine Measure
  • Hellinger Measure
  • Tanimoto Measure
  • Clarity Measure

23
Advanced Terms
  • Conditional Entropy
  • Information Gain

24
Specific Conditional Entropy
  • H(YXv)
  • Suppose Im trying to predict output Y and I have
    input X
  • XCollege Major, Y likes Gladiator
  • Lets assume this reflects the true probabilities

X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
  • From this data we estimate
  • P(LikeGYes)0.5
  • P(MajorMath LikeGNo) 0.25
  • P(MajorMath)0.5
  • P(LikeGYes MajorHisgory)0
  • Note
  • -H(X)1.5 -H(Y)1
  • ----
  • H(YXMath)1 H(YXHistory)0
  • H(YXCS)0

25
Conditional Entropy
  • Definition of Conditional Entropy
  • H(YX)The average specific conditional entropy
    of Y
  • If you choose a record at random what will be the
    conditional entropy of Y, conditioned on that
    rows value of X
  • Expected number of bits to transmit Y if both
    sides will know the value of X

X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
vj Prob(Xvj) H(YXvj)
Math 0.5 1
History 0.25 0
CS 0.25 0
H(YX)0.510.2500.2500.5
26
Information Gain
  • Definition of Information Gain
  • IG(YX) I must transmit Y. How many bits on
    average would it save me if both ends the line
    knew X?
  • IG(YX) H(Y) H(YX)

X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
H(Y) 1 H(YX) 0.510.2500.2500.5 Thus,
IG(YX) 1-0.5 0.5
27
Relative Information Gain
  • Definition of Relative Information Gain
  • RIG(YX) I must transmit Y, what fraction of
    the bits on average would it save me if both ends
    the line knew X?
  • RIG(YX) H(Y) H(YX)/H(Y)

X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
H(Y) 1 H(YX) 0.510.2500.2500.5 Thus,
IG(YX) (1-0.5)/1 0.5
28
What is Information Gain used for?
  • Suppose you are trying to predict whether someone
    is going to live past 80 years. From historical
    data you might find
  • IG(LongLife HairColor) 0.01
  • IG(LongLife Smoker) 0.2
  • IG(LongLife Gender) 0.25
  • IG(LongLife LastDigitOfSSN) 0.00001
  • IG tells you how interesting 1 2-d contingency
    table is going to be.

29
Clustering
  • Given
  • Data points and number of desired clusters K
  • Group the data points into K clusters
  • Data points within clusters are more similar than
    across clusters
  • Sample applications
  • Customer segmentation
  • Market basket customer analysis
  • Attached mailing in direct marketing
  • Clustering companies with similar growth

30
A Clustering Example
Income High Children1 CarLuxury
Income Medium Children2 CarTruck
Cluster 1
Car Sedan and Children3 Income Medium
Income Low Children0 CarCompact
Cluster 4
Cluster 3
Cluster 2
31
Different ways of representing clusters
(b)
e
c
b
f
i
g
32
Clustering Methods
  • Partitioning
  • Given a set of objects and a clustering
    criterion, partitional clustering obtains a
    partition of the objects into clusters such that
    the objects in a cluster are more similar to each
    other than to objects in different clusters.
  • K-means, and K-mediod methods determine K cluster
    representatives and assign each object to the
    cluster with its representative closest to the
    object such that the sum of the distances squared
    between the objects and their representatives is
    minimized.
  • Hierarchical
  • Nested sequence of partitions.
  • Agglomerative starts by placing each object in
    its own cluster and then merge these atomic
    clusters into larger and larger clusters until
    all objects are in a single cluster.
  • Divisive starts with all objects in cluster and
    subdividing into smaller pieces.

33
Algorithms
  • k-Means
  • Fuzzy C-Means Clustering
  • Hierarchical Clustering
  • Probabilistic Clustering

34
Similarity Measures (2)
  • Mutual Neighbor Distance (MND)
  • MND(xi, xj) NN(xi, xj)NN(xj, xi), where NN(xi,
    xj) is the neighbor number xj with respect to xi.
  • Distance under context
  • s(xi, xj)f(xi, xj, e), where e is the context

35
K-Means Clustering Algorithm
  • Choose k cluster centers to coincide with k
    randomly-chosen patterns
  • Assign each pattern to its closest cluster
    center.
  • Recompute the cluster centers using the current
    cluster memberships.
  • If a convergence criterion is not met, go to step
    2.
  • Typical convergence criteria
  • No (or minimal) reassignment of patterns to new
    cluster centers, or minimal decrease in squared
    error.

36
Objective Function
  • k-Means algorithm aims at minimizing the
    following objective function (square error
    function)

37
K-Means Algorithm (Ex)
G
F
E
D
H
I
C
J
B
A
38
Distortion
  • Given a clustering ?, we denote by ?(x) the
    centroid this clustering associates with an
    arbitrary point x. A measure of quality for ?
  • Distortion? ?x d2(x, ?(x))/R
  • Where R is the total number of points and x
    ranges over all input points.
  • Improvement
  • Distortion ?( parameters) log R
  • Distortion ? mk log R

39
Remarks
  • The way to initialize the means is the problem.
    One popular way to start is to randomly choose k
    of the samples
  • The results produced depend on the initial values
    for the means
  • It can happen that the set of samples closest to
    mi is empty, so the mi cannot be updated.
  • The results depend on the metric used to measure

40
Related Work Clustering
  • Graph-based clustering
  • For an XML document collection C, s-Graph sg (C)
    (N, E), a directed graph such that N is the set
    of all the elements and attributes in the
    documents in C and (a, b) ? E if and only if a is
    a parent element of b in document(s) in C (b can
    be element or attribute).
  • For two sets, C1 and C2, of XML documents, the
    distance between them, where sg(Ci) is the
    number of edges

41
Fuzzy C-Means Clustering
  • FCM is a method of clustering which allows one
    piece of data to belong to two or more clusters.
  • Fuzzy partitioning is carried out through an
    iterative optimization of the objective function
    shown above, with the update of membership u and
    the cluster center c by

42
Membership
  • The iteration stop when
    , where ? is a termination criterion
    between 0 and 1, whereas k are the iteration
    steps. This procedure converges to a local
    minimum or a saddle point of Jm.

43
Fuzzy Clustering
  • Properties
  • uij ? 0,1 for all i,j
  • for all i
  • for all N

44
Speculations
  • Correlation between m and ?
  • More iteration k for less ?.

45
Hierarchical Clustering
  • Basic Process
  • Start by assigning each item to a cluster. N
    clusters for N items. (Let the distances between
    the clusters the same as the distances between
    the items they contain.)
  • Find the closest (most similar) pair of clusters
    and merge them into a single cluster.
  • Compute distances between the new cluster and
    each of the old clusters.
  • Repeat steps 2 and 3 until all items are
    clustered intoa single cluster of size N.

46
Hierarchical Clustering (Ex)
dendrogram
47
Hierarchical Clustering Algorithms
  • Single-linkage clustering
  • The distance between two clusters is the minimum
    of the distances between all pairs of patterns
    drawn from the two clusters (one pattern from the
    first cluster, the other from the second).
  • Complete-linkage clustering
  • The distance between two clusters is the maximum
    of the distances between all pairs of patterns
    drawn from the two clusters
  • Average-linkage clustering
  • Minimum-variance algorithm

48
Single-/Complete-Link Clustering
1
2
1
1
1
2
2
2
2
1
2
2
1
2
1
2
1
2
2
2
2

1
2
1
2
1
1
2
1
2
2
2
1
1
2
1
1
2
2
1
2
1
49
Single Linkage Hierarchical Cluster
  • Steps
  • Begin with the disjoint clustering having level
    L(k)0 and sequence number m0.
  • Find the least dissimilar pair of clusters in the
    current clustering, d(r),(s) min d(i),(j),
    where the minimum is over all pairs of clusters
    in the current clustering.
  • Increment the sequence number mm1. Merge
    clusters (r) and (s) into a single cluster to
    form the next clustering m. Set L(m)
    d(r),(s).
  • Update the proximity matrix, D, by deleting the
    rows and columns corresponding to clusters (r)
    and (s) and adding a row and column corresponding
    to the newly formed cluster. The proximity
    between the new cluster, denoted (r,s) and old
    cluster (k) is defined d(k),(r,s) min
    (d(s),(r), d(k),(s)).
  • If all objects are in one cluster, stop. Else go
    to step 2.

50
Ex Single-Linkage
  • Cities ? States

0
51
Agglomerative Hierarchical Clustering
ALGORITHM Agglomerative Hierarchical
Clustering INPUT bit-vectors B in bitmap index
BI OUTPUT a tree T METHOD (1) Place each
bit-vector Bi in its cluster (singleton),
creating the list of clusters L
(initially, the leaves of T) LB1, B2, , Bn.
(2) Compute a merging cost function,
between every
pair of elements in L to find the two closest
clusters Bi,Bj which will be the
cheapest couple to merge. (3) Remove Bi and Bj
from L. (4) Merge Bi and Bj to create a new
internal node Bjj in T which will be the
parent of Bi and Bj in the result tree. (5)
Repeat from (2) until there is only one set
remaining.
52
Graph-Theoretic Clustering
  • Construct the minimal spanning tree (MST)
  • Delete the MST edges with the largest lengths

x2
B
3.5
0.5
C
1.5
A
6.5
1.5
D
F
G
1.7
E
x1
53
Improving k-Means
D. Pelleg and A. Moore, Accelerating Exact
k-means Algorithms with Geometric Reasoning, ACM
Proceedings of Conf. on Knowledge and Data
Mining, 1999.
  • Definitions
  • Center of clusters ? (Th2) Center of rectangle
    owner(h)
  • c1 dominates c2 w.r.t. h ? if h is in the same
    side as c1 wrt c2. (pg.7,9)
  • Update Centroid
  • If for all other centers c, c dominates c wrt h
    (so cowner(h), pg 10) ? insert into owner(h) or
    split h
  • (blacklist version) c1 dominates c2 wrt h for
    any h contained in h. (pg.11)

54
Clustering Categorical Data ROCK
  • S. Guha, R. Rastogi, K. Shim, ROCK Robust
    Clustering using linKs, IEEE Conf Data
    Engineering, 1999
  • Use links to measure similarity/proximity
  • Not distance based
  • Computational complexity
  • Basic ideas
  • Similarity function and neighbors
  • Let T1 1,2,3, T23,4,5

55
Using Jaccard Coefficient
  • According to Jaccard coefficient, the distance
    between 1,2,3 and 1,2,6 is the same as the
    one between 1,2,3 and 1,2,4, although the
    former is from two different clusters.

lt1,2,3,4,5gt CLUSTER 1 1,2,3 1,4,5 1,2,4
2,3,4 1,2,5 2,3,5 1,3,4 2,4,5 1,3,5
3,4,5
lt1,2,6,7gt CLUSTER 2 1,2,6 1,2,7 1,6,7 2,6,7

56
ROCK
  • Inducing LINK the main problem is local
    properties involving only the two points are
    considered
  • Neighbor If two points are similar enough with
    each other, they are neighbors
  • Link the link for pair of points is the number
    of common neighbors.

57
Rock Algorithm
S. Guha, R. Rastogi, K. Shim, ROCK Robust
Clustering using linKs, IEEE Conf Data
Engineering, 1999
  • Links The number of common neighbors for the
    two points.
  • Algorithm
  • Draw random sample
  • Cluster with links
  • Label data in disk

1,2,3, 1,2,4, 1,2,5, 1,3,4,
1,3,5 1,4,5, 2,3,4, 2,3,5, 2,4,5,
3,4,5
3
1,2,3 1,2,4
58
Rock Algorithm
S. Guha, R. Rastogi, K. Shim, ROCK Robust
Clustering using linKs, IEEE Conf Data
Engineering, 1999
  • Criterion function to maximize link for the k
    clusters
  • Ci denotes cluster i of size ni.

For the similarity threshold 0.5, link (1,2,6,
1,2,7) 4 link (1,2,6, 1,2,3) 3 link
(1,6,7, 1,2,3) 2 link (1,2,3, 1,4,5)
3
1,2,3 1,4,5 1,2,4 2,3,4 1,2,5
2,3,5 1,3,4 2,4,5 1,3,5 3,4,5
1,2,6 1,2,7 1,6,7 2,6,7
59
More on Hierarchical Clustering Methods
  • Major weakness of agglomerative clustering
    methods
  • do not scale well time complexity of at least
    O(n2), where n is the number of total objects
  • can never undo what was done previously
  • Integration of hierarchical with distance-based
    clustering
  • BIRCH (1996) uses CF-tree and incrementally
    adjusts the quality of sub-clusters
  • CURE (1998) selects well-scattered points from
    the cluster and then shrinks them towards the
    center of the cluster by a specified fraction

60
BIRCH
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.
  • Pre-cluster data points using CF-tree
  • For each point
  • CF-tree is traversed to find the closest cluster
  • If the threshold criterion is satisfied, the
    point is absorbed into the cluster
  • Otherwise, it forms a new cluster
  • Requires only single scan of data
  • Cluster summaries stored in CF-tree are given to
    main memory hierarchical clustering algorithm

61
Initialization of BIRCH
  • CF of a cluster of n d-dimensional vectors,
    V1,,Vn, is defined as (n,LS, SS)
  • n is the number of vectors
  • LS is the sum of vectors
  • SS is the sum of squares of vectors
  • CF1CF2 (n1 n1 LS1 LS1, SS1 SS1)
  • This property is used for incremental maintaining
    cluster features
  • Distance between two clusters CF1 and CF2 is
    defined to be the distance between their
    centroids.

62
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.
Clustering Feature Vector
Clustering Feature CF (N, LS, SS) N Number
of data points LS (linear sum of N data points)
?Ni1Xi SS (square sum of N data points
?Ni1Xi2
CF (5, (16,30),(54,190))
(3,4) (2,6) (4,5) (4,7) (3,8)
63
Notations
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.
  • Given N d-dimensional data points in a cluster
    Xi
  • Centroid X0, radius R, diameter D, controid
    Euclidian distance D0, centroid Manhattan
    distance D1

64
Notations (2)
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.
  • Given N d-dimensional data points in a cluster
    Xi
  • Average inter-cluster distance D2, average
    intra-cluster distance D3, variance increase
    distance D4

65
CF Tree
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
66
Example
  • Given (T2?), B3 for 3, 6, 8, and 1
  • (2,(9, 45) ? (2,(4,10)), (2,(14,100))
  • For 2 inserted ?(1,(2,4))
  • (3,(6,14), (2,(14,100))
  • (2,(3,5)), (1,(3,9)) (2,(14,100))
  • For 5 inserted ?(1,(5,25))
  • (3,(6,14),
    (3,(19,125))
  • (2,(3,5)), (1,(3,9)) (2,(11,61)), (1,(8,64))
  • For 7 inserted ? (1,(7,49))
  • (3,(6,14),
    (4,(26,174))
  • (2,(3,5)), (1,(3,9)) (2,(11,61)),
    (2,(15,113))

67
Evaluation of BIRCH
  • Scales linearly finds a good clustering with a
    single scan and improves the quality with a few
    additional scans
  • Weakness handles only numeric data and sensitive
    to the order of the data record.

68
Data Summarization
  • To compress the data into suitable representative
    objects
  • OPTICS Data Bubble

Finding clusters from hierarchical clustering
depending on the resolution
69
OPTICS
M. Ankerst, M. Breunig, H. Kriegel, J. Sander,
OPTICS Ordering Points to Identify the
Clustering Structure, ACM SIGMOD, 1999.
  • Pre N?(q) the subset of D contained in the
    ?neighborhood of q. (? is a radius)
  • Definition 1 (directly density-reachable) Object
    p is directly density-reachable from object q
    wrt. ? and MinPts in a set of objects D if 1) p ?
    N,(q) (N?(q) is the subset of D contained in the
    ?-neighborhood of q.) 2) Card(N?(q)) gt MinPts
    (Card(N) denotes the cardinality of the set N)
  • Definitions
  • Directly density-reachable (p.51 Figure 2) ?
    density-reachable transitivity of ddr
  • Density-connected (p -gt o lt- q)
  • Core-distance ?, MinPts (p) MinPts_distance (p)
  • Reachability-distance ?, MinPts (p,o) wrt o
    max(core-distance(o), dist(o,p)) ? Figure 4
  • Ex) cluster ordering ? reachability values Fig 12

70
Data Bubbles
M. Breunig, H. Kriegel, P. Kroger, J. Sander,
Data Bubbles Quality Preserving Performance
Boosting for Hierarchical Clustering, ACM SIGMOD,
2001.
  • ?-neighborhood of P
  • k-distance of P, at least for k objects O ? D it
    holds d(P,O) d(P,O), and at most k-1 objects
    O ? D it holds d(P,O) lt d(P,O).
  • k-nearest neighbors of P
  • MinPts-dist(P) a distance in which there are at
    least MinPts objects within the ?-neighborhood of
    P.

71
Data Bubbles
M. Breunig, H. Kriegel, P. Kroger, J. Sander,
Data Bubbles Quality Preserving Performance
Boosting for Hierarchical Clustering, ACM SIGMOD,
2001.
  • Structural distortion
  • Figure 11
  • Data Bubbles, B(n,rep,extend,nnDist)
  • n of objects in X rep a representative
    bject for X extent estimation of the radius of
    X nnDist partial function, estimating k-nearest
    neighbor distances in X.
  • Distance (B,C) page 6-83

Dist(B.rep, C.rep) - B.extent C.extend
B.nnDist(1) C.nnDist(1)
Max B.nnDist(1) C.nnDist(1)
72
K-Means in SQL
C. Ordonez, Integrating K-Means Clustering with a
Relational DBMS Using SQL, IEEE TKDE 2006.
  • Dataset Yy1,y2,,yn d?n matrix, where yid?1
    column vector
  • K-Means to find k clusters, by minimizing the
    square error from the centers.
  • Square distance, Eq(1) and objective fn, Eq(2)
  • Matrices
  • W k weights (fractions of n) d?k matrix
  • C k means (centroids) d?k matrix
  • R k variances (square distances) k?1 matrix
  • Matrices
  • Mj contains the d sums of point dimension values
    in cluster j d?k matrix
  • Qj contains the d sums of squared dimension
    values in cluster j d?k matrix
  • Nj contains points in cluster j k?1 matrix
  • Intermediate matrices YH, YV, YD, YNN, NMQ, WCR
    Figure 193

73
Y
YH
C
YV
CH
Y1 Y2 Y3
1 2 3
1 2 3
9 8 7
9 8 7
9 8 7
i Y1 Y2 Y3
1 1 2 3
2 1 2 3
3 9 8 7
4 9 8 7
5 9 8 7
l k C1/C2
1 1 1
2 1 2
3 1 3
1 2 9
2 2 8
3 2 7
i l val
1 1 1
1 2 2
1 3 3
2 1 1
2 2 2
2 3 3
3 1 9
3 2 8
3 3 7
4 1 9
4 2 8
4 3 7
5 1 9
5 2 8
5 3 7
j Y1 Y2 Y3
1 1 2 3
2 9 8 7
YNN
YD
Insert into C Select 1,1,Y1 From CH Where
j1 Insert into C Select d,k,Yd From CH Where
jk
i j
1 1
2 1
3 2
4 2
5 2
i d1 d2
1 0 116
2 0 116
3 116 0
4 116 0
5 116 0
Insert into YD Select i, sum(YV.val-C.C1)2) AS
d1, sum(YV.val-C.Ck)2) AS dk FROM YV,
C Where YV.l C.l Group by i
NMQ
WCR
Insert into YNN CASE When d1 lt d2 AND d1 lt dk
Then 1 When d2 lt d3 .. Then
2 ELSE k
l j N M Q
1 1 2 2 3
2 1 2 4 3
3 1 2 6 7
1 2 3 27 7
2 2 3 24 7
3 2 3 21 7
l j W C R
1 1 0.4 1 0
2 1 0.4 2 0
3 1 0.4 3 0
1 2 0.6 9 0
2 2 0.6 8 0
3 2 0.6 7 0
Insert into MNQ Select l,j,sum(1.0) AS N,
sum(YV.val) AS M, sum(YV.va.YV.val) AS Q FROM
YV, YNN Where YV.i YNN.i GROUP by l,j
74
Incremental Data Summarization
S. Nassar, J. Sander, C. Cheng, Incremental and
Effective Data Summarization for Dynamic
Hierarchical Clustering, ACM SIGMOD, 2004.
  • For DXi for 1?i?N, ?data bubble, the data
    index ?i n/N.
  • For DXi with the mean ?X and standard
    deviation ?X,
  • ? is
  • good iff ???? - ?? , ?? ??,
  • under-filled iff ?lt ?? - ?? , and
  • over-filled iff ?gt ?? ??.

75
Research Issues
  • Reduction Dimensions
  • Approximation

76
(No Transcript)
77
Cure The Algorithm
Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998
  • Guha, Rastogi Shim, CURE An Efficient
    Clustering Algorithm for Large Databases, ACM
    SIGMOD, 1998
  • Draw random sample s.
  • Partition sample to p partitions with size s/p
  • Partially cluster partitions into s/pq clusters
  • Eliminate outliers
  • By random sampling
  • If a cluster grows too slow, eliminate it.
  • Cluster partial clusters.
  • Label data in disk

78
Data Partitioning and Clustering
Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998
  • s 50
  • p 2
  • s/p 25
  • s/pq 5

x
x
79
Cure Shrinking Representative Points
Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998
  • Shrink the multiple representative points towards
    the gravity center by a fraction of ?.
  • Multiple representatives capture the shape of the
    cluster

80
Density-Based Clustering Methods
  • Clustering based on density (local cluster
    criterion), such as density-connected points
  • Major features
  • Discover clusters of arbitrary shape
  • Handle noise
  • One scan
  • Need density parameters as termination condition
  • Several interesting studies
  • DBSCAN Ester, et al. (KDD96)
  • OPTICS Ankerst, et al (SIGMOD99).
  • DENCLUE Hinneburg D. Keim (KDD98)
  • CLIQUE Agrawal, et al. (SIGMOD98)

81
CLIQUE (Clustering In QUEst)
  • Agrawal, Gehrke, Gunopulos, Raghavan, Automatic
    Subspace Clustering of High Dimensional Data for
    Data Mining Applications, ACM SIGMOD 1998.
  • Automatically identifying subspaces of a high
    dimensional data space that allow better
    clustering than original space
  • CLIQUE can be considered as both density-based
    and grid-based
  • It partitions each dimension into the same number
    of equal length interval
  • It partitions a d-dimensional data space into
    non-overlapping rectangular units
  • A unit is dense if the fraction of total data
    points contained in the unit exceeds the input
    model parameter
  • A cluster is a maximal set of connected dense
    units within a subspace

82
Salary (10,000)
7
6
5
4
3
2
1
age
0
20
30
40
50
60
? 3
83
CLIQUE The Major Steps
Agrawal, Gehrke, Gunopulos, Raghavan, Automatic
Subspace Clustering of High Dimensional Data for
Data Mining Applications, ACM SIGMOD 1998.
  • Partition the data space and find the number of
    points that lie inside each cell of the
    partition.
  • Identify the subspaces that contain clusters
    using the Apriori principle
  • Identify clusters
  • Determine dense units in all subspaces of
    interests
  • Determine connected dense units in all subspaces
    of interests.
  • Generate minimal description for the clusters
  • Determine maximal regions that cover a cluster of
    connected dense units for each cluster
  • Determination of minimal cover for each cluster

84
Strength and Weakness of CLIQUE
  • Strength
  • It automatically finds subspaces of the highest
    dimensionality such that high density clusters
    exist in those subspaces
  • It is insensitive to the order of records in
    input and does not presume some canonical data
    distribution
  • It scales linearly with the size of input and has
    good scalability as the number of dimensions in
    the data increases
  • Weakness
  • The accuracy of the clustering result may be
    degraded at the expense of simplicity of the
    method

85
Model based clustering
  • Assume data generated from K probability
    distributions
  • Typically Gaussian distribution Soft or
    probabilistic version of K-means clustering
  • Need to find distribution parameters.
  • EM Algorithm

86
EM Algorithm
  • Initialize K cluster centers
  • Iterate between two steps
  • Expectation step assign points to clusters
  • Maximation step estimate model parameters

87
CURE (Clustering Using Epresentatives )
  • Guha, Rastogi Shim, CURE An Efficient
    Clustering Algorithm for Large Databases, ACM
    SIGMOD, 1998
  • Stops the creation of a cluster hierarchy if a
    level consists of k clusters
  • Uses multiple representative points to evaluate
    the distance between clusters, adjusts well to
    arbitrary shaped clusters and avoids single-link
    effect

88
Drawbacks of Distance-Based Method
Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998
  • Drawbacks of square-error based clustering method
  • Consider only one point as representative of a
    cluster
  • Good only for convex shaped, similar size and
    density, and if k can be reasonably estimated

89
BIRCH
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.
  • Dependent on order of insertions
  • Works for convex, isotropic clusters of uniform
    size
  • Labeling Problem
  • Centroid approach
  • Labeling Problem even with correct centers, we
    cannot label correctly

90
Jensen-Shannon Divergence
  • Jensen-Shannon(JS) divergence between two
    probability distributions
  • where
  • Jensen-Shannon(JS) divergence between a finite
    number of probability distributions

91
Information-Theoretic Clustering (preserving
mutual information)
  • (Lemma) The loss in mutual information equals
  • Interpretation Quality of each cluster is
    measured by the Jensen-Shannon Divergence between
    the individual distributions in the cluster.
  • Can rewrite the above as
  • Goal Find a clustering that minimizes the above
    loss

92
Information Theoretic Co-clustering (preserving
mutual information)
  • (Lemma) Loss in mutual information equals
  • where
  • Can be shown that q(x,y) is a maximum entropy
    approximation to p(x,y).
  • q(x,y) preserves marginals q(x)p(x) q(y)p(y)

93
parameters that determine q are
(m-k)(kl-1)(n-l)
94
Preserving Mutual Information
  • Lemma
  • Note that may be thought of as
    the prototype of row cluster (the usual
    centroid of the cluster is
    )
  • Similarly,

95
Example Continued
96
Co-Clustering Algorithm
97
Properties of Co-clustering Algorithm
  • Theorem The co-clustering algorithm
    monotonically decreases loss in mutual
    information (objective function value)
  • Marginals p(x) and p(y) are preserved at every
    step (q(x)p(x) and q(y)p(y) )
  • Can be generalized to higher dimensions

98
(No Transcript)
99
Applications -- Text Classification
  • Assigning class labels to text documents
  • Training and Testing Phases

New Document
Class-1
Document collection
Grouped into classes
Classifier (Learns from Training data)
New Document With Assigned class
Class-m
Training Data
100
Dimensionality Reduction
  • Feature Selection
  • Feature Clustering

1
  • Select the best words
  • Throw away rest
  • Frequency based pruning
  • Information criterion based
  • pruning

Document Bag-of-words
Vector Of words
Word1
Wordk
m
1
Vector Of words
Cluster1
  • Do not throw away words
  • Cluster words instead
  • Use clusters as features

Document Bag-of-words
Clusterk
m
101
Experiments
  • Data sets
  • 20 Newsgroups data
  • 20 classes, 20000 documents
  • Classic3 data set
  • 3 classes (cisi, med and cran), 3893 documents
  • Dmoz Science HTML data
  • 49 leaves in the hierarchy
  • 5000 documents with 14538 words
  • Available at http//www.cs.utexas.edu/users/manyam
    /dmoz.txt
  • Implementation Details
  • Bow for indexing,co-clustering, clustering and
    classifying

102
Naïve Bayes with word clusters
  • Naïve Bayes classifier
  • Assign document d to the class with the highest
    score
  • Relation to KL Divergence
  • Using word clusters instead of words
  • where parameters for clusters are estimated
    according to joint statistics

103
Selecting Correlated Attributes
T. Fukuda, Y. Morimoto, S. Morishita, T.
Tokuyama, Constructing Efficient Decision Trees
by Using Optimized Numeric Association Rules,
Proc. of VLDB Conf., 1996.
  • To decide A and A are strongly correlated iff
  • where a threshold ? ? 1

104
MDL-based Decision Tree Pruning
  • M. Mehta, J. Rissanen, R. Agrawal, MDL-based
    Decision Tree Pruning, Proc. on KDD Conf., 1995.
  • Two steps for induction of decision trees
  • Construct a DT using training data
  • Reduce the DT by pruning to prevent overfitting
  • Possible approaches
  • Cost-complexity pruning using a separate set of
    samples for pruning
  • DT pruning using the same training data sets for
    testing
  • MDL-based pruning using Minimum Description
    Length (MDL) principle.

105
Pruning Using MDL Principle
M. Mehta, J. Rissanen, R. Agrawal, MDL-based
Decision Tree Pruning, Proc. on KDD Conf., 1995.
  • View decision tree as a means for efficiently
    encoding classes of records in training set
  • MDL Principle best tree is the one that can
    encode records using the fewest bits
  • Cost of encoding tree includes
  • 1 bit for encoding type of each node (e.g. leaf
    or internal)
  • Csplit cost of encoding attribute and value for
    each split
  • nE cost of encoding the n records in each leaf
    (E is entropy)

106
Pruning Using MDL Principle
M. Mehta, J. Rissanen, R. Agrawal, MDL-based
Decision Tree Pruning, Proc. on KDD Conf., 1995.
  • Problem to compute the minimum cost subtree at
    root of built tree
  • Suppose minCN is the cost of encoding the minimum
    cost subtree rooted at N
  • Prune children of a node N if minCN nE1
  • Compute minCN as follows
  • N is leaf nE1
  • N has children N1 and N2 minnE1,Csplit1minCN
    1minCN2
  • Prune tree in a bottom-up fashion

107
MDL Pruning - Example
R. Rastogi, K. Shim, PUBLIC A Decision Tree
Classifier that Integrates Building and Pruning,
Proc. of VLDB Conf., 1998.
  • Cost of encoding records in N, (nE1) 3.8
  • Csplit 2.6
  • minCN min3.8,2.6111 3.8
  • Since minCN nE1, N1 and N2 are pruned

108
PUBLIC
  • R. Rastogi, K. Shim, PUBLIC A Decision Tree
    Classifier that Integrates Building and Pruning,
    Proc. of VLDB Conf., 1998.
  • Prune tree during (not after) building phase
  • Execute pruning algorithm (periodically) on
    partial tree
  • Problem how to compute minCN for a yet to be
    expanded leaf N in a partial tree
  • Solution compute lower bound on the subtree cost
    at N and use this as minCN when pruning
  • minCN is thus a lower bound on the cost of
    subtree rooted at N
  • Prune children of a node N if minCN nE1
  • Guaranteed to generate identical tree to that
    generated by SPRINT

109
PUBLIC(1)
R. Rastogi, K. Shim, PUBLIC A Decision Tree
Classifier that Integrates Building and Pruning,
Proc. of VLDB Conf., 1998.
sal education Label
10K High-school Reject
40K Under Accept
15K Under Reject
75K grad Accept
18K grad Accept
  • Simple lower bound for a subtree 1
  • Cost of encoding records in N nE1 5.8
  • Csplit 4
  • minCN min5.8, 4111 5.8
  • Since minCN nE1, N1 and N2 are pruned

110
PUBLIC(S)
  • Theorem The cost of any subtree with s splits
    and rooted at node N is at least 2s1slog a
  • a is the number of attributes
  • k is the number of classes
  • ni (gt ni1) is the number of records belonging
    to class i
  • Lower bound on subtree cost at N is thus the
    minimum of
  • nE1 (cost with zero split)
  • 2s1slog a

k
å
ni
is2
k
å
ni
is2
111
Whats Clustering
  • Clustering is a kind of unsupervised learning.
  • Clustering is a method of grouping data that
    share similar trend and patterns.
  • Clustering of data is a method by which large
    sets of data is grouped into clusters of smaller
    sets of similar data.
  • Example

After clustering
Thus, we see clustering means grouping of data or
dividing a large data set into smaller data sets
of some similarity.
112
Partitional Algorithms
  • Enumerate K partitions optimizing some criterion
  • Example square-error criterion
  • Where x is the ith pattern belonging to the jth
    cluster and c is the centroid of the jth cluster.

113
Squared Error Clustering Method
  1. Select an initial partition of the patterns with
    a fixed number of clusters and cluster centers
  2. Assign each pattern to its closest cluster center
    and compute the new cluster centers as the
    centroids of the clusters. Repeat this step until
    convergence is achieved, i.e., until the cluster
    membership is stable.
  3. Merge and split clusters based on some heuristic
    information, optionally repeating step 2.

114
Agglomerative Clustering Algorithm
  1. Place each pattern in its own cluster. Construct
    a list of interpattern distances for all distinct
    unordered pairs of patterns, and sort this list
    in ascending order
  2. Step through the sorted list of distances,
    forming for each distinct dissimilarity value dk
    a graph on the patterns where pairs of patterns
    closer than dk are connected by a graph edge. If
    all the patterns are members of a connected
    graph, stop. Otherwise, repeat this step.
  3. The output of the algorithm is a nested hierarchy
    of graphs with can be cut at a desired
    dissimilarity level forming a partition
    identified by simply connected components in the
    corresponding graph.

115
Agglomerative Hierarchical Clustering
  • Mostly used hierarchical clustering algorithm
  • Initially each point is a distinct cluster
  • Repeatedly merge closest clusters until the
    number of clusters becomes K
  • Closest dmean (Ci, Cj)
  • dmin (Ci, Cj)
  • Likewise dave (Ci, Cj) and dmax (Ci, Cj)

116
Clustering
  • Summary of Drawbacks of Traditional Methods
  • Partitional algorithms split large clusters
  • Centroid-based method splits large and
    non-hyperspherical clusters
  • Centers of subclusters can be far apart
  • Minimum spanning tree algorithm is sensitive to
    outliers and slight change in position
  • Exhibits chaining effect on string of outliers
  • Cannot scale up for large databases

117
Model-based Clustering
  • Mixture of Gaussians
  • Gaussian pdf P(?i)
  • Data point, N(?i,?2I)
  • Consider
  • Data points x1, x2,, xN
  • P(?1),, P(?k),?
  • Likelihood function
  • Maximize the likelihood function by calculating

118
Overview of EM Clustering
  • Extensions and generalizations. The EM
    (expectation maximization) algorithm extends the
    k-means clustering technique in two important
    ways
  • Instead of assigning cases or observations to
    clusters to maximize the differences in means for
    continuous variables, the EM clustering algorithm
    computes probabilities of cluster memberships
    based on one or more probability distributions.
    The goal of the clustering algorithm then is to
    maximize the overall probability or likelihood of
    the data, given the (final) clusters.
  • Unlike the classic implementation of k-means
    clustering, the general EM algorithm can be
    applied to both continuous and categorical
    variables (note that the classic k-means
    algorithm can also be modified to accommodate
    categorical variables).

119
EM Algorithm
  • The EM algorithm for clustering is described in
    detail in Witten and Frank (2001).
  • The basic approach and logic of this clustering
    method is as follows.
  • Suppose you measure a single continuous variable
    in a large sample of observations.
  • Further, suppose that the sample consists of two
    clusters of observations with different means
    (and perhaps different standard deviations)
    within each sample, the distribution of values
    for the continuous variable follows the normal
    distribution.
  • The resulting distribution of values (in the
    population) may look like this

120
EM v.s. k-Means
  • Classification probabilities instead of
    classifications. The results of EM clustering are
    different from those computed by k-means
    clustering. The latter will assign observations
    to clusters to maximize the distances between
    clusters. The EM algorithm does not compute
    actual assignments of observations to clusters,
    but classification probabilities. In other words,
    each observation belongs to each cluster with a
    certain probability. Of course, as a final result
    you can usually review an actual assignment of
    observations to clusters, based on the (largest)
    classification probability.

121
Finding k
  • V-fold cross-validation. This type of
    cross-validation is useful when no test sample is
    available and the learning sample is too small to
    have the test sample taken from it. A specified V
    value for V-fold cross-validation determines the
    number of random subsamples, as equal in size as
    possible, that are formed from the learning
    sample. The classification tree of the specified
    size is computed V times, each time leaving out
    one of the subsamples from the computations, and
    using that subsample as a test sample for
    cross-validation, so that each subsample is used
    V - 1 times in the learning sample and just once
    as the test sample. The CV costs computed for
    each of the V test samples are then averaged to
    give the V-fold estimate of the CV costs.

122
Expectation Maximization
  • A mixture of Gaussians
  • Ex x130, P(x1)1/2 x218, P(x2)u x30,
    P(x3)2u x423, P(x4)1/2-3u
  • Likelihood for X1 a students x2 b students
    x3 c students x4 d students
  • To maximize L, calculate the log Likelihood L

Supposing a14, b6, c9,d10, then u1/10. If
x1x2 h students ? abh ? ah/(u1), b2uh/(u1)
123
Gaussian (Normal) pdf
  • The Gaussian function with mean (?) and standard
    deviation (?). The properties of the function
  • symmetric about the mean
  • Gains its maximum value at the mean, the minimum
    value at plus and minus infinity
  • The distribution is often referred to as bell
    shaped
  • At one standard deviation from the mean the
    function has dropped to about 2/3 of its maximum
    value, at two standard deviations it has falled
    to about a 1/7.
  • The area under the function one standard
    deviation from the mean is about 0.682. Two
    standard deviations it is 0.9545, and the three
    s.d. it is 0.9973. The total area under the curve
    is 1.

124
Gaussian
Think the cumulative distribution, F?,?2(x)
125
Multi-variate Density Estimation
Mixture of Gaussians
  • contains all the parameters of the mixture
    model. pi are known as mixing proportions or
    coefficients.
  • A mixture of Gaussians model
  • Generic mixture

P(y) y1
y2 P(xy1) P(xy2)
126
Mixture Density
  • If we are given just x we do not know which
    mixture component this example came from
  • We can evaluate the posterior probability that an
    observed x was generated from the first mixture
    component
Write a Comment
User Comments (0)
About PowerShow.com