Clustering - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Clustering

Description:

Clustering Clustering is the unsupervised classification of patterns (observations, data items or feature vectors) into groups (clusters) [ACM CS 99] – PowerPoint PPT presentation

Number of Views:435
Avg rating:3.0/5.0
Slides: 54
Provided by: Euri2
Category:
Tags: clustering | graph

less

Transcript and Presenter's Notes

Title: Clustering


1
Clustering
  • Clustering is the unsupervised classification of
    patterns (observations, data items or feature
    vectors) into groups (clusters) ACM CS99
  • Instances within a cluster are very similar
  • Instances in different clusters are very different

2
Example
3
Applications
  • Faster retrieval
  • Faster and better browsing
  • Structuring of search results
  • Revealing classes and other data regularities
  • Directory construction
  • Better data organization in general

4
Cluster Searching
  • Similar instances tend to be relevant to the same
    requests
  • The query is mapped to the closest cluster by
    comparison with the cluster-centroids

5
Notation
  • N number of elements
  • Class real world grouping ground truth
  • Cluster grouping by algorithm
  • The ideal clustering algorithm will produce
    clusters equivalent to real world classes with
    exactly the same members

6
Problems
  • How many clusters ?
  • Complexity? N is usually large
  • Quality of clustering
  • When a method is better than another?
  • Overlapping clusters
  • Sensitivity to outliers

7
Example
8
Clustering Approaches
  • Divisive build clusters top down starting from
    the entire data set
  • K-means, Bisecting K-means
  • Hierarchical or flat clustering
  • Agglomerative build clusters bottom-up
    starting with individual instances and by
    iteratively combining them to form larger cluster
    at higher level
  • Hierarchical clustering
  • Combinations of the above
  • Buckshot algorithm

9
Hierarchical Flat Clustering
  • Flat all clusters at the same level
  • K-means, Buckshot
  • Hierarchical nested sequence of clusters
  • Single cluster with all data at the top
    singleton clusters at the bottom
  • Intermediate levels are more useful
  • Every intermediate level combines two clusters
    from the next lower level
  • Agglomerative, Bisecting K-means

10
Flat Clustering
11
Hierarchical Clustering
12
Text Clustering
  • Finds overall similarities among documents or
    groups of documents
  • Faster searching, browsing etc.
  • Needs to know how to compute the similarity (or
    equivalently the distance) between documents

13
Query Document Similarity
  • Similarity is defined as the cosine of the angle
    between document and query vectors

14
Document Distance
  • Consider documents d1, d2 with vectors u1, u2
  • Their distance is defined as the length AB

15
Normalization by Document Length
  • The longer the document is, the more likely it is
    for a given term to appear in it
  • Normalize the term weights by document length
    (so terms in long documents are not given more
    weight)

16
Evaluation of Cluster Quality
  • Clusters can be evaluated using internal or
    external knowledge
  • Internal Measures intra cluster cohesion and
    cluster separability
  • intra cluster similarity
  • inter cluster similarity
  • External measures quality of clusters compared
    to real classes
  • Entropy (E), Harmonic Mean (F)

17
Intra Cluster Similarity
  • A measure of cluster cohesion
  • Defined as the average pair-wise similarity of
    documents in a cluster
  • Where cluster centroid
  • Documents (not centroids) have unit length

18
Inter Cluster Similarity
  • Single Link similarity of two most similar
    members
  • Complete Link similarity of two least similar
    members
  • Group Average average similarity between members

19
Example
20
Entropy
  • Measures the quality of flat clusters using
    external knowledge
  • Pre-existing classification
  • Assessment by experts
  • Pij probability that a member of cluster j
    belong to class i
  • The entropy of cluster j is defined as
    Ej-SiPijlogPij

21
Entropy (cont)
  • Total entropy for all clusters
  • Where nj is the size of cluster j
  • m is the number of clusters
  • N is the number of instances
  • The smaller the value of E is the better the
    quality of the algorithm is
  • The best entropy is obtained when each cluster
    contains exactly one instance

22
Harmonic Mean (F)
  • Treats each cluster as a query result
  • F combines precision (P) and recall (R)
  • Fij for cluster j and class i is defined as
  • nij number of instances of class i in cluster
    j,
  • ni number of instances of class i,
  • nj number of instances of cluster j

23
Harmonic Mean (cont)
  • The F value of any class i is the maximum value
    it achieves over all j
  • Fi maxj Fij
  • The F value of a clustering solution is computed
    as the weighted average over all classes
  • Where N is the number of data instances

24
Quality of Clustering
  • A good clustering method
  • Maximizes intra-cluster similarity
  • Minimizes inter cluster similarity
  • Minimizes Entropy
  • Maximizes Harmonic Mean
  • Difficult to achieve all together simultaneously
  • Maximize some objective function of the above
  • An algorithm is better than an other if it has
    better values on most of these measures

25
K-means Algorithm
  • Select K centroids
  • Repeat I times or until the centroids do not
    change
  • Assign each instance to the cluster represented
    by its nearest centroid
  • Compute new centroids
  • Reassign instances
  • Compute new centroids
  • .

26
K-Means demo (1/7) http//www.delft-cluster.nl/t
extminer/theory/kmeans/kmeans.html
27
K-Means demo (2/7)
28
K-Means demo (3/7)
29
K-Means demo (4/7)
30
K-Means demo (5/7)
31
K-Means demo (6/7)
32
K-Means demo (7/7)
33
Comments on K-Means (1)
  • Generates a flat partition of K clusters
  • K is the desired number of clusters and must be
    known in advance
  • Starts with K random cluster centroids
  • A centroid is the mean or the median of a group
    of instances
  • The mean rarely corresponds to a real instance

34
Comments on K-Means (2)
  • Up to I10 iterations
  • Keep the clustering resulted in best inter/intra
    similarity or the final clusters after I
    iterations
  • Complexity O(IKN)
  • A repeated application of K-Means for K2, 4,
    can produce a hierarchical clustering

35
Choosing Centroids for K-means
  • Quality of clustering depends on the selection of
    initial centroids
  • Random selection may result in poor convergence
    rate, or convergence to sub-optimal clusterings.
  • Select good initial centroids using a heuristic
    or the results of another method
  • Buckshot algorithm

36
Incremental K-Means
  • Update each centroid during each iteration after
    each point is assigned to a cluster rather than
    at the end of each iteration
  • Reassign instances to clusters at the end of each
    iteration
  • Converges faster than simple K-means
  • Usually 2-5 iterations

37
Bisecting K-Means
  • Starts with a single cluster with all instances
  • Select a cluster to split larger cluster or
    cluster with less intra similarity
  • The selected cluster is split into 2 partitions
    using K-means (K2)
  • Repeat up to the desired depth h
  • Hierarchical clustering
  • Complexity O(2hN)

38
Agglomerative Clustering
  • Compute the similarity matrix between all pairs
    of instances
  • Starting from singleton clusters
  • Repeat until a single cluster remains
  • Merge the two most similar clusters
  • Replace them with a single cluster
  • Replace the merged cluster in the matrix and
    update the similarity matrix
  • Complexity O(N2)

39
Similarity Matrix
C1d1 C2d2 CNdN
C1d1 1 0.8 0.3
C2d2 0.8 1 0.6
. 1
CNdN 0.3 0.6 1
40
Update Similarity Matrix
merged
C1d1 C2d2 CNdN
C1d1 1 0.8 0.3
C2d2 0.8 1 0.6
. 1
CNdN 0.3 0.6 1
merged
41
New Similarity Matrix
C12 d1 ? d2 CNdN
C12 d1 ? d2 1 0.4
1
CNdN 0.4 1
42
Single Link
  • Selecting the most similar clusters for merging
    using single link
  • Can result in long and thin clusters due to
    chaining effect
  • Appropriate in some domains, such as clustering
    islands

43
Complete Link
  • Selecting the most similar clusters for merging
    using complete link
  • Results in compact, spherical clusters that are
    preferable

44
Group Average
  • Selecting the most similar clusters for merging
    using group average
  • Fast compromise between single and complete link

45
Example
46
Inter Cluster Similarity
  • A new cluster is represented by its centroid
  • The document to cluster similarity is computed as
  • The cluster-to-cluster similarity can be computed
    as single, complete or group average similarity

47
Buckshot K-Means
  • Combines Agglomerative and K-Means
  • Agglomerative results in a good clustering
    solution but has O(N2) complexity
  • Randomly select a sample ?N instances
  • Applying Agglomerative on the sample which takes
    (N) time
  • Take the centroids of the cluster as input to
    K-Means
  • Overall complexity is O(N)

48
Example
initial cetroids for K-Means
49
More on Clustering
  • Sound methods based on the document-to-document
    similarity matrix
  • graph theoretic methods
  • O(N2) time
  • Iterative methods operating directly on the
    document vectors
  • O(NlogN),O(N2/logN), O(mN) time

50
Soft Clustering
  • Hard clustering each instance belongs to exactly
    one cluster
  • Does not allow for uncertainty
  • An instance may belong to two or more clusters
  • Soft clustering is based on probabilities that an
    instance belongs to each of a set of clusters
  • probabilities of all categories must sum to 1
  • Expectation Minimization (EM) is the most popular
    approach

51
More Methods
  • Two documents with similarity gt T (threshold) are
    connected with an edge DudaHart73
  • clusters the connected components (maximal
    cliques) of the resulting graph
  • problem selection of appropriate threshold T
  • Zahns method Zahn71

52
Zahns method Zahn71
the dashed edge is inconsistent and is deleted
  • Find the minimum spanning tree
  • for each doc delete edges with length l gt lavg
  • lavg average distance if its incident edges
  • clusters the connected components of the graph

53
References
  • "Searching Multimedia Databases by Content",
    Christos Faloutsos, Kluwer Academic Publishers,
    1996
  • A Comparison of Document Clustering Techniques,
    M. Steinbach, G. Karypis, V. Kumar, In KDD
    Workshop on Text Mining,2000
  • Data Clustering A Review, A.K. Jain, M.N.
    Murphy, P.J. Flynn, ACM Comp. Surveys, Vol. 31,
    No. 3, Sept. 99.
  • Algorithms for Clustering Data A.K. Jain, R.C.
    Dubes Prentice-Hall , 1988, ISBN 0-13-022278-X
  • Automatic Text Processing The Transformation,
    Analysis, and Retrieval of Information by
    Computer, G. Salton, Addison-Wesley, 1989
Write a Comment
User Comments (0)
About PowerShow.com