Clustering in General - PowerPoint PPT Presentation

About This Presentation
Title:

Clustering in General

Description:

Clustering is unsupervised pattern classification. ... Patterns typically are samples of feature vectors or matrices. ... Assign each pattern to the closest ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 27
Provided by: CSU67
Category:

less

Transcript and Presenter's Notes

Title: Clustering in General


1
Clustering in General
  • In vector space, clusters are vectors found
    within e of a cluster vector, with different
    techniques for determining the cluster vector and
    e.
  • Clustering is unsupervised pattern
    classification.
  • Unsupervised means no correct answer or feedback.
  • Patterns typically are samples of feature vectors
    or matrices.
  • Classification means collecting the samples into
    groups of similar members.

2
Clustering Decisions
  • Pattern Representation
  • feature selection (e.g., stop word removal,
    stemming)
  • number of categories
  • Pattern proximity
  • distance measure on pairs of patterns
  • Grouping
  • characteristics of clusters (e.g., fuzzy,
    hierarchical)
  • Clustering algorithms embody different
    assumptions about these decisions and the form of
    clusters.

3
Formal Definitions
  • Feature vector x is a single datum of d
    measurements.
  • Hard clustering techniques assign a class label
    to each cluster members of clusters are mutually
    exclusive.
  • Fuzzy clustering techniques assign a fractional
    degree of membership to each label for each x.

4
Proximity Measures
  • Generally, use Euclidean distance or mean squared
    distance.
  • In IR, use similarity measure from retrieval
    (e.g., cosine measure for TFIDF).

5
Jain, Murty Flynn Taxonomy of Clustering
Clustering
Hierarchical
Partitional
Single Link
Complete Link
Square Error
Graph Theoretic
Mixture Resolving
Mode Seeking
Expectation Minimization
HAC
k-means
6
Clustering Issues
7
Hierarchical Algorithms
  • Produce hierarchy of classes (taxonomy) from
    singleton clusters to just one cluster.
  • Select level for extracting cluster set.
  • Representation is a dendrogram.

8
Complete-Link Revisited
  • Used to create statistical thesaurus
  • Agglomerative, hard, deterministic, batch
  • Start with 1 cluster/sample
  • Find two clusters with lowest distance
  • Merge two clusters and add to hierarchy
  • Repeat from 2 until termination criterion or
    until all clusters have merged

9
Single-Link
  • Like Complete-Link except
  • use minimum of distances between all pairs of
    samples in the two clusters (complete-link uses
    maximum).
  • Single-link has chaining effect with elongated
    clusters, but can construct more complex shapes.

10
ExamplePlot
11
Example Proximity Matrix
12
Complete-Link Solution
C15
C13
C14
C10
C11
C12
C6
C7
C8
C9
C1
C2
C3
C4
C5
29,26
1,28
9,16
21,15
29,22
45,42
46,30
23,32
4,9
13,18
31,15
33,21
35,35
42,45
21,27
26,25
13
Single-Link Solution
C15
C14
C12
C11
C13
C8
C7
C10
C3
C9
C2
C4
C5
C6
C1
29,26
1,28
9,16
21,15
29,22
45,42
46,30
23,32
4,9
13,18
31,15
33,21
35,35
42,45
21,27
26,25
14
Hierarchical Agglomerative Clustering (HAC)
  • Agglomerative, hard, deterministic, batch
  • Start with 1 cluster/sample and compute a
    proximity matrix between pairs of clusters.
  • Merge most similar pair of clusters and update
    proximity matrix.
  • Repeat 2 until all clusters merged.
  • Difference is in how proximity matrix is updated.
  • Ability to combine benefits of both single and
    complete link algorithms.

15
HAC for IR
  • Intra-cluster Similarity
  • where S is TFIDF vectors for documents, c is
    centroid of cluster X, and d is a document.
  • Proximity is similarity of all documents to the
    cluster centroid.
  • Select pair of clusters that produces the
    smallest decrease in similarity, e.g., if
    merge(X,Y)gtZ, then
  • maxSim(Z)-(Sim(X)Sim(Y))

16
HAC for IR- Alternatives
  • UPGMA
  • Centroid Similarity
  • cosine similarity between the centroid of the two
    clusters

17
Partitional Algorithms
  • Results in set of unrelated clusters.
  • Issues
  • how many clusters is enough?
  • how to search space of possible partitions?
  • what is appropriate clustering criterion?

18
K Means
  • Number of clusters is set by user to be k.
  • Non-deterministic
  • Clustering criterion is squared error
  • where S is document set, L is a clustering, K is
    number of clusters, x is ith document in jth
    cluster and c is centroid of jth cluster.

19
k-Means Clustering Algorithm
  • Randomly select k samples as cluster centroids.
  • Assign each pattern to the closest cluster
    centroid.
  • Recompute centroids.
  • If convergence criterion (e.g., minimal decrease
    in error or no change in cluster composition) is
    not met, return to 2.

20
ExampleK-Means Solutions
21
k-Means Sensitivity to Initialization
F
G
C
D
E
B
A
K3, red started w/A, D, F yellow w/A, B, C
22
k-Means for IR
  • Update centroids incrementally
  • Calculate centroid as with hierarchical methods.
  • Can refine into a divisive hierarchical method by
    starting with single cluster and splitting using
    k-means until forms k clusters with highest
    summed similarities. (bisecting k-means)

23
Other Types of Clustering Algorithms
  • Graph Theoretic construct minimal spanning tree
    and delete edges with largest lengths
  • Expectation Minimization (EM) assume clusters
    are drawn from distributions, use maximum
    likelihood to estimate parameters of
    distributions.
  • Nearest Neighbors iteratively assign each sample
    to the cluster of its nearest labelled neighbor,
    so long as distance is below a set threshold.

24
Comparison of Clustering Algorithms Steinbach et
al.
  • Implement 3 versions of HAC and 2 versions of
    k-Means
  • Compare performance on documents hand labelled as
    relevant to one of a set of classes.
  • Well known data sets (TREC)
  • Found that UPGMA is best of hierarchical, but
    bisecting k-means seems to do better if
    considered over many runs.

M. Steinbach, G. Karypis, V.Kumar. A Comparison
of Document Clustering Techniques, KDD Workshop
on Text Mining, 2000.
25
Evaluation Metrics 1
  • Evaluation how to measure cluster quality?
  • Entropy
  • where pij is probability that a member of cluster
    j belongs to class i, nj is size of cluster j, m
    is number of clusters, n is number of docs and CS
    is a clustering solution.

26
Comparison Measure 2
  • F measure combines precision and recall
  • treat each cluster as the result of a query and
    each class as the relevant set of docs

nij is of members of class i in cluster j, nj
is in j, ni is in i, n is of docs.
Write a Comment
User Comments (0)
About PowerShow.com