Document Clustering

About This Presentation

Title:

Document Clustering

Description:

Based on ppt files by Hinrich Sch tze, Ray Mooney, and Soumen Chakrabarti ... The commonest form of ... mammal worm insect crustacean. invertebrate ... – PowerPoint PPT presentation

Number of Views:436

Avg rating:3.0/5.0

Slides: 64

Provided by: christo397

Category:

more less

Transcript and Presenter's Notes

Title: Document Clustering

1
CS 6633 ????Information Retrieval and Web Search

Lecture 9
Document Clustering

?? 125 Based on ppt files by Hinrich Schütze, Ray
Mooney, and Soumen Chakrabarti
2
Todays Topic Clustering

Document clustering
Motivations
Document representations
Success criteria
Clustering algorithms
Partitional
Hierarchical

3
What is clustering?

Clustering the process of grouping a set of
objects into classes of similar objects
The commonest form of unsupervised learning
Unsupervised learning learning from raw data,
as opposed to supervised data where a
classification of examples is given
A common and important task with many
applications in IR and beyond

4
Why cluster documents?

Whole corpus analysis/navigation
Better user interface
For improving recall in search applications
Better search results
For better navigation of search results
Effective user recall will be higher
For speeding up vector space retrieval
Retrieve clusters and then documents
Faster search

5
Google News presents news clusters
6
Yahoo! Hierarchy
www.yahoo.com/Science
(30)
agriculture
biology
physics
CS
space
...
...
...
...
...
dairy
AI
botany
cell
courses
crops
craft
magnetism
HCI
missions
agronomy
evolution
forestry
relativity
7
Scatter/Gather Clustering

Developed at PARC in the late 80s/early 90s
Based on two novel clustering algorithms
Buckshot fast for online clustering
Fractionation accurate for offline initial
clustering of the entire set
Top-down approach
Start with k seeds (documents) to represent k
clusters
Each document assigned to the cluster with the
most similar seeds

Pedersen, Cutting, Karger, Tukey, Scatter/Gather
A Cluster-based Approach to Browsing Large
Document Collections, SIGIR 1992
8
Fractionation
Invented by Cutting, Karger, Pederson and Tukey
for nonparametric clustering of large datasets.
Source Jeremy Tantrum, www.stat.washington.edu/wx
s/Stat593-s03/Slides/jeremyKDDtalk.ppt
9
Scatter/Gather
Pedersen, Cutting, Karger, Tukey, Scatter/Gather
A Cluster-based Approach to Browsing Large
Document Collections, SIGIR 1992
10
The Scatter/Gather Interface
11
Two Queries Two Clusterings
AUTO, CAR, ELECTRIC
AUTO, CAR, SAFETY
8 control drive accident 25 battery
california technology 48 import j. rate
honda toyota 16 export international unit
japan 3 service employee automatic
6 control inventory integrate 10
investigation washington 12 study fuel death
bag air 61 sale domestic truck import 11
japan export defect unite
The main differences are the clusters that are
central to the query
12
Scatter/Gather Evaluations

Good part
The clusters do group relevant documents together
Participants noted that useful for eliminating
irrelevant groups
Bad part
Difficult to understand the clusters
There is no consistence in results
Can be slower to find answers

13
(No Transcript)
14
(No Transcript)
15
Visualizing Clustering Results

Use clustering to map the entire huge
multidimensional document space into a huge
number of small clusters.
User dimension reduction and then project these
onto a 2D/3D graphical representation

16
For visualizing a document collection and its
themes

Wise et al, Visualizing the non-visual PNNL
ThemeScapes, Cartia
Mountain height cluster size

17
For improving search recall

Cluster hypothesis - Documents with similar text
are related
Therefore, to improve search recall
Cluster docs in corpus in advance
When a query matches a doc D, also return other
docs in the cluster containing D
Hope if we do this The query car will also
return docs containing automobile
Because clustering grouped together docs
containing car with those containing automobile.
Why might this happen?

18
For better navigation of search results

For grouping search results thematically
clusty.com / Vivisimo

19
Issues for clustering

Representation for clustering
Document representation
Vector space? Normalization?
Need a notion of similarity/distance
How many clusters?
Fixed a priori?
Completely data driven?
Avoid trivial clusters - too large or small
In an application, if a cluster's too large, then
for navigation purposes you've wasted an extra
user click without whittling down the set of
documents much.

20
What makes docs related?

Ideal semantic similarity
Practical statistical similarity
cosine similarity a good choice
Docs as vectors
For many algorithms, easier to think in terms of
a distance (rather than similarity) between docs

21
Clustering Algorithms

Partitional algorithms
Usually start with a random (partial)
partitioning
Refine it iteratively
K means clustering
Model based clustering
Hierarchical algorithms
Bottom-up, agglomerative
Top-down, divisive

22
Model based clustering

In model-based clustering it is assumed that the
data is generated by a mixture of the Gaussian
models
Observations are sampled from a mixture density
p(x) å pg pg(x)
Use training data to fitting a Mixture of
Gaussians
Use the EM algorithm to maximize the log
likelihood

Source Jeremy Tantrum, www.stat.washington.edu/wx
s/Stat593-s03/Slides/jeremyKDDtalk.ppt
23
Model Based Clustering
Fitting Estimate pg and parameters of pg(x)
Source Jeremy Tantrum, www.stat.washington.edu/wx
s/Stat593-s03/Slides/jeremyKDDtalk.ppt
24
Model Based Clustering

Hierarchical Clustering
Provides a good starting point for EM algorithm
Start with every point being its own cluster
Merge the two closest clusters
Measured by the decrease in likelihood when those
two clusters are merged
Uses the Classification Likelihood not the
Mixture Likelihood
Algorithm is quadratic in the number of
observations

Source Jeremy Tantrum, www.stat.washington.edu/wx
s/Stat593-s03/Slides/jeremyKDDtalk.ppt
25
Partitioning Algorithms

Partitioning method Construct a partition of n
documents into a set of K clusters
Given a set of documents and the number K
Find a partition of K clusters that optimizes
the chosen partitioning criterion
Globally optimal exhaustively enumerate all
partitions
Effective heuristic methods K-means and
K-medoids algorithms

26
K-Means

Assumes documents are real-valued vectors.
Clusters based on centroids (aka the center of
gravity or mean) of points in a cluster, c
Reassignment of instances to clusters is based on
distance to the current cluster centroids.
(Or one can equivalently phrase it in terms of
similarities)

27
K-Means Algorithm
Select K random docs s1, s2, sK as
seeds. Until clustering converges or other
stopping criterion For each doc di
Assign di to the cluster cj such that dist(xi,
sj) is minimal. (Update the seeds to the
centroid of each cluster) For each cluster
cj , recompute seed sj ?(cj)
28
K Means Example(K2)
Reassign clusters
Converged!
29
Termination conditions

Several possibilities, e.g.,
A fixed number of iterations.
Doc partition unchanged.
Centroid positions dont change.
Does this mean that the docs in a cluster are
unchanged?

30
Convergence

Why should the K-means algorithm ever reach a
fixed point?
A state in which clusters dont change
K-means is a special case of a general procedure
known as the Expectation Maximization (EM)
algorithm
EM is known to converge
Number of iterations could be large

31
Convergence of K-Means

Define goodness measure of cluster k as sum of
squared distances from cluster centroid
Gk Si (di ck)2 (sum over all di in cluster
k)
G Sk Gk
Reassignment monotonically decreases G since each
vector is assigned to the closest centroid.

32
Convergence of K-Means

Recomputation monotonically decreases Gk Si (di
ck)2 because
S (di a)2 reaches minimum for
S 2(di a) 0
S di S a mK a
a (1/ mk) S di ck
mk is number of members in cluster k
K-means typically converges quickly

33
Time Complexity

Computing distance between two docs is O(m) where
m is the dimensionality of the vectors.
Reassigning clusters O(Kn) distance
computations, or O(Knm).
Computing centroids Each doc gets added once to
some centroid O(nm).
Assume these two steps are each done once for I
iterations O(IKnm).

34
Seed Choice

Results can vary based on random seed selection.
Some seeds can result in poor convergence rate,
or convergence to sub-optimal clusterings.
Select good seeds using a heuristic (e.g., doc
least similar to any existing mean)
Try out multiple starting points
Initialize with the results of another method.

Example showing sensitivity to seeds
In the above, if you start with B and E as
centroids you converge to A,B,C and D,E,F If
you start with D and F you converge to A,B,D,E
C,F
35
How Many Clusters?

Number of clusters K is given
Partition n docs into predetermined number of
clusters
Finding the right number of clusters is part of
the problem
Given docs, partition into an appropriate
number of subsets.
E.g., for query results - ideal value of K not
known up front - though UI may impose limits.
Can usually take an algorithm for one flavor and
convert to the other.

36
K not specified in advance

Say, the results of a query.
Solve an optimization problem penalize having
lots of clusters
application dependent, e.g., compressed summary
of search results list.
Tradeoff between having more clusters (better
focus within each cluster) and having too many
clusters

37
K not specified in advance

Given a clustering, define the Benefit for a doc
to be the cosine similarity to its centroid
Define the Total Benefit to be the sum of the
individual doc Benefits.
What is the Total Benefit, if each document is a
cluster?

38
Penalize lots of clusters

For each cluster, we have a Cost C.
Thus for a clustering with K clusters, the Total
Cost is KC.
Define the Value of a clustering to be
Total Benefit - Total Cost.
Find the clustering of highest value, over all
choices of K.
Total benefit increases with increasing K. But
can stop when it doesnt increase by much. The
Cost term enforces this.

39
K-means issues, variations, etc.

Variantions
Recomputing the centroid after every assignment
(improves speed of convergence)
Recomputing the centroid after all re-assignment
Assumes clusters are spherical in vector space
Sensitive to coordinate changes, weighting etc.
Soft clusters vs. hard clusters
Allowing outliers?

40
Hierarchical Clustering

Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents.
One approach recursive application of a
partitional clustering algorithm.

41
Dendogram Hierarchical Clustering

Cutting dedrogram to obtain many connected
components as flat clusters

42
Hierarchical Agglomerative Clustering (HAC)

Starts with each doc in a separate cluster
then repeatedly joins the closest pair of
clusters, until there is only one cluster.
The history of merging forms a binary tree or
hierarchy.

43
Closest pair of clusters

Many variants to defining closest pair of
clusters
Single-link
Similarity of the most cosine-similar
(single-link)
Complete-link
Similarity of the furthest points, the least
cosine-similar
Centroid
Clusters whose centroids (centers of gravity) are
the most cosine-similar
Average-link
Average cosine between pairs of elements

44
Single Link Agglomerative Clustering

Use maximum similarity of pairs
Can result in long and thin clusters due to
chaining effect.
After merging ci and cj, the similarity of the
resulting cluster to another cluster, ck, is

45
Single Link Example
long and thin clusters due to chaining effect.
46
Complete Link Agglomerative Clustering

Use minimum similarity of pairs
Producing tighter, spherical clusters that are
typically preferable
After merging ci and cj, the similarity of the
resulting cluster to another cluster, ck, is

Ci
Cj
Ck
47
Complete Link Example
tighter, spherical clusters that are typically
preferable
48
Computational Complexity

In the first iteration, all HAC methods need to
compute similarity of all pairs of n individual
instances which is O(n2).
In each of the subsequent n?2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters.
In order to maintain an overall O(n2)
performance, computing similarity to each other
cluster must be done in constant time.
Else O(n2 log n) or O(n3) if done naively

49
Group Average Agglomerative Clustering

Similarity of two clusters average similarity
of all pairs within merged cluster.
Compromise between single (max) and complete link
(min)
Averaging options
Averaged across all ordered pairs in the merged
cluster or the two original clusters
No clear difference in performance

50
Computing Group Average Similarity

Always maintain sum of vectors in each cluster.
Compute similarity of clusters in constant time

51
What Is A Good Clustering?

Internal criterion A good clustering will
produce high quality clusters in which
the intra-cluster similarity is high
the inter-cluster similarity is low
The measured quality of a clustering depends on
both the document representation and the
similarity measure used

52
External criteria for clustering quality

Quality measured by its ability to discover some
or all of the hidden patterns or latent classes
in gold standard data
Assesses a clustering with respect to ground
truth
Assume documents with C gold standard classes,
while our clustering algorithms produce K
clusters, ?1, ?2, , ?K with ni members.

53
External Evaluation of Cluster Quality

Simple measure purity, the ratio between the
dominant class (gold) in the cluster pi and the
size of cluster ?i
Others are entropy of classes in clusters (or
mutual information between classes and clusters)

54
Purity example
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ?
Cluster I
Cluster II
Cluster III
Cluster I Purity 1/6 (max(5, 1, 0)) 5/6
Cluster II Purity 1/6 (max(1, 4, 1)) 4/6
Cluster III Purity 1/5 (max(2, 0, 3)) 3/5
55
Rand Index
56
Rand index symmetric version
Similar to precision and recall
57
Rand Index example 0.68
58
Using collocation clusters in teaching academic
Writing
Source ???? ????????????? ????? 2007
59
Example collocation clustering using
translation based similarity