Title: Document Clustering
1CS 6633 ????Information Retrieval and Web Search
- Lecture 9
- Document Clustering
?? 125 Based on ppt files by Hinrich Schütze, Ray
Mooney, and Soumen Chakrabarti
2Todays Topic Clustering
- Document clustering
- Motivations
- Document representations
- Success criteria
- Clustering algorithms
- Partitional
- Hierarchical
3What is clustering?
- Clustering the process of grouping a set of
objects into classes of similar objects - The commonest form of unsupervised learning
- Unsupervised learning learning from raw data,
as opposed to supervised data where a
classification of examples is given - A common and important task with many
applications in IR and beyond
4Why cluster documents?
- Whole corpus analysis/navigation
- Better user interface
- For improving recall in search applications
- Better search results
- For better navigation of search results
- Effective user recall will be higher
- For speeding up vector space retrieval
- Retrieve clusters and then documents
- Faster search
5Google News presents news clusters
6Yahoo! Hierarchy
www.yahoo.com/Science
(30)
agriculture
biology
physics
CS
space
...
...
...
...
...
dairy
AI
botany
cell
courses
crops
craft
magnetism
HCI
missions
agronomy
evolution
forestry
relativity
7Scatter/Gather Clustering
- Developed at PARC in the late 80s/early 90s
- Based on two novel clustering algorithms
- Buckshot fast for online clustering
- Fractionation accurate for offline initial
clustering of the entire set - Top-down approach
- Start with k seeds (documents) to represent k
clusters - Each document assigned to the cluster with the
most similar seeds
Pedersen, Cutting, Karger, Tukey, Scatter/Gather
A Cluster-based Approach to Browsing Large
Document Collections, SIGIR 1992
8Fractionation
Invented by Cutting, Karger, Pederson and Tukey
for nonparametric clustering of large datasets.
Source Jeremy Tantrum, www.stat.washington.edu/wx
s/Stat593-s03/Slides/jeremyKDDtalk.ppt
9Scatter/Gather
Pedersen, Cutting, Karger, Tukey, Scatter/Gather
A Cluster-based Approach to Browsing Large
Document Collections, SIGIR 1992
10The Scatter/Gather Interface
11Two Queries Two Clusterings
AUTO, CAR, ELECTRIC
AUTO, CAR, SAFETY
8 control drive accident 25 battery
california technology 48 import j. rate
honda toyota 16 export international unit
japan 3 service employee automatic
6 control inventory integrate 10
investigation washington 12 study fuel death
bag air 61 sale domestic truck import 11
japan export defect unite
The main differences are the clusters that are
central to the query
12Scatter/Gather Evaluations
- Good part
- The clusters do group relevant documents together
- Participants noted that useful for eliminating
irrelevant groups - Bad part
- Difficult to understand the clusters
- There is no consistence in results
- Can be slower to find answers
13(No Transcript)
14(No Transcript)
15Visualizing Clustering Results
- Use clustering to map the entire huge
multidimensional document space into a huge
number of small clusters. - User dimension reduction and then project these
onto a 2D/3D graphical representation
16For visualizing a document collection and its
themes
- Wise et al, Visualizing the non-visual PNNL
- ThemeScapes, Cartia
- Mountain height cluster size
17For improving search recall
- Cluster hypothesis - Documents with similar text
are related - Therefore, to improve search recall
- Cluster docs in corpus in advance
- When a query matches a doc D, also return other
docs in the cluster containing D - Hope if we do this The query car will also
return docs containing automobile - Because clustering grouped together docs
containing car with those containing automobile. - Why might this happen?
18For better navigation of search results
- For grouping search results thematically
- clusty.com / Vivisimo
19Issues for clustering
- Representation for clustering
- Document representation
- Vector space? Normalization?
- Need a notion of similarity/distance
- How many clusters?
- Fixed a priori?
- Completely data driven?
- Avoid trivial clusters - too large or small
- In an application, if a cluster's too large, then
for navigation purposes you've wasted an extra
user click without whittling down the set of
documents much.
20What makes docs related?
- Ideal semantic similarity
- Practical statistical similarity
- cosine similarity a good choice
- Docs as vectors
- For many algorithms, easier to think in terms of
a distance (rather than similarity) between docs
21Clustering Algorithms
- Partitional algorithms
- Usually start with a random (partial)
partitioning - Refine it iteratively
- K means clustering
- Model based clustering
- Hierarchical algorithms
- Bottom-up, agglomerative
- Top-down, divisive
22Model based clustering
- In model-based clustering it is assumed that the
data is generated by a mixture of the Gaussian
models - Observations are sampled from a mixture density
p(x) å pg pg(x) - Use training data to fitting a Mixture of
Gaussians - Use the EM algorithm to maximize the log
likelihood
Source Jeremy Tantrum, www.stat.washington.edu/wx
s/Stat593-s03/Slides/jeremyKDDtalk.ppt
23Model Based Clustering
Fitting Estimate pg and parameters of pg(x)
Source Jeremy Tantrum, www.stat.washington.edu/wx
s/Stat593-s03/Slides/jeremyKDDtalk.ppt
24Model Based Clustering
- Hierarchical Clustering
- Provides a good starting point for EM algorithm
- Start with every point being its own cluster
- Merge the two closest clusters
- Measured by the decrease in likelihood when those
two clusters are merged - Uses the Classification Likelihood not the
Mixture Likelihood - Algorithm is quadratic in the number of
observations
Source Jeremy Tantrum, www.stat.washington.edu/wx
s/Stat593-s03/Slides/jeremyKDDtalk.ppt
25Partitioning Algorithms
- Partitioning method Construct a partition of n
documents into a set of K clusters - Given a set of documents and the number K
- Find a partition of K clusters that optimizes
the chosen partitioning criterion - Globally optimal exhaustively enumerate all
partitions - Effective heuristic methods K-means and
K-medoids algorithms
26K-Means
- Assumes documents are real-valued vectors.
- Clusters based on centroids (aka the center of
gravity or mean) of points in a cluster, c - Reassignment of instances to clusters is based on
distance to the current cluster centroids. - (Or one can equivalently phrase it in terms of
similarities)
27K-Means Algorithm
Select K random docs s1, s2, sK as
seeds. Until clustering converges or other
stopping criterion For each doc di
Assign di to the cluster cj such that dist(xi,
sj) is minimal. (Update the seeds to the
centroid of each cluster) For each cluster
cj , recompute seed sj ?(cj)
28K Means Example(K2)
Reassign clusters
Converged!
29Termination conditions
- Several possibilities, e.g.,
- A fixed number of iterations.
- Doc partition unchanged.
- Centroid positions dont change.
- Does this mean that the docs in a cluster are
unchanged?
30Convergence
- Why should the K-means algorithm ever reach a
fixed point? - A state in which clusters dont change
- K-means is a special case of a general procedure
known as the Expectation Maximization (EM)
algorithm - EM is known to converge
- Number of iterations could be large
31Convergence of K-Means
- Define goodness measure of cluster k as sum of
squared distances from cluster centroid - Gk Si (di ck)2 (sum over all di in cluster
k) - G Sk Gk
- Reassignment monotonically decreases G since each
vector is assigned to the closest centroid.
32Convergence of K-Means
- Recomputation monotonically decreases Gk Si (di
ck)2 because - S (di a)2 reaches minimum for
- S 2(di a) 0
- S di S a mK a
- a (1/ mk) S di ck
- mk is number of members in cluster k
- K-means typically converges quickly
33Time Complexity
- Computing distance between two docs is O(m) where
m is the dimensionality of the vectors. - Reassigning clusters O(Kn) distance
computations, or O(Knm). - Computing centroids Each doc gets added once to
some centroid O(nm). - Assume these two steps are each done once for I
iterations O(IKnm).
34Seed Choice
- Results can vary based on random seed selection.
- Some seeds can result in poor convergence rate,
or convergence to sub-optimal clusterings. - Select good seeds using a heuristic (e.g., doc
least similar to any existing mean) - Try out multiple starting points
- Initialize with the results of another method.
Example showing sensitivity to seeds
In the above, if you start with B and E as
centroids you converge to A,B,C and D,E,F If
you start with D and F you converge to A,B,D,E
C,F
35How Many Clusters?
- Number of clusters K is given
- Partition n docs into predetermined number of
clusters - Finding the right number of clusters is part of
the problem - Given docs, partition into an appropriate
number of subsets. - E.g., for query results - ideal value of K not
known up front - though UI may impose limits. - Can usually take an algorithm for one flavor and
convert to the other.
36K not specified in advance
- Say, the results of a query.
- Solve an optimization problem penalize having
lots of clusters - application dependent, e.g., compressed summary
of search results list. - Tradeoff between having more clusters (better
focus within each cluster) and having too many
clusters
37K not specified in advance
- Given a clustering, define the Benefit for a doc
to be the cosine similarity to its centroid - Define the Total Benefit to be the sum of the
individual doc Benefits. - What is the Total Benefit, if each document is a
cluster?
38Penalize lots of clusters
- For each cluster, we have a Cost C.
- Thus for a clustering with K clusters, the Total
Cost is KC. - Define the Value of a clustering to be
- Total Benefit - Total Cost.
- Find the clustering of highest value, over all
choices of K. - Total benefit increases with increasing K. But
can stop when it doesnt increase by much. The
Cost term enforces this.
39K-means issues, variations, etc.
- Variantions
- Recomputing the centroid after every assignment
(improves speed of convergence) - Recomputing the centroid after all re-assignment
- Assumes clusters are spherical in vector space
- Sensitive to coordinate changes, weighting etc.
- Soft clusters vs. hard clusters
- Allowing outliers?
40Hierarchical Clustering
- Build a tree-based hierarchical taxonomy
(dendrogram) from a set of documents. - One approach recursive application of a
partitional clustering algorithm.
41Dendogram Hierarchical Clustering
- Cutting dedrogram to obtain many connected
components as flat clusters
42Hierarchical Agglomerative Clustering (HAC)
- Starts with each doc in a separate cluster
- then repeatedly joins the closest pair of
clusters, until there is only one cluster. - The history of merging forms a binary tree or
hierarchy.
43Closest pair of clusters
- Many variants to defining closest pair of
clusters - Single-link
- Similarity of the most cosine-similar
(single-link) - Complete-link
- Similarity of the furthest points, the least
cosine-similar - Centroid
- Clusters whose centroids (centers of gravity) are
the most cosine-similar - Average-link
- Average cosine between pairs of elements
44Single Link Agglomerative Clustering
- Use maximum similarity of pairs
- Can result in long and thin clusters due to
chaining effect. - After merging ci and cj, the similarity of the
resulting cluster to another cluster, ck, is
45Single Link Example
long and thin clusters due to chaining effect.
46Complete Link Agglomerative Clustering
- Use minimum similarity of pairs
- Producing tighter, spherical clusters that are
typically preferable - After merging ci and cj, the similarity of the
resulting cluster to another cluster, ck, is
Ci
Cj
Ck
47Complete Link Example
tighter, spherical clusters that are typically
preferable
48Computational Complexity
- In the first iteration, all HAC methods need to
compute similarity of all pairs of n individual
instances which is O(n2). - In each of the subsequent n?2 merging iterations,
compute the distance between the most recently
created cluster and all other existing clusters. - In order to maintain an overall O(n2)
performance, computing similarity to each other
cluster must be done in constant time. - Else O(n2 log n) or O(n3) if done naively
49Group Average Agglomerative Clustering
- Similarity of two clusters average similarity
of all pairs within merged cluster. - Compromise between single (max) and complete link
(min) - Averaging options
- Averaged across all ordered pairs in the merged
cluster or the two original clusters - No clear difference in performance
50Computing Group Average Similarity
- Always maintain sum of vectors in each cluster.
- Compute similarity of clusters in constant time
51What Is A Good Clustering?
- Internal criterion A good clustering will
produce high quality clusters in which - the intra-cluster similarity is high
- the inter-cluster similarity is low
- The measured quality of a clustering depends on
both the document representation and the
similarity measure used
52External criteria for clustering quality
- Quality measured by its ability to discover some
or all of the hidden patterns or latent classes
in gold standard data - Assesses a clustering with respect to ground
truth - Assume documents with C gold standard classes,
while our clustering algorithms produce K
clusters, ?1, ?2, , ?K with ni members.
53External Evaluation of Cluster Quality
- Simple measure purity, the ratio between the
dominant class (gold) in the cluster pi and the
size of cluster ?i - Others are entropy of classes in clusters (or
mutual information between classes and clusters)
54Purity example
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ?
Cluster I
Cluster II
Cluster III
Cluster I Purity 1/6 (max(5, 1, 0)) 5/6
Cluster II Purity 1/6 (max(1, 4, 1)) 4/6
Cluster III Purity 1/5 (max(2, 0, 3)) 3/5
55Rand Index
56Rand index symmetric version
Similar to precision and recall
57Rand Index example 0.68
58Using collocation clusters in teaching academic
Writing
Source ???? ????????????? ????? 2007
59Example collocation clustering using
translation based similarity
- InputVN method
- Output
- ??????? present advocate introduce propose
recommend - ??????? study analyze consider investigate method
- ??????? employ utilize use adopt implement impose
apply execute perform - ???? limit restrict dominate manage manipulate
control - ?? ????? develop expand build augment emphasize
improve extend adjust - Vs. manual approach
- ???? adopt apply employ utilize method
- ????comeupwith develop formulate workout
method - ????design devise method
- ????propose recommend method
- ???? abandon scrap giv-up method
- ????introduce method ???? test method ????try
method
60Integrating ESP corpus with parallel corpus to
cluster collocations
- ESP Corpus ACL Anthology
- ACL Association for computational linguistics
- NLP handling Tagging, chunking
- Collocations extraction, VN AN, NV, RV, RA, VPN,
VNP - Translation-based clustering (e.g., VN)
- Find corpus translations TV1 and TV2 of V1 and V2
- Sim(V1, V2) Overlap of TV1 and TV2 Dice(V1,
V2) - Translation-based clustering (e.g., VN)
- Find collocates CV1 and CV2 for V1 and V2
- Sim(V1, V2) Overlap of CV1 and CV2 Dice(V1,
V2)
61Translations from Hong Kong Parallel Texts
- InputVN method
- Output
- ?? develop recommend survey learn study analyze
consider find investigate d - ?? create cause add result impose arise mean
represent generate - ?? conduct develop expand activate start execute
supervise - ?? process start execute coordinate test perform
investigate implement - ?? create cause result impose arise emerge
undermine generate - ?? recommend suggest apply present express
justify name advocate - ?? accept consider find view feel think hope
- ?? solve eliminate help attempt find remove
address - ?? suggest discover consider find view feel think
- ?? process manage approach resolve solve address
cover
62- ?? develop highlight exploit build utilize
emphasize perform - ?? limit contain restrict dominate manage
manipulate control - ?? employ arrange utilize do using adopt
implement - ?? develop expand build augment emphasize improve
devote - ?? inherit pursue evolve keep go proceed
- ?? develop expand help advocate implement
generate - ?? employ arrange utilize using adopt implement
- ?? handle manage adjust match address cover
- ?? introduce impose apply justify adopt implement
- ?? extend adjust augment provide improve generat
http//tera.zibox.cc/tpb/
63Resources
- IIR 16 except 16.5
- IIR 17 except 17.3