Document Clustering - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Document Clustering

Description:

Based on ppt files by Hinrich Sch tze, Ray Mooney, and Soumen Chakrabarti ... The commonest form of ... mammal worm insect crustacean. invertebrate ... – PowerPoint PPT presentation

Number of Views:436
Avg rating:3.0/5.0
Slides: 64
Provided by: christo397
Category:

less

Transcript and Presenter's Notes

Title: Document Clustering


1
CS 6633 ????Information Retrieval and Web Search
  • Lecture 9
  • Document Clustering

?? 125 Based on ppt files by Hinrich Schütze, Ray
Mooney, and Soumen Chakrabarti
2
Todays Topic Clustering
  • Document clustering
  • Motivations
  • Document representations
  • Success criteria
  • Clustering algorithms
  • Partitional
  • Hierarchical

3
What is clustering?
  • Clustering the process of grouping a set of
    objects into classes of similar objects
  • The commonest form of unsupervised learning
  • Unsupervised learning learning from raw data,
    as opposed to supervised data where a
    classification of examples is given
  • A common and important task with many
    applications in IR and beyond

4
Why cluster documents?
  • Whole corpus analysis/navigation
  • Better user interface
  • For improving recall in search applications
  • Better search results
  • For better navigation of search results
  • Effective user recall will be higher
  • For speeding up vector space retrieval
  • Retrieve clusters and then documents
  • Faster search

5
Google News presents news clusters
6
Yahoo! Hierarchy
www.yahoo.com/Science
(30)
agriculture
biology
physics
CS
space
...
...
...
...
...
dairy
AI
botany
cell
courses
crops
craft
magnetism
HCI
missions
agronomy
evolution
forestry
relativity
7
Scatter/Gather Clustering
  • Developed at PARC in the late 80s/early 90s
  • Based on two novel clustering algorithms
  • Buckshot fast for online clustering
  • Fractionation accurate for offline initial
    clustering of the entire set
  • Top-down approach
  • Start with k seeds (documents) to represent k
    clusters
  • Each document assigned to the cluster with the
    most similar seeds

Pedersen, Cutting, Karger, Tukey, Scatter/Gather
A Cluster-based Approach to Browsing Large
Document Collections, SIGIR 1992
8
Fractionation
Invented by Cutting, Karger, Pederson and Tukey
for nonparametric clustering of large datasets.
Source Jeremy Tantrum, www.stat.washington.edu/wx
s/Stat593-s03/Slides/jeremyKDDtalk.ppt
9
Scatter/Gather
Pedersen, Cutting, Karger, Tukey, Scatter/Gather
A Cluster-based Approach to Browsing Large
Document Collections, SIGIR 1992
10
The Scatter/Gather Interface
11
Two Queries Two Clusterings
AUTO, CAR, ELECTRIC
AUTO, CAR, SAFETY
8 control drive accident 25 battery
california technology 48 import j. rate
honda toyota 16 export international unit
japan 3 service employee automatic
6 control inventory integrate 10
investigation washington 12 study fuel death
bag air 61 sale domestic truck import 11
japan export defect unite
The main differences are the clusters that are
central to the query
12
Scatter/Gather Evaluations
  • Good part
  • The clusters do group relevant documents together
  • Participants noted that useful for eliminating
    irrelevant groups
  • Bad part
  • Difficult to understand the clusters
  • There is no consistence in results
  • Can be slower to find answers

13
(No Transcript)
14
(No Transcript)
15
Visualizing Clustering Results
  • Use clustering to map the entire huge
    multidimensional document space into a huge
    number of small clusters.
  • User dimension reduction and then project these
    onto a 2D/3D graphical representation

16
For visualizing a document collection and its
themes
  • Wise et al, Visualizing the non-visual PNNL
  • ThemeScapes, Cartia
  • Mountain height cluster size

17
For improving search recall
  • Cluster hypothesis - Documents with similar text
    are related
  • Therefore, to improve search recall
  • Cluster docs in corpus in advance
  • When a query matches a doc D, also return other
    docs in the cluster containing D
  • Hope if we do this The query car will also
    return docs containing automobile
  • Because clustering grouped together docs
    containing car with those containing automobile.
  • Why might this happen?

18
For better navigation of search results
  • For grouping search results thematically
  • clusty.com / Vivisimo

19
Issues for clustering
  • Representation for clustering
  • Document representation
  • Vector space? Normalization?
  • Need a notion of similarity/distance
  • How many clusters?
  • Fixed a priori?
  • Completely data driven?
  • Avoid trivial clusters - too large or small
  • In an application, if a cluster's too large, then
    for navigation purposes you've wasted an extra
    user click without whittling down the set of
    documents much.

20
What makes docs related?
  • Ideal semantic similarity
  • Practical statistical similarity
  • cosine similarity a good choice
  • Docs as vectors
  • For many algorithms, easier to think in terms of
    a distance (rather than similarity) between docs

21
Clustering Algorithms
  • Partitional algorithms
  • Usually start with a random (partial)
    partitioning
  • Refine it iteratively
  • K means clustering
  • Model based clustering
  • Hierarchical algorithms
  • Bottom-up, agglomerative
  • Top-down, divisive

22
Model based clustering
  • In model-based clustering it is assumed that the
    data is generated by a mixture of the Gaussian
    models
  • Observations are sampled from a mixture density
    p(x) å pg pg(x)
  • Use training data to fitting a Mixture of
    Gaussians
  • Use the EM algorithm to maximize the log
    likelihood

Source Jeremy Tantrum, www.stat.washington.edu/wx
s/Stat593-s03/Slides/jeremyKDDtalk.ppt
23
Model Based Clustering
Fitting Estimate pg and parameters of pg(x)
Source Jeremy Tantrum, www.stat.washington.edu/wx
s/Stat593-s03/Slides/jeremyKDDtalk.ppt
24
Model Based Clustering
  • Hierarchical Clustering
  • Provides a good starting point for EM algorithm
  • Start with every point being its own cluster
  • Merge the two closest clusters
  • Measured by the decrease in likelihood when those
    two clusters are merged
  • Uses the Classification Likelihood not the
    Mixture Likelihood
  • Algorithm is quadratic in the number of
    observations

Source Jeremy Tantrum, www.stat.washington.edu/wx
s/Stat593-s03/Slides/jeremyKDDtalk.ppt
25
Partitioning Algorithms
  • Partitioning method Construct a partition of n
    documents into a set of K clusters
  • Given a set of documents and the number K
  • Find a partition of K clusters that optimizes
    the chosen partitioning criterion
  • Globally optimal exhaustively enumerate all
    partitions
  • Effective heuristic methods K-means and
    K-medoids algorithms

26
K-Means
  • Assumes documents are real-valued vectors.
  • Clusters based on centroids (aka the center of
    gravity or mean) of points in a cluster, c
  • Reassignment of instances to clusters is based on
    distance to the current cluster centroids.
  • (Or one can equivalently phrase it in terms of
    similarities)

27
K-Means Algorithm
Select K random docs s1, s2, sK as
seeds. Until clustering converges or other
stopping criterion For each doc di
Assign di to the cluster cj such that dist(xi,
sj) is minimal. (Update the seeds to the
centroid of each cluster) For each cluster
cj , recompute seed sj ?(cj)
28
K Means Example(K2)
Reassign clusters
Converged!
29
Termination conditions
  • Several possibilities, e.g.,
  • A fixed number of iterations.
  • Doc partition unchanged.
  • Centroid positions dont change.
  • Does this mean that the docs in a cluster are
    unchanged?

30
Convergence
  • Why should the K-means algorithm ever reach a
    fixed point?
  • A state in which clusters dont change
  • K-means is a special case of a general procedure
    known as the Expectation Maximization (EM)
    algorithm
  • EM is known to converge
  • Number of iterations could be large

31
Convergence of K-Means
  • Define goodness measure of cluster k as sum of
    squared distances from cluster centroid
  • Gk Si (di ck)2 (sum over all di in cluster
    k)
  • G Sk Gk
  • Reassignment monotonically decreases G since each
    vector is assigned to the closest centroid.

32
Convergence of K-Means
  • Recomputation monotonically decreases Gk Si (di
    ck)2 because
  • S (di a)2 reaches minimum for
  • S 2(di a) 0
  • S di S a mK a
  • a (1/ mk) S di ck
  • mk is number of members in cluster k
  • K-means typically converges quickly

33
Time Complexity
  • Computing distance between two docs is O(m) where
    m is the dimensionality of the vectors.
  • Reassigning clusters O(Kn) distance
    computations, or O(Knm).
  • Computing centroids Each doc gets added once to
    some centroid O(nm).
  • Assume these two steps are each done once for I
    iterations O(IKnm).

34
Seed Choice
  • Results can vary based on random seed selection.
  • Some seeds can result in poor convergence rate,
    or convergence to sub-optimal clusterings.
  • Select good seeds using a heuristic (e.g., doc
    least similar to any existing mean)
  • Try out multiple starting points
  • Initialize with the results of another method.

Example showing sensitivity to seeds
In the above, if you start with B and E as
centroids you converge to A,B,C and D,E,F If
you start with D and F you converge to A,B,D,E
C,F
35
How Many Clusters?
  • Number of clusters K is given
  • Partition n docs into predetermined number of
    clusters
  • Finding the right number of clusters is part of
    the problem
  • Given docs, partition into an appropriate
    number of subsets.
  • E.g., for query results - ideal value of K not
    known up front - though UI may impose limits.
  • Can usually take an algorithm for one flavor and
    convert to the other.

36
K not specified in advance
  • Say, the results of a query.
  • Solve an optimization problem penalize having
    lots of clusters
  • application dependent, e.g., compressed summary
    of search results list.
  • Tradeoff between having more clusters (better
    focus within each cluster) and having too many
    clusters

37
K not specified in advance
  • Given a clustering, define the Benefit for a doc
    to be the cosine similarity to its centroid
  • Define the Total Benefit to be the sum of the
    individual doc Benefits.
  • What is the Total Benefit, if each document is a
    cluster?

38
Penalize lots of clusters
  • For each cluster, we have a Cost C.
  • Thus for a clustering with K clusters, the Total
    Cost is KC.
  • Define the Value of a clustering to be
  • Total Benefit - Total Cost.
  • Find the clustering of highest value, over all
    choices of K.
  • Total benefit increases with increasing K. But
    can stop when it doesnt increase by much. The
    Cost term enforces this.

39
K-means issues, variations, etc.
  • Variantions
  • Recomputing the centroid after every assignment
    (improves speed of convergence)
  • Recomputing the centroid after all re-assignment
  • Assumes clusters are spherical in vector space
  • Sensitive to coordinate changes, weighting etc.
  • Soft clusters vs. hard clusters
  • Allowing outliers?

40
Hierarchical Clustering
  • Build a tree-based hierarchical taxonomy
    (dendrogram) from a set of documents.
  • One approach recursive application of a
    partitional clustering algorithm.

41
Dendogram Hierarchical Clustering
  • Cutting dedrogram to obtain many connected
    components as flat clusters

42
Hierarchical Agglomerative Clustering (HAC)
  • Starts with each doc in a separate cluster
  • then repeatedly joins the closest pair of
    clusters, until there is only one cluster.
  • The history of merging forms a binary tree or
    hierarchy.

43
Closest pair of clusters
  • Many variants to defining closest pair of
    clusters
  • Single-link
  • Similarity of the most cosine-similar
    (single-link)
  • Complete-link
  • Similarity of the furthest points, the least
    cosine-similar
  • Centroid
  • Clusters whose centroids (centers of gravity) are
    the most cosine-similar
  • Average-link
  • Average cosine between pairs of elements

44
Single Link Agglomerative Clustering
  • Use maximum similarity of pairs
  • Can result in long and thin clusters due to
    chaining effect.
  • After merging ci and cj, the similarity of the
    resulting cluster to another cluster, ck, is

45
Single Link Example
long and thin clusters due to chaining effect.
46
Complete Link Agglomerative Clustering
  • Use minimum similarity of pairs
  • Producing tighter, spherical clusters that are
    typically preferable
  • After merging ci and cj, the similarity of the
    resulting cluster to another cluster, ck, is

Ci
Cj
Ck
47
Complete Link Example
tighter, spherical clusters that are typically
preferable
48
Computational Complexity
  • In the first iteration, all HAC methods need to
    compute similarity of all pairs of n individual
    instances which is O(n2).
  • In each of the subsequent n?2 merging iterations,
    compute the distance between the most recently
    created cluster and all other existing clusters.
  • In order to maintain an overall O(n2)
    performance, computing similarity to each other
    cluster must be done in constant time.
  • Else O(n2 log n) or O(n3) if done naively

49
Group Average Agglomerative Clustering
  • Similarity of two clusters average similarity
    of all pairs within merged cluster.
  • Compromise between single (max) and complete link
    (min)
  • Averaging options
  • Averaged across all ordered pairs in the merged
    cluster or the two original clusters
  • No clear difference in performance

50
Computing Group Average Similarity
  • Always maintain sum of vectors in each cluster.
  • Compute similarity of clusters in constant time

51
What Is A Good Clustering?
  • Internal criterion A good clustering will
    produce high quality clusters in which
  • the intra-cluster similarity is high
  • the inter-cluster similarity is low
  • The measured quality of a clustering depends on
    both the document representation and the
    similarity measure used

52
External criteria for clustering quality
  • Quality measured by its ability to discover some
    or all of the hidden patterns or latent classes
    in gold standard data
  • Assesses a clustering with respect to ground
    truth
  • Assume documents with C gold standard classes,
    while our clustering algorithms produce K
    clusters, ?1, ?2, , ?K with ni members.

53
External Evaluation of Cluster Quality
  • Simple measure purity, the ratio between the
    dominant class (gold) in the cluster pi and the
    size of cluster ?i
  • Others are entropy of classes in clusters (or
    mutual information between classes and clusters)

54
Purity example
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ?
Cluster I
Cluster II
Cluster III
Cluster I Purity 1/6 (max(5, 1, 0)) 5/6
Cluster II Purity 1/6 (max(1, 4, 1)) 4/6
Cluster III Purity 1/5 (max(2, 0, 3)) 3/5
55
Rand Index
56
Rand index symmetric version
Similar to precision and recall
57
Rand Index example 0.68
58
Using collocation clusters in teaching academic
Writing
Source ???? ????????????? ????? 2007
59
Example collocation clustering using
translation based similarity
  • InputVN method
  • Output
  • ??????? present advocate introduce propose
    recommend
  • ??????? study analyze consider investigate method
  • ??????? employ utilize use adopt implement impose
    apply execute perform
  • ???? limit restrict dominate manage manipulate
    control
  • ?? ????? develop expand build augment emphasize
    improve extend adjust
  • Vs. manual approach
  • ???? adopt apply employ utilize method
  • ????comeupwith develop formulate workout
    method
  • ????design devise method
  • ????propose recommend method
  • ???? abandon scrap giv-up method
  • ????introduce method ???? test method ????try
    method

60
Integrating ESP corpus with parallel corpus to
cluster collocations
  • ESP Corpus ACL Anthology
  • ACL Association for computational linguistics
  • NLP handling Tagging, chunking
  • Collocations extraction, VN AN, NV, RV, RA, VPN,
    VNP
  • Translation-based clustering (e.g., VN)
  • Find corpus translations TV1 and TV2 of V1 and V2
  • Sim(V1, V2) Overlap of TV1 and TV2 Dice(V1,
    V2)
  • Translation-based clustering (e.g., VN)
  • Find collocates CV1 and CV2 for V1 and V2
  • Sim(V1, V2) Overlap of CV1 and CV2 Dice(V1,
    V2)

61
Translations from Hong Kong Parallel Texts
  • InputVN method
  • Output
  • ?? develop recommend survey learn study analyze
    consider find investigate d
  • ?? create cause add result impose arise mean
    represent generate
  • ?? conduct develop expand activate start execute
    supervise
  • ?? process start execute coordinate test perform
    investigate implement
  • ?? create cause result impose arise emerge
    undermine generate
  • ?? recommend suggest apply present express
    justify name advocate
  • ?? accept consider find view feel think hope
  • ?? solve eliminate help attempt find remove
    address
  • ?? suggest discover consider find view feel think
  • ?? process manage approach resolve solve address
    cover

62
  • ?? develop highlight exploit build utilize
    emphasize perform
  • ?? limit contain restrict dominate manage
    manipulate control
  • ?? employ arrange utilize do using adopt
    implement
  • ?? develop expand build augment emphasize improve
    devote
  • ?? inherit pursue evolve keep go proceed
  • ?? develop expand help advocate implement
    generate
  • ?? employ arrange utilize using adopt implement
  • ?? handle manage adjust match address cover
  • ?? introduce impose apply justify adopt implement
  • ?? extend adjust augment provide improve generat

http//tera.zibox.cc/tpb/
63
Resources
  • IIR 16 except 16.5
  • IIR 17 except 17.3
Write a Comment
User Comments (0)
About PowerShow.com