Title: Text Clustering
1Text Clustering
2Review
- Definition
- The category of x c(x)?C
- K-Nearest Neighbor
- Naïve Bayes
- Bayesian Methods
- Bernoulli NB classifier
- Multinomial NB classifier
- Categorization Evaluation
- Training data/Test data
- Over-fitting Generalize
3How to Evaluate?
- Test data set
- ??training set?????
- Measure
- Precision
- Recall
- Accuracy (Correct rate)
- F1
- Micro/Macro Average
Actual Class
Predictedclass
4Exercise
- Federalist papers
- 1787-1788???Hamilton, Jay and Madison??????77???,?
??NY??US Constitution - ??12?papers???????
?????
5Author identification
- In 1964 Mosteller and Wallace solved the problem
- Mosteller, Frederick and Wallace, David L. 1964.
Inference and Disputed Authorship The
Federalist. - Its a Text Catergorization Problem
- They identified 70 function words as good
candidates for authorship analysis - Using statistical inference they concluded the
author was Madison
6Function words for Author Identification
7Function Words for Author Identification
8Todays Topic
- Document clustering
- Motivations
- Clustering algorithms
- Partitional
- Hierarchical
- Evaluation
9Whats Clustering?
10What is clustering?
- Clustering the process of grouping a set of
objects into classes of similar objects - The commonest form of unsupervised learning
- Unsupervised learning learning from raw data,
as opposed to supervised data where a
classification of examples is given - A common and important task that finds many
applications in IR and other places
11Clustering Internal Criterion
How many clusters?
- High intra-cluster similarity
- Low inter-cluster similarity
12Issues for clustering
- Representation for clustering
- ????Document representation
- Vector space or language model?
- ???/??similarity/distance
- COS similarity or KL distance
- How many clusters?
- Fixed a priori?
- Completely data driven?
- Avoid trivial clusters - too large or small
13Clustering Algorithms
- Hard clustering algorithms
- computes a hard assignment each document is a
member of exactly one cluster. - Soft clustering algorithms
- is soft a documents assignment is a
distribution over all clusters.
14Clustering Algorithms
- Flat algorithms
- Create cluster set without explicit structure
- Usually start with a random (partial)
partitioning - Refine it iteratively
- K means clustering
- Model based clustering
- Hierarchical algorithms
- Bottom-up, agglomerative
- Top-down, divisive
15Clustering Algorithms
- Flat algorithms
- Create cluster set without explicit structure
- Usually start with a random (partial)
partitioning - Refine it iteratively
- K means clustering
- Model based clustering
- Hierarchical algorithms
- Bottom-up, agglomerative
- Top-down, divisive
16Evaluation
17Think about it
- Evaluation by High internal criterion scores?
- Object function for High intra-cluster similarity
and Low inter-cluster similarity
Application User judgment
Internal judgment
18External criteria for clustering quality
- ???????ground truth?
- Assume documents with C gold standard classes,
while our clustering algorithms produce K
clusters, ?1, ?2, , ?K with ni members each. - ?????measure purity ???cluster?????class Ci
?????cluster ?K ????? - ? ?1,?2, . . . ,?K is the set of clusters and
C c1, c2, . . . , cJ the set of classes.
19Purity example
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ?
Cluster I
Cluster II
Cluster III
Cluster I Purity 1/6 (max(5, 1, 0)) 5/6
Cluster II Purity 1/6 (max(1, 4, 1)) 4/6
Cluster III Purity 1/5 (max(2, 0, 3)) 3/5
Total Purity
1/17 (543) 12/17
20Rand Index
- View it as a series of decisions, one for each of
the N(N - 1)/2 pairs of documents in the
collection. - true positive (TP) decision assigns two similar
documents to the same cluster - true negative (TN) decision assigns two
dissimilar documents to different clusters. - false positive (FP) decision assigns two
dissimilar documents to the same cluster. - false negative (FN) decision assigns two similar
documents to different clusters.
21Rand Index
TP
FN
FP
22Rand index Example
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ?
Cluster I
Cluster II
Cluster III
23K Means Algorithm
24Partitioning Algorithms
- Given
- a set of documents D and the number K
- Find
- ????K clusters???,?partitioning criterion??
- Globally optimal exhaustively enumerate all
partitions - Effective heuristic methods K-means algorithms
partitioning criterion residual sum of
squares(?????)
25K-Means
- ??documents??? vectors.
- ??cluster ????centroids (aka the center of
gravity or mean) - ??instances?clusters?????cluster
centroid??????,?????centroid
26K Means Example(K2)
Reassign clusters
Converged!
27K-Means Algorithm
28Convergence
- ???K-means??????
- A state in which clusters dont change.
- Reassignment RSS???,?????????centroid.
- Recomputation ??RSSk ???(mk is number of members
in cluster k) - a ?(?k )????,?RSSK??????
S 2(X a) 0 S X S a mK a S X a (1/
mk) S X
29Convergence Global Minimum?
- There is unfortunately no guarantee that a global
minimum in the objective function will be reached
outlier
30Seed Choice
- Seed????????
- ??seeds??????,?????sub-optimal clusterings.
- ?heuristic?seeds (e.g., doc least similar to any
existing mean) - ????starting points
- ???clustering?????????.(e.g., by sampling)
In the above, if you start with B and E as
centroids you converge to A,B,C and D,E,F If
you start with D and F you converge to A,B,D,E
C,F
31How Many Clusters?
- ???????K?
- ?????cluster(??cluster????)??????cluster
(eg.?????)?????? - ??
- ??Benefit a doc?????cluster centroid?cosine
similarity???docs?benefit???Total Benefit. - ????cluster?Cost
- ??clustering?Value Total Benefit - Total Cost.
- ?????K?,??value??????
32Is K-Means Efficient?
- Time Complexity
- Computing distance between two docs is O(M) where
M is the dimensionality of the vectors. - Reassigning clusters O(KN) distance
computations, or O(KNM). - Computing centroids Each doc gets added once to
some centroid O(NM). - Assume these two steps are each done once for I
iterations O(IKNM). - M is
- Document is sparse vector, but Centroid is not
- K-medoids algorithms the element closest to the
center as "the medoid"
33Efficiency Medoid As Cluster Representative
- Medoid ???document??cluster???
- ? ?centroid???document
- One reason this is useful
- ???????cluster?representative (gt1000 documents)
- The centroid of this cluster will be a dense
vector - The medoid of this cluster will be a sparse
vector - ???
- mean .vs. median
- centroid vs. medoid
34Hierarchical Clustering Algorithm
35Hierarchical Agglomerative Clustering (HAC)
- ??????similarity function????? instances????.
- ????
- ??instances?????cluster??
- ???similar???cluster,??????cluster
- ????????cluster??
- ???????????binary tree?hierarchy.
Dendrogram
36Dendrogram Document Example
- As clusters agglomerate, docs likely to fall into
a hierarchy of topics or concepts.
d3
d5
d1
d4
d2
d1,d2
37HAC Algorithm, pseudo-code
38Hierarchical Clustering algorithms
- Agglomerative (bottom-up)
- Start with each document being a single cluster.
- Eventually all documents belong to the same
cluster. - Divisive (top-down)
- Start with all documents belong to the same
cluster. - Eventually each node forms a cluster on its own.
- ?????clusters???k
39Key notion cluster representative
- ???????clusters???
- ?????????,??????cluster(cluster representation)?
- Representative??cluster????typical ?central?
- point inducing smallest radii to docs in cluster
- smallest squared distances, etc.
- point that is the average of all docs in the
cluster - Centroid or center of gravity
40Closest pair of clusters
- Center of gravity
- centroids (centers of gravity)?cosine-similar?clus
ters - Average-link
- ???????cosine-similar
- Single-link
- ???(Similarity of the most cosine-similar)
- Complete-link
- ???(Similarity of the furthest points, the
least cosine-similar)
41Single Link Example
chaining
42Complete Link Example
Affect by outliers
43Computational Complexity
- ???iteration, HAC????pairs???similarity O(n2).
- ???n?2 merging iterations, ?????????cluster??????c
lusters???similarity - ???similarity??
- ???????O(n2) performance
- ?????cluster???similarity???constant time.
- ??O(n2 log n) or O(n3)
44Centroid Agglomerative Clustering
Example n6, k3, closest pair of centroids
d4
d6
d3
d5
d1
d2
45Group Average Agglomerative Clustering
- ????cluster???pairs???similarity
- ??????????
- Vectors???????normalized.
- ????cluster?sum of vectors.
46Exercise
- ?????????n???agglomerative??. ????n3
????/???????????????????
47Efficiency Using approximations
- ?????,???????????centroid pairs
- ???? ?nearly closest pair
- simplistic example maintain closest pair based
on distances in projection on a random line
Random line
48Applications in IR
49Navigating document collections
Table of Contents 1. Science of Cognition 1.a.
Motivations 1.a.i. Intellectual
Curiosity 1.a.ii. Practical Applications 1.b.
History of Cognitive Psychology2. The Neural
Basis of Cognition 2.a. The Nervous System 2.b.
Organization of the Brain 2.c. The Visual
System 3. Perception and Attention 3.a. Sensory
Memory 3.b. Attention and Sensory Information
Processing
IndexAardvark, 15Blueberry, 200Capricorn, 1,
45-55Dog, 79-99Egypt, 65Falafel,
78-90Giraffes, 45-59
- Information Retrieval a book index
- Document clusters a table of contents
50Scatter/Gather Cutting, Karger, and Pedersen
51For better navigation of search results
52Vivisimo SE
53Navigating search results (2)
- ?sense of a word ?documents??
- ????? (say Jaguar, or NLP), ????????
- ??????word sense disambiguation
54(No Transcript)
55For speeding up vector space retrieval
- VSM?retrieval, ?????query vector???doc vectors
- ????????doc?query doc?similarity slow (for some
applications) - ??????inverted index,?????query doc??term????doc
- By clustering docs in corpus a priori
- ???????query doc???cluster
56Resources
- Weka 3 - Data Mining with Open Source Machine
Learning Software in Java
57?????
- Text Clustering
- Evaluation
- Purity, NMI ,Rand Index
- Partition Algorithm
- K-Means
- Reassignment
- Recomputation
- Hierarchical Algorithm
- Cluster representation
- Close measure of cluster pair
- Single link
- Complete link
- Average link
- centroid
58Readings
- 1. IIR Ch16.1-4 Ch17.1-4
- 2. B. Florian, E. Martin, and X. Xiaowei,
"Frequent term-based text clustering," in
Proceedings of the eighth ACM SIGKDD
international conference on Knowledge discovery
and data mining. Edmonton, Alberta, Canada ACM,
2002.
59Thank You!
60Cluster Labeling
61Major issue - labeling
- After clustering algorithm finds clusters - how
can they be useful to the end user? - Need pithy label for each cluster
- In search results, say Animal or Car in the
jaguar example. - In topic trees (Yahoo), need navigational cues.
- Often done by hand, a posteriori.
62How to Label Clusters
- Show titles of typical documents
- Titles are easy to scan
- Authors create them for quick scanning!
- But you can only show a few titles which may not
fully represent cluster - Show words/phrases prominent in cluster
- More likely to fully represent cluster
- Use distinguishing words/phrases
- Differential labeling (think about Feature
Selection) - But harder to scan
63Labeling
- Common heuristics - list 5-10 most frequent terms
in the centroid vector. - Drop stop-words stem.
- Differential labeling by frequent terms
- Within a collection Computers, clusters all
have the word computer as frequent term. - Discriminant analysis of centroids.
- Perhaps better distinctive noun phrase