Title: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS
1SCATTER/GATHER A CLUSTER BASED APPROACH FOR
BROWSING LARGE DOCUMENT COLLECTIONS
GROUPER A DYNAMIC CLUSTERING INTERFACE TO WEB
SEARCH RESULTS
MINAL PATANKAR MADHURI WUDALI
2DOCUMENT CLUSTERING
-
- Process of grouping documents with similar
contents into a common cluster
3ADVANTAGES OF DOCUMENT CLUSTERING
- If a collection is well clustered, we can
- search only the cluster that will contain
- relevant documents
- Clustering also improves browsing through the
document collection
4USER
INTERFACES
A TOOL FOR SEARCHING
A TOOL FOR BROWSING
SCATTER /GATHER
GROUPER
DOCUMENT COLLECTION
META SEARCH ENGINE
CLUSTERING
TRADITIONAL
TEXT-BASED
CLUSTERING ALGORITHM
BUCKSHOT FRACTIONATION
STC
WORD BASED SIMILARITY
PHRASE BASED SIMILARITY
5SCATTER /GATHER INTERFACE
6SCATTER /GATHER SESSION
- User is presented with short summaries of a small
number of document groups. - User selects one or more groups for further study
- Continue this process until the individual
document level
7Fractionation
Cluster Digest
Buckshot
Buckshot
8HOW IS SCATTER/GATHER DONE?
- Static offline partitioning phase
- Fractionation Algorithm
- Online Reclustering phase
- Buckshot Algorithm
- Step 1Group average agglomerative clustering
- Step 2 K-Means
9 Clustering
Partitional
Hierarchical
Agglomerative
Divisive
Hybrid
K-Means
Single link
Group Average Link
Complete Link
Buckshot
Fractionation
10HIERARCHICAL AGGLOMERATIVE CLUSTERING
- Create NxN doc-doc similarity matrix
- Each document starts as a cluster of size
- one.
- Do Until there is only one cluster.
- combine the two clusters with the greatest
- similarity
- update the doc-doc matrix
11Example
- A B C D E
-
- A _ 2 7 6 4
-
- B 2 _ 9 11 14
-
- C 7 9 _ 4 8
-
- D 6 11 4 _ 2
-
- E 4 14 8 2 _
-
A
B
C
D
E
A
BE
C
D
SC(A,BE) 4 if we are using single link (take
max) SC(A,BE) 2 if we are using complete
linkage (take min) SC(A,BE) 3 if we are using
group average (take average) Note C - BE is now
the highest link
12Example
- A BE C D
-
- A _ 3 7 6
- BE 3 _ 8.5 6.5
- C 7 8.5 _ 4
- D 6 6.5 4 _
-
BE
A
C
D
BEC
SC(C,B)9 SC(C,E)8 SC(C,BE)8.5
13Example
- A BEC D
-
- A _ 5 6
-
- BEC 5 _ 5.75
-
- D 6 5.75 _
-
BEC
D
A
A,D
14SCATTER/GATHER SESSION STAGE 1
FRACTIONATION
- Corpus C is broken into N/m buckets of fixed size
mgtk - Apply Group average agglomerative clustering on
each bucket - Generate document groups, given as input to next
iteration - Repeat till k centers remain
15SCATTER/GATHER SESSION STAGE 2
BUCKSHOT
STEP1 HAC
- First, randomly takes sample of size sqrt(kn)
- Apply the Group average agglomerative clustering
till we obtain k clusters - Return the obtained clusters
16SCATTER /GATHER STAGE 2
BUCKSHOT
STEP2 K -Means
- Arbitrary select K documents as seeds, they are
the initial centroids of each cluster. - Assign all other documents to the closest
centroid - Compute the centroid of each cluster again. Get
new centroid of each cluster - Repeat step2,3, until the centroid of each
cluster doesnt change.
17A
C
H
G
F
E
D
B
Bucket 1
Bucket 2
F
E
D
C
A
H
G
B
Group Average Agglomerative Clustering
Fractionation
A
BG
H
C
F
DE
BG
AH
DE
CF
BGCFDE
AH
Contd
18Documents in Sample
A
D
G
E
Group Average Agglomerative Clustering
G
A
DE
Buckshot
AG
DE
Assign remaining documents to these clusters
using K-means
19GENESIS OF GROUPER
20GROUPER
- A dynamic ,web-interface to Husky Search
meta-search engine - Clusters the top retrieved results of Husky Meta
search engine - Dynamically group search results into clusters
- Uses STC Algorithm for Clustering
21Groupers query interface.
22Grouper Interface
23 STC (Suffix Tree Clustering)
- A Fast , incremental algorithm
- Operates on web document- snippets.
- Relies on Suffix Tree to identify common phrases
- Uses the common information to create clusters
24WHAT IS A SUFFIX TREE?
- A suffix tree is a rooted, directed tree
- Each internal node has at least 2 children
- Each edge is labeled with a non-empty sub-string
of S. - The label of a node is the concatenation of the
edge-labels on the path from the root to that
node. - No two edges out of the same node can have
edge-labels that begin with the same word.
25 STEPS OF STC
- Step-1 Document Cleaning
- Step-2 Identifying Base Clusters
- Step-3 Combining Base Clusters
- Step-4 Score clusters
26DOCUMENT CLEANING
- Stemming
- Striping of HTML, Punctuation and numbers
lthtmlgt2 Cats ate ltbgt cheeselt/bgt.lt/htmlgt
Cat ate cheese
27Identifying Base Clusters
-
- Create an inverted index of strings from the web
document collection with using a suffix tree - Each node of the suffix tree represents a group
of documents and a string that is common to all
of them - The label of the node represents the common
string - Each node represents a base cluster.
281.cat ate cheese
cat
ate
cheese
2.mouse ate cheese too
mouse
ate
cheese
too
3.cat ate mouse too
cat
ate
mouse
too
cat ate
mouse
2,3
too
ate cheese too
ate
cheese
1,2
cheese
1,3
2,3
cheese
1,2,3
ate cheese too
too
mouse too
cheese
too
cheese
mouse too
1,2
too
too
29 BASE CLUSTERS IDENTIFIED!!
Node Phrase Documents
a cat ate 1,3
b ate 1,2,3
c cheese 1,2
d mouse 2,3
e too 2,3
f ate cheese 1,2
Table 1 Six nodes and their corresponding base
clusters
30SCORING BASE CLUSTERS
B is the number of documents in base cluster B
P is the number of words in Phrase P
S(B) B . f (P)
31Combining Base Clusters
- Bm ? Bn gt 0.5 Bm ? Bn
gt 0.5 - Bm
Bn -
Binary similarity measure
SIMILARITY 1 IF CONDITION SATISFIED OTHERWISE O
Documents which are in both Clusters
Documents in Cluster n
Documents in cluster m
32COMBINING THE BASE CLUSTERS
1,3
cat ate
2,3
1,2
mouse
cheese
1,2,3
ate
2,3
1,2
too
ate cheese
33STC is Incremental
- As each document arrives from the web, we
- clean it
- Add it to the suffix tree. Each node that is
updated/created as a result of this is tagged - Update the relevant base clusters and recalculate
the similarity of these base clusters to the rest
of k highest scoring base clusters - Check any changes to the final clusters
- Score and sort the final clusters, choose top 10
34STC allows cluster overlap
- Why overlap is reasonable?
- a document often has 1 topics
- STC allows a document to appear in 1 clusters,
since documents may share 1 phrases with other
documents
35REFERENCES
- http//www.math.unipd.it/aiolli/corsi/0708/IR/Lez
18.pdf - http//www.ir.iit.edu/dagr/cs529/files/handouts/0
8Clustering.pdf - http//www.cs.washington.edu/research/projects/Web
Ware1/www/metacrawler/ - http//sils.unc.edu/research/publications/reports/
TR-2007-06.pdf - http//www.ir.iit.edu/dagr/cs529/files/handouts/0
8Clustering.pdf