SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS

About This Presentation

Title:

SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS

Description:

MINAL PATANKAR MADHURI WUDALI STC allows cluster overlap Why overlap is reasonable? –

Number of Views:73

Avg rating:3.0/5.0

Slides: 36

Provided by: minal

Learn more at: https://crystal.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS

1
SCATTER/GATHER A CLUSTER BASED APPROACH FOR
BROWSING LARGE DOCUMENT COLLECTIONS
GROUPER A DYNAMIC CLUSTERING INTERFACE TO WEB
SEARCH RESULTS
MINAL PATANKAR MADHURI WUDALI
2
DOCUMENT CLUSTERING

Process of grouping documents with similar
contents into a common cluster

3
ADVANTAGES OF DOCUMENT CLUSTERING

If a collection is well clustered, we can
search only the cluster that will contain
relevant documents
Clustering also improves browsing through the
document collection

4
USER
INTERFACES
A TOOL FOR SEARCHING
A TOOL FOR BROWSING
SCATTER /GATHER
GROUPER
DOCUMENT COLLECTION
META SEARCH ENGINE
CLUSTERING
TRADITIONAL
TEXT-BASED
CLUSTERING ALGORITHM
BUCKSHOT FRACTIONATION
STC
WORD BASED SIMILARITY
PHRASE BASED SIMILARITY
5
SCATTER /GATHER INTERFACE
6
SCATTER /GATHER SESSION

User is presented with short summaries of a small
number of document groups.
User selects one or more groups for further study
Continue this process until the individual
document level

7
Fractionation
Cluster Digest
Buckshot
Buckshot
8
HOW IS SCATTER/GATHER DONE?

Static offline partitioning phase
Fractionation Algorithm
Online Reclustering phase
Buckshot Algorithm
Step 1Group average agglomerative clustering
Step 2 K-Means

9

Clustering
Partitional
Hierarchical
Agglomerative
Divisive
Hybrid
K-Means
Single link
Group Average Link
Complete Link
Buckshot
Fractionation
10
HIERARCHICAL AGGLOMERATIVE CLUSTERING

Create NxN doc-doc similarity matrix
Each document starts as a cluster of size
one.
Do Until there is only one cluster.
combine the two clusters with the greatest
similarity
update the doc-doc matrix

11
Example

A B C D E
A _ 2 7 6 4
B 2 _ 9 11 14
C 7 9 _ 4 8
D 6 11 4 _ 2
E 4 14 8 2 _

A
B
C
D
E
A
BE
C
D
SC(A,BE) 4 if we are using single link (take
max) SC(A,BE) 2 if we are using complete
linkage (take min) SC(A,BE) 3 if we are using
group average (take average) Note C - BE is now
the highest link
12
Example

A BE C D
A _ 3 7 6
BE 3 _ 8.5 6.5
C 7 8.5 _ 4
D 6 6.5 4 _

COMBINING

BE
A
C
D
BEC
SC(C,B)9 SC(C,E)8 SC(C,BE)8.5
13
Example

A BEC D
A _ 5 6
BEC 5 _ 5.75
D 6 5.75 _

COMBINING

BEC
D
A
A,D
14
SCATTER/GATHER SESSION STAGE 1
FRACTIONATION

Corpus C is broken into N/m buckets of fixed size
mgtk
Apply Group average agglomerative clustering on
each bucket
Generate document groups, given as input to next
iteration
Repeat till k centers remain

15
SCATTER/GATHER SESSION STAGE 2
BUCKSHOT
STEP1 HAC

First, randomly takes sample of size sqrt(kn)
Apply the Group average agglomerative clustering
till we obtain k clusters
Return the obtained clusters

16
SCATTER /GATHER STAGE 2
BUCKSHOT
STEP2 K -Means

Arbitrary select K documents as seeds, they are
the initial centroids of each cluster.
Assign all other documents to the closest
centroid
Compute the centroid of each cluster again. Get
new centroid of each cluster
Repeat step2,3, until the centroid of each
cluster doesnt change.

17
A
C
H
G
F
E
D
B
Bucket 1
Bucket 2
F
E
D
C
A
H
G
B
Group Average Agglomerative Clustering
Fractionation
A
BG
H
C
F
DE
BG
AH
DE
CF

BGCFDE
AH
Contd
18
Documents in Sample
A
D
G
E
Group Average Agglomerative Clustering
G
A
DE
Buckshot
AG
DE
Assign remaining documents to these clusters
using K-means
19
GENESIS OF GROUPER
20
GROUPER

A dynamic ,web-interface to Husky Search
meta-search engine
Clusters the top retrieved results of Husky Meta
search engine
Dynamically group search results into clusters
Uses STC Algorithm for Clustering

21
Groupers query interface.

22
Grouper Interface
23
STC (Suffix Tree Clustering)

A Fast , incremental algorithm
Operates on web document- snippets.
Relies on Suffix Tree to identify common phrases
Uses the common information to create clusters

24
WHAT IS A SUFFIX TREE?

A suffix tree is a rooted, directed tree
Each internal node has at least 2 children
Each edge is labeled with a non-empty sub-string
of S.
The label of a node is the concatenation of the
edge-labels on the path from the root to that
node.
No two edges out of the same node can have
edge-labels that begin with the same word.

25
STEPS OF STC

Step-1 Document Cleaning
Step-2 Identifying Base Clusters
Step-3 Combining Base Clusters
Step-4 Score clusters

26
DOCUMENT CLEANING

Stemming
Striping of HTML, Punctuation and numbers

lthtmlgt2 Cats ate ltbgt cheeselt/bgt.lt/htmlgt
Cat ate cheese
27
Identifying Base Clusters

Create an inverted index of strings from the web
document collection with using a suffix tree
Each node of the suffix tree represents a group
of documents and a string that is common to all
of them
The label of the node represents the common
string
Each node represents a base cluster.

28
1.cat ate cheese

cat
ate
cheese
2.mouse ate cheese too
mouse
ate
cheese
too
3.cat ate mouse too
cat
ate
mouse
too
cat ate
mouse
2,3
too
ate cheese too
ate
cheese
1,2
cheese
1,3
2,3
cheese
1,2,3
ate cheese too
too
mouse too
cheese
too
cheese
mouse too
1,2
too
too
29
BASE CLUSTERS IDENTIFIED!!
Node Phrase Documents
a cat ate 1,3
b ate 1,2,3
c cheese 1,2
d mouse 2,3
e too 2,3
f ate cheese 1,2
Table 1 Six nodes and their corresponding base
clusters
30
SCORING BASE CLUSTERS

Scoring clusters

B is the number of documents in base cluster B
P is the number of words in Phrase P
S(B) B . f (P)
31
Combining Base Clusters

Bm ? Bn gt 0.5 Bm ? Bn
gt 0.5
Bm
Bn

Binary similarity measure
SIMILARITY 1 IF CONDITION SATISFIED OTHERWISE O

Documents which are in both Clusters
Documents in Cluster n
Documents in cluster m
32
COMBINING THE BASE CLUSTERS

Base cluster graph

1,3
cat ate
2,3
1,2
mouse
cheese
1,2,3
ate
2,3
1,2
too
ate cheese
33
STC is Incremental

As each document arrives from the web, we
clean it
Add it to the suffix tree. Each node that is
updated/created as a result of this is tagged
Update the relevant base clusters and recalculate
the similarity of these base clusters to the rest
of k highest scoring base clusters
Check any changes to the final clusters
Score and sort the final clusters, choose top 10

34
STC allows cluster overlap

Why overlap is reasonable?
a document often has 1 topics
STC allows a document to appear in 1 clusters,
since documents may share 1 phrases with other
documents

35
REFERENCES