SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS

About This Presentation
Title:

SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS

Description:

MINAL PATANKAR MADHURI WUDALI STC allows cluster overlap Why overlap is reasonable? –

Number of Views:73
Avg rating:3.0/5.0
Slides: 36
Provided by: minal
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS


1
SCATTER/GATHER A CLUSTER BASED APPROACH FOR
BROWSING LARGE DOCUMENT COLLECTIONS
GROUPER A DYNAMIC CLUSTERING INTERFACE TO WEB
SEARCH RESULTS
MINAL PATANKAR MADHURI WUDALI
2
DOCUMENT CLUSTERING
  • Process of grouping documents with similar
    contents into a common cluster

3
ADVANTAGES OF DOCUMENT CLUSTERING
  • If a collection is well clustered, we can
  • search only the cluster that will contain
  • relevant documents
  • Clustering also improves browsing through the
    document collection

4
USER
INTERFACES
A TOOL FOR SEARCHING
A TOOL FOR BROWSING
SCATTER /GATHER
GROUPER
DOCUMENT COLLECTION
META SEARCH ENGINE
CLUSTERING
TRADITIONAL
TEXT-BASED
CLUSTERING ALGORITHM
BUCKSHOT FRACTIONATION
STC
WORD BASED SIMILARITY
PHRASE BASED SIMILARITY
5
SCATTER /GATHER INTERFACE
6
SCATTER /GATHER SESSION
  • User is presented with short summaries of a small
    number of document groups.
  • User selects one or more groups for further study
  • Continue this process until the individual
    document level

7
Fractionation
Cluster Digest
Buckshot
Buckshot
8
HOW IS SCATTER/GATHER DONE?
  • Static offline partitioning phase
  • Fractionation Algorithm
  • Online Reclustering phase
  • Buckshot Algorithm
  • Step 1Group average agglomerative clustering
  • Step 2 K-Means

9

Clustering
Partitional
Hierarchical
Agglomerative
Divisive
Hybrid
K-Means
Single link
Group Average Link
Complete Link
Buckshot
Fractionation
10
HIERARCHICAL AGGLOMERATIVE CLUSTERING
  • Create NxN doc-doc similarity matrix
  • Each document starts as a cluster of size
  • one.
  • Do Until there is only one cluster.
  • combine the two clusters with the greatest
  • similarity
  • update the doc-doc matrix

11
Example
  • A B C D E
  •  
  • A _ 2 7 6 4
  •  
  • B 2 _ 9 11 14
  •  
  • C 7 9 _ 4 8
  •  
  • D 6 11 4 _ 2
  •  
  • E 4 14 8 2 _
  •  

A
B
C
D
E
A
BE
C
D
SC(A,BE) 4 if we are using single link (take
max) SC(A,BE) 2 if we are using complete
linkage (take min) SC(A,BE) 3 if we are using
group average (take average) Note C - BE is now
the highest link
12
Example
  • A BE C D
  •  
  • A _ 3 7 6
  • BE 3 _ 8.5 6.5
  • C 7 8.5 _ 4
  • D 6 6.5 4 _
  • COMBINING

BE
A
C
D
BEC
SC(C,B)9 SC(C,E)8 SC(C,BE)8.5
13
Example
  • A BEC D
  •  
  • A _ 5 6
  •  
  • BEC 5 _ 5.75
  •  
  • D 6 5.75 _
  •  
  • COMBINING

BEC
D
A
A,D
14
SCATTER/GATHER SESSION STAGE 1
FRACTIONATION
  • Corpus C is broken into N/m buckets of fixed size
    mgtk
  • Apply Group average agglomerative clustering on
    each bucket
  • Generate document groups, given as input to next
    iteration
  • Repeat till k centers remain

15
SCATTER/GATHER SESSION STAGE 2
BUCKSHOT
STEP1 HAC
  • First, randomly takes sample of size sqrt(kn)
  • Apply the Group average agglomerative clustering
    till we obtain k clusters
  • Return the obtained clusters

16
SCATTER /GATHER STAGE 2
BUCKSHOT
STEP2 K -Means
  • Arbitrary select K documents as seeds, they are
    the initial centroids of each cluster.
  • Assign all other documents to the closest
    centroid
  • Compute the centroid of each cluster again. Get
    new centroid of each cluster
  • Repeat step2,3, until the centroid of each
    cluster doesnt change.

17
A
C
H
G
F
E
D
B
Bucket 1
Bucket 2
F
E
D
C
A
H
G
B
Group Average Agglomerative Clustering
Fractionation
A
BG
H
C
F
DE
BG
AH
DE
CF

BGCFDE
AH
Contd
18
Documents in Sample
A
D
G
E
Group Average Agglomerative Clustering
G
A
DE
Buckshot
AG
DE
Assign remaining documents to these clusters
using K-means
19
GENESIS OF GROUPER
20
GROUPER
  • A dynamic ,web-interface to Husky Search
    meta-search engine
  • Clusters the top retrieved results of Husky Meta
    search engine
  • Dynamically group search results into clusters
  • Uses STC Algorithm for Clustering

21
Groupers query interface.

22
Grouper Interface
23
STC (Suffix Tree Clustering)
  • A Fast , incremental algorithm
  • Operates on web document- snippets.
  • Relies on Suffix Tree to identify common phrases
  • Uses the common information to create clusters

24
WHAT IS A SUFFIX TREE?
  • A suffix tree is a rooted, directed tree
  • Each internal node has at least 2 children
  • Each edge is labeled with a non-empty sub-string
    of S.
  • The label of a node is the concatenation of the
    edge-labels on the path from the root to that
    node.
  • No two edges out of the same node can have
    edge-labels that begin with the same word.

25
STEPS OF STC
  • Step-1 Document Cleaning
  • Step-2 Identifying Base Clusters
  • Step-3 Combining Base Clusters
  • Step-4 Score clusters

26
DOCUMENT CLEANING
  • Stemming
  • Striping of HTML, Punctuation and numbers

lthtmlgt2 Cats ate ltbgt cheeselt/bgt.lt/htmlgt
Cat ate cheese
27
Identifying Base Clusters
  • Create an inverted index of strings from the web
    document collection with using a suffix tree
  • Each node of the suffix tree represents a group
    of documents and a string that is common to all
    of them
  • The label of the node represents the common
    string
  • Each node represents a base cluster.

28
1.cat ate cheese

cat
ate
cheese
2.mouse ate cheese too
mouse
ate
cheese
too
3.cat ate mouse too
cat
ate
mouse
too
cat ate
mouse
2,3
too
ate cheese too
ate
cheese
1,2
cheese
1,3
2,3
cheese
1,2,3
ate cheese too
too
mouse too
cheese
too
cheese
mouse too
1,2
too
too
29
BASE CLUSTERS IDENTIFIED!!
Node Phrase Documents
a cat ate 1,3
b ate 1,2,3
c cheese 1,2
d mouse 2,3
e too 2,3
f ate cheese 1,2
Table 1 Six nodes and their corresponding base
clusters
30
SCORING BASE CLUSTERS
  • Scoring clusters

B is the number of documents in base cluster B
P is the number of words in Phrase P
S(B) B . f (P)
31
Combining Base Clusters
  • Bm ? Bn gt 0.5 Bm ? Bn
    gt 0.5
  • Bm
    Bn

Binary similarity measure
SIMILARITY 1 IF CONDITION SATISFIED OTHERWISE O

Documents which are in both Clusters
Documents in Cluster n
Documents in cluster m
32
COMBINING THE BASE CLUSTERS
  • Base cluster graph

1,3
cat ate
2,3
1,2
mouse
cheese
1,2,3
ate
2,3
1,2
too
ate cheese
33
STC is Incremental
  • As each document arrives from the web, we
  • clean it
  • Add it to the suffix tree. Each node that is
    updated/created as a result of this is tagged
  • Update the relevant base clusters and recalculate
    the similarity of these base clusters to the rest
    of k highest scoring base clusters
  • Check any changes to the final clusters
  • Score and sort the final clusters, choose top 10

34
STC allows cluster overlap
  • Why overlap is reasonable?
  • a document often has 1 topics
  • STC allows a document to appear in 1 clusters,
    since documents may share 1 phrases with other
    documents

35
REFERENCES
  • http//www.math.unipd.it/aiolli/corsi/0708/IR/Lez
    18.pdf
  • http//www.ir.iit.edu/dagr/cs529/files/handouts/0
    8Clustering.pdf
  • http//www.cs.washington.edu/research/projects/Web
    Ware1/www/metacrawler/
  • http//sils.unc.edu/research/publications/reports/
    TR-2007-06.pdf
  • http//www.ir.iit.edu/dagr/cs529/files/handouts/0
    8Clustering.pdf  
Write a Comment
User Comments (0)
About PowerShow.com