Semantic, Hierarchical, Online Clustering of Web Search Results - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Semantic, Hierarchical, Online Clustering of Web Search Results

Description:

Title: Clustering Web Search Results Author: Iwona Bialynicka-Birula Last modified by: AILAB Created Date: 4/5/2004 9:34:13 AM Document presentation format – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 18
Provided by: Iwona1
Category:

less

Transcript and Presenter's Notes

Title: Semantic, Hierarchical, Online Clustering of Web Search Results


1
Semantic, Hierarchical, Online Clustering of Web
Search Results
  • Yisheng Dong

2
Overview
  • Introduction
  • Previous Related Works
  • SHOC Approach
  • Prototype System
  • Conclusion

3
Introduction
  • Motivation
  • The Web is the biggest data source.
  • Search engine is the most commonly used tool for
    Web information retrieval.
  • Its current status is far from the satisfaction.
  • Solution
  • Clustering of Web search results would help a
    lot.
  • SHOC can generate both reasonable and readable
    cluster.

4
Basic requirements (clustering approach for web
search result)
  • Semantic
  • Each cluster should correspond to a concept.
  • Avoid confining each Web page to only on cluster.
  • A label can describe the topic of cluster well.
  • Hierarchical
  • Eye-browsing tree structure.
  • Taking advantage of the relationship between
    them.
  • Online
  • Provide fresh clustering result just-in-time.

5
Previous Related Work
  • Scatter/Gather system
  • traditional heuristic clustering algorithm.
  • It has some limitations.
  • Based on hyperlink
  • It needs to download and parse original Web page.
  • Cannot cluster immediately.
  • STC
  • It is not appropriate for Oriental language.
  • Extract many meaningless partial phrases.
  • Synonymy and polysemy are not considered.

6
SOHC step
  1. Data acquisition
  2. Data cleaning
  3. Feature extraction
  4. Identifying base clusters
  5. Combining base clusters

7
Data acquision
  • The data acquisition task here is actually
    meta-search.
  • Use 2-level parallelization mechanism
  • Call several engines simultaneously.
  • Fetch all of its search result simultaneously.

8
Data cleaning
  • Sentence boundaries are identified via the
    following.
  • punctuation marks (e.g. ., ,, , ?, etc.)
  • HTML tags (e.g.ltpgt, ltbrgt, ltligt, lttdgt etc.)
  • Non-word tokens are stripped.(e.g. punctuation
    marks and HTML tags)
  • Redundant spaces are compressed.
  • Stemming algorithm may be applied.(for English
    text)

9
Feature extraction (Overview)
  • Words
  • Most clustering algorithm treat a document as
    bag of words.
  • Ignoring word order and proximity.
  • Key phrases
  • Advantage
  • Improve the quality of the clusters.
  • Useful in constructing labels.
  • Data structures (key phrase discovery)
  • Suffix tree
  • Related to the alphabet size of language.
  • Suffix array
  • Scalable over alphabet size.

10
Feature extraction(key phrase discovery)
  • Completeness
  • Left-completeness
  • Right-completeness
  • Stability (Mutual Information)
  • S c1c2cp, SL c1cp-1, SR c2cp
  • Significance
  • se(S) freq(S) g(S)
  • g(x) 0 (x1) log2x (2x8)
    3 (xgt8)

11
Feature extraction (Suffix array)
  • Suffix array
  • An array of all N suffixes, sorted alphabetically
  • LCP (Longest Common Prefix)
  • Use to accelerate searching in text

ltSuffix array and lcp of the to_be_or_not_to_begt
12
Feature extraction (Discover rcs)
  • void discover_rcs()
  • typedef structure
  • int ID
  • int frequency
  • RCSTYPE
  • RSCTYPE rcs_stackN // N is the
    document's length
  • Initialize rcs_stack
  • int sp -1 // the stack pointer
  • int i 1
  • while(i lt N1)
  • if(sp lt 0) // the stack is empty
  • if(lcpi gt 0)
  • sp
  • rcs_stacksp.ID i
  • rcs_stacksp.frequency 2
  • i
  • int r rcs_stacksp.ID
  • if(lcpr lt lcpi)
  • sp
  • rcs_stacksp.ID i
  • rcs_stacksp.frequency 2
  • i
  • else if(lcpr lcpi)
  • rcs_stacksp.frequecny
  • i
  • else
  • Output rcs_stacksp // ID
    frequency
  • int f rcs_stacksp.frequency
  • sp--
  • if(sp gt 0)
  • rcs_stacksp.frequency
  • rcs_stacksp.frequency
    f -1

13
Feature extraction (Intersect lcs_rcs)
  • void intersect_lcs_rcs(sorted lcs array, sorted
    rcs array)
  • int i 0, j0
  • while(iltL j lt R)
  • string str_l lcsi.ID denoted LCS
  • string str_r rcsj.ID denoted RCS
  • if(str_l str_r)
  • Output lcsi
  • i
  • j
  • if(str_l lt str_r)
  • i
  • if(str_l gt str_r)
  • j

rcs array rcs array rcs array
ID frequency RCS
1 2 _be
2 5 _
6 2 be
8 2 e
11 2 o_be
12 4 o
16 3 to_be
17 2 t
cs array cs array cs array
ID frequency CS
2 5 _
12 4 o
16 3 t
17 2 to_be
14
Identifying base clusters
15
Combining base clusters
  • Combine base cluster X and Y
  • if ( X n Y / X ? Y gt t1 )
  • X and Y are merged into
  • one cluster
  • else
  • if ( X gt Y )
  • if ( X n Y / Y gt t2 )
  • let Y become Xs child
  • else
  • if ( X n Y / X gt t2 )
  • let X become Ys child
  • Merging Label
  • if ( label x is a substring of label y )
  • label_xy label_y
  • else if ( label_y is a substring of label_x )
  • label_xy label_x
  • else
  • label_xy label_x label_y

16
Prototype system
  • Crate a prototype system named WICE (Web
    Information Clustering Engine)
  • Doing well for dealing with the special problems
    related to Chinese
  • Output for query object oriented
  • object oriented programming
  • object oriented analysis, etc.

17
Conclusion
  • Main contribution
  • The benefit of using key phrase.
  • Method based on suffix array for key phrase.
  • The concept of orthogonal clustering.
  • The WICE system is designed and implemented.
  • Further works
  • Detailed analysis.
  • Further experimenting.
  • Interpretation of experiment results.
  • Comparing with other clustering algorithms.
Write a Comment
User Comments (0)
About PowerShow.com