Block-based Web Search - PowerPoint PPT Presentation

About This Presentation
Title:

Block-based Web Search

Description:

Blockbased Web Search – PowerPoint PPT presentation

Number of Views:13
Avg rating:3.0/5.0
Slides: 28
Provided by: JRW8721
Category:
Tags: based | block | property | search | web

less

Transcript and Presenter's Notes

Title: Block-based Web Search


1
Block-based Web Search
  • Deng Cai1, Shipeng Yu2, Ji-Rong Wen and
    Wei-Ying Ma
  • Microsoft Research Asia
  • 1Tsinghua University
  • 2University of Munich

2
Problems in Traditional IR
  • Term-Document Irrelevance Problem
  • Noisy terms
  • Multiple topics
  • Variant Document Length Problem
  • Length normalization is important
  • Passage Retrieval in traditional IR
  • Partition the document to several passages
  • Solve the problem in some sense
  • Has three types of passages discourse, semantic,
    window
  • Fixed-window passage is shown to be robust

3
Problems in Web IR
  • Noisy information
  • Navigation
  • Decoration
  • Interaction
  • Multiple topics
  • May contain text as well as images or links

Noisy Information
Multiple Topics
4
Problems in Web IR (Cont.)
  • Variant Document Length Problem
  • Conclusion in web IR all the problems of
    traditional IR remain and are more severe!

TREC-24 TREC-45 WT10g .GOV
Number of doc 524,929 556,077 1,692,096 1,247,753
Text size (Mb) 2,059 2,134 10,190 18,100
Median length (Kb) 2.5 2.5 3.3 7.5
Average length (Kb) 4.0 3.9 6.3 15.2
5
Challenges in Web IR
  • New characteristics of web pages
  • Two-Dimensional
  • Logical Structure
  • Visual Layout
  • Presentation
  • Page segmentation methods can be achieved
  • Obtain blocks from web pages
  • Block-based web search is possible

Font Size
Color
Space
Font Style
Separator
6
Outline
  • Motivation
  • Page segmentation approaches
  • Web search using page segmentation
  • Block Retrieval
  • Block-level Query Expansion
  • Experiments and Discussions
  • Conclusion

7
Web Page Segmentation Approaches
Web Page Segmentation FixedPS DomPS VIPS CombPS
Passage Retrieval Window Discourse Semantic Semantic Window
  • Fixed-length approach (FixedPS)
  • Traditional window-based passage retrieval
  • DOM-based approach (DomPS)
  • Like the natural paragraph in traditional passage
    retrieval
  • Vision-based Web Page Segmentation (VIPS)
  • Achieve a semantic partition to some extent
  • Combined Approach (CombPS)
  • Combined VIPS Fixed-length

8
Fixed-length Page Segmentation (FixedPS)
  • A block contains words of fixed-length
  • Traditional window-based methods can be applied
  • Approaches
  • Overlapped windows (e.g. Callan, SIGIR94)
  • Arbitrary passages of varying length (e.g.
    Kaszkiel et al, SIGIR97)
  • Results
  • A simple but robust approach
  • Do not consider semantic information

9
DOM-based Page Segmentation (DomPS)
  • Rely on the DOM structure to partition the page
  • DOM Document-Object Model
  • Current approaches
  • Only base on tags (e.g. Crivellari et al, TREC 9)
  • Combine tags with contents and links (e.g.
    Chakrabarti et al, SIGIR01)
  • Results
  • Similar to discourse in passage retrieval
  • DOM represents only part of the semantic
    structure
  • Imprecise content structure

10
VIPS Algorithm
  • Motivation
  • Topics can be distinguished with visual cues in
    many cases
  • Utilize the two-dimensional structure of web
    pages
  • Goal
  • Extract the semantic structure of a web page to
    some extent, based on its visual presentation
  • Procedure
  • Top-down partition the web page based on the
    separators
  • Result
  • A tree structure, each node in the tree
    corresponds to a block in the page
  • Each node will be assigned a value (Degree of
    Coherence) to indicate how coherent of the
    content in the block based on visual perception

11
VIPS An Example
Microsoft Technical Report MSR-TR-2003-79
12
Combined Approach (CombPS)
  • VIPS solves the problems of noisy information and
    multi-topics
  • FixedPS can deal with the variant document length
    problem
  • Combine these two
  • Partition the web page
  • using VIPS
  • Divide the blocks
  • containing more words
  • than pre-defined
  • window length

Block length after segment 50,000 pages using
VIPS chosen from the WT10g data set
13
Web Page Segmentation Summarization
Web Page Segmentation FixedPS DomPS VIPS CombPS
Passage Retrieval Window Discourse Semantic Semantic Window
  • Fixed-length approach (FixedPS)
  • traditional passage retrieval
  • DOM-based approach (DomPS)
  • Like the natural paragraph in traditional passage
    retrieval
  • Vision-based Web Page Segmentation (VIPS)
  • Achieve a semantic partition to some extent
  • Combined Approach (CombPS)
  • Combined VIPS Fixed-length

14
Outline
  • Motivation
  • Page segmentation approaches
  • Web search using page segmentation
  • Block Retrieval
  • Block-level Query Expansion
  • Experiments and Discussions
  • Conclusion

15
Block Retrieval
  • Similar to traditional passage retrieval
  • Retrieve blocks instead of full documents
  • Combine the relevance of blocks with relevance of
    documents
  • Goal
  • Verify if page segmentation can deal with both
    the length normalization and multiple-topic
    problems

16
Block-level Query Expansion
  • Similar to passage-level pseudo-relevance
    feedback
  • Expansion terms are selected from top blocks
    instead of top documents
  • Goal
  • Testify if page segmentation can benefit the
    selection of query terms through increasing term
    correlations within a block, and thus improve the
    final performance

17
Outline
  • Motivation
  • Page segmentation approaches
  • Web search using page segmentation
  • Block Retrieval
  • Block-level Query Expansion
  • Experiments and Discussions
  • Conclusion

18
Experiments
  • Methodology
  • Fixed-length window approach (FixedPS)
  • Overlapped window with size of 200 words
  • DOM-based approach (DomPS)
  • Iterate the DOM tree for some structural tags
  • A block is constructed and identified by such
    leaf tag
  • Free text between two tags is treated as a
    special block
  • Vision-based approach (VIPS)
  • The permitted degree of coherence is set to 0.6
  • All the leaf nodes are extracted as visual blocks
  • The combined approach (CombPS)
  • VIPS then FixedPS
  • Full document approach (FullDoc)
  • No segmentation is performed

19
Experiments (Cont.)
  • Dataset
  • TREC 2001 Web Track
  • WT10g corpus (1.69 million pages), crawled at
    1997
  • 50 queries (topics 501-550)
  • TREC 2002 Web Track
  • .GOV corpus (1.25 million pages), crawled at 2002
  • 49 queries (topics 551-560)
  • Retrieval System
  • Okapi, with weighting function BM2500
  • Preprocessing
  • Standard stop-word list
  • Do not use stemming and phrase information
  • Tune parameters in BM2500 to achieve best
    baselines
  • Evaluation criteria P_at_10

20
Experiments on Block Retrieval
  • Steps
  • Do original document retrieval
  • Obtain a document rank DR
  • Analyze top N (1000 here) documents to get a
    block set
  • Do block retrieval on the block set (same as Step
    1 but replace the document with block)
  • Obtain a block rank BR
  • Documents are re-ranked by the single-best block
    in each document
  • Combine the BR and DR to get a new rank of
    document
  • is the tuning parameter

21
Block Retrieval on TREC 2001 and TREC 2002 (P_at_10)
Page Segmentation Baseline BR only BR DR best
DomPS 0.2286 0.1571 0.2286
FixedPS 0.2286 0.1776 0.2317
VIPS 0.2286 0.2163 0.2408
CombPS 0.2286 0.1939 0.2379
Page Segmentation Baseline BR only BR DR best
DomPS 0.312 0.252 0.322
FixedPS 0.312 0.304 0.326
VIPS 0.312 0.316 0.328
CombPS 0.312 0.326 0.338
Result on TREC 2002 (P_at_10)
Result on TREC 2001 (P_at_10)
22
Experiments on Block-level Query Expansion
  • Steps
  • Same steps as block retrieval
  • Do original document retrieval to get DR
  • Analyze top N (1000 here) documents to get a
    block set
  • Do block retrieval on the block set to get BR
  • Select some expansion terms based on top blocks
  • 10 expansion terms in our experiments
  • Number of top blocks is a tuning parameter
  • Document retrieval with the expanded query
  • Modify the term weights before final retrieval

23
Query Expansion on TREC 2001 and TREC 2002 (P_at_10)
Page Segmentation Baseline Query Expansion (best) Query Expansion (best)
Page Segmentation Baseline P_at_10 Improvement
FullDoc 0.2286 0.2082 -8.9
DomPS 0.2286 0.2224 -2.7
FixedPS 0.2286 0.2327 1.8
VIPS 0.2286 0.2327 1.8
CombPS 0.2286 0.2388 4.5
Page Segmentation Baseline Query Expansion (best) Query Expansion (best)
Page Segmentation Baseline P_at_10 Improvement
FullDoc 0.312 0.326 4.5
DomPS 0.312 0.324 3.8
FixedPS 0.312 0.36 15.4
VIPS 0.312 0.362 16.0
CombPS 0.312 0.366 17.3
Result on TREC 2002 (P_at_10)
Result on TREC 2001 (P_at_10)
24
Discussions
  • FullDoc can only obtain a low and insignificant
    result
  • The baseline is low, so many top ranked documents
    are actually irrelevant
  • DomPS is not good and very unstable
  • The segmentation is too detailed
  • Semantic block can hardly be detected and
    expansion terms are not good
  • FixedPS is stable and good
  • Similar result as the case in traditional IR
  • A window may miss the real semantic blocks
  • VIPS is very good
  • Top blocks usually have very good quality
  • Length normalization is still a problem
  • CombPS is almost the best method in all
    experiments
  • More than just a tradeoff

25
Outline
  • Motivation
  • Page segmentation approaches
  • Web search using page segmentation
  • Block Retrieval
  • Block-level Query Expansion
  • Experiments and Discussions
  • Conclusion

26
Conclusion
  • Page segmentation is effective for improving web
    search
  • Block Retrieval
  • Block-level Query Expansion
  • Plain-text retrieval ? Fixed-windows
    partition
  • Web information retrieval ? Semantic
    partition (VIPS)
  • Integrating both semantic and fixed-length
    properties (CombPS) could deal with all problems
    and achieve the best performance
  • We believe that block-based web search can be
    very useful in real search engines, and can also
    be very easily combined with block-level link
    analysis

27
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com