Title: Relevance Propagation for Web Search
1Relevance Propagation for Web Search
- Dr. Tie-Yan Liu
- Web Search and Mining Group
- Microsoft Research Asia
- Joint Work with Tao Qin, Tsinghua University.
2Outline
- Introduction
- Generic framework for relevance propagation
- Evaluations
- Effectiveness analysis
- Complexity analysis
- Conclusions
3Introduction
- Web Search ? Information Retrieval
- Beside the content relevance, various structure
information also plays an important role in Web
search - Hyperlink graph
- Local sitemap
- Webpage layout
4Introduction
- Three ways of utilizing the structure information
for Web search - Linear combination of content relevance and
importance scores computed from hyperlink graph - ßRelevance (1-ß) PageRank
- Enhance link analysis with the help of content
relevance - Query-dependent link graph in HITS
- Topic-sensitive PageRank
- Propagate content relevance along the Web
structure - The use of anchor text in Search Engines
- Hyperlink-based relevance score propagation (TREC
2003) - Sitemap-based feature propagation (TREC 2004)
5Hyperlink-based Relevance Score Propagation (Zhai
et al, TREC2003)
- Assumption
- Hyperlinked pages have correlated content
outlinks
links
6Hyperlink-based Relevance Score Propagation (Zhai
et al, TREC2003)
- Assumption
- Hyperlinked pages have correlated content
- Propagation model
- Weighted inlink model
- Weighted outlink model
- Uniform outlink model
7Sitemap-based Feature Propagation (Liu and Qin,
TREC2004)
- Assumption
- Child pages are extensions of their parent page
- One should consider the contribution of the child
pages while computing the relevance of the parent
page to a query. - Propagation model
8Generic Relevance Propagation Framework
- Modification of the sitemap-based feature
propagation model - Reminder of the hyperlink-based propagation model
- A generic framework to cover both hyperlink-based
and sitemap-based propagations
9More Derived Propagation Models
10Summary All Models Covered by the Generic
Framework
11Benchmark Datasets
- Corpora
- .GOV
- 1M pages
- Queries TD 2003, 2004
- MSN
- 2M pages
- Query 100 most popular queries from MSN query
log
- Base Ranking function
- BM2500
12Experimental Results (1)
TREC 2003
13Experimental Results (2)
TREC 2004
14Experimental Results (3)
MSN
15Conclusions on Effectiveness
- In general, relevance propagation can boost the
search performance with proper parameter
settings - The sitemap-based models are more effective than
the hyperlink-based models - Hyperlinks ? Content Correlation, while the pages
in the same sub site usually talk about
correlated topics. - Detailed comparisons
- The two sitemap-based models have similar
performance. - Among the hyperlink-based models, the HF-WI model
performs best.
16Online Complexity
- w is the size of the working set, q is the number
of query terms, l is the average number of
inlinks / outlinks, t is the number of
iterations. - For the SS model, the complexity is O(w),
- The SS model needs to propagate the relevance
score of a page to its parent only once if we
conduct the propagation from the leaf nodes in a
bottom-up manner. - For the SF model, the complexity is O(qw).
- For the HS models, the complexity is O(twl)
- In each step of t iterations of the HS models, we
need to propagate the relevance score of a page
along its in-link or out-link in the sub graph of
the working set. - For the HF models, the complexity is O(tqwl).
17Online Complexity
The sitemap-based models are more efficient than
the hyperlink-based models The score-level
propagation models are faster than feature-level
models
18Offline Complexity
- Score-level propagation is very difficult to
implement offline - The score can only be computed online w.r.t the
query. - For feature-level propagations,
- The time complexity of the SF model for offline
implementation is acceptable - 62.2 hours, or 2.6 days to re-index 8 billion
pages - The time complexity of the HF model is out of
tolerance. - 1083 hours, or 45 days to re-index 8 billion
pages - The ST model is easy for parallel implementation
while the parallel implementation of the HF model
is non-trivial
19Conclusions of this Study
- Generally speaking, relevance propagation can
boost the performance of web information
retrieval. - Sitemap-based propagation models outperform
hyperlink-based propagation models in terms of
both effectiveness and efficiency. Notably,
sitemap-based propagation can be implemented in
parallel. - Score-level propagation and feature-level
propagation have almost similar effectiveness.
Although the former is more efficient in on-line
implementations, it is not practical for
real-world search engines because it can not be
implemented offline. - Overall speaking, sitemap-based feature
propagation model is the best choice for real
search engines.
20Thanks!
- tyliu_at_microsoft.com
- http//research.microsoft.com/users/tyliu/