Topical Link Analysis for Web Search - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Topical Link Analysis for Web Search

Description:

OKAPI BM2500 [Robertson, 1997] weighting. Linear combination ... Overview of the OKAPI projects. Journal of Documentation, 53:3-7, 1997. Thank You! ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 26

Provided by: newl7

Category:

more less

Transcript and Presenter's Notes

Title: Topical Link Analysis for Web Search

1
Topical Link Analysis for Web Search

Lan Nie, Brian D. Davison and Xiaoguang Qi
Computer Science and Engineering
Lehigh University, USA

Introduction
Topical link analysis model
Experiments
Conclusion

3
Question
Is http//www.rottentomatoes.com/ a good answer
to the query tomatoes?
4
Traditional Link Analysis
Yes, it is a famous web site!

Simple summation of incoming authority flows
without considering from which communities that
authority is derived.
A resource that is highly popular for one topic
may dominate the results of another topic in
which it is less authoritative.

5
Topical Link Analysis
No, it is famous for entertainment, not for food!

Split the generic authority score into a vector
to record a page's reputation with respect to
different topics
Use a topic distribution to embody the context of
a link and thus affect the authority propagation
Topical random surfer model

0.7
(0.4, 0.1, 0.2)
6

Introduction
Topical link analysis model
Experiments
Conclusion

7
Topical Random Surfer Model
8

Content Vector

query vector

Static, generated by textual classifier
Purely based on content

Authority Vector

Dynamic, computed by topical link analysis
approach
Combination of text and linkage information

Query-specific importance score

9
Topical HITS
Topical PageRank
Where
Topical model

Page model
Normalized HITS
PageRank
10

Introduction
Topical link analysis model
Experiments
Conclusion

11
Textual Classification

Topics
12 top level categories from dmoz ODP hierarchy
Classifier
McCallum's Rainbow (Method Naïve Bayes)
http//www.cs.umass.edu/mccallum/bow/rainbow/
Training Set
19,000 randomly selected docs per category from
ODP
Classification
Generate a normalized content vector across 12
topics for each web document and query in
experimental data sets.

12
Data Sets

(1) Global Dataset
Data Collection
TREC .GOV collection(2002)
1,053,372 text/html files
Fifty queries in 2003 topic distillation task
Ranking Ranking algorithms in PageRank model

(2) Query-specific Datasets
20 selected hot queries
Data Collection
Root Set Top 200 URLs returned by Yahoo
Expansion First 50 incoming pages (by querying
Yahoo) and all the outgoing pages.
5000 docs on average per query
Ranking
Ranking algorithms using Hub and Authority Model

14
Experiments on Global Dataset

Competitors
Topical PageRank (T-PR)
Traditional PageRank (PR)
Topic Sensitive PageRank (TSPR)
Intelligent Surfer (IS)
BM2500 (BM)
Evaluation
Relevance judgments provided by TREC
P_at_10, Mean average precision (MAP) and Rprec

15
Combination of IR and Importance Score

Typical approach Cai He, 2004
OKAPI BM2500 Robertson, 1997 weighting
Linear combination
Output top results of the combined list

16
Performance Comparison
17
Experiments on Query-specific Datasets

Competitors
Topical HITS (T-HITS)
Traditional HITS w/ normalization (HITS / N-HITS)
BH imp w/ normalization (IMP / N-IMP)
CLEVERs ARC weighting w/ normalization (ARC /
N-ARC)
Evaluation
Human evaluation system, 43 participants in total
Metric P_at_10 and S_at_10

18
Topical HITS
Topical PageRank
Where
Topical model

Page model
Normalized HITS
PageRank
19
Performance Comparison
Precision
Human assessment
20

Introduction
Topical link analysis model
Experiments
Conclusion

21
Summary of Contributions
In this work, we have

Introduced a topical random walk model that
probabilistically combines page content (via a
topic distribution) and link structure.
Re-implemented a number of well-known ranking
algorithms and conducted performance comparisons.
Demonstrated topic-level transitions within
global authority propagation without affecting
the global transition probabilities.

22
Reference

K. Bharat and M. R. Henzinger. Improved
algorithms for topic distillation in hyperlinked
environments. In Proc. of the 21st Intl ACM
SIGIR Conference on Research and Development in
Information Retrieval, pages 104-111, Aug. 1998.
D. Cai, X. He, J.-R. Wen, and W.-Y. Ma.
Block-level link analysis. In Proc. of the 27th
Annual Intl ACM SIGIR Conference on Research and
Development in Information Retrieval, July 2004.
S. Chakrabarti, B. E. Dom, P. Raghavan, S.
Rajagopalan, D. Gibson, and J. M. Kleinberg.
Automatic resource compilation by analyzing
hyperlink structure and associated text. In Proc.
of the 7th Int'l World Wide Web Conference, pages
65-74, Brisbane, Australia, Apr. 1998.
T. H. Haveliwala. Topic-sensitive PageRank. In
Proc. of the Eleventh Intl World Wide Web
Conference, Honolulu, Hawaii, May 2002.
IBM Almaden Research Center. The CLEVER Project.
Home page http//www.almaden.ibm.com/cs/k53/cleve
r.html, 2000.
J. M. Kleinberg. Authoritative sources in a
hyperlinked environment. Journal of the ACM,
46(5)604-632, 1999.
L. Page, S. Brin, R. Motwani, and T. Winograd.
The PageRank citation ranking Bringing order to
the Web. Unpublished draft, 1998.
A. McCallum. Rainbow text classication
tool.http//www.cs.umass.edu/mccallum/bow/rainbo
w/.
S. E. Robertson. Overview of the OKAPI projects.
Journal of Documentation, 533-7, 1997.

23
Thank You!
Lan Nie lan2_at_lehigh.edu http//wume.cse.lehigh.e
du
24
Combination of IR and Importance Score
P_at_10 Precision of the top 10 results Gamma IR
scores weight in the combined score
25
Statistical t-tests
Compared to T-PR
Compared to BM2500

Write a Comment

User Comments (0)