Title: Topical Link Analysis for Web Search
1Topical Link Analysis for Web Search
- Lan Nie, Brian D. Davison and Xiaoguang Qi
- Computer Science and Engineering
- Lehigh University, USA
2- Introduction
- Topical link analysis model
- Experiments
- Conclusion
3Question
Is http//www.rottentomatoes.com/ a good answer
to the query tomatoes?
4Traditional Link Analysis
Yes, it is a famous web site!
- Simple summation of incoming authority flows
without considering from which communities that
authority is derived. - A resource that is highly popular for one topic
may dominate the results of another topic in
which it is less authoritative.
5 Topical Link Analysis
No, it is famous for entertainment, not for food!
- Split the generic authority score into a vector
to record a page's reputation with respect to
different topics - Use a topic distribution to embody the context of
a link and thus affect the authority propagation - Topical random surfer model
0.7
(0.4, 0.1, 0.2)
6- Introduction
- Topical link analysis model
- Experiments
- Conclusion
7Topical Random Surfer Model
8query vector
- Static, generated by textual classifier
- Purely based on content
- Dynamic, computed by topical link analysis
approach - Combination of text and linkage information
- Query-specific importance score
9Topical HITS
Topical PageRank
Where
Topical model
Page model
Normalized HITS
PageRank
10- Introduction
- Topical link analysis model
- Experiments
- Conclusion
11Textual Classification
- Topics
- 12 top level categories from dmoz ODP hierarchy
- Classifier
- McCallum's Rainbow (Method Naïve Bayes)
- http//www.cs.umass.edu/mccallum/bow/rainbow/
- Training Set
- 19,000 randomly selected docs per category from
ODP - Classification
- Generate a normalized content vector across 12
topics for each web document and query in
experimental data sets.
12Data Sets
- (1) Global Dataset
- Data Collection
- TREC .GOV collection(2002)
- 1,053,372 text/html files
- Fifty queries in 2003 topic distillation task
- Ranking Ranking algorithms in PageRank model
13- (2) Query-specific Datasets
- 20 selected hot queries
- Data Collection
- Root Set Top 200 URLs returned by Yahoo
- Expansion First 50 incoming pages (by querying
Yahoo) and all the outgoing pages. - 5000 docs on average per query
- Ranking
- Ranking algorithms using Hub and Authority Model
14Experiments on Global Dataset
- Competitors
- Topical PageRank (T-PR)
- Traditional PageRank (PR)
- Topic Sensitive PageRank (TSPR)
- Intelligent Surfer (IS)
- BM2500 (BM)
- Evaluation
- Relevance judgments provided by TREC
- P_at_10, Mean average precision (MAP) and Rprec
15Combination of IR and Importance Score
- Typical approach Cai He, 2004
- OKAPI BM2500 Robertson, 1997 weighting
- Linear combination
- Output top results of the combined list
16Performance Comparison
17Experiments on Query-specific Datasets
- Competitors
- Topical HITS (T-HITS)
- Traditional HITS w/ normalization (HITS / N-HITS)
- BH imp w/ normalization (IMP / N-IMP)
- CLEVERs ARC weighting w/ normalization (ARC /
N-ARC) - Evaluation
- Human evaluation system, 43 participants in total
- Metric P_at_10 and S_at_10
18Topical HITS
Topical PageRank
Where
Topical model
Page model
Normalized HITS
PageRank
19Performance Comparison
Precision
Human assessment
20- Introduction
- Topical link analysis model
- Experiments
- Conclusion
21Summary of Contributions
In this work, we have
- Introduced a topical random walk model that
probabilistically combines page content (via a
topic distribution) and link structure. - Re-implemented a number of well-known ranking
algorithms and conducted performance comparisons. - Demonstrated topic-level transitions within
global authority propagation without affecting
the global transition probabilities.
22Reference
- K. Bharat and M. R. Henzinger. Improved
algorithms for topic distillation in hyperlinked
environments. In Proc. of the 21st Intl ACM
SIGIR Conference on Research and Development in
Information Retrieval, pages 104-111, Aug. 1998. - D. Cai, X. He, J.-R. Wen, and W.-Y. Ma.
Block-level link analysis. In Proc. of the 27th
Annual Intl ACM SIGIR Conference on Research and
Development in Information Retrieval, July 2004. - S. Chakrabarti, B. E. Dom, P. Raghavan, S.
Rajagopalan, D. Gibson, and J. M. Kleinberg.
Automatic resource compilation by analyzing
hyperlink structure and associated text. In Proc.
of the 7th Int'l World Wide Web Conference, pages
65-74, Brisbane, Australia, Apr. 1998. - T. H. Haveliwala. Topic-sensitive PageRank. In
Proc. of the Eleventh Intl World Wide Web
Conference, Honolulu, Hawaii, May 2002. - IBM Almaden Research Center. The CLEVER Project.
Home page http//www.almaden.ibm.com/cs/k53/cleve
r.html, 2000. - J. M. Kleinberg. Authoritative sources in a
hyperlinked environment. Journal of the ACM,
46(5)604-632, 1999. - L. Page, S. Brin, R. Motwani, and T. Winograd.
The PageRank citation ranking Bringing order to
the Web. Unpublished draft, 1998. - A. McCallum. Rainbow text classication
tool.http//www.cs.umass.edu/mccallum/bow/rainbo
w/. - S. E. Robertson. Overview of the OKAPI projects.
Journal of Documentation, 533-7, 1997.
23Thank You!
Lan Nie lan2_at_lehigh.edu http//wume.cse.lehigh.e
du
24Combination of IR and Importance Score
P_at_10 Precision of the top 10 results Gamma IR
scores weight in the combined score
25Statistical t-tests
Compared to T-PR
Compared to BM2500