Topical Link Analysis for Web Search - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Topical Link Analysis for Web Search

Description:

OKAPI BM2500 [Robertson, 1997] weighting. Linear combination ... Overview of the OKAPI projects. Journal of Documentation, 53:3-7, 1997. Thank You! ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 26
Provided by: newl7
Category:
Tags: analysis | link | okapi | search | topical | web

less

Transcript and Presenter's Notes

Title: Topical Link Analysis for Web Search


1
Topical Link Analysis for Web Search
  • Lan Nie, Brian D. Davison and Xiaoguang Qi
  • Computer Science and Engineering
  • Lehigh University, USA

2
  • Introduction
  • Topical link analysis model
  • Experiments
  • Conclusion

3
Question
Is http//www.rottentomatoes.com/ a good answer
to the query tomatoes?
4
Traditional Link Analysis
Yes, it is a famous web site!
  • Simple summation of incoming authority flows
    without considering from which communities that
    authority is derived.
  • A resource that is highly popular for one topic
    may dominate the results of another topic in
    which it is less authoritative.

5
Topical Link Analysis
No, it is famous for entertainment, not for food!
  • Split the generic authority score into a vector
    to record a page's reputation with respect to
    different topics
  • Use a topic distribution to embody the context of
    a link and thus affect the authority propagation
  • Topical random surfer model

0.7
(0.4, 0.1, 0.2)
6
  • Introduction
  • Topical link analysis model
  • Experiments
  • Conclusion

7
Topical Random Surfer Model
8
  • Content Vector

query vector
  • Static, generated by textual classifier
  • Purely based on content
  • Authority Vector
  • Dynamic, computed by topical link analysis
    approach
  • Combination of text and linkage information
  • Query-specific importance score

9
Topical HITS
Topical PageRank
Where
Topical model


Page model
Normalized HITS
PageRank
10
  • Introduction
  • Topical link analysis model
  • Experiments
  • Conclusion

11
Textual Classification
  • Topics
  • 12 top level categories from dmoz ODP hierarchy
  • Classifier
  • McCallum's Rainbow (Method Naïve Bayes)
  • http//www.cs.umass.edu/mccallum/bow/rainbow/
  • Training Set
  • 19,000 randomly selected docs per category from
    ODP
  • Classification
  • Generate a normalized content vector across 12
    topics for each web document and query in
    experimental data sets.

12
Data Sets
  • (1) Global Dataset
  • Data Collection
  • TREC .GOV collection(2002)
  • 1,053,372 text/html files
  • Fifty queries in 2003 topic distillation task
  • Ranking Ranking algorithms in PageRank model

13
  • (2) Query-specific Datasets
  • 20 selected hot queries
  • Data Collection
  • Root Set Top 200 URLs returned by Yahoo
  • Expansion First 50 incoming pages (by querying
    Yahoo) and all the outgoing pages.
  • 5000 docs on average per query
  • Ranking
  • Ranking algorithms using Hub and Authority Model

14
Experiments on Global Dataset
  • Competitors
  • Topical PageRank (T-PR)
  • Traditional PageRank (PR)
  • Topic Sensitive PageRank (TSPR)
  • Intelligent Surfer (IS)
  • BM2500 (BM)
  • Evaluation
  • Relevance judgments provided by TREC
  • P_at_10, Mean average precision (MAP) and Rprec

15
Combination of IR and Importance Score
  • Typical approach Cai He, 2004
  • OKAPI BM2500 Robertson, 1997 weighting
  • Linear combination
  • Output top results of the combined list

16
Performance Comparison
17
Experiments on Query-specific Datasets
  • Competitors
  • Topical HITS (T-HITS)
  • Traditional HITS w/ normalization (HITS / N-HITS)
  • BH imp w/ normalization (IMP / N-IMP)
  • CLEVERs ARC weighting w/ normalization (ARC /
    N-ARC)
  • Evaluation
  • Human evaluation system, 43 participants in total
  • Metric P_at_10 and S_at_10

18
Topical HITS
Topical PageRank
Where
Topical model


Page model
Normalized HITS
PageRank
19
Performance Comparison
Precision
Human assessment
20
  • Introduction
  • Topical link analysis model
  • Experiments
  • Conclusion

21
Summary of Contributions
In this work, we have
  • Introduced a topical random walk model that
    probabilistically combines page content (via a
    topic distribution) and link structure.
  • Re-implemented a number of well-known ranking
    algorithms and conducted performance comparisons.
  • Demonstrated topic-level transitions within
    global authority propagation without affecting
    the global transition probabilities.

22
Reference
  • K. Bharat and M. R. Henzinger. Improved
    algorithms for topic distillation in hyperlinked
    environments. In Proc. of the 21st Intl ACM
    SIGIR Conference on Research and Development in
    Information Retrieval, pages 104-111, Aug. 1998.
  • D. Cai, X. He, J.-R. Wen, and W.-Y. Ma.
    Block-level link analysis. In Proc. of the 27th
    Annual Intl ACM SIGIR Conference on Research and
    Development in Information Retrieval, July 2004.
  • S. Chakrabarti, B. E. Dom, P. Raghavan, S.
    Rajagopalan, D. Gibson, and J. M. Kleinberg.
    Automatic resource compilation by analyzing
    hyperlink structure and associated text. In Proc.
    of the 7th Int'l World Wide Web Conference, pages
    65-74, Brisbane, Australia, Apr. 1998.
  • T. H. Haveliwala. Topic-sensitive PageRank. In
    Proc. of the Eleventh Intl World Wide Web
    Conference, Honolulu, Hawaii, May 2002.
  • IBM Almaden Research Center. The CLEVER Project.
    Home page http//www.almaden.ibm.com/cs/k53/cleve
    r.html, 2000.
  • J. M. Kleinberg. Authoritative sources in a
    hyperlinked environment. Journal of the ACM,
    46(5)604-632, 1999.
  • L. Page, S. Brin, R. Motwani, and T. Winograd.
    The PageRank citation ranking Bringing order to
    the Web. Unpublished draft, 1998.
  • A. McCallum. Rainbow text classication
    tool.http//www.cs.umass.edu/mccallum/bow/rainbo
    w/.
  • S. E. Robertson. Overview of the OKAPI projects.
    Journal of Documentation, 533-7, 1997.

23
Thank You!
Lan Nie lan2_at_lehigh.edu http//wume.cse.lehigh.e
du
24
Combination of IR and Importance Score
P_at_10 Precision of the top 10 results Gamma IR
scores weight in the combined score
25
Statistical t-tests
Compared to T-PR
Compared to BM2500
Write a Comment
User Comments (0)
About PowerShow.com