Topic-Sensitive PageRank - PowerPoint PPT Presentation

About This Presentation
Title:

Topic-Sensitive PageRank

Description:

Topic-Sensitive PageRank – PowerPoint PPT presentation

Number of Views:189
Slides: 43
Provided by: Username withheld or not provided

less

Transcript and Presenter's Notes

Title: Topic-Sensitive PageRank


1
Topic-Sensitive PageRank
  • Taher H. Haveliwala
  • 2002

2
Abstract
  • Targetimproving the ranking of search-query
    results
  • Beforeusing the link structure of the Web, to
    capture the relative importance of Web pages,
    independent of any particular search query
  • Nowa set of PageRank vectors, biased using a set
    of topics, to capture more accurately the notion
    of importance with respect to a particular topic

3
Abstract contribution
  • more accurate rankings than generic PageRank
  • Compute topic-sensitive PageRank scores for pages
    satisfying the query using the topic of the query
    keywords
  • Considering searches done in context
  • Compute the topic-sensitive PageRank scores using
    the topic of the context in which the query
    appeared

4
1. Introduction
  • HITS 14
  •  a link analysis algorithm
  • Hubs
  • Authorities
  • Include content analyst 4
  • Automatically compiling resource lists for
    general topics 8

5
1. Introduction - PageRank algorithm7,16
  • rank vector - apriori importance -gt estimate
    pages on the Web
  • Computed once
  • Offline
  • independent of the search query (con)
  • importance scores are used in conjunction with
    query-specific IR scores to rank the query
    results

6
1. Introduction - Advantage of PageRank
  • query-time cost of incorporating the precomputed
    PageRank importance score for a page is low
  • PageRank is generated using the entire Web graph
    rather than a small subset

7
1. Introduction - Method in this paper
  • allows the query to influence the link-based
    score(HITS)
  • requires minimal query-time processing (PageRank)
  • biased with a different topic

8
1. Introduction -making PageRank topic-sensitive
  • avoid the problem of heavily linked pages getting
    highly ranked for queries(no particular
    authority)
  • Hilltop 5 that is designed to improve results
    for popular queries
  • Generates a query-specific authority score by
    detecting and indexing pages that appear to be
    good experts for certain keywords
  • experts were not found will not be handled by the
    Hilltop algorithm.

9
1. Introduction -making PageRank topic-sensitive
  • 17Propose using a set of Web Pages terms for
    influencing the computation.
  • An approach for enhancing search rankings by
    generating a PageRank vector for each possible
    query term was recently proposed in 18 with
    favorable results
  • requires considerable processing time and storage
  • not easily extended to make use of user and query
    context

10
1. Introduction - two query scenarios
  • Scenarios1assume a user with a specific
    information need issues a query
  • Determine the topics most closely associated with
    the query, and use the appropriate
    topic-sensitive PageRank vectors for ranking the
    documents satisfying the query.

11
1. Introduction - two query scenarios
  • Scenario2user is viewing a document (for
    instance, browsing the Web or reading email), and
    selects a term from the document for which he
    would like more information.

12
Summary of approach
  • generate 16 topic-sensitive PageRank vectors
    using URLs from a top-level category from the
    Open Directory Project (ODP)
  • At query time, calculate the similarity of the
    query to each topics
  • take the linear combination of the
    topic-sensitive vectors, weighted using the
    similarities of the query to the topics
  • link-based computations are performed offline,
    the query-time costs are not much

13
2. Review of PageRank
  • page u link to page v
  • Example
  • Yahoo -gt important page(many pages point to it)
  • pointed to from Yahoo! are probably important
  • Nu -gt out degree of page u
  • Rank(p) importance of page p
  • link (u ,v) confers units of rank to v

14
  • N is the number of pages , assign all pages the
    initial value 1/N
  • Bv represent the set of pages pointing to v
  • The final vector
  • contains the PageRank vector over the Web
  • computed only once after each crawl of the Web

15
  • expressed as the following eigenvector
    calculation
  • M -gt square stochastic matrix corresponding to
    the directed graph G of the Web
  • Page j to Page I , mij 1/Nj
  • Repeatedly multiplying Rank by M yields the
    dominant eigenvector Rank of the matrix M

16
An example
v2
v1
v3
v5
v4
17
  • PageRank can be viewed as the stationary
    probability distribution over pages induced by a
    random walk on the Web
  • To convergence - M is irreducible and aperiodic
  • Dumping factor1 a to restrict rank sink
  • add transition edges of probability a /N between
    every pair of nodes in G

18
Rank Sink
  • Page A points to Page B and Page B points to Page
    A, and the PageRank value for these pages
    increases

RefAnalysis of Rank Sink Problem in PageRank
Algorithm
19
The key to creating topic-sensitive PageRank is
to bias the computation to increase the effect of
certain categories of pages by using a nonuniform
Nx1 personalization vector for
20
3. Topic-Sensitive PageRank - Method in this
article
  • precompute the multiple importance scores for
    each page
  • a set of scores of the importance of a page with
    respect to various topics
  • combined to form a composite PageRank score
  • to produce the final rank of the query

21
3. Topic-Sensitive Pagerank - 3.2 ODP biasing
  • To generate a set of biased PageRank vectors
    using a set of a basis topics.
  • Performed once
  • Offline
  • Personalization vector
  • 16 different biased PageRank vectors (using 16
    top-level of ODP)
  • Tj set of URLs in the ODP category cj

DMOZ Open Directory Project 16 top-level topics
22
3.2 Personalization vector
23
3.3 Query Time Importance Score
  • The second is performed at query time
  • Given a query q, let q be the context of q.
  • Let qi be the ith term in the query context
  • Then given the query q, compute for each cj the
    following
  • reflects the interests of user k

24
  • the query-sensitive importance score of each of
    these retrieved URLs
  • rankjd is the rank of document d given by the
    rank vector
  • The results are ranked according to this
    composite score sqd

25
  • random surfer modelvisits a web page with a
    certain probability which derives from the page's
    PageRank
  • wj is the coefficient used to weight the jth rank
    vector
  • With probability 1- a a random surfer on page u
    follows an outlink of u

26
4. Experimental Results
  • A series of experiments
  • 4-1 describe the similarity measure use to
    compare two rankings
  • 4-2 investigate how the induced rankings vary,
    based on both the topic used to bias the rank
    vectors
  • 4-3 the retrieval performance of ordinary
    PageRank versus topic-sensitive PageRank.
  • 4-4 how the use of query context can be used in
    conjunction with topic-sensitive PageRank

27
4. Experiment Data
  1. crawl contained roughly 280,000 of the 3 million
    URLs in the ODP.
  2. 35 queries in paper 9 show at Table1

28
4-1 Similarity Measure - First measure
  • First measure
  • degree of overlap
    between the top n URLs of two rankings
  • n 20, use to compare
  • it does not indicate the degree to which the
    relative orderings of the top n URLs of two
    rankings

29
4-1 Similarity Measure - second measure
Kendalls distance measure9
  • KSim(T1,T2) is the probability that and
    agree on the relative ordering of a randomly
    selected pair of distinct nodes
  • U union of the URLs in and

30
  • Ref https//en.wikipedia.org/wiki/Kendall_tau_dis
    tance

31
4.2 Effect of ODP Biasing
  • bias factor a
  • affects the degree to which the resultant vector
    is biased towards the topic vector used for
  • For a 1, the URLs in the bias set Tj will be
    assigned the score
  • as a 0, the content of Tj becomes irrelevant to
    the final score

32
  • Use a 0.25(heuristically)
  • the induced rankings of query results are not
    very sensitive to the choice of a
  • The average overlap
  • between the top 20
  • results for the two
  • values of is very high

33
  • differences across different topically-biased
    PageRank vectors is much higher

34
  • investigate which of these rankings is best for
    specific queries
  • Table 5 shows the top 5 ranked URLs

35
4.3 query-sensitive scoring
  • how effectively we can utilize the ranking
    precision
  • intuitively the most relevant
    categories for the query
  • Use only the top three highest values categories
    to compute sqd score

36
4.3 query-sensitive scoring
37
4.3 query-sensitive scoring -experiment
  • To compare the query-sensitive approach to
    PageRank
  • 10 queries
  • 5 volunteers
  • Each query, the volunteer
  • was shown 2 result rankings
  • Top 10 results with the unbiased PageRank vector
  • Top 10 results with the composite sqd score
  • Select all URLs relevant to the query
  • Choose the better ranking results

38
4.3 query-sensitive scoring -result
39
4.4 context-sensitive scoring
  • Using the context can help disambiguate the query
    term and yield results that more closely reflect

40
(No Transcript)
41
5.Sources of search context
  1. the history of queries issued leading up to the
    current query is another form of query context
  2. Jordan and basketball
  3. sort of hierarchical directory
  4. User context
  5. Browsings patterns
  6. Bookmarks
  7. Email archive

42
6. Ongoing Work
  • discovering sources of search context
  • development of the best set of the basis
    topics(second of third level of Open Directory
    hierarchy) -gt efficiency problem
  • Creating the dumping vector to create the topic
    sensitive rank vectors -gt being more resistant
Write a Comment
User Comments (0)
About PowerShow.com