Topic-Sensitive PageRank - PowerPoint PPT Presentation

About This Presentation

Title:

Topic-Sensitive PageRank

Description:

Topic-Sensitive PageRank – PowerPoint PPT presentation

Number of Views:189

Slides: 43

Provided by: Username withheld or not provided

Category: Medicine, Science & Technology

Tags: topic_sensitive_pagerank

more less

Transcript and Presenter's Notes

Title: Topic-Sensitive PageRank

1
Topic-Sensitive PageRank

Taher H. Haveliwala
2002

2
Abstract

Targetimproving the ranking of search-query
results
Beforeusing the link structure of the Web, to
capture the relative importance of Web pages,
independent of any particular search query
Nowa set of PageRank vectors, biased using a set
of topics, to capture more accurately the notion
of importance with respect to a particular topic

3
Abstract contribution

more accurate rankings than generic PageRank
Compute topic-sensitive PageRank scores for pages
satisfying the query using the topic of the query
keywords
Considering searches done in context
Compute the topic-sensitive PageRank scores using
the topic of the context in which the query
appeared

4
1. Introduction

HITS 14
a link analysis algorithm
Hubs
Authorities
Include content analyst 4
Automatically compiling resource lists for
general topics 8

5
1. Introduction - PageRank algorithm7,16

rank vector - apriori importance -gt estimate
pages on the Web
Computed once
Offline
independent of the search query (con)
importance scores are used in conjunction with
query-specific IR scores to rank the query
results

6
1. Introduction - Advantage of PageRank

query-time cost of incorporating the precomputed
PageRank importance score for a page is low
PageRank is generated using the entire Web graph
rather than a small subset

7
1. Introduction - Method in this paper

allows the query to influence the link-based
score(HITS)
requires minimal query-time processing (PageRank)
biased with a different topic

8
1. Introduction -making PageRank topic-sensitive

avoid the problem of heavily linked pages getting
highly ranked for queries(no particular
authority)
Hilltop 5 that is designed to improve results
for popular queries
Generates a query-specific authority score by
detecting and indexing pages that appear to be
good experts for certain keywords
experts were not found will not be handled by the
Hilltop algorithm.

9
1. Introduction -making PageRank topic-sensitive

17Propose using a set of Web Pages terms for
influencing the computation.
An approach for enhancing search rankings by
generating a PageRank vector for each possible
query term was recently proposed in 18 with
favorable results
requires considerable processing time and storage
not easily extended to make use of user and query
context

10
1. Introduction - two query scenarios

Scenarios1assume a user with a specific
information need issues a query
Determine the topics most closely associated with
the query, and use the appropriate
topic-sensitive PageRank vectors for ranking the
documents satisfying the query.

11
1. Introduction - two query scenarios

Scenario2user is viewing a document (for
instance, browsing the Web or reading email), and
selects a term from the document for which he
would like more information.

12
Summary of approach

generate 16 topic-sensitive PageRank vectors
using URLs from a top-level category from the
Open Directory Project (ODP)
At query time, calculate the similarity of the
query to each topics
take the linear combination of the
topic-sensitive vectors, weighted using the
similarities of the query to the topics
link-based computations are performed offline,
the query-time costs are not much

13
2. Review of PageRank

page u link to page v
Example
Yahoo -gt important page(many pages point to it)
pointed to from Yahoo! are probably important
Nu -gt out degree of page u
Rank(p) importance of page p
link (u ,v) confers units of rank to v

N is the number of pages , assign all pages the
initial value 1/N
Bv represent the set of pages pointing to v
The final vector
contains the PageRank vector over the Web
computed only once after each crawl of the Web

expressed as the following eigenvector
calculation
M -gt square stochastic matrix corresponding to
the directed graph G of the Web
Page j to Page I , mij 1/Nj
Repeatedly multiplying Rank by M yields the
dominant eigenvector Rank of the matrix M

16
An example
v2
v1
v3
v5
v4
17

PageRank can be viewed as the stationary
probability distribution over pages induced by a
random walk on the Web
To convergence - M is irreducible and aperiodic
Dumping factor1 a to restrict rank sink
add transition edges of probability a /N between
every pair of nodes in G

18
Rank Sink

Page A points to Page B and Page B points to Page
A, and the PageRank value for these pages
increases

RefAnalysis of Rank Sink Problem in PageRank
Algorithm
19
The key to creating topic-sensitive PageRank is
to bias the computation to increase the effect of
certain categories of pages by using a nonuniform
Nx1 personalization vector for
20
3. Topic-Sensitive PageRank - Method in this
article

precompute the multiple importance scores for
each page
a set of scores of the importance of a page with
respect to various topics
combined to form a composite PageRank score
to produce the final rank of the query

21
3. Topic-Sensitive Pagerank - 3.2 ODP biasing

To generate a set of biased PageRank vectors
using a set of a basis topics.
Performed once
Offline
Personalization vector
16 different biased PageRank vectors (using 16
top-level of ODP)
Tj set of URLs in the ODP category cj

DMOZ Open Directory Project 16 top-level topics
22
3.2 Personalization vector
23
3.3 Query Time Importance Score

The second is performed at query time
Given a query q, let q be the context of q.
Let qi be the ith term in the query context
Then given the query q, compute for each cj the
following
reflects the interests of user k

the query-sensitive importance score of each of
these retrieved URLs
rankjd is the rank of document d given by the
rank vector
The results are ranked according to this
composite score sqd

random surfer modelvisits a web page with a
certain probability which derives from the page's
PageRank
wj is the coefficient used to weight the jth rank
vector
With probability 1- a a random surfer on page u
follows an outlink of u

26
4. Experimental Results

A series of experiments
4-1 describe the similarity measure use to
compare two rankings
4-2 investigate how the induced rankings vary,
based on both the topic used to bias the rank
vectors
4-3 the retrieval performance of ordinary
PageRank versus topic-sensitive PageRank.
4-4 how the use of query context can be used in
conjunction with topic-sensitive PageRank

27
4. Experiment Data

crawl contained roughly 280,000 of the 3 million
URLs in the ODP.
35 queries in paper 9 show at Table1

28
4-1 Similarity Measure - First measure

First measure
degree of overlap
between the top n URLs of two rankings
n 20, use to compare
it does not indicate the degree to which the
relative orderings of the top n URLs of two
rankings

29
4-1 Similarity Measure - second measure
Kendalls distance measure9

KSim(T1,T2) is the probability that and
agree on the relative ordering of a randomly
selected pair of distinct nodes
U union of the URLs in and

Ref https//en.wikipedia.org/wiki/Kendall_tau_dis
tance

31
4.2 Effect of ODP Biasing

bias factor a
affects the degree to which the resultant vector
is biased towards the topic vector used for
For a 1, the URLs in the bias set Tj will be
assigned the score
as a 0, the content of Tj becomes irrelevant to
the final score

Use a 0.25(heuristically)
the induced rankings of query results are not
very sensitive to the choice of a
The average overlap
between the top 20
results for the two
values of is very high

differences across different topically-biased
PageRank vectors is much higher

investigate which of these rankings is best for
specific queries
Table 5 shows the top 5 ranked URLs

35
4.3 query-sensitive scoring

how effectively we can utilize the ranking
precision
intuitively the most relevant
categories for the query
Use only the top three highest values categories
to compute sqd score

36
4.3 query-sensitive scoring
37
4.3 query-sensitive scoring -experiment

To compare the query-sensitive approach to
PageRank
10 queries
5 volunteers
Each query, the volunteer
was shown 2 result rankings
Top 10 results with the unbiased PageRank vector
Top 10 results with the composite sqd score
Select all URLs relevant to the query
Choose the better ranking results

38
4.3 query-sensitive scoring -result
39
4.4 context-sensitive scoring

Using the context can help disambiguate the query
term and yield results that more closely reflect

40
(No Transcript)
41
5.Sources of search context

the history of queries issued leading up to the
current query is another form of query context
Jordan and basketball
sort of hierarchical directory
User context
Browsings patterns
Bookmarks
Email archive

42
6. Ongoing Work

discovering sources of search context
development of the best set of the basis
topics(second of third level of Open Directory
hierarchy) -gt efficiency problem
Creating the dumping vector to create the topic
sensitive rank vectors -gt being more resistant

Write a Comment

User Comments (0)