Algorithms for Large Data Sets - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Algorithms for Large Data Sets

Description:

Synonymy (cars vs. autos) Polysemy (java, 'Michael Jordan') Spam ... Furthermore, for all q0, qt as t tends to infinity. is a left eigenvector of P with e.v. 1. ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 30

Provided by: zivbar

Category:

more less

Transcript and Presenter's Notes

Title: Algorithms for Large Data Sets

1
Algorithms for Large Data Sets

Ziv Bar-Yossef

Lecture 3 March 23, 2005
http//www.ee.technion.ac.il/courses/049011
2
Ranking Algorithms
3
Outline

The ranking problem
PageRank
HITS (Hubs Authorities)
Markov Chains and Random Walks
PageRank and HITS computation

4
The Ranking Problem

Input
D document collection
Q query space
Goal Find a ranking function rank D x Q ? R
s.t.
rank and q induce a ranking (partial order) ?q on
D
Same as the relevance scoring function from
previous lecture

5
Text-based Ranking

Classical ranking functions
Keyword-based boolean ranking
Cosine similarity TF-IDF scores
Limitations in the context of web search
The abundance problem
Recall is not important
Short queries
Web pages are poor in text
Synonymy (cars vs. autos)
Polysemy (java, Michael Jordan)
Spam

6
Link-based Ranking
Hypertext IR Principle 1
If p ? q, then q is relevant to p
Hypertext IR Principle 2
If p ? q, then p confers authority to q

Hyperlinks carry important semantics
Recommendation
Critique
Navigation

7
Static Ranking

Static ranking rank D ? R, where rank(d) gt
rank(d) implies d is more authoritative than
d
Use links to come up with a static ranking of all
web pages.
Given a query q, use text-based ranking to
identify a set S of candidate relevant pages.
Order S by their static rank.
Advantage static ranking can be computed at a
pre-processing step.
Disadvantage no use of Hypertext IR Principle
1.

8
Query-Dependent Ranking

Given a query q, use text-based ranking to
identify a set S of candidate relevant pages.
Use links within S to come up with a ranking
rank S ? R, where rank(d) gt rank(d) implies d
is more authoritative than d with respect to q.
Advantage both Hypertext IR principles are
exploited.
Disadvantage less efficient.

9
The Web as a Graph

V a set of pages
In static ranking, V web
In query dependent ranking, V S
The Web Graph G (V,E), where
(p,q) is an edge iff p has a hyperlink to q
A adjacency matrix of G

10
Popularity Ranking

rank(p) in-degree(p)
Advantages
Most important pages extracted from millions of
matches
No need for text rich documents
Efficiently computable
Disadvantages
Bias towards popular pages, irrespective of query
Easily spammable

11
PageRank Page, Brin, Motwani, Winograd 1998

Motivating principles
Rank of p should be proportional to the rank of
the pages that point to p
Recommendations from Bill Gates Steve Jobs vs.
from Moishale and Ahuva
Rank of p should depend on the number of pages
co-cited with p
Compare Bill Gates recommends only me vs. Bill
Gates recommends everyone on earth

12
PageRank, Attempt 1

r rank vector
B normalized adjacency matrix

Then
r is a left eigenvector of B
B must have 1 as an eigenvalue
Since some rows of B are 0, 1 is not necessarily
an eigenvalue
Rank is lost in sinks

13
PageRank, Attempt 2
where

Then
r is a left eigenvector of B with eigenvalue 1/?
Any left eigenvector will do.
Usually will use normalized principal
eigenvector.
Rank accumulates at sinks and sink communities.

14
PageRank, Attempt 2 Example
I
II
0.25/0.8 0.31
0.65/0.8 0.69
0.3
0.2
0.5
0
III
0
1
0
15
PageRank, Final Definition

E(p) rank source function
Standard setting E(p) ?/V for some ? lt 1
pagerank is normalized to L1 unit norm
e rank source vector, r pagerank vector
1 the all 1s vector

Then
r is a left eigenvector of (B 1eT) with
eigenvalue 1/?
Use normalized principal eigenvector.

16
The Random Surfer Model

When visiting a page p, a random surfer
With probability 1 - ?, selects a random outlink
p ? q and goes to visit q. (focused browsing)
With probability ?, jumps to a random web page q.
(loss of interest)
If p has no outlinks, assume it has a self loop.
P probability transition matrix

17
PageRank Random Surfer Model
Suppose
Then

Therefore, r is a left eigenvector of (B 1eT)
with eigenvalue 1/(1 - ?), iff it is a left
eigenvector of P with eigenvalue 1.

18
Markov Chain Primer

V state space
P probability transition matrix
Non-negative.
Sum of each row is 1.
q0 initial distribution on V
qt q0 Pt distribution on V after t steps
P is ergodic if it is
Irreducible (underlying graph is strongly
connected)
Aperiodic (for all states u,v, the gcd of the
lengths of paths from u to v is 1)
Theorem
If P is ergodic, then it has a stationary
distribution ?. Furthermore, for all q0, qt ? ?
as t tends to infinity.
?P ?. ? is a left eigenvector of P with e.v.
1.

19
PageRank Markov Chains

Conclusion The pagerank vector r is the
stationary distribution of the random surfer
Markov Chain.
pagerank(p) rp probability random surfer
visits p at the limit.
Note random jump guarantees Markov Chain is
irreducible and aperiodic.

20
PageRank Computation
In practice about 50 iterations suffices
21
HITS Hubs and Authorities Kleinberg, 1997