Title: Algorithms for Large Data Sets
1Algorithms for Large Data Sets
Lecture 3 March 23, 2005
http//www.ee.technion.ac.il/courses/049011
2Ranking Algorithms
3Outline
- The ranking problem
- PageRank
- HITS (Hubs Authorities)
- Markov Chains and Random Walks
- PageRank and HITS computation
4The Ranking Problem
- Input
- D document collection
- Q query space
- Goal Find a ranking function rank D x Q ? R
s.t. - rank and q induce a ranking (partial order) ?q on
D - Same as the relevance scoring function from
previous lecture
5Text-based Ranking
- Classical ranking functions
- Keyword-based boolean ranking
- Cosine similarity TF-IDF scores
- Limitations in the context of web search
- The abundance problem
- Recall is not important
- Short queries
- Web pages are poor in text
- Synonymy (cars vs. autos)
- Polysemy (java, Michael Jordan)
- Spam
6Link-based Ranking
Hypertext IR Principle 1
If p ? q, then q is relevant to p
Hypertext IR Principle 2
If p ? q, then p confers authority to q
- Hyperlinks carry important semantics
- Recommendation
- Critique
- Navigation
7Static Ranking
- Static ranking rank D ? R, where rank(d) gt
rank(d) implies d is more authoritative than
d - Use links to come up with a static ranking of all
web pages. - Given a query q, use text-based ranking to
identify a set S of candidate relevant pages. - Order S by their static rank.
- Advantage static ranking can be computed at a
pre-processing step. - Disadvantage no use of Hypertext IR Principle
1.
8Query-Dependent Ranking
- Given a query q, use text-based ranking to
identify a set S of candidate relevant pages. - Use links within S to come up with a ranking
rank S ? R, where rank(d) gt rank(d) implies d
is more authoritative than d with respect to q. - Advantage both Hypertext IR principles are
exploited. - Disadvantage less efficient.
9The Web as a Graph
- V a set of pages
- In static ranking, V web
- In query dependent ranking, V S
- The Web Graph G (V,E), where
- (p,q) is an edge iff p has a hyperlink to q
- A adjacency matrix of G
10Popularity Ranking
- rank(p) in-degree(p)
- Advantages
- Most important pages extracted from millions of
matches - No need for text rich documents
- Efficiently computable
- Disadvantages
- Bias towards popular pages, irrespective of query
- Easily spammable
11PageRank Page, Brin, Motwani, Winograd 1998
- Motivating principles
- Rank of p should be proportional to the rank of
the pages that point to p - Recommendations from Bill Gates Steve Jobs vs.
from Moishale and Ahuva - Rank of p should depend on the number of pages
co-cited with p - Compare Bill Gates recommends only me vs. Bill
Gates recommends everyone on earth
12PageRank, Attempt 1
- r rank vector
- B normalized adjacency matrix
- Then
- r is a left eigenvector of B
- B must have 1 as an eigenvalue
- Since some rows of B are 0, 1 is not necessarily
an eigenvalue - Rank is lost in sinks
13PageRank, Attempt 2
where
- Then
- r is a left eigenvector of B with eigenvalue 1/?
- Any left eigenvector will do.
- Usually will use normalized principal
eigenvector. - Rank accumulates at sinks and sink communities.
14PageRank, Attempt 2 Example
I
II
0.25/0.8 0.31
0.65/0.8 0.69
0.3
0.2
0.5
0
III
0
1
0
15PageRank, Final Definition
- E(p) rank source function
- Standard setting E(p) ?/V for some ? lt 1
- pagerank is normalized to L1 unit norm
- e rank source vector, r pagerank vector
- 1 the all 1s vector
- Then
- r is a left eigenvector of (B 1eT) with
eigenvalue 1/? - Use normalized principal eigenvector.
16The Random Surfer Model
- When visiting a page p, a random surfer
- With probability 1 - ?, selects a random outlink
p ? q and goes to visit q. (focused browsing) - With probability ?, jumps to a random web page q.
(loss of interest) - If p has no outlinks, assume it has a self loop.
- P probability transition matrix
17PageRank Random Surfer Model
Suppose
Then
- Therefore, r is a left eigenvector of (B 1eT)
with eigenvalue 1/(1 - ?), iff it is a left
eigenvector of P with eigenvalue 1.
18Markov Chain Primer
- V state space
- P probability transition matrix
- Non-negative.
- Sum of each row is 1.
- q0 initial distribution on V
- qt q0 Pt distribution on V after t steps
- P is ergodic if it is
- Irreducible (underlying graph is strongly
connected) - Aperiodic (for all states u,v, the gcd of the
lengths of paths from u to v is 1) - Theorem
- If P is ergodic, then it has a stationary
distribution ?. Furthermore, for all q0, qt ? ?
as t tends to infinity. - ?P ?. ? is a left eigenvector of P with e.v.
1.
19PageRank Markov Chains
- Conclusion The pagerank vector r is the
stationary distribution of the random surfer
Markov Chain. - pagerank(p) rp probability random surfer
visits p at the limit. - Note random jump guarantees Markov Chain is
irreducible and aperiodic.
20PageRank Computation
In practice about 50 iterations suffices
21HITS Hubs and Authorities Kleinberg, 1997
- HITS Hyperlink Induced Topic Search
- Main principle every page p is associated with
two scores - Authority score how authoritative a page is
about the querys topic - Ex query IR authorities scientific IR
papers - Ex query automobile manufacturers
authorities Mazda, Toyota, and GM web sites - Hub score how good the page is as a resource
list about the querys topic - Ex query IR hubs surveys and books about IR
- Ex query automobile manufacturers hubs KBB,
car link lists
22Mutual Reinforcement
- HITS principles
- p is a good authority, if it is linked by many
good hubs. - p is a good hub, if it points to many good
authorities.
23HITS Algebraic Form
- a authority vector
- h hub vector
- A adjacency matrix
- a is principal eigenvector of ATA
- h is principal eigenvector of AAT
24Co-Citation and Bibilographic Coupling
- ATA co-citation matrix
- ATAp,q of pages that link both to p and to q.
- Thus authority scores propagate through
co-citation. - AAT bibliographic coupling matrix
- AATp,q of pages that both p and q link to.
- Thus hub scores propagate through bibliographic
coupling.
p
q
p
q
25HITS Computation
26Principal Eigenvector Computation
- E n by n matrix
- ?1 gt ?2 gt ?3 gt ?n eigenvalues of E
- v1,,vn corresponding eigenvectors
- Eigenvectors are linearly independent
- Input
- The matrix E
- The principal eigenvalue ?1
- A unit vector u, which is not orthogonal to v1
- Goal computer v1
27The Power Method
28Why Does It Work?
29End of Lecture 3