Algorithms for Large Data Sets - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Algorithms for Large Data Sets

Description:

Synonymy (cars vs. autos) Polysemy (java, 'Michael Jordan') Spam ... Furthermore, for all q0, qt as t tends to infinity. is a left eigenvector of P with e.v. 1. ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 30
Provided by: zivbar
Category:

less

Transcript and Presenter's Notes

Title: Algorithms for Large Data Sets


1
Algorithms for Large Data Sets
  • Ziv Bar-Yossef

Lecture 3 March 23, 2005
http//www.ee.technion.ac.il/courses/049011
2
Ranking Algorithms
3
Outline
  • The ranking problem
  • PageRank
  • HITS (Hubs Authorities)
  • Markov Chains and Random Walks
  • PageRank and HITS computation

4
The Ranking Problem
  • Input
  • D document collection
  • Q query space
  • Goal Find a ranking function rank D x Q ? R
    s.t.
  • rank and q induce a ranking (partial order) ?q on
    D
  • Same as the relevance scoring function from
    previous lecture

5
Text-based Ranking
  • Classical ranking functions
  • Keyword-based boolean ranking
  • Cosine similarity TF-IDF scores
  • Limitations in the context of web search
  • The abundance problem
  • Recall is not important
  • Short queries
  • Web pages are poor in text
  • Synonymy (cars vs. autos)
  • Polysemy (java, Michael Jordan)
  • Spam

6
Link-based Ranking
Hypertext IR Principle 1
If p ? q, then q is relevant to p
Hypertext IR Principle 2
If p ? q, then p confers authority to q
  • Hyperlinks carry important semantics
  • Recommendation
  • Critique
  • Navigation

7
Static Ranking
  • Static ranking rank D ? R, where rank(d) gt
    rank(d) implies d is more authoritative than
    d
  • Use links to come up with a static ranking of all
    web pages.
  • Given a query q, use text-based ranking to
    identify a set S of candidate relevant pages.
  • Order S by their static rank.
  • Advantage static ranking can be computed at a
    pre-processing step.
  • Disadvantage no use of Hypertext IR Principle
    1.

8
Query-Dependent Ranking
  • Given a query q, use text-based ranking to
    identify a set S of candidate relevant pages.
  • Use links within S to come up with a ranking
    rank S ? R, where rank(d) gt rank(d) implies d
    is more authoritative than d with respect to q.
  • Advantage both Hypertext IR principles are
    exploited.
  • Disadvantage less efficient.

9
The Web as a Graph
  • V a set of pages
  • In static ranking, V web
  • In query dependent ranking, V S
  • The Web Graph G (V,E), where
  • (p,q) is an edge iff p has a hyperlink to q
  • A adjacency matrix of G

10
Popularity Ranking
  • rank(p) in-degree(p)
  • Advantages
  • Most important pages extracted from millions of
    matches
  • No need for text rich documents
  • Efficiently computable
  • Disadvantages
  • Bias towards popular pages, irrespective of query
  • Easily spammable

11
PageRank Page, Brin, Motwani, Winograd 1998
  • Motivating principles
  • Rank of p should be proportional to the rank of
    the pages that point to p
  • Recommendations from Bill Gates Steve Jobs vs.
    from Moishale and Ahuva
  • Rank of p should depend on the number of pages
    co-cited with p
  • Compare Bill Gates recommends only me vs. Bill
    Gates recommends everyone on earth

12
PageRank, Attempt 1
  • r rank vector
  • B normalized adjacency matrix
  • Then
  • r is a left eigenvector of B
  • B must have 1 as an eigenvalue
  • Since some rows of B are 0, 1 is not necessarily
    an eigenvalue
  • Rank is lost in sinks

13
PageRank, Attempt 2
where
  • Then
  • r is a left eigenvector of B with eigenvalue 1/?
  • Any left eigenvector will do.
  • Usually will use normalized principal
    eigenvector.
  • Rank accumulates at sinks and sink communities.

14
PageRank, Attempt 2 Example
I
II
0.25/0.8 0.31
0.65/0.8 0.69
0.3
0.2
0.5
0
III
0
1
0
15
PageRank, Final Definition
  • E(p) rank source function
  • Standard setting E(p) ?/V for some ? lt 1
  • pagerank is normalized to L1 unit norm
  • e rank source vector, r pagerank vector
  • 1 the all 1s vector
  • Then
  • r is a left eigenvector of (B 1eT) with
    eigenvalue 1/?
  • Use normalized principal eigenvector.

16
The Random Surfer Model
  • When visiting a page p, a random surfer
  • With probability 1 - ?, selects a random outlink
    p ? q and goes to visit q. (focused browsing)
  • With probability ?, jumps to a random web page q.
    (loss of interest)
  • If p has no outlinks, assume it has a self loop.
  • P probability transition matrix

17
PageRank Random Surfer Model
Suppose
Then
  • Therefore, r is a left eigenvector of (B 1eT)
    with eigenvalue 1/(1 - ?), iff it is a left
    eigenvector of P with eigenvalue 1.

18
Markov Chain Primer
  • V state space
  • P probability transition matrix
  • Non-negative.
  • Sum of each row is 1.
  • q0 initial distribution on V
  • qt q0 Pt distribution on V after t steps
  • P is ergodic if it is
  • Irreducible (underlying graph is strongly
    connected)
  • Aperiodic (for all states u,v, the gcd of the
    lengths of paths from u to v is 1)
  • Theorem
  • If P is ergodic, then it has a stationary
    distribution ?. Furthermore, for all q0, qt ? ?
    as t tends to infinity.
  • ?P ?. ? is a left eigenvector of P with e.v.
    1.

19
PageRank Markov Chains
  • Conclusion The pagerank vector r is the
    stationary distribution of the random surfer
    Markov Chain.
  • pagerank(p) rp probability random surfer
    visits p at the limit.
  • Note random jump guarantees Markov Chain is
    irreducible and aperiodic.

20
PageRank Computation
In practice about 50 iterations suffices
21
HITS Hubs and Authorities Kleinberg, 1997
  • HITS Hyperlink Induced Topic Search
  • Main principle every page p is associated with
    two scores
  • Authority score how authoritative a page is
    about the querys topic
  • Ex query IR authorities scientific IR
    papers
  • Ex query automobile manufacturers
    authorities Mazda, Toyota, and GM web sites
  • Hub score how good the page is as a resource
    list about the querys topic
  • Ex query IR hubs surveys and books about IR
  • Ex query automobile manufacturers hubs KBB,
    car link lists

22
Mutual Reinforcement
  • HITS principles
  • p is a good authority, if it is linked by many
    good hubs.
  • p is a good hub, if it points to many good
    authorities.

23
HITS Algebraic Form
  • a authority vector
  • h hub vector
  • A adjacency matrix
  • Then
  • Therefore
  • a is principal eigenvector of ATA
  • h is principal eigenvector of AAT

24
Co-Citation and Bibilographic Coupling
  • ATA co-citation matrix
  • ATAp,q of pages that link both to p and to q.
  • Thus authority scores propagate through
    co-citation.
  • AAT bibliographic coupling matrix
  • AATp,q of pages that both p and q link to.
  • Thus hub scores propagate through bibliographic
    coupling.

p
q
p
q
25
HITS Computation
26
Principal Eigenvector Computation
  • E n by n matrix
  • ?1 gt ?2 gt ?3 gt ?n eigenvalues of E
  • v1,,vn corresponding eigenvectors
  • Eigenvectors are linearly independent
  • Input
  • The matrix E
  • The principal eigenvalue ?1
  • A unit vector u, which is not orthogonal to v1
  • Goal computer v1

27
The Power Method
28
Why Does It Work?
29
End of Lecture 3
Write a Comment
User Comments (0)
About PowerShow.com