Algorithms for Large Data Sets - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Algorithms for Large Data Sets

Description:

r is a non-negative normalized left eigenvector of B with ... Ex: query: 'automobile manufacturers'; hubs: KBB, car link lists. 16. Mutual Reinforcement ... – PowerPoint PPT presentation

Number of Views:128
Avg rating:3.0/5.0
Slides: 24
Provided by: zivbar
Category:
Tags: algorithms | data | kbb | sets

less

Transcript and Presenter's Notes

Title: Algorithms for Large Data Sets


1
Algorithms for Large Data Sets
  • Ziv Bar-Yossef

Lecture 4 April 9, 2006
http//www.ee.technion.ac.il/courses/049011
2
Crash Course in AlgebraandMarkov Chains
3
Ranking Algorithms
4
PageRank, Attempt 1
  • Additional Conditions
  • r is non-negative r 0
  • r is normalized r1 1
  • B normalized adjacency matrix
  • Then
  • r is a non-negative normalized left eigenvector
    of B with eigenvalue 1

5
PageRank, Attempt 1
  • Solution exists only if B has eigenvalue 1
  • Problem B may not have 1 as an eigenvalue
  • Because some of its rows are 0.
  • Example

6
PageRank, Attempt 2
  • ? normalization constant
  • r is a non-negative normalized left eigenvector
    of B with eigenvalue 1/?

7
PageRank, Attempt 2
  • Any nonzero eigenvalue ? of B may give a solution
  • l 1/?
  • r any non-negative normalized left eigenvector
    of B with eigenvalue ?
  • Which solution to pick?
  • Pick a principal eigenvector (i.e.,
    corresponding to maximal ?)
  • How to find a solution?
  • Power iterations

8
PageRank, Attempt 2
  • Problem 1 Maximal eigenvalue may have
    multiplicity gt 1
  • Several possible solutions
  • Happens, for example, when graph is disconnected
  • Problem 2 Rank accumulates at sinks.
  • Only sinks or nodes, from which a sink cannot be
    reached, can have nonzero rank mass.

9
PageRank, Final Definition
  • e rank source vector
  • Standard setting e(p) ?/n for all p (? lt 1)
  • 1 the all 1s vector
  • Then
  • r is a non-negative normalized left eigenvector
    of (B 1eT) with eigenvalue 1/?

10
PageRank, Final Definition
  • Any nonzero eigenvalue of (B 1eT) may give a
    solution
  • Pick r to be a principal left eigenvector of (B
    1eT)
  • Will show
  • Principal eigenvalue has multiplicity 1, for any
    graph
  • There exists a non-negative left eigenvector
  • Hence, PageRank always exists and is uniquely
    defined
  • Due to rank source vector, rank no longer
    accumulates at sinks

11
An Alternative View of PageRankThe Random
Surfer Model
  • When visiting a page p, a random surfer
  • With probability 1 - d, selects a random outlink
    p ? q and goes to visit q. (focused browsing)
  • With probability d, jumps to a random web page q.
    (loss of interest)
  • If p has no outlinks, assume it has a self loop.
  • P probability transition matrix

12
PageRank Random Surfer Model
Suppose
Then
  • Therefore, r is a principal left eigenvector of
    (B 1eT) if and only if it is a principal left
    eigenvector of P.

13
PageRank Markov Chains
  • PageRank vector is normalized principal left
    eigenvector of (B 1eT).
  • Hence, PageRank vector is also a principal left
    eigenvector of P
  • Conclusion PageRank is the unique stationary
    distribution of the random surfer Markov Chain.
  • PageRank(p) r(p) probability of random surfer
    visiting page p at the limit.
  • Note Random jump guarantees Markov Chain is
    ergodic.

14
PageRank Computation
In practice about 50 iterations suffices
15
HITS Hubs and Authorities Kleinberg, 1997
  • HITS Hyperlink Induced Topic Search
  • Main principle every page p is associated with
    two scores
  • Authority score how authoritative a page is
    about the querys topic
  • Ex query IR authorities scientific IR
    papers
  • Ex query automobile manufacturers
    authorities Mazda, Toyota, and GM web sites
  • Hub score how good the page is as a resource
    list about the querys topic
  • Ex query IR hubs surveys and books about IR
  • Ex query automobile manufacturers hubs KBB,
    car link lists

16
Mutual Reinforcement
  • HITS principles
  • p is a good authority, if it is linked by many
    good hubs.
  • p is a good hub, if it points to many good
    authorities.

17
HITS Algebraic Form
  • a authority vector
  • h hub vector
  • A adjacency matrix
  • Then
  • Therefore
  • a is principal eigenvector of ATA
  • h is principal eigenvector of AAT
  • Need to deal with same issues as in PageRank

18
Co-Citation and Bibilographic Coupling
  • ATA co-citation matrix
  • ATAp,q of pages that link both to p and to q.
  • Thus authority scores propagate through
    co-citation.
  • AAT bibliographic coupling matrix
  • AATp,q of pages that both p and q link to.
  • Thus hub scores propagate through bibliographic
    coupling.

p
q
p
q
19
HITS Computation
20
Principal Eigenvector Computation
  • E n n matrix
  • ?1 gt ?2 gt ?3 gt ?n eigenvalues of E
  • v1,,vn corresponding eigenvectors
  • Eigenvectors are linearly independent
  • Input
  • The matrix E
  • The principal eigenvalue ?1
  • A unit vector u, which is not orthogonal to v1
  • Goal compute v1

21
The Power Method
22
Why Does It Work?
  • Theorem As t ? ?, w/?1t ? c v1
    (c is a constant)
  • Convergence rate Proportional to (?2/?1)t
  • The larger the spectral gap ?2 - ?1, the faster
    the convergence.

23
End of Lecture 4
Write a Comment
User Comments (0)
About PowerShow.com