Title: Towards Scaling Fully Personalized PageRank
1Towards Scaling Fully Personalized PageRank
- Dániel Fogaras, Balázs Rácz
Computer and Automation Research Institute of the
Hungarian Academy of Sciences
Budapest University of Technology and Economics
2Problem formulation
- PageRank(Brin,Page,98)
- PV PageRank vector, r uniform distribution vector
- Overall quality measure of Web pages
- Pre-computation evaluate PV by power iteration
- Query order results by PV
- Personalized PageRank(Brin,Page,98)
- r preference vector of a user, query dependent
- PPV(r)PV personalized quality measure of Web
pages - Pre-computation r is not known. What to compute?
- Query power-iteration. 5 hours/query!!!
Towards Scaling Fully Personalized
PageRank Dániel Fogaras, Balázs Rácz
3Preliminaries
- Linearity
- Full personalization
- Pre-compute PPV(ri) for all pages
- V2 disk, V(VE) time, where V 109, E 1010,
??? - Topic-Sensitive PageRank (Haveliwala 01)
- Linearity
- Pre-compute PPV(ri) for a topical basis r1,,rk,
k20 - Query user submits a topic by
- Query engine combines PPV(ri) vectors
- Scaling Personalized Web Search (Jeh, Widom, 03)
- Decomposition, linearity
- Pre-compute PPV(ri) for unit vectors r1,,rk,
corresponding to k10.000 pages - Query personalization over the 10.000 pages
Towards Scaling Fully Personalized
PageRank Dániel Fogaras, Balázs Rácz
4Towards full personalization
- Our algorithm
- Monte Carlo simulation, not power iteration
- Pre-compute approximate PPV(ri) for all unit
vectors r1,,rk, knumber of pages - Scalability quasi linear pre-computation
sub-linear query - Main points of this presentation
- Outline of the algorithm
- Pre-computation external-memory, distributed
- Query used to increase precision
- Error of approximation tends to zero
exponentially - Exact vs. approximated PPV -- space lower bounds
Towards Scaling Fully Personalized
PageRank Dániel Fogaras, Balázs Rácz
5Outline of the Algorithm
- Theorem (Jeh, Widom 03, F 03)
- Random walk starts from page u
- Uniform step with probability 1-c, stops with c
- PPV(u,v)Pr the walk stops at page v
- Monte Carlo algorithm
- Pre-computation
- From u simulate N independent random walks
- Database of fingerprints ending vertices of the
walks from all vertices - Query
- PPV(u,v) ( walks u?v ) / N
Towards Scaling Fully Personalized
PageRank Dániel Fogaras, Balázs Rácz
6External memory pre-computation
- Goal N independent random walks from each vertex
- Input webgraph V 109, E 1010
- VE gt memory
- Accessing the edges
- Edge scan --- stream access
- Edges sorted by source vertices
Towards Scaling Fully Personalized
PageRank Dániel Fogaras, Balázs Rácz
7External memory pre-computation (2)
- Goal N independent random walks from each vertex
- Simulate all walks together
Iteration 1 blink 1 edge scan Sort path
ends Merge with the sorted graph Each walk stops
with prob. c E( walks ) (1-c)kNV after k
iterations
Towards Scaling Fully Personalized
PageRank Dániel Fogaras, Balázs Rácz
8Distributed indexing
- M machines with fast local network connections
- memory lt VE M(memory)
Parallelize for NV walks Parts of the graph in
RAM Remote transfers batched
M3
Heuristic partition one site to one
machine Machine1 www.cnn.com/, Machine2
www.yahoo.com/ Uniform load balance ? ordinary
PR distributed equally
Towards Scaling Fully Personalized
PageRank Dániel Fogaras, Balázs Rácz
9Query, increasing precision
- Database of NV fingerprints (path endings)
- Query PPV(u) empirical distribution
- from N samples
- Theorem (Jeh, Widom, 03)
- O(u) denotes out-neighbors of u
- Query PPV(u) empirical distribution
- from NO(u) samples
- Number of fingerprints for a query
- F N(db accesses/query)
Towards Scaling Fully Personalized
PageRank Dániel Fogaras, Balázs Rácz
10Error of approximation
- Exact PPV(u,v)
- Approximate by F fingerprints PPV(u,v)
- Theorem
- If PPV(u,v) gt PPV(u, holds, then
- Pr PPV(u,v) lt PPV(u,w) lt exp( - 0.3Nd2 )
- Idea of the proof
- N( PPV(u,v) - PPV(u,w) ) (u?v) - (u?w)
- sum of F iid. random variables with values
-1,0,1 - Bernsteins inequality
- Error of approximation ? 0 exponentially with
- F (db size/vertex)(db accesses/query) ? 8
Towards Scaling Fully Personalized
PageRank Dániel Fogaras, Balázs Rácz
11Exact versus approximate
- Model of computation
- Input G graph with V vertices
- Pre-compute a database of size D
- Query respond by accessing only the db.
- Exact
- Query u,v,w
- Decide if PPV(u,v) gt PPV(u,w) holds
- Approximate for fixed e and d
- Query u,v,w
- Decide if PPV(u,v) gt PPV(u,w) holds with error
probability e when PPV(u,v) - PPV(u,w) gt d
Towards Scaling Fully Personalized
PageRank Dániel Fogaras, Balázs Rácz
12Lower bounds for the db size
- For the webgraph V 109
- Theorem 1
- For the Exact problem D ?(V2) sized db is
required in worst case - Theorem 2
- For the Approximate problem D ?(V)
- Is it possible to improve the 2nd lower bound?
- Our algorithm uses a D O(V logV) sized db
Towards Scaling Fully Personalized
PageRank Dániel Fogaras, Balázs Rácz
13Idea of the lower bound proofs
- One-way communication complexity
- Bit-vector probing (BVP)
-
- Theorem B m for any protocol
- Reduction from Exact-PPV to BVP
Alice has a bit vector Input x (x1, x2, , xm
)
Bob has a number Input 1 k m Xk ?
Communication B bits
Alice has x (x1, x2, , xm ) G graph with V
vertices, where V2 m Pre-compute an Exact PPV
database of size D
Bob has 1 k m u, v, w vertices PPV(u,v) ?
PPV(u,w) Xk ?
Communication Exact PPV db, D bits
Thus D B m V2
Towards Scaling Fully Personalized
PageRank Dániel Fogaras, Balázs Rácz
14Summary
- Fully personalized PR
- Monte-Carlo method, not power iteration
- Pre-computation
- External-memory, distributed
- Query
- Increase precision by (db accesses/query)
- Error of approximation
- Tends to zero exponentially
- Space lower bounds
- Quadratic for Exact PPR
- Linear for Approximate PPR
15Thank you!
Towards Scaling Fully Personalized
PageRank Dániel Fogaras, Balázs Rácz
16Misc
- NPPV(u,v) (u?v) Binom(N,PPV(u,v))
- Claim (by Chernoffs bound)
- Pr PPV(u,v) gt (1d) PPV(u,v) lt
- exp(-NPPV(u,v)d2/4)
- If for a protocol Prright answer (1?) / 2
then B ? m - PV PageRank vector, c constant, M normalized
adjacency matrix,
Towards Scaling Fully Personalized
PageRank Dániel Fogaras, Balázs Rácz