Exploiting Web Matrix Permutations to Speedup PageRank Computation - PowerPoint PPT Presentation

About This Presentation
Title:

Exploiting Web Matrix Permutations to Speedup PageRank Computation

Description:

Subtract off estimates of non-principal eigenvectors (Kamvar et al ) ... O - orders nodes by increasing out-degree. Q - orders nodes by decreasing out-degree ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 30
Provided by: kayanar
Category:

less

Transcript and Presenter's Notes

Title: Exploiting Web Matrix Permutations to Speedup PageRank Computation


1
Exploiting Web Matrix Permutations to Speedup
PageRank Computation
  • Presented by Aries Chan, Cody Lawson, and
    Michael Dwyer

2
Introduction
  • Internet Statistics
  • 151 million active internet user as of January
    2004
  • 76 used a search engine at least once during the
    month
  • Average time spent searching was about 40 minutes

3
Introduction
  • Search Engines
  • Most common means of accessing Web
  • Easiest method of organizing and accessing
    information
  • Therefore, high quality / usable search engines
    are very important

Rank Provider Searches (000) Share of Searches
- All Search 10,812,734 100.0
1 Google Search 6,986,580 64.6
2 Yahoo! Search 1,726,060 16.0
3 MSN/Windows Live/Bing Search 1,156,415 10.7
4
Introduction
  • Basic Idea of a Search Engine
  • Scanning Web Graph
  • Using crawlers to create an index of the
    information
  • Ranking
  • Organizing and ordering information in a usable
    way

5
Introduction
  • Webpage Ranking Problems
  • Personalization
  • http//www.google.com/history/?hlen
  • Updating Keeping order up to date
  • 25 of links are changed in one week
  • 5 new content in one week
  • 35 of entire web changed in eleven weeks
  • Reducing Computation Time

6
Introduction
  • Accelerating the PageRank
  • Use compression to fit the Web Graph into main
    memory
  • Use sequence of scans of the Web Graph to
    efficiently compute in external memory
  • Reduce computation time through numerical methods

7
Introduction
  • Reduce computation time through numerical methods
  • Use iterative methods such as Gauss-Seidel and
    Jacobi method (Arasu et al and Bianchini et al)
  • Subtract off estimates of non-principal
    eigenvectors (Kamvar et al )
  • Sort the graph lexicographically by url resulting
    in an approximate block structure (Kamvar et al)
  • Split into two problems dangling nodes and
    non-dangling nodes (Lee et al)
  • Others have been working on ways to only update
    the ranks of nodes influenced by Web Graph changes

8
Introduction
  • Del Corso, Gulli, and Romani(authors of the
    paper)
  • Numerical optimization of PageRank
  • View the PageRank computation as a linear system
  • Transform a dense problem into one which uses a
    sparse matrix
  • Treatment of dangling nodes which naturally
    adapts to the random surfer model
  • Exploiting web matrix permutations
  • Increase data locality and reduce number of
    iterations necessary to solve the problem

9
Googles PageRank Overview
  • Web as an oriented graph
  • The random surfer
  • vi ? pji vj
  • (sum of PageRanks of nodes linking to i weighted
    by the transition probability)
  • Equilibrium distribution
  • vT vTH (left eigenvector of H with eigenvalue 1)

10
Problems with the ideal model
  • Dangling nodes (trap the user)
  • Impose a random jump to every other page
  • B H auT
  • Cyclic paths in the Web Graph (reducibility)
  • Brin and Page suggested adding artificial
    transitions (low probability jump to all nodes)
  • G ?B (1 ? ?)euT

11
Current PageRank Solution
  • Since G is just a rank one modification of ?H,
    the power method takes advantage of the sparsity
    of matrix H.

12
Googles PageRank eigenproblem as a linear system
  • Expand
  • vTG vT (eigenproblem)
  • G ?(H auT) (1 ? ?)euT
  • (expansion of Google matrix)
  • ? vT (?H ?auT) (1 ? ?)vTeuT vT
  • Restructure
  • Taking S I ? ?HT ? ?uaT
  • And with vTe 1
  • ? Sv (1 ? ?)u
  • (after taking transpose and rearranging)

13
Dangling Nodes Problems
  • Dangling Nodes
  • Pages with no links to other pages
  • Pages whose existence is inferred but crawler has
    not reached
  • According to a 2001 sample, approximately 76 are
    dangling nodes.

14
Dangling Nodes Problems
  • Natural Model
  • B H auT
  • Jump with probability 1 to any other node
  • Drastic Model
  • Completely removes dangling nodes
  • Problems
  • Dangling nodes themselves are not ranked
  • Removing nodes create new dangling nodes
  • Self-loop Model eg.
  • B H F
  • Fij 1 if i j dangling 0 otherwise
  • Still row stochastic and is similar to natural
  • Problem
  • Gives unreasonably high rank to the dangling nodes

15
Which model is the best?
  • Natural model
  • the most accurate.
  • Problem
  • Gives a much more dense matrix B
  • It is at least partially for this reason that the
    problem is approached as an eigenproblem to
    exploit the sparsity of H
  • Can we have an equally lightweight iterative
    approach?

16
Iterative Approach with Sparsity
  • Sparse matrix R
  • R I ? ?HT
  • PageRank v obtained by solving
  • Ry u, v ?y such that v11
  • Why does this work?
  • Since S R ? ?uaT
  • and Sv (1 ? ?)u
  • (R ? ?uaT)v (1 ? ?)u
  • Use Sherman-Morrison formula to calculate the
    inverse

17
Iterative Approach with Sparsity
  • We get
  • The vector v is our PageRank vector and was
    solved using sparse matrix R

(R ? ?uaT)-1 R-1 (R ? ?uaT)-1 R-1 (R ? ?uaT)-1 R-1 R-1uaTR-1 R-1uaTR-1 R-1uaTR-1
(R ? ?uaT)-1 R-1 (R ? ?uaT)-1 R-1 (R ? ?uaT)-1 R-1 1/? aTR-1u 1/? aTR-1u 1/? aTR-1u
v (1? ?)(1 aTy aTy aTy )y )y )y
v (1? ?)(1 1/? aTy 1/? aTy 1/? aTy )y )y )y
v ?y v ?y v ?y v ?y v ?y v ?y v ?y
? (1? ?)(1 ? (1? ?)(1 aTy aTy aTy ) )
? (1? ?)(1 ? (1? ?)(1 1/? aTy 1/? aTy 1/? aTy ) )
18
Exploiting Web Matrix Permutations
  • Use a variety of cheap operators to permute the
    web graph in an organized way in hopes to
  • increase data locality
  • reduce the number of iterations in order to solve
    the problem
  • Explore different iterative methods in order to
    solve the problem quicker.
  • Compare the performance of different iterative
    methods based on specifically permuted web
    graphs.

19
Permutation Strategies
  • The following operations were chosen based on
    their limited impact on the computational cost
    (Del Corso et al.)
  • O - orders nodes by increasing out-degree
  • Q - orders nodes by decreasing out-degree
  • X - orders nodes by increasing in-degree
  • Y - orders nodes by decreasing in-degree
  • B - ordering the nodes according to their BFS
    (Breadth First Search) order of visit
  • T - transposes the matrix

20
Permutation Strategies (cont.)
  • O, Q, X, Y, and T operators ? a full matrix
  • The B operator conveniently arranges R into a
    lower block triangular structure due to BFS order
    of visit.
  • Combining these operations on R, the following
    structures of the permuted web matrix are
    produced.

21
Permutation Strategies (cont.)
  • Visual representations of the permuted web graph.

22
Iterative Strategies
  • Power Method
  • Computes dominant eigenvector
  • Jacobi Method
  • Using an initial guess, approximates the solution
    to the linear system of equations. Each
    successive iteration uses the previous
    approximated solution as its next guess till a
    degree of convergence is reached.
  • (further explained)
  • Gauss-Seidel Method (Reverse Gauss-Seidel)
  • Modification of the Jacobi Method, which
    approximates the solution to each successive
    equation in the linear system based on the values
    derived from the previous equations, all within
    each iterative loop. (further explained)
  • (http//college.cengage.com/mathematics/larson/el
    ementary_linear/5e/students/ch08-10/chap_10_2.pdf)

23
Iterative Strategies (cont.)
  • Further exploration led to iterative methods
    based on the distinct block structure of certain
    web graph permutations.
  • DN (or DNR) Method
  • The permuted matrix ROT has the property that the
    lower diagonal block coincides with the identity
    matrix. The matrix can be easily partitioned into
    non-dangling and dangling portions. Then the
    non-dangling part is solved by Gauss-Seidel (or
    Reverse Gauss-Seidel respectively).
  • LB/UB/LBR/UBR Methods
  • Uses the Gauss-Seidel or Reverse Gauss-Seidel
    methods to solve the individual blocks of the
    triangular block matrices produced by the B
    operator.

24
Results
25
Results (cont.)
  • For both the Power Method and Jacobi Method, the
    number of Mflops is not dependent on permutations
    of the web matrix. (the small differences in
    numerical data are due to the finite precision
    Del Corso et al.)
  • The Jacobi Method (applied to the matrix R) is
    only a slight improvement (about 3) compared to
    the Power Method (applied to the matrix G).

26
Results (cont.)
  • The Gauss-Seidel and Reverse Gauss-Seidel Methods
    reduced Mflops by around 40 and running time by
    around 45 on the full matrix compared to the
    Power Method on the full matrix.
  • In particular the Reverse Gauss-Seidel Method
    performed on the permuted matrix RYTB reduced the
    number of Mflops by 51 and running time by 82
    when compared to the Power method on the full
    matrix.

27
Results (cont.)
  • The block methods achieved even better results
  • In particular, the best overall reduction in
    computation time and number of Mflops was
    achieved by LBR method on the permuted matrix
    RQTB.
  • 58 reduction in terms of Mflops
  • 89 reduction in terms of running time

28
Conclusion
  • Objective
  • Accelerating Google PageRank by numerical methods
  • Contribution
  • Viewed web matrix as a sparse linear system
  • Formalized new method for treating dangling nodes
  • Explored new iterative methods and applied them
    to web matrix permutations
  • Achievement
  • 1/10 of the computation time
  • Reduced over 50 Mflops

29
References
  • G.M. Del Corso, A. Gulli, F. Romani, Exploiting
    Web Matrix Permutations to Speedup PageRank
    Computation
  • Nielsen MegaView Search (http//en-us.nielsen.com/
    rankings/insights/rankings/internet)
Write a Comment
User Comments (0)
About PowerShow.com