Title: How GoogleTM works: Eigenvalue methods for PageRankTM
1How GoogleTM works Eigenvalue methods for
PageRankTM
2Introduction
Phase I Select only 2000 web pages out of the
1,740,000 web pages
Phase II Select 1000 out of the 2000 web pages
3Introduction
Structure of the World Wide Web
Directed Graph
- Very Large ( 4.5 billion web pages)
4Introduction
The PageRankTM Thesis
5Introduction
The PageRankTM Thesis
B(P) all pages pointing to P
where
Q number of outlinks from Q
This is a recursive definition, so computation
necessarily requires iteration.
Suppose
- There are n web pages, P1, P2, , Pn
- An initial ranking, r0(Pi), i 1,2,,n, is
assigned to each of them
Then
6Introduction
The equation
in matrix form is
where
and
7Introduction
Raw Idea
If the limit
exists, the PageRank vector is defined to be
and thus we would get
Note
This is equivalent to using the Power Method to
find the left-eigenvector of the matrix P
corresponding to the eigenvalue 1.
Therefore the convergence mainly depends on the
matrix P.
8Markov Model of the Web
From the definition
The .pdf files on the Web are examples of
dangling pages
follows
and
This means that the PageRank iteration would then
represent the evolution of a Markov chain.
9Markov Model of the Web
Example
dangling page
dangling page
10Markov Model of the Web
More generally, we change the matrix P in the
following way
where the column vector a is defined as
the column vector v is a distribution vector,
which is
where
11Markov Model of the Web
Therefore the modified left-eigenvector problem is
Rank Sink
But it turns out that we cant apply this model!
Example
12Markov Model of the Web
where
and
is (the same) distribution vector.
Obviously
The last property of P1 allows the following
interpretation
13Markov Model of the Web
Next section!
Interpretation of the PageRank iteration w.r.t.
the matrix P1.
In case of convergence, the iteration
with the stopping criteria
can be thought of as modelling the behaviour of a
random surfer.
The random surfer either keeps clicking on
successive links at random, or enters a new URL
on the command line. Thus he can never get into a
small loop of web pages (the Rank Sink problem).
14The Power Method
(Plan of the section)
P1 is a huge, dense matrix.
Is ? 1 a possible eigenvalue for P1?
In this section we shall discuss
Is ? 1 unique for P1?
Rate of Convergence and Convergence Criteria.
of the PageRank Power Method.
15The Power Method
Storage Issues Existence Uniqueness Convergence
The matrix P is a very sparse matrix, because of
the nature of the Web
- even more web pages containing just a few
outlinks.
When we modify the matrix P by adding links from
all dangling web pages to all web pages, we make
the resulting matrix denser.
The matrix P1
In order to avoid the Rank Sink problem we added
links from every web page to every other web
page. Therefore there should be no zero elements
in the matrix P1.
16The Power Method
Storage Issues Existence Uniqueness Convergence
Although the arguments on the previous slide are
true, the following holds
nnz(P) number of nonzeros in P
Therefore we only form and store the matrix P and
the vectors a, e and v.
Since P is sparse, each matrix-vector
multiplication required by the Power Method can
be computed in nnz(P) flops.
The average number of nonzeros per row in P is
3-10, thus O(nnz(P)) O(n).
17The Power Method
Storage Issues Existence Uniqueness Convergence
The existence is ensured by the following theorem
Therefore ? 1 is an eigenvalue of P1.
Problem
18The Power Method
Storage Issues Existence Uniqueness Convergence
The following theorem is true
19The Power Method
Storage Issues Existence Uniqueness Convergence
20The Power Method
Storage Issues Existence Uniqueness Convergence
Kamvar and Haveliwala have proven that
21The Power Method
Storage Issues Existence Uniqueness Convergence
But on the other hand, since
we can not choose ? arbitrarily close to zero.
Brin and Page, the founders of Google, use ?
0.85.
22The Power Method
Storage Issues Existence Uniqueness Convergence
Suppose ? is the tolerance level, measured by the
residual
then a rough estimate of the number of iterations
needed to achieve that level is
which for ? 0.85 produces roughly 85 iterations
when ? 10-6, and 114 iterations, when ? 10-8.
Brin and Page report success with 50 to 100
iterations.
)? 2 10-3,10-7
23The Power Method
Storage Issues Existence Uniqueness Convergence
(Convergence Criteria)
Being an iterative method, the PageRank Power
Method continues until some termination
criterion is met.
The traditional termination criterion is
In our case
We dont need the exact values of the PageRank
vector, only their order
Therefore
Iterate until the ordering of the approximate
PageRank vector converges.
24The Power Method
Storage Issues Existence Uniqueness Convergence
(Convergence Criteria)
Haveliwala has reported significant savings for
some datasets.
But that convergence w.r.t. ordering raises some
interesting issues
- How does one measure the difference between two
orderings?
Several papers have provided possible answers to
that question, using such measures as Kendalls
Tau, rank aggregation and set overlap.
- How does one determine a satisfactory ordering
convergence?
25Interesting Topics
Acceleration Techniques
- Reduction in work per iteration.
- Reduction in the number of iterations.
The Linear System Formulation.
Forcing Irreducibility (alternative ways)
Sensitivity, Stability and Condition Numbers
Updating the PageRank vector, etc
26The Linear System Formulation
The problem
can be rewritten, with some algebra, as
and with further algebra (and some tricks), as
27The Linear System Formulation
The nice properties of the problem
are
28The Linear System Formulation
Idea of the algorithm for solving
Permute the rows and the columns of the matrix P
to get
Then
therefore
29Forcing Irreducibility
Recall We modified the matrix P in the following
way
in order to get the irreducible primitive
stochastic matrix P1.
Another way of getting an irreducible primitive
stochastic matrix from P is by adding a dummy web
page to the Web, which connects to all other
web pages and to which every web page is
connected.
This method implies the following changes
and it can be proved that solving the problem
w.r.t the new matrix is equi- valent to solving
the Googles problem.