How GoogleTM works: Eigenvalue methods for PageRankTM - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

How GoogleTM works: Eigenvalue methods for PageRankTM

Description:

dangling pages. therefore, if we artificially add links from the dangling ... many dangling web pages. even more web pages containing 'just a few' outlinks. ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 30

Provided by: people7

Category:

more less

Transcript and Presenter's Notes

Title: How GoogleTM works: Eigenvalue methods for PageRankTM

1
How GoogleTM works Eigenvalue methods for
PageRankTM
2
Introduction
Phase I Select only 2000 web pages out of the
1,740,000 web pages
Phase II Select 1000 out of the 2000 web pages
3
Introduction
Structure of the World Wide Web
Directed Graph

Very Large ( 4.5 billion web pages)

Sparse

Many broken links

4
Introduction
The PageRankTM Thesis
5
Introduction
The PageRankTM Thesis
B(P) all pages pointing to P
where
Q number of outlinks from Q
This is a recursive definition, so computation
necessarily requires iteration.
Suppose

There are n web pages, P1, P2, , Pn

An initial ranking, r0(Pi), i 1,2,,n, is
assigned to each of them

Then
6
Introduction
The equation
in matrix form is
where
and
7
Introduction
Raw Idea
If the limit
exists, the PageRank vector is defined to be
and thus we would get
Note
This is equivalent to using the Power Method to
find the left-eigenvector of the matrix P
corresponding to the eigenvalue 1.
Therefore the convergence mainly depends on the
matrix P.
8
Markov Model of the Web
From the definition
The .pdf files on the Web are examples of
dangling pages
follows
and
This means that the PageRank iteration would then
represent the evolution of a Markov chain.
9
Markov Model of the Web
Example
dangling page
dangling page
10
Markov Model of the Web
More generally, we change the matrix P in the
following way
where the column vector a is defined as
the column vector v is a distribution vector,
which is
where
11
Markov Model of the Web
Therefore the modified left-eigenvector problem is
Rank Sink
But it turns out that we cant apply this model!
Example
12
Markov Model of the Web
where
and
is (the same) distribution vector.
Obviously
The last property of P1 allows the following
interpretation
13
Markov Model of the Web
Next section!
Interpretation of the PageRank iteration w.r.t.
the matrix P1.
In case of convergence, the iteration
with the stopping criteria
can be thought of as modelling the behaviour of a
random surfer.
The random surfer either keeps clicking on
successive links at random, or enters a new URL
on the command line. Thus he can never get into a
small loop of web pages (the Rank Sink problem).
14
The Power Method
(Plan of the section)
P1 is a huge, dense matrix.
Is ? 1 a possible eigenvalue for P1?
In this section we shall discuss
Is ? 1 unique for P1?

storage issues

Rate of Convergence and Convergence Criteria.

existence

uniqueness

convergence

of the PageRank Power Method.
15
The Power Method
Storage Issues Existence Uniqueness Convergence
The matrix P is a very sparse matrix, because of
the nature of the Web

many dangling web pages

even more web pages containing just a few
outlinks.

When we modify the matrix P by adding links from
all dangling web pages to all web pages, we make
the resulting matrix denser.
The matrix P1
In order to avoid the Rank Sink problem we added
links from every web page to every other web
page. Therefore there should be no zero elements
in the matrix P1.
16
The Power Method
Storage Issues Existence Uniqueness Convergence
Although the arguments on the previous slide are
true, the following holds
nnz(P) number of nonzeros in P
Therefore we only form and store the matrix P and
the vectors a, e and v.
Since P is sparse, each matrix-vector
multiplication required by the Power Method can
be computed in nnz(P) flops.
The average number of nonzeros per row in P is
3-10, thus O(nnz(P)) O(n).
17
The Power Method
Storage Issues Existence Uniqueness Convergence
The existence is ensured by the following theorem
Therefore ? 1 is an eigenvalue of P1.
Problem
18
The Power Method
Storage Issues Existence Uniqueness Convergence
The following theorem is true
19
The Power Method
Storage Issues Existence Uniqueness Convergence
20
The Power Method
Storage Issues Existence Uniqueness Convergence
Kamvar and Haveliwala have proven that
21
The Power Method
Storage Issues Existence Uniqueness Convergence
But on the other hand, since
we can not choose ? arbitrarily close to zero.
Brin and Page, the founders of Google, use ?
0.85.
22
The Power Method
Storage Issues Existence Uniqueness Convergence
Suppose ? is the tolerance level, measured by the
residual
then a rough estimate of the number of iterations
needed to achieve that level is
which for ? 0.85 produces roughly 85 iterations
when ? 10-6, and 114 iterations, when ? 10-8.
Brin and Page report success with 50 to 100
iterations.
)? 2 10-3,10-7
23
The Power Method
Storage Issues Existence Uniqueness Convergence
(Convergence Criteria)
Being an iterative method, the PageRank Power
Method continues until some termination
criterion is met.
The traditional termination criterion is
In our case
We dont need the exact values of the PageRank
vector, only their order
Therefore
Iterate until the ordering of the approximate
PageRank vector converges.
24
The Power Method
Storage Issues Existence Uniqueness Convergence
(Convergence Criteria)
Haveliwala has reported significant savings for
some datasets.
But that convergence w.r.t. ordering raises some
interesting issues

How does one measure the difference between two
orderings?

Several papers have provided possible answers to
that question, using such measures as Kendalls
Tau, rank aggregation and set overlap.

How does one determine a satisfactory ordering
convergence?

25
Interesting Topics
Acceleration Techniques

Reduction in work per iteration.

Reduction in the number of iterations.

The Linear System Formulation.
Forcing Irreducibility (alternative ways)
Sensitivity, Stability and Condition Numbers
Updating the PageRank vector, etc
26
The Linear System Formulation
The problem
can be rewritten, with some algebra, as
and with further algebra (and some tricks), as
27
The Linear System Formulation
The nice properties of the problem
are
28
The Linear System Formulation
Idea of the algorithm for solving
Permute the rows and the columns of the matrix P
to get
Then
therefore
29
Forcing Irreducibility
Recall We modified the matrix P in the following
way
in order to get the irreducible primitive
stochastic matrix P1.
Another way of getting an irreducible primitive
stochastic matrix from P is by adding a dummy web
page to the Web, which connects to all other
web pages and to which every web page is
connected.
This method implies the following changes
and it can be proved that solving the problem
w.r.t the new matrix is equi- valent to solving
the Googles problem.

Write a Comment

User Comments (0)