Estimating the Global PageRank of Web Communities - PowerPoint PPT Presentation

About This Presentation

Title:

Estimating the Global PageRank of Web Communities

Description:

Searching for particular terms with several meanings. Relatively ... The subdominant eigenvalue is at most which means that for large l, it is very close to a ... – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 44

Provided by: Sco7111

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Estimating the Global PageRank of Web Communities

1
Estimating the Global PageRank of Web Communities

Paper by Jason V. Davis Inderjit S. Dhillon
Dept. of Computer Sciences
University of Texas at Austin
Presentation given by Scott J. McCallen
Dept. of Computer Science
Kent State University
December 4th 2006

2
Localized Search Engines

What are they?
Focus on a particular community
Examples www.cs.kent.edu (site specific) or all
computer science related websites (topic
specific)
Advantages
Searching for particular terms with several
meanings
Relatively inexpensive to build and use
Use less bandwidth, space and time
Local domains are orders of magnitude smaller
than global domain

3
Localized Search Engines (cont)

Disadvantages
Lack of Global information
i.e. only local PageRanks are available
Why is this a problem?
Only pages within that community that are highly
regarded will have high PageRanks
There is a need for a global PageRank for pages
only within a local domain
Traditionally, this can only be obtained by
crawling entire domain

4
Some Global Facts

2003 Study by Lyman on the Global Domain
8.9 billion pages on the internet (static pages)
Approximately 18.7 kilobytes each
167 terabytes needed to download and crawl the
entire web
These resources are only available to major
corporations
Local Domains
May only contain a couple hundred thousand pages
May already be contained on a local web server
(www.cs.kent.edu)
There is much less restriction to the entire
dataset
The advantages of localized search engines
becomes clear

5
Global (N) vs. Local (n)
Each local domain isnt aware of the rest of the
global domain.
Some parts overlap, but others dont. Overlap
represents links to other domains.
How is it possible to extract global information
when only the local domain is available?
Excluding overlap from other domains gives a very
poor estimate of global rank.
6
Proposed Solution

Find a good approximation to the global PageRank
value without crawling entire global domain
Find a superdomain of local domain that will well
approximate the PageRank
Find this superdomain by crawling as few as n or
2n additional pages given a local domain of n
pages
Esessentially, add as few pages to the local
domain as possible until we find a very good
approximation of the PageRanks in the local
domain

7
PageRank - Description

Defines importance of pages based on the
hyperlinks from one page to another (the web
graph)
Computes the stationary distribution of a Markov
chain created from the web graph
Uses the random surfer model to create a
random walk over the chain

8
PageRank Matrix

Given m x m adjacency matrix for the web graph,
define the PageRank Matrix as
DU is diagonal matrix such that UDU-1 is column
stochastic
0 a 1
e is vector of all 1s
v is the random surfer vector

9
PageRank Vector

The PageRank vector r represents the page rank of
every node in the webgraph
It is defined as the dominate eigenvector of the
PageRank matrix
Computed using the power method using a random
starting vector
Computation can take as much as O(m2) time for a
dense graph but in practice is normally O(km), k
being the average number of links per page

10
Algorithm 1

Computing the PageRank vector based on the
adjacency matrix U of the given web graph

11
Algorithm 1 (Explanation)

Input Adjacency Matrix U
Output PageRank vector r
Method
Choose a random initial value for r(0)
Continue to iterate using the random surfer
probability and vector until reaching the
convergence threshold
Return the last iteration as the dominant
eigenvector for adjacency matrix U

12
Defining the Problem ( G vs. L)

For a local domain L, we have G as the entire
global domain with an N x N adjacency matrix
Define G to be as the following
i.e. we partition G into separate sections that
allow L to be contained
Assume that L has already been crawled and Lout
is known

13
Defining the Problem (p in g)

If we partition G as such, we can denote actual
PageRank vector of L as
with respect to g (the global PageRank vector)
Note EL selects only the nodes that correspond
to L from g

14
Defining the Problem (n ltlt N)

We define p as the PageRank vector computed by
crawling only local domain L
Note that p will be much different than p
Continue to crawl more nodes of the global domain
and the difference will become smaller, however
this is not possible
Find the supergraph F of L that will minimize the
difference between p and p

15
Defining the Problem (finding F)

We need to find F that gives us the best
approximation of p
i.e. minimize the following problem (the
difference between the actual global PageRank and
the estimated PageRank)
F is found with a greedy strategy, using
Algorithm 2
Essentially, start with L and add the nodes in
Fout that minimize our objective and continue
doing so a total of T iterations

16
Algorithm 2
17
Algorithm 2 (Explanation)

Input L (local domain), Lout (outlinks from L),
T (number of iterations), k (pages to crawl per
iteration)
Output p (an improved estimated PageRank vector)
Method
First set F (supergraph) and Fout equal to L and
Lout
Compute the PageRank vector of F
While T has not been exceeded
Select k new nodes to crawl based on F, Fout, f
Expand F to include those new nodes and modify
Fout
Compute the new PageRank vector for F
Select the elements from f that correspond to L
and return p

18
Global (N) vs. Local (n) (Again)
We know how to create the PageRank vector using
the power method.
Using it on only the local domain gives very
inaccurate estimates of the PageRank.
How can we select nodes from other domains (i.e.
expanding the current domain) to improve accuracy?
How far can selecting more nodes be allowed to
proceed without crawling the entire global domain?
19
Selecting Nodes

Select nodes to expand L to F
Selected nodes must bring us closer to the actual
PageRank vector
Some nodes will greatly influence the current
PageRank
Only want to select at most O(n) more pages than
those already in L

20
Finding the Best Nodes

For a page j in the global domain and the
frontier of F (Fout), the addition of page j to F
is as follows
uj is the outlinks from F to j
s is the estimated inlinks from j into F (j has
not yet been crawled)
s is estimated based on the expectation of inlink
counts of pages already crawled as so

21
Finding the Best Nodes (cont)

We defined the PageRank of F to be f
The PageRank of Fj is fj
xj is the PageRank of node j (added to the
current PageRank vector)
Directly optimizing requires us to know the
global PageRank p
How can we minimize the objective without knowing
p?

22
Node Influence

Find the nodes in Fout that will have the
greatest influence on the local domain L
Done by attaching an influence score to each node
j
Summation of the difference adding page j will
make to PageRank vector among all pages in L
The influence score has a strong corollary to the
minimization of the GlobalDiff(fj) function (as
compared to a baseline, for instance, the total
outlink count from F to node j)

23
Node Influence Results

Node Influence vs. Outlink Count on a crawl of
conservative web sites

24
Finding the Influence

Influence must be calculated for each node j in
frontier of F that is considered
We are considering O(n) pages and the calculation
is O(n), we are left with a O(n2) computation
To reduce this complexity, approximating the
influence of j may be acceptable, but how?
Using the power method for computing the PageRank
algorithms may lead us to a good approximation
However, using the algorithm (Algorithm 1),
requires having a good starting vector

25
PageRank Vector (again)

The PageRank algorithm will converge at a rate
equal to the random surfer probability a
With a starting vector x(0), the complexity of
the algorithm is
That is, the more accurate the vector becomes,
the more complex the process is
Saving Grace Find a very good starting vector
for x(0), in which case we only need to perform
one iteration of Algorithm 1

26
Finding the Best x(0)

Partition the PageRank matrix for Fj

27
Finding the Best x(0)

Simple approach
Use as the starting vector (the
current PageRank vector)
Perform one PageRank iteration
Remove the element that corresponds to added node
Issues
The estimate of fj will have an error of at
least 2axj
So if the PageRank of j is very high, very bad
estimate

28
Stochastic Complement

In an expanded form, the PageRank fj is
Which can be solved as
Observation
This is the stochastic complement of PageRank
matrix of Fj

29
Stochastic Complement (Observations)

The stochastic complement of an irreducible
matrix is unique
The stochastic complement is also irreducible and
therefore has unique stationary distribution
With regards to the matrix S
The subdominant eigenvalue is at most
which means that for large l, it is very close to
a

30
The New PageRank Approximation

Estimate the vector fj of length l by performing
one PageRank iteration over S, starting at f
Advantages
Starting and ending with a vector of length l
Creates a lower bound for error of zero
Example Considering adding a node k to F that
has no influence over the PageRank of F
Using the stochastic complement yields the exact
solution

31
The Details

Begin by expanding the difference between two
PageRank vectors
with

32
The Details

Substitute PF into the equation
Summarizing into vectors

33
(No Transcript)
34
Algorithm 3 (Explanation)

Input F (the current local subgraph), Fout
(outlinks of F), f (current PageRank of F), k
(number of pages to return)
Output k new pages to crawl
Method
Compute the outlink sums for each page in F
Compute a scalar for every known global page j
(how many pages link to j)
Compute y and z as formulated
For each of the pages in Fout
Computer x as formulated
Compute the score of each page using x, y and z
Return the k pages with the highest scores

35
PageRank Leaks and Flows

The change of a PageRank based on added a node j
to F can be described as Leaks and Flows
A flow is the increase in local PageRanks
Represented by
Scalar is the total amount j has to
distribute
Vector determines how it will be
distributed
A leak is the decrease in local PageRanks
Leaks come from non-positive vectors x and y
X is proportional to the weighted sum of sibling
PageRanks
Y is an artifact of the random surfer vector

36
Leaks and Flows

Leaks Random Surfer Siblings
Local Pages
Flows
37
Experiments

Methodology
Resources are limited, global graph is
approximated
Baseline Algorithms
Random
Nodes chosen uniformly at random from known
global nodes
Outlink Count
Node chosen have the highest number of outline
counts from the current local domain

38
Results (Data Sets)

Data Set
Restricted to http pages that do not contain the
characters ?, , _at_, or
EDU Data Set
Crawl of the top 100 computer science
universities
Yielded 4.7 million pages, 22.9 million links
Politics Data Set
Crawl of the pages under politics in dmoz
directory
Yielded 4.4 million pages, 17.2 million links

39
Results (EDU Data Set)

Normalizations show difference, Kendall shows
similarity

40
Results (Politics Data Set)
41
Result Summary

Stochastic Complement outperformed other methods
in nearly every trial
The results are significantly better than the
random walk approach with minimal computation

42
Conclusion