Title: Estimating the Global PageRank of Web Communities
1Estimating the Global PageRank of Web Communities
- Paper by Jason V. Davis Inderjit S. Dhillon
- Dept. of Computer Sciences
- University of Texas at Austin
- Presentation given by Scott J. McCallen
- Dept. of Computer Science
- Kent State University
- December 4th 2006
2Localized Search Engines
- What are they?
- Focus on a particular community
- Examples www.cs.kent.edu (site specific) or all
computer science related websites (topic
specific) - Advantages
- Searching for particular terms with several
meanings - Relatively inexpensive to build and use
- Use less bandwidth, space and time
- Local domains are orders of magnitude smaller
than global domain
3Localized Search Engines (cont)
- Disadvantages
- Lack of Global information
- i.e. only local PageRanks are available
- Why is this a problem?
- Only pages within that community that are highly
regarded will have high PageRanks - There is a need for a global PageRank for pages
only within a local domain - Traditionally, this can only be obtained by
crawling entire domain
4Some Global Facts
- 2003 Study by Lyman on the Global Domain
- 8.9 billion pages on the internet (static pages)
- Approximately 18.7 kilobytes each
- 167 terabytes needed to download and crawl the
entire web - These resources are only available to major
corporations - Local Domains
- May only contain a couple hundred thousand pages
- May already be contained on a local web server
(www.cs.kent.edu) - There is much less restriction to the entire
dataset - The advantages of localized search engines
becomes clear
5Global (N) vs. Local (n)
Each local domain isnt aware of the rest of the
global domain.
Some parts overlap, but others dont. Overlap
represents links to other domains.
How is it possible to extract global information
when only the local domain is available?
Excluding overlap from other domains gives a very
poor estimate of global rank.
6Proposed Solution
- Find a good approximation to the global PageRank
value without crawling entire global domain - Find a superdomain of local domain that will well
approximate the PageRank - Find this superdomain by crawling as few as n or
2n additional pages given a local domain of n
pages - Esessentially, add as few pages to the local
domain as possible until we find a very good
approximation of the PageRanks in the local
domain
7PageRank - Description
- Defines importance of pages based on the
hyperlinks from one page to another (the web
graph) - Computes the stationary distribution of a Markov
chain created from the web graph - Uses the random surfer model to create a
random walk over the chain
8PageRank Matrix
- Given m x m adjacency matrix for the web graph,
define the PageRank Matrix as - DU is diagonal matrix such that UDU-1 is column
stochastic - 0 a 1
- e is vector of all 1s
- v is the random surfer vector
9PageRank Vector
- The PageRank vector r represents the page rank of
every node in the webgraph - It is defined as the dominate eigenvector of the
PageRank matrix - Computed using the power method using a random
starting vector - Computation can take as much as O(m2) time for a
dense graph but in practice is normally O(km), k
being the average number of links per page
10Algorithm 1
- Computing the PageRank vector based on the
adjacency matrix U of the given web graph
11Algorithm 1 (Explanation)
- Input Adjacency Matrix U
- Output PageRank vector r
- Method
- Choose a random initial value for r(0)
- Continue to iterate using the random surfer
probability and vector until reaching the
convergence threshold - Return the last iteration as the dominant
eigenvector for adjacency matrix U
12Defining the Problem ( G vs. L)
- For a local domain L, we have G as the entire
global domain with an N x N adjacency matrix - Define G to be as the following
- i.e. we partition G into separate sections that
allow L to be contained - Assume that L has already been crawled and Lout
is known
13Defining the Problem (p in g)
- If we partition G as such, we can denote actual
PageRank vector of L as -
- with respect to g (the global PageRank vector)
- Note EL selects only the nodes that correspond
to L from g
14Defining the Problem (n ltlt N)
- We define p as the PageRank vector computed by
crawling only local domain L - Note that p will be much different than p
- Continue to crawl more nodes of the global domain
and the difference will become smaller, however
this is not possible - Find the supergraph F of L that will minimize the
difference between p and p
15Defining the Problem (finding F)
- We need to find F that gives us the best
approximation of p - i.e. minimize the following problem (the
difference between the actual global PageRank and
the estimated PageRank) - F is found with a greedy strategy, using
Algorithm 2 - Essentially, start with L and add the nodes in
Fout that minimize our objective and continue
doing so a total of T iterations
16Algorithm 2
17Algorithm 2 (Explanation)
- Input L (local domain), Lout (outlinks from L),
T (number of iterations), k (pages to crawl per
iteration) - Output p (an improved estimated PageRank vector)
- Method
- First set F (supergraph) and Fout equal to L and
Lout - Compute the PageRank vector of F
- While T has not been exceeded
- Select k new nodes to crawl based on F, Fout, f
- Expand F to include those new nodes and modify
Fout - Compute the new PageRank vector for F
- Select the elements from f that correspond to L
and return p
18Global (N) vs. Local (n) (Again)
We know how to create the PageRank vector using
the power method.
Using it on only the local domain gives very
inaccurate estimates of the PageRank.
How can we select nodes from other domains (i.e.
expanding the current domain) to improve accuracy?
How far can selecting more nodes be allowed to
proceed without crawling the entire global domain?
19Selecting Nodes
- Select nodes to expand L to F
- Selected nodes must bring us closer to the actual
PageRank vector - Some nodes will greatly influence the current
PageRank - Only want to select at most O(n) more pages than
those already in L
20Finding the Best Nodes
- For a page j in the global domain and the
frontier of F (Fout), the addition of page j to F
is as follows - uj is the outlinks from F to j
- s is the estimated inlinks from j into F (j has
not yet been crawled) - s is estimated based on the expectation of inlink
counts of pages already crawled as so
21Finding the Best Nodes (cont)
- We defined the PageRank of F to be f
- The PageRank of Fj is fj
- xj is the PageRank of node j (added to the
current PageRank vector) - Directly optimizing requires us to know the
global PageRank p - How can we minimize the objective without knowing
p?
22Node Influence
- Find the nodes in Fout that will have the
greatest influence on the local domain L - Done by attaching an influence score to each node
j - Summation of the difference adding page j will
make to PageRank vector among all pages in L - The influence score has a strong corollary to the
minimization of the GlobalDiff(fj) function (as
compared to a baseline, for instance, the total
outlink count from F to node j)
23Node Influence Results
- Node Influence vs. Outlink Count on a crawl of
conservative web sites
24Finding the Influence
- Influence must be calculated for each node j in
frontier of F that is considered - We are considering O(n) pages and the calculation
is O(n), we are left with a O(n2) computation - To reduce this complexity, approximating the
influence of j may be acceptable, but how? - Using the power method for computing the PageRank
algorithms may lead us to a good approximation - However, using the algorithm (Algorithm 1),
requires having a good starting vector
25PageRank Vector (again)
- The PageRank algorithm will converge at a rate
equal to the random surfer probability a - With a starting vector x(0), the complexity of
the algorithm is - That is, the more accurate the vector becomes,
the more complex the process is - Saving Grace Find a very good starting vector
for x(0), in which case we only need to perform
one iteration of Algorithm 1
26Finding the Best x(0)
- Partition the PageRank matrix for Fj
27Finding the Best x(0)
- Simple approach
- Use as the starting vector (the
current PageRank vector) - Perform one PageRank iteration
- Remove the element that corresponds to added node
- Issues
- The estimate of fj will have an error of at
least 2axj - So if the PageRank of j is very high, very bad
estimate
28Stochastic Complement
- In an expanded form, the PageRank fj is
- Which can be solved as
- Observation
- This is the stochastic complement of PageRank
matrix of Fj
29Stochastic Complement (Observations)
- The stochastic complement of an irreducible
matrix is unique - The stochastic complement is also irreducible and
therefore has unique stationary distribution - With regards to the matrix S
- The subdominant eigenvalue is at most
which means that for large l, it is very close to
a
30The New PageRank Approximation
- Estimate the vector fj of length l by performing
one PageRank iteration over S, starting at f - Advantages
- Starting and ending with a vector of length l
- Creates a lower bound for error of zero
- Example Considering adding a node k to F that
has no influence over the PageRank of F - Using the stochastic complement yields the exact
solution
31The Details
- Begin by expanding the difference between two
PageRank vectors - with
32The Details
- Substitute PF into the equation
- Summarizing into vectors
33(No Transcript)
34Algorithm 3 (Explanation)
- Input F (the current local subgraph), Fout
(outlinks of F), f (current PageRank of F), k
(number of pages to return) - Output k new pages to crawl
- Method
- Compute the outlink sums for each page in F
- Compute a scalar for every known global page j
(how many pages link to j) - Compute y and z as formulated
- For each of the pages in Fout
- Computer x as formulated
- Compute the score of each page using x, y and z
- Return the k pages with the highest scores
35PageRank Leaks and Flows
- The change of a PageRank based on added a node j
to F can be described as Leaks and Flows - A flow is the increase in local PageRanks
- Represented by
- Scalar is the total amount j has to
distribute - Vector determines how it will be
distributed - A leak is the decrease in local PageRanks
- Leaks come from non-positive vectors x and y
- X is proportional to the weighted sum of sibling
PageRanks - Y is an artifact of the random surfer vector
36Leaks and Flows
Leaks Random Surfer Siblings
Local Pages
Flows
37Experiments
- Methodology
- Resources are limited, global graph is
approximated - Baseline Algorithms
- Random
- Nodes chosen uniformly at random from known
global nodes - Outlink Count
- Node chosen have the highest number of outline
counts from the current local domain
38Results (Data Sets)
- Data Set
- Restricted to http pages that do not contain the
characters ?, , _at_, or - EDU Data Set
- Crawl of the top 100 computer science
universities - Yielded 4.7 million pages, 22.9 million links
- Politics Data Set
- Crawl of the pages under politics in dmoz
directory - Yielded 4.4 million pages, 17.2 million links
39Results (EDU Data Set)
- Normalizations show difference, Kendall shows
similarity
40Results (Politics Data Set)
41Result Summary
- Stochastic Complement outperformed other methods
in nearly every trial - The results are significantly better than the
random walk approach with minimal computation
42Conclusion
- Accurate estimates of the PageRank can be
obtained by using local results - Expand the local graph based on influence
- Crawl at most O(n) more pages
- Use stochastic complement to accurately estimate
the new PageRank vector - Not computationally or storage intensive
43Estimating the Global PageRank of Web Communities