Title: On NearUniform URL Sampling
1On Near-Uniform URL Sampling
2Introduction
- First, the authors consider several sampling
approaches, including natural approaches based on
random walks. Intuitively, the problem with
using a random walk in order to sample URLs from
the web is that pages that are more highly
connected tend to be chosen more often.
3Introduction
- The authors suggest an improvement to the
standard random walk technique that mitigates
this effect, leading to a more uniform sample.
4Introduction
- Second, the authors describe a test bed for
validating their technique. - In particular, the authors apply an sampling
approach to a synthetic random graph whose
connectivity was designed to resemble that of the
web, and then analyze the distribution of these
samples.
5Introduction
- Finally, the authors apply their sampling
- technique to three sizable random walks
- of the actual web. They then use these
- samples to estimate the distribution of
- pages over internet domains, and to
- estimate the coverage of various search
- engine indexes.
6Prior Work
- Definition The size of a search engine is the
number of pages indexed by the search engine. - Similarly, the size of the web corresponds to the
number of publicly accessible static web pages.
7Prior Work
- The question of understanding the size of the web
and the relative sizes of search engines has been
studied previously. - The question of whether size is an appropriate
gauge of search engine utility, however, remains
a subject of debate. - Another reason to study size is to learn about
the growth of the web, so that - Appropriate predictions can be made and future
trends can be spotted early.
8Prior Work
- Initial work The approach of sampling from NEC
query logs leaves questions as to the statistical
appropriateness of the sample, as well as the
repeatability of the test by other researchers. - Further work An approach based on random testing
of IP addresses to determine characteristics of
hosts and pages found on the web, as well as to
estimate web size. -
9Prior Work
- This technique appears to be a usefule approach
for determining characteristics of web hosts. - Given the high variance in the number of
pages/host, and the difficulties in accessing
pages from hosts by this approach, it is not
clear that this technique provides a general
methodology to accurately determine the size of
the web. In particular, the scalability of this
approach is uncertain for future 128 bit IP-v6
addresses.
10Prior Work
- The weight of an index is a generalization of the
notion of its size - Each page can be assigned a weight, which
corresponds to its importance - The weight of a search engine index is then
defined to be the sum of the weights of the pages
it contains.
11Prior Work
- Another natural weight measure is, for example,
the PageRank measure. - Random walks on the web graph and search-engine
probing techniques can determine the weight of an
index when the weight measure is given by the
PageRank measure.
12Prior Work-PageRank
- The PageRank is a measure of a page that is
fundamental to this sampling approach. - PageRank
- , where
- T is the total pages on the web
- 0 0.15
- Let pages p1,p2,,pn link to page p, and
- C(p) be the number of links out of p.
-
-
13Mercator
- An extensible, multi-threaded web crawler written
in Java - Mercator configured to use 100 crawling threads,
thus 100 random walks in parallel - The crawl is seeded with 10K starting points
chosen from a previous crawl
14Mathematical Underpinnings
-
- First, approximate Pr(X is crawled)
-
15Mathematical Underpinnings
- Consider a page well-connected if it can be
reached by almost every other page through
several short paths. - A short walk means about steps, where n is
the number of pages in the Web graph - Short Walks
-
(3) -
16Mathematical Underpinnings
- The main point of the approach is to obtain more
nearly uniform samples from the history of random
walks if we sample visited pages with a skewed
probability distribution, namely by sampling
inversely to each pages PageRank.
17Mathematical Underpinnings
- The random walk provides us with two possible
ways of estimating the PageRank - 1. Estimate R(X), and call it the visit ratio
-
18Mathematical Underpinnings
- Second, estimate the PageRank R(X) of a
- page by the sample PageRank R?(X)
- computed on this sample graph
-
19Limitations
- Crawl-based approach finds pages that are
accessible only through some sequence of links
from our initial seed set - Furthermore, avoid crawling dynamic content by
stripping the query component from discovered
URLs log only content text/html - Technique biased against pages that are not
well-connected
20Limitations that stem from Mathematical framework
- Initial bias based on starting point, and
mitigated by choosing a large, diverse set of
initial starting points - Dependence A dependence between pages in the
random walk - Short cycles A specific problem raised by the
dependence problem. In particular, implies that
approximation (3) is inaccurate for these pages
21Limitations that stem from Mathematical framework
- Large PageRanks Approximation (3) is
inappropriate for long walks and pages with very
high PageRank. Approximation (4) will therefore
overestimate the probability that a high PageRank
page is crawled, since the right hand side can be
larger than 1
22Limitations
- Random Jumps The random walk approximates the
behavior of a random web surfer by jumping to a
random page visited previously, rather than a
completely random page - Approximation (4) doesnt guarantee uniform
samples, since there are other possible sources
of error in using the visit ratio to estimate the
PageRank of a page
23A Random Test Bed
- The graph represented by the web has a
distinguishing structure. For example, the
in-degrees and out-degrees of the nodes appear to
have a power-law (or Zipf-like) distribution - A random variable X is said to have a power-law
distribution -
- for some real number a and some range of k
24A Random Test Bed
- The total in-degree and out-degree must match
- Random connections are then made from out links
to in links via a random permutation - The probability of having out-degree k was set to
be proportional to 1/k2.38, for k in the range
five to twenty - The probability of having in-degree k was set to
be proportional to 1/k2.1. The range of the
in-degrees were therefore set to lie between five
and eighteen
25A Random Test Bed
- The final graph has 10,000,000 nodes and
82,086,395 edges - To crawl this graph, they wrote a program that
reads a description of the graph and acts as a
web server that returns synthetic pages whose
links correspond to those of the graph - Mercator to perform a random walk on this server
26A Random Test Bed
- Three sets of two thousand samples each were
chosen from the visited nodes, using three
different sampling techniques - A PR sample was obtained by sampling a crawled
page X with probability inversely proportional to
its apparent PageRank R'(X)
27A Random Test Bed
- A VR sample was obtained by sampling a crawled
page X with probability inversely proportional to
its visit ratio VR(X) - Finally, a random sample was obtained by simply
choosing 2000 of the crawled pages independently
and uniformly at random
28A Random Test Bed
- Compare the proportions of the in-degrees,
out-degrees, and PageRanks of our samples with
their proportions in the original graph - Consider first the out-degrees as shown in Figure
1. - The graph on the right normalizes these
distributions against the percentages from the
original graph
29A Random Test Bed
- Figure 1 Out-degree distributions for the
original graph and for nodes obtained by three
different sampling techniques.
30A Random Test Bed
- In both graphs, sample curves are closer to the
graph curve are better. Although the
distributions for the samples differ somewhat
from that of the original graph, the differences
are minor, and are due to the variation inherent
in any probabilistic experiment.
31A Random Test Bed
- In contrast, when they compare samples to the
original graph in terms of the in-degree and
PageRank, there does appear to be a systematic
bias against pages with low in-degree and low
PageRank. - Figures 2 and 3 show the most biased results for
in-degree and PageRank appear in the random
samples.
32A Random Test Bed
- Figure 2 In-degree distributions for the
original graph and for nodes obtained by three
different sampling techniques.
33A Random Test Bed
- Figure 3 PageRank distributions for the original
graph and for nodes obtained by three different
sampling techniques.
34A Random Test Bed
- Similar experiments with random graphs with
broader ranges of in- and out-degrees, more
similar to those found on the web were conducted - Potential problem is that random graphs
constructed with small in- and out- degrees might
contain disjoint pieces that are never sampled,
or long trails that are not well-connected.
35A Random Test Bed
- In such graphs, again find that using the values
VR(X) or R'(X) to re-scale sampling probabilities
makes the resulting sample appear more uniform.
36Sampling Random Walks of the Web
- Various attributes of the three random walks
37Sampling Random Walks of the Web
- Note that Walk 3 downloaded pages at roughly
twice the rate as the other two walks they
attribute this to the variability inherent in
network bandwith and DNS resolution
38Sampling Random Walks of the Web
- Figure 4 Overlap of the URLs (in thousands)
visited during the three walks.
39Sampling Random Walks of the Web
- Figure 4 shows a Venn diagram. The regions
enclosed by the blue, red and green lines
represent the sets of URLs encountered by Walks
1, 2 and 3, respectively. - The main conclusion of Figure 4 is that 83.2 of
all visited URLs were visited by only one walk.
Hence, their walks seem to disperse well
40Applications
- Measure properties of the web 2 groups
- 1. Determine characteristics of the URLs
themselves. Examples include measuring
distributions of the following URL properties
length, number of arcs, port numbers, filename
extensions, and top-level internet domains.
41Applications
- 2. Determine characteristics of the documents
referred to by the URLs. Examples include
measuring distributions of the following document
properties length, character set, language,
number of out-links, and number of embedded
images.
42Estimating the Top-Level Domain Distribution
- Table 2 shows for each walk and each sampling
method (using 10K URLs) the percentage of pages
in the most popular internet domains. - The results are consistent over the three walks
that are sampled in the same way - As the domain becomes smaller, the variance in
percentages increases.
43Estimating the Top-Level Domain Distribution
- Table 2 Percentage of sampled URLs in each
top-level domain.
44Estimating the Top-Level Domain Distribution
- Table 2 Percentage of sampled URLs in each
top-level domain.
45Estimating the Top-Level Domain Distribution
- Table 2 Percentage of sampled URLs in each
top-level domain.
46Search Engine Coverage
- To test whether a URL is indexed by a search
engine. - Used a list of words and approximate measure of
their frequency - Find the r rarest words
- Then query the search engine using a conjunction
of these r rarest words and check for the
appropriate URL.
47Search Engine Coverage
- In practice, strong queries dont always uniquely
identify a page. - Sampled pages contain few rare words
- Mirror sites, duplicates or near-duplicates of
the page or other spurious matches - Northern Light can return pages that dont
contain all of the words in the query.
48Search Engine Coverage
- All URLs normalized by converting to lowercase,
removing optional extensions such as index.htm
and home.htm, inserting defaulted port numbers if
necessary, and removing relative references of
the form ... - Figures 5, 6 and 7
49Search Engine Coverage
- Figure 5 Exact matches for the three walks.
50Search Engine Coverage
- Figure 6 Host matches for the three walks.
51Search Engine Coverage
- Figure 7 Non-zero matches for the three walks.
52Search Engine Coverage
- Google appears to perform better than one might
expect from reported results on seach size - First, Google sometimes returns pages that it has
not indexed based on key words in the anchor text
pointing to the page - Second, Googles index may contain pages with
higher PageRank than other search engines,
53Search Engine Coverage
- and the biases of this approach in favor of
such pages may be significant.
54Conclusions
- Method for generating a near-uniform sample of
URLs by sampling URLs discovered during a random
walk of the web - Known that random walks tend to over-sample URLs
with higher connectivity, or PageRank. - To ameliorate that effect, they described
55Conclusion
- how additional information obtained by the walk
can be used to skew the sampling probability
against pages with high PageRank. - They dont take advantage of additional knowledge
about the web. However, such an approach might
incur other problems for example, the changing
nature of the web makes it unclear whether
additional information used for sampling can be
trusted to remain accurate over time.