On NearUniform URL Sampling - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

On NearUniform URL Sampling

Description:

Second, the authors describe a test bed for validating their technique. ... A Random Test Bed. The final graph has 10,000,000 nodes and ... A Random Test Bed ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 56

Provided by: glennmbe

Category:

more less

Transcript and Presenter's Notes

Title: On NearUniform URL Sampling

1
On Near-Uniform URL Sampling

Glenn M. Bernstein

2
Introduction

First, the authors consider several sampling
approaches, including natural approaches based on
random walks. Intuitively, the problem with
using a random walk in order to sample URLs from
the web is that pages that are more highly
connected tend to be chosen more often.

3
Introduction

The authors suggest an improvement to the
standard random walk technique that mitigates
this effect, leading to a more uniform sample.

4
Introduction

Second, the authors describe a test bed for
validating their technique.
In particular, the authors apply an sampling
approach to a synthetic random graph whose
connectivity was designed to resemble that of the
web, and then analyze the distribution of these
samples.

5
Introduction

Finally, the authors apply their sampling
technique to three sizable random walks
of the actual web. They then use these
samples to estimate the distribution of
pages over internet domains, and to
estimate the coverage of various search
engine indexes.

6
Prior Work

Definition The size of a search engine is the
number of pages indexed by the search engine.
Similarly, the size of the web corresponds to the
number of publicly accessible static web pages.

7
Prior Work

The question of understanding the size of the web
and the relative sizes of search engines has been
studied previously.
The question of whether size is an appropriate
gauge of search engine utility, however, remains
a subject of debate.
Another reason to study size is to learn about
the growth of the web, so that
Appropriate predictions can be made and future
trends can be spotted early.

8
Prior Work

Initial work The approach of sampling from NEC
query logs leaves questions as to the statistical
appropriateness of the sample, as well as the
repeatability of the test by other researchers.
Further work An approach based on random testing
of IP addresses to determine characteristics of
hosts and pages found on the web, as well as to
estimate web size.

9
Prior Work

This technique appears to be a usefule approach
for determining characteristics of web hosts.
Given the high variance in the number of
pages/host, and the difficulties in accessing
pages from hosts by this approach, it is not
clear that this technique provides a general
methodology to accurately determine the size of
the web. In particular, the scalability of this
approach is uncertain for future 128 bit IP-v6
addresses.

10
Prior Work

The weight of an index is a generalization of the
notion of its size
Each page can be assigned a weight, which
corresponds to its importance
The weight of a search engine index is then
defined to be the sum of the weights of the pages
it contains.

11
Prior Work

Another natural weight measure is, for example,
the PageRank measure.
Random walks on the web graph and search-engine
probing techniques can determine the weight of an
index when the weight measure is given by the
PageRank measure.

12
Prior Work-PageRank

The PageRank is a measure of a page that is
fundamental to this sampling approach.
PageRank
, where
T is the total pages on the web
0 0.15
Let pages p1,p2,,pn link to page p, and
C(p) be the number of links out of p.

13
Mercator

An extensible, multi-threaded web crawler written
in Java
Mercator configured to use 100 crawling threads,
thus 100 random walks in parallel
The crawl is seeded with 10K starting points
chosen from a previous crawl

14
Mathematical Underpinnings

First, approximate Pr(X is crawled)

15
Mathematical Underpinnings

Consider a page well-connected if it can be
reached by almost every other page through
several short paths.
A short walk means about steps, where n is
the number of pages in the Web graph
Short Walks
(3)

16
Mathematical Underpinnings

The main point of the approach is to obtain more
nearly uniform samples from the history of random
walks if we sample visited pages with a skewed
probability distribution, namely by sampling
inversely to each pages PageRank.

17
Mathematical Underpinnings

The random walk provides us with two possible
ways of estimating the PageRank
1. Estimate R(X), and call it the visit ratio

18
Mathematical Underpinnings

Second, estimate the PageRank R(X) of a
page by the sample PageRank R?(X)
computed on this sample graph

19
Limitations

Crawl-based approach finds pages that are
accessible only through some sequence of links
from our initial seed set
Furthermore, avoid crawling dynamic content by
stripping the query component from discovered
URLs log only content text/html
Technique biased against pages that are not
well-connected

20
Limitations that stem from Mathematical framework

Initial bias based on starting point, and
mitigated by choosing a large, diverse set of
initial starting points
Dependence A dependence between pages in the
random walk
Short cycles A specific problem raised by the
dependence problem. In particular, implies that
approximation (3) is inaccurate for these pages

21
Limitations that stem from Mathematical framework

Large PageRanks Approximation (3) is
inappropriate for long walks and pages with very
high PageRank. Approximation (4) will therefore
overestimate the probability that a high PageRank
page is crawled, since the right hand side can be
larger than 1

22
Limitations

Random Jumps The random walk approximates the
behavior of a random web surfer by jumping to a
random page visited previously, rather than a
completely random page
Approximation (4) doesnt guarantee uniform
samples, since there are other possible sources
of error in using the visit ratio to estimate the
PageRank of a page

23
A Random Test Bed

The graph represented by the web has a
distinguishing structure. For example, the
in-degrees and out-degrees of the nodes appear to
have a power-law (or Zipf-like) distribution
A random variable X is said to have a power-law
distribution
for some real number a and some range of k

24
A Random Test Bed

The total in-degree and out-degree must match
Random connections are then made from out links
to in links via a random permutation
The probability of having out-degree k was set to
be proportional to 1/k2.38, for k in the range
five to twenty
The probability of having in-degree k was set to
be proportional to 1/k2.1. The range of the
in-degrees were therefore set to lie between five
and eighteen

25
A Random Test Bed

The final graph has 10,000,000 nodes and
82,086,395 edges
To crawl this graph, they wrote a program that
reads a description of the graph and acts as a
web server that returns synthetic pages whose
links correspond to those of the graph
Mercator to perform a random walk on this server

26
A Random Test Bed

Three sets of two thousand samples each were
chosen from the visited nodes, using three
different sampling techniques
A PR sample was obtained by sampling a crawled
page X with probability inversely proportional to
its apparent PageRank R'(X)

27
A Random Test Bed

A VR sample was obtained by sampling a crawled
page X with probability inversely proportional to
its visit ratio VR(X)
Finally, a random sample was obtained by simply
choosing 2000 of the crawled pages independently
and uniformly at random

28
A Random Test Bed

Compare the proportions of the in-degrees,
out-degrees, and PageRanks of our samples with
their proportions in the original graph
Consider first the out-degrees as shown in Figure
1.
The graph on the right normalizes these
distributions against the percentages from the
original graph

29
A Random Test Bed

Figure 1 Out-degree distributions for the
original graph and for nodes obtained by three
different sampling techniques.

30
A Random Test Bed

In both graphs, sample curves are closer to the
graph curve are better. Although the
distributions for the samples differ somewhat
from that of the original graph, the differences
are minor, and are due to the variation inherent
in any probabilistic experiment.

31
A Random Test Bed

In contrast, when they compare samples to the
original graph in terms of the in-degree and
PageRank, there does appear to be a systematic
bias against pages with low in-degree and low
PageRank.
Figures 2 and 3 show the most biased results for
in-degree and PageRank appear in the random
samples.

32
A Random Test Bed

Figure 2 In-degree distributions for the
original graph and for nodes obtained by three
different sampling techniques.

33
A Random Test Bed

Figure 3 PageRank distributions for the original
graph and for nodes obtained by three different
sampling techniques.

34
A Random Test Bed

Similar experiments with random graphs with
broader ranges of in- and out-degrees, more
similar to those found on the web were conducted
Potential problem is that random graphs
constructed with small in- and out- degrees might
contain disjoint pieces that are never sampled,
or long trails that are not well-connected.

35
A Random Test Bed

In such graphs, again find that using the values
VR(X) or R'(X) to re-scale sampling probabilities
makes the resulting sample appear more uniform.

36
Sampling Random Walks of the Web

Various attributes of the three random walks

37
Sampling Random Walks of the Web

Note that Walk 3 downloaded pages at roughly
twice the rate as the other two walks they
attribute this to the variability inherent in
network bandwith and DNS resolution

38
Sampling Random Walks of the Web

Figure 4 Overlap of the URLs (in thousands)
visited during the three walks.

39
Sampling Random Walks of the Web

Figure 4 shows a Venn diagram. The regions
enclosed by the blue, red and green lines
represent the sets of URLs encountered by Walks
1, 2 and 3, respectively.
The main conclusion of Figure 4 is that 83.2 of
all visited URLs were visited by only one walk.
Hence, their walks seem to disperse well

40
Applications

Measure properties of the web 2 groups
1. Determine characteristics of the URLs
themselves. Examples include measuring
distributions of the following URL properties
length, number of arcs, port numbers, filename
extensions, and top-level internet domains.

41
Applications

2. Determine characteristics of the documents
referred to by the URLs. Examples include
measuring distributions of the following document
properties length, character set, language,
number of out-links, and number of embedded
images.

42
Estimating the Top-Level Domain Distribution

Table 2 shows for each walk and each sampling
method (using 10K URLs) the percentage of pages
in the most popular internet domains.
The results are consistent over the three walks
that are sampled in the same way
As the domain becomes smaller, the variance in
percentages increases.

43
Estimating the Top-Level Domain Distribution

Table 2 Percentage of sampled URLs in each
top-level domain.

44
Estimating the Top-Level Domain Distribution

Table 2 Percentage of sampled URLs in each
top-level domain.

45
Estimating the Top-Level Domain Distribution

Table 2 Percentage of sampled URLs in each
top-level domain.

46
Search Engine Coverage

To test whether a URL is indexed by a search
engine.
Used a list of words and approximate measure of
their frequency
Find the r rarest words
Then query the search engine using a conjunction
of these r rarest words and check for the
appropriate URL.

47
Search Engine Coverage

In practice, strong queries dont always uniquely
identify a page.
Sampled pages contain few rare words
Mirror sites, duplicates or near-duplicates of
the page or other spurious matches
Northern Light can return pages that dont
contain all of the words in the query.

48
Search Engine Coverage

All URLs normalized by converting to lowercase,
removing optional extensions such as index.htm
and home.htm, inserting defaulted port numbers if
necessary, and removing relative references of
the form ...
Figures 5, 6 and 7

49
Search Engine Coverage

Figure 5 Exact matches for the three walks.

50
Search Engine Coverage

Figure 6 Host matches for the three walks.

51
Search Engine Coverage

Figure 7 Non-zero matches for the three walks.

52
Search Engine Coverage

Google appears to perform better than one might
expect from reported results on seach size
First, Google sometimes returns pages that it has
not indexed based on key words in the anchor text
pointing to the page
Second, Googles index may contain pages with
higher PageRank than other search engines,

53
Search Engine Coverage

and the biases of this approach in favor of
such pages may be significant.

54
Conclusions

Method for generating a near-uniform sample of
URLs by sampling URLs discovered during a random
walk of the web
Known that random walks tend to over-sample URLs
with higher connectivity, or PageRank.
To ameliorate that effect, they described

55
Conclusion

how additional information obtained by the walk
can be used to skew the sampling probability
against pages with high PageRank.
They dont take advantage of additional knowledge
about the web. However, such an approach might
incur other problems for example, the changing
nature of the web makes it unclear whether
additional information used for sampling can be
trusted to remain accurate over time.

Write a Comment

User Comments (0)