On NearUniform URL Sampling - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

On NearUniform URL Sampling

Description:

Second, the authors describe a test bed for validating their technique. ... A Random Test Bed. The final graph has 10,000,000 nodes and ... A Random Test Bed ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 56
Provided by: glennmbe
Category:

less

Transcript and Presenter's Notes

Title: On NearUniform URL Sampling


1
On Near-Uniform URL Sampling
  • Glenn M. Bernstein

2
Introduction
  • First, the authors consider several sampling
    approaches, including natural approaches based on
    random walks. Intuitively, the problem with
    using a random walk in order to sample URLs from
    the web is that pages that are more highly
    connected tend to be chosen more often.

3
Introduction
  • The authors suggest an improvement to the
    standard random walk technique that mitigates
    this effect, leading to a more uniform sample.

4
Introduction
  • Second, the authors describe a test bed for
    validating their technique.
  • In particular, the authors apply an sampling
    approach to a synthetic random graph whose
    connectivity was designed to resemble that of the
    web, and then analyze the distribution of these
    samples.

5
Introduction
  • Finally, the authors apply their sampling
  • technique to three sizable random walks
  • of the actual web. They then use these
  • samples to estimate the distribution of
  • pages over internet domains, and to
  • estimate the coverage of various search
  • engine indexes.

6
Prior Work
  • Definition The size of a search engine is the
    number of pages indexed by the search engine.
  • Similarly, the size of the web corresponds to the
    number of publicly accessible static web pages.

7
Prior Work
  • The question of understanding the size of the web
    and the relative sizes of search engines has been
    studied previously.
  • The question of whether size is an appropriate
    gauge of search engine utility, however, remains
    a subject of debate.
  • Another reason to study size is to learn about
    the growth of the web, so that
  • Appropriate predictions can be made and future
    trends can be spotted early.

8
Prior Work
  • Initial work The approach of sampling from NEC
    query logs leaves questions as to the statistical
    appropriateness of the sample, as well as the
    repeatability of the test by other researchers.
  • Further work An approach based on random testing
    of IP addresses to determine characteristics of
    hosts and pages found on the web, as well as to
    estimate web size.

9
Prior Work
  • This technique appears to be a usefule approach
    for determining characteristics of web hosts.
  • Given the high variance in the number of
    pages/host, and the difficulties in accessing
    pages from hosts by this approach, it is not
    clear that this technique provides a general
    methodology to accurately determine the size of
    the web. In particular, the scalability of this
    approach is uncertain for future 128 bit IP-v6
    addresses.

10
Prior Work
  • The weight of an index is a generalization of the
    notion of its size
  • Each page can be assigned a weight, which
    corresponds to its importance
  • The weight of a search engine index is then
    defined to be the sum of the weights of the pages
    it contains.

11
Prior Work
  • Another natural weight measure is, for example,
    the PageRank measure.
  • Random walks on the web graph and search-engine
    probing techniques can determine the weight of an
    index when the weight measure is given by the
    PageRank measure.

12
Prior Work-PageRank
  • The PageRank is a measure of a page that is
    fundamental to this sampling approach.
  • PageRank
  • , where
  • T is the total pages on the web
  • 0 0.15
  • Let pages p1,p2,,pn link to page p, and
  • C(p) be the number of links out of p.

13
Mercator
  • An extensible, multi-threaded web crawler written
    in Java
  • Mercator configured to use 100 crawling threads,
    thus 100 random walks in parallel
  • The crawl is seeded with 10K starting points
    chosen from a previous crawl

14
Mathematical Underpinnings
  • First, approximate Pr(X is crawled)

15
Mathematical Underpinnings
  • Consider a page well-connected if it can be
    reached by almost every other page through
    several short paths.
  • A short walk means about steps, where n is
    the number of pages in the Web graph
  • Short Walks

  • (3)

16
Mathematical Underpinnings
  • The main point of the approach is to obtain more
    nearly uniform samples from the history of random
    walks if we sample visited pages with a skewed
    probability distribution, namely by sampling
    inversely to each pages PageRank.

17
Mathematical Underpinnings
  • The random walk provides us with two possible
    ways of estimating the PageRank
  • 1. Estimate R(X), and call it the visit ratio

18
Mathematical Underpinnings
  • Second, estimate the PageRank R(X) of a
  • page by the sample PageRank R?(X)
  • computed on this sample graph

19
Limitations
  • Crawl-based approach finds pages that are
    accessible only through some sequence of links
    from our initial seed set
  • Furthermore, avoid crawling dynamic content by
    stripping the query component from discovered
    URLs log only content text/html
  • Technique biased against pages that are not
    well-connected

20
Limitations that stem from Mathematical framework
  • Initial bias based on starting point, and
    mitigated by choosing a large, diverse set of
    initial starting points
  • Dependence A dependence between pages in the
    random walk
  • Short cycles A specific problem raised by the
    dependence problem. In particular, implies that
    approximation (3) is inaccurate for these pages

21
Limitations that stem from Mathematical framework
  • Large PageRanks Approximation (3) is
    inappropriate for long walks and pages with very
    high PageRank. Approximation (4) will therefore
    overestimate the probability that a high PageRank
    page is crawled, since the right hand side can be
    larger than 1

22
Limitations
  • Random Jumps The random walk approximates the
    behavior of a random web surfer by jumping to a
    random page visited previously, rather than a
    completely random page
  • Approximation (4) doesnt guarantee uniform
    samples, since there are other possible sources
    of error in using the visit ratio to estimate the
    PageRank of a page

23
A Random Test Bed
  • The graph represented by the web has a
    distinguishing structure. For example, the
    in-degrees and out-degrees of the nodes appear to
    have a power-law (or Zipf-like) distribution
  • A random variable X is said to have a power-law
    distribution
  • for some real number a and some range of k

24
A Random Test Bed
  • The total in-degree and out-degree must match
  • Random connections are then made from out links
    to in links via a random permutation
  • The probability of having out-degree k was set to
    be proportional to 1/k2.38, for k in the range
    five to twenty
  • The probability of having in-degree k was set to
    be proportional to 1/k2.1. The range of the
    in-degrees were therefore set to lie between five
    and eighteen

25
A Random Test Bed
  • The final graph has 10,000,000 nodes and
    82,086,395 edges
  • To crawl this graph, they wrote a program that
    reads a description of the graph and acts as a
    web server that returns synthetic pages whose
    links correspond to those of the graph
  • Mercator to perform a random walk on this server

26
A Random Test Bed
  • Three sets of two thousand samples each were
    chosen from the visited nodes, using three
    different sampling techniques
  • A PR sample was obtained by sampling a crawled
    page X with probability inversely proportional to
    its apparent PageRank R'(X)

27
A Random Test Bed
  • A VR sample was obtained by sampling a crawled
    page X with probability inversely proportional to
    its visit ratio VR(X)
  • Finally, a random sample was obtained by simply
    choosing 2000 of the crawled pages independently
    and uniformly at random

28
A Random Test Bed
  • Compare the proportions of the in-degrees,
    out-degrees, and PageRanks of our samples with
    their proportions in the original graph
  • Consider first the out-degrees as shown in Figure
    1.
  • The graph on the right normalizes these
    distributions against the percentages from the
    original graph

29
A Random Test Bed
  • Figure 1 Out-degree distributions for the
    original graph and for nodes obtained by three
    different sampling techniques.

30
A Random Test Bed
  • In both graphs, sample curves are closer to the
    graph curve are better. Although the
    distributions for the samples differ somewhat
    from that of the original graph, the differences
    are minor, and are due to the variation inherent
    in any probabilistic experiment.

31
A Random Test Bed
  • In contrast, when they compare samples to the
    original graph in terms of the in-degree and
    PageRank, there does appear to be a systematic
    bias against pages with low in-degree and low
    PageRank.
  • Figures 2 and 3 show the most biased results for
    in-degree and PageRank appear in the random
    samples.

32
A Random Test Bed
  • Figure 2 In-degree distributions for the
    original graph and for nodes obtained by three
    different sampling techniques.

33
A Random Test Bed
  • Figure 3 PageRank distributions for the original
    graph and for nodes obtained by three different
    sampling techniques.

34
A Random Test Bed
  • Similar experiments with random graphs with
    broader ranges of in- and out-degrees, more
    similar to those found on the web were conducted
  • Potential problem is that random graphs
    constructed with small in- and out- degrees might
    contain disjoint pieces that are never sampled,
    or long trails that are not well-connected.

35
A Random Test Bed
  • In such graphs, again find that using the values
    VR(X) or R'(X) to re-scale sampling probabilities
    makes the resulting sample appear more uniform.

36
Sampling Random Walks of the Web
  • Various attributes of the three random walks

37
Sampling Random Walks of the Web
  • Note that Walk 3 downloaded pages at roughly
    twice the rate as the other two walks they
    attribute this to the variability inherent in
    network bandwith and DNS resolution

38
Sampling Random Walks of the Web
  • Figure 4 Overlap of the URLs (in thousands)
    visited during the three walks.

39
Sampling Random Walks of the Web
  • Figure 4 shows a Venn diagram. The regions
    enclosed by the blue, red and green lines
    represent the sets of URLs encountered by Walks
    1, 2 and 3, respectively.
  • The main conclusion of Figure 4 is that 83.2 of
    all visited URLs were visited by only one walk.
    Hence, their walks seem to disperse well

40
Applications
  • Measure properties of the web 2 groups
  • 1. Determine characteristics of the URLs
    themselves. Examples include measuring
    distributions of the following URL properties
    length, number of arcs, port numbers, filename
    extensions, and top-level internet domains.

41
Applications
  • 2. Determine characteristics of the documents
    referred to by the URLs. Examples include
    measuring distributions of the following document
    properties length, character set, language,
    number of out-links, and number of embedded
    images.

42
Estimating the Top-Level Domain Distribution
  • Table 2 shows for each walk and each sampling
    method (using 10K URLs) the percentage of pages
    in the most popular internet domains.
  • The results are consistent over the three walks
    that are sampled in the same way
  • As the domain becomes smaller, the variance in
    percentages increases.

43
Estimating the Top-Level Domain Distribution
  • Table 2 Percentage of sampled URLs in each
    top-level domain.

44
Estimating the Top-Level Domain Distribution
  • Table 2 Percentage of sampled URLs in each
    top-level domain.

45
Estimating the Top-Level Domain Distribution
  • Table 2 Percentage of sampled URLs in each
    top-level domain.

46
Search Engine Coverage
  • To test whether a URL is indexed by a search
    engine.
  • Used a list of words and approximate measure of
    their frequency
  • Find the r rarest words
  • Then query the search engine using a conjunction
    of these r rarest words and check for the
    appropriate URL.

47
Search Engine Coverage
  • In practice, strong queries dont always uniquely
    identify a page.
  • Sampled pages contain few rare words
  • Mirror sites, duplicates or near-duplicates of
    the page or other spurious matches
  • Northern Light can return pages that dont
    contain all of the words in the query.

48
Search Engine Coverage
  • All URLs normalized by converting to lowercase,
    removing optional extensions such as index.htm
    and home.htm, inserting defaulted port numbers if
    necessary, and removing relative references of
    the form ...
  • Figures 5, 6 and 7

49
Search Engine Coverage
  • Figure 5 Exact matches for the three walks.

50
Search Engine Coverage
  • Figure 6 Host matches for the three walks.

51
Search Engine Coverage
  • Figure 7 Non-zero matches for the three walks.

52
Search Engine Coverage
  • Google appears to perform better than one might
    expect from reported results on seach size
  • First, Google sometimes returns pages that it has
    not indexed based on key words in the anchor text
    pointing to the page
  • Second, Googles index may contain pages with
    higher PageRank than other search engines,

53
Search Engine Coverage
  • and the biases of this approach in favor of
    such pages may be significant.

54
Conclusions
  • Method for generating a near-uniform sample of
    URLs by sampling URLs discovered during a random
    walk of the web
  • Known that random walks tend to over-sample URLs
    with higher connectivity, or PageRank.
  • To ameliorate that effect, they described

55
Conclusion
  • how additional information obtained by the walk
    can be used to skew the sampling probability
    against pages with high PageRank.
  • They dont take advantage of additional knowledge
    about the web. However, such an approach might
    incur other problems for example, the changing
    nature of the web makes it unclear whether
    additional information used for sampling can be
    trusted to remain accurate over time.
Write a Comment
User Comments (0)
About PowerShow.com