CS246 - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

CS246

Description:

What is the Web like? Any questions on some of the characteristics and/or ... If there are only two links, A B and B A, then A and B becomes one SCC. A. B ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 65
Provided by: junghoo
Category:
Tags: bb | cs246

less

Transcript and Presenter's Notes

Title: CS246


1
CS246
  • Web Characteristics

2
Web Characteristics
  • What is the Web like?
  • Any questions on some of the characteristics
    and/or properties of the Web?

3
Web Characteristics
  • Size of the Web
  • Search engine coverage
  • Link structure of the Web

4
How Many Web Sites?
  • Polling every IP
  • 232 4B sites, 10 sec/IP, 1000 simultaneous
    connection
  • 23210/(1000246060) 460 days

5
How Many Web Sites?
  • Sampling based

6
How Many Web Sites?
  • Select S random IPs
  • Send HTTP requests to port 80 at the selected IPs
  • Count valid replies HTTP 200 OK V
  • T 232

7
How Many Web Sites?
  • OCLC (Online Computer Library) results
  • http//wcp.oclc.org
  • Total number of available IPs 232 4.2
    Billion
  • Growth (in terms of sites) has slowed down

8
Issues
  • Multi-hosted servers
  • cnn.com 207.25.71.5, 207.25.71.20,
  • Select the lowest IP addressFor each sampled IP
  • Look up domain name
  • Resolve the name to IP
  • Is our sampled IP the lowest?

9
Issues
  • Virtual hosting
  • Multiple sites on the same IP
  • Find the average number of hosted sites per IP
  • 7.4M sites on 3.4M IPs by polling all available
    site names Netcraft, 2000
  • Other ports?
  • Temporarily unavailable sites?

10
Where Are They Located?
11
What Language?
(Based on Web sites)
12
Questions?
13
How Many Web Pages?
  • Sampling based?
  • Infinite number of URLs

14
How Many Web Pages?
  • Solution 1
  • Estimate the average number of pages per
    site(average no of pages) (total no of sites)
  • Algorithm
  • For each site with valid reply, download all
    pages
  • Take average
  • Result LG99
  • 289 pages per site, 2.8M sites
  • 800M pages

15
Issues
  • A small number of sites with TONS of pages
  • Very likely to miss these sites
  • Lots of samples necessary

16
How Many Pages?
  • Solution 2 Sampling-based

17
Related Question
  • How many deer in Yosemite National Park?

18
Random Page?
  • Idea Random walk
  • Start from the Yahoo home page
  • Follow random links, say 10,000 times
  • Select the page
  • Problem
  • Biased to popular pages. e.g., Microsoft,
    Google

19
Random Page?
  • Random walks on regular, undirected graph?
    uniform random sample
  • Regular graph an equal number of edges for all
    nodes
  • After steps
  • ? depends on the graph structure
  • N number of nodes
  • Idea
  • Transform the Web graph to a regular, undirected
    graph
  • Perform a random walk

20
Ideal Random Walk
  • Generate the regular, undirected graph
  • Make edges undirected
  • Decide d the maximum of edges per pagesay,
    300,000
  • If edge(n) lt 300,000, then add self-loop
  • Perform random walks on the graph
  • ? ? 10-5 for the 1996 Web, N ? 109
  • 3,000,000 steps, but mostly self-loops
  • 100 actual walk

21
Different Interpretation
  • Random walk on irregular Web graph
  • High chance to be at a popular node at a
    particular time
  • Increase the chance to be at an unpopular node
    by staying there longer through self loops.

Unpopular nodes
Popular node
22
Issues
  • How to get edges to/from node n?
  • Edges discovered so far
  • From search engines, like Altavista, HotBot
  • Still limited incoming links

23
WebWalker BBCF00
  • Our graph does not have to be the same as the
    real Web
  • Construct regular undirected graphs while
    performing the random walk
  • Add new node n when it visits n
  • Find edges for node n at that time
  • Edges discovered so far
  • From search engines
  • Add self-loops as necessary
  • Ignore any more edges to n later

24
WebWalker
  • d 5

1
2
2
3
1
25
WebWalker
  • Why ignore new incoming edges?
  • Make the graph regular.Discovered parts of the
    graph do not change
  • Uniformity theorem still holds
  • Can we arrive at all reachable pages?
  • We ignore only the edges to visited nodes
  • Can we use the same ? ?
  • No

26
WebWalker results
  • Size of the Web
  • Altavista B 250M
  • B?S/S 35
  • T 720M
  • Avg page size 12K
  • Avg no of out-links 10

27
WebWalker results
  • Pages by domain
  • .com 49
  • .edu 8
  • .org 7
  • .net 6
  • .de 4
  • .jp 3
  • .uk 3
  • .gov 2

28
What About Other Web Pages?
  • Pages that are
  • Available within corporate Intranet
  • Protected by authentication
  • Not reachable through following links
  • E.g., pages within e-commerce sites
  • Deep Web vs Hidden Web
  • Information reachable through search interface
  • What if a page is reachable both through links
    and search interface?

29
Size of Deep Web?
  • Estimation
  • (Avg no of records per site) (Total no of Deep
    Web sites)
  • How to estimate?
  • By sampling

30
Size of Deep Web?
  • Total of Deep Web sites
  • B?S/S
  • Avg no of records per site
  • Contact the site directly
  • Use Not zzxxyyxx, if the site reports no of
    matches

31
Size of Deep Web
  • BrightPlanet report
  • Avg no of records per site 5 million
  • Total no of Deep Web sites 200,000
  • Avg size of a record 14KB
  • Size of the Deep Web 1016 (10 petabytes)
  • 1000 larger than the Surface Web
  • How to access it?

32
Web Characteristics
  • Size of the Web
  • Search engines
  • Link structure of the Web

33
Search Engines
  • Coverage
  • Overlap
  • Dead links
  • Indexing delay

34
Coverage?
  • Q How to estimate coverage?
  • A Create a random sample and measure how many of
    them are indexed by a search engine
  • In 1999
  • Estimated Web size 800M, 1999
  • Reported indexed pages 128M (Northern light) ?
    16
  • No reliable Web size estimate at this point
  • Google claims 8B index
  • Yahoo claims 20B index

35
Overlap?
  • How many pages are commonly indexed?
  • Method 1
  • Create a random sample and measure how many are
    indexed only by A or B and commonly by A and B
  • Method 2
  • Send common queries, compare returned pages, and
    measure overlap
  • Result from method 2 Little overlap
  • E.g., Infoseek and AltaVista 20 overlap
    Bharat and Broder 1997
  • Is it still true?
  • Results seem to converge

36
Dead Links?
  • Q How can we measure what fraction of pages in
    search engines are dead?
  • A Issue random queries and check and see whether
    returned pages are dead?
  • Result in Feb 2000
  • AltaVista 13.7
  • Excite 8.7
  • Google 4.3
  • Search engines have got much better due to better
    recrawling algorithms
  • A topic for later study

37
How Early Pages Get Indexed?
  • Method 1
  • Create pages at random locations
  • Check when they are available at search engines
  • Cons Difficult to create pages at random
    locations
  • Method 2
  • Repeatedly issue same queries over time
  • When a new page appears in the result, record the
    last modified date
  • Cons last modified date is only a lower bound

38
How Early are Pages Indexed?
  • Mean time Lawrence and Giles 2000
  • Northern Light 141 days
  • AltaVista 166 days
  • HotBot 192 days

39
How Stable Are the Sites?
  • Monitor a set of random sites
  • Percentage of Web servers available (simila
    r results for other years)

40
Web Characteristics
  • Size of the Web
  • Search engines
  • Link structure of the Web

41
Web As A Graph
  • Page Node
  • Link Edge

42
Link Degree
  • How many links?
  • In-degree

Power law Why consistently 2.1?
43
Link Degree
  • Out-degree

44
Large-Scale Structure?
  • Study by AltaVista IBM, 1999
  • Based on 200M pages downloaded by AltaVista
    crawler
  • Bow-tie result based on two experiments

45
Experiment 1Strongly Connected Components
  • Strongly connected component (SCC)
  • C is a strongly connected component if? a, b ?
    C, there are pathsfrom a ? b and from b ? a

a
a
SCC?
b
c
b
c
Yes
No
46
Result 1 SCC
  • Identified all SCCs from 200M pages
  • Biggest SCC 50M (25)
  • Other SCCs are small
  • Second largest 150K
  • Mostly fewer than 1000 nodes

47
Experiment 2 Reachability
  • How many pages can we reach starting from a
    random page?
  • Experiment
  • Pick 500 random pages
  • Follow links in the Breadth-first manner until no
    more links
  • Repeated the same experiments following links in
    the reverse direction

48
Result 2 Reachability
  • Out-links (forward direction)
  • 50 reaches 100M
  • 50 reaches fewer than 1000

49
Result 2 Reachability
  • In-links (reverse direction)
  • 50 reaches 100M
  • 50 reaches fewer than 1000

50
What Can We Conclude?
  • 50M (25) SCC

SCC (50M, 25)
51
What Can We Conclude?
  • How many nodes would we reach from SCC?
  • Clearly not 1000, then 100M
  • 50M more pages reachable from SCC(no way back,
    though)

SCC (50M, 25)
52
What Can We Conclude?
  • Similar result for in-links when we followed
    links backwards
  • 50M more pages reachable by following in-links

SCC (50M, 25)
Out (50M,25)
In (50M,25)
53
What Can We Conclude?
  • 25 Miscellaneous

(50M, 25)
SCC (50M, 25)
Out (50M,25)
In (50M,25)
54
Questions
  • How did they crawl 50M In and 50M Misc nodes in
    the first place?
  • There may be much more In and Misc nodes that
    were not crawled (25 is lower bounds)
  • Only 25 SCC surprising (will be explained)

55
SCC
  • If there are only two links, A ? B and B ? A,
    then A and B becomes one SCC.

A
B
56
Links between In, SCC and Out
  • No single link from SCC to In
  • No single link from Out to SCC
  • At least 50 of the Web unknown to the core SSC

57
Diameter of SCC
  • On average, 16 links between two nodes in SCC
  • The maximum distance (diameter) is at least 28

58
Questions?
59
More Sources For Web Characteristics
  • OCLC (Online Computer Library)
  • http//wcp.oclc.org
  • Netcraft Survey
  • http//www.netcraft.com/survey/
  • NEC Web Analysis
  • http//www.webmetrics.com

60
How To Sample?
  • Method 1 Take the last page and repeat
  • Many wasted visits

61
How To Sample?
  • Method 2 Take last k pages
  • Are they random samples?

62
How To Sample?
  • Theorem If k is large enough, they are
    approximately random pages
  • Intuition If we visit many pages, we visit all
    different pages

63
How To Sample?
  • Goal Estimate A/N by m/k. Make A/N m/k,
    i.e.,
  • if

N
A
k
m
64
How To Sample?
  • Assuming
  • A is 20 of the Web
  • ? 0.1 less than 10 error
  • ? 0.01 99 confidence
  • ? 10-5 the value from 1996 Web crawl
  • k 350,000,000
  • 12,000 non-self-loop
Write a Comment
User Comments (0)
About PowerShow.com