CS246 - PowerPoint PPT Presentation

1 / 64

About This Presentation

Title:

CS246

Description:

What is the Web like? Any questions on some of the characteristics and/or ... If there are only two links, A B and B A, then A and B becomes one SCC. A. B ... – PowerPoint PPT presentation

Number of Views:68

Avg rating:3.0/5.0

Slides: 65

Provided by: junghoo

Category:

Tags: bb | cs246

more less

Transcript and Presenter's Notes

Title: CS246

1
CS246

Web Characteristics

2
Web Characteristics

What is the Web like?
Any questions on some of the characteristics
and/or properties of the Web?

3
Web Characteristics

Size of the Web
Search engine coverage
Link structure of the Web

4
How Many Web Sites?

Polling every IP
232 4B sites, 10 sec/IP, 1000 simultaneous
connection
23210/(1000246060) 460 days

5
How Many Web Sites?

Sampling based

6
How Many Web Sites?

Select S random IPs
Send HTTP requests to port 80 at the selected IPs
Count valid replies HTTP 200 OK V
T 232

7
How Many Web Sites?

OCLC (Online Computer Library) results
http//wcp.oclc.org
Total number of available IPs 232 4.2
Billion
Growth (in terms of sites) has slowed down

8
Issues

Multi-hosted servers
cnn.com 207.25.71.5, 207.25.71.20,
Select the lowest IP addressFor each sampled IP
Look up domain name
Resolve the name to IP
Is our sampled IP the lowest?

9
Issues

Virtual hosting
Multiple sites on the same IP
Find the average number of hosted sites per IP
7.4M sites on 3.4M IPs by polling all available
site names Netcraft, 2000
Other ports?
Temporarily unavailable sites?

10
Where Are They Located?
11
What Language?
(Based on Web sites)
12
Questions?
13
How Many Web Pages?

Sampling based?

Infinite number of URLs

14
How Many Web Pages?

Solution 1
Estimate the average number of pages per
site(average no of pages) (total no of sites)
Algorithm
For each site with valid reply, download all
pages
Take average
Result LG99
289 pages per site, 2.8M sites
800M pages

15
Issues

A small number of sites with TONS of pages
Very likely to miss these sites
Lots of samples necessary

16
How Many Pages?

Solution 2 Sampling-based

17
Related Question

How many deer in Yosemite National Park?

18
Random Page?

Idea Random walk
Start from the Yahoo home page
Follow random links, say 10,000 times
Select the page
Problem
Biased to popular pages. e.g., Microsoft,
Google

19
Random Page?

Random walks on regular, undirected graph?
uniform random sample
Regular graph an equal number of edges for all
nodes
After steps
? depends on the graph structure
N number of nodes
Idea
Transform the Web graph to a regular, undirected
graph
Perform a random walk

20
Ideal Random Walk

Generate the regular, undirected graph
Make edges undirected
Decide d the maximum of edges per pagesay,
300,000
If edge(n) lt 300,000, then add self-loop
Perform random walks on the graph
? ? 10-5 for the 1996 Web, N ? 109
3,000,000 steps, but mostly self-loops
100 actual walk

21
Different Interpretation

Random walk on irregular Web graph
High chance to be at a popular node at a
particular time
Increase the chance to be at an unpopular node
by staying there longer through self loops.

Unpopular nodes
Popular node
22
Issues

How to get edges to/from node n?
Edges discovered so far
From search engines, like Altavista, HotBot
Still limited incoming links

23
WebWalker BBCF00

Our graph does not have to be the same as the
real Web
Construct regular undirected graphs while
performing the random walk
Add new node n when it visits n
Find edges for node n at that time
Edges discovered so far
From search engines
Add self-loops as necessary
Ignore any more edges to n later

24
WebWalker

1
2
2
3
1
25
WebWalker

Why ignore new incoming edges?
Make the graph regular.Discovered parts of the
graph do not change
Uniformity theorem still holds
Can we arrive at all reachable pages?
We ignore only the edges to visited nodes
Can we use the same ? ?
No

26
WebWalker results

Size of the Web
Altavista B 250M
B?S/S 35
T 720M
Avg page size 12K
Avg no of out-links 10

27
WebWalker results

Pages by domain
.com 49
.edu 8
.org 7
.net 6
.de 4
.jp 3
.uk 3
.gov 2

28
What About Other Web Pages?

Pages that are
Available within corporate Intranet
Protected by authentication
Not reachable through following links
E.g., pages within e-commerce sites
Deep Web vs Hidden Web
Information reachable through search interface
What if a page is reachable both through links
and search interface?

29
Size of Deep Web?

Estimation
(Avg no of records per site) (Total no of Deep
Web sites)
How to estimate?
By sampling

30
Size of Deep Web?

Total of Deep Web sites
B?S/S
Avg no of records per site
Contact the site directly
Use Not zzxxyyxx, if the site reports no of
matches

31
Size of Deep Web

BrightPlanet report
Avg no of records per site 5 million
Total no of Deep Web sites 200,000
Avg size of a record 14KB
Size of the Deep Web 1016 (10 petabytes)
1000 larger than the Surface Web
How to access it?

32
Web Characteristics

Size of the Web
Search engines
Link structure of the Web

33
Search Engines

Coverage
Overlap
Dead links
Indexing delay

34
Coverage?

Q How to estimate coverage?
A Create a random sample and measure how many of
them are indexed by a search engine
In 1999
Estimated Web size 800M, 1999
Reported indexed pages 128M (Northern light) ?
16
No reliable Web size estimate at this point
Google claims 8B index
Yahoo claims 20B index

35
Overlap?

How many pages are commonly indexed?
Method 1
Create a random sample and measure how many are
indexed only by A or B and commonly by A and B
Method 2
Send common queries, compare returned pages, and
measure overlap
Result from method 2 Little overlap
E.g., Infoseek and AltaVista 20 overlap
Bharat and Broder 1997
Is it still true?
Results seem to converge

36
Dead Links?

Q How can we measure what fraction of pages in
search engines are dead?
A Issue random queries and check and see whether
returned pages are dead?
Result in Feb 2000
AltaVista 13.7
Excite 8.7
Google 4.3
Search engines have got much better due to better
recrawling algorithms
A topic for later study

37
How Early Pages Get Indexed?

Method 1
Create pages at random locations
Check when they are available at search engines
Cons Difficult to create pages at random
locations
Method 2
Repeatedly issue same queries over time
When a new page appears in the result, record the
last modified date
Cons last modified date is only a lower bound

38
How Early are Pages Indexed?

Mean time Lawrence and Giles 2000
Northern Light 141 days
AltaVista 166 days
HotBot 192 days

39
How Stable Are the Sites?

Monitor a set of random sites
Percentage of Web servers available (simila
r results for other years)

40
Web Characteristics

Size of the Web
Search engines
Link structure of the Web

41
Web As A Graph

Page Node
Link Edge

42
Link Degree

How many links?
In-degree

Power law Why consistently 2.1?
43
Link Degree

Out-degree

44
Large-Scale Structure?

Study by AltaVista IBM, 1999
Based on 200M pages downloaded by AltaVista
crawler
Bow-tie result based on two experiments

45
Experiment 1Strongly Connected Components

Strongly connected component (SCC)
C is a strongly connected component if? a, b ?
C, there are pathsfrom a ? b and from b ? a

a
a
SCC?
b
c
b
c
Yes
No
46
Result 1 SCC

Identified all SCCs from 200M pages
Biggest SCC 50M (25)
Other SCCs are small
Second largest 150K
Mostly fewer than 1000 nodes

47
Experiment 2 Reachability

How many pages can we reach starting from a
random page?
Experiment
Pick 500 random pages
Follow links in the Breadth-first manner until no
more links
Repeated the same experiments following links in
the reverse direction

48
Result 2 Reachability

Out-links (forward direction)

50 reaches 100M
50 reaches fewer than 1000

49
Result 2 Reachability

In-links (reverse direction)

50 reaches 100M
50 reaches fewer than 1000

50
What Can We Conclude?

50M (25) SCC

SCC (50M, 25)
51
What Can We Conclude?

How many nodes would we reach from SCC?
Clearly not 1000, then 100M
50M more pages reachable from SCC(no way back,
though)

SCC (50M, 25)
52
What Can We Conclude?

Similar result for in-links when we followed
links backwards
50M more pages reachable by following in-links

SCC (50M, 25)
Out (50M,25)
In (50M,25)
53
What Can We Conclude?

25 Miscellaneous

(50M, 25)
SCC (50M, 25)
Out (50M,25)
In (50M,25)
54
Questions

How did they crawl 50M In and 50M Misc nodes in
the first place?
There may be much more In and Misc nodes that
were not crawled (25 is lower bounds)
Only 25 SCC surprising (will be explained)

55
SCC