Web Structure and Accessibility of Information - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Web Structure and Accessibility of Information

Description:

1. Graph Structure in the Web. Andrei Broder et al, WWW, 2000 ... frequency of the nth-ranked item is given by the Zeta distribution, 1/(ns?(s) ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 36

Provided by: webC

Category:

more less

Transcript and Presenter's Notes

Title: Web Structure and Accessibility of Information

1
Web Structure and Accessibility of Information

CS525 Databases the Web
Diane Kramer
September 23, 2004

2
Introduction

Papers on Topic
1. Graph Structure in the Web
Andrei Broder et al, WWW, 2000
2. Accessibility of Information on the Web
Steve Lawrence, C. Lee Giles, Nature, 1999
Topic
What do these papers have in common?

3
This Presentation

Accessibility Paper interesting, but less
technical
Aimed at wider audience, more easily understood
Many technical details from Web Structure paper
Comprehensive review of concepts and findings
Not enough time to do both
Focus on Paper 1

4
Paper 1 Discussion Outline

Purpose
Terminology
Related Work
Data
Experiments
Results
Interpretation

5
Purpose

Whats the paper about?
Consider the Web as a directed graph
What do the nodes represent? The arcs?
What about connectivity?
Why is this interesting?
How/where is it applicable?

6
Terminology

Directed graph
Nodes, Arcs, Path, Distance
Out-degree, In-degree
Undirected graph how it differs?
Components
Strongly vs. Weakly connected
Breadth-first search
Diameter of graph

7
Power Law Distributions

Definitions
From Wikipedia (http//en.wikipedia.org)
Relationship between two scalar quantities x and
y
such that the relationship can be written as y
axk
where a and k are constants.
From the paper
Defined on positive integers, with the
probability of
the value i being proportional to 1/ik
What does it mean?

8
Power Law Continued

Large is rare, small is common
E.g., individual wealth (80/20 rule)
Similar to Zipfs law
Zipf distribution is a function of ranks, rather
than magnitudes
Some other examples
AOL Users to Sites
http//www.hpl.hp.com/research/idl/papers/ranking
/ranking.html
Website Popularity
http//www.useit.com/alertbox/zipf.html

9
Related Work

Observation of power law distributions
Access statistics for web sites, for particular
pages within a web site
Distribution of degrees on the web graph
Applying graph theoretic methods
Search, retrieval and mining problems
Context and content
Meaning found in connected components
Understanding domain structure, taxonomy

10
Data

2 AltaVista crawls (May, Oct. 1999)
200 million (M) pages
1.5 billion (G) links
5 x larger than previous biggest study
Kumar, et. al. used pruned data set from 1997
containing 40 M pages
Why does size matter?

11
Experimental Setup

Connectivity Server 2 (CS2)
Built at Compaq Systems Research Center
Takes a Web crawl as input
Creates a Web graph as output
Stores data in DB URLs, in-links, out-links
Extended with URLs referenced ? 5 times

12
More Experimental Setup

CS2 System Continued
Data compression, required storage for URLs
links
High performance w/ enough RAM to store entire DB
in memory
Database statistics
203 M URLs, 1.466 G links (May, '99)
Fit in 9.5 GB of storage
271 M URLs, 2.13 G links (Oct. '99)

13
AltaVista Data

Crawl based on large set of starting points
accumulated over time with rules
Avoid overloading servers robot traps
Avoid/detect spam, deal w/ timeouts
Evolve index
Remove duplicates, dead links, etc.,
Add new pages, updated pages, etc.
DB contains superset of all pages in index at one
point in time

14
Algorithms Experiments

Check in/out degrees
Verify power law distribution
WCC, SCC
Find weak/strong connected components
Random-start BFS
Run from 570 randomly chosen nodes
Forward and backward

15
Experimental Results

The heart of the matter

16
Degree Distribution

Verify earlier observations of power law
Results from both May and Oct. crawls show
agreement
In-degree exponent 2.1
Out-degree exponent 2.72
Deviation on out-degree distribution
Initial segment shows Poisson distribution

17
In/Out Degree Distribution
In- and out-degree distributions from May and
October, 1999
18
Weak Components

Treat Web graph as undirected
Find weakly connected components
Giant component of 186 M nodes
91 of nodes reachable from one another
Connectivity highly resilient
Remove links to pages w/ in-degree ? 5
Giant component still contains 59 M nodes

19
WCC Distribution Resilience
Power Law Distribution (Exponent 2.5)
Size (M) of largest surviving weak component when
links to pages w/ in-degree ? k are removed
20
Strong Components

Single large SCC about 56 M pages
Only 28 of all pages in crawl
All other SCCs significantly smaller size
SCC distribution also obeys power law
Exponent 2.5
Same as for WCC

21
SCC vs. WCC Size Distribution
Both SCC and WCC show power law distribution Same
exponent 2.5
22
Random-start BFS

To study issues of diameter average distance
Traversals exhibit bimodal behavior
Die out after reaching small set of nodes
lt 90 nodes, 90 of the time
Explode to cover about 100 M nodes
But never entire 186 M
Sometimes both forward backward

23
Cumulative Dist. Of BFS Runs
Explosion in 50 of start nodes for in- and
out-links, 90 of nodes for undirected
24
Zipf Distributions

Definition from Wikipedia
Originally the term Zipf's law meant the
observation of Harvard linguist George Kingsley
Zipf that the frequency of use of the
nth-most-frequently-used word in any natural
language is approximately inversely proportional
to n. The term Zipf's law has consequently come
to be used to refer to frequency distributions of
"rank data" in which the relative frequency of
the nth-ranked item is given by the Zeta
distribution, 1/(ns?(s)).
What does it mean?

25
Power Law vs. Zipf
In-degree distribution has better fit with Zipf
than with power law
In-degrees plotted against both ranks and
magnitudes
26
Interpretation

Results of connected component experiments with
random-start BFS
186 M nodes in giant weak component
56 M nodes in strong component
Use BFS runs to estimate positions of remaining
nodes
IN, OUT, TENDRILS, DISCONNECTED

27
Connectivity of the Web
28
SCC, IN and OUT

Directed path from each node of IN to all nodes
in SCC
Every BFS start node in SCC reaches 100 M nodes
under in-link expansion
Directed path from any node in SCC to every node
in OUT
Every node in SCC reaches 100 M nodes under
out-link expansion
SCC IN SCC OUT 100 M
IN and OUT therefore 44 M nodes each

29
Disconnected Set Tendrils

Computing Disconnected set of nodes
Total nodes in crawl ? 203.5 M
Giant WCC ? 186.7 M nodes
Total WCC ? 16.8 M nodes in disc. set
Tendrils
WCC SCC IN OUT Tendrils
? 44 M nodes

30
Size of Neighborhoods

Sample of 128 nodes (IN), 134 (OUT)
Exploring outwards follow out-links from OUT,
and in-links from IN
Encounter 3093 nodes (OUT), 171 (IN)
Exploring inwards follow in-links from OUT, and
out-links from IN
Encounter 3367 nodes (OUT), 173 (IN)
OUT encounters larger neighborhoods

31
Diameter of SCC
Depth at which BFS terminates in each direction

IN and OUT each contain a few long paths
The last path is much longer than any others
Conclusion
Directed diameter of SCC is at least 28

32
Compare SCC w/ IN, OUT

Average depths following in/out links
From SCC (in) From IN 482
From SCC (out) From OUT 434
Probability of path from node u to v
Non-negligible (24 to 28) if and only if
u in SCCIN, v in SCCOUT

33
Average Connected Distance

Avg Undirected Connected Dist 6.83
Contrast Albert, et. al. (99) prediction of
average distance of 19
Over 75 of the time, no directed path from
random start to finish node
When there is a path, avg distance ? 16

34
Conclusions

Using larger data set produces more accurate
results than in the past
Undirected graph of web shows more connectedness
than directed
Although SCCINOUT comprises 70 of total, given
2 random web pages, can only get from one to
other 25 of the time

35
Feedback