Title: Web Structure and Accessibility of Information
1Web Structure and Accessibility of Information
- CS525 Databases the Web
- Diane Kramer
- September 23, 2004
2Introduction
- Papers on Topic
- 1. Graph Structure in the Web
- Andrei Broder et al, WWW, 2000
- 2. Accessibility of Information on the Web
- Steve Lawrence, C. Lee Giles, Nature, 1999
- Topic
- What do these papers have in common?
3This Presentation
- Accessibility Paper interesting, but less
technical - Aimed at wider audience, more easily understood
- Many technical details from Web Structure paper
- Comprehensive review of concepts and findings
- Not enough time to do both
- Focus on Paper 1
4Paper 1 Discussion Outline
- Purpose
- Terminology
- Related Work
- Data
- Experiments
- Results
- Interpretation
5Purpose
- Whats the paper about?
- Consider the Web as a directed graph
- What do the nodes represent? The arcs?
- What about connectivity?
- Why is this interesting?
- How/where is it applicable?
6Terminology
- Directed graph
- Nodes, Arcs, Path, Distance
- Out-degree, In-degree
- Undirected graph how it differs?
- Components
- Strongly vs. Weakly connected
- Breadth-first search
- Diameter of graph
7Power Law Distributions
- Definitions
- From Wikipedia (http//en.wikipedia.org)
- Relationship between two scalar quantities x and
y - such that the relationship can be written as y
axk - where a and k are constants.
- From the paper
- Defined on positive integers, with the
probability of - the value i being proportional to 1/ik
- What does it mean?
8Power Law Continued
- Large is rare, small is common
- E.g., individual wealth (80/20 rule)
- Similar to Zipfs law
- Zipf distribution is a function of ranks, rather
than magnitudes - Some other examples
- AOL Users to Sites
- http//www.hpl.hp.com/research/idl/papers/ranking
/ranking.html - Website Popularity
- http//www.useit.com/alertbox/zipf.html
9Related Work
- Observation of power law distributions
- Access statistics for web sites, for particular
pages within a web site - Distribution of degrees on the web graph
- Applying graph theoretic methods
- Search, retrieval and mining problems
- Context and content
- Meaning found in connected components
- Understanding domain structure, taxonomy
10Data
- 2 AltaVista crawls (May, Oct. 1999)
- 200 million (M) pages
- 1.5 billion (G) links
- 5 x larger than previous biggest study
- Kumar, et. al. used pruned data set from 1997
containing 40 M pages - Why does size matter?
11Experimental Setup
- Connectivity Server 2 (CS2)
- Built at Compaq Systems Research Center
- Takes a Web crawl as input
- Creates a Web graph as output
- Stores data in DB URLs, in-links, out-links
- Extended with URLs referenced ? 5 times
12More Experimental Setup
- CS2 System Continued
- Data compression, required storage for URLs
links - High performance w/ enough RAM to store entire DB
in memory - Database statistics
- 203 M URLs, 1.466 G links (May, '99)
- Fit in 9.5 GB of storage
- 271 M URLs, 2.13 G links (Oct. '99)
13AltaVista Data
- Crawl based on large set of starting points
accumulated over time with rules - Avoid overloading servers robot traps
- Avoid/detect spam, deal w/ timeouts
- Evolve index
- Remove duplicates, dead links, etc.,
- Add new pages, updated pages, etc.
- DB contains superset of all pages in index at one
point in time
14Algorithms Experiments
- Check in/out degrees
- Verify power law distribution
- WCC, SCC
- Find weak/strong connected components
- Random-start BFS
- Run from 570 randomly chosen nodes
- Forward and backward
15Experimental Results
16Degree Distribution
- Verify earlier observations of power law
- Results from both May and Oct. crawls show
agreement - In-degree exponent 2.1
- Out-degree exponent 2.72
- Deviation on out-degree distribution
- Initial segment shows Poisson distribution
17In/Out Degree Distribution
In- and out-degree distributions from May and
October, 1999
18Weak Components
- Treat Web graph as undirected
- Find weakly connected components
- Giant component of 186 M nodes
- 91 of nodes reachable from one another
- Connectivity highly resilient
- Remove links to pages w/ in-degree ? 5
- Giant component still contains 59 M nodes
19WCC Distribution Resilience
Power Law Distribution (Exponent 2.5)
Size (M) of largest surviving weak component when
links to pages w/ in-degree ? k are removed
20Strong Components
- Single large SCC about 56 M pages
- Only 28 of all pages in crawl
- All other SCCs significantly smaller size
- SCC distribution also obeys power law
- Exponent 2.5
- Same as for WCC
21SCC vs. WCC Size Distribution
Both SCC and WCC show power law distribution Same
exponent 2.5
22Random-start BFS
- To study issues of diameter average distance
- Traversals exhibit bimodal behavior
- Die out after reaching small set of nodes
- lt 90 nodes, 90 of the time
- Explode to cover about 100 M nodes
- But never entire 186 M
- Sometimes both forward backward
23Cumulative Dist. Of BFS Runs
Explosion in 50 of start nodes for in- and
out-links, 90 of nodes for undirected
24Zipf Distributions
- Definition from Wikipedia
- Originally the term Zipf's law meant the
observation of Harvard linguist George Kingsley
Zipf that the frequency of use of the
nth-most-frequently-used word in any natural
language is approximately inversely proportional
to n. The term Zipf's law has consequently come
to be used to refer to frequency distributions of
"rank data" in which the relative frequency of
the nth-ranked item is given by the Zeta
distribution, 1/(ns?(s)). - What does it mean?
25Power Law vs. Zipf
In-degree distribution has better fit with Zipf
than with power law
In-degrees plotted against both ranks and
magnitudes
26Interpretation
- Results of connected component experiments with
random-start BFS - 186 M nodes in giant weak component
- 56 M nodes in strong component
- Use BFS runs to estimate positions of remaining
nodes - IN, OUT, TENDRILS, DISCONNECTED
27Connectivity of the Web
28SCC, IN and OUT
- Directed path from each node of IN to all nodes
in SCC - Every BFS start node in SCC reaches 100 M nodes
under in-link expansion - Directed path from any node in SCC to every node
in OUT - Every node in SCC reaches 100 M nodes under
out-link expansion - SCC IN SCC OUT 100 M
- IN and OUT therefore 44 M nodes each
29Disconnected Set Tendrils
- Computing Disconnected set of nodes
- Total nodes in crawl ? 203.5 M
- Giant WCC ? 186.7 M nodes
- Total WCC ? 16.8 M nodes in disc. set
- Tendrils
- WCC SCC IN OUT Tendrils
- ? 44 M nodes
30Size of Neighborhoods
- Sample of 128 nodes (IN), 134 (OUT)
- Exploring outwards follow out-links from OUT,
and in-links from IN - Encounter 3093 nodes (OUT), 171 (IN)
- Exploring inwards follow in-links from OUT, and
out-links from IN - Encounter 3367 nodes (OUT), 173 (IN)
- OUT encounters larger neighborhoods
31Diameter of SCC
Depth at which BFS terminates in each direction
- IN and OUT each contain a few long paths
- The last path is much longer than any others
- Conclusion
- Directed diameter of SCC is at least 28
32Compare SCC w/ IN, OUT
- Average depths following in/out links
- From SCC (in) From IN 482
- From SCC (out) From OUT 434
- Probability of path from node u to v
- Non-negligible (24 to 28) if and only if
- u in SCCIN, v in SCCOUT
33Average Connected Distance
- Avg Undirected Connected Dist 6.83
- Contrast Albert, et. al. (99) prediction of
average distance of 19 - Over 75 of the time, no directed path from
random start to finish node - When there is a path, avg distance ? 16
34Conclusions
- Using larger data set produces more accurate
results than in the past - Undirected graph of web shows more connectedness
than directed - Although SCCINOUT comprises 70 of total, given
2 random web pages, can only get from one to
other 25 of the time
35Feedback