Web Structure and Accessibility of Information - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Web Structure and Accessibility of Information

Description:

1. Graph Structure in the Web. Andrei Broder et al, WWW, 2000 ... frequency of the nth-ranked item is given by the Zeta distribution, 1/(ns?(s) ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 36
Provided by: webC
Category:

less

Transcript and Presenter's Notes

Title: Web Structure and Accessibility of Information


1
Web Structure and Accessibility of Information
  • CS525 Databases the Web
  • Diane Kramer
  • September 23, 2004

2
Introduction
  • Papers on Topic
  • 1. Graph Structure in the Web
  • Andrei Broder et al, WWW, 2000
  • 2. Accessibility of Information on the Web
  • Steve Lawrence, C. Lee Giles, Nature, 1999
  • Topic
  • What do these papers have in common?

3
This Presentation
  • Accessibility Paper interesting, but less
    technical
  • Aimed at wider audience, more easily understood
  • Many technical details from Web Structure paper
  • Comprehensive review of concepts and findings
  • Not enough time to do both
  • Focus on Paper 1

4
Paper 1 Discussion Outline
  • Purpose
  • Terminology
  • Related Work
  • Data
  • Experiments
  • Results
  • Interpretation

5
Purpose
  • Whats the paper about?
  • Consider the Web as a directed graph
  • What do the nodes represent? The arcs?
  • What about connectivity?
  • Why is this interesting?
  • How/where is it applicable?

6
Terminology
  • Directed graph
  • Nodes, Arcs, Path, Distance
  • Out-degree, In-degree
  • Undirected graph how it differs?
  • Components
  • Strongly vs. Weakly connected
  • Breadth-first search
  • Diameter of graph

7
Power Law Distributions
  • Definitions
  • From Wikipedia (http//en.wikipedia.org)
  • Relationship between two scalar quantities x and
    y
  • such that the relationship can be written as y
    axk
  • where a and k are constants.
  • From the paper
  • Defined on positive integers, with the
    probability of
  • the value i being proportional to 1/ik
  • What does it mean?

8
Power Law Continued
  • Large is rare, small is common
  • E.g., individual wealth (80/20 rule)
  • Similar to Zipfs law
  • Zipf distribution is a function of ranks, rather
    than magnitudes
  • Some other examples
  • AOL Users to Sites
  • http//www.hpl.hp.com/research/idl/papers/ranking
    /ranking.html
  • Website Popularity
  • http//www.useit.com/alertbox/zipf.html

9
Related Work
  • Observation of power law distributions
  • Access statistics for web sites, for particular
    pages within a web site
  • Distribution of degrees on the web graph
  • Applying graph theoretic methods
  • Search, retrieval and mining problems
  • Context and content
  • Meaning found in connected components
  • Understanding domain structure, taxonomy

10
Data
  • 2 AltaVista crawls (May, Oct. 1999)
  • 200 million (M) pages
  • 1.5 billion (G) links
  • 5 x larger than previous biggest study
  • Kumar, et. al. used pruned data set from 1997
    containing 40 M pages
  • Why does size matter?

11
Experimental Setup
  • Connectivity Server 2 (CS2)
  • Built at Compaq Systems Research Center
  • Takes a Web crawl as input
  • Creates a Web graph as output
  • Stores data in DB URLs, in-links, out-links
  • Extended with URLs referenced ? 5 times

12
More Experimental Setup
  • CS2 System Continued
  • Data compression, required storage for URLs
    links
  • High performance w/ enough RAM to store entire DB
    in memory
  • Database statistics
  • 203 M URLs, 1.466 G links (May, '99)
  • Fit in 9.5 GB of storage
  • 271 M URLs, 2.13 G links (Oct. '99)

13
AltaVista Data
  • Crawl based on large set of starting points
    accumulated over time with rules
  • Avoid overloading servers robot traps
  • Avoid/detect spam, deal w/ timeouts
  • Evolve index
  • Remove duplicates, dead links, etc.,
  • Add new pages, updated pages, etc.
  • DB contains superset of all pages in index at one
    point in time

14
Algorithms Experiments
  • Check in/out degrees
  • Verify power law distribution
  • WCC, SCC
  • Find weak/strong connected components
  • Random-start BFS
  • Run from 570 randomly chosen nodes
  • Forward and backward

15
Experimental Results
  • The heart of the matter

16
Degree Distribution
  • Verify earlier observations of power law
  • Results from both May and Oct. crawls show
    agreement
  • In-degree exponent 2.1
  • Out-degree exponent 2.72
  • Deviation on out-degree distribution
  • Initial segment shows Poisson distribution

17
In/Out Degree Distribution
In- and out-degree distributions from May and
October, 1999
18
Weak Components
  • Treat Web graph as undirected
  • Find weakly connected components
  • Giant component of 186 M nodes
  • 91 of nodes reachable from one another
  • Connectivity highly resilient
  • Remove links to pages w/ in-degree ? 5
  • Giant component still contains 59 M nodes

19
WCC Distribution Resilience
Power Law Distribution (Exponent 2.5)
Size (M) of largest surviving weak component when
links to pages w/ in-degree ? k are removed
20
Strong Components
  • Single large SCC about 56 M pages
  • Only 28 of all pages in crawl
  • All other SCCs significantly smaller size
  • SCC distribution also obeys power law
  • Exponent 2.5
  • Same as for WCC

21
SCC vs. WCC Size Distribution
Both SCC and WCC show power law distribution Same
exponent 2.5
22
Random-start BFS
  • To study issues of diameter average distance
  • Traversals exhibit bimodal behavior
  • Die out after reaching small set of nodes
  • lt 90 nodes, 90 of the time
  • Explode to cover about 100 M nodes
  • But never entire 186 M
  • Sometimes both forward backward

23
Cumulative Dist. Of BFS Runs
Explosion in 50 of start nodes for in- and
out-links, 90 of nodes for undirected
24
Zipf Distributions
  • Definition from Wikipedia
  • Originally the term Zipf's law meant the
    observation of Harvard linguist George Kingsley
    Zipf that the frequency of use of the
    nth-most-frequently-used word in any natural
    language is approximately inversely proportional
    to n. The term Zipf's law has consequently come
    to be used to refer to frequency distributions of
    "rank data" in which the relative frequency of
    the nth-ranked item is given by the Zeta
    distribution, 1/(ns?(s)).
  • What does it mean?

25
Power Law vs. Zipf
In-degree distribution has better fit with Zipf
than with power law
In-degrees plotted against both ranks and
magnitudes
26
Interpretation
  • Results of connected component experiments with
    random-start BFS
  • 186 M nodes in giant weak component
  • 56 M nodes in strong component
  • Use BFS runs to estimate positions of remaining
    nodes
  • IN, OUT, TENDRILS, DISCONNECTED

27
Connectivity of the Web
28
SCC, IN and OUT
  • Directed path from each node of IN to all nodes
    in SCC
  • Every BFS start node in SCC reaches 100 M nodes
    under in-link expansion
  • Directed path from any node in SCC to every node
    in OUT
  • Every node in SCC reaches 100 M nodes under
    out-link expansion
  • SCC IN SCC OUT 100 M
  • IN and OUT therefore 44 M nodes each

29
Disconnected Set Tendrils
  • Computing Disconnected set of nodes
  • Total nodes in crawl ? 203.5 M
  • Giant WCC ? 186.7 M nodes
  • Total WCC ? 16.8 M nodes in disc. set
  • Tendrils
  • WCC SCC IN OUT Tendrils
  • ? 44 M nodes

30
Size of Neighborhoods
  • Sample of 128 nodes (IN), 134 (OUT)
  • Exploring outwards follow out-links from OUT,
    and in-links from IN
  • Encounter 3093 nodes (OUT), 171 (IN)
  • Exploring inwards follow in-links from OUT, and
    out-links from IN
  • Encounter 3367 nodes (OUT), 173 (IN)
  • OUT encounters larger neighborhoods

31
Diameter of SCC
Depth at which BFS terminates in each direction
  • IN and OUT each contain a few long paths
  • The last path is much longer than any others
  • Conclusion
  • Directed diameter of SCC is at least 28

32
Compare SCC w/ IN, OUT
  • Average depths following in/out links
  • From SCC (in) From IN 482
  • From SCC (out) From OUT 434
  • Probability of path from node u to v
  • Non-negligible (24 to 28) if and only if
  • u in SCCIN, v in SCCOUT

33
Average Connected Distance
  • Avg Undirected Connected Dist 6.83
  • Contrast Albert, et. al. (99) prediction of
    average distance of 19
  • Over 75 of the time, no directed path from
    random start to finish node
  • When there is a path, avg distance ? 16

34
Conclusions
  • Using larger data set produces more accurate
    results than in the past
  • Undirected graph of web shows more connectedness
    than directed
  • Although SCCINOUT comprises 70 of total, given
    2 random web pages, can only get from one to
    other 25 of the time

35
Feedback
  • Questions? Comments?
Write a Comment
User Comments (0)
About PowerShow.com