On Compressing Web Graphs - PowerPoint PPT Presentation

About This Presentation
Title:

On Compressing Web Graphs

Description:

The Web graph itself is interesting and useful. PageRank / Kleinberg's algorithm. ... Quicker, but a potential memory hog. Parallelizable. Complexity: 14. Variations ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 26
Provided by: mich298
Category:
Tags: compressing | graphs | hog | web

less

Transcript and Presenter's Notes

Title: On Compressing Web Graphs


1
On Compressing Web Graphs
  • Michael Mitzenmacher, Harvard
  • Micah Adler, Univ. of Massachusetts

2
The Web as a Graph
A
Page A
Page B
Page C
Page D
B
C
D
3
Motivation
  • The Web graph itself is interesting and useful.
  • PageRank / Kleinbergs algorithm.
  • Finding cyber-communities.
  • Archival history of Web growth and development.
  • Connectivity server.
  • Storing Web linkage information is expensive.
  • Web growth rate vs. storage growth rate?
  • Can we compress it?

4
Varieties of Compression
  • 1. Compress an isomorphism of the Web graph.
    Good for storage/transmission of graph features.
  • 2. Compress the Web graph with nodes in a given
    order (e.g. sorted by URL).
  • 3. Compress for use of compressed graph in a
    product (e.g. connectivity server).

5
Baseline Huffman coding
  • Significant work has shown in/outdegrees of
    vertices of Web graph have power-law
    distribution.
  • Basic scheme for each vertex, list all
    outedges.
  • Assign Huffman codeword based on indegree.

6
Huffman Example
Indegrees
Codewords
1
100
1
0000
3
01
1
101
2
001
1
0001
3
11
7
Web Graph Structure
  • Intuition Huffman uses degree distribution, but
    not Web graph structure.
  • More structure to take advantage of Web
    communities.
  • Many pages share links.

A
B
C
D
E
F
8
Reference Algorithm
  • Each vertex is allowed to choose a reference
    vertex.
  • Compress by representing edges copied from
    reference vertex as a bit vector.
  • No cycles allowed among references.

X uses Y X outedges a ref Y 11100
Y
X
a
b
c
d
e
f
9
Simple Reference Algorithm
  • Maximize the number of edges compressed.
  • Build a related affinity graph, recording number
    of shared pointers.
  • Find a maximum spanning tree (or forest) to find
    best references.

Y
X
3
Y
X
a
b
c
d
e
f
10
Improved Reference Algorithm
  • Let cost(A,B) be the cost of compressing A using
    B as a reference.
  • Form an improved affinity graph directed graph
    with costs.
  • Also add a root node R, with cost(A,R) being the
    cost of A with no reference.
  • Compute the rooted directed maximum spanning tree
    on directed affinity graph.

11
Example
B
A
n 1024 vertices
a
b
c
d
e
f
25
Part of the directed affinity graph.
B
A
34
40
50
R
12
Complexity
  • Finding directed maximum spanning is fast for x
    vertices and y edges, running time is O(x log x
    y) or O(y log x).
  • Compressing is fast given references.
  • Slow part is building affinity graph.
  • Equivalent to sparse matrix multiplication.
  • If M is adjacency matrix, number of shared
    neighbors found by computing MMT.
  • Sparseness helps, but still potentially very
    slow.

13
Building the Affinity Graph
  • Approach 1 For each pair of vertices a,b, check
    edge list to find common neighbors.
  • Slow, but good with memory.
  • Approach 2 For each vertex a, increase count
    for each pair b,c of vertices with edges to a.
  • Quicker, but a potential memory hog.
  • Parallelizable.
  • Complexity

14
Variations
  • Huffman code non-referenced edges.
  • Using non-Huffman weights to find references is
    no longer optimal.
  • But do not know Huffman weights until references
    found.
  • Huffman/run length/otherwise encode bit vectors.
  • Bound the depth of tree.
  • Find multiple references.

15
Bounded Tree Depth
  • For computing on compressed form of graph, do not
    want a long path of references.
  • Potential solution bound tree depth from root.
  • Problem finding optimal tree of bounded depth is
    NP-hard.
  • Depth 2 Facility location problem.
  • In practice use heuristic/approximation
    algorithms split full optimal tree to keep
    depth bound.

16
Multiple References
  • If one reference is good, finding two could be
    better.
  • We show finding optimal pair of references, even
    just to maximize number of compressed edges, is
    NP-hard.
  • In practice run single Reference algorithm
    multiple times.

17
Prototype
  • Finds references by constructing directed
    affinity graph, computing directed maximum
    spanning tree.
  • Does not output compressed form only size of
    compressed form.
  • Also computes Huffman and Reference Huffman
    size.
  • Size of Huffman table not counted.
  • Future work dealing with bottleneck of
    computing affinity graph.

18
Web Graph Models
  • Copy models
  • New pages generated dynamically
  • Some links are random-- uniform over all
    vertices
  • Some links are copies choose a page you like at
    random, and copy some of its links.
  • Richer models include deletions, changing links,
    inedges at creation.
  • Results in power-law distribution.

19
Copy Model
Random Link
X
Copies of X links
X
20
Data for Testing
  • Graphs chosen using random copy graphs.
  • TREC8 WT2g data set.

G4
TREC
G3
G2
G1
Graph
Nodes
131,072
131,072
131,072
131,072
247,428
Pages Copied
1
1
1,2
0,4
NA
Copy Prob
0.5
0.7
0.5
0.5
NA
1
1
1,2
0,4
NA
Random Links
21
Testing Details
  • Single pass at most one reference.
  • 10 trials for each random graph type.
  • Little variance found.
  • Random graphs seeded with 1024 vertices of degree
    3.
  • Small graphs edge between vertices in affinity
    graph if at least 2 shared edges in original.
    Large graphs (G3,G4,TREC) 3 shared edges.

22
Results
G4
TREC
G3
G2
G1
Graph
Avg. Deg.
2.09
3.25
5.10
10.22
4.72
No comp. Bits, mill.
4.66
7.25
11.36
22.78
21.00
Huffman
87.75
83.93
85.15
79.47
83.31
Reference
88.68
67.49
69.96
61.65
49.15
RefHuff
81.58
63.63
65.35
54.13
46.36
23
Analysis of Results
  • Huffman fails to capture significant structure.
  • More copying leads to more compression.
  • Good compression possible even with only one
    reference.
  • Performs well on real Web data.
  • TREC database may not be representative.
  • Significant locality.

24
Contributions
  • We introduce the Reference algorithm, an
    algorithm designed to compress Web graphs based
    on structural properties.
  • Initial results Reference algorithm appears
    very promising, better than Huffman.
  • Bounded depth variations may be suitable for
    on-line computing (connectivity server).
  • Hardness results for natural extensions of
    Reference algorithm.

25
Future Work
  • Beating the bottleneck determining the affinity
    graph.
  • Can we approximate the affinity graph and still
    compress well?
  • More extensive testing.
  • Variations multiple passes, bounded depth.
  • Graphs larger artificial and real Web graphs.
  • Determining value of locality and combining
    locality with a reference-based scheme.
Write a Comment
User Comments (0)
About PowerShow.com