Title: Graph Problems in the Streaming Model
1Graph Problems in the Streaming Model
- Sampath Kannan
- University of Pennsylvania
- Work done with Joan Feigenbaum, Andrew McGregor,
Siddharth Suri and Jian Zhang
2Graph Streaming
- G(V,E),
- V known V n
- E revealed in arbitrary order (e1, e2, )
- Space allowed O(n polylog n) Semi streaming
3Motivation?
- Fundamental problems help calibrate model
- Massive graphs such as the webgraph can appear as
stream - Recommendation systems and more generally data
mining
4Why so much space?
- Even simple problems need it
- Given u,v, and a streamed graph G, is there path
of length 2 between u v? - Requires W(n) space.
- More generally for balanced graph properties
5Balanced Properties
v
A property is balanced, if there existsstream of
edges such that before seeing last edge There
exists v last edge is (v,x)... for ?(n) xs,
property holds for ?(n) xs property doesnt
hold.
6Lower Bound for Balanced Props
Consider all isomorphic versions of the
graph that demonstrates the balance
property. Before seeing last edge, streaming
algorithm has to remember the subset x of
vertices such that the addition of edge (v,x)
causes property to hold. As we range over
isomorphisms... this is an arbitrary subset of
the given cardinality... and there are
exponentially many possibilities.
7Exceptions
- Counting Local Structures
- Counting triangles (Bar-Yossef et al, Buriol et
al) - Counting E(G2) (Ganguly et al)
- Duplicate elimination and aggregation
- (Cormode,Muthukrishnan)
8One algorithm design technique
- Sparsification (Eppstein, Galil,Italiano,Nissenzwe
ig 97) -
- For graph property P G strong certificate for G
if ? H (G ? H) ? P ? (G ? H) ? P. - Existence of quickly computable, sparse, strong
certificates leads to good semi-streaming
algorithms
9Sparsification-based algorithms
- Bipartiteness, 1-, 2-, 3-vertex
connectedcomponents, 2-, 3-edge connected
components O(a(n)) per edge - MST, 4-vertex connected comps., 3-edge connected
comps. O(log n) - Higher connectivities O(n). (Zelke)
10Bipartite Matching
Matching (maximal) Augmenting path
Approximable with local greed
11Constant-pass 2/3-approx for bip. matching
- Maximal matching is .5 approx If M maximum
and M maximal then M matches at least one
endpoint of each edge in M has M/2 edges. - If M has only aM vertex-disjoint 3-aug-paths
gt - M (1 a) 2 OPT/3M maximum M? M bunch
of augmenting paths. Count!
12- Can find maximal matching
- To go beyond Need to get most aug. paths of
length 3. - ???Randomly project all free vertices into Layer
0 or Layer 3 - Matched edges go from layer 1 to layer 2.
- Expect half the augmenting paths of length 3
to respect layering - Use maximal matchings between successive
layers to get constant fraction of these. - Gives constant-pass 2/3 - ? approximation
13- To get approximation scheme Need to findmost
augmenting paths of length ?????? - Again project vertices into k1 layers to find
augmenting paths of length k - Use carefully chosen maximal matchings
algorithms between successive layers - Repeat constant number of times
- Gives streaming linear time approx scheme for
unweighted matching in general graphs (McGregor)
14Weighted Matching
15A 1/6 Approximation in 1 Pass
- At all times we store some matching M.
- On seeing edge e (u,v) we compare the w(e) with
the weight W of edges e1 and e2 in M incident on
u and v. - If w(e) gt 2W then
- M ? M ? e \ e1,e2
16- To show 1/6 approx Account for the weight of
edges lost in terms of weight of edges that
survive - Can improve approx to 1/2 - ? (McGregor) in
constant number of passes - Choose an edge if it is (1 ?) times the weight
of edges that it kills.
17Approximating Distances
18The Sketch Approach
- A two-stage approach
- First stage While going through the stream,
construct a small sketch of the input graph. - Second stage Compute the distance using the
sketch, without further access to the stream. - Perform BFS-like computations in the second
stage.
19Graph Spanners as Sketches
- Multiplicative t-spanner Edge subgraph H of a
graph G, s.t., for any pair of vertices u and v,
distH(u,v) ? tdistG(u,v). - There is a t-Spanner with O(n11/t) edges.
- Reduce streaming graph distance to streaming
spanner construction. - BFS-like subroutines are used in most existing
spanner constructions.
20Streaming Spanner Construction
- For each incoming edge, decide whether it should
be in the spanner. - If the edge causes a cycle of length ? t, do not
put the edge in the spanner. - This gives a t-spanner, because there is a path P
of length lt t connecting the two endpoints of any
discarded edge. - This spanner is sparse.
- Thm Bollobás78 A graph whose girth is
larger than k can only have O(n12/(k-1)) edges. - Need to know For an incoming edge, does a short
path exist?
21Baswana Sen show almost linear time
non-streaming algorithm for spanners
growingBFS-trees from appropriate
nodes. Difficult to do in streaming
fashion Instead we grow a BFS-like tree not
just from itsroot! Clusters Rooted BFS
trees Preclusters Free floating pieces of BFS
trees will attach to clusters
22Summary of the One-Pass Algorithm
- Use a vertex-labeling scheme to construct
clusters. - Structure of the algorithm
- In the pre-processing phase, generate a
multi-level set of labels for the vertices. - Go through the stream for each edge
- According to the current assignment of labels to
vertices, decide whether to put this edge in the
spanner. - Depending on the type of edge, possibly assign
more labels to one of its endpoints. - Next, an example with t log n
23Labels
(2,2) (2,7)
(1,2) (1,4) (1,7) (1,11)
(1,2) (1,4) (1,7) (1,11)
(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (
0,9) (0,10) (0,11) (0,12)
(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (
0,9) (0,10) (0,11) (0,12)
- logn/2 levels
- w.h.p., there are top-level labels.
- Semantics of labels
- The set of vertices assigned the same top-level
label forms a cluster. - The set of vertices assigned the same lower-level
label forms a pre-cluster.
24Initial Label Assignment
(2,2) (2,7)
(1,2) (1,4) (1,7) (1,11)
(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (
0,9) (0,10) (0,11) (0,12)
v1 v2 v3 v4 v5 v6 v7 v8
v9 v10 v11 v12
25On arrival of an edge
- Already know what to do with
- Intra-cluster/pre-cluster edges
- Inter-cluster edges
- Edges connecting pre-clusters the sticky edges
- They are added to the spanner.
- They may lead to new label assignment and cluster
growth.
26Good Neighbor (1)
(3,2) (2,2) (1,2) (0,2)
(3,2)
(2,2)
Has marked labels
(1,6) (0,6)
v
u
27Good Neighbor (2)
C(3,2)
C(2,2)
C(1,2)
C(1,6)
v
u
28Bad Neighbor
No marked labels
(1,6)
(3,2)
v
u
29Properties of the Clusters
- Small diameter
- Number of clusters bounded by .
- Do not need to cover the whole graph with
clusters, but the uncovered subgraph is sparse.
The uncovered subgraph consists of sticky edges,
and there are not too many of them.
30Sticky Edges are Rare
u1
v
u1, u2, u3, u4
u4
u2
u3
- A neighbor is good with probability at least ½.
- After seeing at most logn/2 good neighbors, v
will be assigned a top-level label and be
included in a cluster. No more sticky edges for
v. - The number of sticky edges can be bounded by the
length of the shortest prefix in the above
sequence that contains logn/2 good neighbors.
314. Lower Bounds
32One-pass diameter lower bound
- Theorem For any ?????, any one-pass algorithm
that - returns a k (slightly better than 1/?) approx to
diameter - in weighted graph requires ??n1?) space.
- Proof (Sketch)
- Some properties of random graph G in Gn,p with p
1/n1-? - w.h.p. Contains set E of edges E n1??64
- no edge in E is in a cycle of length k or less.
- When all edges in E are removed, graph still
has diameter lt 2/?
Fix one such G (V, E ? E)
33- Sketch (contd) Reduce from INDEX (hard for
comm. cmplxty) - INDEX Alice has m-bit string x and Bob has index
i. One-way comm. complexity for Bob
to learn xi is m. - Reduction m edges in E enumerated 1 .. m.
- Alice constructs prefix of stream corresponding
to multiple copies of - H (V,E ? E) where E ? E are the
indices where xi1. All Alices edges
have weight 1 - Bob constructs rest of stream If his index
corresponds to edge (a,b) in E - He connects vertex b in one copy with vertex a
in next copy at 0 weight - Also creates source s and sink t and connects s
to a in 1st copy and b in last copy to t at
high weight. - Properties If xi 1 where i is Bobs index then
small diameter - else large diameter.
- Small space streaming violates comm. lower bound.
34Open Problems
- Are there interesting subclasses of graphs for
which distances and diameters are easier in
streaming model? - Is there a more generous but reasonable model?
35Network Intrusion Detection Systems
- Current techniques fairly primitive
- Misuse Pattern match packets with misuse
signatures in database - Anomaly Look for statistical anomalies in
individual packet headers and payload - Needed
- Look across multiple packets for intrusions
- Deal with interleaved traffic
36An Example Browsing habits
- You read sports and cartoons. Youre equally
likely to read both. You do not remember what you
read last. - Youd expect a random sequence
SCSSCSSCSSCCSCCCSSSSCSC
37Two readers
- I like health, entertainment, and politics
- I always read entertainment first, health next
and politics last - The sequence would be
- EHPEHPEHPEHPEHPEHPEHP
38Two readers, one log file
- If there is one log file
- Assume there is no correlation between us
SECHSSPECSHPESCSSHCPCESCHCCPSESHPESSHPE
Is there enough information to tell that there
are two people browsing? What are they browsing?
How are they browsing?
39Clues in stream?
- Yes, under model assumptions.
- H, E, P have special relationship.
- They cannot belong to different (uncorrelated)
people. - Not clear about S and C ... These could
- be two people or one person.
SECHSSPECSHPESCSSHCPCESCHCCPSESHPESSHPE
40Markov Chains as Stochastic Sources
.4
2
1
Output sequence 1 4 7 7 1 2 5 7 ...
.3
.4
.7
.2
4
6
.5
.8
.1
3
.5
.2
5
1
.9
7
.9
.1
41Markov chains on S,E,C,H,F
Modeled by
1
E
H
1
1
F
42- Need more realistic generalizations of such
analysis to - deal with
- Worm detection
- Anomaly detection at high traffic links in a
network - TCP compliance
- BGP policy behavior
43Partial Solution Clusters (1)
- A cluster is a subset of vertices and a small
diameter spanning tree built on these vertices. - Intra-cluster edge
44Partial Solution Clusters (2)
Bollobáss result no longer applies. Need to
control the number of clusters (i.e., make it
).
45Open Shortest Path First (OSPF)
- Packet routing protocol
- Each link broadcasts its weight (initially could
be 1/bw...) - To route from A to B, each router sends along
shortest path to B, dividing traffic evenly if
many shortest paths. - Adjustments
- Human operator observing congestion on link
could raise wt - Local decisions could lead to oscillation
suboptimality - Link latency Convex function of its utilization
- Goal Minimize max link latency, total link
latency, expected path latency, etc. - Exact optimizations typically NP-hard
46Streaming problem
- Can we automate the weight adjustments?
- Simple scenario
- Assume weights have been optimized for current
traffic matrix - Assume we now have a new (unknown) traffic
matrix - observed at routers
- Assume some simple goal ... minimize time to
converge to new solution ... or something ... - Streaming algorithm should itself be allowed to
generatetraffic for communication between
monitors and for - diagnostics, but this overhead should be low.
47Early Worm Detection
- EarlyBird System Singh et al identifies
following characteristics - Substantial volume of identical traffic
- Rising infection levels ( sources destinations
increasing) - Random probing (infected source tries many IP
addresses) - 1. Top-k type streaming algorithm can identify
high volume of - identical traffic at one location.
- Can we do better in distributed fashion?
- 2. How do we communicate to detect rising inf.
levels? - 3. Sophisticated worms may not use random
probing. - What other discriminating tests are possible?
- 4. Sophisticated worms are polymorphic not
identical traffic. -