Title: Data Stream Algorithms Intro, Sampling, Entropy
1Data Stream Algorithms Intro, Sampling, Entropy
Graham Cormode graham_at_research.att.com
2Outline
- Introduction to Data Streams
- Motivating examples and applications
- Data Streaming models
- Basic tail bounds
- Sampling from data streams
- Sampling to estimate entropy
3Data is Massive
- Data is growing faster than our ability to store
or index it - There are 3 Billion Telephone Calls in US each
day, 30 Billion emails daily, 1 Billion SMS,
IMs. - Scientific data NASA's observation satellites
generate billions of readings each per day. - IP Network Traffic up to 1 Billion packets per
hour per router. Each ISP has many (hundreds)
routers! - Whole genome sequences for many species now
available each megabytes to gigabytes in size
4Massive Data Analysis
- Must analyze this massive data
- Scientific research (monitor environment,
species) - System management (spot faults, drops, failures)
- Customer research (association rules, new offers)
- For revenue protection (phone fraud, service
abuse) - Else, why even measure this data?
5Example Network Data
- Networks are sources of massive data the
metadata per hour per router is gigabytes - Fundamental problem of data stream analysis Too
much information to store or transmit - So process data as it arrives one pass, small
space the data stream approach. - Approximate answers to many questions are OK, if
there are guarantees of result quality
6IP Network Monitoring Application
Example NetFlow IP Session Data
- 24x7 IP packet/flow data-streams at network
elements - Truly massive streams arriving at rapid rates
- ATT/Sprint collect 1 Terabyte of NetFlow data
each day - Often shipped off-site to data warehouse for
off-line analysis
7Packet-Level Data Streams
- Single 2Gb/sec link say avg packet size is
50bytes - Number of packets/sec 5 million
- Time per packet 0.2 microsec
- If we only capture header information per
packet src/dest IP, time, no. of bytes, etc.
at least 10bytes. - Space per second is 50Mb
- Space per day is 4.5Tb per link
- ISPs typically have hundreds of links!
- Analyzing packet content streams order(s) of
magnitude harder
8Network Monitoring Queries
Off-line analysis slow, expensive
Network Operations Center (NOC)
Peer
R3
R1
R2
EnterpriseNetworks
PSTN
DSL/Cable Networks
- Extra complexity comes from limited space and
time - Will introduce solutions for these and other
problems
9Streaming Data Questions
- Network managers ask questions requiring us to
analyze the data - How many distinct addresses seen on the network?
- Which destinations or groups use most bandwidth?
- Find hosts with similar usage patterns?
- Extra complexity comes from limited space and
time - Will introduce solutions for these and other
problems
10Other Streaming Applications
- Sensor networks
- Monitor habitat and environmental parameters
- Track many objects, intrusions, trend analysis
- Utility Companies
- Monitor power grid, customer usage patterns etc.
- Alerts and rapid response in case of problems
11Streams Defining Frequency Dbns.
- We will consider streams that define frequency
distributions - E.g. frequency of packets from source A to source
B - This simple setting captures many of the core
algorithmic problems in data streaming - How many distinct (non-zero) values seen?
- What is the entropy of the frequency
distribution? - What (and where) are the highest frequencies?
- More generally, can consider streams that define
multi-dimensional distributions, graphs,
geometric data etc. - But even for frequency distributions, several
models are relevant
12Data Stream Models
- We model data streams as sequences of simple
tuples - Complexity arises from massive length of streams
- Arrivals only streams
- Example (x, 3), (y, 2), (x, 2) encodesthe
arrival of 3 copies of item x, 2 copies of y,
then 2 copies of x. - Could represent eg. packets on a network power
usage - Arrivals and departures
- Example (x, 3), (y,2), (x, -2) encodes final
state of (x, 1), (y, 2). - Can represent fluctuating quantities, or measure
differences between two distributions
x y
x y
13Approximation and Randomization
- Many things are hard to compute exactly over a
stream - Is the count of all items the same in two
different streams? - Requires linear space to compute exactly
- Approximation find an answer correct within some
factor - Find an answer that is within 10 of correct
result - More generally, a (1? ?) factor approximation
- Randomization allow a small probability of
failure - Answer is correct, except with probability 1 in
10,000 - More generally, success probability (1-?)
- Approximation and Randomization (?,
?)-approximations
14Basic Tools Tail Inequalities
- General bounds on tail probability of a random
variable (probability that a random variable
deviates far from its expectation) - Basic Inequalities Let X be a random variable
with expectation ? and variance VarX. Then, for
any ?gt0
15Tail Bounds
- Markov Inequality
- For a random variable Y which takes only
non-negative values. - PrY ? k ? E(Y)/k
- (This will be lt 1 only for k gt E(Y))
- Chebyshevs Inequality
- For a random variable Y
- PrY-E(Y) ? k ? Var(Y)/k2
- Proof Set X (Y E(Y))2
- E(X) E(Y2E(Y)22YE(Y)) E(Y2)E(Y)2-2E(Y)2
Var(Y) - So PrY-E(Y) ? k Pr(Y E(Y))2 ?
k2. - Using Markov ? E(Y E(Y))2/k2 Var(Y)/k2
16Outline
- Introduction to Data Streams
- Motivating examples and applications
- Data Streaming models
- Basic tail bounds
- Sampling from data streams
- Sampling to estimate entropy
17Sampling From a Data Stream
- Fundamental prob sample m items uniformly from
stream - Useful approximate costly computation on small
sample - Challenge dont know how long stream is
- So when/how often to sample?
- Two solutions, apply to different situations
- Reservoir sampling (dates from 1980s?)
- Min-wise sampling (dates from 1990s?)
18Reservoir Sampling
- Sample first m items
- Choose to sample the ith item (igtm) with
probability m/i - If sampled, randomly replace a previously sampled
item - Optimization when i gets large, compute which
item will be sampled next, skip over intervening
items. Vitter 85
19Reservoir Sampling - Analysis
- Analyze simple case sample size m 1
- Probability ith item is the sample from stream
length n - Prob. i is sampled on arrival ? prob. i survives
to end
1/n
- Case for m gt 1 is similar, easy to show uniform
probability - Drawbacks of reservoir sampling hard to
parallelize
20Min-wise Sampling
- For each item, pick a random fraction between 0
and 1 - Store item(s) with the smallest random tag Nath
et al.04
0.391
0.908
0.291
0.555
0.619
0.273
- Each item has same chance of least tag, so
uniform - Can run on multiple streams separately, then
merge
21Sampling Exercises
- What happens when each item in the stream also
has a weight attached, and we want to sample
based on these weights? - Generalize the reservoir sampling algorithm to
draw a single sample in the weighted case. - Generalize reservoir sampling to sample multiple
weighted items, and show an example where it
fails to give a meaningful answer. - Research problem design new streaming algorithms
for sampling in the weighted case, and analyze
their properties.
22Outline
- Introduction to Data Streams
- Motivating examples and applications
- Data Streaming models
- Basic tail bounds
- Sampling from data streams
- Sampling to estimate entropy
23Application of Sampling Entropy
- Given a long sequence of characters
- S lta1, a2, a3 amgt each aj ? 1 n
- Let fi frequency of i in the sequence
- Compute the empirical entropy
- H(S) - ?i fi/m log fi/m - ?i pi log pi
- Example S lt a, b, a, b, c, a, d, agt
- pa 1/2, pb 1/4, pc 1/8, pd 1/8
- H(S) ½ ¼ ? 2 1/8 ? 3 1/8 ? 3 7/4
- Entropy promoted for anomaly detection in networks
24Challenge
- Goal approximate H(S) in space sublinear
(poly-log) in m (stream length), n (alphabet
size) - (?,?) approx answer is (1?)H(S) w/prob 1-?
- Easy if we have O(n) space compute each fi
exactly - More challenging if n is huge, m is huge, and we
have only one pass over the input in order - (The data stream model)
25Sampling Based Algorithm
- Simple estimator
- Randomly sample a position j in the stream
- Count how many times aj appears subsequently r
- Output X -(r log (r/m) (r-1) log((r-1)/m))
- Claim Estimator is unbiased EX H(S)
- Proof prob of picking j 1/m, sum telescopes
correctly - Variance of estimate is not too large VarX
O(log2 m) - Observe that X log m
- VarX E(X EX)2 lt (max(X) min(X))2
O(log2 m)
26Analysis of Basic Estimator
- A general technique in data streams
- Repeat in parallel an unbiased estimator with
bounded variance, take average of estimates to
improve variance - Var 1/k (Y1 Y2 ... Yk) 1/k VarY
- By Chebyshev, need k repetitions to be
VarX/?2E2X - For entropy, this means space k
O(log2m/?2H2(S)) - Problem for entropy when H(S) is very small?
- Space needed for an accurate approx goes as 1/H2!
27Low Entropy
- But... what does a low entropy stream look like?
- aaaaaaaaaaaaaaaaaaaaaaaaaaaaaabaaaaa
- Very boring most of the time, we are only rarely
surprised - Can there be two frequent items?
- aabababababababaababababbababababababa
- No! Thats high entropy (¼ 1 bit / character)
- Only way to get H(S) o(1) is to have only one
character with pi close to 1
28Removing the frequent character
- Write entropy as
- -pa log pa (1-pa) H(S)
- Where S stream S with all as removed
- Can show
- Doesnt matter if H(S) is small as pa is large,
additive error on H(S) ensures relative error on
(1-pa)H(S) - Relative error (1-pa) on pa gives relative error
on pa log pa - Summing both (positive) terms gives relative
error overall
29Finding the frequency character
- Ejecting a is easy if we know in advance what it
is - Can then compute pa exactly
- Can find online deterministically
- Assume pa gt 2/3 (if not, H(S) gt 0.9, and original
alg works) - Run a heavy hitters algorithm on the stream
(see later) - Modify analysis, find a and pa ? (1-pa)
- But... how to also compute H(S) simultaneously
if we dont know a from the start... do we need
two passes?
30Always have a back up plan...
- Idea keep two samples to build our estimator
- If at the end one of our samples is a, use the
other - How to do this and ensure uniform sampling?
- Pick first sample with min-wise sampling
- At end of the stream, if the sampled character
a, we want to sample from the stream ignoring
all as - This is just the character achieving the
smallest label distinct from the one that
achieves the smallest label - Can track information to do this in a single
pass, constant space
31Sampling Two Tokens
B
C
D
B
B
B
A
A
A
A
A
C
Stream
0.627
0.549
0.228
0.366
0.770
0.191
0.408
Tags
0.202
0.173
0.082
0.217
0.815
Repeats
A
A
A
- Assign tags, choose first token as before
- Delete all occurrences of first token
- Choose token with min remaining tag count
repeats - Implementation keep track of two triples
- (min tag, corresponding token, number of repeats)
32Putting it all together
- Can combine all these pieces
- Build an estimator based on tracking this
information, deciding whether there is a frequent
character or not - A more involved Chernoff bounds argument improves
number of repetitions of estimator from
O(?-2VarX/E2X) to O(?-2RangeX/EX) O(?-2
log m) - In O(?-2 log m log 1/?) space (words) we can
compute an (?,?) approximation to H(S) in a
single pass
33Entropy Exercises
- As a subroutine, we need to find an element that
occurs more than 2/3 of the time and estimate its
weight - How can we find a frequently occurring item?
- How can we estimate its weight p with ?(1-p)
error? - Our algorithm uses O(?-2 log m log 1/?) space,
could this be improved or is it optimal (lower
bounds)? - Our algorithm updates each sampled pair for every
update, how quickly can we implement it? - (Research problem) What if there are multiple
distributed streams and we want to compute the
entropy of their union?
34Outline
- Introduction to Data Streams
- Motivating examples and applications
- Data Streaming models
- Basic tail bounds
- Sampling from data streams
- Sampling to estimate entropy
35Data Stream Algorithms Frequency Moments
Graham Cormode graham_at_research.att.com
36Frequency Moments
- Introduction to Frequency Moments and Sketches
- Count-Min sketch for F? and frequent items
- AMS Sketch for F2
- Estimating F0
- Extensions
- Higher frequency moments
- Combined frequency moments
37Last Time
- Introduced data streams and data stream models
- Focus on a stream defining a frequency
distribution - Sampling to draw a uniform sample from the stream
- Entropy estimation based on sampling
38This Time Frequency Moments
- Given a stream of updates, let fi be the number
of times that item i is seen in the stream - Define Fk of the stream as ?i (fi)k the kth
Frequency Moment - Space Complexity of the Frequency Moments by
Alon, Matias, Szegedy in STOC 1996 studied this
problem - Awarded Godel prize in 2005
- Set the pattern for much streaming algorithms to
follow - Frequency moments are at the core of many
streaming problems
39Frequency Moments
- F0 count 1 if fi ? 0 number of distinct items
- F1 length of stream, easy
- F2 sum the squares of the frequencies self
join size - Fk related to statistical moments of the
distribution - F? (really lim k? ? Fk1/k) dominated by the
largest fk, finds the largest frequency - Different techniques needed for each one.
- Mostly sketch techniques, which compute a certain
kind of random linear projection of the stream
40Sketches
- Not every problem can be solved with sampling
- Example counting how many distinct items in the
stream - If a large fraction of items arent sampled,
dont know if they are all same or all different - Other techniques take advantage that the
algorithm can see all the data even if it cant
remember it all - (To me) a sketch is a linear transform of the
input - Model stream as defining a vector, sketch is
result of multiplying stream vector by an
(implicit) matrix
linear projection
41Trivial Example of a Sketch
1 0 1 1 1 0 1 0 1
1 0 1 1 0 0 1 0 1
- Test if two (asynchronous) binary streams are
equal d (x,y) 0 iff xy, 1 otherwise - To test in small space pick a random hash
function h - Test h(x)h(y) small chance of false positive,
no chance of false negative. - Compute h(x), h(y) incrementally as new bits
arrive (e.g. h(x) xiti mod p for random prime
p, and t lt p) - Exercise extend to real valued vectors in update
model
42Frequency Moments
- Introduction to Frequency Moments and Sketches
- Count-Min sketch for F? and frequent items
- AMS Sketch for F2
- Estimating F0
- Extensions
- Higher frequency moments
- Combined frequency moments
43Count-Min Sketch
- Simple sketch idea, can be used for as the basis
of many different stream mining tasks. - Model input stream as a vector x of dimension U
- Creates a small summary as an array of w ? d in
size - Use d hash function to map vector entries to
1..w - Works on arrivals only and arrivals departures
streams
W
Array CMi,j
d
44Count-Min Sketch Structure
j,c
dlog 1/?
w 2/?
- Each entry in vector x is mapped to one bucket
per row. - Merge two sketches by entry-wise summation
- Estimate xj by taking mink CMk,hk(j)
- Guarantees error less than eF1 in size O(1/e log
1/d) - Probability of more error is less than 1-d
C, Muthukrishnan 04
45Approximation of Point Queries
- Approximate point query xj mink CMk,hk(j)
- Analysis In k'th row, CMk,hk(j) xj Xk,j
- Xk,j S xi hk(i) hk(j)
- E(Xk,j) Si? j xiPrhk(i)hk(j) ?
Prhk(i)hk(k) Si xi e F1/2 by pairwise
independence of h - PrXk,j ? eF1 PrXk,j ? 2E(Xk,j) ? 1/2 by
Markov inequality - So, Prxj ? xj eF1 Pr? k. Xk,j gt eF1
?1/2log 1/d d - Final result with certainty xj ? xj and
with probability at least 1-d, xj lt xj e
F1
46Applications of Count-Min to F?
- Count-Min sketch lets us estimate fi for any i
(up to ?F1) - F? asks to find maxi fi
- Slow way test every i after creating sketch
- Faster way test every i after it is seen in the
stream, and remember largest estimated value - Alternate way
- keep a binary tree over the domain of input
items, where each node corresponds to a subset - keep sketches of all nodes at same level
- descend tree to find large frequencies,
discarding branches with low frequency
47Count-Min Exercises
- The median of a distribution is the item so that
the sum of the frequencies of lexicographically
smaller items is ½ F1. Use CM sketch to find the
(approximate) median. - Assume the input frequencies follow the Zipf
distribution so that the ith largest frequency
is ?(i-z) for zgt1. Show that CM sketch only
needs to be size ?-1/z to give same guarantee - Suppose we have arrival and departure streams
where the frequencies of items are allowed to be
negative. Extend CM sketch analysis to estimate
these frequencies (note, Markov argument no
longer works) - How to find the large absolute frequencies when
some are negative? Or in the difference of two
streams?
48Frequency Moments
- Introduction to Frequency Moments and Sketches
- Count-Min sketch for F? and frequent items
- AMS Sketch for F2
- Estimating F0
- Extensions
- Higher frequency moments
- Combined frequency moments
49F2 estimation
- AMS sketch (for Alon-Matias-Szegedy) proposed in
1996 - Allows estimation of F2 (second frequency moment)
- Used at the heart of many streaming and
non-streaming mining applications achieves
dimensionality reduction - Here, describe AMS sketch by generalizing CM
sketch. - Uses extra hash functions g1...glog 1/d 1...U?
1,-1 - Now, given update (j,c), set CMk,hk(i)
cgk(j)
linear projection
AMS sketch
50F2 analysis
j,c
d8log 1/?
w 4/?2
- Estimate F2 mediank åi CMk,i2
- Each rows result is åi g(i)2xi2 åh(i)h(j)
2 g(i) g(j) xi xj - But g(i)2 -12 12 1, and åi xi2 F2
- g(i)g(j) has 1/2 chance of 1 or 1
expectation is 0
51F2 Variance
- Expectation of row estimate Rk åi CMk,i2 is
exactly F2 - Variance of row k, VarRk, is an expectation
- VarRk E (?buckets b (CMk,b)2 F2)2
- Good exercise in algebra expand this sum and
simplify - Many terms are zero in expectation because of
terms like g(a)g(b)g(c)g(d) (degree at most 4) - Requires that hash function g is four-wise
independent it behaves uniformly over subsets of
size four or smaller - Such hash functions are easy to construct
52F2 Variance
- Terms with odd powers of g(a) are zero in
expectation - g(a)g(b)g2(c), g(a)g(b)g(c)g(d), g(a)g3(b)
- Leaves VarRk ? ?i g4(i) xi4 2 ?j? i
g2(i) g2(j) xi2 xj2 4 ?h(i)h(j) g2(i)
g2(j) xi2 xj2 - (xi4 ?j? i 2xi2
xj2) ? F22/w - Row variance can finally be bounded by F22/w
- Chebyshev for w4/?2 gives probability ¼ of
failure - How to amplify this to small ? probability of
failure?
53Tail Inequalities for Sums
- We derive stronger bounds on tail probabilities
for the sum of independent Bernoulli trials via
the Chernoff Bound - Let X1, ..., Xm be independent Bernoulli trials
s.t. PrXi1 p (PrXi0 1-p). - Let X ?i1m Xi ,and ? mp be the expectation
of X. - Then, for any ?gt0,
54Applying Chernoff Bound
- Each row gives an estimate that is within ?
relative error with probability p gt ¾ - Take d repetitions and find the median. Why the
median? - Because bad estimates are either too small or too
large - Good estimates form a contiguous group in the
middle - At least d/2 estimates must be bad for median to
be bad - Apply Chernoff bound to d independent estimates,
p3/4 - Pr More than d/2 bad estimates lt 2exp(d/8)
- So we set d ?(ln ?) to give ? probability of
failure - Same outline used many times in data streams
55Aside on Independence
- Full independence is expensive in a streaming
setting - If hash functions are fully independent over n
items, then we need ?(n) space to store their
description - Pairwise and four-wise independent hash functions
can be described in a constant number of words - The F2 algorithm uses a careful mix of limited
and full independence - Each hash function is four-wise independent over
all n items - Each repetition is fully independent of all
others but there are only O(log 1/?)
repetitions.
56AMS Sketch Exercises
- Let x and y be binary streams of length n. The
Hamming distance H(x,y) i xi? yiShow
how to use AMS sketches to approximate H(x,y) - Extend for strings drawn from an arbitrary
alphabet - The inner product of two strings x, y is x ? y
?i1n xiyiUse AMS sketches to estimate x ?
y - Hint try computing the inner product of the
sketches.Show the estimator is unbiased (correct
in expectation) - What form does the error in the approximation
take? - Use Count-Min Sketches for the same problem and
compare the errors. - Is it possible to build a (1??) approximation of
x ? y?
57Frequency Moments
- Introduction to Frequency Moments and Sketches
- Count-Min sketch for F? and frequent items
- AMS Sketch for F2
- Estimating F0
- Extensions
- Higher frequency moments
- Combined frequency moments
58F0 Estimation
- F0 is the number of distinct items in the stream
- a fundamental quantity with many applications
- Early algorithms by Flajolet and Martin 1983
gave nice hashing-based solution - analysis assumed fully independent hash functions
- Will describe a generalized version of the FM
algorithm due to Bar-Yossef et. al with only
pairwise indendence
59F0 Algorithm
- Let m be the domain of stream elements
- Each item in stream is from 1m
- Pick a random hash function h m ? m3
- With probability at least 1-1/m, no collisions
under h - For each stream item i, compute h(i), and track
the t distinct items achieving the smallest
values of h(i) - Note if same i is seen many times, h(i) is same
- Let vt tth smallest value of h(i) seen.
- If F0 lt t, give exact answer, else estimate F0
tm3/vt - vt/m3 ? fraction of hash domain occupied by t
smallest
0m3
m3
60Analysis of F0 algorithm
- Suppose F0 tm3/vt gt (1?) F0 estimate is
too high - So for stream set S ? 2m, we have
- s ? S h(s) lt tm3/(1?)F0 gt t
- Because ? lt 1, we have tm3/(1?)F0 ?
(1-?/2)tm3/F0 - Pr h(s) lt (1-?/2)tm3/F0 ? 1/m3 (1-?/2)tm3/F0
(1-?/2)t/F0 - (this analysis outline hides some rounding
issues)
0m3
tm3/(1?)F0
vt
m3
61Chebyshev Analysis
- Let Y be number of items hashing to under
tm3/(1?)F0 - EY F0 Pr h(s) lt tm3/(1?)F0 (1-?/2)t
- For each item i, variance of the event p(1-p) lt
p - VarY ?s ? S Var h(s) lt tm3/(1?)F0 lt
(1-?/2)t - We sum variances because of pairwise independence
- Now apply Chebyshev
- Pr Y gt t ? PrY EY gt ?t/2 ?
4VarY/?2t2 lt 4t/(?2t2) - Set t20/?2 to make this Prob ? 1/5
62Completing the analysis
- We have shown Pr F0 gt (1?) F0 lt 1/5
- Can show Pr F0 lt (1-?) F0 lt 1/5 similarly
- too few items hash below a certain value
- So Pr (1-?) F0 ? F0 ? (1?)F0 gt 3/5 Good
estimate - Amplify this probability repeat O(log 1/?) times
in parallel with different choices of hash
function h - Take the median of the estimates, analysis as
before
63F0 Issues
- Space cost
- Store t hash values, so O(1/?2 log m) bits
- Can improve to O(1/?2 log m) with additional
tricks - Time cost
- Find if hash value h(i) lt vt
- Update vt and list of t smallest if h(i) not
already present - Total time O(log 1/? log m) worst case
64Range Efficiency
- Sometimes input is specified as a stream of
ranges a,b - a,b means insert all items (a, a1, a2 b)
- Trivial solution just insert each item in the
range - Range efficient F0 Pavan, Tirthapura 05
- Start with an alg for F0 based on pairwise hash
functions - Key problem track which items hash into a
certain range - Dives into hash fns to divide and conquer for
ranges - Range efficient F2 Calderbank et al. 05,
Rusu,Dobra 06 - Start with sketches for F2 which sum hash values
- Design new hash functions so that range sums are
fast
65F0 Exercises
- Suppose the stream consists of a sequence of
insertions and deletions. Design an algorithm to
approximate F0 of the current set. - What happens when some frequencies are negative?
- Give an algorithm to find F0 of the most recent W
arrivals - Use F0 algorithms to approximate Max-dominance
given a stream of pairs (i,x(i)), approximate ?i
max(i, x(i)) x(i)
66Frequency Moments
- Introduction to Frequency Moments and Sketches
- Count-Min sketch for F? and frequent items
- AMS Sketch for F2
- Estimating F0
- Extensions
- Higher frequency moments
- Combined frequency moments
67Higher Frequency Moments
- Fk for kgt2. Use sampling trick as with Entropy
Alon et al 96 - Uniformly pick an item from the stream length 1n
- Set r how many times that item appears
subsequently - Set estimate Fk n(rk (r-1)k)
- EFk1/nn f1k - (f1-1)k (f1-1)k - (f1-2)k
1k-0k f1k f2k Fk - VarFk?1/nn2(f1k-(f1-1)k)2
- Use various bounds to bound the variance by k
m1-1/k Fk2 - Repeat k m1-1/k times in parallel to reduce
variance - Total space needed is O(k m1-1/k) machine words
68Improvements
- Coppersmith and Kumar 04 Generalize the F2
approach - E.g. For F3, set p1/?m, and hash items onto
1-1/p, -1/p with probability 1/p, 1-1/p
respectively. - Compute cube of sum of the hash values of the
stream - Correct in expectation, bound variance ? O(?mF32)
- Indyk, Woodruff 05, Bhuvangiri et al. 06
Optimal solutions by extracting different
frequencies - Use hashing to sample subsets of items and fis
- Combine these to build the correct estimator
- Cost is O(m1-2/k poly-log(m,n,1/?)) space
69Combined Frequency Moments
Consider network traffic data defines a
communication graph eg edge (source,
destination) or edge (sourceport,
destport) Defines a (directed) multigraph We are
interested in the underlying (support) graph on n
nodes
- Want to focus on number of distinct communication
pairs, not size of communication - So want to compute moments of F0 values...
70Multigraph Problems
- Let Gi,j 1 if (i,j) appears in stream edge
from i to j. Total of m distinct edges - Let di Sj1n Gi,j degree of node i
- Find aggregates of dis
- Estimate heavy dis (people who talk to many)
- Estimate frequency momentsnumber of distinct di
values, sum of squares - Range sums of dis (subnet traffic)
71F? (F0) using CM-FM
- Find is such that di gt f åi diFinds the people
that talk to many others - Count-Min sketch only uses additions, so can
apply
72Accuracy for F?(F0)
- Focus on point query accuracy estimate di.
- Can prove estimate has only small bias in
expectation - Analysis is similar to original CM sketch
analysis, but now have to take account of F0
estimation of counts - Gives an bound of O(1/?3 poly-log(n)) space
- The product of the size of the sketches
- Remains to fully understand other combinations of
frequency moments, eg. F2(F0), F2(F2) etc.
73Exercises / Problems
- (Research problem) What can be computed for other
combinations of frequency moments, e.g. F2 of F2
values, etc.? - The F2 algorithm uses the fact that 1/-1 values
square to preserve F2 but are 0 in expectation.
Why wont it work to estimate F4 with h ? -1,
1, -i, i? - (Research problem) Read, understand and simplify
analysis for optimal Fk estimation algorithms - Take the sampling Fk algorithm and combine it
with F0 estimators to approximate Fk of node
degrees - Why cant we use the sketch approach for F2 of
node degrees? Show there the analysis breaks
down
74Frequency Moments
- Introduction to Frequency Moments and Sketches
- Count-Min sketch for F? and frequent items
- AMS Sketch for F2
- Estimating F0
- Extensions
- Higher frequency moments
- Combined frequency moments
75Data Stream Algorithms Lower Bounds
Graham Cormode graham_at_research.att.com
76Streaming Lower Bounds
- Lower bounds for data streams
- Communication complexity bounds
- Simple reductions
- Hardness of Gap-Hamming problem
- Reductions to Gap-Hamming
1 0 1 1 1 0 1 0 1
77This Time Lower Bounds
- So far, have seen many examples of things we can
do with a streaming algorithm - What about things we cant do?
- Whats the best we could achieve for things we
can do? - Will show some simple lower bounds for data
streams based on communication complexity
78Streaming As Communication
1 0 1 1 1 0 1 0 1
- Imagine Alice processing a stream
- Then take the whole working memory, and send to
Bob - Bob continues processing the remainder of the
stream
79Streaming As Communication
- Suppose Alices part of the stream corresponds to
string x, and Bobs part corresponds to string
y... - ...and that computing the function on the stream
corresponds to computing f(x,y)... - ...then if f(x,y) has communication complexity
?(g(n)), then the streaming computation has a
space lower bound of ?(g(n)) - Proof by contradiction If there was an
algorithm with better space usage, we could run
it on x, then send the memory contents as a
message, and hence solve the communication problem
80Deterministic Equality Testing
1 0 1 1 1 0 1 0 1
1 0 1 1 0 0 1 0 1
- Alice has string x, Bob has string y, want to
test if xy - Consider a deterministic (one-round, one-way)
protocol that sends a message of length m lt n - There are 2m possible messages, so some strings
must generate the same message this would cause
error - So a deterministic message (sketch) must be ?(n)
bits - In contrast, we saw a randomized sketch of size
O(log n)
81Hard Communication Problems
- INDEX x is a binary string of length ny is an
index in nGoal output xyResult (one-way)
(randomized) communication complexity of INDEX is
?(n) bits - DISJOINTNESS x and y are both length n binary
strings Goal Output 1 if ?i xiyi1, else
0Result (multi-round) (randomized)
communication complexity of DISJOINTNESS is ?(n)
bits
82Simple Reduction to Disjointness
x 1 0 1 1 0 1
1, 3, 4, 6
y 0 0 0 1 1 0
4, 5
- F? output the highest frequency in a stream
- Input the two strings x and y from disjointness
- Stream if xi1, then put i in stream then
same for y - Analysis if F?2, then intersection if F??1,
then disjoint. - Conclusion Giving exact answer to F? requires
?(N) bits - Even approximating up to 50 error is hard
- Even with randomization DISJ bound allows
randomness
83Simple Reduction to Index
x 1 0 1 1 0 1
1, 3, 4, 6
y 5
5
- F0 output the number of items in the stream
- Input the strings x and index y from INDEX
- Stream if xi1, put i in stream then put y in
stream - Analysis if (1-?)F0(x?y)gt(1?)F0(x) then
xy1, else it is 0 - Conclusion Approximating F0 for ?lt1/N requires
?(N) bits - Implies that space to approximate must be ?(1/?)
- Bound allows randomization
84Hardness Reduction Exercises
- Use reductions to DISJ or INDEX to show the
hardness of - Frequent items find all items in the stream
whose frequency gt ?N, for some ?. - Sliding window given a stream of binary (0/1)
values, compute the sum of the last N values - Can this be approximated instead?
- Min-dominance given a stream of pairs (i,x(i)),
approximate ?i min(i, x(i)) x(i) - Rank sum Given a stream of (x,y) pairs and query
(p,q) specified after stream, approximate
(x,y) xltp, yltq
85Streaming Lower Bounds
- Lower bounds for data streams
- Communication complexity bounds
- Simple reductions
- Hardness of Gap-Hamming problem
- Reductions to Gap-Hamming
1 0 1 1 1 0 1 0 1
86Gap Hamming
- GAP-HAMM communication problem
- Alice holds x ? 0,1N, Bob holds y ? 0,1N
- Promise H(x,y) is either ? N/2 - pN or ? N/2
pN - Which is the case?
- Model one message from Alice to Bob
- Requires ?(N) bits of one-way randomized
communication - Indyk, Woodruff03, Woodruff04, Jayram, Kumar,
Sivakumar 07
87Hardness of Gap Hamming
- Reduction to an instance of INDEX
- Map string x to u by 1? 1, 0 ? -1 (i.e. ui
2xi -1 ) - Assume both Alice and Bob have access to public
random strings rj, where each bit of rj is iid
-1, 1 - Assume w.l.o.g. that length of string n is odd
(important!) - Alice computes aj sign(rj ? u)
- Bob computes bj sign(rjy)
- Repeat N times with different random strings, and
consider the Hamming distance of a1... aN with b1
... bN
88Probability of a Hamming Error
- Consider the pair aj sign(rj ? u), bj
sign(rjy) - Let w ?i ? y ui rji
- w is a sum of (n-1) values distributed iid
uniform -1,1 - Case 1 w ? 0. So w? 2, since (n-1) is even
- so sign(aj) sign(w), independent of xy
- Then Praj ? bj Prsign(w) ? sign(rjy) ½
- Case 2 w 0. So aj sign(rj?u) sign(w
uyrjy) sign(uyrjy) - Then Praj ? bj Prsign(uyrjy)
sign(rjy) - This probability is 1 is uy1, 0 if uy-1
- Completely biased by the answer to INDEX
89Finishing the Reduction
- So what is Prw0?
- w is sum of (n-1) iid uniform -1,1 values
- Textbook Prw0 c/?n, for some constant c
- Do some probability manipulation
- Praj bj ½ c/2?n if xy1
- Praj bj ½ - c/2?n if xy0
- Amplify this bias by making strings of length
N4n/c2 - Apply Chernoff bound on N instances
- With probgt2/3, either H(a,b)gtN/2 ?N or
H(a,b)ltN/2 - ?N - If we could solve GAP-HAMMING, could solve INDEX
- Therefore, need ?(N) ?(n) bits for GAP-HAMMING
90Streaming Lower Bounds
- Lower bounds for data streams
- Communication complexity bounds
- Simple reductions
- Hardness of Gap-Hamming problem
- Reductions to Gap-Hamming
1 0 1 1 1 0 1 0 1
91Lower Bound for Entropy
- Alice x ? 0,1N, Bob y ? 0,1N
- Entropy estimation algorithm A
- Alice runs A on enc(x) ?(1,x1), (2,x2), ,
(N,xN)? - Alice sends over memory contents to Bob
- Bob continues A on enc(y) ?(1,y1), (2,y2), ,
(N,yN)?
1
1
0
0
1
0
Alice
Bob
0
1
0
0
1
1
92Lower Bound for Entropy
- Observe there are
- 2H(x,y) tokens with frequency 1 each
- N-H(x,y) tokens with frequency 2 each
- So, H(S) log N H(x,y)/N
- Thus size of Alices memory contents ?(N).
Set ? 1/(p(N) log N) to show bound of ?(?/log
1/?)-2)
1
1
0
0
1
0
Alice
Bob
0
1
0
0
1
1
93Lower Bound for F0
- Same encoding works for F0 (Distinct Elements)
- 2H(x,y) tokens with frequency 1 each
- N-H(x,y) tokens with frequency 2 each
- F0(S) N H(x,y)
- Either H(x,y)gtN/2 ?N or H(x,y)ltN/2 - ?N
- If we could approximate F0 with ? lt 1/?N, could
separate - But space bound ?(N) ?(?-2) bits
- Dependence on ? for F0 is tight
- Similar arguments show ?(?-2) bounds for Fk,
- Proof assumes k (and hence 2k) are constants
94Lower Bounds Exercises
- Formally argue the space lower bound for F2 via
Gap-Hamming - Argue space lower bounds for Fk via Gap-Hamming
- (Research problem) Extend lower bounds for the
case when the order of the stream is random or
near-random - (Research problem) Kumar conjectures the
multi-round communication complexity of
Gap-Hamming is ?(n) this would give lower
bounds for multi-pass streaming
95Streaming Lower Bounds
- Lower bounds for data streams
- Communication complexity bounds
- Simple reductions
- Hardness of Gap-Hamming problem
- Reductions to Gap-Hamming
1 0 1 1 1 0 1 0 1
96Data Stream Algorithms Extensions and Open
Problems
Graham Cormode graham_at_research.att.com
97This Time Extensions
- Have given the basics of streaming streams of
items, frequency moments, upper and lower bounds - Many variations with many open problems
- Streams representing different combinatorial
objects - Streams that are distributed, correlated,
uncertain - Systems for processing streams
- Different models of streams
- See also Open problems in Data Streams
McGregor 07 - Result of a workshop held at IIT Kanpur in Dec
2006
98Deterministic Streaming Algorithms
- Focus so far has been on randomized algorithms
- Many important problems can be solved
deterministically! - Finding frequent items/ heavy hitters
- Finding quantiles of a distribution
- For many problems, lower bounds show
randomization is necessary for sublinear space - Anything involving equality testing as a special
case - Frequency moments
- When they are possible, deterministic algorithms
are often faster and use less space more
practical to implement
99Clustering On Data Streams
- Goal output k cluster centers at end - any
point can be classified using these centers. - Use divide and conquer approach Guha et al.
00 - Buffer as many points as possible, then cluster
them - Cluster the clusters
- Cluster the cluster clusters, etc...
- Each level of clustering gives up extra factors
in quality
Input
Output
100Geometric Streaming
- Stream specifies a sequence of d-dimensional
points - Answer various geometric problems such as
- Convex hull
- Minimum spanning tree weight
- Facility location
- Minimum enclosing ball
- Gridding approach reduces to Fk or related
problems Indyk 03 - Core-set keep a carefully chosen small subset of
points and evaluate on them Har-Peled 02,
Chan06 - Simple example For minimum enclosing ball, keep
extremal points in evenly-space directions
101Sliding Window Computations
- In a sliding window, we only consider the last W
items - W still very large, so want poly-log(W) solutions
- Exponential Histograms Datar et al.02 and
Waves Gibbons Tirthapura02 - Deterministic structure tracks counts in a window
- Based on doubling bucket sizes to give relative
error - Same structure sketches solves for aggregates
- Asynchronous streams items not in timestamp
order - Relative error counts possible Busch, Tirthapura
07 - Extend concept to other aggregates C. et al. 08
102Time Decay
- Assign a weight to each item as a function of its
age - E.g. Exponential decay or polynomial decay
- Implies weighted versions of problems
- Cohen and Strauss 2003
- Can reduce sum and counts to multiple instances
of sliding window queries - C., Korn and Tirthapura 2008
- Same observations applies to othercomputations
(quantiles, frequent items)
age ?
103Multi-Pass Algorithms
- Some situations allow multiple passes of the
stream - E.g. scanning over slow storage (tape) random
access not possible, but can scan multiple times - Earliest work in streaming Munro, Paterson 78
studied the pass/space tradeoff for finding
medians - Lower bounds can follow from multi-round
communication complexity bounds
1 0 1 1 1 0 1 0 1
104Other Massive Data Models
- Massive Unordered Data (MUD) model Feldman et
al. 08 - Abstracts computations in MapReduce/Hadoop
settings - Can provably simulate deterministic streaming
algs - What about randomized computations, multiple
passes?
105Skewed Streams
- In practice, not all frequency distributions are
worst case - Few items are frequent, then a long tail of
infrequent items - Such skew is prevalent in network data, word
frequency, paper citations, city sizes, etc. - Zipfian distribution with skew z gt 0 (z
1..2 typical) - Analyze algorithms under assumption of skewed
data - Improved F2 space cost O(e-2/(1z) log 1/d),
provided zgt1
106Graph Streaming
(4,5) (2,3) (1,3) (3,5) (1,2) (2,4) (1,5) (3,4)
- Stream specifies a massive graph edge by edge
- Most natural problems have ?(V) space lower
bounds - Semi-streaming model allow ?(V) but o(E)
spaceTherefore also o(V2) space also - Allow one (or few) passes to approximate
- Minimum Spanning Tree Weight
- Graph Distances (based on spanners)
- Maximum weight matching
- Counting Triangles
107Matrix Streaming
- Stream specifies a massive n ? n matrix
- Either by giving entries in some order, or
updates to entries - In one (or few) passes, find
- CUR Decomposition
- Page Rank Vector
- Approximate Matrix product
- Singular Vector Decomposition
- Current methods take small constant number of
passes, sample constant number of rows and
columns by weight - Sketching methods dont seem so useful here
O(1) Columns
O(1) Rows
Carefully chosen U
108Permutation Streaming
- Stream presents a permutation of items
- Abstraction of several settings, more of
theoretic interest - Approximate number of inversions in the stream
- Locations where i gt j but i appears before j in
stream - Can be reduced to a variation of quantiles
Gupta, Zane03 - Find length of longest increasing subsequence
- Reduce (up to factor 2) to simpler function
Ergun, Jowhari 08 - Approximate this using a different variation of
quantiles - Deterministic lower bound ?(N1/2), randomized
bound open
109Random Order Streaming
- Lower bounds are sometimes based on carefully
creating adversarial orders of streams - Random order streams order is uniformly permuted
- Can sometimes give much better upper bounds
prefix of stream gives a good sample of dbn. to
come - Lower bounds in random order give stronger
evidence of robust hardness, e.g. Chakrabarti
et al. 08 - Hardness via communication complexity of random
partitions - GAP-HAMMING still has linear lower bound
- t2-party DISJOINTNESS has ?(n/t) lower bound
110Probabilistic Streams
Example S (?x, ½?, ?y, 1/3?, ?y, ¼?) Encodes
6 possible worlds
G ? x y x,y y,y x,y,y
PrG ¼ ¼ 5/24 5/24 1/24 1/24
- Instead of exact values, stream of discrete
distributions - Specify exponentially many possible worlds
- Adds complexity to previously studied problems
- Sum and Count are easy (by linearity of
expectation) - AvgSum/Count is hard! because of ratio
McGregor et al. 07 - Linearity of expectation, summation of variance
- Allows estimation of Fk over streams C,
Garofalakis 07
111Distributed Streams
- Motivated by Sensor Networks large wireless
nets - Communication drains battery compute more, send
less - Key problem design stream summary data
structures that can be combined to summarize the
union of streams - Most sketches (AMS, Count-Min, F0) naturally
distribute - Similar results needed for other problems
http//www.intel.com/research/exploratory/motes.ht
m
base station (root, coordinator)
112Continuous Distributed Model
- Goal Continuously track (global) query over
streams at the coordinator while bounding the
communication - Large-scale network-event monitoring, real-time
anomaly/ DDoS attack detection, power grid
monitoring, - Results known for quantiles, Fk, clustering...
- Cost not much higher than one time computation C
et al. 08
113Extensions for P2P Networks
- Much work focused on specifics of sensor and
wired nets - P2P and Grid computing present alternate models
- Structure of multi-hop overlay networks
- Controlled failure model nodes explicitly
leave and join - Allows us to think beyond model of highly
resource constrained sensors. - Implementations such as OpenDHT over PlanetLab
Rhea et al.05
114Authenticated Stream Aggregation
- Wide-area query processing
- Possible malicious aggregators
- Can suppress or add spurious information
- Authenticate query results at the querier?
- Perhaps, to within some approximation error
- Initial steps in Garofalakis et al.06,
- Sliding window Hadjieleftheriou et al. 07
115Data Stream Algorithms
- Slides are on the web on my website
- Long list of references also on the web
- http//dimacs.rutgers.edu/graham