Title: Optimal Approximations of the Frequency Moments of Data Streams
1Optimal Approximations of the Frequency Moments
of Data Streams
- Piotr Indyk
- David Woodruff
2The Streaming Model
- Stream of elements a1, , an each in 1, , m
- Want to compute statistics on stream
- Elements arranged in adversarial order
- Algorithms given one pass over stream
- Goal Minimum space algorithm
3Frequency Moments AMS96
n stream size, m universe size fi
occurrences of item i
k-th moment
- F0 of distinct elements
- F1 n stream size
- F2 self-join size
Why are frequency moments important?
4Applications
- Estimating distinct elements with low space
- Estimate query selectivity to huge DB without
sorting - Routers gather distinct destinations
- F2 estimates size of self-joins
,
fB2 fA2 4 1 5
- Fk measures data skewness
5The Best Deterministic Algorithm
- Trivial algorithm for Fk
- Store/update fi for each item i, sum fik at end
- Space O(mlog n) m items i, log n bits to count
fi
- Negative Results AMS96
- Compute Fk exactly ? ?(m) space
- Any deterministic alg. outputs X with
- Fk X lt ?Fk must use ?(m) space
What about randomized algorithms?
6Randomized Approx Algs for Fk
- Randomized alg. ?-approximates Fk if outputs X
s.t. - PrFk X lt ? Fk gt
2/3 - Previous work (table suppresses polylog mn)
7Matching Upper Bound
Our Contribution For every k there is a
1-pass O(m1-2/k) space algorithm to
?-approximate Fk
- Additional Features
- Works even if we allow deletions, that is, stream
of elements (i, ), (i,-) - 2. Constant update time
8Techniques
- Previous Algorithms AMS96, CK04, G04
- 1. Cleverly construct small-space
estimator X s.t. - EX Fk
- VarX small
- 2. Apply Chebyshevs inequality
- Our algorithm
- 1. Divide frequencies into buckets
- 0, 1, 2), 2, 4), 4, 8), , 2i-1, 2i),
- 2. Estimate size si of each bucket
- 3. Output X ?i si 2ik
9Whats Left?
- Remaining Problem Estimate si of elements
with frequency in each bucket 2i-1, 2i) - Is this always easy? No.
- Suppose always easy then could approximate the
maximum frequency - This is HARD ?(m) space AMS96
- However, ?(m) only applies to worst-case
streams, otherwise can do better Countsketch
CCF-C
10For the moment, lets assume
- 1. 9 a 1-pass oracle Max returning the maximum
frequency using O(B) space (we remove this
using CountSketch)
Max
frequency
items
- 2. We have a very long RAM of random bits
- (we remove this using Nisans generator)
11General Idea Max Sampling
- Restrict input stream to a random subset of items
in 1, , m, where items are included
independently with probability p.
7
1
1
3
7
3
4
Random subset 1, 3
12General Idea Max Sampling
- Restrict input to a random subset of items in
1, , m, where items are included
independently with probability p.
- What are chances the maximum lies in
- Si elements r such that fr 2 2i-1, 2i)?
q (1-p)? j gt i sj (1 (1-p)si)
Idea 1. Estimate q as q by taking
independent trials and computing
fraction of max in Si 2. If already
estimated sj for j gt i, solve this
expression for si.
13When is this estimate any good?
Recall q (1-p)?j gt i sj (1 (1-p)si), so
estimate si
Need 1.
(holds inductively)
(tight concentration of q)
2.
Requires 9 p so that q gt 1/R, where R trials
used to estimate q
14When is this estimate any good?
q (1-p)?j gt i sj (1 (1-p)si)
p too large? ! q too small
p too small? ! q too small
Motivates the following Say a class Si
contributes if and only if si gt ?j gt i sj /R If
R ?(log n), then Fk ¼ ?contributing i si 2ik
15The Idealized Algorithm
- Use the random string to generate hash functions
hjr m -gt 2j for j 2 log m and r 2 R - Restrict stream Str to Strjr, those items i with
hjr(i) 1 - For each Strjr, compute Max(Strjr)
- To estimate si given st for t gt i, find some j
for which enough of the Max(Strjr) come from
Si, and then set - Output Fk ?i si 2ik
16Removing the assumptions
1. Assumption 9 a 1-pass oracle Max returning
the maximum frequency using O(B) space
- CCF-C02 9 a 1-pass O(B)-space algorithm
CountSketch - which, given stream Str, outputs all x for which
fx2 F2/B
Recall Si contributes if and only if si gt ?j gt
i sj /R
Lemma If Si 2i-1, 2i) contributes, then
Proof Holders inequality.
17Removing the assumptions
2. We have an infinite string of random bits
- Consider a space-S algorithm A and a function
- f, with random strings R1, , Rn that, when
- processing a stream, maintains a variable
- C, and updates as follows C C f(i, Ri)
Indyk00 Then R1, , Rn can be generated using
Nisans PRG, and
- The new algorithm A has space O(S)
- The outputs of A and A are indistinguishable
Our algorithm follows this framework
18Conclusions
- Result Tight O(m1-2/k) upper bound
- Handle deletions (j, -)
- O(1) update time
- Open Problem Reduce O factors