Optimal Approximations of the Frequency Moments of Data Streams - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Optimal Approximations of the Frequency Moments of Data Streams

Description:

n = stream size, m = universe size. fi = # occurrences of item i ... Estimate query selectivity to huge DB without sorting. Routers gather # distinct destinations ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 19
Provided by: davidwo1
Learn more at: http://web.mit.edu
Category:

less

Transcript and Presenter's Notes

Title: Optimal Approximations of the Frequency Moments of Data Streams


1
Optimal Approximations of the Frequency Moments
of Data Streams
  • Piotr Indyk
  • David Woodruff

2
The Streaming Model
  • Stream of elements a1, , an each in 1, , m
  • Want to compute statistics on stream
  • Elements arranged in adversarial order
  • Algorithms given one pass over stream
  • Goal Minimum space algorithm

3
Frequency Moments AMS96
n stream size, m universe size fi
occurrences of item i
k-th moment
  • F0 of distinct elements
  • F1 n stream size
  • F2 self-join size

Why are frequency moments important?
4
Applications
  • Estimating distinct elements with low space
  • Estimate query selectivity to huge DB without
    sorting
  • Routers gather distinct destinations
  • F2 estimates size of self-joins

,
fB2 fA2 4 1 5
  • Fk measures data skewness

5
The Best Deterministic Algorithm
  • Trivial algorithm for Fk
  • Store/update fi for each item i, sum fik at end
  • Space O(mlog n) m items i, log n bits to count
    fi
  • Negative Results AMS96
  • Compute Fk exactly ? ?(m) space
  • Any deterministic alg. outputs X with
  • Fk X lt ?Fk must use ?(m) space

What about randomized algorithms?
6
Randomized Approx Algs for Fk
  • Randomized alg. ?-approximates Fk if outputs X
    s.t.
  • PrFk X lt ? Fk gt
    2/3
  • Previous work (table suppresses polylog mn)

7
Matching Upper Bound
Our Contribution For every k there is a
1-pass O(m1-2/k) space algorithm to
?-approximate Fk
  • Additional Features
  • Works even if we allow deletions, that is, stream
    of elements (i, ), (i,-)
  • 2. Constant update time

8
Techniques
  • Previous Algorithms AMS96, CK04, G04
  • 1. Cleverly construct small-space
    estimator X s.t.
  • EX Fk
  • VarX small
  • 2. Apply Chebyshevs inequality
  • Our algorithm
  • 1. Divide frequencies into buckets
  • 0, 1, 2), 2, 4), 4, 8), , 2i-1, 2i),
  • 2. Estimate size si of each bucket
  • 3. Output X ?i si 2ik

9
Whats Left?
  • Remaining Problem Estimate si of elements
    with frequency in each bucket 2i-1, 2i)
  • Is this always easy? No.
  • Suppose always easy then could approximate the
    maximum frequency
  • This is HARD ?(m) space AMS96
  • However, ?(m) only applies to worst-case
    streams, otherwise can do better Countsketch
    CCF-C

10
For the moment, lets assume
  • 1. 9 a 1-pass oracle Max returning the maximum
    frequency using O(B) space (we remove this
    using CountSketch)

Max
frequency
items
  • 2. We have a very long RAM of random bits
  • (we remove this using Nisans generator)

11
General Idea Max Sampling
  • Restrict input stream to a random subset of items
    in 1, , m, where items are included
    independently with probability p.


7
1
1
3
7
3
4
Random subset 1, 3

12
General Idea Max Sampling
  • Restrict input to a random subset of items in
    1, , m, where items are included
    independently with probability p.
  • What are chances the maximum lies in
  • Si elements r such that fr 2 2i-1, 2i)?

q (1-p)? j gt i sj (1 (1-p)si)
Idea 1. Estimate q as q by taking
independent trials and computing
fraction of max in Si 2. If already
estimated sj for j gt i, solve this
expression for si.
13
When is this estimate any good?
Recall q (1-p)?j gt i sj (1 (1-p)si), so
estimate si
Need 1.
(holds inductively)
(tight concentration of q)
2.
Requires 9 p so that q gt 1/R, where R trials
used to estimate q
14
When is this estimate any good?
q (1-p)?j gt i sj (1 (1-p)si)
p too large? ! q too small
p too small? ! q too small
Motivates the following Say a class Si
contributes if and only if si gt ?j gt i sj /R If
R ?(log n), then Fk ¼ ?contributing i si 2ik
15
The Idealized Algorithm
  • Use the random string to generate hash functions
    hjr m -gt 2j for j 2 log m and r 2 R
  • Restrict stream Str to Strjr, those items i with
    hjr(i) 1
  • For each Strjr, compute Max(Strjr)
  • To estimate si given st for t gt i, find some j
    for which enough of the Max(Strjr) come from
    Si, and then set
  • Output Fk ?i si 2ik

16
Removing the assumptions
1. Assumption 9 a 1-pass oracle Max returning
the maximum frequency using O(B) space
  • CCF-C02 9 a 1-pass O(B)-space algorithm
    CountSketch
  • which, given stream Str, outputs all x for which
    fx2 F2/B

Recall Si contributes if and only if si gt ?j gt
i sj /R
Lemma If Si 2i-1, 2i) contributes, then
Proof Holders inequality.
17
Removing the assumptions
2. We have an infinite string of random bits
  • Consider a space-S algorithm A and a function
  • f, with random strings R1, , Rn that, when
  • processing a stream, maintains a variable
  • C, and updates as follows C C f(i, Ri)

Indyk00 Then R1, , Rn can be generated using
Nisans PRG, and
  • The new algorithm A has space O(S)
  • The outputs of A and A are indistinguishable

Our algorithm follows this framework
18
Conclusions
  • Result Tight O(m1-2/k) upper bound
  • Handle deletions (j, -)
  • O(1) update time
  • Open Problem Reduce O factors
Write a Comment
User Comments (0)
About PowerShow.com