Estimating Distinct Elements, Optimally - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Estimating Distinct Elements, Optimally

Description:

Estimating Distinct Elements, Optimally David Woodruff IBM Based on papers with Piotr Indyk, Daniel Kane, and Jelani Nelson – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 18

Provided by: IBMU88

Category:

more less

Transcript and Presenter's Notes

Title: Estimating Distinct Elements, Optimally

1
Estimating Distinct Elements, Optimally

David Woodruff
IBM
Based on papers with Piotr Indyk, Daniel Kane,
and Jelani Nelson

2
Problem Description

Given a long string of at most n distinct
characters, count the number F0 of distinct
characters
See characters one at a time
One pass over the string
Algorithms must use small memory and fast update
time
too expensive to store set of distinct characters
algorithms must be randomized and settle for an
approximate solution output F 2 (1-²)F0,
(1²)F0 with, say, good constant probability

3
Algorithm History

Flajolet and Martin introduced problem, FOCS 1983
O(log n) space for fixed e in random oracle model
Alon, Matias and Szegedy
O(log n) space/update time for fixed e with no
oracle
Gibbons and Tirthapura
O(e-2 log n) space and O(e-2) update time
Bar-Yossef et al
O(e-2 log n) space and O(log 1/e) update time
O(e-2 log log n log n) space and O(e-2) update
time, essentially
Similar space bound also obtained by Flajolet et
al in the random oracle model
Kane, Nelson and W
O(e-2 log n) space and O(1) update and
reporting time
All time complexities are in unit-cost RAM model

4
Lower Bound History

Alon, Matias and Szegedy
Any algorithm requires O(log n) bits of space
Bar-Yossef
Any algorithm requires O(e-1) bits of space
Indyk and W
If e gt 1/n1/9, any algorithm needs O(e-2) bits of
space
W
If e gt 1/n1/2, any algorithm needs O(e-2) bits of
space
Jayram, Kumar and Sivakumar
Simpler proof of O(e-2) bound for any e gt 1/m1/2
Brody and Chakrabarti
Show above lower bounds hold even for multiple
passes over the string
Combining upper and lower bounds, the complexity
of this problem is
T(e-2 log n) space and T(1) update and
reporting time

5
Outline for Remainder of Talk

Proofs of the Upper Bounds
Proofs of the Lower Bounds

6
Hash Functions for Throwing Balls

We consider a random mapping f of B balls into C
containers and count the number of non-empty
containers
The expected number of non-empty containers is C
C(1-1/C)B
If instead of the mapping f, we use an O(log
C/e)/log log C/e wise independent mapping g,
then
the expected number of non-empty containers under
g is the same as that under f, up to a factor of
(1 e)
Proof based on approximate inclusion-exclusion
express 1 (1-1/C)B in terms of a series of
binomial coefficients
truncate the series at an appropriate place
use limited independence to handle the remaining
terms

7
Fast Hash Functions

Use hash functions g that can be evaluated in
O(1) time.
If g is O(log C/e)/(log log C/e)-wise
independent, the natural family of polynomial
hash functions doesnt work
We use theorems due to Pagh, Pagh, and Siegel
that construct k-wise independent families for
large k, and allow O(1) evaluation time
For example, Siegel shows
Let U u and V v with u vc for a
constant c gt 1, and suppose the machine word size
is O(log v)
Let k vo(1) be arbitrary
For any constant d gt 0 there is a randomized
procedure that constructs a k-wise independent
hash family H from U to V that succeeds with
probability 1-1/vd and requires vd space. Each h
2 H can be evaluated in O(1) time
Can show we have sufficiently random hash
functions that can be evaluated in O(1) time and
represented with O(e-2 log n) bits of space

8
Algorithm Outline

Set K 1/e2
Instantiate a lg n x K bitmatrix A, initializing
entries of A to 0
Pick random hash functions f n-gtn and g
n-gtK
Obtain a constant factor approximation R to F0
somehow
Update(i) Set A1, g(i) 1, A2, g(i) 1, ,
Alsb(f(i)), g(i) 1
Estimator Let T j in K Alog (16R/K), j
1
Output (32R/K)
ln(1-T/K)/ln(1-1/K)

9
Space Complexity

Naively, A is a lg n x K bitmatrix, so O(e-2 log
n) space
Better for each column j, store the identity of
the largest row i(j) for which
Ai, j 1. Note if Ai,j 1, then Ai, j
1 for all i lt i
Takes O(e-2 log log n) space
Better yet maintain a base level I. For each
column j, store max(i(j) I, 0)
Given an O(1)-approximation R to F0 at each point
in the stream, set
I log R
Dont need to remember i(j) if i(j) lt I, since j
wont be used in estimator
For the j for which i(j) I, about 1/2 such j
will have i(j) I, about one fourth such j will
have i(j) I1, etc.
Total number of bits to store offsets is now only
O(K) O(e-2) with good probability at all points
in the stream

10
The Constant Factor Approximation

Previous algorithms state that at each point in
the stream, with probability 1-d, the output is
an O(1)-approximation to F0
The space of such algorithms is O(log n log 1/d).
Union-bounding over a stream of length m gives
O(log n log m) total space
We achieve O(log n) space, and guarantee the
O(1)-approximation R of the algorithm is
non-decreasing
Apply the previous scheme on a log n x log n/(log
log n) matrix
For each column, maintain the identity of the
deepest row with value 1
Output 2i, where i is the largest row containing
a constant fraction of 1s
We repeat the procedure O(1) times, and output
the median of the estimates
Can show the output is correct with probability
1- O(1/log n)
Then we use the non-decreasing property to
union-bound over O(log n) events
We only increase the base level every time R
increases by a factor of 2
Note that the base level never decreases

11
Running Time

Blandford and Blelloch
Definition a variable length array (VLA) is a
data structure implementing an array C1, , Cn
supporting the following operations
Update(i, x) sets the value of Ci to x
Read(i) returns Ci
The Ci are allowed to have bit
representations of varying lengths len(Ci).
Theorem there is a VLA using O(n sumi len(Ci))
bits of space supporting worst case O(1) updates
and reads, assuming the machine word size is at
least log n
Store our offsets in a VLA, giving O(1) update
time for a fixed base level
Occasionally we need to update the base level and
decrement offsets by 1
Show base level only increases after T(e-2)
updates, so can spread this work across these
updates, so O(1) worst-case update time
Copy the data structure, use it for performing
this additional work so it doesnt interfere with
reporting the correct answer
When base level changes, switch to copy
For O(1) reporting time, maintain a count of
non-zero containers in a level

12
Outline for Remainder of Talk

Proofs of the Upper Bounds
Proofs of the Lower Bounds

13
1-Round Communication Complexity
Alice
Bob
What is f(x,y)?
input x
input y

Alice sends a single, randomized message M(x) to
Bob
Bob outputs g(M(x), y) for a randomized function
g
g(M(x), y) should equal f(x,y) with constant
probability
Communication cost CC(f) is M(x), maximized
over x and random bits
Alice creates a string s(x), runs a randomized
algorithm A on s(x), and
transmits the state of A(s(x)) to Bob
Bob creates a string s(y), continues A on s(y),
thus computing A(s(x)?s(y))
If A(s(x)?s(y)) can be used to solve f(x,y),
then space(A) CC(f)

14
The O(log n) Bound

Consider equality function f(x,y) 1 if and
only if x y for x, y 2 0,1n/3
Well known that CC(f) O(log n) for (n/3)-bit
strings x and y
Let C 0,1n/3 -gt 0,1n be an error-correcting
code with all codewords of Hamming weight n/10
If x y, then C(x) C(y)
If x ! y, then (C(x), C(y)) O(n)
Let s(x) be any string on alphabet size n with
i-th character appearing in s(x) if and only if
C(x)i 1. Similarly define s(y)
If x y, then F0(s(x)?s(y)) n/10. Else,
F0(s(x)?s(y)) n/10 O(n)
A constant factor approximation to F0 solves
f(x,y)

15
The O(e-2) Bound

Let r 1/e2. Gap Hamming promise problem for x,
y in 0,1r
f(x,y) 1 if (x,y) gt 1/(2e-2)
f(x,y) 0 if (x,y) lt 1/(2e-2) - 1/e
Theorem CC(f) O(e-2)
Can prove this from the Indexing function
Alice has w 2 0,1r, Bob has i in 1, 2, , r,
output g(w, i) wi
Well-known that CC(g) O(r)
Proof CC(f) O(r),
Alice sends the seed r of a pseudorandom
generator to Bob, so the parties have common
random strings zi, , zr 2 0,1r
Alice sets x coordinate-wise-majorityzi wj
1
Bob sets y zi
Since the zi are random, if xj 1, then by
properties of majority, with good probability
f(x,y) lt 1/(2e-2) - 1/e, otherwise likely that
f(x,y) gt 1/(2e-2)
Repeat a few times to get concentration

16
The O(e-2) Bound Continued

Need to create strings s(x) and s(y) to have
F0(s(x)?s(y)) decide whether (x,y) gt 1/(2e-2) or
(x,y) lt 1/(2e-2) - 1/e
Let s(x) be a string on n characters where
character i appears if and only if xi 1.
Similarly define s(y)
F0(s(x)?s(y)) (wt(x) wt(y) (x,y))/2
Alice sends wt(x) to Bob
A calculation shows a (1e)-approximation to
F0(s(x)?s(y)), together with wt(x) and wt(y),
solves the Gap-Hamming problem
Total communication is space(A) log 1/e
O(e-2)
It follows that space(A) O(e-2)

17
Conclusion
Combining upper and lower bounds, the streaming
complexity of estimating F0 up to a (1e) factor
is T(e-2 log n) bits of space and T(1) update
and reporting time