Title: Range-Efficient Counting of Distinct Elements
1Range-Efficient Counting of Distinct Elements
- Srikanta Tirthapura
- Iowa State University
- (joint work with Phillip Gibbons, Aduri Pavan)
2Range-Efficient F0
- Stream 100,200, 0,10, 60, 120, 5,25
- F0 0,25 U 60,200 167
120
200
60
100
0
5
10
25
3Range-Efficient F0
- Input StreamSequence of ranges l1,r1,
l2,r2 lm,rm - for each i, 0 lt li lt ri lt n, and li, ri
are integers - Output
- Return l1,r1 U l2,r2 U U lm,rm
- i.e. number of distinct elements in the union
(F0) - Constraints
- Single pass through the data
- Small Workspace
- Fast Processing Time
4Reductions to Range-Efficient F0
Duplicate Insensitive Sum
Max-Dominance Norm
Range-Efficient F0
Counting Triangles in Graphs
5Duplicate-Insensitive Sum
- Problem Sum of all distinct elements in a stream
of integers - Input Stream Sequence of integers S
a1,a2,.., an - Output
- ?distinct ai in S ai
- Example
- S 4, 5, 15, 4, 100, 4, 16, 15
- Distinct Elements 4,5,15,100, 16
- Sum 140
6Reduction from Dup-Insensitive Sum to F0
Stream from U 0,m-1
Alternate Stream from U0,m2-1
S S
4 4m, 4m1, .., 4m3
5 5m,..,5m4
15 15m,,15m14
4 4m,,4m3
100 100m,,100m99
4 4m,,4m3
15 15m,,15m14
Duplicate-Insensitive Sum
Number of Distinct Elements
7Max Dominance Norm
- Given k streams of m integers each, (the elements
of the streams arrive in an arbitrary order),
where 1 ai,j n a1,1 a1,2 .. a1,ma2,1
a2,2 a2,m -
- ak,1 ak,2 ak,m
- Return
- ?j1m max1 i k ai,j
a
b
8Reduction From Max Dominance Norm
- Input stream I, output stream O F0 of Output
Stream Dominance Norm of Input Stream - Assign ranges to the k positions 1,n
n1,2n (k-1)n1, kn - When element ai,j is received, generate the
range (j-1)m1, (j-1)m1ai,j - Observation F0 of the resulting stream of ranges
is the dominance norm of the input stream
a
b
9Talk Outline
- Range Efficient F0
- Reductions Among Data Stream Problems
- Algorithm for Range Efficient F0 (building on
distinct sampling) - Update Streams
- Open Questions
10Counting Distinct Elements (F0)
- Example
- How many different users accessed my website
today? - Stream 1,1,2,3,4,1,2 F0 4
- Numerous Applications in databases and networking
- Prior Work
- Flajolet-Martin (1985)
- Alon, Matias and Szegedy (1996)
- Gibbons and Tirthapura (2001)
- Bar-Yossef et al. (2002) (currently most
space-efficient) - Indyk-Woodruff (2003) (Lower Bounds)
11Range-Efficient F0 (Pavan and Tirthapura)
Range Sampling for 2-way Independent Hash
Functions
Distinct Sampling Algorithm for F0
12Sampling Based Algorithm for F0(Gibbons and
Tirthapura 2001)
D Distinct Elements In Stream
U 1,2,3,..,n
S0
p1/2
D ? S1
S0, S1, S2.. stored implicitly implicitly using
hash functions
2,4,7,
S1
p1/2
D ? S2
4,7,11,..
S2
13Distinct Sampling
Sample , p 1
Target Workspace 4 numbers
14Distinct Sampling
5
Sample 5, p 1
Target Workspace 4 numbers
15Distinct Sampling
5 3
Sample 5,3, p 1
Target Workspace 4 numbers
16Distinct Sampling
5 3 7
Sample 5,3,7, p 1
Target Workspace 4 numbers
17Distinct Sampling
5 3 7 5
Sample 5,3,7, p 1
Target Workspace 4 numbers
18Distinct Sampling
5 3 7 5 6
Sample 5,3,7,6, p 1
Target Workspace 4 numbers
19Distinct Sampling
5 3 7 5 6 8
Sample 5,3,7,6,8, p 1
Overflow Sample Sample ? S1
Sample 3,6,8, p ½
Target Workspace 4 numbers
20Distinct Sampling
5 3 7 5 6 8 9
Sample 3,6,8,9, p ½
Target Workspace 4 numbers
21Distinct Sampling
5 3 7 5 6 8 9 7
Same Decision for both
Sample 3,6,8,9, p ½
Target Workspace 4 numbers
22Distinct Sampling
5 3 7 5 6 8 9 7 2
Sample 3,6,8,9,2, p ½
Overflow Sample Sample ? S2
Sample 6,9, p¼
Target Workspace 4 numbers
23Distinct Sampling
5 3 7 5 6 8 9 7 2 2 7 8 8 3 5
Finally, Sample 6,9, p¼ Estimate of F0
(Sample Size)(4) 8
24Counting Distinct Elements
- Finally, return a sample of distinct elements of
the stream of a large enough size - If target workspace O((1/?2)(log(1/?))
integers, then estimate of F0 is a (?,
?)-approximation - Hash functions need only be pairwise independent
and can be stored in small space
25Sampling Using Independent Coin Tosses
Distinct Sampling Using Hash Functions
Hash Function
0
1
0
0
0
1
26Adaptive Sampling for Range-Efficient F0
- Naïve Approach Given range x,y, successively
insert x, x1, y into F0 sampling algorithm - Problem Time per range very large
- Range-Sampling Given stream element p,q, how
to sample all elements in p,q quickly? - At sampling level i, quickly compute p,q n
Si
27Hash Functions, and S0,S1,S2
1
v2
h(x)(axb) mod p p primea,b random in 0,p-1
v3
0
v1
p-1
n
If h(x) ?0,vi, then x ? Si
28Range Sampling
v
1
X1
0
p-1
X2
n
f(x)(axb) mod p
Compute x ? x1,x2 f(x) ? 0,v
29Arithmetic Progression
1
X1
X2
n
f(x)(axb) mod p
Common Difference a
30Low and High Revolutions
- Each revolution, number of hits on 0,v is
- floor(v/a) (low rev)
- floor(v/a) 1 (high rev)
- Task Count number of low, high revolutions
31Starting Points of Revolutions
- Can find r (v - v mod a) such that
- If starting point in 0,r, then high revolution
- Else low revolution
- Task Count the number of revolutions with
starting point in 0,r
r
32Recursive Algorithm
modulo a circle
modulo p circle
Observation Starting Points form an Arithmetic
Progression with difference (- p mod a)
33Recursive Algorithm
- Focus on common difference
- Two Reductions Possible
Common Difference a- (p mod a)
Common Difference a
Common Difference (p mod a)
At least one of the two common differences is
smaller than a/2
34Range Sampling
- Theorem There is an algorithm for sampling range
x,y using 2-way independent hash functions with - Time complexity O(log (y-x))
- Space Complexity O(log (y-x) log m)
- Plug back into distinct sampling to get
range-efficient F0 algorithm
35Results
Input StreamSequence of ranges l1,r1,
l2,r2 lm,rm for each i, 0 lt li lt
ri lt n, and li, ri are integers Output
l1,r1 U l2,r2 U U lm,rm
- Randomized (?,?)-Approximation Algorithm for
Range-efficient F0 of a data stream - Processing Time (n is the size of the universe)
- Amortized processing time per interval
O(log(1/?) (log (n/?))) - Time to answer a query for F0 is a constant
- WorkSpace O((1/?2)(log(1/?)) (log n))
Pavan,TirthapuraSICOMP (to appear)
36Prior Work
- Bar-Yossef, Kumar, Sivakumar 2002
- First studied range-efficient F0
- Algorithms with higher space complexity
- Cormode, Muthukrishnan 2003
- Max-dominance Norm
- Nath, Gibbons, Seshan, Anderson 2004
- Duplicate-insensitive Sum assuming ideal hash
functions
37Comparison
Range-Efficient F0 Bar-Yossef et al. Pavan and Tirthapura
Time O(log5 n)(1/?5)(log 1/?) O(log n log 1/?)(log 1/?)
Space O(1/?3)(log n)(log 1/?) O(1/?2)(log n)(log 1/?)
Max-Dominance Norm Cormode, Muthukrishnan Pavan and Tirthapura
Time O(1/?4 ) (log n) (log m) (log 1/?) O(log n log 1/?)(log 1/?)
Space O (1/?2) (log n1/? (log m) (log log m)) (log 1/?) O (1/?2) (log m log n) (log 1/?)
38Other Applications of Distinct Sampling
- Sample of distinct elements of the stream of any
desired target size - Approximate median of all distinct elements in
stream (duplicate insensitive median) - Distinct Frequent elements (heavy hitters in
network monitoring)
39Update Streams
- Insertions and Deletions of elements into the
streams(11, 1), (7, 3), (4, 2), (7, -2),
(11,-1) - Distinct Elements Problem How many elements have
a positive cumulative weight? - Assume a sanity constraint, no element has
weight less than 0 - Sampling algorithm described so far fails, since
it can only decrease sampling probability as
stream becomes larger
40Distinct Sampling on Update Streams (three
independent approaches)
- Sumit Ganguly, Minos N. Garofalakis, Rajeev
Rastogi Processing Set Expressions over
Continuous Update Streams. SIGMOD 2003,
followed up by Ganguly, 2005 and Ganguly,
Majumder 2006 - Graham Cormode, S. Muthukrishnan, Irina
Rozenbaum Summarizing and Mining Inverse
Distributions on Data Streams via Dynamic Inverse
Sampling. VLDB 2005 - Gereon Frahling, Piotr Indyk, Christian Sohler
Sampling in dynamic data streams and
applications. SocG 2005
41Distinct Elements on Update Streams
- Use of K-Set Structure in storing samples
Ganguly, Garofalakis, Rastogi 2003 Ganguly
2005 Ganguly, Majumder 2006
42K-Set Structure
- Small space data structure for multi-set S
(size ?(K)) - Operations
- Insert (x,v) into S
- Delete (x,v) from S
- Membership Query (is x in S?) what is the
number of distinct elements in S? - If S K, then Queries answered correctly
K
Active
Silent
Active
43Counting Distinct Elements on Update Streams
- Sample Stream at different probabilities, 1, ½,
¼,.. - Store each of (D n S0, D n S1, D n S2,..) in a
k-set structure for an appropriate value of k - When queried, use the highest probability sample
that hasnt overflowed yet
44Distributed Streams
Alice
Workspace
Stream A
Sketch(A)
11 54 21 11 2 45 21 1
Referee
Bob
ComputeDup-Ins-Sum(A,B)
Workspace
1 5 21 2 54 21 35
Sketch(B)
Stream B
45Summary
Range-Efficiency(range-sampling)
Update Streams(k-set structure)
Sliding Windows(multiple samples)
Distinct Sampling
46Open Questions
- Can we efficiently handle higher-dimensional
ranges? - Klees measure problem in streaming model
47Open Questions
- Range-Efficient F0 under update streams
- Duplicate-insensitive Fk (k 2), range-efficient
Fk