Range-Efficient Counting of Distinct Elements - PowerPoint PPT Presentation

About This Presentation
Title:

Range-Efficient Counting of Distinct Elements

Description:

ai in S ai. Example: S = 4, 5, 15, 4, 100, 4, 16, 15 ... 1m max1 i k ai,j. a. b. 8/26/09. IIT Kanpur Streams Workshop. 8. Reduction From Max Dominance Norm ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 48
Provided by: cseIi
Category:

less

Transcript and Presenter's Notes

Title: Range-Efficient Counting of Distinct Elements


1
Range-Efficient Counting of Distinct Elements
  • Srikanta Tirthapura
  • Iowa State University
  • (joint work with Phillip Gibbons, Aduri Pavan)

2
Range-Efficient F0
  • Stream 100,200, 0,10, 60, 120, 5,25
  • F0 0,25 U 60,200 167

120
200
60
100
0
5
10
25
3
Range-Efficient F0
  • Input StreamSequence of ranges l1,r1,
    l2,r2 lm,rm
  • for each i, 0 lt li lt ri lt n, and li, ri
    are integers
  • Output
  • Return l1,r1 U l2,r2 U U lm,rm
  • i.e. number of distinct elements in the union
    (F0)
  • Constraints
  • Single pass through the data
  • Small Workspace
  • Fast Processing Time

4
Reductions to Range-Efficient F0
Duplicate Insensitive Sum
Max-Dominance Norm
Range-Efficient F0
Counting Triangles in Graphs
5
Duplicate-Insensitive Sum
  • Problem Sum of all distinct elements in a stream
    of integers
  • Input Stream Sequence of integers S
    a1,a2,.., an
  • Output
  • ?distinct ai in S ai
  • Example
  • S 4, 5, 15, 4, 100, 4, 16, 15
  • Distinct Elements 4,5,15,100, 16
  • Sum 140

6
Reduction from Dup-Insensitive Sum to F0
Stream from U 0,m-1
Alternate Stream from U0,m2-1
S S
4 4m, 4m1, .., 4m3
5 5m,..,5m4
15 15m,,15m14
4 4m,,4m3
100 100m,,100m99
4 4m,,4m3
15 15m,,15m14
Duplicate-Insensitive Sum
Number of Distinct Elements
7
Max Dominance Norm
  • Given k streams of m integers each, (the elements
    of the streams arrive in an arbitrary order),
    where 1 ai,j n a1,1 a1,2 .. a1,ma2,1
    a2,2 a2,m
  • ak,1 ak,2 ak,m
  • Return
  • ?j1m max1 i k ai,j

a
b
8
Reduction From Max Dominance Norm
  • Input stream I, output stream O F0 of Output
    Stream Dominance Norm of Input Stream
  • Assign ranges to the k positions 1,n
    n1,2n (k-1)n1, kn
  • When element ai,j is received, generate the
    range (j-1)m1, (j-1)m1ai,j
  • Observation F0 of the resulting stream of ranges
    is the dominance norm of the input stream

a
b
9
Talk Outline
  • Range Efficient F0
  • Reductions Among Data Stream Problems
  • Algorithm for Range Efficient F0 (building on
    distinct sampling)
  • Update Streams
  • Open Questions

10
Counting Distinct Elements (F0)
  • Example
  • How many different users accessed my website
    today?
  • Stream 1,1,2,3,4,1,2 F0 4
  • Numerous Applications in databases and networking
  • Prior Work
  • Flajolet-Martin (1985)
  • Alon, Matias and Szegedy (1996)
  • Gibbons and Tirthapura (2001)
  • Bar-Yossef et al. (2002) (currently most
    space-efficient)
  • Indyk-Woodruff (2003) (Lower Bounds)

11
Range-Efficient F0 (Pavan and Tirthapura)
Range Sampling for 2-way Independent Hash
Functions
Distinct Sampling Algorithm for F0

12
Sampling Based Algorithm for F0(Gibbons and
Tirthapura 2001)
D Distinct Elements In Stream
U 1,2,3,..,n
S0
p1/2
D ? S1
S0, S1, S2.. stored implicitly implicitly using
hash functions
2,4,7,
S1
p1/2
D ? S2
4,7,11,..
S2
13
Distinct Sampling

Sample , p 1
Target Workspace 4 numbers
14
Distinct Sampling
5
Sample 5, p 1
Target Workspace 4 numbers
15
Distinct Sampling
5 3
Sample 5,3, p 1
Target Workspace 4 numbers
16
Distinct Sampling
5 3 7
Sample 5,3,7, p 1
Target Workspace 4 numbers
17
Distinct Sampling
5 3 7 5
Sample 5,3,7, p 1
Target Workspace 4 numbers
18
Distinct Sampling
5 3 7 5 6
Sample 5,3,7,6, p 1
Target Workspace 4 numbers
19
Distinct Sampling
5 3 7 5 6 8
Sample 5,3,7,6,8, p 1
Overflow Sample Sample ? S1
Sample 3,6,8, p ½
Target Workspace 4 numbers
20
Distinct Sampling
5 3 7 5 6 8 9
Sample 3,6,8,9, p ½
Target Workspace 4 numbers
21
Distinct Sampling
5 3 7 5 6 8 9 7
Same Decision for both
Sample 3,6,8,9, p ½
Target Workspace 4 numbers
22
Distinct Sampling
5 3 7 5 6 8 9 7 2
Sample 3,6,8,9,2, p ½
Overflow Sample Sample ? S2
Sample 6,9, p¼
Target Workspace 4 numbers
23
Distinct Sampling
5 3 7 5 6 8 9 7 2 2 7 8 8 3 5
Finally, Sample 6,9, p¼ Estimate of F0
(Sample Size)(4) 8
24
Counting Distinct Elements
  • Finally, return a sample of distinct elements of
    the stream of a large enough size
  • If target workspace O((1/?2)(log(1/?))
    integers, then estimate of F0 is a (?,
    ?)-approximation
  • Hash functions need only be pairwise independent
    and can be stored in small space

25
Sampling Using Independent Coin Tosses
Distinct Sampling Using Hash Functions
Hash Function
0
1
0
0
0
1
26
Adaptive Sampling for Range-Efficient F0
  • Naïve Approach Given range x,y, successively
    insert x, x1, y into F0 sampling algorithm
  • Problem Time per range very large
  • Range-Sampling Given stream element p,q, how
    to sample all elements in p,q quickly?
  • At sampling level i, quickly compute p,q n
    Si

27
Hash Functions, and S0,S1,S2
1
v2
h(x)(axb) mod p p primea,b random in 0,p-1
v3
0
v1
p-1
n
If h(x) ?0,vi, then x ? Si
28
Range Sampling
v
1
X1
0
p-1
X2
n
f(x)(axb) mod p
Compute x ? x1,x2 f(x) ? 0,v
29
Arithmetic Progression
1
X1
X2
n
f(x)(axb) mod p
Common Difference a
30
Low and High Revolutions
  • Each revolution, number of hits on 0,v is
  • floor(v/a) (low rev)
  • floor(v/a) 1 (high rev)
  • Task Count number of low, high revolutions

31
Starting Points of Revolutions
  • Can find r (v - v mod a) such that
  • If starting point in 0,r, then high revolution
  • Else low revolution
  • Task Count the number of revolutions with
    starting point in 0,r

r
32
Recursive Algorithm
modulo a circle
modulo p circle
Observation Starting Points form an Arithmetic
Progression with difference (- p mod a)
33
Recursive Algorithm
  • Focus on common difference
  • Two Reductions Possible

Common Difference a- (p mod a)
Common Difference a
Common Difference (p mod a)
At least one of the two common differences is
smaller than a/2
34
Range Sampling
  • Theorem There is an algorithm for sampling range
    x,y using 2-way independent hash functions with
  • Time complexity O(log (y-x))
  • Space Complexity O(log (y-x) log m)
  • Plug back into distinct sampling to get
    range-efficient F0 algorithm

35
Results
Input StreamSequence of ranges l1,r1,
l2,r2 lm,rm for each i, 0 lt li lt
ri lt n, and li, ri are integers Output
l1,r1 U l2,r2 U U lm,rm
  • Randomized (?,?)-Approximation Algorithm for
    Range-efficient F0 of a data stream
  • Processing Time (n is the size of the universe)
  • Amortized processing time per interval
    O(log(1/?) (log (n/?)))
  • Time to answer a query for F0 is a constant
  • WorkSpace O((1/?2)(log(1/?)) (log n))

Pavan,TirthapuraSICOMP (to appear)
36
Prior Work
  • Bar-Yossef, Kumar, Sivakumar 2002
  • First studied range-efficient F0
  • Algorithms with higher space complexity
  • Cormode, Muthukrishnan 2003
  • Max-dominance Norm
  • Nath, Gibbons, Seshan, Anderson 2004
  • Duplicate-insensitive Sum assuming ideal hash
    functions

37
Comparison
Range-Efficient F0 Bar-Yossef et al. Pavan and Tirthapura
Time O(log5 n)(1/?5)(log 1/?) O(log n log 1/?)(log 1/?)
Space O(1/?3)(log n)(log 1/?) O(1/?2)(log n)(log 1/?)
Max-Dominance Norm Cormode, Muthukrishnan Pavan and Tirthapura
Time O(1/?4 ) (log n) (log m) (log 1/?) O(log n log 1/?)(log 1/?)
Space O (1/?2) (log n1/? (log m) (log log m)) (log 1/?) O (1/?2) (log m log n) (log 1/?)
38
Other Applications of Distinct Sampling
  1. Sample of distinct elements of the stream of any
    desired target size
  2. Approximate median of all distinct elements in
    stream (duplicate insensitive median)
  3. Distinct Frequent elements (heavy hitters in
    network monitoring)

39
Update Streams
  • Insertions and Deletions of elements into the
    streams(11, 1), (7, 3), (4, 2), (7, -2),
    (11,-1)
  • Distinct Elements Problem How many elements have
    a positive cumulative weight?
  • Assume a sanity constraint, no element has
    weight less than 0
  • Sampling algorithm described so far fails, since
    it can only decrease sampling probability as
    stream becomes larger

40
Distinct Sampling on Update Streams (three
independent approaches)
  • Sumit Ganguly, Minos N. Garofalakis, Rajeev
    Rastogi Processing Set Expressions over
    Continuous Update Streams. SIGMOD 2003,
    followed up by Ganguly, 2005 and Ganguly,
    Majumder 2006
  • Graham Cormode, S. Muthukrishnan, Irina
    Rozenbaum Summarizing and Mining Inverse
    Distributions on Data Streams via Dynamic Inverse
    Sampling. VLDB 2005
  • Gereon Frahling, Piotr Indyk, Christian Sohler
    Sampling in dynamic data streams and
    applications. SocG 2005

41
Distinct Elements on Update Streams
  • Use of K-Set Structure in storing samples

Ganguly, Garofalakis, Rastogi 2003 Ganguly
2005 Ganguly, Majumder 2006
42
K-Set Structure
  • Small space data structure for multi-set S
    (size ?(K))
  • Operations
  • Insert (x,v) into S
  • Delete (x,v) from S
  • Membership Query (is x in S?) what is the
    number of distinct elements in S?
  • If S K, then Queries answered correctly

K
Active
Silent
Active
43
Counting Distinct Elements on Update Streams
  1. Sample Stream at different probabilities, 1, ½,
    ¼,..
  2. Store each of (D n S0, D n S1, D n S2,..) in a
    k-set structure for an appropriate value of k
  3. When queried, use the highest probability sample
    that hasnt overflowed yet

44
Distributed Streams
Alice
Workspace
Stream A
Sketch(A)
11 54 21 11 2 45 21 1
Referee
Bob
ComputeDup-Ins-Sum(A,B)
Workspace
1 5 21 2 54 21 35
Sketch(B)
Stream B
45
Summary
Range-Efficiency(range-sampling)
Update Streams(k-set structure)
Sliding Windows(multiple samples)
Distinct Sampling
46
Open Questions
  • Can we efficiently handle higher-dimensional
    ranges?
  • Klees measure problem in streaming model

47
Open Questions
  • Range-Efficient F0 under update streams
  • Duplicate-insensitive Fk (k 2), range-efficient
    Fk
Write a Comment
User Comments (0)
About PowerShow.com