Title: Counting Distinct Objects over Sliding Windows
1Counting Distinct Objects over Sliding Windows
- Presented by
- Muhammad Aamir Cheema
Joint work with Wenjie Zhang, Ying Zhang and
Xuemin Lin
University of New South Wales, Australia
2Introduction
- Counting distinct objects
- Given a dataset D, return the number of distinct
objects in D. - Counting distinct objects against sliding
windows - Given a data stream, return the number of
distinct objects that arrive at or after
timestamp t. - Applications
- traffic management, call centers, wireless
communication, stock market etc.
3Introduction
- Approximate counting
- Let n be the actual number of distinct
objects and n be the reported answer. Build a
sketch s.t. every query is answered with the
following guarantee - n-n/n e with confidence
(1 d) - Contribution
- FM based algorithms
- SE-FM (accuracy guarantee space usage
guarantee) - PCSA-based algorithm (No accuracy guarantee
(although practical) more efficient) - k-Skyband
- (Accuracy guarantee efficient no space
usage guarantee)
4FM Algorithm
- FM SKETCH
- Let h(x) be a uniform hash function
- Let pivot p(y) be the position of left most
1-bit of h(x) - FM be an array of size k initialized to zero
- For each record x in dataset
- FMpivot 1
- Let BFMmin be the position of left most 0-bit of
FM - Number of distinct elements a 2B
- where a 1.2897385
- Each bit i of h(x) has 1/2 probability to be one
r1 r2 r1 r3 r1
k 4
1 0 1 0
h(r1)
0 0 1 0
h(r2)
1 1 0 1
h(r3)
0 0 0 0
FM
1 0 0 0
1 0 1 0
P. Flajolet and G. N. Martin. Probabilistic
counting algorithms for data base applications.
JCSS 1985
FMmin 1
5FM Algorithm
- Each bit i of h(x) has 1/2 probability to be one
- A h(x) with first i bits zero and (i1)th bit one
has a probability 1/2i1 - Let n be the number of distinct elements
- FM0 is accessed appx. n/2 times
- FM1 is accessed appx. n/4 times
- .
- FMi is accessed appx. n/2i1 times
- If i gtgt log2 n
- FMi will almost certainly be zero
- If i ltlt log2 n
- FMi will almost certainly be one
- If i log2 n
- FMi may be zero or one
- Hence, the first i for which FMi is zero may be
used to approximate number of distinct elements
n.
r1 r2 r1 r3 r1
1 0 1 0
h(r1)
0 0 1 0
h(r2)
1 1 0 1
h(r3)
FM
1 0 1 0
FMmin 1
6FM Algorithm
- Use r hash functions to create r FM Sketches
- Initialize each FM to zero
- For each record x in dataset
- For each hash function hi(x)
- FMipivot 1
- Let Bi be the position of left most 0-bit of FMi
- B (B1 B2 Br )/ r
- Number of distinct elements a 2B
- where a 1.2897385
1 0 1 0
FM1
B1 1
1 1 0 0
FM2
B2 2
Performance Guarantee Let n be the actual number
of distinct objects, n be the reported answer
and m be the domain of elements then P( n
n/n ? ) 1 - d If n gt 1/? and k
O(log m log 1/? log 1/d ) and r O(1/?2 log
1/d)
1 1 0 1
FM3
B3 2
B (1 2 2)/3 1.67
7FM-based Algorithm
- Maintaining one FM sketch
- For each record (x,t) in dataset
- FMpivot t
- Answering a query
- For any t, let B FMmin (t) be the position of
left most entry of FM with value less than t - Number of distinct elements arrived after
(inclusive) t a 2B where a 1.2897385
1 2 3 4 5
r1 r2 r3 r2 r2
1 0 1 0
h(r1)
0 0 1 0
h(r2)
1 1 0 1
h(r3)
FM
0 0 0 0
1 0 0 0
1 0 2 0
3 0 4 0
3 0 5 0
3 0 2 0
FMmin (4) 0
8FM-based Algorithm
- Maintain r FM sketches
- Initialize each FM to zero
- For each record (x,t) in dataset
- For each hash function hi(x)
- FMipivot t
- Answering a query
- For any t, let Bi (t) be the position of left
most entry smaller than t in i-th FM - Let B ( B1 (t) B2 (t) Br(t) )/ r
- Number of distinct elements arrived after
(inclusive) t a 2B where a 1.2897385
9Performance Analysis
- Let n be the actual number of distinct objects
arriving not before time t, n be the reported
answer and m be the domain of elements then - P( n n/n ? ) 1 - d
- If n gt 1/?
- and k O(log m log 1/? log 1/d )
- and r O(1/?2 log 1/d)
- Total Space O(1/?2 log 1/d log m)
- Total maintenance cost for one record O(1/?2 log
1/d log log m) - Total query cost O(1/?2 log 1/d log log m)
10PCSA-based Algorithm
- Maintain r FM sketches but update j lt r sketches
- Generate j hash functions H(x) that map x to
1,r - Initialize each FM to zero
- For each record (x,t) in dataset
- For each of the j hash functions H()
- i H(x)
- Update i-th FM sketch
- Answering a query
- For any t, let Bi (t) be the position of left
most entry smaller than t in i-th FM - Let B ( B1 (t) B2 (t) Br(t) )/ r
- Number of distinct elements arrived after
(inclusive) t (a 2B)/ j where a 1.2897385 - Inspired by PCSA technique in P.. Flajolet and
G. N. Martin. Probabilistic counting algorithms
for data base applications. JCSS 1985 - NOTE No accuracy guarantee but performs well in
practice
11BJKST Algorithm
- Main Idea
- Let h() be a hash function to hash D to 1,m3
where m D - For each record x, we generate its hash value
h(x) - Maintain k-th smallest distinct hash value k_min
- Number of distinct elements n km3/k_min
- Improved algorithm
- Use r hash functions
- Compute ni for each hash function hi() as above
- Report final answer as median of ni values
- Performance guarantee
- P( n n/n ? ) 1 - d
- If m gt 1/ d
- and n gt k
- and k O(1/?2)
- and r O(log 1/d)
Z. Bar-Yossef, T. S. Jayram, R. Kumar, D.
Sivakumar, and L. Trevisan. Counting distinct
elements in datastream. In RANDOM'02.
12K-Skyband Technique
- Main Idea
- Let h() be a hash function to hash D to 1,m3
where m D - For each record (x,t) we generate h(x) and store
record (x, h(x), t) - Answering a query q(t)
- Retrieve all records (x,h(x),t) for which
timestamp t t - Get the k-th smallest distinct hashed value and
apply BJKST algorithm - Limitation Requires storing all records
13K-Skyband Technique
- For any time t, we need to find k-th smallest
hash value arriving no later than t - A record x dominates another record y if x
arrives after y and has smaller hash value -
- K-Skybands keeps only the objects that are
dominated by at most (k-1) records - Maintaining K-Skyband
- Keep a counter for each record
- When a new element (x,t) arrives, increment the
counter of all records dominated by it - Remove the records with counter at least equal to
k - We increment the counters of groups to improve
efficiency (Domination aggregation search tree)
k 2
b
e
c
t
d
a
h(x)
14K-Skyband Technique
- Answering Query
- Find k_min (the k-th smallest hash value among
elements arriving no later than t) - Let z be the number of elements arrived before t
- k_min is the (zk)-th overall smallest hash value
- Algorithm
- Maintain a binary search tree eT that stores
elements according to t - Maintain a binary search tree eH that stores
elements according to h(x) - When a query q(t) arrives
- Compute z by using eT
- Find (zk)-th overall smallest hash value from eH
k_min 5th smallest h(x)
k 2
b
e
c
t
d
a
z 3
f
h(x)
15Performance Analysis
- Let n be the actual number of distinct objects
arriving not before time t, n be the reported
answer and m be the domain of elements then - P( n n/n ? ) 1 - d
- If m gt 1/ d
- and n gt k
- and k O(1/?2)
- and r O(log 1/d)
- Expected total space O(1/?2 log 1/d log n)
- Expected time complexity O(log 1/d (log 1/?
log n))
16Experiments
- Synthetic datasets following Uniform and Zipf
distribution - Real dataset WorldCup 98 HTTP requests (20 M
records)
j
17Space Efficiency
18Space Efficiency
19Time Efficiency
Maintenance cost
20Time Efficiency
Query response time
21Accuracy
22Thanks
23- P. B. Gibbons. Distinct sampling for
highly-accurate answers to distinct values
queries and event reports. In VLDB, 2001. - Space usage 1/e2 log 1/d m1/2
- Y. Tao, G. Kollios, J. Considine, F. Li, and D.
Papadias. Spatio-temporal aggregation using
sketches. In ICDE 2004. - Space usage O(N/e2 log 1/d log m)
24Space Requirement (SE-FM)
- To guarantee the performance we require the
following - k O(log m log 1/? log 1/d )
- r O(1/?2 log 1/d)
- Let m gt 1/? and m gt 1/d then k O(log m)
- Size of one sketch is k O(log m)
- Size of r sketches is O(r log m) O(1/?2 log
1/d log m) - Total Space O(1/?2 log 1/d log m)
25Time Complexity (SE-FM)
- To guarantee the performance we require the
following - k O(log m log 1/? log 1/d )
- r O(1/?2 log 1/d)
- The elements in a sketch are stored in a min-heap
to support logarithmic search/update - Hence, cost of one search/update operation O(
log k) O( log log m) - To maintain the sketches, we update r sketches
for each record x - Total maintenance cost for one record O( r log
log m) O(1/?2 log 1/d log log m) - To answer a query, we search in r sketches
- Total cost O( r log log m) O(1/?2 log 1/d log
log m)
26Space Usage (K-Skyband)
- Performance guarantee
- P( n n/n ? ) 1 - d
- If m gt 1/ d
- and n gt k
- and k O(1/?2)
- and r O(log 1/d)
- Expected size of k-skyband O (k ln (n/k) )
- Expected size of r k-sybands O(rk log (n/k) )
O(1/?2 log 1/d log n)
27Time Complexity (K-Skyband)
- Performance guarantee
- P( n n/n ? ) 1 - d
- If m gt 1/ d
- and n gt k
- and k O(1/?2)
- and r O(log 1/d)
- Answering Query q(t)
- Search eT to compute z log (k log n) O(log k
log n) - Search eH to find (zt)-th element O(log k log
n) - We require this for all r sketches O (r (log k
log n)) O(log 1/d (log 1/? log n))