Counting Distinct Objects over Sliding Windows - PowerPoint PPT Presentation

About This Presentation
Title:

Counting Distinct Objects over Sliding Windows

Description:

University of New South Wales, Australia ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 28
Provided by: mach161
Category:

less

Transcript and Presenter's Notes

Title: Counting Distinct Objects over Sliding Windows


1
Counting Distinct Objects over Sliding Windows
  • Presented by
  • Muhammad Aamir Cheema

Joint work with Wenjie Zhang, Ying Zhang and
Xuemin Lin
University of New South Wales, Australia
2
Introduction
  • Counting distinct objects
  • Given a dataset D, return the number of distinct
    objects in D.
  • Counting distinct objects against sliding
    windows
  • Given a data stream, return the number of
    distinct objects that arrive at or after
    timestamp t.
  • Applications
  • traffic management, call centers, wireless
    communication, stock market etc.

3
Introduction
  • Approximate counting
  • Let n be the actual number of distinct
    objects and n be the reported answer. Build a
    sketch s.t. every query is answered with the
    following guarantee
  • n-n/n e with confidence
    (1 d)
  • Contribution
  • FM based algorithms
  • SE-FM (accuracy guarantee space usage
    guarantee)
  • PCSA-based algorithm (No accuracy guarantee
    (although practical) more efficient)
  • k-Skyband
  • (Accuracy guarantee efficient no space
    usage guarantee)

4
FM Algorithm
  • FM SKETCH
  • Let h(x) be a uniform hash function
  • Let pivot p(y) be the position of left most
    1-bit of h(x)
  • FM be an array of size k initialized to zero
  • For each record x in dataset
  • FMpivot 1
  • Let BFMmin be the position of left most 0-bit of
    FM
  • Number of distinct elements a 2B
  • where a 1.2897385
  • Each bit i of h(x) has 1/2 probability to be one

r1 r2 r1 r3 r1
k 4
1 0 1 0
h(r1)
0 0 1 0
h(r2)
1 1 0 1
h(r3)
0 0 0 0
FM
1 0 0 0
1 0 1 0
P. Flajolet and G. N. Martin. Probabilistic
counting algorithms for data base applications.
JCSS 1985
FMmin 1
5
FM Algorithm
  • Each bit i of h(x) has 1/2 probability to be one
  • A h(x) with first i bits zero and (i1)th bit one
    has a probability 1/2i1
  • Let n be the number of distinct elements
  • FM0 is accessed appx. n/2 times
  • FM1 is accessed appx. n/4 times
  • .
  • FMi is accessed appx. n/2i1 times
  • If i gtgt log2 n
  • FMi will almost certainly be zero
  • If i ltlt log2 n
  • FMi will almost certainly be one
  • If i log2 n
  • FMi may be zero or one
  • Hence, the first i for which FMi is zero may be
    used to approximate number of distinct elements
    n.

r1 r2 r1 r3 r1
1 0 1 0
h(r1)
0 0 1 0
h(r2)
1 1 0 1
h(r3)
FM
1 0 1 0
FMmin 1
6
FM Algorithm
  • Use r hash functions to create r FM Sketches
  • Initialize each FM to zero
  • For each record x in dataset
  • For each hash function hi(x)
  • FMipivot 1
  • Let Bi be the position of left most 0-bit of FMi
  • B (B1 B2 Br )/ r
  • Number of distinct elements a 2B
  • where a 1.2897385

1 0 1 0
FM1
B1 1
1 1 0 0
FM2
B2 2
Performance Guarantee Let n be the actual number
of distinct objects, n be the reported answer
and m be the domain of elements then P( n
n/n ? ) 1 - d If n gt 1/? and k
O(log m log 1/? log 1/d ) and r O(1/?2 log
1/d)
1 1 0 1
FM3
B3 2
B (1 2 2)/3 1.67
7
FM-based Algorithm
  • Maintaining one FM sketch
  • For each record (x,t) in dataset
  • FMpivot t
  • Answering a query
  • For any t, let B FMmin (t) be the position of
    left most entry of FM with value less than t
  • Number of distinct elements arrived after
    (inclusive) t a 2B where a 1.2897385

1 2 3 4 5
r1 r2 r3 r2 r2
1 0 1 0
h(r1)
0 0 1 0
h(r2)
1 1 0 1
h(r3)
FM
0 0 0 0
1 0 0 0
1 0 2 0
3 0 4 0
3 0 5 0
3 0 2 0
FMmin (4) 0
8
FM-based Algorithm
  • Maintain r FM sketches
  • Initialize each FM to zero
  • For each record (x,t) in dataset
  • For each hash function hi(x)
  • FMipivot t
  • Answering a query
  • For any t, let Bi (t) be the position of left
    most entry smaller than t in i-th FM
  • Let B ( B1 (t) B2 (t) Br(t) )/ r
  • Number of distinct elements arrived after
    (inclusive) t a 2B where a 1.2897385

9
Performance Analysis
  • Let n be the actual number of distinct objects
    arriving not before time t, n be the reported
    answer and m be the domain of elements then
  • P( n n/n ? ) 1 - d
  • If n gt 1/?
  • and k O(log m log 1/? log 1/d )
  • and r O(1/?2 log 1/d)
  • Total Space O(1/?2 log 1/d log m)
  • Total maintenance cost for one record O(1/?2 log
    1/d log log m)
  • Total query cost O(1/?2 log 1/d log log m)

10
PCSA-based Algorithm
  • Maintain r FM sketches but update j lt r sketches
  • Generate j hash functions H(x) that map x to
    1,r
  • Initialize each FM to zero
  • For each record (x,t) in dataset
  • For each of the j hash functions H()
  • i H(x)
  • Update i-th FM sketch
  • Answering a query
  • For any t, let Bi (t) be the position of left
    most entry smaller than t in i-th FM
  • Let B ( B1 (t) B2 (t) Br(t) )/ r
  • Number of distinct elements arrived after
    (inclusive) t (a 2B)/ j where a 1.2897385
  • Inspired by PCSA technique in P.. Flajolet and
    G. N. Martin. Probabilistic counting algorithms
    for data base applications. JCSS 1985
  • NOTE No accuracy guarantee but performs well in
    practice

11
BJKST Algorithm
  • Main Idea
  • Let h() be a hash function to hash D to 1,m3
    where m D
  • For each record x, we generate its hash value
    h(x)
  • Maintain k-th smallest distinct hash value k_min
  • Number of distinct elements n km3/k_min
  • Improved algorithm
  • Use r hash functions
  • Compute ni for each hash function hi() as above
  • Report final answer as median of ni values
  • Performance guarantee
  • P( n n/n ? ) 1 - d
  • If m gt 1/ d
  • and n gt k
  • and k O(1/?2)
  • and r O(log 1/d)

Z. Bar-Yossef, T. S. Jayram, R. Kumar, D.
Sivakumar, and L. Trevisan. Counting distinct
elements in datastream. In RANDOM'02.
12
K-Skyband Technique
  • Main Idea
  • Let h() be a hash function to hash D to 1,m3
    where m D
  • For each record (x,t) we generate h(x) and store
    record (x, h(x), t)
  • Answering a query q(t)
  • Retrieve all records (x,h(x),t) for which
    timestamp t t
  • Get the k-th smallest distinct hashed value and
    apply BJKST algorithm
  • Limitation Requires storing all records

13
K-Skyband Technique
  • For any time t, we need to find k-th smallest
    hash value arriving no later than t
  • A record x dominates another record y if x
    arrives after y and has smaller hash value
  • K-Skybands keeps only the objects that are
    dominated by at most (k-1) records
  • Maintaining K-Skyband
  • Keep a counter for each record
  • When a new element (x,t) arrives, increment the
    counter of all records dominated by it
  • Remove the records with counter at least equal to
    k
  • We increment the counters of groups to improve
    efficiency (Domination aggregation search tree)

k 2
b
e
c
t
d
a
h(x)
14
K-Skyband Technique
  • Answering Query
  • Find k_min (the k-th smallest hash value among
    elements arriving no later than t)
  • Let z be the number of elements arrived before t
  • k_min is the (zk)-th overall smallest hash value
  • Algorithm
  • Maintain a binary search tree eT that stores
    elements according to t
  • Maintain a binary search tree eH that stores
    elements according to h(x)
  • When a query q(t) arrives
  • Compute z by using eT
  • Find (zk)-th overall smallest hash value from eH

k_min 5th smallest h(x)
k 2
b
e
c
t
d
a
z 3
f
h(x)
15
Performance Analysis
  • Let n be the actual number of distinct objects
    arriving not before time t, n be the reported
    answer and m be the domain of elements then
  • P( n n/n ? ) 1 - d
  • If m gt 1/ d
  • and n gt k
  • and k O(1/?2)
  • and r O(log 1/d)
  • Expected total space O(1/?2 log 1/d log n)
  • Expected time complexity O(log 1/d (log 1/?
    log n))

16
Experiments
  • Synthetic datasets following Uniform and Zipf
    distribution
  • Real dataset WorldCup 98 HTTP requests (20 M
    records)

j
17
Space Efficiency
18
Space Efficiency
19
Time Efficiency
Maintenance cost
20
Time Efficiency
Query response time
21
Accuracy
22
Thanks
23
  • P. B. Gibbons. Distinct sampling for
    highly-accurate answers to distinct values
    queries and event reports. In VLDB, 2001.
  • Space usage 1/e2 log 1/d m1/2
  • Y. Tao, G. Kollios, J. Considine, F. Li, and D.
    Papadias. Spatio-temporal aggregation using
    sketches. In ICDE 2004.
  • Space usage O(N/e2 log 1/d log m)

24
Space Requirement (SE-FM)
  • To guarantee the performance we require the
    following
  • k O(log m log 1/? log 1/d )
  • r O(1/?2 log 1/d)
  • Let m gt 1/? and m gt 1/d then k O(log m)
  • Size of one sketch is k O(log m)
  • Size of r sketches is O(r log m) O(1/?2 log
    1/d log m)
  • Total Space O(1/?2 log 1/d log m)

25
Time Complexity (SE-FM)
  • To guarantee the performance we require the
    following
  • k O(log m log 1/? log 1/d )
  • r O(1/?2 log 1/d)
  • The elements in a sketch are stored in a min-heap
    to support logarithmic search/update
  • Hence, cost of one search/update operation O(
    log k) O( log log m)
  • To maintain the sketches, we update r sketches
    for each record x
  • Total maintenance cost for one record O( r log
    log m) O(1/?2 log 1/d log log m)
  • To answer a query, we search in r sketches
  • Total cost O( r log log m) O(1/?2 log 1/d log
    log m)

26
Space Usage (K-Skyband)
  • Performance guarantee
  • P( n n/n ? ) 1 - d
  • If m gt 1/ d
  • and n gt k
  • and k O(1/?2)
  • and r O(log 1/d)
  • Expected size of k-skyband O (k ln (n/k) )
  • Expected size of r k-sybands O(rk log (n/k) )
    O(1/?2 log 1/d log n)

27
Time Complexity (K-Skyband)
  • Performance guarantee
  • P( n n/n ? ) 1 - d
  • If m gt 1/ d
  • and n gt k
  • and k O(1/?2)
  • and r O(log 1/d)
  • Answering Query q(t)
  • Search eT to compute z log (k log n) O(log k
    log n)
  • Search eH to find (zt)-th element O(log k log
    n)
  • We require this for all r sketches O (r (log k
    log n)) O(log 1/d (log 1/? log n))
Write a Comment
User Comments (0)
About PowerShow.com