Efficient Computation of Frequent and Top-k Elements in Data Streams PowerPoint PPT Presentation

presentation player overlay
1 / 31
About This Presentation
Transcript and Presenter's Notes

Title: Efficient Computation of Frequent and Top-k Elements in Data Streams


1
Efficient Computation of Frequent and Top-k
Elements in Data Streams
  • Ahmed Metwally
  • Divyakant Agrawal
  • Amr El Abbadi
  • Department of Computer Science
  • University of California, Santa Barbara

2
Outline
  • Problem Definition
  • Space-Saving Summarizing the Data Stream
  • Answering Frequent Elements Queries
  • Answering Top-k Queries
  • Experimental Results
  • Conclusion

3
Motivation
  • Motivated by Internet advertising commissioners
  • Before rendering an advertisement for user, query
    clicks stream for advertisements to display.
  • If the user's profile is not a frequent
    clicker, then s/he will probably not click any
    displayed advertisement.
  • Show Pay-Per-Impression advertisements.
  • If the user's profile is a frequent clicker,
    then s/he may click a displayed advertisement.
  • Show Pay-Per-Click advertisements.
  • Retrieve top advertisements to choose what to
    display.

4
Problem Definition
  • Given alphabet A, stream S of size N, a frequent
    element, E, is an element whose frequency, F,
    exceeds a user specified support, fN
  • Top-k elements are the k elements with highest
    frequency
  • Both problems
  • Very related, though, no integrated solution has
    been proposed
  • Exact solution is O(min(N,A)) space

? approximate variations
5
Practical Frequent Elements
  • ?-Deficient Frequent Elements Manku 02
  • All frequent elements output should have
  • F gt (f - ?)N, where ? is the user-defined error.

6
Practical Top-k
  • FindApproxTop(S, k, ?) Charikar 02
  • Retrieve a list of k elements such that every
    element, Ei, in the list has Fi gt (1 - ?) Fk,
    where Ek is the kth ranked element.

7
Related Work
  • Algorithms Classification
  • Counter-Based techniques
  • Keep an individual counter for each element
  • If the observed ID is monitored, its counter is
    updated
  • If the observed ID is not monitored, algorithm
    dependent action
  • Sketch-Based techniques
  • Estimate frequency for all elements using
    bit-maps of counters
  • Each element is hashed into the counters space
    using a family of hash functions.
  • Hashed-to counters are queried for the frequencies

8
Recent Work (Comparison)
Algorithm Nature Space Bound Handles
CountSketch Charikar 02 Sketch O(k/?2 log N/d), d is the failure probability FindApproxTop(S, k, ?)
GroupTest Cormode 03 Sketch O(f-1 log(f-1) log(A)) Hot Items
Frequent Demaine 02 Counter O(1/?), proved by Bose 03 FE
Probabilistic-Inplace Demaine 02 Counter O(m), m is the available memory FindCandidateTop(S, k, m/2)
Lossy Counting Manku 02 Counter (1/?) log(?N) ?-Deficient FE
Sticky Sampling Manku 02 Counter (2/?) log(f-1d-1) ?-Deficient FE
9
Outline
  • Problem Definition
  • Space-Saving Summarizing the Data Stream
  • Answering Frequent Elements Queries
  • Answering Top-k Queries
  • Experimental Results
  • Conclusion

10
The Space-Saving Algorithm
  • Space-Saving is counter-based
  • Monitor only m elements
  • Only over-estimation errors
  • Frequency estimation is more accurate for
    significant elements
  • Keep track of max. possible errors

11
Space-Saving By Example
Element
Count
error (max possible)
Element A B C
Count 2 2 1
error (max possible) 0 0 0
Element A B C
Count 3 2 1
error (max possible) 0 0 0
Element B A C
Count 4 3 1
error (max possible) 0 0 0
Element B A D
Count 4 3 2
error (max possible) 0 0 1
Element B A D
Count 5 3 3
error (max possible) 0 0 1
Element B E A
Count 5 4 3
error (max possible) 0 3 0
Element B E C
Count 5 4 4
error (max possible) 0 3 3
A
B
B
A
C
A
B
B
D
D
E
C
B
  • Space-Saving Algorithm
  • For every element in the stream S
  • If a monitored element is observed
  • Increment its Count
  • If a non-monitored element is observed,
  • Replace the element with minimum hits, min
  • Increment the minimum Count to min 1
  • maximum possible over-estimation is error
  • Space-Saving Algorithm
  • For every element in the stream S
  • If a monitored element is observed
  • Increment its Count
  • If a non-monitored element is observed,
  • Replace the element with minimum hits, min
  • Increment the minimum Count to min 1
  • maximum possible over-estimation is error
  • Space-Saving Algorithm
  • For every element in the stream S
  • If a monitored element is observed
  • Increment its Count
  • If a non-monitored element is observed,
  • Replace the element with minimum hits, min
  • Increment the minimum Count to min 1
  • maximum possible over-estimation is error
  • Space-Saving Algorithm
  • For every element in the stream S
  • If a monitored element is observed
  • Increment its Count
  • If a non-monitored element is observed,
  • Replace the element with minimum hits, min
  • Increment the minimum Count to min 1
  • maximum possible over-estimation is error
  • Space-Saving Algorithm
  • For every element in the stream S
  • If a monitored element is observed
  • Increment its Count
  • If a non-monitored element is observed,
  • Replace the element with minimum hits, min
  • Increment the minimum Count to min 1
  • maximum possible over-estimation is error

12
Space-Saving Observations
S ABBACABBDDBEC N 13
  • Observations
  • The summation of the Counts is N
  • Minimum number of hits, min N/m
  • In this example, min 4
  • The minimum number of hits, min, is an upper
    bound on the error of any element

Element B E C
Count 5 4 4
error (max possible) 0 3 3
Element B E C
Count 5 4 4
error (max possible) 0 3 3
Element B E C
Count 5 4 4
error (max possible) 0 3 3
13
Space-Saving Proved Properties
S ABBACABBDDBEC N 13
S ABBACABBDDBEC N 13
  1. If Element E has frequency F gt min, then E must
    be in Stream-Summary. F(B) F1 5, min 4.
  1. The Count at position i in Stream-Summary is no
    less than Fi, the frequency of the ith ranked
    element. F(A) F2 3, Count2 4.

Element B E C
Count 5 4 4
error (max possible) 0 3 3
Element B E C
Count 5 4 4
error (max possible) 0 3 3
14
Space-Saving Intuition
  • Make use of the skewed property of the data
  • A minority of the elements, the more frequent
    ones, gets the majority of the hits.
  • Frequent elements will reside in the counters of
    bigger values.
  • They will not be distorted by the ineffective
    hits of the infrequent elements.
  • Numerous infrequent elements reside on the
    smaller counters.

15
Space-Saving Intuition (Contd)
  • If the skew remains, but the popular elements
    change overtime
  • The elements that are growing more popular will
    gradually be pushed to the top of the list.
  • If one of the previously popular elements lost
    its popularity, its relative position will
    decline, as other counters get incremented.

16
Space-Saving Data Structure
  • We need a data structure that
  • Increments counters in constant time
  • Keeps elements sorted by their counters
  • We propose the Stream-Summary structure, similar
    to the data structure in Demaine 02

17
Outline
  • Problem Definition
  • Space-Saving Summarizing the Data Stream
  • Answering Frequent Elements Queries
  • Answering Top-k Queries
  • Experimental Results
  • Conclusion

18
Frequent Elements Queries
  • Traverse Stream-Summary, and report all elements
    that satisfy the user support
  • Any element whose
  • guaranteed hits (Count error) gt fN
  • is guaranteed to be a frequent element

19
Frequent Elements Example
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits Count - error 19 14 8 8 4 5 2 1
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits Count - error 19 14 8 8 4 5 2 1
  • For N 73, m 8, f 0.15
  • Frequent Elements should have support of 11 hits.
  • Candidate Frequent Elements are B, D, and G.
  • Guaranteed Frequent Elements are B, and D, since
    their guaranteed hits gt 11.

20
Frequent Elements Space Bounds
Space Bounds General Distribution Zipf(a)
Space-Saving O(1/?) (1/?)(1/a)
GroupTest O(f-1 log(f-1) log(A))
Frequent O(1/?) proved byBose03
Lossy Counting (1/?) log(?N)
Sticky Sampling (2/?) log(f-1d-1)
21
FE Quantitative Comparison
  • Example N 106, A 104, f 10-1, ? 10-2,
    and d, the failure probability, 10-1 ,and
    Uniform data
  • Space-Saving and Frequent 100 counters
  • Sticky Sampling 700 counters
  • Lossy Counting 1000 counters
  • GroupTest C930 counters, C 1
  • Zipfian with a 2 Space-Saving 10 counters

22
FE Qualitative Comparison
  • Frequent
  • It has a bound similar to Space-Saving in the
    general distribution case.
  • It is built and queried in a way that does not
    allow the user to specify an error threshold.
  • There is no feasible extension to track
    under-estimation errors.
  • Every observation of a non-monitored element
    increases the errors for all the monitored
    elements, since their counters get decremented.

23
FE Qualitative Comparison (Contd)
  • GroupTest
  • It does not output frequencies at all.
  • It reveals nothing about the relative order of
    the elements.
  • It assumes that IDs are 1 A. This can only be
    enforced by building an indexed lookup table.
  • Thus, practically it needs O(A) space.

24
FE Qualitative Comparison (Contd)
  • Lossy Counting and Sticky Sampling
  • The theoretical space bound of Space-Saving is
    much tighter than those of Lossy Counting and
    Sticky Sampling.

25
Outline
  • Problem Definition
  • Space-Saving Summarizing the Data Stream
  • Answering Frequent Elements Queries
  • Answering Top-k Queries
  • Experimental Results
  • Conclusion

26
Top-k Elements Queries
  • Traverse the Stream-Summary, and report top-k
    elements.
  • From Property 2, we assert
  • Guaranteed top-k elements
  • Any element whose guaranteed hits (Count
    error) Countk1, is guaranteed to be in the
    top-k.
  • Guaranteed top-k (where kk)
  • The top-k elements reported are guaranteed to be
    the correct top-k iff for every element in the
    top-k, guaranteed hits (Count error)
    Countk1.

27
Top-k Elements Example
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits Count - error 19 14 8 8 4 5 2 1
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits Count - error 19 14 8 8 4 5 2 1
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits Count - error 19 14 8 8 4 5 2 1
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits Count - error 19 14 8 8 4 5 2 1
  • For k 3, m 8
  • B, D, and G are the top-3 candidates.
  • B, and D are guaranteed to be in the top-3.
  • B , D, G and A are guaranteed to be the top-4.
    Here k 4.
  • B , and D are guaranteed to be the top-2. Another
    k 2.

28
Top-k Elements Space Bounds
Space Bounds General Distribution Zipf(a)
Space-Saving FindApproxTop(S, k, ?) O(k/? log(N)) Exact Top-k Problem a 1 O(k2 log(A) ) a gt 1 O((k/ a)(1/a) k )
CountSketch FindApproxTop(S, k, ?) O(k/?2 log(N / d)) FindApproxTop(S, k, ?) a 1 O(k log(N / d))
29
Top-k Quantitative Comparison
  • For N 106, A 104, k 100, ? 10-1, and d
    10-1, and Uniform data
  • Space-Saving 1000 counters
  • CountSketch C2.3107 counters, C gtgt 1
  • If the data is Zipfian with a 2
  • Space-Saving 66 counters
  • CountSketch C230 counters, C gtgt 1

30
Top-k Qualitative Comparison
  • CountSketch
  • General distribution
  • Space-Saving has a tighter theoretical space
    bound.
  • Zipf(a) distribution
  • Space-Saving solves the exact problem, while
    CountSketch solves the approximate problem.
  • Space-Saving has a tighter bound in cases of
  • Skewed data
  • Long streams
  • It has 0-probability of failure.

31
Top-k Qualitative Comparison (Contd)
  • Probabilistic-Inplace
  • Outputs m/2 elements, which is too many.
  • Zipf(a) distribution
  • Probabilistic-Inplace does not offer space
    analysis in case of Zipfian data.

32
Outline
  • Problem Definition
  • Space-Saving Summarizing the Data Stream
  • Answering Frequent Elements Queries
  • Answering Top-k Queries
  • Experimental Results
  • Conclusion

33
Experimental Results - Setup
  • Synthetic data
  • Zipf(a), a varied 0.0, 0.5, 1.0, , 2.5, 3.0
  • N 107 hits.
  • Real Data (ValueClick, Inc.) Similar results
  • Precision
  • number of correct elements found / entire output
  • Recall
  • number of correct elements found / number of
    actual correct
  • Run time
  • Processing Stream Query Time
  • Space used
  • Including hash table

34
Frequent Elements Results
  • Query f 10-2, ? 10-4, and d 10-2
  • We compared with
  • GroupTest and Frequent
  • All algorithms had a recall of 1.
  • That is, they all output the correct elements
    among their output.
  • Space-Saving was able to guarantee all its output
    to be correct

35
Frequent Elements Precision
36
Frequent Elements Run Time
37
Frequent Elements Space Used
38
Top-k Elements Results
  • Query k 100, ? 10-4, and d 10-2
  • We compared with
  • CountSketch CountSketch was re-run several
    times. The hidden constant was estimated to be
    16, in order to have output of competitive
    quality.
  • Probabilistic-InPlace was allowed the same
    number of counters as Space-Saving
  • Space-Saving was able to guarantee all its output
    to be correct

39
Top-k Elements Precision
40
Top-k Elements Recall
41
Top-k Elements Run Time
42
Top-k Elements Space Used
43
Outline
  • Problem Definition
  • Space-Saving Summarizing the Data Stream
  • Answering Frequent Elements Queries
  • Answering Top-k Queries
  • Experimental Results
  • Conclusion

44
Conclusion
  • Contributions
  • An integrated approach to solve an interesting
    family of problems
  • Strict error bounds using little space
  • Guarantees on results
  • Special attention was given to Zipfian data
  • Experimental validation
  • Future Work
  • Incremental frequent and top-k elements reporting
Write a Comment
User Comments (0)
About PowerShow.com