Title: Efficient Computation of Frequent and Top-k Elements in Data Streams
1Efficient Computation of Frequent and Top-k
Elements in Data Streams
- Ahmed Metwally
- Divyakant Agrawal
- Amr El Abbadi
- Department of Computer Science
- University of California, Santa Barbara
2Outline
- Problem Definition
- Space-Saving Summarizing the Data Stream
- Answering Frequent Elements Queries
- Answering Top-k Queries
- Experimental Results
- Conclusion
3Motivation
- Motivated by Internet advertising commissioners
- Before rendering an advertisement for user, query
clicks stream for advertisements to display. - If the user's profile is not a frequent
clicker, then s/he will probably not click any
displayed advertisement. - Show Pay-Per-Impression advertisements.
- If the user's profile is a frequent clicker,
then s/he may click a displayed advertisement. - Show Pay-Per-Click advertisements.
- Retrieve top advertisements to choose what to
display.
4Problem Definition
- Given alphabet A, stream S of size N, a frequent
element, E, is an element whose frequency, F,
exceeds a user specified support, fN - Top-k elements are the k elements with highest
frequency - Both problems
- Very related, though, no integrated solution has
been proposed - Exact solution is O(min(N,A)) space
? approximate variations
5Practical Frequent Elements
- ?-Deficient Frequent Elements Manku 02
- All frequent elements output should have
- F gt (f - ?)N, where ? is the user-defined error.
6Practical Top-k
- FindApproxTop(S, k, ?) Charikar 02
- Retrieve a list of k elements such that every
element, Ei, in the list has Fi gt (1 - ?) Fk,
where Ek is the kth ranked element.
7Related Work
- Algorithms Classification
- Counter-Based techniques
- Keep an individual counter for each element
- If the observed ID is monitored, its counter is
updated - If the observed ID is not monitored, algorithm
dependent action - Sketch-Based techniques
- Estimate frequency for all elements using
bit-maps of counters - Each element is hashed into the counters space
using a family of hash functions. - Hashed-to counters are queried for the frequencies
8Recent Work (Comparison)
Algorithm Nature Space Bound Handles
CountSketch Charikar 02 Sketch O(k/?2 log N/d), d is the failure probability FindApproxTop(S, k, ?)
GroupTest Cormode 03 Sketch O(f-1 log(f-1) log(A)) Hot Items
Frequent Demaine 02 Counter O(1/?), proved by Bose 03 FE
Probabilistic-Inplace Demaine 02 Counter O(m), m is the available memory FindCandidateTop(S, k, m/2)
Lossy Counting Manku 02 Counter (1/?) log(?N) ?-Deficient FE
Sticky Sampling Manku 02 Counter (2/?) log(f-1d-1) ?-Deficient FE
9Outline
- Problem Definition
- Space-Saving Summarizing the Data Stream
- Answering Frequent Elements Queries
- Answering Top-k Queries
- Experimental Results
- Conclusion
10The Space-Saving Algorithm
- Space-Saving is counter-based
- Monitor only m elements
- Only over-estimation errors
- Frequency estimation is more accurate for
significant elements - Keep track of max. possible errors
11Space-Saving By Example
Element
Count
error (max possible)
Element A B C
Count 2 2 1
error (max possible) 0 0 0
Element A B C
Count 3 2 1
error (max possible) 0 0 0
Element B A C
Count 4 3 1
error (max possible) 0 0 0
Element B A D
Count 4 3 2
error (max possible) 0 0 1
Element B A D
Count 5 3 3
error (max possible) 0 0 1
Element B E A
Count 5 4 3
error (max possible) 0 3 0
Element B E C
Count 5 4 4
error (max possible) 0 3 3
A
B
B
A
C
A
B
B
D
D
E
C
B
- Space-Saving Algorithm
- For every element in the stream S
- If a monitored element is observed
- Increment its Count
- If a non-monitored element is observed,
- Replace the element with minimum hits, min
- Increment the minimum Count to min 1
- maximum possible over-estimation is error
- Space-Saving Algorithm
- For every element in the stream S
- If a monitored element is observed
- Increment its Count
- If a non-monitored element is observed,
- Replace the element with minimum hits, min
- Increment the minimum Count to min 1
- maximum possible over-estimation is error
- Space-Saving Algorithm
- For every element in the stream S
- If a monitored element is observed
- Increment its Count
- If a non-monitored element is observed,
- Replace the element with minimum hits, min
- Increment the minimum Count to min 1
- maximum possible over-estimation is error
- Space-Saving Algorithm
- For every element in the stream S
- If a monitored element is observed
- Increment its Count
- If a non-monitored element is observed,
- Replace the element with minimum hits, min
- Increment the minimum Count to min 1
- maximum possible over-estimation is error
- Space-Saving Algorithm
- For every element in the stream S
- If a monitored element is observed
- Increment its Count
- If a non-monitored element is observed,
- Replace the element with minimum hits, min
- Increment the minimum Count to min 1
- maximum possible over-estimation is error
12Space-Saving Observations
S ABBACABBDDBEC N 13
- Observations
- The summation of the Counts is N
- Minimum number of hits, min N/m
- In this example, min 4
- The minimum number of hits, min, is an upper
bound on the error of any element
Element B E C
Count 5 4 4
error (max possible) 0 3 3
Element B E C
Count 5 4 4
error (max possible) 0 3 3
Element B E C
Count 5 4 4
error (max possible) 0 3 3
13Space-Saving Proved Properties
S ABBACABBDDBEC N 13
S ABBACABBDDBEC N 13
- If Element E has frequency F gt min, then E must
be in Stream-Summary. F(B) F1 5, min 4.
- The Count at position i in Stream-Summary is no
less than Fi, the frequency of the ith ranked
element. F(A) F2 3, Count2 4.
Element B E C
Count 5 4 4
error (max possible) 0 3 3
Element B E C
Count 5 4 4
error (max possible) 0 3 3
14Space-Saving Intuition
- Make use of the skewed property of the data
- A minority of the elements, the more frequent
ones, gets the majority of the hits. - Frequent elements will reside in the counters of
bigger values. - They will not be distorted by the ineffective
hits of the infrequent elements. - Numerous infrequent elements reside on the
smaller counters.
15Space-Saving Intuition (Contd)
- If the skew remains, but the popular elements
change overtime - The elements that are growing more popular will
gradually be pushed to the top of the list. - If one of the previously popular elements lost
its popularity, its relative position will
decline, as other counters get incremented.
16Space-Saving Data Structure
- We need a data structure that
- Increments counters in constant time
- Keeps elements sorted by their counters
- We propose the Stream-Summary structure, similar
to the data structure in Demaine 02
17Outline
- Problem Definition
- Space-Saving Summarizing the Data Stream
- Answering Frequent Elements Queries
- Answering Top-k Queries
- Experimental Results
- Conclusion
18Frequent Elements Queries
- Traverse Stream-Summary, and report all elements
that satisfy the user support - Any element whose
- guaranteed hits (Count error) gt fN
- is guaranteed to be a frequent element
19Frequent Elements Example
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits Count - error 19 14 8 8 4 5 2 1
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits Count - error 19 14 8 8 4 5 2 1
- For N 73, m 8, f 0.15
- Frequent Elements should have support of 11 hits.
- Candidate Frequent Elements are B, D, and G.
- Guaranteed Frequent Elements are B, and D, since
their guaranteed hits gt 11.
20Frequent Elements Space Bounds
Space Bounds General Distribution Zipf(a)
Space-Saving O(1/?) (1/?)(1/a)
GroupTest O(f-1 log(f-1) log(A))
Frequent O(1/?) proved byBose03
Lossy Counting (1/?) log(?N)
Sticky Sampling (2/?) log(f-1d-1)
21FE Quantitative Comparison
- Example N 106, A 104, f 10-1, ? 10-2,
and d, the failure probability, 10-1 ,and
Uniform data
- Space-Saving and Frequent 100 counters
- Sticky Sampling 700 counters
- Lossy Counting 1000 counters
- GroupTest C930 counters, C 1
- Zipfian with a 2 Space-Saving 10 counters
22FE Qualitative Comparison
- Frequent
- It has a bound similar to Space-Saving in the
general distribution case. - It is built and queried in a way that does not
allow the user to specify an error threshold. - There is no feasible extension to track
under-estimation errors. - Every observation of a non-monitored element
increases the errors for all the monitored
elements, since their counters get decremented.
23FE Qualitative Comparison (Contd)
- GroupTest
- It does not output frequencies at all.
- It reveals nothing about the relative order of
the elements. - It assumes that IDs are 1 A. This can only be
enforced by building an indexed lookup table. - Thus, practically it needs O(A) space.
24FE Qualitative Comparison (Contd)
- Lossy Counting and Sticky Sampling
- The theoretical space bound of Space-Saving is
much tighter than those of Lossy Counting and
Sticky Sampling.
25Outline
- Problem Definition
- Space-Saving Summarizing the Data Stream
- Answering Frequent Elements Queries
- Answering Top-k Queries
- Experimental Results
- Conclusion
26Top-k Elements Queries
- Traverse the Stream-Summary, and report top-k
elements. - From Property 2, we assert
- Guaranteed top-k elements
- Any element whose guaranteed hits (Count
error) Countk1, is guaranteed to be in the
top-k. - Guaranteed top-k (where kk)
- The top-k elements reported are guaranteed to be
the correct top-k iff for every element in the
top-k, guaranteed hits (Count error)
Countk1.
27Top-k Elements Example
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits Count - error 19 14 8 8 4 5 2 1
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits Count - error 19 14 8 8 4 5 2 1
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits Count - error 19 14 8 8 4 5 2 1
Element B D G A Q F C E
Count 20 14 12 9 7 5 3 3
error 1 0 4 1 3 0 1 2
Guaranteed Hits Count - error 19 14 8 8 4 5 2 1
- For k 3, m 8
- B, D, and G are the top-3 candidates.
- B, and D are guaranteed to be in the top-3.
- B , D, G and A are guaranteed to be the top-4.
Here k 4.
- B , and D are guaranteed to be the top-2. Another
k 2.
28Top-k Elements Space Bounds
Space Bounds General Distribution Zipf(a)
Space-Saving FindApproxTop(S, k, ?) O(k/? log(N)) Exact Top-k Problem a 1 O(k2 log(A) ) a gt 1 O((k/ a)(1/a) k )
CountSketch FindApproxTop(S, k, ?) O(k/?2 log(N / d)) FindApproxTop(S, k, ?) a 1 O(k log(N / d))
29Top-k Quantitative Comparison
- For N 106, A 104, k 100, ? 10-1, and d
10-1, and Uniform data
- Space-Saving 1000 counters
- CountSketch C2.3107 counters, C gtgt 1
- If the data is Zipfian with a 2
- Space-Saving 66 counters
- CountSketch C230 counters, C gtgt 1
30Top-k Qualitative Comparison
- CountSketch
- General distribution
- Space-Saving has a tighter theoretical space
bound. - Zipf(a) distribution
- Space-Saving solves the exact problem, while
CountSketch solves the approximate problem. - Space-Saving has a tighter bound in cases of
- Skewed data
- Long streams
- It has 0-probability of failure.
31Top-k Qualitative Comparison (Contd)
- Probabilistic-Inplace
- Outputs m/2 elements, which is too many.
- Zipf(a) distribution
- Probabilistic-Inplace does not offer space
analysis in case of Zipfian data.
32Outline
- Problem Definition
- Space-Saving Summarizing the Data Stream
- Answering Frequent Elements Queries
- Answering Top-k Queries
- Experimental Results
- Conclusion
33Experimental Results - Setup
- Synthetic data
- Zipf(a), a varied 0.0, 0.5, 1.0, , 2.5, 3.0
- N 107 hits.
- Real Data (ValueClick, Inc.) Similar results
- Precision
- number of correct elements found / entire output
- Recall
- number of correct elements found / number of
actual correct - Run time
- Processing Stream Query Time
- Space used
- Including hash table
34Frequent Elements Results
- Query f 10-2, ? 10-4, and d 10-2
- We compared with
- GroupTest and Frequent
- All algorithms had a recall of 1.
- That is, they all output the correct elements
among their output. - Space-Saving was able to guarantee all its output
to be correct
35Frequent Elements Precision
36Frequent Elements Run Time
37Frequent Elements Space Used
38Top-k Elements Results
- Query k 100, ? 10-4, and d 10-2
- We compared with
- CountSketch CountSketch was re-run several
times. The hidden constant was estimated to be
16, in order to have output of competitive
quality. - Probabilistic-InPlace was allowed the same
number of counters as Space-Saving - Space-Saving was able to guarantee all its output
to be correct
39Top-k Elements Precision
40Top-k Elements Recall
41Top-k Elements Run Time
42Top-k Elements Space Used
43Outline
- Problem Definition
- Space-Saving Summarizing the Data Stream
- Answering Frequent Elements Queries
- Answering Top-k Queries
- Experimental Results
- Conclusion
44Conclusion
- Contributions
- An integrated approach to solve an interesting
family of problems - Strict error bounds using little space
- Guarantees on results
- Special attention was given to Zipfian data
- Experimental validation
- Future Work
- Incremental frequent and top-k elements reporting