Title: How to find frequent items continuously in data streams
1How to find frequent items continuously in data
streams
2A naïve approach to find frequent items
- Method
- Maintain an array of counters
- Increment the corresponding counter by one
whenever a new item arrives - Problem
- Available array size M ltlt n (distinct item
number) - Inappropriate to continuous query
3Applications
- The statistical property of sensor monitoring
data - The statistical property of Internet packets
through a router - The statistical property of searching keywords of
a search engine
4Basic idea MJRTY (majority voting) ()
- Use one counter to find the majority of a group
- Number of comparisons n-1
- Example 1222321
element_name
Counter
value
?
0
1
1
?
0
2
1
2
1
2
1
Tech. Report ICSCA-CMP-32, Robert S. Boyer and
J Strother Moore, 1982
5Why MJRTY works?
- Assume a majority item a exists in group G, we
randomly delete 2 different items from G - If the two items are not a, a would naturally
still be the majority after deleting them - If one of the two items is a, a would still be
the majority since both a and its adversary are
decrement by one
6Apply MJRTY to distributed environment
- Merge two nodes with the same element_name
- Add the values directly
- Merge two nodes with different element_names
- Set value to the abstract value of the difference
between two values - Set element_name to the one with larger value
d
12
d
d
9
3
d
6
d
c
9
3
7Apply MJRTY to data stream (basic)
- Required space ? window size
Ex number of available counters 9
now
0
-1
-2
-3
-4
-5
-6
-7
-8
d
a
b
d
b
c
d
d
c
time t
15
8
22
2
10
3
7
9
3
Use the recycled counter element
Going to be recycled
d
a
b
d
b
c
d
d
b
time t1
15
8
22
2
10
3
7
9
5
8Apply MJRTY to data stream (improved)
- Required space ? log(window size)
0
-1
-2
-4
-6
-10
-18
-34
-66
Use the recycled counter element
Going to be recycled
Three counters are responsible for time unit
with length 1 ? merge
Three counters are responsible for time units
with length 2 ? merge
9Extend MJRTY to HI-FRQCY (high-frequency)
- Frequent item frequency gt 1/(n1)
- Use n counters to get frequent items
- Ex when n2
- 11233
element_name
Counter
value
F
0
1
1
1
2
1
1
F
0
2
1
F
0
3
1
10Why HI_FRQCY works?
- At most n items whose frequency are larger than
1/(n1), so n counters are enough to record all
frequent items - If frequent items exist in group G, randomly
delete n different items from G will not affect
the status of frequent items
11Apply HI_FRQCY to distributed and continuous
environment
- Merge two nodes
- If any counters in the two nodes record the same
item, merge them - Sort the counters
- Choose the larger n counters as result
- Can be applied to distributed systems
- Can be applied to continuous query environment
12Continuous query
- Characteristics
- Data updates continuously
- Tend to query recent data
- May query some statistic during any period of
time within the window size - A diagram of continuous query
Arrival time
Elements
now
window size (7)