How to find frequent items continuously in data streams - PowerPoint PPT Presentation

About This Presentation
Title:

How to find frequent items continuously in data streams

Description:

... query. Applications. The statistical property of sensor monitoring data ... Choose the larger n counters as result. Can be applied to distributed systems ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 12
Provided by: ccrcNt
Category:

less

Transcript and Presenter's Notes

Title: How to find frequent items continuously in data streams


1
How to find frequent items continuously in data
streams
  • Speaker ???
  • Adviser ???

2
A naïve approach to find frequent items
  • Method
  • Maintain an array of counters
  • Increment the corresponding counter by one
    whenever a new item arrives
  • Problem
  • Available array size M ltlt n (distinct item
    number)
  • Inappropriate to continuous query

3
Applications
  • The statistical property of sensor monitoring
    data
  • The statistical property of Internet packets
    through a router
  • The statistical property of searching keywords of
    a search engine

4
Basic idea MJRTY (majority voting) ()
  • Use one counter to find the majority of a group
  • Number of comparisons n-1
  • Example 1222321

element_name
Counter
value
?
0
1
1
?
0
2
1
2
1
2
1
Tech. Report ICSCA-CMP-32, Robert S. Boyer and
J Strother Moore, 1982
5
Why MJRTY works?
  • Assume a majority item a exists in group G, we
    randomly delete 2 different items from G
  • If the two items are not a, a would naturally
    still be the majority after deleting them
  • If one of the two items is a, a would still be
    the majority since both a and its adversary are
    decrement by one

6
Apply MJRTY to distributed environment
  • Merge two nodes with the same element_name
  • Add the values directly
  • Merge two nodes with different element_names
  • Set value to the abstract value of the difference
    between two values
  • Set element_name to the one with larger value

d
12
d
d
9
3

d
6
d
c
9
3
7
Apply MJRTY to data stream (basic)
  • Required space ? window size

Ex number of available counters 9
now
0
-1
-2
-3
-4
-5
-6
-7
-8
d
a
b
d
b
c
d
d
c
time t
15
8
22
2
10
3
7
9
3
Use the recycled counter element
Going to be recycled
d
a
b
d
b
c
d
d
b
time t1
15
8
22
2
10
3
7
9
5
8
Apply MJRTY to data stream (improved)
  • Required space ? log(window size)

0
-1
-2
-4
-6
-10
-18
-34
-66
Use the recycled counter element
Going to be recycled
Three counters are responsible for time unit
with length 1 ? merge
Three counters are responsible for time units
with length 2 ? merge
9
Extend MJRTY to HI-FRQCY (high-frequency)
  • Frequent item frequency gt 1/(n1)
  • Use n counters to get frequent items
  • Ex when n2
  • 11233

element_name
Counter
value
F
0
1
1
1
2
1
1
F
0
2
1
F
0
3
1
10
Why HI_FRQCY works?
  • At most n items whose frequency are larger than
    1/(n1), so n counters are enough to record all
    frequent items
  • If frequent items exist in group G, randomly
    delete n different items from G will not affect
    the status of frequent items

11
Apply HI_FRQCY to distributed and continuous
environment
  • Merge two nodes
  • If any counters in the two nodes record the same
    item, merge them
  • Sort the counters
  • Choose the larger n counters as result
  • Can be applied to distributed systems
  • Can be applied to continuous query environment

12
Continuous query
  • Characteristics
  • Data updates continuously
  • Tend to query recent data
  • May query some statistic during any period of
    time within the window size
  • A diagram of continuous query

Arrival time
Elements
now
window size (7)
Write a Comment
User Comments (0)
About PowerShow.com