Title: Continuous Data Stream Processing
1Continuous Data Stream Processing
Post-Excellence Project Subproject 6
Date 2006/03/07
2Music Virtual Channel
Clustering engine
Interface
Profile monitor
Channel monitor
Favorite channel
1
Internet
V.C. player
2
V.C. player
Filtering engine
N
Music metadata
Music collections
3Research Directions
Sequence Query Matching
Temporal Query Processing
Episode Query Matching
Range Search
Filtering
Spatial Query Processing
KNN Search
Aggregate Query Processing
Streaming Data Management
Top-K Search
Frequent Tree Pattern Mining
Closed Tree Pattern Mining
Mining
Frequent Itemset Mining (sliding window)
Frequent Itemset Mining (landmark model)
4Hash-based synopsis with memory consideration for
mining frequent itemsets over data streams
5Landmark model
6Lossy Counting
Step 1 Divide the stream into buckets
bucket-size 1/e e 10 of support s
7Lossy Counting in Action
Empty
8Lossy Counting continued ...
Output Elements with counter values exceeding
sN eN
9Drawbacks of Lossy Counting
1
Applied to mine frequent itemsets, the space may
exponentially increase
s
e
0
Lossy-Counting
10hCount
m
h1(9) mod m
1
0
1
1
1
2
h2(9) mod m
1
0
0
1
2
2
h
h3(9) mod m
1
1
0
1
1
2
h4(9) mod m
1
0
1
1
1
2
For each item, hash the item into buckets, choose
the minimum count and return the item if its
minimum count sN
11hash-based
1
N
1
- Transaction 1, 2, 3
- Subsets of 1, 2, 3
N
1
N
1
N
?
2
Itemset
Surplus_Estimate
3
?
True_Count
1, 2
1, 3
?
2, 3
1, 2, 3
Total_Access
Nlast_access
How to compute the Surplus_Estimate?
12Compute the Surplus_Estimate for an Itemset
- Two variables
- n number of different itemsets in the bucket but
not in the list - c sensible counts to be divided between itemsets
which are not in the list - If c 3, 5, n 3, ? ?Surplus_Estimate 3,
(3, 1, 1) - Surplus_Estimate --, until (Surplus_Estimate) /
Nlast_acces lt minSup
13Determine c and n
2, 3, 5, N 4, minSup 0.4 2 is hashed into
the bucket
Boundary of c 4-(2SE) c 4-2 Boundary of n
c 2, n 2 ? (1, 1)
?Surplus_Estimate 1
4
3
2
1
2
0
1
1
5
4
Itemset
Total_Access
Surplus_Estimate
Nlast_access
True_Count
14Monitoring Constrained k-Nearest Neighbor over
Moving Objects with Different Values
15Motivation (Cont.)
- Example
- Consider that an user wants to find the k places
to buy new shoes where the costs are the lowest. - Cost Price() Traffic Cost()
2-NN Query
100
4001001500 2001002400 1001003400
901005590
200
3
90
2
400
5
1
16Motivation
- Objects with different values in spatial
database. - find the k places to buy something where the
costs are the lowest. - Cost Price() Traffic
Cost() - Taxi driver wants to find the k places to gain
the most profits. - Profit Gain() - Traffic Cost()
- Taxi driver wants to find the k places to gain
the most profits. - Profit Gain() / Time Gain() / Time
- Virtual Channel
- age profile distance
- listen hours / profile distance
- Market Survey
- consumption (or income , age) / profile
distance
17Challenges
- Efficiency
- Search space reduction
- Query processing enhancement
- Effectiveness
- Previous result reuse
18Framework
- Step1
- Find k-candidates to restrict the search
region. -
- Step2
- Run Pruning Ring on the remaining candidates to
determine actual answer.
q
- Handling updates
- -Incrementally update positions or values for
objects and queries - -Computation is necessary only for affected
query -
19Querying Episodes over Event Stream
20Motivation
- Knowledge Discovery from Telecommunication
Network Alarm Databases ICDE96 - If an alarm of type A occurs, then an alarm of
type B occurs within 30 seconds with probability
0.8 - If alarms of types A and B occurs within 5
seconds, then a alarm of type C occurs within 60
seconds with probability 0.7 - If an alarm of type A precedes an alarm of type
B, and C precedes D, all within 15 seconds, then
E will follow within 4 minutes with probability
0.6
B
A
A
A
B
C
D
21Challenges
- Efficiency
- Index impaction
- Partial result sharing
- Load shedding
22Framework
Q1
Q2
Q3
- Q3 is composed of p5 and p4
23A E. I. 6
A EQueue
A TLink
B E. I. 5
B EQueue
B TLink
C E. I. 2
C EQueue
C TLink
D E. I. 4
D EQueue
D TLink