Title: Evaluating Window Joins over Punctuated Streams
1Evaluating Window Joins over Punctuated Streams
- Many slides taken from talk by
- Luping Ding and Elke A. Rundensteiner, CIKM04
- Database Systems Research Group
- Worcester Polytechnic Institute
2Stream Data Processing
- Online Transaction Management
- Sensor Network Monitoring
Register Continuous Queries
Stream Query Engine
Streaming Data
Streaming Result
3New Challenges in Stream Context
- Potentially infinite data streams vs. stateful
operators. e.g., join, distinct, - Problem potentially unbounded state
- Reason no hint on which data is no longer useful
4Example -Symmetric Hash Join WA93
- Memory overflow resolution state relocation
- Example XJoin UF00,
- Hash-Merge Join MLA04
- Problems
- Join state still grows with no bound
- Delivery of some join results may be highly
deferred
Memory Overflow
Memory
SA
SB
probe
insert
A
B
5Avoiding Unbounded State
- Solution exploit constraints to detect
no-longer-useful data - Sliding window MWA03
- Identify a bounded set of input data based on
time - K-constraint BW03
- Models clustered or ordered data arrival pattern
- Punctuation TMSF03
- Dynamically announce termination of certain value
6Sliding Window KNV03
Wa
Wb
Timeline
Stream A
Stream B
7Punctuation
- Meta-knowledge embedded inside data streams
- An ordered set of patterns corresponding to
attributes of tuples - Wildcard (), constant (9), list (1,2,3), range
(1, 20), empty (?) - Semantics tuples after a punctuation p will NOT
match p
Bid
180
Marlie
820.00
Nov-13-03 110200
No more tuple will contain Item_id 180.
182
Ultrasale
1000.00
Nov-13-03 110500
180
Jocelyn
850.00
Nov-13-03 111400
180
181
pcfan
50.00
Nov-13-03 113600
8Punctuation-Aware Join DMR04
A
C
A
B
1
200.00
Joinitem_id
2
63.00
SA
SB
175
80.00
175
80.00
175
100.00
175
100.00
No more tuple will have A 175.
175
181
50.00
180
135.00
175
20.00
158
310.00
175
20.00
Stream B
Stream A
9Features of Punctuation
- Purge rule. For any tuple ta from stream A, if
there exists a punctuation Pb that has already
been received from stream B such that match (ta,
,,Pb), ta will not be joining with any future
arriving tuples from stream B. ta doesnt need to
be maintained in the A state after being
processed. - Propagation rule. The join operator can also
propagate punctuations to the output stream in
order to help downstream operators.
10- Based on punctuation semantics, we derive the
following theorem as the foundation of our
punctuation propagation algorithm. - Theorem 3.1. Let pa and pb be punctuations
retrieved from streams A and B at time TSa and
TSb respectively specifying the same punctuated
value val of join attribute att. Then no output
tuples with val being the value of attribute att
will be generated after time max(TSa, TSb).
11Sliding Window Join
- Suppose Ta and Tb are time windows for streams A
and B respectively. We define the invalidation
rule from the join state based on the sliding
window - Let tuple ta be the latest tuple with timestamp
TSa from stream A that has been processed.The
tuple in the B state with timestamp TSb such that
TSb Tb lt TSa is called a time-expired tuple and
can be invalidated. The same invalidation rule
applies to tuples in the A state.
12Basic Window join
TSa-Tb
TSb-Ta
Tb
Ta
TSa
TSb
Stream A
Stream B
timeline
13Optimization Opportunities
- Maintain smaller state than either pure window
join or pure punctuation-exploiting join - Bid tuples that have been joined dont need to be
maintained in state (Punctuation) - Drop tuples without affecting precision of result
- Bid tuples out of 24-hour window of corresponding
Auction tuple dont need to be processed - Aggregate result for some Auction tuples can be
produced in less than 24 hours
14Features of PWJoin algorithm
- Punctuation-exploiting Window Join is composed of
three operations - Probing state to find matching tuples for
producing join results. - Purging no-longer-joining tuples by punctuations.
- Invalidating expired tuples by windows. Among
these operations.
15Window and Punctuation Occur Simultaneously
SELECT A.item_id, Count () FROM
Auction Range 24 Hours A, Bid B
WHERE A.item_id B.item_id GROUP BY
A.item_id
Auction Stream
Group-byitem_id (count())
Joinitem_id
Bid Stream
Out1 (item_id)
Out2 (item_id, count)
Contains punctuations on item_id
Applies a 24-hour window on Auction stream
16PWJoin Basics and Issue
Receive a new tuple ta from stream A
Invalidate tuples from B state
Probe B state
Insert ta into A state
Receive a new punct pa from stream A
Purge tuples from B state
Insert pa into A state
- Issue how to design PWJoin state to facilitate
all search-based operations? - Invalidate conducts time-based search
- Probe and Purge needs value-based search
17PWJoin State with Two-dimensional Index
Time List
I-Node Index (Hash Table)
Punctuation Time List
Punctuation Timestamp
p1 T1
p2 T2
Window Begin
8
8
none
10
10
punctuated
8
8
10
tuple
NextValueListTNode
T-Node
4
NextTimeListTNode
8
Key
Head
Tail
PunctFlag
Window End
I-Node
18PWJoin Algorithm
- Invalidate Once a new tuple t is retrieved from
stream A, its timestamp is used to invalidate
expired tuples from the head of the time list of
stream B. - Probe probe I-Node index and join with tuples in
value list of matching I-Node. - After invalidation is done, the join value of t
is used to probe the I-Node index of the B state.
If the matching I-Node iNode is found, the
corresponding value list is located by following
the Head pointer of iNode. Tuple t then joins
with all tuples in this value list by following
the NextValueListTNode pointer of each T-Node. - Finally, the PunctFlag of iNode is checked. If it
is punctuated, t is discarded. If it is none,
t is inserted into the A state.
19PWJoin Algorithm
- Purge probe I-Node index and delete tuples in
value list of matching I-Node. - When a new punctuation p is retrieved from stream
A, p is used to probe the I-Node index of the B
state. If the matching I-Node iNode is found, all
tuples in the corresponding value list are
deleted. iNode is removed from the I-Node index
as well. If the PunctFlag of iNode is
punctuated, p is discarded. If iNode is not
found or iNodes PunctFlag is none, p is used
to probe the I-Node index of the A state and set
the PunctFlag of the matching I-Node iNodea as
punctuated. - If iNodea does not exist, a new I-Node is created
with its PunctFlag marked as true and inserted
into the I-Node index of the A state.
20Punctuation Propagation CIKM04
- An operator may propagate punctuations to benefit
downstream operators
Auction Stream
Group-byitem_id (count())
Joinitem_id
Bid Stream
Item_id
Bidder_id
Bid_price
be unblocked by punctuations propagated by join
operator
propagate punctuations on item_id
180
21Optimizations Enabled by Combined Constraints
Early Punctuation Propagation
Tuple Dropping
a1
a1
a6
a6
a1
a1
a2
a3
a2
a3
a3
a3
a3
a3
a7
a7
a4
a4
a3
a3
a2
a2
a1
a1
a8
a8
a3
a3
propagation point 2
a2
a2
a6
a6
a3
a3
a10
a10
a3
propagation point 1
a3
Stream S1
Stream S2
Stream S1
Stream S2
22Achieving Optimizations by Combined Constraints
- Early propagation
- Invalidate punctuations in punctuation time list
as invalidating tuples - Expired punctuations can be propagated
- Tuple dropping
- When early propagation happens, set PunctFlag of
matching I-Node as propagated - Drop new tuples that matches an I-Node whose
PunctFlag is propagated
23Memory Cost Analysis
- SbT SbTinsert - SbTpurge SbTarrive -
SbTpurge - ?bTb - ? bTb(? paT/NKb,T)
- ?b tuple input rate of stream B
- ?pa punctuation input rate of stream A
- NKb,T - of distinct join values occurred in
stream B up to Tth time unit - Tb time window on stream B
Saving by Punctuation
Window Join
24PWJoin vs. WJoin Memory and Tuple Output Rate
Stream A, B punct-asc-100-40
25PWJoin vs. PJoin Punctuation Output Rate
Stream A punct-asc-100-40, Stream B
punct-random-30-40 Window 1 second
26Conclusion
- PWJoin algorithm
- Designed storage structure for PWJoin state
- Memory cost analysis of PWJoin
27Thanks
- WPI Database Research Group
many slides are from davis.wpi.edu/dsrg/CAPE/sl
ides
28References
- CIKM04, L. Ding and E.A. Rundensteiner.
Evaluating Window Joins over Punctuated Streams.
CIKM04. - KNV03 J. Kang, J. F. Naughton and S. D. Viglas.
Evaluating Window Joins over Unbounded Streams.
ICDE03. - UF00 T. Urhan and M. Franklin, XJoin A
Reactively Scheduled Pipelined Join Operator.
IEEE Data Engineering Bulletin, 23(2), 2000. - HH99 P. Haas and J. Hellerstein, Ripple Joins
for Online Aggregation. SIGMOD99. - GO03 L. Golab and M. T. Ozsu, Processing
Sliding Window Multi-Joins in Continuous Queries
over Data Streams. VLDB03. - GGO04 L. Golab, S. Garg and M. T. Ozsu, On
Indexing Sliding Windows over On-line Data
Streams, EDBT04. - RDS04 E. A. Rundensteiner, L. Ding, T.
Sutherland, Y. Zhu, B. Pielech and N. Mehta,
CAPE Continuous Query Engine with
Heterogeneous-Grained Adaptivity. VLDB Demo,
2004. - BW04 S. Babu and J. Widom. Exploiting
k-Constraints to Reduce Memory Overhead in
Continuous Queries over Data Streams - TMS03 P. A. Tucker, D. Maier, T. Sheard and L.
Fegaras. Exploiting Punctuation Semantics in
Continuous Data Streams. TKDE, 15(3), 2003. - DMR04 L. Ding, N. Mehta, E. A. Rundensteiner
and G. T. Heineman, Joining Punctuated Streams.
EDBT04. - MWA03 R. Motwani, J. Widom, A. Arasu et al.
Query Processing, Resource Management, and
Approximation in a Data Stream Management System.
CIDR03.
29 Thanks!