Title: Evaluating Window Joins over Unbounded Streams
1Evaluating Window Joins over Unbounded Streams
- Jaewoo Kang Jeffrey F. Naughton
- Stratis D. Viglas
- jaewoo, naughton, viglas_at_cs.wisc.edu
- Univ. of Wisconsin-Madison
ICDE03 Bangalore, India
2Outline of the talk
- Introduction Continuous Queries over Unbounded
Streams - Measuring the Cost of Sliding Window Joins
- On Maximizing the Efficiency of Processing Joins
- Summary
3Sliding Windows
- Handling internal states is big challenge.
- Approximate answers
- Sliding windows toss out expired tuples
- Synopses resort to reduced answer precision
4A Simple Sliding Window Query
- On arrival of a new tuple to window A
- Scan window B and propagate matching tuples
- Insert new tuple into window A
- Invalidate all expired tuples in window A
5Some interesting questions
- How should we measure the efficiency of a sliding
window join evaluation strategy? - Can a sliding window join algorithm take
advantages of asymmetries in two input stream
speeds?
6Interesting questions (Contd)
- How should we allocate computing resources
between the two windows to maximize join
efficiency? - If memory is the bottleneck, how should we
allocate memory between the two windows for the
two inputs?
7Interesting questions
- How should we measure the efficiency of a sliding
window join evaluation strategy? - Can a sliding window join algorithm take
advantages of asymmetries in two input stream
speeds?
8Outline of the talk
- Introduction Continuous Queries over Unbounded
Streams - Measuring the Cost of Sliding Window Joins
- On Maximizing the Efficiency of Processing Joins
- Summary
9Cost Model
- Unit-time basis cost model
- Aggregate cost of processing tuples arriving in
each window in a time unit
10Cost Model (Contd)
- Cost formula can be divided into two independent
groups, one for each input stream - Thus, can evaluate join algorithms for each join
direction independently
11Cost of One-way NLJ
- P(D) - cost of accessing one tuple in data
structure D during search operation - I(D) - cost of accessing one tuple in data
structure D during update operation - Total number of tuples processed in a time unit
multiplied by the tuple access cost
12Cost of One-way HJ
- B -- of hash buckets in window B
- B/B -- of tuples in a hash bucket
- Implement hash bucket to preserve tuple arrival
order avoid invalidation overhead.
13Cost of One-way T-tree INLJ
- N size of a T-tree node (of tuples)
- B/N total of nodes in a T-tree
14Implementation
- Implemented
- Four join algorithms NLJ, HJ, BJ, and TJ.
- Asymmetric join operator
- Stream emulator
- System
- Java HotSpot VM 1.4
- AMD Athlon XP 1533Mhz, 1GB memory
- Windows XP Professional
15Fitting Parameters in the Model
- Process 60 seconds worth of tuples without
intermittent delays, at 20 different points with
increasing workload rates. - Then, equate the measured values with the cost
formula, and solve the equation. - Hash bucket size 10, T-tree node size 100
used - P(N) 3x10-4 P(H) 5.5x10-4
- P(BT) 2.6x10-4 P(TT) 2.6x10-4
- I(N) 1x10-4 I(H) 7.8x10-4
- I(BT) 2.6x10-4 I(N) 2.7x10-4
16Outline of the talk
- Introduction Continuous Queries over Unbounded
Streams - Measuring the Cost of Sliding Window Joins
- On Maximizing the Efficiency of Processing Joins
- Summary
17Interesting questions
- How should we measure the efficiency of a sliding
window join evaluation strategy? - Can a sliding window join algorithm take
advantages of asymmetries in two input stream
speeds?
18Taking Advantage of Asymmetry
- There are cases where an asymmetric combination
of join algorithms outperforms symmetric
counterparts! - E.g. for some A, B
19Join Cost Estimation using Cost Model
- Size of window A 5000
- Size of window B 5000
- Five winning combinations TN, TH, HH, HT, NT
20Join Cost Estimation using Cost Model
- Size of window A 5000
- Size of window B 5000
- Five winning combinations TN, TH, HH, HT, NT
21Join Cost Estimation using Cost Model
- Size of window A 5000
- Size of window B 5000
- Five winning combinations TN, TH, HH, HT, NT
22Join Cost Estimation using Cost Model
- Size of window A 5000
- Size of window B 5000
- Five winning combinations TN, TH, HH, HT, NT
23Join Cost Estimation using Cost Model
- Size of window A 5000
- Size of window B 5000
- Five winning combinations TN, TH, HH, HT, NT
24Join Cost Estimation using Cost Model
- Size of window A 5000
- Size of window B 5000
- Five winning combinations TN, TH, HH, HT, NT
25Measured Join Cost (CPU Time)
- A5000, B5000
- Memory utilization HJ (h10) consumed 5 more
than TJ (n100). - Same five winners TN, TH, HH, HT, NT
- Cost model prediction was accurate for both
overall shape and crossover points.
- What if we increase window A and decrease window
B? - (e.g. A7000, B3000 as opposed to current
50005000)
26Cross-over Point TN-TH
- TN-TH only dependent on window size B
- TN-TH 0.0094 (B500), meaning TNJ will
outperform THJ when stream B is more than 106
times faster than stream A. - TN-TH 0.0555 (B100), 18 times.
27Cross-over Point TH-HH
- TH-HH only dependent on the size of window A
- If the size of window A increases the crossover
point TH-HH will move toward left, and vice versa.
28Join Performance
- A7000, B3000, ?a800,?b200
29Interesting questions (Contd)
- How should we allocate computing resources
between the two windows to maximize join
efficiency? - If memory is the bottleneck, how should we
allocate memory between the two windows for the
two inputs?
30Resource Allocation Join Performance
- Focus on cases where system resources are
insufficient to fully support queries and
workloads. - Input streams are simply too fast to keep up
with. - Evaluating expensive join operator and its
service rate is lower than the input rates. - System memory cannot hold both windows.
31Resource Allocation Join Performance (Contd)
- Approximate answers may be acceptable
- E.g. query involving aggregate (e.g. average)
over join - Question is how to maximize the accuracy of the
approximate answers, given the limited resources. - We use insight that larger samples produce better
answers - Goal is to maximize the of join result tuples
- Care must be taken to ensure that the result
produced is statistically comparable to a random
sample of the full join result.
32Limited Computing Resources
- ?a800, ?b200
- A100, B200
- ?0.01, µ100
Window Join Output Rate
w/ Effective Rates
33Limited Memory Resources
Window Join Output Rate
34Limited Memory Computing Resources
- Best performers are groups that allocate maximum
computing resources to one stream and maximum
memory to the another.
35Summary
- Introduced unit-time basis cost model and
experimentally validated it. - Extended traditional join framework to include
asymmetric combinations of join algorithms. - Investigated resource allocation strategies for
improving the accuracy of approximate answers. - Developed powerful optimization framework for
sliding window join queries by addressing these
issues in a unified manner.