Evaluating Window Joins over Unbounded Streams - PowerPoint PPT Presentation

About This Presentation

Title:

Evaluating Window Joins over Unbounded Streams

Description:

N size of a T-tree node (#of tuples) B/N total #of nodes in a ... on window size B ... If the size of window A increases the crossover point TH-HH ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 36

Provided by: infosK

Category:

more less

Transcript and Presenter's Notes

Title: Evaluating Window Joins over Unbounded Streams

1
Evaluating Window Joins over Unbounded Streams

Jaewoo Kang Jeffrey F. Naughton
Stratis D. Viglas
jaewoo, naughton, viglas_at_cs.wisc.edu
Univ. of Wisconsin-Madison

ICDE03 Bangalore, India
2
Outline of the talk

Introduction Continuous Queries over Unbounded
Streams
Measuring the Cost of Sliding Window Joins
On Maximizing the Efficiency of Processing Joins
Summary

3
Sliding Windows

Handling internal states is big challenge.
Approximate answers
Sliding windows toss out expired tuples
Synopses resort to reduced answer precision

4
A Simple Sliding Window Query

On arrival of a new tuple to window A
Scan window B and propagate matching tuples
Insert new tuple into window A
Invalidate all expired tuples in window A

5
Some interesting questions

How should we measure the efficiency of a sliding
window join evaluation strategy?
Can a sliding window join algorithm take
advantages of asymmetries in two input stream
speeds?

6
Interesting questions (Contd)

How should we allocate computing resources
between the two windows to maximize join
efficiency?
If memory is the bottleneck, how should we
allocate memory between the two windows for the
two inputs?

7
Interesting questions

How should we measure the efficiency of a sliding
window join evaluation strategy?
Can a sliding window join algorithm take
advantages of asymmetries in two input stream
speeds?

8
Outline of the talk

Introduction Continuous Queries over Unbounded
Streams
Measuring the Cost of Sliding Window Joins
On Maximizing the Efficiency of Processing Joins
Summary

9
Cost Model

Unit-time basis cost model
Aggregate cost of processing tuples arriving in
each window in a time unit

10
Cost Model (Contd)

Cost formula can be divided into two independent
groups, one for each input stream
Thus, can evaluate join algorithms for each join
direction independently

11
Cost of One-way NLJ

P(D) - cost of accessing one tuple in data
structure D during search operation
I(D) - cost of accessing one tuple in data
structure D during update operation
Total number of tuples processed in a time unit
multiplied by the tuple access cost

12
Cost of One-way HJ

B -- of hash buckets in window B
B/B -- of tuples in a hash bucket
Implement hash bucket to preserve tuple arrival
order avoid invalidation overhead.

13
Cost of One-way T-tree INLJ

N size of a T-tree node (of tuples)
B/N total of nodes in a T-tree

14
Implementation

Implemented
Four join algorithms NLJ, HJ, BJ, and TJ.
Asymmetric join operator
Stream emulator
System
Java HotSpot VM 1.4
AMD Athlon XP 1533Mhz, 1GB memory
Windows XP Professional

15
Fitting Parameters in the Model

Process 60 seconds worth of tuples without
intermittent delays, at 20 different points with
increasing workload rates.
Then, equate the measured values with the cost
formula, and solve the equation.
Hash bucket size 10, T-tree node size 100
used
P(N) 3x10-4 P(H) 5.5x10-4
P(BT) 2.6x10-4 P(TT) 2.6x10-4
I(N) 1x10-4 I(H) 7.8x10-4
I(BT) 2.6x10-4 I(N) 2.7x10-4

16
Outline of the talk

Introduction Continuous Queries over Unbounded
Streams
Measuring the Cost of Sliding Window Joins
On Maximizing the Efficiency of Processing Joins
Summary

17
Interesting questions

How should we measure the efficiency of a sliding
window join evaluation strategy?
Can a sliding window join algorithm take
advantages of asymmetries in two input stream
speeds?

18
Taking Advantage of Asymmetry

There are cases where an asymmetric combination
of join algorithms outperforms symmetric
counterparts!
E.g. for some A, B

19
Join Cost Estimation using Cost Model

Size of window A 5000
Size of window B 5000
Five winning combinations TN, TH, HH, HT, NT

20
Join Cost Estimation using Cost Model

Size of window A 5000
Size of window B 5000
Five winning combinations TN, TH, HH, HT, NT

21
Join Cost Estimation using Cost Model

Size of window A 5000
Size of window B 5000
Five winning combinations TN, TH, HH, HT, NT

22
Join Cost Estimation using Cost Model

Size of window A 5000
Size of window B 5000
Five winning combinations TN, TH, HH, HT, NT

23
Join Cost Estimation using Cost Model

Size of window A 5000
Size of window B 5000
Five winning combinations TN, TH, HH, HT, NT

24
Join Cost Estimation using Cost Model

Size of window A 5000
Size of window B 5000
Five winning combinations TN, TH, HH, HT, NT

25
Measured Join Cost (CPU Time)

A5000, B5000
Memory utilization HJ (h10) consumed 5 more
than TJ (n100).
Same five winners TN, TH, HH, HT, NT
Cost model prediction was accurate for both
overall shape and crossover points.

What if we increase window A and decrease window
B?
(e.g. A7000, B3000 as opposed to current
50005000)

26
Cross-over Point TN-TH

TN-TH only dependent on window size B
TN-TH 0.0094 (B500), meaning TNJ will
outperform THJ when stream B is more than 106
times faster than stream A.
TN-TH 0.0555 (B100), 18 times.

27
Cross-over Point TH-HH

TH-HH only dependent on the size of window A
If the size of window A increases the crossover
point TH-HH will move toward left, and vice versa.

28
Join Performance

A7000, B3000, ?a800,?b200

A9500, B500, ?a2, ?b998

29
Interesting questions (Contd)

How should we allocate computing resources
between the two windows to maximize join
efficiency?
If memory is the bottleneck, how should we
allocate memory between the two windows for the
two inputs?

30
Resource Allocation Join Performance

Focus on cases where system resources are
insufficient to fully support queries and
workloads.
Input streams are simply too fast to keep up
with.
Evaluating expensive join operator and its
service rate is lower than the input rates.
System memory cannot hold both windows.

31
Resource Allocation Join Performance (Contd)

Approximate answers may be acceptable
E.g. query involving aggregate (e.g. average)
over join
Question is how to maximize the accuracy of the
approximate answers, given the limited resources.
We use insight that larger samples produce better
answers
Goal is to maximize the of join result tuples
Care must be taken to ensure that the result
produced is statistically comparable to a random
sample of the full join result.

32
Limited Computing Resources

?a800, ?b200
A100, B200
?0.01, µ100

Window Join Output Rate
w/ Effective Rates
33
Limited Memory Resources

?a10, ?b50
M1000, ?0.005

Window Join Output Rate
34
Limited Memory Computing Resources

µ10, M100
?0.01

Best performers are groups that allocate maximum
computing resources to one stream and maximum
memory to the another.

35
Summary

Introduced unit-time basis cost model and
experimentally validated it.
Extended traditional join framework to include
asymmetric combinations of join algorithms.
Investigated resource allocation strategies for
improving the accuracy of approximate answers.
Developed powerful optimization framework for
sliding window join queries by addressing these
issues in a unified manner.

Write a Comment

User Comments (0)