Evaluating Window Joins over Unbounded Streams - PowerPoint PPT Presentation

About This Presentation
Title:

Evaluating Window Joins over Unbounded Streams

Description:

N size of a T-tree node (#of tuples) B/N total #of nodes in a ... on window size B ... If the size of window A increases the crossover point TH-HH ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 36
Provided by: infosK
Category:

less

Transcript and Presenter's Notes

Title: Evaluating Window Joins over Unbounded Streams


1
Evaluating Window Joins over Unbounded Streams
  • Jaewoo Kang Jeffrey F. Naughton
  • Stratis D. Viglas
  • jaewoo, naughton, viglas_at_cs.wisc.edu
  • Univ. of Wisconsin-Madison

ICDE03 Bangalore, India
2
Outline of the talk
  • Introduction Continuous Queries over Unbounded
    Streams
  • Measuring the Cost of Sliding Window Joins
  • On Maximizing the Efficiency of Processing Joins
  • Summary

3
Sliding Windows
  • Handling internal states is big challenge.
  • Approximate answers
  • Sliding windows toss out expired tuples
  • Synopses resort to reduced answer precision

4
A Simple Sliding Window Query
  • On arrival of a new tuple to window A
  • Scan window B and propagate matching tuples
  • Insert new tuple into window A
  • Invalidate all expired tuples in window A

5
Some interesting questions
  • How should we measure the efficiency of a sliding
    window join evaluation strategy?
  • Can a sliding window join algorithm take
    advantages of asymmetries in two input stream
    speeds?

6
Interesting questions (Contd)
  • How should we allocate computing resources
    between the two windows to maximize join
    efficiency?
  • If memory is the bottleneck, how should we
    allocate memory between the two windows for the
    two inputs?

7
Interesting questions
  • How should we measure the efficiency of a sliding
    window join evaluation strategy?
  • Can a sliding window join algorithm take
    advantages of asymmetries in two input stream
    speeds?

8
Outline of the talk
  • Introduction Continuous Queries over Unbounded
    Streams
  • Measuring the Cost of Sliding Window Joins
  • On Maximizing the Efficiency of Processing Joins
  • Summary

9
Cost Model
  • Unit-time basis cost model
  • Aggregate cost of processing tuples arriving in
    each window in a time unit

10
Cost Model (Contd)
  • Cost formula can be divided into two independent
    groups, one for each input stream
  • Thus, can evaluate join algorithms for each join
    direction independently

11
Cost of One-way NLJ
  • P(D) - cost of accessing one tuple in data
    structure D during search operation
  • I(D) - cost of accessing one tuple in data
    structure D during update operation
  • Total number of tuples processed in a time unit
    multiplied by the tuple access cost

12
Cost of One-way HJ
  • B -- of hash buckets in window B
  • B/B -- of tuples in a hash bucket
  • Implement hash bucket to preserve tuple arrival
    order avoid invalidation overhead.

13
Cost of One-way T-tree INLJ
  • N size of a T-tree node (of tuples)
  • B/N total of nodes in a T-tree

14
Implementation
  • Implemented
  • Four join algorithms NLJ, HJ, BJ, and TJ.
  • Asymmetric join operator
  • Stream emulator
  • System
  • Java HotSpot VM 1.4
  • AMD Athlon XP 1533Mhz, 1GB memory
  • Windows XP Professional

15
Fitting Parameters in the Model
  • Process 60 seconds worth of tuples without
    intermittent delays, at 20 different points with
    increasing workload rates.
  • Then, equate the measured values with the cost
    formula, and solve the equation.
  • Hash bucket size 10, T-tree node size 100
    used
  • P(N) 3x10-4 P(H) 5.5x10-4
  • P(BT) 2.6x10-4 P(TT) 2.6x10-4
  • I(N) 1x10-4 I(H) 7.8x10-4
  • I(BT) 2.6x10-4 I(N) 2.7x10-4

16
Outline of the talk
  • Introduction Continuous Queries over Unbounded
    Streams
  • Measuring the Cost of Sliding Window Joins
  • On Maximizing the Efficiency of Processing Joins
  • Summary

17
Interesting questions
  • How should we measure the efficiency of a sliding
    window join evaluation strategy?
  • Can a sliding window join algorithm take
    advantages of asymmetries in two input stream
    speeds?

18
Taking Advantage of Asymmetry
  • There are cases where an asymmetric combination
    of join algorithms outperforms symmetric
    counterparts!
  • E.g. for some A, B

19
Join Cost Estimation using Cost Model
  • Size of window A 5000
  • Size of window B 5000
  • Five winning combinations TN, TH, HH, HT, NT

20
Join Cost Estimation using Cost Model
  • Size of window A 5000
  • Size of window B 5000
  • Five winning combinations TN, TH, HH, HT, NT

21
Join Cost Estimation using Cost Model
  • Size of window A 5000
  • Size of window B 5000
  • Five winning combinations TN, TH, HH, HT, NT

22
Join Cost Estimation using Cost Model
  • Size of window A 5000
  • Size of window B 5000
  • Five winning combinations TN, TH, HH, HT, NT

23
Join Cost Estimation using Cost Model
  • Size of window A 5000
  • Size of window B 5000
  • Five winning combinations TN, TH, HH, HT, NT

24
Join Cost Estimation using Cost Model
  • Size of window A 5000
  • Size of window B 5000
  • Five winning combinations TN, TH, HH, HT, NT

25
Measured Join Cost (CPU Time)
  • A5000, B5000
  • Memory utilization HJ (h10) consumed 5 more
    than TJ (n100).
  • Same five winners TN, TH, HH, HT, NT
  • Cost model prediction was accurate for both
    overall shape and crossover points.
  • What if we increase window A and decrease window
    B?
  • (e.g. A7000, B3000 as opposed to current
    50005000)

26
Cross-over Point TN-TH
  • TN-TH only dependent on window size B
  • TN-TH 0.0094 (B500), meaning TNJ will
    outperform THJ when stream B is more than 106
    times faster than stream A.
  • TN-TH 0.0555 (B100), 18 times.

27
Cross-over Point TH-HH
  • TH-HH only dependent on the size of window A
  • If the size of window A increases the crossover
    point TH-HH will move toward left, and vice versa.

28
Join Performance
  • A7000, B3000, ?a800,?b200
  • A9500, B500, ?a2, ?b998

29
Interesting questions (Contd)
  • How should we allocate computing resources
    between the two windows to maximize join
    efficiency?
  • If memory is the bottleneck, how should we
    allocate memory between the two windows for the
    two inputs?

30
Resource Allocation Join Performance
  • Focus on cases where system resources are
    insufficient to fully support queries and
    workloads.
  • Input streams are simply too fast to keep up
    with.
  • Evaluating expensive join operator and its
    service rate is lower than the input rates.
  • System memory cannot hold both windows.

31
Resource Allocation Join Performance (Contd)
  • Approximate answers may be acceptable
  • E.g. query involving aggregate (e.g. average)
    over join
  • Question is how to maximize the accuracy of the
    approximate answers, given the limited resources.
  • We use insight that larger samples produce better
    answers
  • Goal is to maximize the of join result tuples
  • Care must be taken to ensure that the result
    produced is statistically comparable to a random
    sample of the full join result.

32
Limited Computing Resources
  • ?a800, ?b200
  • A100, B200
  • ?0.01, µ100

Window Join Output Rate
w/ Effective Rates
33
Limited Memory Resources
  • ?a10, ?b50
  • M1000, ?0.005

Window Join Output Rate
34
Limited Memory Computing Resources
  • µ10, M100
  • ?0.01
  • Best performers are groups that allocate maximum
    computing resources to one stream and maximum
    memory to the another.

35
Summary
  • Introduced unit-time basis cost model and
    experimentally validated it.
  • Extended traditional join framework to include
    asymmetric combinations of join algorithms.
  • Investigated resource allocation strategies for
    improving the accuracy of approximate answers.
  • Developed powerful optimization framework for
    sliding window join queries by addressing these
    issues in a unified manner.
Write a Comment
User Comments (0)
About PowerShow.com