Scheduling for Shared Window Joins over Data Streams - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Scheduling for Shared Window Joins over Data Streams

Description:

SELECT A.LocationId, MAX(A.Value), MAX(B.Value) FROM Temperature A, Humidity B ... Schedule the tuple that serves max. number of queries in shortest time! ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0
Slides: 25
Provided by: mha115
Category:

less

Transcript and Presenter's Notes

Title: Scheduling for Shared Window Joins over Data Streams


1
Scheduling for Shared Window Joins over Data
Streams
Moustafa Hammad Purdue University
Michael Franklin UC Berkeley
Walid Aref Purdue University
Ahmed Elmagarmid Purdue University
2
Stream Applications and Continuous Queries
  • Streaming Applications
  • Sensor networks (smart building, biosensors, toll
    roads)
  • Location-based services for mobile objects,
    Enhanced 911 (USA) and Enhanced 112 (Europe)
    IEEE Spectrum, July 2003
  • Retail transactions.
  • Continuous Queries (CQ)
  • Reacts to new data as it arrives,
  • Data is continuously arriving ? Query is
    continuously running

3
Challenges Query Processing for Data Streams
  • Data streams break a number of basic assumptions
    of traditional query processing technology
  • Infinite streams ? Infinite state (e.g., most
    join operations)
  • Ordered execution
  • Shared execution
  • Multiple continuous queries (overlapping
    interests)

4
Shared Execution among Multiple CQs
  • Sharing is a key technique in Stream Processing
  • Resources are bound for long time (duration of
    CQ)
  • Large number of CQs, high stream rates and tight
    responsiveness
  • Join Operation, why?
  • A core component combines data from multiple
    streams for further processing and analysis
  • A costly operation in stream processing
  • Selection pull up emphasizes sharing Chen et al.
    ICDE02

?B
?A
Split
S2
S1
?B
?A
S2
S2
S1
S1
5
Window Join Operator
  • Data streams are unbounded
  • Traditional Approach
  • Deals with stored relations !
  • Stream Processing Approach
  • Maintains a window (scope of interest)
  • E.g., joins tuples in the last one hour
  • Q is repeatedly executed (CQ) ? Sliding window
    join (SWJ)
  • Multiple SWJs (same streams and different
    windows)
  • Naïve Approach No Sharing
  • Shared SWJs ?
  • CACQ02, PSoup02, process the larger window,
    filter later for smaller windows. (discriminates
    against queries with smaller window sizes)

Centralized stream processing system
6
Presentation Outline
  • Problem Specification
  • Scheduling Algorithms for SWJs
  • LWO Largest Window Only
  • SWF Smallest Window First
  • MQT Maximum Query Throughput
  • Performance Study
  • Conclusion

7
Example Sliding Window Join Operation
  • Window Join
  • Example A data center with hundreds of sensors
    that monitor temperature and humidity values
  • Schema
  • Temperature Stream (LocationID, Value, TimeStamp)
  • Humidity Stream (LocationID, Value, TimeStamp)

Data Center
  • CQ Continuously monitor the count of sensors
    reporting temperature and humidity values above
    specific thresholds within the last one minute.

8
Problem Specification Sliding Window Join
Operation
  • Q1
  • Select COUNT(DISTINCT A.LocationId))
  • FROM Temperature A, Humidity B
  • WHERE A.LocationId B.LocationId and
  • A.Value gt Threshold_t and
  • B.Value gt Threshold_h
  • WINDOW 1 min

(a9,b12)
(a11,b8)
Q2 SELECT A.LocationId, MAX(A.Value),
MAX(B.Value) FROM Temperature A,
Humidity B WHERE A.LocationId
B.LocationId GROUP BY A.LocationId WINDOW 1 hour
9
Problem Specification Shared Window Join
Routing
Joining
  • Sharing must be transparent to the queries.
  • Transparency Requirement
  • Output order must be the same as single execution
  • Otherwise, produces incorrect output (e.g.,
    online MAX, online COUNT)
  • Response time penalty due to sharing should be
    minimized

1 min.
Q1
COUNT
?
A
B
1 min.
MAX Group By
Q2
1 hour
1 hour
  • In bursty workloads
  • On average system must accommodate aggregate
    input rates.

10
Largest Window Only (LWO) CACQ02
Completely scans the largest window before
serving a new tuple Example 3 Queries with 3
different windows (w1lt w2lt w3)
(a9,b12)
(a11,b8)
Q1(w1)
(a1,b12)
(a5,b12)
(a9,b12)
(a5,b12)
(a11,b0)
(a9,b12)
a11
a9
a7
a5
a3
a1
(a11,b4)
(a11,b4)
A
(a11,b8)
(a11,b8)
Q2(w2)
b0
b2
b4
b6
b8
b10
b12
(a1,b12)
B
(a5,b12)
(a9,b12)
w1
(a11,b0)
w2
(a11,b4)
w3
(a11,b8)
Q3(w3)
Output Data Streams
Routing Part
Join Part
Ordered output (property 1 in transparent
execution ? )
11
Largest Window Only (LWO)
  • Analytical Analysis
  • Average Response time for query Qi
  • Example
  • 7 queries with windows sizes between 1 second and
    10 minutes.
  • High response time for queries with small window
    sizes (fails property 2)

10
12
Proposed Alg1 Smallest Window First (SWF)
For all arriving tuple Scan small window first,
then the next larger window
(a9,b12)
(a1,b12)
w3
(a11,b8)
Q1(w1)
(a11,b0)
w2
(a5,b12)
(a5,b12)
w1
(a9,b12)
(a11,b4)
a11
a9
a7
a5
a3
a1
(a9,b12)
(a11,b4)
A
(a11,b8)
(a11,b8)
Q2(w2)
b0
b2
b4
b6
b8
b10
b12
(a1,b12)
B
(a5,b12)
(a9,b12)
w1
(a11,b0)
w2
(a11,b4)
w3
(a11,b8)
Q3(w3)
Output Data Streams
Routing Part
Join Part
Ordered output not straightforward, routing
part must buffer output before release.
13
Smallest Window First (SWF)
  • Analytical Analysis
  • Average Response time for query Qi
  • Example
  • 7 queries with windows sizes between 1 second and
    10 minutes.
  • High response time for queries with large window
    sizes (fails property 2)

14
Proposed Alg2 Maximum Query Throughput (MQT)
  • Observations
  • LWO SWF make wrong scheduling decisions.
  • LWO SWF ignore the count of queries per window.
  • Greedy Approach
  • Schedule the tuple that serves max. number of
    queries in shortest time!
  • ( E.g. MAX N1/w1, (N2-N1)/(w2-w1) )
  • Problem Ignores future scans. (local optimum)
  • Maximum Query Throughput (MQT)
  • Considers all future scans at a given tuple
    position.
  • MQT (a) MAX (N2-N1)/(w2-w1),
    (N3-N1)/(w3-w1)
  • MQT (b) MAX(N1/w1)
  • ?Schedule (a or b) ? MAX (MQT(a), MQT(b) )

N1/w1
(N2-N1)/(w2-w1)
b
15
Maximum Query Throughput (MQT)
  • Given a window-query configuration build a matrix
    (MaxQT)
  • MaxQT matrix Each entry MAX(
    ) for all partial windows
  • Updated MaxQT as new query is added.
  • Index MaxQT by relative tuple positions
  • Example
  • 3 Queries with window sizes
  • w1 2w, w2 3w, w3 6w

(a9,b12)
(a11,b12)
(a11,b6)
16
Performance Study
  • Performance Metrics
  • Average and Maximum response time.
  • Memory requirements.
  • Implementation
  • Using a prototype database management system,
    PREDATOR.
  • Both hash-based and nested loop versions of the
    W-joins.
  • Stream is an abstract data type, stream-type,
    with specific interface functions.
  • StreamScan operator and stream manager to
    communicate query execution plan to the stream
    type.
  • Settings
  • Synthetic data streams, join selectivity 0.002
  • Sun Enterprise 450, Solaris 2.6 with 4GBytes of
    memory
  • The window is based on time units.

17
1 Varying window distributions
  • Different window distributions
  • Single query per window
  • Hash-based implementation
  • window sizes (1 sec to 10 minutes), ?100
    tuples/sec

18
2 Varying level of burstiness
  • Pareto distribution
  • Average burst size 15 tuples
  • Window distribution small-large

19
3 Varying Query Distribution
  • Window distribution small-large
  • 80 of the total queries share a single window wi
  • 20 are uniformly distributed on other window
  • Total of 30 queries

wi1 - wi
Small-Large
w3
w4
w5
w6
w7
w2
20
4 Memory Requirement
Memory Buffers
  • For ? 100 tuples/sec and w 600 sec,
  • Maximum size for joinBuffer Smax ? wmax
    60K tuples
  • For SWF and MQT
  • Maximum size for inputBuffer always less than 10
    of Smax
  • Maximum size for outputBuffer always less than 3
    of Smax
  • With the basic assumption that system can finally
    keep up with the input arrival rate extra memory
    requirement is not significant for SWF and MQT.

21
Conclusion
  • Sharing window joins is a key technique to
    achieve scalability and optimize system resources
    for CQ processing.
  • We presented three algorithms,
  • LWO, Largest Window Only CACQ02
  • SWF, Smallest Window First and proposed
  • MQT, Maximum Query Throughput proposed
  • MQT provides the best average response time among
    the three.
  • MQT and SWF require additional processing to
    match isolated execution, experimentally the
    gained performance outweigh the additional
    overhead for large window differences.
  • The additional memory requirement for SWF and MQT
    is not significant (less than 13 of joinbuffers
    ).

22
Thank You
23
Future Work
  • Application of shared window execution for other
    window operations such as online-aggregation,
    online-GroupBy, duplicate elimination, union,
    intersect and difference operators.
  • Starvation avoidance, e.g. for long-lasted bursty
    workloads.
  • Lad shedding approach The filtered workload
    will naturally fit in our proposed approaches
  • Partial window join explore partial areas in the
    overlapping windows to maximize output
    throughput.
  • Window clustering and hierarchical filtering for
    large number of queries with different windows.
    An interesting option to limit the number of
    scheduled windows.

24
2 Per window response time
  • Window distribution uniform
Write a Comment
User Comments (0)
About PowerShow.com