Title: Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams
1Static Optimization of Conjunctive Queries with
Sliding Windows over Infinite Streams
Ahmed M.Ayad and Jeffrey F.Naughton Database
Group University of Wisconsin
- Presented by Andy Mason and Sheng Zhong
Material is partially referenced from SIGMOD 2004
1
2Overview
- Introduction
- Semantics of Sliding Window Continuous Queries
- Cost Model
- Load Shedding
- Optimization Framework
- Experiments
3Introduction
- The intent of the paper
- Find a execution plan that minimizes resource
usage when resources are sufficient - Find an execution plan that sheds tuples when
resources are insufficient. - Given a continuous query in a steady state, each
execution plan is similar to a Queuing Network
System - Arriving tuples are clients
- Query operators are servers
- Execution plan is feasible if the system is
stable - If the plan is infeasible, load shedding is
needed
4Feasible and Infeasible Query Plan
0.50.25lt1
10.25gt1
Load Shedding
5Assumptions
- The time stamps are unique (no ties)
- Tuples arrive in the stream in a monotonically
increasing order by its time stamp (no out of
order arrival) - There is no relational tables involved in the
query
Discussion Why will make these assumptions?
Static optimization gt Rates of input streams are
slow changing Enough memory to hold the buffering
requirements for any query plan
6Semantics
- Definitions
- Data Stream
- Time-based Window
- Tuple-based Window
- Selection
- A filter takes a stream as input and outputs a
stream - Join
- A symmetric operator that takes two input streams
The cost model
7Variables
8Rate and Window Calculations
- 1 Select output rate
- 2 Active window size
- 3 output rate of window join
- 4 Active size of window join
- 5 output rate of n-ary join
- of n streams
- 6 Active window size
- of n-ary join
9Cost Model
- An concrete example on the application of the
cost model
- SELECT A.a, B.b, C.c
- FFROM A ROWS 10
- B ROWS 10
- C ROWS 10
- WHERE A.a B.a
- AND B.b C.b
10Cost Model Plans
11Outcome after Load Shedding
12Load Shedding
- A form of approximation which reduces load by
dropping tuples from the incoming streams - Methods of Load Shedding
- Random dropping of tuples ? Presented in this
paper - Achieved by inserting random drop boxes at
several points in the query plan - Semantic dropping of tuples
- Goal Maximize output rate of the approximated
query - Problems addressed
- Optimal placement of drop boxes in an execution
plan and the optimal setting of their sampling
rate - Choice of plan to shed load from
13Selection Only Queries
- Initial condition
- A query consisting of n consecutive filters
- An execution plan for it that orders the filters
in asc order by a designated number - n1 possible combinations
- Observation Only need to drop tuples directly
from the streaming source before they are
processed by any of the filters - Conclusion The plan with the lowest cost yields
the highest rate
14Join Queries
- Only consider tuple-based windows
- Shedding Load From a Specific Plan
- Choice of Plan for Load Shedding
15Shedding Load from a Specific Plan
- Where do we put the drop boxes?
- Query plan joining n streams
- Binary joins
- Drop box can be put before each of the two inputs
to the n - 1 join operators - Plus a box right after the last join is performed
- 2n - 1 possible locations
Obs Sufficient to drop tuples from the input
sources before they are processed by any join
operator
16Choice of Load Shedding Plan
- Intuition for Selection queries
- Pick plan with lowest resource utilization
- Join queries
- Plan with lowest resource utilization?
- This intuition does not always work
- Why?
17Load Shedding Plan Example
- Plans shed load in the order of their average
utilization - Switch-over occurs 4.5 milliseconds (plan
bbest)
18Observations from Example
- The plan with the lowest utilization is not
always the best choice for shedding load - When the join cost is 14 milliseconds, the
throughput of the best plan is more than twice
the throughput of the lowest utilization plan - Lowest utilization plan could be the worst choice
- Conclusion Load shedding must be integrated in
the optimization process
19Optimization Framework
- Two areas
- Throughput of the plan
- Utilization cost of the plan
- Feasible queries
- Goal Minimize cost of the plan
- Where throughput is fixed at its maximum value
for all feasible queries - Infeasible queries
- Goal Maximize throughput of the plan
- Where cost is fixed at its maximum value for all
p - Assumption
- Search space of alternative plans always equipped
with drop boxes - All plans in the search space will be feasible
- Problem can be treated as unconstrained
20Optimization Goal
- Maximize
- R(p) plan throughput/plan cost
- Simplest optimization algorithm
- Generate the set of all plans of the query
- For each plan in the set
- Compute cost of the plan
- If cost gt 1, insert drop boxes
- Compute R
- Return the plan that maximizes R(p)
21Heuristic Optimizer
- Based on the original System R optimizer
- Builds the plan from the bottom-up by storing the
best plans for successively larger subsets of the
input streams - Computing the best plan for any subset
- Test whether this subplan is feasible
- If infeasible, tune the values of the drop boxes
placed at its input streams using load shedding
alg
22Computing the best subset plan
- Test whether this subplan is feasible
- If infeasible, tune the values of the drop boxes
placed at its input streams using load shedding
alg - Store subplan
- At any stage
- If a drop box is placed in front of a stream
which had another one from a previous round, the
two are combined into one drop box whose
selectivity is the product of the original two
23Experiment Setup
- 1000 random continuous queries
- Each query reps join of five input streaming
sources A, B, C, D, E - Window sizes and join selectivities fixed
- Rates were randomly picked from 10 to 1000
tuples/sec
24Need for Reoptimization
25Average Gain in Throughput over using the Lowest
Utilization Plan
At very low resources, the gain is very
significant (almost 8 folds at the 1 mark)
26Average and Maximum Gain
27Heuristic Optimizer
Except at very low resources, the performance of
the heuristic optimizer is quite impressive
28Summary
- Presented framework for static optimization of
sliding window conjunctive queries over infinite
streams - Cost Model
- Load Shedding
- Load shedding must be integrated in the
optimization process! - Optimization Framework
- Experimental Results
29References
- 1 http//web.cs.wpi.edu/cs525/f06s-EAR/cs525-ho
mepage_files/LITERATURE/SIGMOD04-opt-shed-wisconsi
n.pdf - 2 http//se.uwaterloo.ca/tozsu/courses/cs856/F0
5/Presentations/Week8/Stream_Maryam.pdf