Title: Proactive ReOptimization
1Proactive Re-Optimization
- by Shivath Babu, Pedro Bizarro and David DeWitt
2General Idea
- Problem
- Traditional (TRADs) query optimizers (plan-first
execute-next) often choose suboptimal plans, due
to errors in estimates, using, - skewed and correlated data distributions
- wrong/old statistics
3General Idea
- Solutions
- better statistics
- new algorithms for optimization
- adaptive architectures for execution
- Re-optimization
- Proactive Re-Optimization
4General Idea
- Current Re-Optimizers (reactive)
- use traditional optimizers to pick a plan and
- if deviation to estimations are detected at
run-time, they stop execution, in order to
re-optimize using the new knowledge
limitary factor
5General Idea
- Shortcomings of the use of TRADs in Reactive
Re-Optimizers - Pr1 may pick plans heavily depending on
uncertain statistics, making re-optimization very
likely - Pr2 if TRAD chooses a new plan, work until
estimation error is found will be lost - Pr3 limited statistics collection ability
leading to new mistakes
6General Idea
- Construction of proactive re-optimizer (Rio) to
confront the above problems, with - bounding boxes (areas of uncertainty in
estimates, intervals instead of single-point),
use of bounding boxes to generate robust and
switchable plans, minimizing re-optimization - buffer results, in order to use them in case of a
plan change - more quick and accurate statistic collection at
run-time using Random Sample processing
7Limitations of Single-point Estimates (Pr1)
- Example (1)
- select from R, S
- where R.aS.a and R.bgtK1 and R.cgtK2
- Assume
- DB-cache200MB, R500MB, S160MB, and
s(R)300MB - But optimizer estimates s(R)150MB
8Limitations of Single-point Estimates (Pr1)
- Two possible plans
- (Left input of Hash-Join build relation
- Right input of Hash-Join probe relation)
9Limitations of Single-point Estimates (Pr1)
10Limitations of Single-point Estimates (Pr1)
- TRAD will choose P1a resulting in making two
passes over R and S (P1b needs only one pass over
R and S) - Reactive re-optimizer (VRO) will choose P1a
computing a validity range for P1a
100KBs(R)160MB - If s(R)lt100KB, then it is preferable to use an
index nested-loops join - If s(R)gt160MB then P1b is optimal
- In this example, the check s(R)160MB will fail
during execution, invoking re-optimization
11Limitations of Single-point Estimates (Pr1)
- What if we chose P1b from the beginning although
at estimation point s(R)150MB it is not the
optimal plan? - P1b cost is close to optimal at all times
12Limited Information (Pr3)
- Example (3)
- select from R, S, T
- where R.aS.a and S.bT.b and R.cgtK1 and R.dK2
- Assume sizes of the tables are known accurately
to be R200MB, S50MB, and T60MB - Further assume s(R)80MB
- But optimizer underestimates as 40KB
13 14Limited Information (Pr3)
- Based on these statistics, the optimizer chooses
Plan P3a - Due to limited information of the actual size of
s(R), a reactive re-optimizer will trigger
re-optimization step-by-step three times moving
from P3a to P3d (optimal plan), leading to much
wasted execution time (called thrashing)
15Losing Work in a Pipeline (Pr2)
- In addition work done in pipelines (e.g. in P3c)
may be lost - possibly if partial work completed (e.g. Pipeline
PPL1 in P3c) - for certain if partial work not completed (e.g.
Pipeline PPL2 in P3c)
16Proactive Re-opt Architecture
17Components added in Rio
- A Validity-Ranges Optimizer (VRO)
- Rio, the proactive re-optimizer
- uncertainty buckets and rules on them
- histograms for statistical estimations
- operators, like
- Hybrid hash join operator (2 possible input
subtrees 1 base relation1 deep subtree or 2
base relations)
18Components added in Rio
- more operators
- switch operator implementing switchable plans
- buffer operators
- read-random-samples operators
-
- execution engine additions
- ability to stop execution, re-optimize and
continue - in-memory catalog for run-time statistics
- inter-operator communication mechanism based on
punctuations
19Computing Bounding Boxes
- Instead of single-point estimation use intervals
around those estimates - Depending on uncertainty of estimation (how did
we compute the estimation?) varies the size of
the box
20Computing Bounding Boxes
- compute a single-point estimation E and assign to
it a corresponding uncertainty bucket U - U is valuated as integer from 0(no uncertainty)
to 6(very high uncertainty) depending on the way
E is derived - bounding box B is an interval lo, hi depending
on (E, U) computed as follows
21Computing Bounding Boxes
22Plans Bounding Boxes
- But
- Single-point estimation one optimal plan
- What about bounding boxes?
- Four cases with bounding box B
- Single optimal plan at all points in B
- Single robust plan (plan with cost very close to
optimal at all points in B) - Switchable plan. Set S of plans p, so that
- at each point pt in B there is plan p in S, with
cost at pt close to optimal at pt - can decide which p after accurate estimation
exists - can pick any p, if estimation in B without much
execution work done so far - None of the above
23Optimizing with Bounding Boxes
- Rio computes bounding boxes for all input sizes
used to cost plans - then it tries to compute a switchable plan which
may also be a single robust or a single optimal
plan (OPgtRPgtSP) - if that fails it picks the optimal plan based on
the single-point estimates and considers validity
ranges like VRO
24Bounding Boxes
- Bounding Box of example 1
25Bounding Boxes
- No optimal plan in B but there is a robust plan
(P1b)
26Computation of switchable plan
- For each bounding box we find three plans
considering three costs CLow, CEst and CHigh - CEst is computed traditionally, attaching to it
the plan with the minimum cost in case the real
statistics match our single-point estimation - CLow (CHigh) is the cost of the plan with the
minimum cost at lower-left (upper-right) corner
of the bounding box
27Computation of switchable plan
- So we end up with three plans (called seed plans)
- BestPlanLow, plan with minimum cost CLow
- BestPlanEst, plan with minimum cost CEst
- BestPlanHigh, plan with minimum cost CHigh
28Computation of switchable plan
- We now have the following cases
- seeds are all the same plan
- seeds are not all the same, but one of them is a
robust plan - seeds are not all the same, and none of them is
robust but a switchable plan can be created from
the seeds - No single optimal or robust or switchable plan
can be found
29Computation of switchable plan
- in case (i) we a single optimal plan as the
switchable plan (singleton set) - in case (ii), to check if e.g. BestPlanLow is a
robust plan we do the following - if CEst of BestPlanLow is close to (e.g. within
20 of) CEst of BestPlanEst and - CHigh of BestPlanLow is close to CHigh of
BestPlanHigh - then BestPlanLow is a robust plan
30Example 1 again
31Computation of switchable plan
- in case (iii), a switchable plan is a set of
plans S where - all plans in S have a different join operator as
a root (reversed build and probe are different) - all plans in S have the same subplan for the deep
subtree input to the root operator - all plans in S have the same base table, but not
necessarily the same access path
32Computation of switchable plan
33Computation of switchable plan
- if, according to the previous, the three seeds
constitute a switchable set, then we are done - else the optimizer builds a set SW_Low with
switchable plans with BestPlanLow among them,
prunes them so that the three minimum cost (at
Low, Est and High points) plans are left and
check how close they are to CEst and CHigh of
BestPlanEst and BestPlanHigh respectively. If
they are close enough SW_Low is a switchable plan - else we do the same for the two other seeds
34Computation of switchable plan
- if this also fails (case (iv)) then Rio picks
BestPlanEst and adds validity ranges (like VRO)
35Implementing switchable plans
- if we have a switchable set we can defer choosing
among them until we have an accurate estimation
of the input size, as follows - the buffer operator buffers random f tuples of
the input (output of the deep subplan), followed
by an end-of-sample punctuation eos(f) - if the tuples buffered are n, the total input
size is fairly accurately estimated, as
EstN100n/f
36Implementing switchable plans
- if EstN is in the bounding box, then the suitable
plan among the switching plan set is chosen and
executed - else if EstN is out of the bounding box
re-optimization is triggered
37Randomization in Table-Scan
- Randomization in Table-Scan Operators is fairly
easy - if tuples in T are kept in random order, then we
scan sequentially until f of T is read and then
we produce eos(f) - else, we put an extra bit in each tuple in T, to
know if it is used in the sample or not - After computing EstN we continue with the rest
tuples (that are not in the sample)
38Randomization in Join
- Read tuples to memory from probe until eos(f)
reached - Join the in-memory tuples with complete build
relation, then send an eos(f) punctuation and
delete all sample - To resume (after computing EstN) join the rest of
probe with the already built build relation
39Experiments
40Experiments
41Experiments
42Experiments
- Increasing Query Complexity
43Comments on paper
- Many parameters are left to be optimized (?1, ?2,
f, U, E,), or isnt this possible? - What if a slight change of the bounding box leads
to a Single Robust Plan, or even a Single Optimal
Plan, when the whole box doesn't even have a
Switchable Plan? - Switching Plans are estimated based on three
costs (three seeds). Is this optimal? What about
points in-between?
44Comments on paper
- How is Randomization (collecting a random sample
of tuples) really done? - Can we reorder tuples, after sampling, to get
them in order? - Is this efficient?
- Do we need them to be in order or only
mid-operators may need them so?
45Comments on paper
- The test of choosing Switchable Sets and
consequently Switchable Plans is limited (s31) - More Switchable Sets possibly exist but we cant
detect them - But until we can
- Does it have any sense to use Proactive
Re-Optimization in real DBs? (complexitygtgtprofit)
46Comments on paper
- How efficient is Proactive Re-Optimization in
general? - What is its time complexity when we have various
queries? - May be applied effectually in distributed
(integrated) DBs, where we have possibly limited
access to statistical data (but is Randomization
possible?)