Title: Robust Query Processing through Progressive Optimization
1Robust Query Processing through Progressive
Optimization
Volker Markl, Vijayshankar Raman, David
Simmen, Guy Lohman, Hamid Pirahesh, Miso
Cilimdzic
Modified by S. Sudarshan from talk by Raja
Agrawal
2Motivation
- Current optimizers depend heavily upon the
cardinality estimations - What if there errors in those estimations?
- Errors can occur due to
- Inaccurate statistics
- Invalid assumptions (e.g. attribute independence)
3Progressive Query Optimization
- Idea lazily trigger reoptimization during
execution if cardinality counts indicate current
plan is suboptimal - introduces checkpoint (CHECK) operator to compare
actual vs estimated cardinality - key idea precompute cardinality ranges for which
plan is optimal
4Evaluating a re-optimization scheme
- Risk Vs Opportunity
- Risk
- Extent to which re-optimization is not worthwhile
- reoptimization chooses another bad plan
- work redone
- cardinality errors may even cancel, and fixing
one may give an even worse plan! - Opportunity
- Refers to the aggressiveness
- more CHECK operators..
5Background
Risk
Opportunity
- Redbrick
- Star schema with fact table and multiple
dimension tables - First apply selections on dimension tables
- Then decide what plan to use
- Kabra DeWitt 98 (KD98)
- Introduced idea of mid-query reoptimization
- Allow partial results to be use like materialized
views - But ad-hoc cardinality threshold, and only
reoptimize fully materialized plans
6Background
Risk
Opportunity
- Tukwila data integration system
- optimizer may have no idea of statistics
- interleave optimization and query execution
- partial query plans
- Fragment fully pipelined tree with doubly
pipelined hash join - Query Scrambling
- reorder query to deal with delayed sources
7Background
Risk
Opportunity
- Eddies (Telegraph)
- Ingres/DEC Rdb run multiple access methods
competitively then choose - Parametric Query Optimization (PQO)
- e.g. Cole and Graefe 94, Hulgeri and Sudarshan 02
- Choose from a set of plans, each optimal for
selectivity range - POP converse find optimal cardinality range for
a give plan
8Example of Progressive Optimization in Action
9Progressive Query Optimization(POP)
10Architecture of POP
11Architecture of POP
- CHECK operator to find if a plan is suboptimal
- At optimization time, find out cardinality range
(at CHECK location) for which plan is optimal - At run time, ensure cardinality within l,u
- If violated, stop plan execution and reoptimize
- Location of CHECKs
- Re-optimize
- taking observed cardinality into account, and
- exploiting intermediate results where beneficial
- Heuristic limit number of reoptimizations
(default 3)
12Validity Ranges
- Consider a plan edge e that flows rows into
operator o, - let P be the subplan rooted at o.
- The validity range for e is an upper and lower
bound on the number of rows flowing through e,
such that if the range is violated at runtime, we
can guarantee P is suboptimal - Ad-hoc thresholds (proposed earlier) are a bad
idea - E.g. even a 100x error on very small relation may
not make a difference in optimal plan
13Finding Optimality Ranges
- Plan Popt with root operator oopt is being
compared with another plan Palt different only in
the root operator oalt.
14Finding Optimality Ranges
- Need to solve
- cost(Palt , c) cost(Popt , c) 0
- where c is the cardinality on edge e
- Cost functions can be complex/non-linear/non-conti
nuous
15Newton-Raphson Iteration
16What does this achieve?
- Detects suboptimality of the root operator where
Popt and Palt share the same input edges. - Validity range might miss a cross-over point with
a plan that uses a different join order (and
hence has different input edges). - Two plans are structurally equivalent if they
share the same set of edges - where an edge is defined by the set of rows
flowing through it during query execution. - Allows different algorithms, and flipping
inner/outer
17Optimality wrt structurally equivalent plans
- Theorem . Suppose edges edges ei1 , ei2 , ,
eik are seen to be erroneous wrt cardinality.
Then the following statements are equivalent - P is suboptimal with respect to another plan P'
that has the same set of edges e1 , e2 , , em - At least one of Pi1 , Pi2 , , Pik is suboptimal
given the cardinality errors in those edges in
e1 , e2 , , em that lie under them. - At least one of oi1 , oi2 , , oik is a
suboptimal operator given the cardinality errors
in e1 , e2 , , em that are in its input
edges.
18Conservative detection of suboptimality
- Suppose we detect suboptimality of (R Join S)
Join T wrt estimated costs of (R Join T) Join S - During run time, we can never observe the
cardinality of R Join T - We would be making an arbitrary guess as to the
correlation of the predicates on the R and T
tables - Best not to infer suboptimality wrt such
estimates - However, reoptimization may result in a different
join order
19Exploiting Intermediate Results
- All the intermediate results are stored as
temporary MVs - with cardinalities available to the optimizer
- can be reused if it leads to a better plan
- but not necessarily used, e.g. if join result is
very large, and a different join order is
preferred - must be reused if it has performed side-effects
- Reoptimization done as part of same transaction
20Optional use of MV
21Variants of CHECK
- Variants applicable in different cases, trade off
risk for opportunity - Variants
- Lazy checking
- Lazy checking with eager materialization
- Eager checking without compensation
- Eager checking with buffering
- Eager checking with deferred compensation
22Lazy Checking
- Adding CHECKs above a materialization point
(SORT, TEMP etc) - No results have been output yet
- And materialized results can be re-used
- very low overhead
23Lazy checking with eager materialization
- Can insert materialization point if it does not
exists already - Risk overhead of materialization
- Typically done only for outer input of indexed
nested-loop join - low cost if outer is small (as estimated by
optimizer) - and INL is in trouble anyway if outer is large
24Eager Checking
- Lazy checking may be too late
- e.g. if very bad join order chosen, with huge
intermediate results - Idea check even before entire result is
materialized, and stop early - Problem what if some results have already been
output? - Compensation
25Eager Checking
- EC without Compensation
- CHECK is pushed down the materialization point,
into pipeline
26Eager Checking
- EC with buffering
- CHECK and buffer
- output from buffer once sure about bound
- e.g. 0,b), or b,infinity
- else reoptimize
- delayed pipelining
27EC with Deferred Compensation
- Only SPJ queries
- Identifier of all rows returned to the user are
stored in a table S, which is used later in the
new plan for anti-join with the new-result stream
28CHECK Placement
29CHECK Placement
- LCEM and ECB outer side of nested-loop join
- LC above materialization points
- ECWC and ECDC anywhere
- Do not place CHECKs if
- no alternative plan above CHECK
- simple queries with low estimated cost
30Performance Analysis Robustness
- TPC-H Q10 Replace constant in selection on
lineitem by parameter marker, so optimizer
doesnt know actual selectivity - 5 different optimal plans
31Risk Analysis
- Analyze LC, LCEM, ECB
- Can be reoptimized more than once
- Conclusion low overhead/risk
32Opportunity Analysis
- Goal how often does opportunity to reoptimize
arise? - Introduce LC/LCEM/ECB checkpoints
- But turn off reoptimization, and run same plan
- Opportunity region for ECB dotted line
33POP in (in)action
- Real world workload (DMV data and queries)
- Complex predicates leading to cardinality
estimation errors - substring comparison, like, IN, ..
34POP in (in)action (contd.)
- Re-optimization may result in the choice of worse
plan due to - Two estimation errors canceling out each other
- Re-using intermediate results
35Conclusions
- POP gives us a robust mechanism for
re-optimization through inserting of CHECK (in
its various flavors) - Higher opportunity at low risk