Title: Adaptive Query Processing in Looking Glass
1Adaptive Query Processing in Looking Glass
2Inside the paper
- Detailed qualitative comparison of Adaptive Query
Processing systems. - Proposal for two new approaches for adaptive
query processing.
3Plan-first execute-next
- The optimizer picks an efficient plan for a
query. - This plan is used to execute the query.
- Effect Database systems become critically
dependent of optimizer. - Problem Optimizers can make mistakes because of
- Outdated statistics
- Invalid assumptions
- Lack of statistics
- Unpredictable and changeable environment
4Adaptive Query Processing
- Approach to avoid the performance penalty
- The optimization and execution stages are
interleaved - Query processing becomes more robust by
- Correcting Optimizers Mistakes
- Coping with Unknown Statistics
- Reacting to Changes in Input Characteristics and
System Conditions
5Plan-first execute-next approach
6Adaptive Query Processing Systems
- Plan-based
- Routing-based
- CQ-based
7Plan-based Systems
8Routing-based Systems
9CQ-based Systems
10Pros and Cons of using an Optimizer
- Effect on plan quality
- Using an optimizer
- Can consider large complex plan spaces
- - Susceptible to estimation errors (optimizers
mistakes) - No optimizer
- Routing is based on observed, accurate
statistics - - Routing algorithms are usually greedy and
designed for smaller, simpler plan spaces - Complexity
- - Optimizers are complex modules
- Routing based systems are simple
- Run time overhead
- Using optimizer
- - Context switching between optimizer and
executor - - Plan enumeration and costing may be applied for
several times - No optimizer
- - Overhead of routing
- - Have to ensure continuously that routes
correspond to valid plans
11Tracking Statistics
- For verifying that the values observed during
estimation match optimizers estimates. - For obtaining run time values for use in future
optimization steps - Statistics can be observed
- Directly from the execution of a plan (in
plan-based and CQ-based systems) - From the flow of tuples along current routing
path (in routing-based systems)
12Techniques used for gathering statistics
- Observation
- Used in plan-based systems
- Statistics are collected on tuples that pass
through selected points in the query - Exploration
- Used in routing-based systems
- A fraction of the input tuples are routed along
routes different from the current best route - Competition
- Similar with exploration
- Routes the same set of tupes along multiple
competing routes - Profiling
- A fraction of tuples are routed along all
operators
13Analysis of tracking statistics techniques
- Computational overhead
- Observation Depends what statistics are
collected - Exploration Depends on the fraction of tuples
send along less efficient paths - Competition Redundant processing of data
- Profiling Extra work on random sample of data
- Accuracy of Estimation
- Observation high, statistics are observations
- Exploration susceptible to bias in routing
decisions and correlations - Competition high, statistics are observations
- Profiling Depends on sampling fraction
- Coverage of Statistics
- Observation Restricted to what we can observe
from the plan - Exploration Limited by large number of
alternative routes - Competition Low, limited number of competing
plans can be run - Profiling Broad coverage of statistics
14Re-optimizationWhen and how to re-optimize
- In plan-based and CQ-based systems
re-optimization is invoked explicitly - during execution, the statistics that the
optimizer estimated will be tracked - the optimizer is invoked whenever an observed
value is found to be significantly different or
out of range from the value estimated by
optimizer - In a routing based system re-optimization is
invoked implicitly - the scheme used to route tuples to operators is
based on the current statistics
15Re-Optimization Plan switching
- Re-optimization in plan-based or CQ-based systems
may decide a new plan is better then the current
plan - Important issues when switching between plans
- Correctness
- The new plan must not output result tuples that
have been output by previous plans or miss some
of the result tuples - Reuse of work
- The new plan should consider reusing the parts of
the query that were processed by previous plans - Attention to the efficiency of reusing work
- Plan state
- The techniques used for changing the plans should
account the state of the plan
16Techniques used by AQP for plan switching
- Avoiding duplicate results
- Plan-based
- No result is output until processing is complete
- Keep track of tuples output so far to eliminate
duplicates in future - CQ-based
- Access methods over streams are one-pass scan, so
duplicates are never generated - Routing-based
- Routing constraints are enforced in order to
elliminate duplicate results - Reusing work done so far
- Plan based the materialized subexpressions are
used on a cost basis - CQ-based migrate state in temporary structures
- Routing-based migrate state in temporary
structures - Reducing switching time
- Plan-based
- New plan is started on new input
- Combine data partitions processed by different
plans only after all sources are exhausted - CQ-based
- Caches that can be dropped fast
- Techniques to migrate state in parallel to
processing
17Run-time overhead
- In order to ensure adaptability a AQP system
incurs run-time overhead - Low rate of change
- Plan-based
- very low overhead
- CQ-based
- - Overhead is dominated by tracking statistics
- Routing-based
- Low overhead of exploring alternative paths
- High rate of change
- Plan-based
- - may use inefficient plans because of limited
re-optimization opportunities - CQ-based
- More resilient because profiling enables
faster, more complete statistics tracking - Routing-based
- - inefficient routes may be used always because
exploration takes time to converge
18Thrashing
- It happens when an AQP system is spending most of
its resources in adaptivity-related overhead. - Safeguards for minimizing thrashing
- Limiting re-optimization points, only
re-optimizing at blocking operators in plans - Limiting the number of times re-optimization can
be invoked - Setting the minimum number of tuples processed or
time interval between any two invocations of
re-optimization
19New Approach to AQPProactive optimization
- Current plan-based systems use an optimizer to
generate the plan, then detect and respond to
suboptimalities reactive optimization - Drawback when the plan is chosen, a conventional
optimizer does not consider issues affecting
re-optimization - Proactive optimization query plans are chosen
with optimization in mind.
20Issues considered during proactive re-optimization
- The potential overhead of tracking statistics
during execution, possibility of change and plan
switching - The potential for reuse of work in case a plan
change is required - The ability to identify quickly whether the
current execution plan is suboptimal - The ability of decreasing uncertainty in
statistics
21Potential run-time overhead for adaptability
- Consider join of relations R and S
- Consider
- R small
- S large and has a clustered index on the join
attribute - A indexed nested loop join (INLJ) with R as the
outer will outperform a hybrid hash join (HHJ) - If the size of R increases, the performance of
the HHJ starts to dominate the performance of the
INLJ - If R is unknown either
- Use the safe HHJ instead of risky INJL because it
might require change of plan - If the size of R is known with a certain
confidence to be small, we prefer INLJ in stead
of HHJ
22Potential reuse of work
- A plan P can also be considered risky if the
system may not be able to reuse any of the work
done by P if re-optimization is required. - One approach is to generate plans with shorter
pipelined segments and more materialization points
23Identifying Plan Suboptimality Faster
- Some plans allow more assertions to be checked
concurrently. - The system can discover suboptimalities of
downstream operators in the plan long before the
upstream operators have finished execution
24Detecting Uncertainty in Statistics
- Consider the join of relation R and S (like in
the previous example) - Suppose that uncertainty in the size of R comes
from the presence of selective predicates p1 and
p2 on R - The optimizer can choose to estimate the combined
selectivity of p1 and p2 from a sample of R
before choosing the join algorithm - The optimizer can chose pipelined plans for some
queries and profiling to estimate required
statistics - The optimizer can explore multiple selected
subplans to collect statistics
25New Approach for AQPPlan Logging
- With plan-logging for a continuous query Q, the
statistics tracker logs the statistics relevant
to Q - The optimizer logs the plan it picked based on
those statistics. - The information in log for Q can be used as
follows - Grouping together log entries that contain the
same plan P for query Q - The AQP can identify those statistics whose
changes most contribute to significant changes in
plan performance - The history captured can be used for online
what-if analysis. - It can identify statistics that are prone to
transient changes
26Conclusions
- The main contribution of the paper is
identification of the three query families and
the detailed comparison of these systems. - Gives an idea about the optimization aspects that
one needs to consider carefully in case of AQP. - Underlines the limit between performance gain or
lose in case of change of plan.