Title: Adaptive Query Processing
1Adaptive Query Processing
- Amol Deshpande, University of Maryland
- Vijayshankar Raman, IBM Almaden Research Center
Title slide
2Goal
- Research in query processing and optimization has
taken a new track into adaptive query processing
(QP) - Most new research projects that need QP use some
adaptive approach - Whether regular QP, XML QP, text, continuous QP,
web QP, - Selinger-style query processing fallen out of
flavor - Also true in industrial projects, especially new
ones. - Educate audience on what this is all about
- Where adaptive QP works, where it has known
problems, what is unknown - What is Out of Scope of this tutorial
- Parallelism, distributed eddies, continuous query
processing
3Agenda
- Background on non-adaptive Query Processing
- History, where it works, and where it breaks
- Adaptive Query Processing
- Dimensions of Adaptivity
- Evolutionary Adaptive query processing
- Late binding and parametric optimization
- Competition
- Mid-query reoptimization
- Tuple-wise adaptive Query Processing
- Eddies, SteMs, STAIRs
- Commentary
- What works and what fails
- Open problems
4Background on Non-Adaptive Query Processing
- Central value proposition of relational model
declarative queries - You decide what data access (query) you want to
do the DBMS will figure out the right
algorithm (plan) - right usually translated to good enough
- Standard method of realizing this cost-based
query optimization
5Cost-based Query Optimization
Results
Query Optimizer
Query Executor
Compiled Query Plan
Declarative Query
Disk(s)
6Non-Adaptive Query Processing
- Several Improvements over the years
- Wider plan space
- Started out with left deep tree of binary join
operators - Now
- lots more operators
- Many more query transformations
- multi-block SQL
- bushy plans
-
- Plan space has grown to be huge and messy
- Use of statistics to get better costs
- Better dynamic programming techniques
(transformational, memoization) - Good surveys exist Graefe 1993, Chaudhuri
1998
7Success of Non-Adaptive QP
- Used pretty much everywhere
- every commercial DBMS uses some form of
cost-based optimization - Why?
- Plan space is too wide and combinatorial
- Other ideas like randomized optimization simply
dont search widely enough(e.g. all possible
ways of using indexes) - Benefits from 25 years of tuning
- Widely viewed as success story for relational
databases, and as vindication of relational
model - Some qualifications
- DBAs love and demand control (hints,
syntax-direction, plan-fixing across releases, ) - Applications often help the query processor by
splitting a query into multiple phases (precursor
of adaptivity)
8Where does traditional optimization break (1)
- In recent years, optimizers have started to break
at the seams as they are extended to new
environments
SELECT p.profile_id, p.account_id,
p.profile_title, p.profile_default,
r.profile_detail_id AS registrant_detail_id,
r.profile_first_name AS registrant_first_name,
r.profile_last_name AS registrant_last_name,
r.profile_job_title AS registrant_job_title,
r.profile_organization_name AS registrant_organiza
tion_name, r.profile_address1 AS
registrant_address1, r.profile_address2 AS
registrant_address2, r.profile_city AS
registrant_city, r.state_id FROM
TABLE_ACCOUNT_PROFILES p LEFT JOIN
TABLE_ACCOUNT_PROFILE_DETAILS r ON p.profile_id
r.profile_id AND r.profile_type_id 1 LEFT JOIN
TABLE_STATES rs ON r.state_id rs.state_id LEFT
JOIN TABLE_COUNTRIES rc ON
Results
very complex user queries
Query Optimizer
Query Executor
Declarative Query
Disk(s)
9Where does traditional optimization break (1)
- In recent years, optimizers have started to break
at the seams as they are extended to new
environments
wide area federations
SELECT p.profile_id, p.account_id,
p.profile_title, p.profile_default,
r.profile_detail_id AS registrant_detail_id,
r.profile_first_name AS registrant_first_name,
r.profile_last_name AS registrant_last_name,
r.profile_job_title AS registrant_job_title,
r.profile_organization_name AS registrant_organiza
tion_name, r.profile_address1 AS
registrant_address1, r.profile_address2 AS
registrant_address2, r.profile_city AS
registrant_city, r.state_id FROM
TABLE_ACCOUNT_PROFILES p LEFT JOIN
TABLE_ACCOUNT_PROFILE_DETAILS r ON p.profile_id
r.profile_id AND r.profile_type_id 1 LEFT JOIN
TABLE_STATES rs ON r.state_id rs.state_id LEFT
JOIN TABLE_COUNTRIES rc ON
Results
very complex user queries
Query Optimizer
Query Executor
Declarative Query
data streams
X M L
10Where does traditional optimization break (2)
- Bad Statistics
- DBA-challenged environment is the norm
- New data types like XML and text
- Deep Web and federated systems
- Correlations (many optimizers implement only
marginal histograms) - Very dynamic environments
- Continuous queries query runs for ever, so
things change - Wide area federations
11Where does traditional optimization break (3)
- Complex queries
- Large numbers of tables (usually bail to greedy
beyond 10-20 tables in a join) - Queries with Parameter markers
- Select average(salary) from Employee where age gt
1 and age lt 2 - Need for sharing work between queries esp. in
continuous queries - Asynchronous (push) data sources (e.g. data
streams) - New Metrics for QP
- E.g., Interactive metric
- User preferences change too quickly for
optimization to pay off - Pipelined plans focus on early results
12Agenda
- Background on non-adaptive Query Processing
- History, where it works, and where it breaks
- Adaptive Query Processing
- Dimensions of Adaptivity
- Evolutionary Adaptive query processing
- Late binding and parametric optimization
- Competition
- Mid-query reoptimization
- Tuple-wise adaptive Query Processing
- Eddies, SteMs, STAIRs
- Commentary
- What works and what fails
- Open problems
13Idea of Adaptive Query Processing
- Interleave the optimization and execution stages
- We learn better statistics as the query executes
- Two main styles
Evolutionary
Revolutionary
Static Late Inter
Intra
Per Plans Binding
Operator Operator
Tuple
Traditional Dynamic QEP Query Scrambling
Xjoin, DPHJ, Eddies DBMS
Parametric Mid-query Reopt.
Convergent QP
Competitive Progressive Opt.
14Idea of Adaptive Query Processing
- Interleave the optimization and execution stages
- We learn better statistics as the query executes
- Two main styles
Evolutionary
Revolutionary
Static Late Inter
Intra
Per Plans Binding
Operator Operator
Tuple
Traditional Dynamic QEP Query Scrambling
Xjoin, DPHJ, Eddies DBMS
Parametric Mid-query Reopt.
Convergent QP
Competitive Progressive Opt.
15Dimensions of Adaptation
Measure
Analyze
Actuate
Plan
- Measure monitor cardinalities, resource
utilization, etc - Analyze verify compliance with performance goals
(very tricky problem during intermediate stages
of a query) - Plan can involve optimizer, or be a routing
policy - Actuate switch to a new query plan, hopefully
without wasting too much work
16Dimensions of Adaptation
Measure
Analyze
Actuate
Plan
- Traditional database system use a simple form of
inter-query adaptation - Measurements done on the tables directly, and
analyzed during query optimization time - Also, by measuring the cardinalities during
execution - Adaptive selectivity estimation
Chen/Roussopoulos 1994 - LEO Volker et al 2001
- SITS Bruno/Chaudhuri 2002
- But always a single plan is used to execute the
query fully
17Agenda
- Background on non-adaptive Query Processing
- History, where it works, and where it breaks
- Adaptive Query Processing
- Dimensions of Adaptivity
- Evolutionary Adaptive query processing
- Late binding and parametric optimization
- Competition
- Mid-query reoptimization
- Tuple-wise adaptive Query Processing
- Eddies, SteMs, STAIRs
- Commentary
- What works and what fails
- Open problems
18Adaptive Query Processing
Evolutionary
Revolutionary
Static Late Inter
Intra
Per Plans Binding
Operator Operator
Tuple
Traditional Dynamic QEP Query Scrambling
Xjoin, DPHJ, Eddies DBMS
Parametric Mid-query Reopt.
Convergent QP
Competitive Progressive Opt.
19Late Binding and Parametric Optimization
- Idea let optimizer pick multiple plans for each
query - Choose the best one just before execution
Pick one and Actuate
Plan
Q
Plan
Plans
20Late Binding and Parametric Optimization
- Parametric optimization Graefe/Cole 1994,
Ioannidis 1992, Ganguly 1998, etc. - Pick optimal plans for various regions of the
parameter spaces - Do this during dynamic programming itself
- Using plausible assumptions about cost functions
- Challenges
- makes optimization too expensive
- Many plausible assumptions dont hold in practice
(see, e.g., Haritsa 2005) - Cannot address problems that 1st crop up during
execution
selectivity 2
selectivity 1
21Late Binding and Parametric Optimization
- Variation Validity Ranges (Markl/Raman/Lohman/Pi
raheh/Simmens 2005) - Attach to each edge of a query plan, a range of
cardinalities within which plan is optimal - Easier to compute during dynamic programming
optimizer - Use numerical root finding techniques works well
with real cost functions - Extends nicely to more general notions of robust
query optimization - e.g., range within which plan is within 50 of
optimal (Raman/Haas/Reder/Markl 2005). - But, Limited effectiveness
- only tells you if a plan is broken, not how to
fix it - Meshes well with mid-query reoptimization
22Competition
Measure
Run for a while
Actuate a few
Analyze
Pick faster one
Q
Plan
Plan
Plans
- E.g. DEC RDB Antonshenkov 1993 run access
methods competitively - Measure monitor cardinalities, costs of access
methods - Analyze decide which access method has higher
rate - Advantage simple method
- Challenge avoid redundant work, avoid duplicates
- Use a shared hash table global timestamp
Raman/Deshpande/Hellerstein 2003 - Needs more evaluation of overheads
23Adaptive Query Processing
Evolutionary
Revolutionary
Static Late Inter
Intra
Per Plans Binding
Operator Operator
Tuple
Traditional Dynamic QEP Query Scrambling
Xjoin, DPHJ, Eddies DBMS
Parametric Mid-query Reopt.
Convergent QP
Competitive Progressive Opt.
24Mid-Query Reoptimization
- Kabra et al 1998, Markl et al 2005, Babu et al
2005 - Switch query execution plans at well-defined
points - Typically, materialization points
- Advantage
- seems to work well in practice
- Independent implementations
- Challenge
- coarse grained adaptation
- Cannot always reuse work done in prior executions
If plan is sufficiently sub-optimal, switch to
new plan
25Measure monitoring cardinalities
- During execution, monitor cardinalities on each
edge of query plan - E.g., each call to getNext() increments count
- must refine for operators that are executed
multiple times(e.g., inner of nested loops
joins) e.g. Stillger/Markl/Lohman 2001 - As we monitor, we narrow down range of possible
values for true cardinality (primarily the lower
bound) - Can extrapolate to expected value of true
cardinality - E.g.
- Expected value of c2 expected value of c1
(current c2) / (current c1)
c1
c2
op
26Analyze Switching plans Mid-Query
- Re-optimize the original query to form a new plan
- Getting a better plan
- Plug in all cardinality information learned
during this query - Reusing work
- Plug in fully materialized relations from current
executionas materialized views - Materialization points
- Sorts
- Inner of hash joins
- Explicit materializations (uncorrelated inners,
common subexpressions, )
27Analyze when to switch plans
- When no tuples have been output, and
- When current plan is sufficiently sub-optimal,
and - Range of true cardinalities on an edge has no
overlap with validity range for that edge - Or, expected true cardinality falls far outside
of validity range - Cost of reoptimization gt cost of new plan
remaining cost of current plan - Hard to gauge this accurately many variables
- How many intermediate results we will succeed in
reusing - how much better is the new plan
- Relatively easy to gauge this if we switch at a
materialization point - Because cost of reoptimization is negligible
28Adaptive Query Processing
Evolutionary
Revolutionary
Static Late Inter
Intra
Per Plans Binding
Operator Operator
Tuple
Traditional Dynamic QEP Query Scrambling
Xjoin, DPHJ, Eddies DBMS
Parametric Mid-query Reopt.
Convergent QP
Competitive Progressive Opt.
29Query Scrambling
- Amsaleg/Franklin/Tomasic/Urhan 1996,
Urhan/Franklin/Amsaleg 1998 - Designed for coping with delays in a wide area
setting
Query Processor
Measure delays
Network
Response time optimization
Wide Area Data Sources
30Adaptive Query Processing
Evolutionary
Revolutionary
Static Late Inter
Intra
Per Plans Binding
Operator Operator
Tuple
Traditional Dynamic QEP Query Scrambling
Xjoin, DPHJ, Eddies DBMS
Parametric Mid-query Reopt.
Convergent QP
Competitive Progressive Opt.
31Agenda
- Background on non-adaptive Query Processing
- History, where it works, and where it breaks
- Adaptive Query Processing
- Dimensions of Adaptivity
- Evolutionary Adaptive query processing
- Tuple-wise adaptive Query Processing
- Eddies, SteMs, STAIRs
- Commentary
- What works and what fails
- Open problems
32Eddies Avnur/Hellerstein 2000
Eddies combine all the dimensions of adaptivity
into a single operator
Measure
Eddy
Analyze
Actuate
Plan
33Eddies Avnur/Hellerstein 2000
A traditional query plan queries executed using
iterator (getNext) model
Query execution using an eddy
Output
Eddy
S
Output
E
C
C
- An eddy operator
- Intercepts tuples from sources and output
tuples from operators - Uses feedback from the operators to route
S
E
34Example Database
select from students, enrolled, courses where
students.name enrolled.name and
enrolled.course courses.course
Name Level Course Instructor
Jen Senior CS2 Smith
Name Level Course
Joe Junior CS1
Jen Senior CS2
Course Instructor
CS2 Smith
Courses
Name Course
Joe CS1
Jen CS2
Name Level
Joe Junior
Jen Senior
Students
Enrolled
35Query Execution using Eddies
Probe to find matches
Insert with key hash(joe)
HashTable S.Name
HashTable E.Name
Joe Junior
Joe Junior
Joe Junior
Eddy
S
E
Output
C
HashTable E.Course
HashTable C.Course
No matches Eddy processes the next tuple
E C
36Query Execution using Eddies
Probe
Insert
HashTable E.Name
HashTable S.Name
Joe CS1
Joe Jr
Joe CS1
Jen Sr
Joe Jr CS1
Eddy
S
Joe CS1
E
Output
C
HashTable E.Course
HashTable C.Course
Joe Jr CS1
Joe Jr CS1
CS2 Smith
37Query Execution using Eddies
Probe
HashTable S.Name
HashTable E.Name
Joe CS1
Joe Jr
Jen CS2 Smith
Jen Sr
Jen CS2 Smith
Jen Sr. CS2 Smith
Eddy
S
Jen Sr. CS2 Smith
Jen CS2
E
Output
C
HashTable E.Course
HashTable C.Course
Probe
Jen CS2
CS2 Smith
Joe Jr CS1
Jen CS2 Smith
Jen CS2
Note this is a symmetric hash join
38Eddies Postmortem
Output
Output
E C
E S
S E
Courses
C E
Students
Course Instructor
CS2 Smith
Name Level
Joe Junior
Jen Senior
Students
Enrolled
Courses
Enrolled
Name Level
Joe Junior
Jen Senior
Name Course
Joe CS1
Name Course
Jen CS2
Course Instructor
CS2 Smith
- Eddy executes different query execution plans for
different parts of data - Eddy adapts the join order during query execution
- Access methods and join algorithms still chosen
up front
39Eddies and Pipelined Joins
- Simplest way to describe eddies is in terms of
symmetric hash join - This is the approach we used in our example
- Approach extends directly to index nested loops
joins - Can also extend this to non-pipelined joins (e.g.
sort-merge, or hybrid hash) - BUT, opportunities for adaptation are much less
- Only at end of building whole hash table (just
like with mid-query re-optimizations) - Generalized using SteMs and STAIRs
40Eddies Routing Policy Non-adaptive
- Choosing which operator to route a given tuple to
- The brain of the eddy
Send S and E tuples here Returns S JOIN E (SE)
tuples
Non-Adaptive Optimization 1. Run a regular
optimizer to figure out the join order 2. Map
join order into order of tuple routing
Eddy
S
E
Output
C
Send SE, and C tuples here Returns SEC tuples
41Eddies Routing Policy Lottery Scheduling
- Choosing which operator to route a given tuple to
- The brain of the eddy
Send here 99 of the time Send to the other
operator 1 of the time
sent 100 received 2
Lottery Scheduling Avnur 00 Simplified
Description 1. Maintain for each operator
tuples sent tuples returned
cost per tuple 2. Choose (roughly) based on the
above 3. Explore by randomly sending tuples in
the wrong orders
Eddy
S
E
Output
C
sent 10 received 20
42Eddies Routing Policy Statistics Based
- Choosing which operator to route a given tuple to
- The brain of the eddy
Deshpande, Hellerstein 04 1. Learn
statistics on each input table as data streams
in 2. Choose (roughly) based on the statistics
(analogous to traditional QO)
Eddy
S
Output
E
C
43Eddies Routing Policy Interactivity Metric
- Choosing which operator to route a given tuple to
- The brain of the eddy
Raman, Hellerstein 02 Use user
preferences for different kinds of tuples and
different kinds of result columns to determine
the routing order
Eddy
S
Output
E
C
44Agenda
- Background on non-adaptive Query Processing
- History, where it works, and where it breaks
- Adaptive Query Processing
- Dimensions of Adaptivity
- Evolutionary Adaptive query processing
- Tuple-wise adaptive Query Processing
- Eddies, SteMs, STAIRs
- Commentary
- What works and what fails
- Open problems
45State Modules (SteMs) More aggressive adaptation
- Eddies allow dynamic adaptation of join order
- Within a particular spanning tree of the join
graph - Other adaptations
- Access methods especially index vs scan access
methods - Join algorithms e.g., index join (IJ) vs hash
join (HJ) - To guard against wrong cardinality estimates
- To simultaneously optimize for both pipelining
completion time (IJ?HJ) - To deal with memory overflows (HJ?IJ)
46Query Processing with SteMs
- Eddy execs queries by routing tuples to SteMs and
AMs - SteM
- Dictionary of homogeneous tuples
- (insert) build and (search) probe operations
- Access modules (AMs)
- probe with tuples and get matches
- One Example Symmetric Hash Join
- Build SteMR
- Probe SteMS with R tuple
- Build SteMS
- Probe SteMR with R tuple
- Generalizes to n-ary join
- Can simulate other join algorithms, by using
other routing policy - Thus, changing the routing allows adaptation of
join algorithm - Subtle routing constraints needed to ensure
correct execution Raman et al 2003
S bld
R bld
S probe
R probe
Eddy
repeat
S
R
47Another application of SteMs
- Sharing State across queries
- E.g. 1000 queries of the formselect from R
where R.a parameter - Equivalent toselect from R, Q where R.a
Q.parameter - Eddy routes tuples from R SteM and queries from Q
SteM - Useful for continuous-query proc. and multi-query
proc - Need to extend SteMs with eviction (deletion)
operation for window joins - See Madden et. al paper in SIGMOD 2002
48Agenda
- Background on non-adaptive Query Processing
- History, where it works, and where it breaks
- Adaptive Query Processing
- Dimensions of Adaptivity
- Evolutionary Adaptive query processing
- Tuple-wise adaptive Query Processing
- Eddies, SteMs, STAIRs
- Commentary
- What works and what fails
- Open problems
49Query execution using STAIRS Deshpande/Hellerstei
n 2004(Storage, Transformation and Access for
Intermediate Results)
- Extend SteMs to handle reuse of intermediate
results
Probe into E.Name STAIR
s1
s1
s1
s1
50Agenda
- Background on non-adaptive Query Processing
- History, where it works, and where it breaks
- Adaptive Query Processing
- Dimensions of Adaptivity
- Evolutionary Adaptive query processing
- Tuple-wise adaptive Query Processing
- Eddies, SteMs, STAIRs
- Commentary
- What works and what fails
- Open problems
51So where are we
- Evolutionary techniques
- Relatively well understood
- Commercial database support in recent years
- But
- Dont offer fine-grained adaptivity required for
the new environments - Going back to the list of reasons why adaptive
query processing is required
52Evolutionary Techniques
- Can handle to some extent
- Queries with parameter markers
- Complex queries with large tables
- Bad Statistics/correlations (if used in
conjunction with sophisticated synopses) - Can not easily handle
- Deep web and federated systems (maybe)
- Interactive query metrics
- Can not handle
- Very dynamic environments
- Continuous queries query runs for ever, so
things change - Extensive sharing of work between queries esp.
in continuous queries - Asynchronous (push) data sources, triggers, etc.
- Open
- XML (depends on the complexity of the query
language) - Text and other types of complex data
53So where are we
- Revolutionary techniques
- Show great promise
- Have the potential to handle most of the issues
- But
- Much harder to implement
- Seem to have non-negligible overheads in the
common case - Very immature research area
- More work is needed
54Some Open Problems
- Parametric query optimization
- Can potentially avoid the need for adaptive query
processing in many scenarios - Especially parameter markers and correlations
- Few practical techniques so far
- Optimization with expanded plan space
- Eddies can explore a plan space much larger than
traditional plan space - They allow relations to be broken into pieces,
with each piece executed separately - Can we explore this plan space in a non-adaptive
setting ? - Recent work on
- Conditional Planning Deshpande et al, ICDE 2005
- Content-based Routing Babu et al, VLDB 2005
55Some Open Problems
- Eddies
- Routing policies
- Whether eddies will succeed depends on the
routing policies - Little work so far...
- SteMs, STAIRs
- Theoretical analysis of optimization space
- Practical viability analysis
- Especially in the context of continuous query
processing - Parallel, distributed environments, P2P, Grid
- Disk
- Flexibility demanded by adaptive techniques at
odds against the careful scheduling typically
done by DBMSs - Very little work on understanding this
56Conclusions
- Declarative queries are the fundamental
justification for the relational model - Otherwise we could all go back to CODASYL
- Traditional query optimization does not work all
that well outside of a very narrow domain - Evolutionary techniques
- Well understood Offer limited adaptivity
- Revolutionary techniques
- Hold much more promise Immature area
- Very exciting research lies ahead
57Questions ?