Title: Query Optimization Strategies
1Query Optimization Strategies
- Zachary G. Ives
- University of Pennsylvania
- CIS 650 Implementing Data Management Systems
- January 31, 2005
2Administrivia
- Wednesday
- Some initial suggestions for the project proposal
- Scheduling of the deadline for your midterm
report - The next assignment
- Read Gray et al. (granularity of locks) and Kung
and Robinson (optimistic concurrency control)
papers - Review Kung and Robinson
- Consider how optimistic CC is or isnt useful in
Web-scale data management
3Todays Trivia Question
4Query Optimization
- Challenge pick the query execution plan that
has minimum cost - Sources of cost
- Interactions with other work
- Size of intermediate results
- Choices of algorithms, access methods
- Mismatch between I/O, CPU rates
- Data properties skew, order, placement
5The General Model of Optimization
- Given an AST of a query
- Build a logical query plan
- (Tree of query algebraic operations)
- Transform into better logical plan
- Convert into a physical query plan
- (Includes strategies for executing operations)
6Strategies
- Basically, search, over the space of possible
plans - At least of exponential complexity in the number
of operators - Hence, exhaustive search is generally not
feasible - What can you do?
- Heuristics only INGRES, Oracle until the
mid-90s - Randomized, simulated annealing, many efforts
in the mid-90s - Heuristics plus cost-based join enumeration
System-R - Stratified search heuristics plus cost-based
enumeration of joins and a few other operators
Starburst optimizers - Unified search full cost-based search EXODUS,
Volcano, Cascades optimizers
7What Are the Cost Estimating Factors?
- Some notion of CPU, disk speeds, page sizes,
buffer sizes, - Cost model for every operator
- Some information about tables and data
- Sizes
- Cardinalities
- Number of unique values (from index)
- Histograms, sketches,
8The System-R Approach Heuristics with
Cost-Based Join Enumeration
- Make the following assumptions
- All predicates are evaluated as early as possible
- All data is projected away as early as possible
- Separately consider operations that produce
intermediate state or are blocking - Joins
- Aggregation
- Sorting
- Correlation with a subquery (join, exists, )
- By choosing a join ordering, were automatically
choosing where selections and projections are
pushed why is this so?
9System-R Architecture
- Breaks a query into its blocks, separately
optimizes them - Nested loops join between them, if necessary
- Within a block focuses on joins (only a few
kinds) in dynamic programming enumeration - Principle of optimality best k-way join includes
best (k-1)-way join - Use simple table statistics when available, based
on indices magic numbers where unavailable - Heuristics
- Push sargable selects, projects as low as
possible - Cartesian products only after joins
- Left-linear trees only n2n-1 cost-estimation
operations - Grouping last
- Extra interesting orders dimension
- Grouping, ordering, join attributes
10Example
- Schema R(a,b), S(b,c), T(a,c), U(c,d)
- SELECT d, AVG(c)FROM R,S,T,UWHERE R.aS.b AND
S.cT.c AND R.aT.a AND T.cU.cGROUP BY dORDER
BY d - In relational algebra ?d/AVG(c)(?d(sa lt 2 (R ?
S ? T ? U)))
11Why We Need More than System-R
- Cross-query-block optimizations
- e.g., push a selection predicate from one block
to another - Better statistics
- More general kinds of optimizations
- Optimization of aggregation operations with joins
- Different cost and data models, e.g., OO, XML
- Additional joins, e.g., containment joins
- Can we build an extensible architecture for this?
- Logical, physical, and logical-to-physical
transformations - Enforcers
- Alternative search strategies
- Left-deep plans arent always optimal
- Perhaps we can prune more efficiently
12Optimizer Generators
- Idea behind EXODUS and StarburstBuild a
programming language for writing custom
optimizers! - Rule-based or transformational language
- Describes, using declarative template and
conditions, an equivalent expression - EXODUS compile an optimizer based on a set of
rules - Starburst run a set of rules in the
interpreter a client could customize the rules
13Starburst Query Optimizer
- Corona query processor vs. Core engine follows
RDS and RSS separation - Part of a very ambitious project
- Hydrogen language highly orthogonal (unlike SQL),
and supported many fancy OO concepts (e.g.,
inheritance, methods), recursion special
constraints, etc. much more powerful than SQL
at the time - Some portions of Hydrogen and Corona made their
way into DB2, later SQL standards - Two stages (stratified search)
- Query rewrite/query graph model a
SQL-block-level, relational calculus-like
representation of queries - Plan optimization a System-R-style dynamic
programming phase once query rewrite has completed
14Starburst QGM
- Tries to encode relational calculus-like
concepts - Predicates between variables within each SQL
block body - Variables can be
- Distinguished (set-builder / for-each / F)
- Existential (?)
- Universal (?)
- Returned values from each block (head)
- Predicates across blocks
15Starburst QGM Example
- SELECT partno, price, order_qty
- FROM quotations Q1
- WHERE Q1.partno IN
- (SELECT partnoFROM inventory Q3WHERE
Q3.onhand_qty lt Q1.order_qtyAND Q3.type cpu
16Starburst Query Rewrite
- Focus inter-block optimizations
- Pushing predicates across views, pushing
projections down - Magic sets rewritings
- Simplification, transitivity
- Implemented through production rules
- Condition action rules selected by
- Sequence
- Priority
- Probability distribution
- Search may stop after a budget
- End product logical plan(s), chosen via above
constraints - Normally set is singleton
- Can also CHOOSE among set by invoking the
cost-based plan optimizer
17Query Rewrite Example
- Convert subquery to join
- IF OP1.type Select Æ Q2.type 9 Æ (at each
evaluation of the existential predicate at most
one tuple of T2 satisfies the predicate) - THEN Q2.type F
- Merge operations
- IF OP1.type Select Æ OP2.type Select Æ
Q2.type F Æ (NOT (T1.distinct false Æ
OP2.eliminate_duplicate true)) - THEN
- merge OP2 into OP1
- IF OP2.eliminate_duplicateTHEN
- OP1.eliminate_duplicate true
18Query Rewrite Example
- Convert subquery to join
- IF OP1.type Select Æ Q2.type 9 Æ (at each
evaluation of the existential predicate at most
one tuple of T2 satisfies the predicate) - THEN Q2.type F
- Merge operations
- IF OP1.type Select Æ OP2.type Select Æ
Q2.type F Æ (NOT (T1.distinct false Æ
OP2.eliminate_duplicate true)) - THEN
- merge OP2 into OP1
- IF OP2.eliminate_duplicateTHEN
- OP1.eliminate_duplicate true
19Starburst Plan Optimization
- Separately optimizes each QGM operation (box)
- Grammar of STrategy AlteRnatives (STARs)
- Take high-level operations and turn them into
LOw-LEvel Plan OPerations (LOLEPOPs) - JOIN, UNION, SCAN, SORT, SHIP,
- Tables and plans have properties
- Relational (tables joined, columns accessed,
predicates applied) - Operational (ordering, site)
- Estimated (cost, cardinality)
- GLUE operators SORT, SHIP
- Join enumerator tries alternative join sequences
a la System-R - Can produce bushy trees
- Can have rank/priority with each STAR
20Starburst Pros and Cons
- Pro
- Stratified search generally works well in
practice DB2 UDB has perhaps the best query
optimizer out there - Interesting model of separating calculus-level
and algebra-level optimizations - Generally provides fast performance
- Con
- Interpreted rules were too slow and no database
user ever customized the engine! - Difficult to assign priorities to transformations
- Some QGM transformations that were tried were
difficult to assess without running many
cost-based optimizations - Rules got out of control
21The EXODUS and Volcano Optimizer Generators
- Part of a database toolkit approach to building
systems - A set of libraries and languages for building
databases with custom data models and
functionalities
(rules in E)
(EXODUS)
(gcc)
(MyDB plan)
(MyQL)
22EXODUS/Volcano Model
- Try to unify the notion of logical-logical
transformations and logical-physical
transformations - No stratification as in Starburst everything is
transformations - Challenge efficient search need a lot of
pruning - EXODUS used many heuristics, something called a
MESH - Volcano branch-and-bound pruning, recursion
memoization
23Example Rules
- Physical operators
- operator 2 join
- method 2 hash_join loops_join cartesian_product
- Logical-logical transformations
- join (1,2) -gt join(2,1)
- Logical-physical transformations
- join (1,2) by hash_join (1,2)
- Can get quite hairy
- join 7 (join 8 (1,2), 3) lt-gt join 8(1, join 7
(2,3))ifdef FORWARDif (NOT cover_predicate
(OPERATOR_7 oper_argument, INPUT_2 oper_property,
INPUT_3 oper_property)) REJECT
24So How Does the Optimizer Work?(EXODUS version)
- Needs to enumerate all possible transformations
without repeating - Every expression is stored in a MESH
- Basically, an associative lookup for each
expression, which can link to other entries in
the same MESH
25Search in EXODUS
- Apply a transformation, see if it produces a new
node - If so
- Find cheapest implementation rule
- Also apply all relevant transformation rules, add
results to OPEN set - Propagate revised cost to parents (reanalyze)
- Check parents for new transformation
possibilities (rematch) - Heuristics to guide the search in the OPEN set
- Promise an expected cost factor for each
transformation rule, based on analysis of
averages of the optimizers cost model results - Favor items with high expected payoff over the
current cost - Problem often need to apply 2 rules to get a
benefit use heuristics - Once a full plan is found, optimizer does hill
climbing, only applying a limited set of rules
26Pros and Cons of EXODUS
- Pros
- Unified model of optimization is powerful,
elegant - Very extensible architecture
- Cons
- Combined logical and physical expressions in the
same MESH - equivalent logical plans with different physical
operators (e.g., merge vs. hash joins) were kept
twice - Physical properties werent handled well
- sort enforcers were seldom applied since they
didnt pay off immediately had to hack them
into sort-merge join - Hard-coded transformation, then algorithm
selection, cost analysis - always applied even if not part of the most
promising expression - applied based on perceived benefit biased
towards larger expressions, which meant repeated
re-optimization - Cost wasnt very generic a concept
27Volcano, Successor to EXODUS(Shouldnt it be
LEVITICUS?)
- Re-architected into a top-down, memoized engine
- Depth-first search allows branch-and-bound
pruning - FindBestPlan takes logical expression, physical
properties, cost bound - If already computed, return
- Else compute set of possible moves
- Logical-logical rule
- Compliant logical-physical rule
- Enforcer
- Insert logical expression into lookup table
- Insert physical op, plan into separate lookup
table - Return best plan and cost
- More generic notions of properties and enforcers
(e.g., location, SHIP), cost (an ADT)
28EXODUS, Revision 3 Cascades
- Basically, a cleaner, more object-oriented
version of the Volcano engine - Rumor has it that MS SQL Server is currently
based on a (simplified and streamlined) version
of the Volcano/Cascades optimizer generator
29Optimization Evaluation
- So, which is best?
- Heuristics plus join-based enumeration (System-R)
- Stratified, calculus-then-algebraic (Starburst)
- Con QGM transformations are almost always
heuristics-based - Pro very succinct transformations at QGM level
- Unified algebraic (Volcano/Cascades)
- Con many more rules need to be applied to get
effect of QGM rewrites - Pro unified, fully cost-based model