Title: Lecture 8: Query Optimization
1Lecture 8 Query Optimization
Sept. 19, 2007 ChengXiang Zhai
Most slides are adapted from Kevin Changs
lecture slides
2DBMS Architecture
User/Web Forms/Applications/DBA
query
transaction
Query Parser
Transaction Manager
Query Rewriter
Logging Recovery
Query Optimizer
Lock Manager
Query Executor
Files Access Methods
Lock Tables
Buffers
Buffer Manager
Main Memory
Storage Manager
Storage
3SQL Constructs
- SECLECT (DISTINCT) ltlist of columnsgt
- FROM ltlist of tablesgt
- WHERE ltlist of Boolean Factorsgt
- GROUP BY ltlist of columnsgt
- HAVING ltlist of Boolean Factorsgt
- ORDER BY ltlist of columnsgt
4SQL Semantics
- Take Cartesian product of FROM tables
- Project only those referenced columns
- WHERE apply all filters in WHERE
- GROUP BY form groups on results
- HAVING apply filter to groups
- ORDER BY make sure results in right order
- DISTINCT remove duplicates
- Q Is this operational semantics efficient?
- Different plans mainly different in the first
three
5Optimization Different Strategies
- Optimal approach
- enumerate each possible plan
- measure its performance by running it
- pick the fastest one
- Heuristics approach
- fixed heuristics all the way through plan
construction - e.g. always nested loop joins, indexed relation
as inner - e.g. order relations from smallest to biggest
- How does Selinger/System R differ from them?
- What are the main components in their approach?
6New Paradigm Cost-based Optimization
- Plan space
- what is the space of query plans?
- Cost estimation
- how to estimate the cost, without executing each?
- Search algorithm
- how to search the space, as guided by cost
estimates
7Space of Query Plans
- What can you do differently? Parameters to tune
- Selections
- Joins
8Space of Query Plans
- Selections
- algorithms sequential, index scan
- ordering why this will matter?
- Joins
- algorithms nested-loop, sort merge, hash
- ordering
- Ordering/Grouping
- can an interesting order be produced by
join/selections? - algorithm sorting, hash-based
- They interleave with each other!
9Assumptions to Reduce the Space?
- Projections
- pushed down to reduce of columns
- Selections
- Joins
10Huge Space!
- Typical assumptions to help reduce the space
- Projections
- pushed down to reduce of columns
- Selections
- pushed down to reduce of rows
- Joins
- left-deep joins (what else can you do?)
- avoid Cartesian products delay it in the plan
- how to avoid Cartesian products?
- May miss an optimal plan!
11Cost/Size Estimation
- Accurate relatively
- goal is to compare plans, not to predict exact
cost - more of an art than an exact science
- Each operator input size, cost, output size
- estimate cost based on input size
- estimate output size (for next operator) or
selectivity - selectivity ratio of output to input
12Cost Estimation Selinger Style
- Input simple DB statistics
- of tuples disk pages
- of distinct values per column
- statistics updated periodically
- Assumption of attribute/predicate independence
- When no estimate available, use magic number
- New/Alternative approaches
- sampling, histogram of DB
13Selectivity Factors Point Selection
- column value
- dept CS
- input size NCARD(Student) 200, ICARD(dept)
10 - output size ? selectivity ?
14Selectivity Factors Range Selection
- What is assumed in these formulas?
- column gt value1
- F (maxValue - value1) / rangeOfValue
- column in value1value2
- F (value2 - value1) / rangeOfValue
- column in (set of values)
- F as union of point selections
15 Selectivity Factors Join Predicates
- column1 column2
- when both columns are keys
- student.sid employee.sid
- one key, the other foreign key
- student.sid enrollment.sid
- student NCARD 200, SID distinct values 200
- enrollment NCARD 800, SID distinct values
100 - 800 (200/200) 800
- none are keys
- user.sid enrollement.sid
- user NCARD 1000, SID distinct values 200
- enrollment NCARD 800, SID distinct values
100 - 800 (1000/200) 4000
- given formula covers all?
16Goal Cost Estimate
- System R cost model
- Sum of I/O and CPU
- ( PAGE) W (RSI CALLS)
- why RSI calls model workload of RSS?
17Search the Plan Space
- Baseline exhaustive search
- enumerate all combinations, and compare their
cost - Search method parameters
- plan tree development
- construction bottom-up, top-down
- modification improve a somehow-constructed tree
- algorithms
- heuristic selections make choices based on
heuristics - branch and bound search bounded by the current
best tree - hill climbing find nearby plans with lowest
cost - dynamic programming construction by greedy
selections - where does System-R approach fit in?
18Plan Search System-R Style
- AKA Selinger style optimization
- Bottom-up
- start from ground relations (in FROM)
- work up the tree to form a plan
- Dynamic programming
- greedily prune subtrees that are obviously
useless - but not necessarily local best at every step
why? - Many other approaches
19DP Local Optimal -gt Global Optimal?
- Think of optimizing (T1 Q T2 Q T3) Q T4 Q T5
- Sometimes it does hold
- does plan (T1 Q T2 Q T3) affect the rest of the
plan? - Not always true
- example?
- but hopefully at least give good plans
20System R Search Start from Relations
- Base relations access
- find all plans for access to each base relation
- push down sargable arguments
- example of non-sargable predicates?
- choose good plans, discard bad ones
- keep cheapest for unordered each interesting
order - Join ordering
- find all ways of joining a pair of base relations
- choose good plans, discard bad ones
21WHERE Clause and Sargable Predicates
- WHERE conditions
- dept CS OR (dept EE AND GPA gt 3.5)
- Normal forms
- DNF (C) OR (E AND G)
- CNF (C OR E) AND (C OR G)
- Difference in using the CNF/DNF factors
22System R Search Left-Deep Join Plans
- Join ordering
- consider only left-deep trees n! ordering for n
tables - basis
- find all ways of joining a pair of base relations
- choose good plans, discard bad ones
- induction until you have a full plan
- k to k1 given plans of k-relation join, add one
more - prefer one with predicate postpone Cartesian
product - can throw away k-plans
- Finally grouping/ordering
- using interesting order
- additional sorting
23Nested Subqueries
- Subqueries optimized separately
- Correlation order of evaluations
- uncorrelated queries are like constants
- correlated queries are like function calls
24Beyond System-R
- Parallel/Distributed DB
- User-defined predicates
- Materialized views
- Adaptive query optimization
- Fuzzy querie
- Machine learning for query optimization?
25What You Should Know
- Query optimization is critical for any commercial
DBMS - good/bad can be orders of magnitude
- The 3 components of the System-R style
optimization - Plan space definition
- Cost estimation
- Search algorithm
- huge number of alternative, semantically
equivalent plans - Ideal goal
- map a declarative query to the most efficient
plan - Conventional wisdom avoid bad plans
- State of the art
- industry most optimizers are System-R style
- academic always a core database research topic
26Carry Away Messages
- Appropriate formulation of the problem leads to
good solutions - Query optimization as a search problem
- Messy problems always require a clean model
- Make assumptions to make the problem tractable
- An approximate solution could be good enough
- Opens up opportunities for further improvement