Title: Query Optimization
1Query Optimization
- RG, Chapter 15
- Lecture 16
2Administrivia
- Homework 3 available today
- Written exercise will be posted on class website
- Due date Tuesday, March 20 by end of class
period - Homework 4 available later this week
- Implement nested loops and hash join operators
for minibase - Due date April 5 (after Spring Break)
- Midterm 2 is 3/22, 2 weeks from today
- In class, covers lectures 10-17
- Review will be held Tuesday 3/20 7-9 pm 306 Soda
Hall
3Review
Now you are here
- Query plans are a tree of operators that compute
the result of a query - Optimization is the process of picking the best
plan - Execution is the process of executing the plan
You were here
- Operators are the building blocks for computing
results of queries - Sort
- Project
- Join
- Filter
- Access methods for files
- ...
4Query Plans turning text into tuples
Query Result
Query
Shift
Name
SELECT A.aname, max(F.feedingshift) FROM Animals
A, Feeding F WHERE A.aid F.aid AND (A.species
'Big Cat' or A.species 'Bear') GROUP BY
A.aname HAVING COUNT() gt 1
Aslan 3
Bageera 3
Elsa 3
Shere Khan 3
Tigger 2
Operators
Query Plan
10 2 1 100 3
10 3 2 100 3
20 3 2 100 3
20 2 3 100 3
30 3 2 100 3
40 100 Aslan Big Cat
50 300 Baloo Bear
60 100 Bageera Big Cat
70 100 Shere Khan Big Cat
90 100 Dumbo Elephant
5Operator Review
- Access Path pulls tuples from tables
- File scans
- Index scans (clustered or unclustered)
- Index-only scans
- Select (or Filter) conditionally excludes tuples
- Can be pushed/combined with Access Path
operator - Use indexes where possible and apply other
predicates on the result - Can also be applied at intermediate point in
query plan - Projection removes columns and duplicates
- Column projection often done by operators
- Duplicate elimination via Sort or Hash
6Operator Review
- Sort sorts tuples in a particular order
- Simple merge sort
- General external merge sort (with various
optimizations) - B tree traversal
- Join combine tuples from 2 other operators
- Page nested loops
- Block nested loops
- Index nested loops
- Sort-merge join
- Hash-join
- Other operators for
- Group By, Temping,
7Query Optimization steps
SELECT A.aname, max(F.feedingshift) FROM Animals
A, Feeding F WHERE A.aid F.aid AND (A.species
'Big Cat' or A.species 'Bear') GROUP BY
A.aname HAVING COUNT() gt 1
Query parser
Block 3
- Parse query from text to intermediate model
- Traverse intermediate model and produce
alternate query plans - Query plan tree of relational operators
- Optimizer keeps track of cost and properties of
plans - Pick the cheapest plan
- Pass cheapest plan on to query execution engine
to execute and produce results of query
Block 2
Block 1
Query optimizer
Cost 200
Cost 500
Cost 150
To execution engine
8Query Blocks Units of Optimization
- Intermediate model is a set of query blocks
- 1 per SELECT/FROM/WHERE/GROUP BY/HAVING clause
SELECT A.aname, max(F.feedingshift) FROM Animals
A, Feeding F WHERE A.aid F.aid AND (A.species
'Big Cat' or A.species 'Bear') GROUP BY
A.aname HAVING COUNT() gt 1
Query Block
- Subqueries produce nested query blocks
- treated as calls to a subroutine, made once per
tuple produced by outer query block - sometimes subqueries can be rewritten to produce
cheaper plan
SELECT S.sname FROM Sailors S WHERE S.age IN
(SELECT MAX (S2.age) FROM Sailors
S2 GROUP BY S2.rating)
X
Outer Query Block
Rewritten Query Block
Nested Query Block
9Query blocks are optimized 1 at a time
- Convert block to relational algebra tree
- Traverse tree and build plan bottom up
- Pick best access method for each relation in FROM
clause - Applying predicates if possible
- Consider all join trees
- All ways to join relations in FROM clause 1-at-a
time - Consider multiple permutations and join methods
- But not all! too many choices
- Restrict to left-deep plans
- Prune bad plans along the way
Query Block
10Converting Query Blocks to Relational Algebra
Trees
- SQL is relationally complete can express
everything in relational algebra
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
?(sname)?(bid100 ? rating gt 5) (Reserves ??
Sailors)
11SQL extends Relational Algebra
- SQL is more powerful than relational algebra
- extend relational algebra to include aggregate
ops GROUP BY, HAVING - How is this query block expressed?
- SELECT S.sname
- FROM Sailors S
- WHERE S.age IN (constant set from subquery)
- And this query block?
- SELECT MAX (S2.age)
- FROM Sailors S2
- GROUP BY S2.rating
?sname(s(age in set from subquery) Sailors)
s(age in set from subquery) Sailors
?Max(age)(GroupByRating(Sailors) )
GroupByRating(Sailors)
12Why optimize?
- Operators have implementation choices
- Index scan? File scan? Nested loop join? Sort
merge? - Operators can also be applied in different order!
13Motivating Example -- Schema used
Sailors (sid integer, sname string, rating
integer, age real) Reserves (sid integer, bid
integer, day dates, rname string)
- As seen in previous lectures
- Reserves
- Each tuple is 40 bytes long, 100 tuples per
page, 1000 pages. - Assume there are 100 boats
- Sailors
- Each tuple is 50 bytes long, 80 tuples per page,
500 pages. - Assume there are 10 different ratings
- Assume we have 5 pages in our buffer pool!
14Motivating Example
SELECT S.sname FROM Reserves R, Sailors S WHERE
R.sidS.sid AND R.bid100 AND S.ratinggt5
- Cost 5005001000 I/Os
- Not the worst plan, but
- Misses several opportunities
- selections could have been pushed earlier,
- indexes might have been helpful.
- Goal of optimization To find more efficient
plans that compute the same answer.
Plan
1000
500
15Selectivity calculation
- Sailors 500 pages, 80 tuples per page, 10
ratings - Selectivity of S.rating gt 5?
- ½ -gt 50080/2 20,000 tuples
- 20,000/80 250 pages
- Reserves 1000 pages, 100 tuples per page, 100
boats - Selectivity of R.bid 100?
- 1/100 -gt 1000100/100 1000 tuples
- 1000/100 10 pages
16Alternative Plans Push Selects (No Indexes)
250
1000
500
1000
500
500,500 IOs
250,500 IOs
500 250 1000
17Alternative Plans Push Selects (No Indexes)
250
10
250
1000
1000
500
250,500 IOs
500 250 1000
250,500 IOs
18Alternative Plans Try different join order
10
swap
500
1000
6000 IOs
1000 10 500
250,500 IOs
19Alternative Plans Push Selects and precompute
result (No Indexes)
10
250
10
250
500
1000
500
1000
6000 IOs
1000 500 250 (10 250)
4250 IOs
20Alternative Plans Try different join order
10
250
10
250
swap
10
250
1000
500
500
1000
4250 IOs
500 1000 10 (250 10)
4010 IOs
21Optimized query is 124x cheaper than the original!
4010 IOs
500,500 IOs
22More Alternative Plans (No Indexes)
- Main difference Sort Merge Join
- With 5 buffers, cost of plan
- Scan Reserves (1000) write temp T1 (10 pages,
if we have 100 boats,
uniform distribution) 1010. - Scan Sailors (500) write temp T2 (250 pages, if
have 10 ratings) 750. - Sort T1 (2210) sort T2 (24250) merge
(10250) 2300 - Total 4060 page I/Os.
- If use BNL join, join 104250, total cost
2770. - Can also push projections, but must be careful!
- T1 has only sid, T2 only sid, sname
- T1 fits in 3 pgs, cost of BNL under 250 pgs,
total lt 2000.
log4 ceil(10/5) 1 log4 ceil(250/5))3
Ceil(10/3) 4
23More Alt Plans Indexes
(On-the-fly)
sname
(On-the-fly)
- With clustered index on bid of Reserves, we get
100,000/100 1000 tuples on 1000/100 10
pages. - INL with outer not materialized.
rating gt 5
(Index Nested Loops,
with pipelining )
sidsid
(Use hash Index, do not write to temp)
bid100
Sailors
- Projecting out unnecessary fields from outer
doesnt help.
Reserves
- Join column sid is a key for Sailors.
- At most one matching tuple, unclustered index on
sid OK.
- Decision not to push ratinggt5 before the join
is based on - availability of sid index on Sailors.
- Cost Selection of Reserves tuples (10 I/Os)
then, for each, - must get matching Sailors tuple (10001.2)
total 1210 I/Os.
10 I/Os for 1000 tuples on 10 pages For each
tuple assume 1.2 pages to find match
24What is needed for optimization?
- A closed set of operators
- Relational ops (table in, table out)
- Encapsulation based on iterators
- Plan space, based on
- Based on relational equivalences, different
implementations - Cost Estimation, based on
- Cost formulas
- Size estimation, based on
- Catalog information on base tables
- Selectivity (Reduction Factor) estimation
- A search algorithm
- To sift through the plan space based on cost!