Query Execution (end) Query Optimization - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Query Execution (end) Query Optimization

Description:

Scan one table first, then the next, eliminate duplicates. Cost: B(R) B(S) 17. Nested loop join ... Two-Pass Algorithms Based on Sorting. Join R S ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 64
Provided by: csU70
Category:

less

Transcript and Presenter's Notes

Title: Query Execution (end) Query Optimization


1
Query Execution (end)Query Optimization
  • Lecture 23

2
Query Execution
3
Outline
  • Logical/physical operators
  • Cost parameters and sorting
  • One-pass algorithms
  • Nested-loop joins
  • Two-pass algorithms

4
Query Execution
Query or update
User/ Application
Query compiler
Query execution plan
Execution engine
Record, index requests
Index/record mgr.
Page commands
Buffer manager
Read/write pages
Storage manager
storage
5
Logical v.s. Physical Operators
  • Logical operators
  • what they do
  • e.g., union, selection, project, join, grouping
  • Physical operators
  • how they do it
  • e.g., nested loop join, sort-merge join, hash
    join, index join

6
Query Execution Plans
SELECT S.sname FROM Purchase P, Person
Q WHERE P.buyerQ.name AND
Q.cityurbana AND Q.phone gt 5430000
buyer
?
Cityurbana
phonegt5430000
  • Query Plan
  • logical tree
  • implementation choice at every node
  • scheduling of operations.

Buyername
(Simple Nested Loops)
Person
Purchase
(Table scan)
(Index scan)
Some operators are from relational algebra, and
others (e.g., scan, group) are not.
7
How do We Combine Operations?
  • The iterator model. Each operation is implemented
    by 3 functions
  • Open sets up the data structures and performs
    initializations
  • GetNext returns the the next tuple of the
    result.
  • Close ends the operations. Cleans up the data
    structures.
  • Enables pipelining!
  • Contrast with data-driven materialize model.

8
Cost Parameters
  • Cost parameters
  • M number of blocks that fit in main memory
  • B(R) number of blocks holding R
  • T(R) number of tuples in R
  • V(R,a) number of distinct values of the
    attribute a
  • Estimating the cost
  • Important in optimization (next lecture)
  • Compute I/O cost only
  • We compute the cost to read the tables
  • We dont compute the cost to write the result
    (because pipelining)

9
Reminder Sorting
  • Two pass multi-way merge sort
  • Step 1
  • Read M blocks at a time, sort, write
  • Result have runs of length M on disk
  • Step 2
  • Merge M-1 at a time, write to disk
  • Result have runs of length M(M-1)?M2
  • Cost 3B(R), Assumption B(R) ? M2

10
Scanning Tables
  • The table is clustered (I.e. blocks consists only
    of records from this table)
  • Table-scan if we know where the blocks are
  • Index scan if we have a sparse index to find the
    blocks
  • The table is unclustered (e.g. its records are
    placed on blocks with other tables)
  • May need one read for each record

11
Cost of the Scan Operator
  • Clustered relation
  • Table scan B(R) to sort 3B(R)
  • Index scan B(R) to sort B(R) or 3B(R)
  • Unclustered relation
  • T(R) to sort T(R) 2B(R)

12
One pass algorithm
13
One-pass Algorithms
  • Selection s(R), projection P(R)
  • Both are tuple-at-a-Time algorithms
  • Cost B(R)

Unary operator
Input buffer
Output buffer
14
One-pass Algorithms
  • Duplicate elimination d(R)
  • Need to keep a dictionary in memory
  • balanced search tree
  • hash table
  • etc
  • Cost B(R)
  • Assumption B(d(R)) lt M

15
One-pass Algorithms
  • Grouping gcity, sum(price) (R)
  • Need to keep a dictionary in memory
  • Also store the sum(price) for each city
  • Cost B(R)
  • Assumption number of cities fits in memory

16
One-pass Algorithms
  • Binary operations R n S, R U S, R S
  • Assumption min(B(R), B(S)) lt M
  • Scan one table first, then the next, eliminate
    duplicates
  • Cost B(R)B(S)

17
Nested loop join
18
Nested Loop Joins
  • Tuple-based nested loop R S
  • for each tuple r in R do
  • for each tuple s in S do
  • if r and s join then output (r,s)
  • Cost T(R) T(S), sometimes T(R) B(S)

19
Nested Loop Joins
  • Block-based Nested Loop Join
  • for each (M-1) blocks bs of S do
  • for each block br of R do
  • for each tuple s in bs do
  • for each tuple r in br do
  • if r and s join then
    output(r,s)

20
Nested Loop Joins
R S
Join Result
Hash table for block of S (k lt B-1 pages)
. . .
. . .
Input buffer for R
Output buffer
21
Nested Loop Joins
  • Block-based Nested Loop Join
  • Cost
  • Read S once cost B(S)
  • Outer loop runs B(S)/(M-1) times, and each time
    need to read R costs B(S)B(R)/(M-1)
  • Total cost B(S) B(S)B(R)/(M-1)
  • Notice it is better to iterate over the smaller
    relation first
  • R S Router relation, Sinner relation

22
Two pass algorithm
23
Two-Pass Algorithms Based on Sorting
  • Duplicate elimination d(R)
  • Simple idea sort first, then eliminate
    duplicates
  • Step 1 sort runs of size M, write
  • Cost 2B(R)
  • Step 2 merge M-1 runs, but include each tuple
    only once
  • Cost B(R)
  • Some complications...
  • Total cost 3B(R), Assumption B(R) lt M2

24
Two-Pass Algorithms Based on Sorting
  • Grouping gcity, sum(price) (R)
  • Same as before sort, then compute the sum(price)
    for each group
  • As before compute sum(price) during the merge
    phase.
  • Total cost 3B(R)
  • Assumption B(R) lt M2

25
Two-Pass Algorithms Based on Sorting
  • Binary operations R n S, R U S, R S
  • Idea sort R, sort S, then do the right thing
  • A closer look
  • Step 1 split R into runs of size M, then split S
    into runs of size M. Cost 2B(R) 2B(S)
  • Step 2 merge M/2 runs from R merge M/2 runs
    from S ouput a tuple on a case by cases basis
  • Total cost 3B(R)3B(S)
  • Assumption B(R)B(S)lt M2

26
Two-Pass Algorithms Based on Sorting
  • Join R S
  • Start by sorting both R and S on the join
    attribute
  • Cost 4B(R)4B(S) (because need to write to
    disk)
  • Read both relations in sorted order, match tuples
  • Cost B(R)B(S)
  • Difficulty many tuples in R may match many in S
  • If at least one set of tuples fits in M, we are
    OK
  • Otherwise need nested loop, higher cost
  • Total cost 5B(R)5B(S)
  • Assumption B(R) lt M2, B(S) lt M2

27
Two-Pass Algorithms Based on Sorting
  • Join R S
  • If the number of tuples in R matching those in S
    is small (or vice versa) we can compute the join
    during the merge phase
  • Total cost 3B(R)3B(S)
  • Assumption B(R) B(S) lt M2

28
Two Pass Algorithms Based on Hashing
  • Idea partition a relation R into buckets, on
    disk
  • Each bucket has size approx. B(R)/M
  • Does each bucket fit in main memory ?
  • Yes if B(R)/M lt M, i.e. B(R) lt M2

1
2
B(R)
29
Hash Based Algorithms for d
  • Recall d(R) duplicate elimination
  • Step 1. Partition R into buckets
  • Step 2. Apply d to each bucket (may read in main
    memory)
  • Cost 3B(R)
  • AssumptionB(R) lt M2

30
Hash Based Algorithms for g
  • Recall g(R) grouping and aggregation
  • Step 1. Partition R into buckets
  • Step 2. Apply g to each bucket (may read in main
    memory)
  • Cost 3B(R)
  • AssumptionB(R) lt M2

31
Hash-based Join
  • R S
  • Recall the main memory hash-based join
  • Scan S, build buckets in main memory
  • Then scan R and join

32
Partitioned Hash Join
  • R S
  • Step 1
  • Hash S into M buckets
  • send all buckets to disk
  • Step 2
  • Hash R into M buckets
  • Send all buckets to disk
  • Step 3
  • Join every pair of buckets

33
PartitionedHash-Join
  • Partition both relations using hash fn h R
    tuples in partition i will only match S tuples in
    partition i.
  • Read in a partition of R, hash it using h2 (ltgt
    h!). Scan matching partition of S, search for
    matches.

34
Partitioned Hash Join
  • Cost 3B(R) 3B(S)
  • Assumption min(B(R), B(S)) lt M2

35
Hybrid Hash Join Algorithm
  • When we have more memory B(S) ltlt M2
  • Partition S into k buckets
  • But keep first bucket S1 in memory, k-1 buckets
    to disk
  • Partition R into k buckets
  • First bucket R1 is joined immediately with S1
  • Other k-1 buckets go to disk
  • Finally, join k-1 pairs of buckets
  • (R2,S2), (R3,S3), , (Rk,Sk)

36
Hybrid Join Algorithm
  • How big should we choose k ?
  • Average bucket size for S is B(S)/k
  • Need to fit B(S)/k (k-1) blocks in memory
  • B(S)/k (k-1) lt M
  • k slightly smaller than B(S)/M

37
Hybrid Join Algorithm
  • How many I/Os ?
  • Recall cost of partitioned hash join
  • 3B(R) 3B(S)
  • Now we save 2 disk operations for one bucket
  • Recall there are k buckets
  • Hence we save 2/k(B(R) B(S))
  • Cost (3-2/k)(B(R) B(S))
    (3-2M/B(S))(B(R) B(S))

38
Indexed Based Algorithms
  • In a clustered index all tuples with the same
    value of the key are clustered on as few blocks
    as possible

39
Index Based Selection
  • Selection on equality sav(R)
  • Clustered index on a cost B(R)/V(R,a)
  • Unclustered index on a cost T(R)/V(R,a)

40
Index Based Selection
  • Example B(R) 2000, T(R) 100,000, V(R, a)
    20, compute the cost of sav(R)
  • Cost of table scan
  • If R is clustered B(R) 2000 I/Os
  • If R is unclustered T(R) 100,000 I/Os
  • Cost of index based selection
  • If index is clustered B(R)/V(R,a) 100
  • If index is unclustered T(R)/V(R,a) 5000
  • Notice when V(R,a) is small, then unclustered
    index is useless

41
Index Based Join
  • R S
  • Assume S has an index on the join attribute
  • Iterate over R, for each tuple fetch
    corresponding tuple(s) from S
  • Assume R is clustered. Cost
  • If index is clustered B(R) T(R)B(S)/V(S,a)
  • If index is unclustered B(R) T(R)T(S)/V(S,a)

42
Index Based Join
  • Assume both R and S have a sorted index (B tree)
    on the join attribute
  • Then perform a merge join (called zig-zag join)
  • Cost B(R) B(S)

43
Optimization
  • Chapter 16
  • At the heart of the database engine
  • Step 1 convert the SQL query to some logical
    plan
  • Step 2 find a better logical plan, find an
    associated physical plan

44
Converting from SQL to Logical Plans
Select a1, , an From R1, , Rk Where C
Pa1,,an(s C(R1 R2 Rk))
Select a1, , an From R1, , Rk Where C Group by
b1, , bl
Pa1,,an(g b1, , bm, aggs (s C(R1 R2
Rk)))
45
Converting Nested Queries
  • Select distinct product.name
  • From product
  • Where product.maker in (Select company.name
  • From
    company
  • where
    company.cityUrbana)

Select distinct product.name From product,
company Where product.maker company.name AND
company.cityUrbana
46
Converting Nested Queries
  • Select distinct x.name, x.maker
  • From product x
  • Where x.color blue
  • AND x.price gt ALL (Select y.price
  • From
    product y
  • Where
    x.maker y.maker
  • AND
    y.colorblue)

How do we convert this one to logical plan ?
47
Converting Nested Queries
Lets compute the complement first
  • Select distinct x.name, x.maker
  • From product x
  • Where x.color blue
  • AND x.price lt SOME (Select y.price
  • From
    product y
  • Where
    x.maker y.maker
  • AND
    y.colorblue)

48
Converting Nested Queries
This one becomes a SFW query
  • Select distinct x.name, x.maker
  • From product x, product y
  • Where x.color blue AND x.maker y.maker
  • AND y.colorblue AND x.price lt y.price

This returns exactly the products we DONT want,
so
49
Converting Nested Queries
  • (Select x.name, x.maker
  • From product x
  • Where x.color blue)
  • EXCEPT
  • (Select x.name, x.maker
  • From product x, product y
  • Where x.color blue AND x.maker y.maker
  • AND y.colorblue AND x.price lt y.price)

50
Optimization the Logical Query Plan
  • Now we have one logical plan
  • Algebraic laws
  • foundation for every optimization
  • Two approaches to optimizations
  • Heuristics apply laws that seem to result in
    cheaper plans
  • Cost based estimate size and cost of
    intermediate results, search systematically for
    best plan

51
The three components of an optimizer
  • We need three things in an optimizer
  • Algebraic laws
  • An optimization algorithm
  • A cost estimator

52
Algebraic Laws
  • Commutative and Associative Laws
  • R ? S S ? R, R ? (S ? T) (R ? S) ? T
  • R n S S n R, R n (S n T) (R n S) n T
  • R ? S S ? R, R ? (S ? T) (R ? S) ? T
  • Distributive Laws
  • R ? (S ? T) (R ? S) ? (R ? T)

53
Algebraic Laws
  • Laws involving selection
  • s C AND C(R) s C(s C(R)) s C(R) n s C(R)
  • s C OR C(R) s C(R) U s C(R)
  • s C (R ? S) s C (R) ? S
  • When C involves only attributes of R
  • s C (R S) s C (R) S
  • s C (R ? S) s C (R) ? s C (S)
  • s C (R n S) s C (R) n S

54
Algebraic Laws
  • Example R(A, B, C, D), S(E, F, G)
  • s F3 (R ?DE S)
    ?
  • s A5 AND G9 (R ?DE S)
    ?

55
Algebraic Laws
  • Laws involving projections
  • PM(R ? S) PN(PP(R) ? PQ(S))
  • Where N, P, Q are appropriate subsets of
    attributes of M
  • PM(PN(R)) PM,N(R)
  • Example R(A,B,C,D), S(E, F, G)
  • PA,B,G(R ? S) P ? (P?(R) ? P?(S))

56
Algebraic Laws
  • Laws involving grouping and aggregation
  • ?(?A, agg(B)(R)) ?A, agg(B)(R)
  • ?A, agg(B)(?(R)) ?A, agg(B)(R) if agg is
    duplicate insensitive
  • Which of the following are duplicate
    insensitive ?sum, count, avg, min, max
  • ?A, agg(D)(R(A,B) ?BC S(C,D)) ?A,
    agg(D)(R(A,B) ?BC (?B, agg(D)S(C,D)))
  • Why is this true ?
  • Why would we do it ?

57
Heuristic Based Optimizations
  • Query rewriting based on algebraic laws
  • Result in better queries most of the time
  • Heuristics number 1
  • Push selections down
  • Heuristics number 2
  • Sometimes push selections up, then down

58
Predicate Pushdown
pname
pname
s pricegt100 AND cityUrbana
makername
makername
cityUrbana
pricegt100
Product
Company
Company
Product
The earlier we process selections, less tuples we
need to manipulate higher up in the tree (but may
cause us to loose an important ordering of the
tuples, if we use indexes).
59
Predicate Pushdown
Select y.name, Max(x.price) From product x,
company y Where x.maker y.name GroupBy
y.name Having Max(x.price) gt 100
Select y.name, Max(x.price) From product x,
company y Where x.makery.name and
x.price gt 100 GroupBy y.name Having Max(x.price)
gt 100
  • For each company, find the maximal price of its
    products.
  • Advantage the size of the join will be smaller.
  • Requires transformation rules specific to the
    grouping/aggregation
  • operators.
  • Wont work if we replace Max by Min.

60
Pushing predicates up
Bargain view V1 categories with some pricelt20,
and the cheapest price
Select V2.name, V2.price From V1, V2 Where
V1.category V2.category and V1.p
V2.price
Create View V1 AS Select x.category,
Min(x.price) AS p From product x Where
x.price lt 20 GroupBy x.category
Create View V2 AS Select y.name, x.category,
x.price From product x, company y Where
x.makery.name
61
Query RewritePushing predicates up
Bargain view V1 categories with some pricelt20,
and the cheapest price
Select V2.name, V2.price From V1, V2 Where
V1.category V2.category and V1.p
V2.price AND V1.p lt 20
Create View V1 AS Select x.category,
Min(x.price) AS p From product x Where
x.price lt 20 GroupBy x.category
Create View V2 AS Select y.name, x.category,
x.price From product x, company y Where
x.makery.name
62
Query RewritePushing predicates up
Bargain view V1 categories with some pricelt20,
and the cheapest price
Select V2.name, V2.price From V1, V2 Where
V1.category V2.category and V1.p
V2.price AND V1.p lt 20
Create View V1 AS Select x.category,
Min(x.price) AS p From product x Where
x.price lt 20 GroupBy x.category
Create View V2 AS Select y.name, x.category,
x.price From product x, company y Where
x.makery.name AND V1.p lt 20
63
Cost-based Optimizations
  • Main idea apply algebraic laws, until estimated
    cost is minimal
  • Practically start from partial plans, introduce
    operators one by one
  • Will see in a few slides
  • Problem there are too many ways to apply the
    laws, hence too many (partial) plans
Write a Comment
User Comments (0)
About PowerShow.com