QUERY PROCESSING and OPTIMIZATION

About This Presentation

Title:

QUERY PROCESSING and OPTIMIZATION

Description:

FROM Flight f , Passenger p, Crew c. WHERE f.flightNo = p.flightNo AND ... If needed, indicate records with special marker and give record lengths or offsets ... – PowerPoint PPT presentation

Number of Views:144

Avg rating:3.0/5.0

Slides: 69

Provided by: johnkr8

Category:

more less

Transcript and Presenter's Notes

Title: QUERY PROCESSING and OPTIMIZATION

1
QUERY PROCESSING and OPTIMIZATION

PRESENTED DECEMBER 5, 2006 BY
JOHN KRETTEK
ED BANTI
YUYANG CHEN

2
Agenda of Discussion

I. Query Processing and Optimization Why?
II. Steps of Processing
III. Methods of Optimization
Heuristic (Logical Transformations)
Transformation Rules
Heuristic Optimization Guidelines
Cost Based (Physical Execution Costs)
Data Storage/Access Refresher
Catalog Costs
IV. What All This Means To YOU?

3
Query Processing Optimization

What is Query Processing?
Steps required to transform high level SQL query
into a correct and efficient strategy for
execution and retrieval.
What is Query Optimization?
The activity of choosing a single efficient
execution strategy (from hundreds) as determined
by database catalog statistics.

4
Questions for Query Optimization

Which relational algebra expression, equivalent
to the given query, will lead to the most
efficient solution plan?
For each algebraic operator, what algorithm (of
several available) do we use to compute that
operator?
How do operations pass data (main memory buffer,
disk buffer,)?
Will this plan minimize resource usage?
(CPU/Response Time/Disk)

5
Query Processing Who needs it?
A motivating example
Identify all managers who work in a London
branch SELECT FROM Staff s, Branch b WHERE
s.branchNo b.branchNo AND s.position
Manager AND b.city london

Results in these equivalent relational algebra
statements
(1)?s(positionManager)(cityLondon)(Sta
ff.branchNoBranch.branchNo) (Staff X Branch)
(2) s(positionManager)(cityLondon)
(Staff wvStaff.branchNo Branch.branchNo Branch)
(3) s(positionManager) (Staff)
wvStaff.branchNo Branch.branchNo
s(cityLondon) (Branch)

6
A Motivating Example (cont)

Assume
1000 tuples in Staff.
50 Managers
50 tuples in Branch.
5 London branches
No indexes or sort keys
All temporary results are written back to disk
(memory is small)
Tuples are accessed one at a time (not in blocks)

7
Motivating Example Query 1 (Bad)

s(positionManager)(cityLondon)(Staff.branc
hNoBranch.branchNo) (Staff X Branch)
Requires (100050) disk accesses to read from
Staff and Branch relations
Creates temporary relation of Cartesian Product
(100050) tuples
Requires (100050) disk access to read in
temporary relation and test predicate
Total Work (100050) 2(100050)
101,050 I/O operations

8
Motivating Example Query 2 (Better)
s(positionManager)(cityLondon) (Staff
wvStaff.branchNo Branch.branchNo Branch)

Again requires (100050) disk accesses to read
from Staff and Branch
Joins Staff and Branch on branchNo with 1000
tuples
(1 employee 1 branch )
Requires (1000) disk access to read in joined
relation and check predicate
Total Work (100050) 2(1000)
3050 I/O operations
3300 Improvement over Query 1

9
Motivating Example Query 3 (Best)
s(positionManager) (Staff)
wvStaff.branchNo Branch.branchNo
s(cityLondon) (Branch)

Read Staff relation to determine Managers (1000
reads)
Create 50 tuple relation(50 writes)
Read Branch relation to determine London
branches (50 reads)
Create 5 tuple relation(5 writes)
Join reduced relations and check predicate (50
5 reads)
Total Work 1000 2(50) 5 (50 5)
1160 I/O operations
8700 Improvement over Query 1
Consider if Staff and Branch relations were 10x
size? 100x? Yikes!

10
Three Major Steps of Processing

(1) Query Decomposition
Analysis
Derive Relational Algebra Tree
Normalization
(2) Query Optimization
Heuristic Improve and Refine relational algebra
tree to create equivalent Logical Query Plans
Cost Based Use database statistics to estimate
physical costs of logical operators in LQP to
create Physical Execution Plans
(3) Query Execution - Go Go Query Cheapest!

11
Processing Steps
12
Query Decomposition

ANALYSIS
Lexical Is it even valid SQL?
Syntactic Do the relations/attributes exist and
are the operations valid?
Result is internal tree representation of SQL
query (Parse Tree)

13
Query Decomposition (cont)

RELATIONAL ALGEBRA TREE
Root The desired result of query
Leaf Base relations of query
Non-Leaf Intermediate relation created from
relational algebra operation
NORMALIZATION
Convert WHERE clause into more easily manipulated
form
Conjunctive Normal Form(CNF) (a v b) Ù (c v d)
Ù e Ù f (more efficient)
Disjunctive Normal Form(DNF) Ú
Ú

14
Heuristic Optimization

GOAL
Use relational algebra equivalence rules to
improve the expected performance of a given query
tree.
Consider the example given earlier
Join followed by Selection ( 3050 disk reads)
Selection followed by Join ( 1160 disk reads)

15
Relational Algebra Transformations

Cascade of Selection
(1) sp Ù q Ù r (R) sp(sq(sr(R)))
Commutativity of Selection Operations
(2) sp(sq(R)) sq(sp(R))
In a sequence of projections only the last is
required
(3) PLPMPN(R) PL(R)
Selections can be combined with Cartesian
Products and Joins
(4) sp( R x S ) R wvp S
(5) sp( R wvq S ) R wvq p S

Note The above is an incomplete List! For a
complete list see the text.
16
More Relational Algebra Transformations

Join and Cartesian Product Operations are
Commutative and Associative
(6) R x S S x R
(7) R x (S x T) (R x S) x T
(8) R wvp S S wvp R
(9) (R wvp S) wvq T R wvp (S wvq T)
Selection Distributes over Joins
If predicate p involves attributes of R only
(10) sp( R wvq S ) sp(R) wvq S
If predicate p involves only attributes of R and
q involves only attributes of S
(11) spq(R wvr S) sp(R) wvr sq(S)

17
Optimization Uses The Following Heuristics

Break apart conjunctive selections into a
sequence of simpler selections (preparatory step
for next heuristic).
Move s down the query tree for the earliest
possible execution (reduce number of tuples
processed).
Replace s-x pairs by wv (avoid large intermediate
results).
Break apart and move as far down the tree as
possible lists of projection attributes, create
new projections where possible (reduce tuple
widths early).
Perform the joins with the smallest expected
result first

18
Heuristic Optimization Example

What are the ticket numbers of the pilots flying
to France on 01-01-06?
SELECT p.ticketno
FROM Flight f , Passenger p, Crew c
WHERE f.flightNo p.flightNo AND
f .flightNo c.flightNo AND
f.date 01-01-06 AND
f.to FRA AND
p.name c.name AND
c.job Pilot

Canonical Relational Algebra Expression
19
Heuristic Optimization (Step 1)
20
Heuristic Optimization (Step 2)
21
Heuristic Optimization (Step 3)
22
Heuristic Optimization (Step 4)
23
Heuristic Optimization (Step 5)
24
Heuristic Optimization (Step 6)
25
Physical Execution Plan

Identified optimal Logical Query Plans
Every heuristic not always best transform
Heuristic Analysis reduces search space for cost
evaluation but does not necessarily reduce costs
Annotate Logical Query Plan operators with
physical operations (1 )
Binary vs. Linear search for Selection?
Nested-Loop Join vs. Sort-Merge Join?
Pipelining vs. Materialization?
How does optimizer determine cheapest plan?

26
Physical Searching
27
Physical Storage

Record Placement
Types of Records
Variable Length
Fixed Length
Record Separation
Not needed when record size lt block size
Fixed records dont need it
If needed, indicate records with special marker
and give record lengths or offsets

28
Record Separation

Unspanned
Records must stay within a block
Simpler, but wastes space
Spanned
Records are across multiple blocks
Require pointer at the end of the block to the
next block with that record
Essential if record size gt block size

29
Record Separation

Mixed Record Types Clustering
Different record types within the same block
Why cluster? Frequently accessed records are in
the same block
Has performance downsides if there are many
frequently accessed queries with different
ordering
Split Records
Put fixed records in one place and variable in
another block

30
Record Separation

Sequencing
Order records in sequential blocks based on a key
Indirection
Record address is a combination of various
physical identifiers or an arbitrary bit string
Very flexible but can be costly

31
Accessing Data

What is an index?
Data structure that allows the DBMS to quickly
locate particular records or tuples that meet
specific conditions
Types of indicies
Primary Index
Secondary Index
Dense Index
Sparse Index/Clustering Index
Multilevel Indicies

32
Accessing Data

Primary Index
Index on the attribute that determines the
sequencing of the table
Guarantees that the index is unique
Secondary Index
An index on any other attribute
Does not guarantee unique index

33
Accessing Data

Dense Index
Every value of the indexed attribute appears in
the index
Can tell if record exists without accessing files
Better access to overflow records
Clustering Index
Each index can correspond to many records

34
Dense Index
35
Accessing Data

Sparse Index
Many values of the indexed attribute dont appear
Less index space per record
Can keep more of index in memory
Better for insertions
Multilevel Indices
Build an index on an index
Level 2 Index -gt Level 2 Index -gt Data File

36
Sparse Index
37
B Tree

Use a tree model to hold data or indices
Maintain balanced tree and aim for a bushy
shallow tree

100
120 150 180
30
3 5 11
120 130
180 200
100 101 110
150 156 179
30 35
38
B Tree

Rules
If root is not a leaf, it must have at least two
children
For a tree of order n, each node must have
between n/2 and n pointers and children
For a tree of order n, the number of key values
in a leaf node must be between (n-1)/2 and (n-1)
pointers and children

39
B Tree (cont)

Rules
The number of key values contained in a non-leaf
node is 1 less than the number of pointers
The tree must always be balanced that is, every
path from the root node to a leaf must have the
same length
Leaf nodes are linked in order of key values

40
Hashing

Calculates the address of the page in which the
record is to be stored based on more or more
fields
Each hash points to a bucket
Hash function should evenly distribute the
records throughout the file
A good hash will generate an equal number of keys
to buckets
Keep keys sorted within buckets

41
Hashing
. . .
records
key ? h(key)
. . .
42
Hashing

Types of hashing
Extensible Hashing
Pro
Handle growing files
Less wasted space
No full reorganizations
Con
Uses indirection
Directory doubles in size

43
Hashing

Types of hashing
Linear Hashing
Pro
Handle growing files
Less wasted space
No full reorganizations
No indirection like extensible hashing
Con
Still have overflow chains

44
Indexing vs. Hashing

Hashing is good for
Probes given specific key
SELECT FROM R WHERE R.A 5
Indexing is good for
Range searches
SELECT FROM R WHERE R.A gt 5

45
Cost Model
46
Cost Model

To compare different alternatives we must
evaluate and estimate how expensive (resource
intensive) a specific execution is
Inputs to evaluate
Query
Database statistics
Resource availability
Disk Bandwidth/CPU costs/Network Bandwidth

47
Cost Model

Statistics
Held in the system catalog
nTuples(R), number of tuples in a relation
bFactor(R), number of tuples that fit into a
block
nBlocks(R), number of blocks required to store R
nDistinctA(R), number of distinct values that
appear for attribute in relation
minA(R), maxA(R), min/max possible values for
attribute in relation
SCA(R), average number of tuples that satisfy an
equality condition

48
Selection Operation

Example
Branch (branchNo, street, city, postcode)
Staff (staffNo, fName, lName, branchNo)
Staff
100,000 tuples (denoted as CARD(S))
100 tuples per page
1000 pages (StaffPages)
Branch
40,000 tuples (denoted as CARD(B))
80 tuples per page
500 pages (BranchPages)

49
Selection Operation

SELECT FROM Branch WHERE city B
General Form
Without Index
Search on attributes BranchPages 500
Search on primary key attribute BranchPages
500/2 250
Using Clustered B Tree
Costs
Path from root to the leftmost leaf with
qualifying data entry
Retrieve all leaf pages fulfilling search
criteria
For each leaf page get corresponding data pages
Each data page only retrieved once

50
Selection Operation

Using Index for Selections (Clustered B Tree)
Example 1
SELECT FROM Branch WHERE branchNo 6
1 tuple match (1 data page)
Cost 1 index leaf page 1 data page
Example 2
SELECT FROM Branch WHERE street LIKE c
100 tuple matches (2 data pages)
Cost 1 index leaf page two data pages
Example 3
SELECT FROM Branch WHERE street lt c
10,000 tuple matches (125 data pages)
Cost 2 index leaf pages 125 data pages

51
Selection Operation

Using Index for Selections (Non-Clustered B
Tree)
Example 1
SELECT FROM Branch WHERE branchNo 6
1 tuple match (1 data page)
Cost 1 index leaf page 1 data page
Example 2
SELECT FROM Branch WHERE street LIKE c
100 tuple matches (80 data pages)
Cost 1 index leaf page 100 data pages (some
pages are retrieved twice)
Example 3
SELECT FROM Branch WHERE street lt c
10,000 tuple matches (490 data pages)
Cost 2 index leaf pages 10,000 data pages
(pages will be retrieved several
times)

52
Projection Operation
Extracts vertical subset of relation R, and
produces single relation S

Steps to implementing the Projection Operation
1. Remove the attributes that are not required
Eliminate any duplicate tuples from the required
attribute
Only required if the attributes do not include
the key of relation)
Done by sorting or hashing.

53
Projection Operation

Estimation of Cardinality of result set
Projection contains key
ntuples(S) nTuples(R)
Projection contains non-key attribute A
ntuples(S) SCA(R)

54
Projection Operation
- Duplicate elimination using sorting

Sort tuples of the reduced relation using all
remaining attributes as a sort key.
Duplicates are adjacent and can be easily
removed.
To remove unwanted tuples, need to read all the
tuples of R and copy the required attributes to a
temporary relation.
Estimated Cost
nBlocks(R) nBlocks(R) log2(nBlocks(R))

55
Projection Operation
- Duplicate elimination using hashing
Useful if there is a large number of buffer
blocks for R.

Partitioning
Allocate one buffer block for reading R allocate
(nBuffer 1) blocks for the output.
For each tuple in R,
remove the unwanted attributes
apply hash function h to the combination of the
remaining attributes
write the reduced tuple to the hashed value.
h chosen so that tuples are uniformly distributed
to on of the (nBuffer 1) partitions. Then, two
tuples belonging to different partitions are not
duplicates

56
Projection Operation

Read each of the (nBuffer 1) partitions in
turn.
Apply a different hash function h2() to each
tuple as is read.
Insert the computed hash value into an in-memory
hash table.
If the tuple hashes to the same value as some
other tuple, check whether the two are the same,
and eliminate the new one if it is a duplicate.
Once a partion has been removed, write the tuples
in the hash table to the result file.

57
Projection Operation

Estimated Cost nBlocks(R) nb
nb number of blocks required for a tempory
table that results from Projection on R before
duplicate elimination.
Assume hashing has no overflow
Exclude cost of writing result relation

58
Join Operation

Join is very expensive and must be carefully
optimized
Ways the processor can join
Simple nested loop join
Block nested loop join
Indexed nested loop join
Sort-merge join
Hash join

59
Simple Nested Loop Join

For each tuple in the outer relation, we scan
the entire inner relation (scan each tuple in
inner).
Simplest algorithm
Can easily improve on this using Block Nested
Loop
Basic unit of read/write is disk block
Add two additional loops that process blocks

60
Block Nested Loop Join

Outer loop iterates over one table, inner loop
iterates over the other (two additional loops on
the outside for the disk blocks)
for ib 1nBlocks(R)
for jb 1nBlocks(S)
for i 1nTuples
for j 1nTuples
Cost depends on buffer for outer block loops
nBlocks(R) (nBlocks(R) nBlocks(S))
If buffer has only one block for R and S
nBlocks(R) nBlocks(S) (nBlocks(R) /
(nBuffer-2))
If (nBuffer-2) blocks for R (same number of R
blocks, less for S)
nBlocks(R) nBlocks(S),
If all blocks of R can be read into database
buffer (no outside loops)

61
Block Nested Loop Join
62
Indexed Nested Loop Join

For each tuple in R, use index to lookup matching
tuples in S
avoids enumeration of the Cartesian product of R
and S)
Cost depends on indexing method for example
nBlocks(R) nTuples(R) (nLevelsA(I) 1),
If join attribute A in S is the primary key
nBlocks(R) nTuples(R) (nLevelsA(I) SCA(R)
/ bFactor(R)),
For clustering index I on attribute A

63
Sort- Merge Join
For Equijoins, the most efficient join occurs
when both relations are sorted on the join
attributes. We can look for quality tuples of R
and S by merging the two relations.

If the tables are not sorted on key values,then
sort first and merge the table (log(N))
If tables are sorted on key values,then just
merge (linear time)
Cost
nBlocks(R) log2(nBlocks(R) nBlocks(S)
log2(nBlocks(S)), for sorts
nBlocks(R) nBlocks(S), for merge

64
Hash Join

Use hash map for indexing into other table
Cost
3(nBlocks(R) nBlocks(S))
If hash index is held in memory
Read R S to partition
Write R S to disk
Read R S again to find matching tuples
2(nBlocks(R) nBlocks(S)) lognBuffer-1(nBlocks
(S))-1 Blocks(R) nBlocks(S)
Otherwise (if hash index cannot be held in memory)

65
Hash Join
66
Example Cost Estimation for Join Operation

Assumptions
There are separate hash indexes with no overflow
on the primary key attributes staffNo of Staff
and branchNo of Branch
There are 100 database buffer blocks
The system catalog holds the following
statistics
nTuples(Staff) 6000
bFactor(Staff) 30 ? nBlocks(Staff) 200
nTuples(Branch) 500
bFactor(Branch) 50 ? nBlocks(Branch) 10
nTuples(PropertyForRent) 100,000
bFactor(PropertyForRent) 50 ?
nBlocks(PropertyForRent) 2000
J1 Staff XstaffNo PropertyForRent
J2 Branch XbranchNo PropertyForRent