QUERY PROCESSING and OPTIMIZATION - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

QUERY PROCESSING and OPTIMIZATION

Description:

FROM Flight f , Passenger p, Crew c. WHERE f.flightNo = p.flightNo AND ... If needed, indicate records with special marker and give record lengths or offsets ... – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 69
Provided by: johnkr8
Category:

less

Transcript and Presenter's Notes

Title: QUERY PROCESSING and OPTIMIZATION


1
QUERY PROCESSING and OPTIMIZATION
  • PRESENTED DECEMBER 5, 2006 BY
  • JOHN KRETTEK
  • ED BANTI
  • YUYANG CHEN

2
Agenda of Discussion
  • I. Query Processing and Optimization Why?
  • II. Steps of Processing
  • III. Methods of Optimization
  • Heuristic (Logical Transformations)
  • Transformation Rules
  • Heuristic Optimization Guidelines
  • Cost Based (Physical Execution Costs)
  • Data Storage/Access Refresher
  • Catalog Costs
  • IV. What All This Means To YOU?

3
Query Processing Optimization
  • What is Query Processing?
  • Steps required to transform high level SQL query
    into a correct and efficient strategy for
    execution and retrieval.
  • What is Query Optimization?
  • The activity of choosing a single efficient
    execution strategy (from hundreds) as determined
    by database catalog statistics.

4
Questions for Query Optimization
  • Which relational algebra expression, equivalent
    to the given query, will lead to the most
    efficient solution plan?
  • For each algebraic operator, what algorithm (of
    several available) do we use to compute that
    operator?
  • How do operations pass data (main memory buffer,
    disk buffer,)?
  • Will this plan minimize resource usage?
    (CPU/Response Time/Disk)

5
Query Processing Who needs it?
A motivating example
Identify all managers who work in a London
branch SELECT FROM Staff s, Branch b WHERE
s.branchNo b.branchNo AND s.position
Manager AND b.city london
  • Results in these equivalent relational algebra
    statements
  • (1)?s(positionManager)(cityLondon)(Sta
    ff.branchNoBranch.branchNo) (Staff X Branch)
  • (2) s(positionManager)(cityLondon)
    (Staff wvStaff.branchNo Branch.branchNo Branch)
  • (3) s(positionManager) (Staff)
    wvStaff.branchNo Branch.branchNo
    s(cityLondon) (Branch)

6
A Motivating Example (cont)
  • Assume
  • 1000 tuples in Staff.
  • 50 Managers
  • 50 tuples in Branch.
  • 5 London branches
  • No indexes or sort keys
  • All temporary results are written back to disk
    (memory is small)
  • Tuples are accessed one at a time (not in blocks)

7
Motivating Example Query 1 (Bad)
  • s(positionManager)(cityLondon)(Staff.branc
    hNoBranch.branchNo) (Staff X Branch)
  • Requires (100050) disk accesses to read from
    Staff and Branch relations
  • Creates temporary relation of Cartesian Product
    (100050) tuples
  • Requires (100050) disk access to read in
    temporary relation and test predicate
  • Total Work (100050) 2(100050)
  • 101,050 I/O operations

8
Motivating Example Query 2 (Better)
s(positionManager)(cityLondon) (Staff
wvStaff.branchNo Branch.branchNo Branch)
  • Again requires (100050) disk accesses to read
    from Staff and Branch
  • Joins Staff and Branch on branchNo with 1000
    tuples
  • (1 employee 1 branch )
  • Requires (1000) disk access to read in joined
    relation and check predicate
  • Total Work (100050) 2(1000)
  • 3050 I/O operations
  • 3300 Improvement over Query 1

9
Motivating Example Query 3 (Best)
s(positionManager) (Staff)
wvStaff.branchNo Branch.branchNo
s(cityLondon) (Branch)
  • Read Staff relation to determine Managers (1000
    reads)
  • Create 50 tuple relation(50 writes)
  • Read Branch relation to determine London
    branches (50 reads)
  • Create 5 tuple relation(5 writes)
  • Join reduced relations and check predicate (50
    5 reads)
  • Total Work 1000 2(50) 5 (50 5)
  • 1160 I/O operations
  • 8700 Improvement over Query 1
  • Consider if Staff and Branch relations were 10x
    size? 100x? Yikes!

10
Three Major Steps of Processing
  • (1) Query Decomposition
  • Analysis
  • Derive Relational Algebra Tree
  • Normalization
  • (2) Query Optimization
  • Heuristic Improve and Refine relational algebra
    tree to create equivalent Logical Query Plans
  • Cost Based Use database statistics to estimate
    physical costs of logical operators in LQP to
    create Physical Execution Plans
  • (3) Query Execution - Go Go Query Cheapest!

11
Processing Steps
12
Query Decomposition
  • ANALYSIS
  • Lexical Is it even valid SQL?
  • Syntactic Do the relations/attributes exist and
    are the operations valid?
  • Result is internal tree representation of SQL
    query (Parse Tree)

13
Query Decomposition (cont)
  • RELATIONAL ALGEBRA TREE
  • Root The desired result of query
  • Leaf Base relations of query
  • Non-Leaf Intermediate relation created from
    relational algebra operation
  • NORMALIZATION
  • Convert WHERE clause into more easily manipulated
    form
  • Conjunctive Normal Form(CNF) (a v b) Ù (c v d)
    Ù e Ù f (more efficient)
  • Disjunctive Normal Form(DNF) Ú
    Ú

14
Heuristic Optimization
  • GOAL
  • Use relational algebra equivalence rules to
    improve the expected performance of a given query
    tree.
  • Consider the example given earlier
  • Join followed by Selection ( 3050 disk reads)
  • Selection followed by Join ( 1160 disk reads)

15
Relational Algebra Transformations
  • Cascade of Selection
  • (1) sp Ù q Ù r (R) sp(sq(sr(R)))
  • Commutativity of Selection Operations
  • (2) sp(sq(R)) sq(sp(R))
  • In a sequence of projections only the last is
    required
  • (3) PLPMPN(R) PL(R)
  • Selections can be combined with Cartesian
    Products and Joins
  • (4) sp( R x S ) R wvp S
  • (5) sp( R wvq S ) R wvq p S

Note The above is an incomplete List! For a
complete list see the text.
16
More Relational Algebra Transformations
  • Join and Cartesian Product Operations are
    Commutative and Associative
  • (6) R x S S x R
  • (7) R x (S x T) (R x S) x T
  • (8) R wvp S S wvp R
  • (9) (R wvp S) wvq T R wvp (S wvq T)
  • Selection Distributes over Joins
  • If predicate p involves attributes of R only
  • (10) sp( R wvq S ) sp(R) wvq S
  • If predicate p involves only attributes of R and
    q involves only attributes of S
  • (11) spq(R wvr S) sp(R) wvr sq(S)

17
Optimization Uses The Following Heuristics
  • Break apart conjunctive selections into a
    sequence of simpler selections (preparatory step
    for next heuristic).
  • Move s down the query tree for the earliest
    possible execution (reduce number of tuples
    processed).
  • Replace s-x pairs by wv (avoid large intermediate
    results).
  • Break apart and move as far down the tree as
    possible lists of projection attributes, create
    new projections where possible (reduce tuple
    widths early).
  • Perform the joins with the smallest expected
    result first

18
Heuristic Optimization Example
  • What are the ticket numbers of the pilots flying
    to France on 01-01-06?
  • SELECT p.ticketno
  • FROM Flight f , Passenger p, Crew c
  • WHERE f.flightNo p.flightNo AND
  • f .flightNo c.flightNo AND
  • f.date 01-01-06 AND
  • f.to FRA AND
  • p.name c.name AND
  • c.job Pilot

Canonical Relational Algebra Expression
19
Heuristic Optimization (Step 1)
20
Heuristic Optimization (Step 2)
21
Heuristic Optimization (Step 3)
22
Heuristic Optimization (Step 4)
23
Heuristic Optimization (Step 5)
24
Heuristic Optimization (Step 6)
25
Physical Execution Plan
  • Identified optimal Logical Query Plans
  • Every heuristic not always best transform
  • Heuristic Analysis reduces search space for cost
    evaluation but does not necessarily reduce costs
  • Annotate Logical Query Plan operators with
    physical operations (1 )
  • Binary vs. Linear search for Selection?
  • Nested-Loop Join vs. Sort-Merge Join?
  • Pipelining vs. Materialization?
  • How does optimizer determine cheapest plan?

26
Physical Searching
27
Physical Storage
  • Record Placement
  • Types of Records
  • Variable Length
  • Fixed Length
  • Record Separation
  • Not needed when record size lt block size
  • Fixed records dont need it
  • If needed, indicate records with special marker
    and give record lengths or offsets

28
Record Separation
  • Unspanned
  • Records must stay within a block
  • Simpler, but wastes space
  • Spanned
  • Records are across multiple blocks
  • Require pointer at the end of the block to the
    next block with that record
  • Essential if record size gt block size

29
Record Separation
  • Mixed Record Types Clustering
  • Different record types within the same block
  • Why cluster? Frequently accessed records are in
    the same block
  • Has performance downsides if there are many
    frequently accessed queries with different
    ordering
  • Split Records
  • Put fixed records in one place and variable in
    another block

30
Record Separation
  • Sequencing
  • Order records in sequential blocks based on a key
  • Indirection
  • Record address is a combination of various
    physical identifiers or an arbitrary bit string
  • Very flexible but can be costly

31
Accessing Data
  • What is an index?
  • Data structure that allows the DBMS to quickly
    locate particular records or tuples that meet
    specific conditions
  • Types of indicies
  • Primary Index
  • Secondary Index
  • Dense Index
  • Sparse Index/Clustering Index
  • Multilevel Indicies

32
Accessing Data
  • Primary Index
  • Index on the attribute that determines the
    sequencing of the table
  • Guarantees that the index is unique
  • Secondary Index
  • An index on any other attribute
  • Does not guarantee unique index

33
Accessing Data
  • Dense Index
  • Every value of the indexed attribute appears in
    the index
  • Can tell if record exists without accessing files
  • Better access to overflow records
  • Clustering Index
  • Each index can correspond to many records

34
Dense Index
35
Accessing Data
  • Sparse Index
  • Many values of the indexed attribute dont appear
  • Less index space per record
  • Can keep more of index in memory
  • Better for insertions
  • Multilevel Indices
  • Build an index on an index
  • Level 2 Index -gt Level 2 Index -gt Data File

36
Sparse Index
37
B Tree
  • Use a tree model to hold data or indices
  • Maintain balanced tree and aim for a bushy
    shallow tree

100
120 150 180
30
3 5 11
120 130
180 200
100 101 110
150 156 179
30 35
38
B Tree
  • Rules
  • If root is not a leaf, it must have at least two
    children
  • For a tree of order n, each node must have
    between n/2 and n pointers and children
  • For a tree of order n, the number of key values
    in a leaf node must be between (n-1)/2 and (n-1)
    pointers and children

39
B Tree (cont)
  • Rules
  • The number of key values contained in a non-leaf
    node is 1 less than the number of pointers
  • The tree must always be balanced that is, every
    path from the root node to a leaf must have the
    same length
  • Leaf nodes are linked in order of key values

40
Hashing
  • Calculates the address of the page in which the
    record is to be stored based on more or more
    fields
  • Each hash points to a bucket
  • Hash function should evenly distribute the
    records throughout the file
  • A good hash will generate an equal number of keys
    to buckets
  • Keep keys sorted within buckets

41
Hashing
. . .
records
key ? h(key)
. . .
42
Hashing
  • Types of hashing
  • Extensible Hashing
  • Pro
  • Handle growing files
  • Less wasted space
  • No full reorganizations
  • Con
  • Uses indirection
  • Directory doubles in size

43
Hashing
  • Types of hashing
  • Linear Hashing
  • Pro
  • Handle growing files
  • Less wasted space
  • No full reorganizations
  • No indirection like extensible hashing
  • Con
  • Still have overflow chains

44
Indexing vs. Hashing
  • Hashing is good for
  • Probes given specific key
  • SELECT FROM R WHERE R.A 5
  • Indexing is good for
  • Range searches
  • SELECT FROM R WHERE R.A gt 5

45
Cost Model
46
Cost Model
  • To compare different alternatives we must
    evaluate and estimate how expensive (resource
    intensive) a specific execution is
  • Inputs to evaluate
  • Query
  • Database statistics
  • Resource availability
  • Disk Bandwidth/CPU costs/Network Bandwidth

47
Cost Model
  • Statistics
  • Held in the system catalog
  • nTuples(R), number of tuples in a relation
  • bFactor(R), number of tuples that fit into a
    block
  • nBlocks(R), number of blocks required to store R
  • nDistinctA(R), number of distinct values that
    appear for attribute in relation
  • minA(R), maxA(R), min/max possible values for
    attribute in relation
  • SCA(R), average number of tuples that satisfy an
    equality condition

48
Selection Operation
  • Example
  • Branch (branchNo, street, city, postcode)
  • Staff (staffNo, fName, lName, branchNo)
  • Staff
  • 100,000 tuples (denoted as CARD(S))
  • 100 tuples per page
  • 1000 pages (StaffPages)
  • Branch
  • 40,000 tuples (denoted as CARD(B))
  • 80 tuples per page
  • 500 pages (BranchPages)

49
Selection Operation
  • SELECT FROM Branch WHERE city B
  • General Form
  • Without Index
  • Search on attributes BranchPages 500
  • Search on primary key attribute BranchPages
    500/2 250
  • Using Clustered B Tree
  • Costs
  • Path from root to the leftmost leaf with
    qualifying data entry
  • Retrieve all leaf pages fulfilling search
    criteria
  • For each leaf page get corresponding data pages
  • Each data page only retrieved once

50
Selection Operation
  • Using Index for Selections (Clustered B Tree)
  • Example 1
  • SELECT FROM Branch WHERE branchNo 6
  • 1 tuple match (1 data page)
  • Cost 1 index leaf page 1 data page
  • Example 2
  • SELECT FROM Branch WHERE street LIKE c
  • 100 tuple matches (2 data pages)
  • Cost 1 index leaf page two data pages
  • Example 3
  • SELECT FROM Branch WHERE street lt c
  • 10,000 tuple matches (125 data pages)
  • Cost 2 index leaf pages 125 data pages

51
Selection Operation
  • Using Index for Selections (Non-Clustered B
    Tree)
  • Example 1
  • SELECT FROM Branch WHERE branchNo 6
  • 1 tuple match (1 data page)
  • Cost 1 index leaf page 1 data page
  • Example 2
  • SELECT FROM Branch WHERE street LIKE c
  • 100 tuple matches (80 data pages)
  • Cost 1 index leaf page 100 data pages (some
    pages are retrieved twice)
  • Example 3
  • SELECT FROM Branch WHERE street lt c
  • 10,000 tuple matches (490 data pages)
  • Cost 2 index leaf pages 10,000 data pages
    (pages will be retrieved several
    times)

52
Projection Operation
Extracts vertical subset of relation R, and
produces single relation S
  • Steps to implementing the Projection Operation
  • 1. Remove the attributes that are not required
  • Eliminate any duplicate tuples from the required
    attribute
  • Only required if the attributes do not include
    the key of relation)
  • Done by sorting or hashing.

53
Projection Operation
  • Estimation of Cardinality of result set
  • Projection contains key
  • ntuples(S) nTuples(R)
  • Projection contains non-key attribute A
  • ntuples(S) SCA(R)

54
Projection Operation
- Duplicate elimination using sorting
  • Sort tuples of the reduced relation using all
    remaining attributes as a sort key.
  • Duplicates are adjacent and can be easily
    removed.
  • To remove unwanted tuples, need to read all the
    tuples of R and copy the required attributes to a
    temporary relation.
  • Estimated Cost
  • nBlocks(R) nBlocks(R) log2(nBlocks(R))

55
Projection Operation
- Duplicate elimination using hashing
Useful if there is a large number of buffer
blocks for R.
  • Partitioning
  • Allocate one buffer block for reading R allocate
    (nBuffer 1) blocks for the output.
  • For each tuple in R,
  • remove the unwanted attributes
  • apply hash function h to the combination of the
    remaining attributes
  • write the reduced tuple to the hashed value.
  • h chosen so that tuples are uniformly distributed
    to on of the (nBuffer 1) partitions. Then, two
    tuples belonging to different partitions are not
    duplicates

56
Projection Operation
  • Read each of the (nBuffer 1) partitions in
    turn.
  • Apply a different hash function h2() to each
    tuple as is read.
  • Insert the computed hash value into an in-memory
    hash table.
  • If the tuple hashes to the same value as some
    other tuple, check whether the two are the same,
    and eliminate the new one if it is a duplicate.
  • Once a partion has been removed, write the tuples
    in the hash table to the result file.

57
Projection Operation
  • Estimated Cost nBlocks(R) nb
  • nb number of blocks required for a tempory
    table that results from Projection on R before
    duplicate elimination.
  • Assume hashing has no overflow
  • Exclude cost of writing result relation

58
Join Operation
  • Join is very expensive and must be carefully
    optimized
  • Ways the processor can join
  • Simple nested loop join
  • Block nested loop join
  • Indexed nested loop join
  • Sort-merge join
  • Hash join

59
Simple Nested Loop Join
  • For each tuple in the outer relation, we scan
    the entire inner relation (scan each tuple in
    inner).
  • Simplest algorithm
  • Can easily improve on this using Block Nested
    Loop
  • Basic unit of read/write is disk block
  • Add two additional loops that process blocks

60
Block Nested Loop Join
  • Outer loop iterates over one table, inner loop
    iterates over the other (two additional loops on
    the outside for the disk blocks)
  • for ib 1nBlocks(R)
  • for jb 1nBlocks(S)
  • for i 1nTuples
  • for j 1nTuples
  • Cost depends on buffer for outer block loops
  • nBlocks(R) (nBlocks(R) nBlocks(S))
  • If buffer has only one block for R and S
  • nBlocks(R) nBlocks(S) (nBlocks(R) /
    (nBuffer-2))
  • If (nBuffer-2) blocks for R (same number of R
    blocks, less for S)
  • nBlocks(R) nBlocks(S),
  • If all blocks of R can be read into database
    buffer (no outside loops)

61
Block Nested Loop Join
62
Indexed Nested Loop Join
  • For each tuple in R, use index to lookup matching
    tuples in S
  • avoids enumeration of the Cartesian product of R
    and S)
  • Cost depends on indexing method for example
  • nBlocks(R) nTuples(R) (nLevelsA(I) 1),
  • If join attribute A in S is the primary key
  • nBlocks(R) nTuples(R) (nLevelsA(I) SCA(R)
    / bFactor(R)),
  • For clustering index I on attribute A

63
Sort- Merge Join
For Equijoins, the most efficient join occurs
when both relations are sorted on the join
attributes. We can look for quality tuples of R
and S by merging the two relations.
  • If the tables are not sorted on key values,then
    sort first and merge the table (log(N))
  • If tables are sorted on key values,then just
    merge (linear time)
  • Cost
  • nBlocks(R) log2(nBlocks(R) nBlocks(S)
    log2(nBlocks(S)), for sorts
  • nBlocks(R) nBlocks(S), for merge

64
Hash Join
  • Use hash map for indexing into other table
  • Cost
  • 3(nBlocks(R) nBlocks(S))
  • If hash index is held in memory
  • Read R S to partition
  • Write R S to disk
  • Read R S again to find matching tuples
  • 2(nBlocks(R) nBlocks(S)) lognBuffer-1(nBlocks
    (S))-1 Blocks(R) nBlocks(S)
  • Otherwise (if hash index cannot be held in memory)

65
Hash Join
66
Example Cost Estimation for Join Operation
  • Assumptions
  • There are separate hash indexes with no overflow
    on the primary key attributes staffNo of Staff
    and branchNo of Branch
  • There are 100 database buffer blocks
  • The system catalog holds the following
    statistics
  • nTuples(Staff) 6000
  • bFactor(Staff) 30 ? nBlocks(Staff) 200
  • nTuples(Branch) 500
  • bFactor(Branch) 50 ? nBlocks(Branch) 10
  • nTuples(PropertyForRent) 100,000
  • bFactor(PropertyForRent) 50 ?
    nBlocks(PropertyForRent) 2000
  • J1 Staff XstaffNo PropertyForRent
  • J2 Branch XbranchNo PropertyForRent

67
Estimated I/O Costs of Join Operations
68
Questions?
  • Thank you for your time.
  • Questions? Comments?
Write a Comment
User Comments (0)
About PowerShow.com