Title: Query Processing and Optimization
1Query Processing and Optimization
- Sao Mai Nguyen
- Peter Greiner
- Niarcas Jeffrey
2Presentation Overview
- Overview of the main phases of Query Processing
- Parsing
- Validating
- Optimization
- Execution
- Query Decomposition
- Heuristic Approach to Query Optimization
3Overview Cont.
- Cost Estimation Approach
- Pipelining
- If time permits we will also cover
- Query Processing for distributed Databases
4Query Processing
- The activities involved in parsing, validating,
optimizing, and executing a query. - Transform a query written in a high-level
language into a correct and efficient execution
strategy expressed in a low-level language - Execute the strategy to retrieve the required
data - Also Called Query Decomposition
5Parsing Overview
- The process of analyzing a sequence of tokens in
order to determine its grammatical structure with
respect to a given formal grammar. - Parsing transforms input text into a data
structure, usually a tree, which is suitable for
later processing and which captures the implied
hierarchy of the input - Please Find out what kind of parser is used
(Top-down/LL parser, Bottom-up/LR parser)
6Validation Overview
- The process of controlling that data inserted
into an application satisfies pre determined
formats or complies with stated length and
character requirements and other defined input
criteria. - Also called Semantic Analysis
7Query Optimization
- The activity of choosing an efficient strategy
for processing a query - Generally the optimized query is the one that
minimizes resource usage. - The resources which are costed are CPU path
length, disk storage service time, and
interconnect usage between units of parallelism. - The set of query plans examined is formed by
examining possible access paths (e.g., primary
index access, secondary index access, full file
scan) and various relational table join
techniques (e.g, merge join, hash join, product
join).
8Execution Overview
- This is the last step of query processing
- This step is given an execution plan found after
the optimization step. - Any Thing else you guys want to add should go
here.
9Question in Query Processing
- Which relational algebra expression, equivalent
to the given query, will lead to the most
efficient solution plan? - For each algebraic operator, what algorithm (of
several available) do we use to compute that
operator? - How do operations pass data (main memory buffer,
disk buffer,)? - Will this plan minimize resource usage?
(CPU,Response,Disk,)
10Why Optimize?
- Lets Start with an Example
- SELECT
- FROM Staff s, Branch b
- WHERE s.branchNo b.branchNo AND
- (s.position Manager AND b.city London)
11Why Optimize Cont.
- Results in these equivalent relational algebra
statements - (1) s(positionManager)(cityLondon)(Sta
ff.branchNoBranch.branchNo) (Staff X Branch) - (2) s(positionManager)(cityLondon)
(Staff wvStaff.branchNo Branch.branchNo Branch) - (3) s(positionManager) (Staff)
wvStaff.branchNo Branch.branchNo
s(cityLondon) (Branch) - Assume
- 1000 tuples in Staff.
- 50 Managers
- 50 tuples in Branch.
- 5 London branches
- No indexes or sort keys
- All temporary results are written back to disk
(memory is small) - Tuples are accessed one at a time (not in blocks)
12Why Optimize Cont.
- s(positionManager) (cityLondon)
(Staff.branchNoBranch.branchNo) (Staff X Branch) - Requires (100050) disk accesses to read from
Staff and Branch relations - Creates temporary relation of Cartesian Product
(100050) tuples - Requires (100050) disk access to read in
temporary relation and test predicate - Total Work (100050) 2(100050) 101,050
I/O operations
13Why Optimize Cont.
- s(positionManager)(cityLondon) (Staff
wvStaff.branchNo Branch.branchNo Branch) - Again requires (100050) disk accesses to read
from Staff and Branch - Joins Staff and Branch on branchNo with 1000
tuples (1 employee 1 branch ) - Requires (1000) disk access to read in joined
relation and check predicate - Total Work (100050) 2(1000) 3050 I/O
operations - 3300 Improvement over Query 1
14Why Optimize Cont.
- s(positionManager) (Staff)
wvStaff.branchNo Branch.branchNo
s(cityLondon) (Branch) - Read Staff relation to determine Managers (1000
reads) - Create 50 tuple relation(50 writes)
- Read Branch relation to determine London
branches (50 reads) - Create 5 tuple relation(5 writes)
- Join reduced relations and check predicate (50
5 reads) - Total Work 1000 2(50) 5 (50 5) 1160
I/O operations - 8700 Improvement over Query 1
- Consider if Staff and Branch relations were 10x
size? 100x? Yikes!
15The Three Steps of Query Processing
- (1) Query Decomposition
- Analysis
- Derive Relational Algebra Tree
- Normalization
- (2) Query Optimization
- Heuristic Improve and Refine relational algebra
tree to create equivalent Logical
Query Plans - Cost Based Use database statistics to estimate
physical costs of logical operators in LQP to
create Physical Execution Plans - (3) Query Execution
16A Picture is Worth a Thousand Words
17Query Decomposition
- First phase of query processing.
- Aims to translate a high-level SQL statements to
relational algebra query - Then take these created relational algebra
queries are check that they are syntactically and
semantically correct. - Stages
- Analysis
- Normalization
- Semantic analysis
- Simplification
- Query restructuring
18Analysis
- Lexical
- Is it even valid SQL?
- Syntactic
- Do the relations/attributes exist and are the
operations valid? - Result is internal tree representation of SQL
query (Parse Tree)
19Example
- SELECT staffNumber FROM staff WHERE position gt
10 - This query would rejected in this stage if
- The attribute Staff Number is not defined for the
Staff relation - The comparison gt10 is incompatible with the
data type position
20Query Optimization
- RELATIONAL ALGEBRA TREE
- Root The desired result of query
- Leaf Base relations of query
- Non-Leaf Intermediate relation created from
relational algebra operation
21Normalization
- Converts the query into a normalized form that be
more easily manipulated. - Converted into one of two forms
- Conjunctive Normal Form
- A sequence of conjuncts that are connected with
the (AND) operator. Each conjunct contains one
or more terms connected by the v (OR) operator. - Disjunctive Normal Form
- A sequence of Disjuncts that are connected with
v (OR) operator. Each disjunct contains one or
more terms connected by the (and) operator.
22Semantic Analysis
- Objective is to reject normalized queries that
are incorrectly formulated or contradictory. - Happens when
- Query does not contribute to finding the result
(i.e. missing joins) - If its predicate cannot be satisfied by any tuple
(position manager position Assistant) - Algorithms to determine correctness
- Exist only for the subset of queries that do not
contain disjunction or negation. - Relation connection graph
- Normalized attribute connection graph
23Correctness Algorithms
- Relation Connection Graph
- To construct a relation connection graph, we
create a node for each relation and node for the
result. We then crate edges between two node that
represent a join, and edges between nodes that
represent the source of Projection operations - If the graph is not connected, the query is
incorrectly formulated.
24Example
- SELECT p.propertyNo, p.street FROM Client c,
Viewing v, PropertyForRent p WHERE c.clientNo
v.clientNo AND c.maxRent gt 500 AND c.prefType
Flat AND p.ownerNo CO93
Because the graph is not fully connected the
query in not correctly formulated. It is missing
the (v.propertyNo p.propertyNo) from the
predicate.
Result
P
V
25Correctness Algorithms Cont.
- Normalized Attribute Connection Graph
- To construct this graph we create a node for each
reference to an attribute, or constant 0. We then
crate a directed edge between nodes that
represent a join, and a directed edge between an
attribute node and a constant 0 node that
represents a selection operation. Next, we weight
the edges a -gt b with the value c, if it
represents the inequality condition (a lt b c)
and weight the edges 0 -gt a with the value -c, if
it represents the inequality condition (a gt c) - If the graph has a cycle for which the valuation
sum is negative, the query is contradictory
26Example
- SELECT p.propertyNo, p.street FROM Client c,
Viewing v, PropertyForRent p WHERE c.maxRent gt
500 AND c.clientNo v.clientNo AND c.prefType
Flat AND c.maxRent lt 200
27Example Cont.
28Example Cont.
- This query is contradictory because there is a
cycle between c.maxRent and 0 with a negative
valuation sum. - We cannot have a client with a maximum rent that
is both grater than 500 and less than 200.
29Simplification
- Objective To detect redundant qualifications,
eliminate common sub-expressions, and transform
the query to a semantically equivalent but more
easily efficiently computer form. - Typically access restriction, view definitions,
and integrity constraints are considered at this
stage - If all the above checks out then the
simplification applies well-known idempotency
rules of Boolean algebra.
30Rules of Boolean Algebra
Associativity a V (b V c) (a V b) V c a (b
c) (a b) c Commutativity a V b b V a
a b b a Absorption a V (a b) a a
(a V b) a
Distributivity a V (b c) (a V b) (a V c)
a (b V c) (a b) V (a c) Complements a
V a 1 a a 0
31Rules of Boolean Algebra Cont.
Idempotency a V a a a a a
Boundedness a V 0 a a 1 a 0 and 1 are
complements a V 1 1 a 0 0 0 1 1 0
de Morgan's laws (a V b) a b (a
b) a V b Involution a a
32Ways to Optimize
- Heuristical Approach
- Cost Estimation Approach
- Semantic Query Optimization
- May be used in conjunction with the above
techniques
33Heuristical Approach
- Uses twelve transformation rules to convert
relational algebra expressions into more
efficient forms - Also uses five processing strategies applied
during query processing - Every heuristic not always best transform
- Reduces search space for cost evaluation but does
not necessarily reduce costs
34Heuristical Approach Cont.
- Use relational algebra equivalence rules to
improve the expected performance of a given query
tree - Consider the example given earlier
- Join followed by Selection ( 3050 disk reads)
- Selection followed by Join ( 1160 disk reads)
3512 Rules
- Cascade of Selection
- (1) sp Ù q Ù r (R) sp(sq(sr(R)))
- Commutativity of Selection Operations
- (2) sp(sq(R)) sq(sp(R))
- In a sequence of projections only the last is
required - (3) PLPM PN(R) PL(R)
- Selections can be combined with Cartesian
Products and Joins - (4) sp( R x S ) R wvp S
- (5) sp( R wvq S ) R wvq Ù p S
3612 Rules Cont.
- Join and Cartesian Product Operations are
Commutative and Associative - (6) R x S S x R
- (7) R x (S x T) (R x S) x T
- (8) R wvp S S wvp R
- (9) (R wvp S) wvq Ù r T R wvp Ù r (S wvq T)
- Selection Distributes over Joins
- If predicate p involves attributes of R only
- (10) sp( R wvq S ) sp(R) wvq S
- If predicate p involves only attributes of R and
q involves only attributes of S - (11) sp Ù q(R wvr S) sp(R) wvr sq(S)
3712 Rules Cont.
- Associativity of Union and Intersection (but not
set difference) - (R ? S) ? T S ? ( R ? T)
- (R ? S) ? T S ? (R ? T)
385 rules to optimize by
- Perform Selection operation as early as possible.
- Combine the Cartesian product with a subsequent
Selection operation whose predicate represents a
join condition into a Join operation. - Use associativity of binary operations to
rearrange lead nodes so that the leaf nodes with
the most restrictive Selection operations are
executed first. - Perform Projection operations as early as
possible. - Compute common expressions once.
39Cost Estimation Approach
- The goal of cost estimation is to choose the most
efficient way to implement relational algebra
operations by selecting the option with the
lowest cost - Want to select the option that yields the fewest
disk reads
40Cost Estimation Assumptions
- Assume that the DBMS will hold the following
information in its system catalog - For each base relation R
- nTuples(R) number of records, or cardinality
- bFactor(R) - number of tuples that fit into one
block - nBlocks(R) - the number of blocks required to
store R - nBlocks(R) nTuples(R) / bFactor(R) when
tuples are stored together physically
41Assumptions Cont.
- For each attribute A of base relation R
- nDistinctA(R) - Number of distinct values that
appear for attribute A in relation R - minA(R), maxA(R), - minimum and maximum possible
values for the attribute A in relation R - SCA(R) - the selection cardinality of attribute A
in relation R average number of tuples that
satisfy an equality condition on attribute A - For each multilevel index I on attribute set A
- nLevelsA(I) - the number of levels in I
- nLfBlocksA(I) - the number of leaf blocks in I
42Example Cost Estimation Strategies for
Selections
- To decide which implementation to select, the
main strategies considered for Selection
operations are - Linear search (unordered file, no index)
- Binary search (ordered file, no index)
- Equality on hash key
- Equality condition on primary key
- Inequality condition on primary key
- Equality condition on clustering (secondary)
index - Equality condition on a non-clustering
(secondary) index - Inequality condition on a secondary B-tree index
43Example Estimated I/O Costs of Strategies
- Linear search (unordered file, no index)
- nBlocks(R)/2, for equality condition on key
attribute nBlocks(R) - nBlocks(R), otherwise
- Binary search (ordered file, no index)
- log2(nBlocks(R)) , for equality condition on
ordered attribute - log2(nBlocks(R)) SCA(R)/bFactor(R) - 1,
otherwise - Equality condition on clustering (secondary)
index - nLevelsA(I) SCA(R)/bFactor(R)
44Example
- There is a hash index with no overflow on the
primary key attribute staffNo - There is a clustering index on the foreign key
attribute branchNo - There is a B-tree index on the salary attribute
- Estimated cost of a linear search on the key
attribute staffNo is 50 blocks - Estimated cost of a linear search on a non-key
attribute is 100 blocks
45Example Cont.
- The following statistics about the Staff relation
are stored
- nDistinctsalary(Staff) 500
- SCbranchno(Staff) 6
- maxsalary (Staff) 50,000
- SCposition(Staff) 300
- minsalary (Staff) 10,000
- SCsalary(Staff) 6
- nLevelsbranchNo(I) 2
- nLfBlockssalary(I) 50
- nLevelssalary(I) 2
- nTuples(Staff) 3000
- nDistinctbranchNo(Staff) 500
- bFactor(Staff) 30
- nDistinctposition(Staff) 10
- nBlocks(Staff) 100 (3000/30)
46Example Cont.
- Selection operation sbranchNoB003(Staff)
- branchNo is a foreign key with a clustering index
and there is an equality condition, so we use the
strategy mentioned before to calculate cost - Cost is nLevelsA(I) SCA(R)/bFactor(R)
- nLevelsbranchNo(I) SCbranchNo(Staff)/bFactor(St
aff) - 2 6/30 3 blocks
- Estimated cardinality of the result relation is
SCbranchNo(Staff) 6
47Semantic Query Optimization
- Based on constraints specified on the database
schema to reduce search space - Example
- Ensure that no one staff member manages more than
100 properties
CREATE ASSERTION StaffNotHandlingTooMuch CHECK
(NOT EXISTS (SELECT staffNo FROM
propertyForRent GROUP BY staffNo HAVING
COUNT() gt 100))
48Example Cont.
- If the optimizer is aware of the constraint
created to prevent staff from managing more than
100 properties, then it can ignore the query
because no groups will satisfy the HAVING clause.
SELECT s.staffNo, COUNT() FROM Staff s,
PropertyForRent p WHERE s.staffNO
p.staffNo GROUP BY s.staffNo HAVING COUNT() gt
100
49Pipelining
- The result of one operation is sent to another
operation without creating a temporary relation
to hold the intermediate result. - Pipelining can save on the cost of reading and
writing to disk.
50Pipelining
- This approach dispenses with the temporary
relation and instead applies the second Selection
to each tuple in the result of first Selection as
it is produced, and adds any qualifying tuples
from the second operation to the result.
51Pipelining
- Separate Process or thread within the DBMS
- Takes a steam of tuples as input / creates a
stream of tuples as output - Buffer created for each pair of adjacent
operations. - Drawback inputs to operations are not
necessarily available all at once for processing.
52Distributed Database Query Optimization
- Query Decomposition
- Takes the query and performs a partial
optimization using techniques already discussed.
Sends a relational algebra tree to the next
layer. - Data Localization
- Takes into account how the data has been
distributed and replaces the leaves of the tree
with their reconstruction algorithms - Global Optimization
- Takes into account statistical information to
find a near optimal execution strategy with parts
of the query sent to the local DBMS
53Distributed Database Query
- Local Optimization
- Each local DBMS performs its own local
optimization using the techniques we discussed