Query Processing and Optimization

About This Presentation

Title:

Query Processing and Optimization

Description:

Any Thing else you guys want to add should go here. Question in Query Processing ... Also uses five processing strategies applied during query processing ... – PowerPoint PPT presentation

Number of Views:389

Avg rating:3.0/5.0

Slides: 54

Provided by: peterg50

Category:

more less

Transcript and Presenter's Notes

Title: Query Processing and Optimization

1
Query Processing and Optimization

Sao Mai Nguyen
Peter Greiner
Niarcas Jeffrey

2
Presentation Overview

Overview of the main phases of Query Processing
Parsing
Validating
Optimization
Execution
Query Decomposition
Heuristic Approach to Query Optimization

3
Overview Cont.

Cost Estimation Approach
Pipelining
If time permits we will also cover
Query Processing for distributed Databases

4
Query Processing

The activities involved in parsing, validating,
optimizing, and executing a query.
Transform a query written in a high-level
language into a correct and efficient execution
strategy expressed in a low-level language
Execute the strategy to retrieve the required
data
Also Called Query Decomposition

5
Parsing Overview

The process of analyzing a sequence of tokens in
order to determine its grammatical structure with
respect to a given formal grammar.
Parsing transforms input text into a data
structure, usually a tree, which is suitable for
later processing and which captures the implied
hierarchy of the input
Please Find out what kind of parser is used
(Top-down/LL parser, Bottom-up/LR parser)

6
Validation Overview

The process of controlling that data inserted
into an application satisfies pre determined
formats or complies with stated length and
character requirements and other defined input
criteria.
Also called Semantic Analysis

7
Query Optimization

The activity of choosing an efficient strategy
for processing a query
Generally the optimized query is the one that
minimizes resource usage.
The resources which are costed are CPU path
length, disk storage service time, and
interconnect usage between units of parallelism.
The set of query plans examined is formed by
examining possible access paths (e.g., primary
index access, secondary index access, full file
scan) and various relational table join
techniques (e.g, merge join, hash join, product
join).

8
Execution Overview

This is the last step of query processing
This step is given an execution plan found after
the optimization step.
Any Thing else you guys want to add should go
here.

9
Question in Query Processing

Which relational algebra expression, equivalent
to the given query, will lead to the most
efficient solution plan?
For each algebraic operator, what algorithm (of
several available) do we use to compute that
operator?
How do operations pass data (main memory buffer,
disk buffer,)?
Will this plan minimize resource usage?
(CPU,Response,Disk,)

10
Why Optimize?

Lets Start with an Example
SELECT
FROM Staff s, Branch b
WHERE s.branchNo b.branchNo AND
(s.position Manager AND b.city London)

11
Why Optimize Cont.

Results in these equivalent relational algebra
statements
(1) s(positionManager)(cityLondon)(Sta
ff.branchNoBranch.branchNo) (Staff X Branch)
(2) s(positionManager)(cityLondon)
(Staff wvStaff.branchNo Branch.branchNo Branch)
(3) s(positionManager) (Staff)
wvStaff.branchNo Branch.branchNo
s(cityLondon) (Branch)
Assume
1000 tuples in Staff.
50 Managers
50 tuples in Branch.
5 London branches
No indexes or sort keys
All temporary results are written back to disk
(memory is small)
Tuples are accessed one at a time (not in blocks)

12
Why Optimize Cont.

s(positionManager) (cityLondon)
(Staff.branchNoBranch.branchNo) (Staff X Branch)
Requires (100050) disk accesses to read from
Staff and Branch relations
Creates temporary relation of Cartesian Product
(100050) tuples
Requires (100050) disk access to read in
temporary relation and test predicate
Total Work (100050) 2(100050) 101,050
I/O operations

13
Why Optimize Cont.

s(positionManager)(cityLondon) (Staff
wvStaff.branchNo Branch.branchNo Branch)
Again requires (100050) disk accesses to read
from Staff and Branch
Joins Staff and Branch on branchNo with 1000
tuples (1 employee 1 branch )
Requires (1000) disk access to read in joined
relation and check predicate
Total Work (100050) 2(1000) 3050 I/O
operations
3300 Improvement over Query 1

14
Why Optimize Cont.

s(positionManager) (Staff)
wvStaff.branchNo Branch.branchNo
s(cityLondon) (Branch)
Read Staff relation to determine Managers (1000
reads)
Create 50 tuple relation(50 writes)
Read Branch relation to determine London
branches (50 reads)
Create 5 tuple relation(5 writes)
Join reduced relations and check predicate (50
5 reads)
Total Work 1000 2(50) 5 (50 5) 1160
I/O operations
8700 Improvement over Query 1
Consider if Staff and Branch relations were 10x
size? 100x? Yikes!

15
The Three Steps of Query Processing

(1) Query Decomposition
Analysis
Derive Relational Algebra Tree
Normalization
(2) Query Optimization
Heuristic Improve and Refine relational algebra
tree to create equivalent Logical
Query Plans
Cost Based Use database statistics to estimate
physical costs of logical operators in LQP to
create Physical Execution Plans
(3) Query Execution

16
A Picture is Worth a Thousand Words
17
Query Decomposition

First phase of query processing.
Aims to translate a high-level SQL statements to
relational algebra query
Then take these created relational algebra
queries are check that they are syntactically and
semantically correct.
Stages
Analysis
Normalization
Semantic analysis
Simplification
Query restructuring

18
Analysis

Lexical
Is it even valid SQL?
Syntactic
Do the relations/attributes exist and are the
operations valid?
Result is internal tree representation of SQL
query (Parse Tree)

19
Example

SELECT staffNumber FROM staff WHERE position gt
10
This query would rejected in this stage if
The attribute Staff Number is not defined for the
Staff relation
The comparison gt10 is incompatible with the
data type position

20
Query Optimization

RELATIONAL ALGEBRA TREE
Root The desired result of query
Leaf Base relations of query
Non-Leaf Intermediate relation created from
relational algebra operation

21
Normalization

Converts the query into a normalized form that be
more easily manipulated.
Converted into one of two forms
Conjunctive Normal Form
A sequence of conjuncts that are connected with
the (AND) operator. Each conjunct contains one
or more terms connected by the v (OR) operator.
Disjunctive Normal Form
A sequence of Disjuncts that are connected with
v (OR) operator. Each disjunct contains one or
more terms connected by the (and) operator.

22
Semantic Analysis

Objective is to reject normalized queries that
are incorrectly formulated or contradictory.
Happens when
Query does not contribute to finding the result
(i.e. missing joins)
If its predicate cannot be satisfied by any tuple
(position manager position Assistant)
Algorithms to determine correctness
Exist only for the subset of queries that do not
contain disjunction or negation.
Relation connection graph
Normalized attribute connection graph

23
Correctness Algorithms

Relation Connection Graph
To construct a relation connection graph, we
create a node for each relation and node for the
result. We then crate edges between two node that
represent a join, and edges between nodes that
represent the source of Projection operations
If the graph is not connected, the query is
incorrectly formulated.

24
Example

SELECT p.propertyNo, p.street FROM Client c,
Viewing v, PropertyForRent p WHERE c.clientNo
v.clientNo AND c.maxRent gt 500 AND c.prefType
Flat AND p.ownerNo CO93

Because the graph is not fully connected the
query in not correctly formulated. It is missing
the (v.propertyNo p.propertyNo) from the
predicate.
Result
P
V
25
Correctness Algorithms Cont.

Normalized Attribute Connection Graph
To construct this graph we create a node for each
reference to an attribute, or constant 0. We then
crate a directed edge between nodes that
represent a join, and a directed edge between an
attribute node and a constant 0 node that
represents a selection operation. Next, we weight
the edges a -gt b with the value c, if it
represents the inequality condition (a lt b c)
and weight the edges 0 -gt a with the value -c, if
it represents the inequality condition (a gt c)
If the graph has a cycle for which the valuation
sum is negative, the query is contradictory

26
Example

SELECT p.propertyNo, p.street FROM Client c,
Viewing v, PropertyForRent p WHERE c.maxRent gt
500 AND c.clientNo v.clientNo AND c.prefType
Flat AND c.maxRent lt 200

27
Example Cont.
28
Example Cont.

This query is contradictory because there is a
cycle between c.maxRent and 0 with a negative
valuation sum.
We cannot have a client with a maximum rent that
is both grater than 500 and less than 200.

29
Simplification

Objective To detect redundant qualifications,
eliminate common sub-expressions, and transform
the query to a semantically equivalent but more
easily efficiently computer form.
Typically access restriction, view definitions,
and integrity constraints are considered at this
stage
If all the above checks out then the
simplification applies well-known idempotency
rules of Boolean algebra.

30
Rules of Boolean Algebra
Associativity a V (b V c) (a V b) V c a (b
c) (a b) c Commutativity a V b b V a
a b b a Absorption a V (a b) a a
(a V b) a
Distributivity a V (b c) (a V b) (a V c)
a (b V c) (a b) V (a c) Complements a
V a 1 a a 0
31
Rules of Boolean Algebra Cont.
Idempotency a V a a a a a
Boundedness a V 0 a a 1 a 0 and 1 are
complements a V 1 1 a 0 0 0 1 1 0
de Morgan's laws (a V b) a b (a
b) a V b Involution a a
32
Ways to Optimize

Heuristical Approach
Cost Estimation Approach
Semantic Query Optimization
May be used in conjunction with the above
techniques

33
Heuristical Approach

Uses twelve transformation rules to convert
relational algebra expressions into more
efficient forms
Also uses five processing strategies applied
during query processing
Every heuristic not always best transform
Reduces search space for cost evaluation but does
not necessarily reduce costs

34
Heuristical Approach Cont.

Use relational algebra equivalence rules to
improve the expected performance of a given query
tree
Consider the example given earlier
Join followed by Selection ( 3050 disk reads)
Selection followed by Join ( 1160 disk reads)

35
12 Rules

Cascade of Selection
(1) sp Ù q Ù r (R) sp(sq(sr(R)))
Commutativity of Selection Operations
(2) sp(sq(R)) sq(sp(R))
In a sequence of projections only the last is
required
(3) PLPM PN(R) PL(R)
Selections can be combined with Cartesian
Products and Joins
(4) sp( R x S ) R wvp S
(5) sp( R wvq S ) R wvq Ù p S

36
12 Rules Cont.

Join and Cartesian Product Operations are
Commutative and Associative
(6) R x S S x R
(7) R x (S x T) (R x S) x T
(8) R wvp S S wvp R
(9) (R wvp S) wvq Ù r T R wvp Ù r (S wvq T)
Selection Distributes over Joins
If predicate p involves attributes of R only
(10) sp( R wvq S ) sp(R) wvq S
If predicate p involves only attributes of R and
q involves only attributes of S
(11) sp Ù q(R wvr S) sp(R) wvr sq(S)

37
12 Rules Cont.

Associativity of Union and Intersection (but not
set difference)
(R ? S) ? T S ? ( R ? T)
(R ? S) ? T S ? (R ? T)

38
5 rules to optimize by

Perform Selection operation as early as possible.
Combine the Cartesian product with a subsequent
Selection operation whose predicate represents a
join condition into a Join operation.
Use associativity of binary operations to
rearrange lead nodes so that the leaf nodes with
the most restrictive Selection operations are
executed first.
Perform Projection operations as early as
possible.
Compute common expressions once.

39
Cost Estimation Approach

The goal of cost estimation is to choose the most
efficient way to implement relational algebra
operations by selecting the option with the
lowest cost
Want to select the option that yields the fewest
disk reads

40
Cost Estimation Assumptions

Assume that the DBMS will hold the following
information in its system catalog
For each base relation R
nTuples(R) number of records, or cardinality
bFactor(R) - number of tuples that fit into one
block
nBlocks(R) - the number of blocks required to
store R
nBlocks(R) nTuples(R) / bFactor(R) when
tuples are stored together physically

41
Assumptions Cont.

For each attribute A of base relation R
nDistinctA(R) - Number of distinct values that
appear for attribute A in relation R
minA(R), maxA(R), - minimum and maximum possible
values for the attribute A in relation R
SCA(R) - the selection cardinality of attribute A
in relation R average number of tuples that
satisfy an equality condition on attribute A
For each multilevel index I on attribute set A
nLevelsA(I) - the number of levels in I
nLfBlocksA(I) - the number of leaf blocks in I

42
Example Cost Estimation Strategies for
Selections

To decide which implementation to select, the
main strategies considered for Selection
operations are
Linear search (unordered file, no index)
Binary search (ordered file, no index)
Equality on hash key
Equality condition on primary key
Inequality condition on primary key
Equality condition on clustering (secondary)
index
Equality condition on a non-clustering
(secondary) index
Inequality condition on a secondary B-tree index

43
Example Estimated I/O Costs of Strategies

Linear search (unordered file, no index)
nBlocks(R)/2, for equality condition on key
attribute nBlocks(R)
nBlocks(R), otherwise
Binary search (ordered file, no index)
log2(nBlocks(R)) , for equality condition on
ordered attribute
log2(nBlocks(R)) SCA(R)/bFactor(R) - 1,
otherwise
Equality condition on clustering (secondary)
index
nLevelsA(I) SCA(R)/bFactor(R)

44
Example

There is a hash index with no overflow on the
primary key attribute staffNo
There is a clustering index on the foreign key
attribute branchNo
There is a B-tree index on the salary attribute
Estimated cost of a linear search on the key
attribute staffNo is 50 blocks
Estimated cost of a linear search on a non-key
attribute is 100 blocks

45
Example Cont.

The following statistics about the Staff relation
are stored

nDistinctsalary(Staff) 500
SCbranchno(Staff) 6
maxsalary (Staff) 50,000
SCposition(Staff) 300
minsalary (Staff) 10,000
SCsalary(Staff) 6
nLevelsbranchNo(I) 2
nLfBlockssalary(I) 50
nLevelssalary(I) 2

nTuples(Staff) 3000
nDistinctbranchNo(Staff) 500
bFactor(Staff) 30
nDistinctposition(Staff) 10
nBlocks(Staff) 100 (3000/30)

46
Example Cont.

Selection operation sbranchNoB003(Staff)
branchNo is a foreign key with a clustering index
and there is an equality condition, so we use the
strategy mentioned before to calculate cost
Cost is nLevelsA(I) SCA(R)/bFactor(R)
nLevelsbranchNo(I) SCbranchNo(Staff)/bFactor(St
aff)
2 6/30 3 blocks
Estimated cardinality of the result relation is
SCbranchNo(Staff) 6

47
Semantic Query Optimization

Based on constraints specified on the database
schema to reduce search space
Example
Ensure that no one staff member manages more than
100 properties

CREATE ASSERTION StaffNotHandlingTooMuch CHECK
(NOT EXISTS (SELECT staffNo FROM
propertyForRent GROUP BY staffNo HAVING
COUNT() gt 100))
48
Example Cont.

If the optimizer is aware of the constraint
created to prevent staff from managing more than
100 properties, then it can ignore the query
because no groups will satisfy the HAVING clause.

SELECT s.staffNo, COUNT() FROM Staff s,
PropertyForRent p WHERE s.staffNO
p.staffNo GROUP BY s.staffNo HAVING COUNT() gt
100
49
Pipelining

The result of one operation is sent to another
operation without creating a temporary relation
to hold the intermediate result.
Pipelining can save on the cost of reading and
writing to disk.

50
Pipelining

This approach dispenses with the temporary
relation and instead applies the second Selection
to each tuple in the result of first Selection as
it is produced, and adds any qualifying tuples
from the second operation to the result.

51
Pipelining

Separate Process or thread within the DBMS
Takes a steam of tuples as input / creates a
stream of tuples as output
Buffer created for each pair of adjacent
operations.
Drawback inputs to operations are not
necessarily available all at once for processing.

52
Distributed Database Query Optimization

Query Decomposition
Takes the query and performs a partial
optimization using techniques already discussed.
Sends a relational algebra tree to the next
layer.
Data Localization
Takes into account how the data has been
distributed and replaces the leaves of the tree
with their reconstruction algorithms
Global Optimization
Takes into account statistical information to
find a near optimal execution strategy with parts
of the query sent to the local DBMS

53
Distributed Database Query