Query Processing and Optimization - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Query Processing and Optimization

Description:

Any Thing else you guys want to add should go here. Question in Query Processing ... Also uses five processing strategies applied during query processing ... – PowerPoint PPT presentation

Number of Views:389
Avg rating:3.0/5.0
Slides: 54
Provided by: peterg50
Category:

less

Transcript and Presenter's Notes

Title: Query Processing and Optimization


1
Query Processing and Optimization
  • Sao Mai Nguyen
  • Peter Greiner
  • Niarcas Jeffrey

2
Presentation Overview
  • Overview of the main phases of Query Processing
  • Parsing
  • Validating
  • Optimization
  • Execution
  • Query Decomposition
  • Heuristic Approach to Query Optimization

3
Overview Cont.
  • Cost Estimation Approach
  • Pipelining
  • If time permits we will also cover
  • Query Processing for distributed Databases

4
Query Processing
  • The activities involved in parsing, validating,
    optimizing, and executing a query.
  • Transform a query written in a high-level
    language into a correct and efficient execution
    strategy expressed in a low-level language
  • Execute the strategy to retrieve the required
    data
  • Also Called Query Decomposition

5
Parsing Overview
  • The process of analyzing a sequence of tokens in
    order to determine its grammatical structure with
    respect to a given formal grammar.
  • Parsing transforms input text into a data
    structure, usually a tree, which is suitable for
    later processing and which captures the implied
    hierarchy of the input
  • Please Find out what kind of parser is used
    (Top-down/LL parser, Bottom-up/LR parser)

6
Validation Overview
  • The process of controlling that data inserted
    into an application satisfies pre determined
    formats or complies with stated length and
    character requirements and other defined input
    criteria.
  • Also called Semantic Analysis

7
Query Optimization
  • The activity of choosing an efficient strategy
    for processing a query
  • Generally the optimized query is the one that
    minimizes resource usage.
  • The resources which are costed are CPU path
    length, disk storage service time, and
    interconnect usage between units of parallelism.
  • The set of query plans examined is formed by
    examining possible access paths (e.g., primary
    index access, secondary index access, full file
    scan) and various relational table join
    techniques (e.g, merge join, hash join, product
    join).

8
Execution Overview
  • This is the last step of query processing
  • This step is given an execution plan found after
    the optimization step.
  • Any Thing else you guys want to add should go
    here.

9
Question in Query Processing
  • Which relational algebra expression, equivalent
    to the given query, will lead to the most
    efficient solution plan?
  • For each algebraic operator, what algorithm (of
    several available) do we use to compute that
    operator?
  • How do operations pass data (main memory buffer,
    disk buffer,)?
  • Will this plan minimize resource usage?
    (CPU,Response,Disk,)

10
Why Optimize?
  • Lets Start with an Example
  • SELECT
  • FROM Staff s, Branch b
  • WHERE s.branchNo b.branchNo AND
  • (s.position Manager AND b.city London)

11
Why Optimize Cont.
  • Results in these equivalent relational algebra
    statements
  • (1) s(positionManager)(cityLondon)(Sta
    ff.branchNoBranch.branchNo) (Staff X Branch)
  • (2) s(positionManager)(cityLondon)
    (Staff wvStaff.branchNo Branch.branchNo Branch)
  • (3) s(positionManager) (Staff)
    wvStaff.branchNo Branch.branchNo
    s(cityLondon) (Branch)
  • Assume
  • 1000 tuples in Staff.
  • 50 Managers
  • 50 tuples in Branch.
  • 5 London branches
  • No indexes or sort keys
  • All temporary results are written back to disk
    (memory is small)
  • Tuples are accessed one at a time (not in blocks)

12
Why Optimize Cont.
  • s(positionManager) (cityLondon)
    (Staff.branchNoBranch.branchNo) (Staff X Branch)
  • Requires (100050) disk accesses to read from
    Staff and Branch relations
  • Creates temporary relation of Cartesian Product
    (100050) tuples
  • Requires (100050) disk access to read in
    temporary relation and test predicate
  • Total Work (100050) 2(100050) 101,050
    I/O operations

13
Why Optimize Cont.
  • s(positionManager)(cityLondon) (Staff
    wvStaff.branchNo Branch.branchNo Branch)
  • Again requires (100050) disk accesses to read
    from Staff and Branch
  • Joins Staff and Branch on branchNo with 1000
    tuples (1 employee 1 branch )
  • Requires (1000) disk access to read in joined
    relation and check predicate
  • Total Work (100050) 2(1000) 3050 I/O
    operations
  • 3300 Improvement over Query 1

14
Why Optimize Cont.
  • s(positionManager) (Staff)
    wvStaff.branchNo Branch.branchNo
    s(cityLondon) (Branch)
  • Read Staff relation to determine Managers (1000
    reads)
  • Create 50 tuple relation(50 writes)
  • Read Branch relation to determine London
    branches (50 reads)
  • Create 5 tuple relation(5 writes)
  • Join reduced relations and check predicate (50
    5 reads)
  • Total Work 1000 2(50) 5 (50 5) 1160
    I/O operations
  • 8700 Improvement over Query 1
  • Consider if Staff and Branch relations were 10x
    size? 100x? Yikes!

15
The Three Steps of Query Processing
  • (1) Query Decomposition
  • Analysis
  • Derive Relational Algebra Tree
  • Normalization
  • (2) Query Optimization
  • Heuristic Improve and Refine relational algebra
    tree to create equivalent Logical
    Query Plans
  • Cost Based Use database statistics to estimate
    physical costs of logical operators in LQP to
    create Physical Execution Plans
  • (3) Query Execution

16
A Picture is Worth a Thousand Words
17
Query Decomposition
  • First phase of query processing.
  • Aims to translate a high-level SQL statements to
    relational algebra query
  • Then take these created relational algebra
    queries are check that they are syntactically and
    semantically correct.
  • Stages
  • Analysis
  • Normalization
  • Semantic analysis
  • Simplification
  • Query restructuring

18
Analysis
  • Lexical
  • Is it even valid SQL?
  • Syntactic
  • Do the relations/attributes exist and are the
    operations valid?
  • Result is internal tree representation of SQL
    query (Parse Tree)

19
Example
  • SELECT staffNumber FROM staff WHERE position gt
    10
  • This query would rejected in this stage if
  • The attribute Staff Number is not defined for the
    Staff relation
  • The comparison gt10 is incompatible with the
    data type position

20
Query Optimization
  • RELATIONAL ALGEBRA TREE
  • Root The desired result of query
  • Leaf Base relations of query
  • Non-Leaf Intermediate relation created from
    relational algebra operation

21
Normalization
  • Converts the query into a normalized form that be
    more easily manipulated.
  • Converted into one of two forms
  • Conjunctive Normal Form
  • A sequence of conjuncts that are connected with
    the (AND) operator. Each conjunct contains one
    or more terms connected by the v (OR) operator.
  • Disjunctive Normal Form
  • A sequence of Disjuncts that are connected with
    v (OR) operator. Each disjunct contains one or
    more terms connected by the (and) operator.

22
Semantic Analysis
  • Objective is to reject normalized queries that
    are incorrectly formulated or contradictory.
  • Happens when
  • Query does not contribute to finding the result
    (i.e. missing joins)
  • If its predicate cannot be satisfied by any tuple
    (position manager position Assistant)
  • Algorithms to determine correctness
  • Exist only for the subset of queries that do not
    contain disjunction or negation.
  • Relation connection graph
  • Normalized attribute connection graph

23
Correctness Algorithms
  • Relation Connection Graph
  • To construct a relation connection graph, we
    create a node for each relation and node for the
    result. We then crate edges between two node that
    represent a join, and edges between nodes that
    represent the source of Projection operations
  • If the graph is not connected, the query is
    incorrectly formulated.

24
Example
  • SELECT p.propertyNo, p.street FROM Client c,
    Viewing v, PropertyForRent p WHERE c.clientNo
    v.clientNo AND c.maxRent gt 500 AND c.prefType
    Flat AND p.ownerNo CO93

Because the graph is not fully connected the
query in not correctly formulated. It is missing
the (v.propertyNo p.propertyNo) from the
predicate.
Result
P
V
25
Correctness Algorithms Cont.
  • Normalized Attribute Connection Graph
  • To construct this graph we create a node for each
    reference to an attribute, or constant 0. We then
    crate a directed edge between nodes that
    represent a join, and a directed edge between an
    attribute node and a constant 0 node that
    represents a selection operation. Next, we weight
    the edges a -gt b with the value c, if it
    represents the inequality condition (a lt b c)
    and weight the edges 0 -gt a with the value -c, if
    it represents the inequality condition (a gt c)
  • If the graph has a cycle for which the valuation
    sum is negative, the query is contradictory

26
Example
  • SELECT p.propertyNo, p.street FROM Client c,
    Viewing v, PropertyForRent p WHERE c.maxRent gt
    500 AND c.clientNo v.clientNo AND c.prefType
    Flat AND c.maxRent lt 200

27
Example Cont.
28
Example Cont.
  • This query is contradictory because there is a
    cycle between c.maxRent and 0 with a negative
    valuation sum.
  • We cannot have a client with a maximum rent that
    is both grater than 500 and less than 200.

29
Simplification
  • Objective To detect redundant qualifications,
    eliminate common sub-expressions, and transform
    the query to a semantically equivalent but more
    easily efficiently computer form.
  • Typically access restriction, view definitions,
    and integrity constraints are considered at this
    stage
  • If all the above checks out then the
    simplification applies well-known idempotency
    rules of Boolean algebra.

30
Rules of Boolean Algebra
Associativity a V (b V c) (a V b) V c a (b
c) (a b) c Commutativity a V b b V a
a b b a Absorption a V (a b) a a
(a V b) a
Distributivity a V (b c) (a V b) (a V c)
a (b V c) (a b) V (a c) Complements a
V a 1 a a 0
31
Rules of Boolean Algebra Cont.
Idempotency a V a a a a a
Boundedness a V 0 a a 1 a 0 and 1 are
complements a V 1 1 a 0 0 0 1 1 0
de Morgan's laws (a V b) a b (a
b) a V b Involution a a
32
Ways to Optimize
  • Heuristical Approach
  • Cost Estimation Approach
  • Semantic Query Optimization
  • May be used in conjunction with the above
    techniques

33
Heuristical Approach
  • Uses twelve transformation rules to convert
    relational algebra expressions into more
    efficient forms
  • Also uses five processing strategies applied
    during query processing
  • Every heuristic not always best transform
  • Reduces search space for cost evaluation but does
    not necessarily reduce costs

34
Heuristical Approach Cont.
  • Use relational algebra equivalence rules to
    improve the expected performance of a given query
    tree
  • Consider the example given earlier
  • Join followed by Selection ( 3050 disk reads)
  • Selection followed by Join ( 1160 disk reads)

35
12 Rules
  • Cascade of Selection
  • (1) sp Ù q Ù r (R) sp(sq(sr(R)))
  • Commutativity of Selection Operations
  • (2) sp(sq(R)) sq(sp(R))
  • In a sequence of projections only the last is
    required
  • (3) PLPM PN(R) PL(R)
  • Selections can be combined with Cartesian
    Products and Joins
  • (4) sp( R x S ) R wvp S
  • (5) sp( R wvq S ) R wvq Ù p S

36
12 Rules Cont.
  • Join and Cartesian Product Operations are
    Commutative and Associative
  • (6) R x S S x R
  • (7) R x (S x T) (R x S) x T
  • (8) R wvp S S wvp R
  • (9) (R wvp S) wvq Ù r T R wvp Ù r (S wvq T)
  • Selection Distributes over Joins
  • If predicate p involves attributes of R only
  • (10) sp( R wvq S ) sp(R) wvq S
  • If predicate p involves only attributes of R and
    q involves only attributes of S
  • (11) sp Ù q(R wvr S) sp(R) wvr sq(S)

37
12 Rules Cont.
  • Associativity of Union and Intersection (but not
    set difference)
  • (R ? S) ? T S ? ( R ? T)
  • (R ? S) ? T S ? (R ? T)

38
5 rules to optimize by
  • Perform Selection operation as early as possible.
  • Combine the Cartesian product with a subsequent
    Selection operation whose predicate represents a
    join condition into a Join operation.
  • Use associativity of binary operations to
    rearrange lead nodes so that the leaf nodes with
    the most restrictive Selection operations are
    executed first.
  • Perform Projection operations as early as
    possible.
  • Compute common expressions once.

39
Cost Estimation Approach
  • The goal of cost estimation is to choose the most
    efficient way to implement relational algebra
    operations by selecting the option with the
    lowest cost
  • Want to select the option that yields the fewest
    disk reads

40
Cost Estimation Assumptions
  • Assume that the DBMS will hold the following
    information in its system catalog
  • For each base relation R
  • nTuples(R) number of records, or cardinality
  • bFactor(R) - number of tuples that fit into one
    block
  • nBlocks(R) - the number of blocks required to
    store R
  • nBlocks(R) nTuples(R) / bFactor(R) when
    tuples are stored together physically

41
Assumptions Cont.
  • For each attribute A of base relation R
  • nDistinctA(R) - Number of distinct values that
    appear for attribute A in relation R
  • minA(R), maxA(R), - minimum and maximum possible
    values for the attribute A in relation R
  • SCA(R) - the selection cardinality of attribute A
    in relation R average number of tuples that
    satisfy an equality condition on attribute A
  • For each multilevel index I on attribute set A
  • nLevelsA(I) - the number of levels in I
  • nLfBlocksA(I) - the number of leaf blocks in I

42
Example Cost Estimation Strategies for
Selections
  • To decide which implementation to select, the
    main strategies considered for Selection
    operations are
  • Linear search (unordered file, no index)
  • Binary search (ordered file, no index)
  • Equality on hash key
  • Equality condition on primary key
  • Inequality condition on primary key
  • Equality condition on clustering (secondary)
    index
  • Equality condition on a non-clustering
    (secondary) index
  • Inequality condition on a secondary B-tree index

43
Example Estimated I/O Costs of Strategies
  • Linear search (unordered file, no index)
  • nBlocks(R)/2, for equality condition on key
    attribute nBlocks(R)
  • nBlocks(R), otherwise
  • Binary search (ordered file, no index)
  • log2(nBlocks(R)) , for equality condition on
    ordered attribute
  • log2(nBlocks(R)) SCA(R)/bFactor(R) - 1,
    otherwise
  • Equality condition on clustering (secondary)
    index
  • nLevelsA(I) SCA(R)/bFactor(R)

44
Example
  • There is a hash index with no overflow on the
    primary key attribute staffNo
  • There is a clustering index on the foreign key
    attribute branchNo
  • There is a B-tree index on the salary attribute
  • Estimated cost of a linear search on the key
    attribute staffNo is 50 blocks
  • Estimated cost of a linear search on a non-key
    attribute is 100 blocks

45
Example Cont.
  • The following statistics about the Staff relation
    are stored
  • nDistinctsalary(Staff) 500
  • SCbranchno(Staff) 6
  • maxsalary (Staff) 50,000
  • SCposition(Staff) 300
  • minsalary (Staff) 10,000
  • SCsalary(Staff) 6
  • nLevelsbranchNo(I) 2
  • nLfBlockssalary(I) 50
  • nLevelssalary(I) 2
  • nTuples(Staff) 3000
  • nDistinctbranchNo(Staff) 500
  • bFactor(Staff) 30
  • nDistinctposition(Staff) 10
  • nBlocks(Staff) 100 (3000/30)

46
Example Cont.
  • Selection operation sbranchNoB003(Staff)
  • branchNo is a foreign key with a clustering index
    and there is an equality condition, so we use the
    strategy mentioned before to calculate cost
  • Cost is nLevelsA(I) SCA(R)/bFactor(R)
  • nLevelsbranchNo(I) SCbranchNo(Staff)/bFactor(St
    aff)
  • 2 6/30 3 blocks
  • Estimated cardinality of the result relation is
    SCbranchNo(Staff) 6

47
Semantic Query Optimization
  • Based on constraints specified on the database
    schema to reduce search space
  • Example
  • Ensure that no one staff member manages more than
    100 properties

CREATE ASSERTION StaffNotHandlingTooMuch CHECK
(NOT EXISTS (SELECT staffNo FROM
propertyForRent GROUP BY staffNo HAVING
COUNT() gt 100))
48
Example Cont.
  • If the optimizer is aware of the constraint
    created to prevent staff from managing more than
    100 properties, then it can ignore the query
    because no groups will satisfy the HAVING clause.

SELECT s.staffNo, COUNT() FROM Staff s,
PropertyForRent p WHERE s.staffNO
p.staffNo GROUP BY s.staffNo HAVING COUNT() gt
100
49
Pipelining
  • The result of one operation is sent to another
    operation without creating a temporary relation
    to hold the intermediate result.
  • Pipelining can save on the cost of reading and
    writing to disk.

50
Pipelining
  • This approach dispenses with the temporary
    relation and instead applies the second Selection
    to each tuple in the result of first Selection as
    it is produced, and adds any qualifying tuples
    from the second operation to the result.

51
Pipelining
  • Separate Process or thread within the DBMS
  • Takes a steam of tuples as input / creates a
    stream of tuples as output
  • Buffer created for each pair of adjacent
    operations.
  • Drawback inputs to operations are not
    necessarily available all at once for processing.

52
Distributed Database Query Optimization
  • Query Decomposition
  • Takes the query and performs a partial
    optimization using techniques already discussed.
    Sends a relational algebra tree to the next
    layer.
  • Data Localization
  • Takes into account how the data has been
    distributed and replaces the leaves of the tree
    with their reconstruction algorithms
  • Global Optimization
  • Takes into account statistical information to
    find a near optimal execution strategy with parts
    of the query sent to the local DBMS

53
Distributed Database Query
  • Local Optimization
  • Each local DBMS performs its own local
    optimization using the techniques we discussed
Write a Comment
User Comments (0)
About PowerShow.com