Access Path Selection in a Relational Database Management System - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Access Path Selection in a Relational Database Management System

Description:

Estimation of input and intermediate cardinalities based on simple ... When no estimate available, use magic number. These estimations are done periodically ... – PowerPoint PPT presentation

Number of Views:365
Avg rating:3.0/5.0
Slides: 20
Provided by: Joha184
Category:

less

Transcript and Presenter's Notes

Title: Access Path Selection in a Relational Database Management System


1
Access Path Selection in a Relational Database
Management System
  • Selinger et al.

2
Query Optimization
  • Declarative query language relieves programmer of
    the burden to choose an access plan.
  • Difference between good and bad access plans can
    be several orders of magnitude.
  • Problem How do we find a good plan?

3
How to choose a good plan?
  • Divide into three problems
  • What is the set of plans we consider?
  • How do we compare (cost) plans?
  • How do we choose a good plan from this set?

4
Background SQL
  • SELECT DISTINCT ltlist of columnsgt
  • FROM ltlist of tablesgt
  • WHERE ltlist of "Boolean Factors(predicates in
    CNF)gt
  • GROUP BY ltlist of columnsgt
  • HAVING ltlist of Boolean Factorsgt
  • ORDER BY ltlist of columnsgt

5
Processing SQL Statements
  • Four steps
  • Parsing
  • Optimization
  • Code generation
  • Execution

6
SQL Query Types
  • Single block queries only
  • optimize nested sub-queries separately
  • Correlated sub-queries are much harder and much
    more expensive than un-correlated sub-queries.
  • Rewrite to remove correlations where possible.
    (Tricky, more about this later.)
  • SPJ queries only in this paper

7
Problem 1 Plan Space
  • Fixed set of individual access methods
  • Sequential index (clustered/unclustered) scans
  • NL-join, (sort)-merge join, hash join
  • Sorting hash-based grouping
  • Plan flows in a non-blocking fashion with
    get-next iterators

8
Plan Space (Contd.)
  • Assumptions in System R
  • Selections of sargable predicates are "pushed
    down"
  • Projections are "pushed down"
  • Single query blocks
  • Only left-deep plan trees (There are n! plans
    (not factoring in choice of join method))
  • Avoid Cartesian products

9
Problem 2 How to Cost a Plan
  • Estimation of input and intermediate
    cardinalities based on simple statistical models
    (e.g., uniform distribution assumption, attribute
    independence)
  • Estimation of costs for each operator based on
    statistics about input relations
  • Cost is weighted function between I/O and CPU(no
    distinction between random and sequential IO)

10
Cost Estimation (Selinger)
  • Maintenance of simple statistics
  • of tuples pages
  • of values per column (only for indexed columns)
  • Assumption of attribute independence
  • When no estimate available, use magic number
  • These estimations are done periodically

11
Cost Estimation (Today)
  • Sampling so far only concrete results for base
    relations
  • Histograms getting better. Common in industry,
    some interesting new research.
  • Controlling "error propagation"

12
Problem 3 Choosing a Plan
  • Exhaustive search
  • Dynamic Programming (prunes suboptimal parts of
    the search space) System R
  • Top-down, transformative version of DP Volcano,
    Cascades (used in MS SQL Server?)
  • Randomized search algorithms (e.g. Ioannidis
    Kang)
  • Techniques from Operations Research
  • Etc.

13
System R Approach
  • Recall Only left-deep plan trees (n! different
    plans)
  • Observation Many of these plans share common
    prefixes, so do not recompute all of them
    Dynamic Programming

14
Dynamic Programming Approach
  • Find all 1-table plans for each base relation
  • Try all ways of joining i-table plans saved so
    far with 1-table plans. Save cheapest unordered
    (i1)-table plans, and cheapest (i1)-table plans
    for each interesting order
  • Note secondary join predicates are just like
    selections that cant be pushed down
  • At the end, GROUP BY and ORDER BY
  • Use plan in interesting order, or add sort to
    cheapest unordered plan.

15
Evaluation
  • Complexity of dynamic programming about n2n-1,
    intermediate storage  plans
  • No-cartesian-products rule can make a big
    difference for some queries.
  • DP only works up to 10-15 joins
  • Adding parameters to the search space makes
    things worse (e.g. expensive predicates,
    distribution, parallelism, etc.)

16
Nested Queries
  • Subqueries optimized separately
  • Uncorrelated vs. correlated subqueries
  • Uncorrelated subqueries are basically constants
    to be computed once
  • Correlated subqueries are like function calls

17
Query Rewrite in IBM DB2
  • Leung et al.

18
Why Query Rewrite?
  • Problem
  • Very complex queries automatically generated
    through tools with many levels of subqueries
  • Correlated subqueries
  • Selinger approach only works one block at a time
  • Main idea Transform the query into a simpler,
    equivalent query

19
Query Rewrite is Tricky
  • SELECT P.pno
  • FROM Parts P
  • WHERE P.quantity
  • (SELECT COUNT()
  • FROM Supply S
  • WHERE S.pno P.pno AND
  • S.shipdate lt 1-1-2000)
  • INSERT INTO Temp (pnum, cnt)
  • (SELECT P.pno, COUNT()
  • FROM Supply S
  • WHERE S.shipdate lt 1-1-2000 GROUP BY P.pno)
  • SELECT P.pno
  • FROM Parts P, Temp T
  • WHERE P.quantity T.cnt AND
  • P.pno T.pno
Write a Comment
User Comments (0)
About PowerShow.com