Access Path Selection in a Relational Database Management System - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Access Path Selection in a Relational Database Management System

Description:

Estimation of input and intermediate cardinalities based on simple ... When no estimate available, use magic number. These estimations are done periodically ... – PowerPoint PPT presentation

Number of Views:370

Avg rating:3.0/5.0

Slides: 20

Provided by: Joha184

Learn more at: http://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Access Path Selection in a Relational Database Management System

1
Access Path Selection in a Relational Database
Management System

Selinger et al.

2
Query Optimization

Declarative query language relieves programmer of
the burden to choose an access plan.
Difference between good and bad access plans can
be several orders of magnitude.
Problem How do we find a good plan?

3
How to choose a good plan?

Divide into three problems
What is the set of plans we consider?
How do we compare (cost) plans?
How do we choose a good plan from this set?

4
Background SQL

SELECT DISTINCT ltlist of columnsgt
FROM ltlist of tablesgt
WHERE ltlist of "Boolean Factors(predicates in
CNF)gt
GROUP BY ltlist of columnsgt
HAVING ltlist of Boolean Factorsgt
ORDER BY ltlist of columnsgt

5
Processing SQL Statements

Four steps
Parsing
Optimization
Code generation
Execution

6
SQL Query Types

Single block queries only
optimize nested sub-queries separately
Correlated sub-queries are much harder and much
more expensive than un-correlated sub-queries.
Rewrite to remove correlations where possible.
(Tricky, more about this later.)
SPJ queries only in this paper

7
Problem 1 Plan Space

Fixed set of individual access methods
Sequential index (clustered/unclustered) scans
NL-join, (sort)-merge join, hash join
Sorting hash-based grouping
Plan flows in a non-blocking fashion with
get-next iterators

8
Plan Space (Contd.)

Assumptions in System R
Selections of sargable predicates are "pushed
down"
Projections are "pushed down"
Single query blocks
Only left-deep plan trees (There are n! plans
(not factoring in choice of join method))
Avoid Cartesian products

9
Problem 2 How to Cost a Plan

Estimation of input and intermediate
cardinalities based on simple statistical models
(e.g., uniform distribution assumption, attribute
independence)
Estimation of costs for each operator based on
statistics about input relations
Cost is weighted function between I/O and CPU(no
distinction between random and sequential IO)

10
Cost Estimation (Selinger)

Maintenance of simple statistics
of tuples pages
of values per column (only for indexed columns)
Assumption of attribute independence
When no estimate available, use magic number
These estimations are done periodically

11
Cost Estimation (Today)

Sampling so far only concrete results for base
relations
Histograms getting better. Common in industry,
some interesting new research.
Controlling "error propagation"

12
Problem 3 Choosing a Plan

Exhaustive search
Dynamic Programming (prunes suboptimal parts of
the search space) System R
Top-down, transformative version of DP Volcano,
Cascades (used in MS SQL Server?)
Randomized search algorithms (e.g. Ioannidis
Kang)
Techniques from Operations Research
Etc.

13
System R Approach

Recall Only left-deep plan trees (n! different
plans)
Observation Many of these plans share common
prefixes, so do not recompute all of them
Dynamic Programming

14
Dynamic Programming Approach

Find all 1-table plans for each base relation
Try all ways of joining i-table plans saved so
far with 1-table plans. Save cheapest unordered
(i1)-table plans, and cheapest (i1)-table plans
for each interesting order
Note secondary join predicates are just like
selections that cant be pushed down
At the end, GROUP BY and ORDER BY
Use plan in interesting order, or add sort to
cheapest unordered plan.

15
Evaluation

Complexity of dynamic programming about n2n-1,
intermediate storage plans
No-cartesian-products rule can make a big
difference for some queries.
DP only works up to 10-15 joins
Adding parameters to the search space makes
things worse (e.g. expensive predicates,
distribution, parallelism, etc.)

16
Nested Queries

Subqueries optimized separately
Uncorrelated vs. correlated subqueries
Uncorrelated subqueries are basically constants
to be computed once
Correlated subqueries are like function calls

17
Query Rewrite in IBM DB2

Leung et al.

18
Why Query Rewrite?

Problem
Very complex queries automatically generated
through tools with many levels of subqueries
Correlated subqueries
Selinger approach only works one block at a time
Main idea Transform the query into a simpler,
equivalent query

19
Query Rewrite is Tricky