Lecture 8: Query Optimization - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Lecture 8: Query Optimization

Description:

output size = ? selectivity = ? 14. CS511 Advanced Database Management Systems ... Selectivity Factors: Join Predicates. column1 = column2: when both columns are keys: ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 27
Provided by: ChengXi4
Category:

less

Transcript and Presenter's Notes

Title: Lecture 8: Query Optimization


1
Lecture 8 Query Optimization
Sept. 19, 2007 ChengXiang Zhai
Most slides are adapted from Kevin Changs
lecture slides
2
DBMS Architecture
User/Web Forms/Applications/DBA
query
transaction
Query Parser
Transaction Manager
Query Rewriter
Logging Recovery
Query Optimizer
Lock Manager
Query Executor
Files Access Methods
Lock Tables
Buffers
Buffer Manager
Main Memory
Storage Manager
Storage
3
SQL Constructs
  • SECLECT (DISTINCT) ltlist of columnsgt
  • FROM ltlist of tablesgt
  • WHERE ltlist of Boolean Factorsgt
  • GROUP BY ltlist of columnsgt
  • HAVING ltlist of Boolean Factorsgt
  • ORDER BY ltlist of columnsgt

4
SQL Semantics
  • Take Cartesian product of FROM tables
  • Project only those referenced columns
  • WHERE apply all filters in WHERE
  • GROUP BY form groups on results
  • HAVING apply filter to groups
  • ORDER BY make sure results in right order
  • DISTINCT remove duplicates
  • Q Is this operational semantics efficient?
  • Different plans mainly different in the first
    three

5
Optimization Different Strategies
  • Optimal approach
  • enumerate each possible plan
  • measure its performance by running it
  • pick the fastest one
  • Heuristics approach
  • fixed heuristics all the way through plan
    construction
  • e.g. always nested loop joins, indexed relation
    as inner
  • e.g. order relations from smallest to biggest
  • How does Selinger/System R differ from them?
  • What are the main components in their approach?

6
New Paradigm Cost-based Optimization
  • Plan space
  • what is the space of query plans?
  • Cost estimation
  • how to estimate the cost, without executing each?
  • Search algorithm
  • how to search the space, as guided by cost
    estimates

7
Space of Query Plans
  • What can you do differently? Parameters to tune
  • Selections
  • Joins

8
Space of Query Plans
  • Selections
  • algorithms sequential, index scan
  • ordering why this will matter?
  • Joins
  • algorithms nested-loop, sort merge, hash
  • ordering
  • Ordering/Grouping
  • can an interesting order be produced by
    join/selections?
  • algorithm sorting, hash-based
  • They interleave with each other!

9
Assumptions to Reduce the Space?
  • Projections
  • pushed down to reduce of columns
  • Selections
  • Joins

10
Huge Space!
  • Typical assumptions to help reduce the space
  • Projections
  • pushed down to reduce of columns
  • Selections
  • pushed down to reduce of rows
  • Joins
  • left-deep joins (what else can you do?)
  • avoid Cartesian products delay it in the plan
  • how to avoid Cartesian products?
  • May miss an optimal plan!

11
Cost/Size Estimation
  • Accurate relatively
  • goal is to compare plans, not to predict exact
    cost
  • more of an art than an exact science
  • Each operator input size, cost, output size
  • estimate cost based on input size
  • estimate output size (for next operator) or
    selectivity
  • selectivity ratio of output to input

12
Cost Estimation Selinger Style
  • Input simple DB statistics
  • of tuples disk pages
  • of distinct values per column
  • statistics updated periodically
  • Assumption of attribute/predicate independence
  • When no estimate available, use magic number
  • New/Alternative approaches
  • sampling, histogram of DB

13
Selectivity Factors Point Selection
  • column value
  • dept CS
  • input size NCARD(Student) 200, ICARD(dept)
    10
  • output size ? selectivity ?

14
Selectivity Factors Range Selection
  • What is assumed in these formulas?
  • column gt value1
  • F (maxValue - value1) / rangeOfValue
  • column in value1value2
  • F (value2 - value1) / rangeOfValue
  • column in (set of values)
  • F as union of point selections

15
Selectivity Factors Join Predicates
  • column1 column2
  • when both columns are keys
  • student.sid employee.sid
  • one key, the other foreign key
  • student.sid enrollment.sid
  • student NCARD 200, SID distinct values 200
  • enrollment NCARD 800, SID distinct values
    100
  • 800 (200/200) 800
  • none are keys
  • user.sid enrollement.sid
  • user NCARD 1000, SID distinct values 200
  • enrollment NCARD 800, SID distinct values
    100
  • 800 (1000/200) 4000
  • given formula covers all?

16
Goal Cost Estimate
  • System R cost model
  • Sum of I/O and CPU
  • ( PAGE) W (RSI CALLS)
  • why RSI calls model workload of RSS?

17
Search the Plan Space
  • Baseline exhaustive search
  • enumerate all combinations, and compare their
    cost
  • Search method parameters
  • plan tree development
  • construction bottom-up, top-down
  • modification improve a somehow-constructed tree
  • algorithms
  • heuristic selections make choices based on
    heuristics
  • branch and bound search bounded by the current
    best tree
  • hill climbing find nearby plans with lowest
    cost
  • dynamic programming construction by greedy
    selections
  • where does System-R approach fit in?

18
Plan Search System-R Style
  • AKA Selinger style optimization
  • Bottom-up
  • start from ground relations (in FROM)
  • work up the tree to form a plan
  • Dynamic programming
  • greedily prune subtrees that are obviously
    useless
  • but not necessarily local best at every step
    why?
  • Many other approaches

19
DP Local Optimal -gt Global Optimal?
  • Think of optimizing (T1 Q T2 Q T3) Q T4 Q T5
  • Sometimes it does hold
  • does plan (T1 Q T2 Q T3) affect the rest of the
    plan?
  • Not always true
  • example?
  • but hopefully at least give good plans

20
System R Search Start from Relations
  • Base relations access
  • find all plans for access to each base relation
  • push down sargable arguments
  • example of non-sargable predicates?
  • choose good plans, discard bad ones
  • keep cheapest for unordered each interesting
    order
  • Join ordering
  • find all ways of joining a pair of base relations
  • choose good plans, discard bad ones

21
WHERE Clause and Sargable Predicates
  • WHERE conditions
  • dept CS OR (dept EE AND GPA gt 3.5)
  • Normal forms
  • DNF (C) OR (E AND G)
  • CNF (C OR E) AND (C OR G)
  • Difference in using the CNF/DNF factors

22
System R Search Left-Deep Join Plans
  • Join ordering
  • consider only left-deep trees n! ordering for n
    tables
  • basis
  • find all ways of joining a pair of base relations
  • choose good plans, discard bad ones
  • induction until you have a full plan
  • k to k1 given plans of k-relation join, add one
    more
  • prefer one with predicate postpone Cartesian
    product
  • can throw away k-plans
  • Finally grouping/ordering
  • using interesting order
  • additional sorting

23
Nested Subqueries
  • Subqueries optimized separately
  • Correlation order of evaluations
  • uncorrelated queries are like constants
  • correlated queries are like function calls

24
Beyond System-R
  • Parallel/Distributed DB
  • User-defined predicates
  • Materialized views
  • Adaptive query optimization
  • Fuzzy querie
  • Machine learning for query optimization?

25
What You Should Know
  • Query optimization is critical for any commercial
    DBMS
  • good/bad can be orders of magnitude
  • The 3 components of the System-R style
    optimization
  • Plan space definition
  • Cost estimation
  • Search algorithm
  • huge number of alternative, semantically
    equivalent plans
  • Ideal goal
  • map a declarative query to the most efficient
    plan
  • Conventional wisdom avoid bad plans
  • State of the art
  • industry most optimizers are System-R style
  • academic always a core database research topic

26
Carry Away Messages
  • Appropriate formulation of the problem leads to
    good solutions
  • Query optimization as a search problem
  • Messy problems always require a clean model
  • Make assumptions to make the problem tractable
  • An approximate solution could be good enough
  • Opens up opportunities for further improvement
Write a Comment
User Comments (0)
About PowerShow.com