ICS 214A: Database Management Systems Winter 2004 - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

ICS 214A: Database Management Systems Winter 2004

Description:

Those estimation formulas covered so far might not be accurate enough ... for each update on R, increment/decrement the corresponding bucket frequencies ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 23
Provided by: che7
Category:

less

Transcript and Presenter's Notes

Title: ICS 214A: Database Management Systems Winter 2004


1
ICS 214A Database Management Systems Winter 2004
  • Lecture 14 Histograms and Query Optimization
  • Professor Chen Li

2
Histograms Motivation
  • Accurate estimation is critical in query
    optimization
  • Those estimation formulas covered so far might
    not be accurate enough
  • Instead, we keep distribution of R.A (v1,f1), ,
    (vn, fn).
  • Called histogram.
  • Idea
  • Partition the domain of R.A into buckets
  • Store a small summary of the distribution within
    each bucket
  • Number of the buckets is the knob that controls
    the resolution

3
Equi-width Histogram
  • Divide the domain into B buckets of equal width
  • E.g., partition GPAs into buckets
  • Store the bucket boundaries and the sum of
    frequencies of the values with each bucket

4
Construction and maintenance
  • Construction
  • Use one pass over R to construct an accurate
    equi-width histogram
  • Keep a running count for each bucket
  • If scanning is not acceptable, use sampling
  • Construct a histogram on Rsample, and scale the
    frequencies by R/Rsample
  • Maintenance
  • Incremental maintenance for each update on R,
    increment/decrement the corresponding bucket
    frequencies
  • Periodical re-computation because distribution
    changes slowly

5
Using an equi-width histogram
  • Q sA5(R)
  • 5 is in bucket 5,8 (with 19 tuples)
  • Assume uniform distribution within the bucket
  • Thus Q ? 19/4 ? 5.
  • Actual value is 1
  • Q sAgt7 A lt 16(R)
  • 7,16 covers 9,12 (27 tuples) and 13,16 (13
    tuples)
  • 7,16 partially covers 5,8 (19 tuples)
  • Thus Q ? 19/2 27 13 ? 50
  • Actual value 52.

6
Equi-height Histogram
  • Divide the domain into B buckets with roughly the
    same number of tuples in each bucket
  • Store this number and the bucket boundaries
  • Intuition high frequencies are more important
    than low frequencies

7
Construction and maintenance
  • Construction
  • Sort all R.A values, and then take equally spaced
    slights
  • Example 1 2 2 3 4 7 8 9 10 10 10 10 11 11 12 12
    14 16
  • Sampling also works
  • Maintenance
  • Incremental maintenance
  • Merge adjacent buckets with small counts
  • Split any bucket with a large count
  • Select the median value to split
  • Need a sample of the values within this bucket to
    work well
  • Periodic re-computation also works.

8
Using an equi-height histogram
  • Q sA5(R)
  • 5 is in bucket 1,7 (with 16 tuples)
  • Assume uniform distribution within the bucket
  • Thus Q ? 16/7 ? 2. (actual value 1)
  • Q sAgt7 A lt 16(R)
  • 7,16 covers 8,9, 10,11,12,16 (all with
    tuples)
  • 7,16 partially covers 1,7 (16 tuples)
  • Thus Q ? 16/7 16 16 16 ? 50
  • Actually Q 52.

9
Histogram tricks
  • Store the number of distinct values in each
    bucket
  • To get rid of the effects of the values with 0
    frequency
  • These values tend to cause underestimation
  • Compressed histogram
  • Store (vi, fi) pairs explicitly if fi is high
  • For other values, use an equi-width or
    equi-height histogram

10
More Histograms
  • V-optimal histogram
  • Avoid putting very different frequencies into the
    same bucket
  • Partition in a way to minimize ?iVARi, where VARi
    is the frequency variance within bucket i
  • MaxDiff Histogram
  • Define area to be the product of the frequency of
    a value and its spread (the difference between
    this value and the next value with non-zero
    frequency)
  • Insert bucket boundaries where two adjacent areas
    differ by large amounts
  • More in Poosala et al., SIGMOD 1996

11
Did not cover
  • Multidimensional histograms
  • Using histograms to estimate other operators
    (e.g., joins)
  • Other mechanisms for selectivity estimation
  • Wavelets
  • Sampling
  • Probabilistic models of data
  • Much recent research on selectivity estimation in
    the context of
  • Query processing
  • OLAP

12
Next query optimization
13
Query Optimization
SQL query
parse
parse tree
convert
answer
logical query plan
execute
apply laws
statistics
Pi
improved l.q.p
pick best
estimate result sizes
(P1,C1),(P2,C2)...
l.q.p. sizes
estimate costs
consider physical plans
P1,P2,..
14
Query Optimization
  • Convert a logical query plan into an optimal
    physical plan.
  • Step 1 use rewrite laws to improve logical plan.
  • Step 2 use cost-based enumeration to choose a
    good physical plan corresponding to logical plan.
  • Determine order of join
  • Specify algorithm for each operator
  • Materialization vs. pipelining
  • Issues
  • 1. Estimation cost of operators
  • cost model
  • input size
  • 2. Enumeration algorithm

15
Example SQL query
  • SELECT title
  • FROM StarsIn
  • WHERE starName IN (
  • SELECT name
  • FROM MovieStar
  • WHERE birthdate LIKE 1960
  • )
  • (Find the movies with stars born in 1960)

16
Parse Tree
ltQuerygt
ltSFWgt
SELECT ltSelListgt FROM ltFromListgt
WHERE ltConditiongt
ltAttributegt ltRelNamegt
ltTuplegt IN ltQuerygt
title StarsIn
ltAttributegt ( ltQuerygt )
starName ltSFWgt
SELECT ltSelListgt FROM ltFromListgt
WHERE ltConditiongt
ltAttributegt ltRelNamegt
ltAttributegt LIKE ltPatterngt
name MovieStar
birthDate 1960
17
Logical Query Plan techniques to covert a parse
tree to a logical plan
?title
?starNamename
?
StarsIn ?name
?birthdate LIKE 1960
MovieStar
18
Improved Logical Query Plan
?title
Push selection down.
starNamename
In general how?
StarsIn ?name
?birthdate LIKE 1960
MovieStar
19
Relational algebra optimization
  • Transformation rules (preserve equivalence)
  • What are good transformations?

20
Rules Natural joins cross products union
  • Carry attribute names in results, so order is not
    important
  • Can also write as trees, e.g.
  • T R

R S S T
21
Rules Natural joins cross products union
R S S R (R S) T R (S T)
  • R ? S S ? R
  • (R ? S) ? T R ? (S ? T)
  • R ? S S ? R
  • R ? (S ? T) (R ? S) ? T

Commutative law
Associative Law
22
Rules Selects
  • sp1?p2(R)
  • sp1vp2(R)

sp1 sp2 (R) sp2 sp1 (R) sp1 (R) ?
sp2 (R)
  • Example R(dept, id, salary), S(id, city, tel)
  • P1 salary gt 50K
  • P2 dept toy
Write a Comment
User Comments (0)
About PowerShow.com