ICS 214A: Database Management Systems Winter 2004 - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

ICS 214A: Database Management Systems Winter 2004

Description:

Those estimation formulas covered so far might not be accurate enough ... for each update on R, increment/decrement the corresponding bucket frequencies ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 23

Provided by: che7

Category:

more less

Transcript and Presenter's Notes

Title: ICS 214A: Database Management Systems Winter 2004

1
ICS 214A Database Management Systems Winter 2004

Lecture 14 Histograms and Query Optimization
Professor Chen Li

2
Histograms Motivation

Accurate estimation is critical in query
optimization
Those estimation formulas covered so far might
not be accurate enough
Instead, we keep distribution of R.A (v1,f1), ,
(vn, fn).
Called histogram.
Idea
Partition the domain of R.A into buckets
Store a small summary of the distribution within
each bucket
Number of the buckets is the knob that controls
the resolution

3
Equi-width Histogram

Divide the domain into B buckets of equal width
E.g., partition GPAs into buckets
Store the bucket boundaries and the sum of
frequencies of the values with each bucket

4
Construction and maintenance

Construction
Use one pass over R to construct an accurate
equi-width histogram
Keep a running count for each bucket
If scanning is not acceptable, use sampling
Construct a histogram on Rsample, and scale the
frequencies by R/Rsample
Maintenance
Incremental maintenance for each update on R,
increment/decrement the corresponding bucket
frequencies
Periodical re-computation because distribution
changes slowly

5
Using an equi-width histogram

Q sA5(R)
5 is in bucket 5,8 (with 19 tuples)
Assume uniform distribution within the bucket
Thus Q ? 19/4 ? 5.
Actual value is 1
Q sAgt7 A lt 16(R)
7,16 covers 9,12 (27 tuples) and 13,16 (13
tuples)
7,16 partially covers 5,8 (19 tuples)
Thus Q ? 19/2 27 13 ? 50
Actual value 52.

6
Equi-height Histogram

Divide the domain into B buckets with roughly the
same number of tuples in each bucket
Store this number and the bucket boundaries
Intuition high frequencies are more important
than low frequencies

7
Construction and maintenance

Construction
Sort all R.A values, and then take equally spaced
slights
Example 1 2 2 3 4 7 8 9 10 10 10 10 11 11 12 12
14 16
Sampling also works
Maintenance
Incremental maintenance
Merge adjacent buckets with small counts
Split any bucket with a large count
Select the median value to split
Need a sample of the values within this bucket to
work well
Periodic re-computation also works.

8
Using an equi-height histogram

Q sA5(R)
5 is in bucket 1,7 (with 16 tuples)
Assume uniform distribution within the bucket
Thus Q ? 16/7 ? 2. (actual value 1)
Q sAgt7 A lt 16(R)
7,16 covers 8,9, 10,11,12,16 (all with
tuples)
7,16 partially covers 1,7 (16 tuples)
Thus Q ? 16/7 16 16 16 ? 50
Actually Q 52.

9
Histogram tricks

Store the number of distinct values in each
bucket
To get rid of the effects of the values with 0
frequency
These values tend to cause underestimation
Compressed histogram
Store (vi, fi) pairs explicitly if fi is high
For other values, use an equi-width or
equi-height histogram

10
More Histograms

V-optimal histogram
Avoid putting very different frequencies into the
same bucket
Partition in a way to minimize ?iVARi, where VARi
is the frequency variance within bucket i
MaxDiff Histogram
Define area to be the product of the frequency of
a value and its spread (the difference between
this value and the next value with non-zero
frequency)
Insert bucket boundaries where two adjacent areas
differ by large amounts
More in Poosala et al., SIGMOD 1996

11
Did not cover

Multidimensional histograms
Using histograms to estimate other operators
(e.g., joins)
Other mechanisms for selectivity estimation
Wavelets
Sampling
Probabilistic models of data
Much recent research on selectivity estimation in
the context of
Query processing
OLAP

12
Next query optimization
13
Query Optimization
SQL query
parse
parse tree
convert
answer
logical query plan
execute
apply laws
statistics
Pi
improved l.q.p
pick best
estimate result sizes
(P1,C1),(P2,C2)...
l.q.p. sizes
estimate costs
consider physical plans
P1,P2,..
14
Query Optimization

Convert a logical query plan into an optimal
physical plan.
Step 1 use rewrite laws to improve logical plan.
Step 2 use cost-based enumeration to choose a
good physical plan corresponding to logical plan.
Determine order of join
Specify algorithm for each operator
Materialization vs. pipelining
Issues
1. Estimation cost of operators
cost model
input size
2. Enumeration algorithm

15
Example SQL query

SELECT title
FROM StarsIn
WHERE starName IN (
SELECT name
FROM MovieStar
WHERE birthdate LIKE 1960
)
(Find the movies with stars born in 1960)

16
Parse Tree
ltQuerygt
ltSFWgt
SELECT ltSelListgt FROM ltFromListgt
WHERE ltConditiongt
ltAttributegt ltRelNamegt
ltTuplegt IN ltQuerygt
title StarsIn
ltAttributegt ( ltQuerygt )
starName ltSFWgt
SELECT ltSelListgt FROM ltFromListgt
WHERE ltConditiongt
ltAttributegt ltRelNamegt
ltAttributegt LIKE ltPatterngt
name MovieStar
birthDate 1960
17
Logical Query Plan techniques to covert a parse
tree to a logical plan
?title
?starNamename
?
StarsIn ?name
?birthdate LIKE 1960
MovieStar
18
Improved Logical Query Plan
?title
Push selection down.
starNamename
In general how?
StarsIn ?name
?birthdate LIKE 1960
MovieStar
19
Relational algebra optimization