Estimating the Cost of Operations - PowerPoint PPT Presentation

About This Presentation

Title:

Estimating the Cost of Operations

Description:

Estimating the Cost of Operations We don t want to execute the query in order to learn the costs. So, we need to estimate the costs. How can we estimate the number ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 20

Provided by: Computer111

Category:

more less

Transcript and Presenter's Notes

Title: Estimating the Cost of Operations

1
Estimating the Cost of Operations

We dont want to execute the query in order to
learn the costs. So, we need to estimate the
costs.
How can we estimate the number of tuples in an
intermediate relation?
Rules about estimation formulas
Give (somehow) accurate estimates
Easy to compute

2
Projection

Projection ? retains duplicates, so the number of
tuples in the result is the same as in the input.
Result tuples are usually shorter than the input
tuples.
The size of a projection is the only one we can
compute exactly.

3
Selection

Let S ?Ac (R)
We can estimate
T(S) T(R) / V(R,A)
Let S ?Altc (R)
On average, T(S) would be T(R)/2, but more
properly T(R)/3
Let S ?A?c (R),
Then, an estimate is
T(S) T(R)(V(R,A)-1)/V(R,A), or simply
T(S) T(R)

4
Selection ...

Let
S ?C AND D(R) ?C(?D(R)) and
U ?D(R).
First estimate T(U) and then use this to estimate
T(S).
Example
S ?a10 AND blt20(R)
T(R) 10,000, V(R,a) 50
T(S) (1/50) (1/3) T(R) 67
Note Watch for selections like
?a10 AND agt20(R)

5
Selection ...

Let S ?C OR D(R).
Simple estimate T(S) T(?C(R)) T(?D(R)).
Problem It is possible that T(S)?T(R)!
A more accurate estimate
Let
T(R)n,
m1 size of selection on C, and
m2 size of selection on D.
Then T(S) n(1-(1-m1/n)(1-m2/n)) Why?
Example S ?a10 OR blt20(R).
T(R) 10,000, V(R,a) 50
Simple estimation T(S) 3533
More accurate T(S) 3466

6
Natural Join R(X,Y) ?? S(Y,Z)

Anything could happen!
No tuples join T(R ?? S) 0
Y is key in S and a foreign key in R (i.e., R.Y
refers to S.Y) Then,
T(R ?? S) T(R)
All tuples join i.e. R.YS.Y a.
Then,
T(R ?? S) T(R)T(S)

7
Two Assumptions

Containment of value sets
If V(R,Y) V(S,Y), then every Y-value in R is
assumed to occur as a Y-value in S
When such thing can happen?
For example when Y is foreign key in R, and key
in S
Preservation of set values
If A is an attribute of R but not S, then it is
assumed that
V(R ?? S, A)V(R, A)
This may be violated when there are dangling
tuples in R
There is no violation when Y is foreign key in
R, and key in S

8
Natural Join size estimation

Let, R(X,Y) and S(Y,Z), where Y is a single
attribute.
Whats the size of T(R ?? S)?
Let r be a tuple in R and s be a tuple in S.
Whats the probability that r and s join?
Suppose V(R,Y) ? V(S,Y)
By the containment of set values we infer that
Every Ys value in R appears in S.
So, the tuple r of R surely is going match with
some tuples of S, but whats the probability it
matches with s?
Its 1/V(S,Y).
Hence, T(R ?? S) T(R)T(S)/V(S,Y)
When V(R,Y) ? V(S,Y)
By a similar reasoning, for the case when V(S,Y)
? V(R,Y), we get T(R ?? S) T(R)T(S)/V(R,Y).
So, sumarizing we have as an estimate
T(R ?? S) T(R)T(S)/maxV(R,Y),V(S,Y)

T(R ?? S) T(R)T(S)/maxV(R,Y),V(S,Y)
Example
R(a,b), T(R)1000, V(R,b)20
S(b,c), T(S)2000, V(S,b)50, V(S,c)100
U(c,d), T(U)5000, V(U,c)500
Estimate the size of R ?? S ?? U
T(R ?? S)
10002000 / 50 40,000
T((R ?? S) ?? U)
40000 5000 / 500 400,000
T(S ?? U)
20,000
T(R ?? (S ?? U))
100020000 / 50 400,000

The equality of results is not a coincidence.
Note 1 estimate of final result should not
depend on the evaluation order
Note 2 intermediate results could be of
different sizes

10
Natural join with multiple join attrib.

R(x,y1,y2) ?? S(y1,y2,z)
T(R ?? S) T(R)T(S)/m1m2, where
m1 maxV(R,y1),V(S,y1)
m2 maxV(R,y2),V(S,y2)
Why?
Let r be a tuple in R and s be a tuple in S.
Whats the probability that r and s agree on y1?
From the previous reasoning, its
1/maxV(R,y1),V(S,y1)
Similarly, whats the probability that r and s
agree on y2?
Its 1/maxV(R,y2),V(S,y2)
Assuming that aggrements on y1 and y2 are
independent we estimate
T(R ?? S)
T(R)T(S)/maxV(R,y1),V(S,y1)
maxV(R,y2),V(S,y2)
Example
T(R)1000, V(R,b)20, V(R,c)100
T(S)2000, V(S,d)50, V(S,e)50
R(a,b,c) ?? R.bS.d AND R.cS.e S(d,e,f)
T(R ?? S) (10002000)/(50100)400

Another example (one of the previous)
R(a,b), T(R)1000, V(R,b)20
S(b,c), T(S)2000, V(S,b)50, V(S,c)100
U(c,d), T(U)5000, V(U,c)500
Estimate the size of R ?? S ?? U
Observe that R ?? S ?? U (R ? U) ?? S
T(R ? U)
10005000 5,000,000
Note that the number of bs in the product is 20
(V(R,b)), and the number of cs is 500 (V(U,c)).
T((R ? U) ?? S)
5,000,000 2000 / (50 500) 400,000

12
Size estimates for other operations

Cartesian product
T(R ? S) T(R) T(S)
Bag Union sum of sizes
Set union
larger half the smaller. Why?
Because a set union can be as large as the sum
of sizes or as small as the larger of the two
arguments. Something in the middle is suggested.
Intersection half the smaller. Why?
Because intersection can be as small as 0 or as
large as the sizes of the smaller. Something in
the middle is suggested.
Difference T(R-S) T(R) - 1/2T(S)
Because the result can have between T(R) and
T(R)-T(S). Something in the middle is suggested.

13
Size estimates for other operations

Duplicate elimination ? in (R(a1,...,an))
The size ranges from 1 to T(R).
T(?(R)) V(R,a1...an), if available (but
usually not available).
Otherwise T(?(R)) minV(R,a1)...V(R,an),
1/2T(R) is suggested. Why?
V(R,a1)...V(R,an) is the upper limit on the
number of distinct tuples that could exist
1/2T(R) is because the size can be as small as
1 or as big as T(R)
Grouping and Aggregation
similar to ?, but only with respect to grouping
attributes.

14
Computing the statistics

It is the DBA who orders the statistics to be
computed by the system.
T(R)s, and V(R,A)s are just aggregation queries
(COUNT queries).
However, they are expensive to be computed.

15
Incremental computation of statistics

Maintaining T(R)
Add 1 for every insertion and subtract 1 for
every deletion.
Whats the problem?
If there is a B-Tree on any attribute of R, then
Just keep track of the B-Tree blocks and infer
the approximate size of the relation.
Requires effort only when?
On B-Tree changes, which is relative rare
compared with the rate of insertions and
deletions.

16
Incremental computation of statistics

Maintaining V(R,A)
If there is an index on attribute A of a relation
R, then
On insert into R, we must find the A-value for
the new tuple in the index anyway, and so we can
determine whether there is already such a value
for A. If not increment V(R,A).
On deletion
If there isnt an index on A, the system could in
effect create a rudimentary index by keeping a
data structure (e.g. B-Tree) that holds every
value of A.
Final option Sampling the relation.

17
Histograms
1-2 3-4 4-5 6-7 8-9
Equal width
Advantage more accurate estimate of the size of
a join.
18
Example (freq. Values histogram)

Estimate U R(a,b) ?? S(b,c)
V(R,b) 14. Histogram for R.b
0150, 1200, 5100, rest 550
V(S,b) 13. Histogram for S.b
0100, 180, 270, rest 250
Tuples in U
on 0
100150 15,000
on 1
20080 16,000
on 2
70 (550/(14-3)) 3500
on 5
100 (250/(13-3)) 2500
on the 9 other values
9(550/11)(250/10) 91250

We have 9 values, because V(S,b)ltV(R,b), and by
the preservation of value sets assumption, all
the 9 values we didnt consider yet in S, will be
in R as well.
Total T(U) 15000 16000 3500 2500
91250 48,250 Simple estimate (equal
occurrence assumption) T(U) 1000500/14 35,714
19
Example (equal width histogram)