Title: Estimating the Cost of Operations
1Estimating the Cost of Operations
- We dont want to execute the query in order to
learn the costs. So, we need to estimate the
costs. - How can we estimate the number of tuples in an
intermediate relation? - Rules about estimation formulas
- Give (somehow) accurate estimates
- Easy to compute
2Projection
- Projection ? retains duplicates, so the number of
tuples in the result is the same as in the input. - Result tuples are usually shorter than the input
tuples. - The size of a projection is the only one we can
compute exactly.
3Selection
- Let S ?Ac (R)
- We can estimate
- T(S) T(R) / V(R,A)
- Let S ?Altc (R)
- On average, T(S) would be T(R)/2, but more
properly T(R)/3 - Let S ?A?c (R),
- Then, an estimate is
- T(S) T(R)(V(R,A)-1)/V(R,A), or simply
- T(S) T(R)
4Selection ...
- Let
- S ?C AND D(R) ?C(?D(R)) and
- U ?D(R).
- First estimate T(U) and then use this to estimate
T(S). - Example
- S ?a10 AND blt20(R)
- T(R) 10,000, V(R,a) 50
- T(S) (1/50) (1/3) T(R) 67
- Note Watch for selections like
- ?a10 AND agt20(R)
5Selection ...
- Let S ?C OR D(R).
- Simple estimate T(S) T(?C(R)) T(?D(R)).
- Problem It is possible that T(S)?T(R)!
- A more accurate estimate
- Let
- T(R)n,
- m1 size of selection on C, and
- m2 size of selection on D.
- Then T(S) n(1-(1-m1/n)(1-m2/n)) Why?
- Example S ?a10 OR blt20(R).
- T(R) 10,000, V(R,a) 50
- Simple estimation T(S) 3533
- More accurate T(S) 3466
6Natural Join R(X,Y) ?? S(Y,Z)
- Anything could happen!
- No tuples join T(R ?? S) 0
- Y is key in S and a foreign key in R (i.e., R.Y
refers to S.Y) Then, - T(R ?? S) T(R)
- All tuples join i.e. R.YS.Y a.
- Then,
- T(R ?? S) T(R)T(S)
7Two Assumptions
- Containment of value sets
- If V(R,Y) V(S,Y), then every Y-value in R is
assumed to occur as a Y-value in S - When such thing can happen?
- For example when Y is foreign key in R, and key
in S - Preservation of set values
- If A is an attribute of R but not S, then it is
assumed that - V(R ?? S, A)V(R, A)
- This may be violated when there are dangling
tuples in R - There is no violation when Y is foreign key in
R, and key in S
8Natural Join size estimation
- Let, R(X,Y) and S(Y,Z), where Y is a single
attribute. - Whats the size of T(R ?? S)?
- Let r be a tuple in R and s be a tuple in S.
Whats the probability that r and s join? - Suppose V(R,Y) ? V(S,Y)
- By the containment of set values we infer that
- Every Ys value in R appears in S.
- So, the tuple r of R surely is going match with
some tuples of S, but whats the probability it
matches with s? - Its 1/V(S,Y).
- Hence, T(R ?? S) T(R)T(S)/V(S,Y)
- When V(R,Y) ? V(S,Y)
- By a similar reasoning, for the case when V(S,Y)
? V(R,Y), we get T(R ?? S) T(R)T(S)/V(R,Y). - So, sumarizing we have as an estimate
- T(R ?? S) T(R)T(S)/maxV(R,Y),V(S,Y)
9- T(R ?? S) T(R)T(S)/maxV(R,Y),V(S,Y)
- Example
- R(a,b), T(R)1000, V(R,b)20
- S(b,c), T(S)2000, V(S,b)50, V(S,c)100
- U(c,d), T(U)5000, V(U,c)500
- Estimate the size of R ?? S ?? U
- T(R ?? S)
- 10002000 / 50 40,000
- T((R ?? S) ?? U)
- 40000 5000 / 500 400,000
- T(S ?? U)
- 20,000
- T(R ?? (S ?? U))
- 100020000 / 50 400,000
- The equality of results is not a coincidence.
- Note 1 estimate of final result should not
depend on the evaluation order - Note 2 intermediate results could be of
different sizes
10Natural join with multiple join attrib.
- R(x,y1,y2) ?? S(y1,y2,z)
- T(R ?? S) T(R)T(S)/m1m2, where
- m1 maxV(R,y1),V(S,y1)
- m2 maxV(R,y2),V(S,y2)
- Why?
- Let r be a tuple in R and s be a tuple in S.
Whats the probability that r and s agree on y1? - From the previous reasoning, its
1/maxV(R,y1),V(S,y1) - Similarly, whats the probability that r and s
agree on y2? - Its 1/maxV(R,y2),V(S,y2)
- Assuming that aggrements on y1 and y2 are
independent we estimate - T(R ?? S)
- T(R)T(S)/maxV(R,y1),V(S,y1)
maxV(R,y2),V(S,y2) - Example
- T(R)1000, V(R,b)20, V(R,c)100
- T(S)2000, V(S,d)50, V(S,e)50
- R(a,b,c) ?? R.bS.d AND R.cS.e S(d,e,f)
- T(R ?? S) (10002000)/(50100)400
11- Another example (one of the previous)
- R(a,b), T(R)1000, V(R,b)20
- S(b,c), T(S)2000, V(S,b)50, V(S,c)100
- U(c,d), T(U)5000, V(U,c)500
- Estimate the size of R ?? S ?? U
- Observe that R ?? S ?? U (R ? U) ?? S
- T(R ? U)
- 10005000 5,000,000
- Note that the number of bs in the product is 20
(V(R,b)), and the number of cs is 500 (V(U,c)). - T((R ? U) ?? S)
- 5,000,000 2000 / (50 500) 400,000
12Size estimates for other operations
- Cartesian product
- T(R ? S) T(R) T(S)
- Bag Union sum of sizes
- Set union
- larger half the smaller. Why?
- Because a set union can be as large as the sum
of sizes or as small as the larger of the two
arguments. Something in the middle is suggested. - Intersection half the smaller. Why?
- Because intersection can be as small as 0 or as
large as the sizes of the smaller. Something in
the middle is suggested. - Difference T(R-S) T(R) - 1/2T(S)
- Because the result can have between T(R) and
T(R)-T(S). Something in the middle is suggested.
13Size estimates for other operations
- Duplicate elimination ? in (R(a1,...,an))
- The size ranges from 1 to T(R).
- T(?(R)) V(R,a1...an), if available (but
usually not available). - Otherwise T(?(R)) minV(R,a1)...V(R,an),
1/2T(R) is suggested. Why? - V(R,a1)...V(R,an) is the upper limit on the
number of distinct tuples that could exist - 1/2T(R) is because the size can be as small as
1 or as big as T(R) - Grouping and Aggregation
- similar to ?, but only with respect to grouping
attributes.
14Computing the statistics
- It is the DBA who orders the statistics to be
computed by the system. - T(R)s, and V(R,A)s are just aggregation queries
(COUNT queries). - However, they are expensive to be computed.
15Incremental computation of statistics
- Maintaining T(R)
- Add 1 for every insertion and subtract 1 for
every deletion. - Whats the problem?
- If there is a B-Tree on any attribute of R, then
- Just keep track of the B-Tree blocks and infer
the approximate size of the relation. - Requires effort only when?
- On B-Tree changes, which is relative rare
compared with the rate of insertions and
deletions.
16Incremental computation of statistics
- Maintaining V(R,A)
- If there is an index on attribute A of a relation
R, then - On insert into R, we must find the A-value for
the new tuple in the index anyway, and so we can
determine whether there is already such a value
for A. If not increment V(R,A). - On deletion
- If there isnt an index on A, the system could in
effect create a rudimentary index by keeping a
data structure (e.g. B-Tree) that holds every
value of A. - Final option Sampling the relation.
17Histograms
1-2 3-4 4-5 6-7 8-9
Equal width
Advantage more accurate estimate of the size of
a join.
18Example (freq. Values histogram)
- Estimate U R(a,b) ?? S(b,c)
- V(R,b) 14. Histogram for R.b
- 0150, 1200, 5100, rest 550
- V(S,b) 13. Histogram for S.b
- 0100, 180, 270, rest 250
- Tuples in U
- on 0
- 100150 15,000
- on 1
- 20080 16,000
- on 2
- 70 (550/(14-3)) 3500
- on 5
- 100 (250/(13-3)) 2500
- on the 9 other values
- 9(550/11)(250/10) 91250
We have 9 values, because V(S,b)ltV(R,b), and by
the preservation of value sets assumption, all
the 9 values we didnt consider yet in S, will be
in R as well.
Total T(U) 15000 16000 3500 2500
91250 48,250 Simple estimate (equal
occurrence assumption) T(U) 1000500/14 35,714
19Example (equal width histogram)
- Schemas
- Jan(day,temp)
- July(day,temp)
- Query
- SELECT Jan.day, July.day
- FROM Jan, July
- WHERE Jan.tempJuly.temp
- Size of join of each band T1T2/Width
- On band 40-49 105/10 5
- On band 50-59 520/10 10
- ? size of the result is thus 510 15
- Without using the histogram we would
- estimate the size as
- 245245/100 600 !!