Estimating the Cost of Operations - PowerPoint PPT Presentation

About This Presentation
Title:

Estimating the Cost of Operations

Description:

Estimating the Cost of Operations We don t want to execute the query in order to learn the costs. So, we need to estimate the costs. How can we estimate the number ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 20
Provided by: Computer111
Category:

less

Transcript and Presenter's Notes

Title: Estimating the Cost of Operations


1
Estimating the Cost of Operations
  • We dont want to execute the query in order to
    learn the costs. So, we need to estimate the
    costs.
  • How can we estimate the number of tuples in an
    intermediate relation?
  • Rules about estimation formulas
  • Give (somehow) accurate estimates
  • Easy to compute

2
Projection
  • Projection ? retains duplicates, so the number of
    tuples in the result is the same as in the input.
  • Result tuples are usually shorter than the input
    tuples.
  • The size of a projection is the only one we can
    compute exactly.

3
Selection
  • Let S ?Ac (R)
  • We can estimate
  • T(S) T(R) / V(R,A)
  • Let S ?Altc (R)
  • On average, T(S) would be T(R)/2, but more
    properly T(R)/3
  • Let S ?A?c (R),
  • Then, an estimate is
  • T(S) T(R)(V(R,A)-1)/V(R,A), or simply
  • T(S) T(R)

4
Selection ...
  • Let
  • S ?C AND D(R) ?C(?D(R)) and
  • U ?D(R).
  • First estimate T(U) and then use this to estimate
    T(S).
  • Example
  • S ?a10 AND blt20(R)
  • T(R) 10,000, V(R,a) 50
  • T(S) (1/50) (1/3) T(R) 67
  • Note Watch for selections like
  • ?a10 AND agt20(R)

5
Selection ...
  • Let S ?C OR D(R).
  • Simple estimate T(S) T(?C(R)) T(?D(R)).
  • Problem It is possible that T(S)?T(R)!
  • A more accurate estimate
  • Let
  • T(R)n,
  • m1 size of selection on C, and
  • m2 size of selection on D.
  • Then T(S) n(1-(1-m1/n)(1-m2/n)) Why?
  • Example S ?a10 OR blt20(R).
  • T(R) 10,000, V(R,a) 50
  • Simple estimation T(S) 3533
  • More accurate T(S) 3466

6
Natural Join R(X,Y) ?? S(Y,Z)
  • Anything could happen!
  • No tuples join T(R ?? S) 0
  • Y is key in S and a foreign key in R (i.e., R.Y
    refers to S.Y) Then,
  • T(R ?? S) T(R)
  • All tuples join i.e. R.YS.Y a.
  • Then,
  • T(R ?? S) T(R)T(S)

7
Two Assumptions
  • Containment of value sets
  • If V(R,Y) V(S,Y), then every Y-value in R is
    assumed to occur as a Y-value in S
  • When such thing can happen?
  • For example when Y is foreign key in R, and key
    in S
  • Preservation of set values
  • If A is an attribute of R but not S, then it is
    assumed that
  • V(R ?? S, A)V(R, A)
  • This may be violated when there are dangling
    tuples in R
  • There is no violation when Y is foreign key in
    R, and key in S

8
Natural Join size estimation
  • Let, R(X,Y) and S(Y,Z), where Y is a single
    attribute.
  • Whats the size of T(R ?? S)?
  • Let r be a tuple in R and s be a tuple in S.
    Whats the probability that r and s join?
  • Suppose V(R,Y) ? V(S,Y)
  • By the containment of set values we infer that
  • Every Ys value in R appears in S.
  • So, the tuple r of R surely is going match with
    some tuples of S, but whats the probability it
    matches with s?
  • Its 1/V(S,Y).
  • Hence, T(R ?? S) T(R)T(S)/V(S,Y)
  • When V(R,Y) ? V(S,Y)
  • By a similar reasoning, for the case when V(S,Y)
    ? V(R,Y), we get T(R ?? S) T(R)T(S)/V(R,Y).
  • So, sumarizing we have as an estimate
  • T(R ?? S) T(R)T(S)/maxV(R,Y),V(S,Y)

9
  • T(R ?? S) T(R)T(S)/maxV(R,Y),V(S,Y)
  • Example
  • R(a,b), T(R)1000, V(R,b)20
  • S(b,c), T(S)2000, V(S,b)50, V(S,c)100
  • U(c,d), T(U)5000, V(U,c)500
  • Estimate the size of R ?? S ?? U
  • T(R ?? S)
  • 10002000 / 50 40,000
  • T((R ?? S) ?? U)
  • 40000 5000 / 500 400,000
  • T(S ?? U)
  • 20,000
  • T(R ?? (S ?? U))
  • 100020000 / 50 400,000
  • The equality of results is not a coincidence.
  • Note 1 estimate of final result should not
    depend on the evaluation order
  • Note 2 intermediate results could be of
    different sizes

10
Natural join with multiple join attrib.
  • R(x,y1,y2) ?? S(y1,y2,z)
  • T(R ?? S) T(R)T(S)/m1m2, where
  • m1 maxV(R,y1),V(S,y1)
  • m2 maxV(R,y2),V(S,y2)
  • Why?
  • Let r be a tuple in R and s be a tuple in S.
    Whats the probability that r and s agree on y1?
  • From the previous reasoning, its
    1/maxV(R,y1),V(S,y1)
  • Similarly, whats the probability that r and s
    agree on y2?
  • Its 1/maxV(R,y2),V(S,y2)
  • Assuming that aggrements on y1 and y2 are
    independent we estimate
  • T(R ?? S)
  • T(R)T(S)/maxV(R,y1),V(S,y1)
    maxV(R,y2),V(S,y2)
  • Example
  • T(R)1000, V(R,b)20, V(R,c)100
  • T(S)2000, V(S,d)50, V(S,e)50
  • R(a,b,c) ?? R.bS.d AND R.cS.e S(d,e,f)
  • T(R ?? S) (10002000)/(50100)400

11
  • Another example (one of the previous)
  • R(a,b), T(R)1000, V(R,b)20
  • S(b,c), T(S)2000, V(S,b)50, V(S,c)100
  • U(c,d), T(U)5000, V(U,c)500
  • Estimate the size of R ?? S ?? U
  • Observe that R ?? S ?? U (R ? U) ?? S
  • T(R ? U)
  • 10005000 5,000,000
  • Note that the number of bs in the product is 20
    (V(R,b)), and the number of cs is 500 (V(U,c)).
  • T((R ? U) ?? S)
  • 5,000,000 2000 / (50 500) 400,000

12
Size estimates for other operations
  • Cartesian product
  • T(R ? S) T(R) T(S)
  • Bag Union sum of sizes
  • Set union
  • larger half the smaller. Why?
  • Because a set union can be as large as the sum
    of sizes or as small as the larger of the two
    arguments. Something in the middle is suggested.
  • Intersection half the smaller. Why?
  • Because intersection can be as small as 0 or as
    large as the sizes of the smaller. Something in
    the middle is suggested.
  • Difference T(R-S) T(R) - 1/2T(S)
  • Because the result can have between T(R) and
    T(R)-T(S). Something in the middle is suggested.

13
Size estimates for other operations
  • Duplicate elimination ? in (R(a1,...,an))
  • The size ranges from 1 to T(R).
  • T(?(R)) V(R,a1...an), if available (but
    usually not available).
  • Otherwise T(?(R)) minV(R,a1)...V(R,an),
    1/2T(R) is suggested. Why?
  • V(R,a1)...V(R,an) is the upper limit on the
    number of distinct tuples that could exist
  • 1/2T(R) is because the size can be as small as
    1 or as big as T(R)
  • Grouping and Aggregation
  • similar to ?, but only with respect to grouping
    attributes.

14
Computing the statistics
  • It is the DBA who orders the statistics to be
    computed by the system.
  • T(R)s, and V(R,A)s are just aggregation queries
    (COUNT queries).
  • However, they are expensive to be computed.

15
Incremental computation of statistics
  • Maintaining T(R)
  • Add 1 for every insertion and subtract 1 for
    every deletion.
  • Whats the problem?
  • If there is a B-Tree on any attribute of R, then
  • Just keep track of the B-Tree blocks and infer
    the approximate size of the relation.
  • Requires effort only when?
  • On B-Tree changes, which is relative rare
    compared with the rate of insertions and
    deletions.

16
Incremental computation of statistics
  • Maintaining V(R,A)
  • If there is an index on attribute A of a relation
    R, then
  • On insert into R, we must find the A-value for
    the new tuple in the index anyway, and so we can
    determine whether there is already such a value
    for A. If not increment V(R,A).
  • On deletion
  • If there isnt an index on A, the system could in
    effect create a rudimentary index by keeping a
    data structure (e.g. B-Tree) that holds every
    value of A.
  • Final option Sampling the relation.

17
Histograms
1-2 3-4 4-5 6-7 8-9
Equal width
Advantage more accurate estimate of the size of
a join.
18
Example (freq. Values histogram)
  • Estimate U R(a,b) ?? S(b,c)
  • V(R,b) 14. Histogram for R.b
  • 0150, 1200, 5100, rest 550
  • V(S,b) 13. Histogram for S.b
  • 0100, 180, 270, rest 250
  • Tuples in U
  • on 0
  • 100150 15,000
  • on 1
  • 20080 16,000
  • on 2
  • 70 (550/(14-3)) 3500
  • on 5
  • 100 (250/(13-3)) 2500
  • on the 9 other values
  • 9(550/11)(250/10) 91250

We have 9 values, because V(S,b)ltV(R,b), and by
the preservation of value sets assumption, all
the 9 values we didnt consider yet in S, will be
in R as well.
Total T(U) 15000 16000 3500 2500
91250 48,250 Simple estimate (equal
occurrence assumption) T(U) 1000500/14 35,714
19
Example (equal width histogram)
  • Schemas
  • Jan(day,temp)
  • July(day,temp)
  • Query
  • SELECT Jan.day, July.day
  • FROM Jan, July
  • WHERE Jan.tempJuly.temp
  • Size of join of each band T1T2/Width
  • On band 40-49 105/10 5
  • On band 50-59 520/10 10
  • ? size of the result is thus 510 15
  • Without using the histogram we would
  • estimate the size as
  • 245245/100 600 !!
Write a Comment
User Comments (0)
About PowerShow.com