Title: Self-tuning Histograms Building Histograms Without Looking at Data
1Self-tuning HistogramsBuilding Histograms
Without Looking at Data
By Ashraf Aboulnage Surajit Chaudhuri
Represents By Miller Ofer
2Traditional histograms
- Histograms impose little cost at queries ,
especially in a large data base. - The cost of building histograms from database is
significant high and prevent us from building
useful histograms . - Data modification causes to lost of histogram
accuracy .
3Self-Tuning histogram
- Similar in structure to traditional histogram.
- Builds without looking or sampling the data.
- Uses free information from the queries results.
- Can be refined in an on-line mode ( The overall
cost of building self tuning histogram is very
low). - Their accuracy depends on how often they are
used, the more it is use , the more it is refined
, the more accurate it becomes
4The main steps in building the ST-histogram
- Initial the histogram.
- Refining the bucket values (frequencies).
- Restructuring moving the buckets boundaries.
5Initial The Histogram
- Data requirement
- B Number of histogram buckets
- T Number of tuples
- Min/max min and max values of attribute .
- Assuming uniformity of the data distribution and
initial each of the buckets as T/b tuples.
6ST-histogram after initialization
tuples
frequency
buckets
7The Algorithm for refining the buckets
frequencies. (second step )
- begin
- Finds set of k buckets that overlapping the
selection range . - Let est be the estimated results size of the
selection range using histogram h . - Let act be the actual result size.
- Compute the estimation error by act-est , denote
by esterr . - ?
8?
last overlap bucket
- 5. for i 1 to k do
-
-
-
- 8. endfor
-
- end UpdateFreq
proportion
average assumption
9Refinement example
act 60 rangehigh 25 Rangelow 12.5 high(b1)
15 low(b1) 5
0.5
5/25
35
10Restructuring Algorithm
- Motivation
- Frequencies are approximated by their average.
Thus , high frequent value will be contain in
high frequent buckets , but they may be grouped
with low frequency values . - When the range of query adapts to the range of
histogram bucket , no average assumption is
needed.
11Restructuring Algorithm
Merging step
- For every consecutive runs of buckets , find the
maximum differences in frequency between a bucket
in the first run and the buckets in the second
run. - Find the minimum of all these maximum difference
, denote by mindiff. - if mindiff lt mT then
- Merge the two runs of buckets corresponding to
mindiff into one run. - Look for other runs to merge .goto line 1.
- endif
12Restructuring Algorithm
Splitting step
- ksb b the rest of the buckets that
havent been chosen . - Find the set with k highest frequency .
- Compute the splitting extra bucket of each one by
- split(bi)
- where totalfreq is the sum of all the bucket to
be split - and B is the number of extra bucks
- 8. Each buckets freq gets the old freq divided to
split(bi)1.
13Restructuring example
mT ? 3 SB ? 2
merge?1
merge?2
split
split
10
13
17
14
13
11
25
70
10
30
10
1
6
7
8
9
5
4
3
2
15
15
10
24
23
23
25
38
17
23
10
1
7
8
9
5
4
2
3
6
14Multi-dimensional ST-histogram
- Initialization
- Assuming a complete uniformity and independence.
- Using existing one-dimensional ST-histograms
assuming independence of the attribute .
15Refining the buckets frequencies
Multi-dimensional
- The refining algorithm for multi-dimensional is
identical to the algorithm for one-dimensional
except the two following changes - Finding the overlap selected range, now require
examining a multi-dimensional structure. - The fraction of a bucket overlapping the
selection range is now equal to the volume of the
region divided by the volume of the region
represented by the whole bucket.
16Restructuring
Multi-dimensional
- Merge find the max difference in freq between
any two corresponding buckets of the same line,
merge if the difference within mt . ( mlt1 ). - Split the frequency of partition ji in
dimension i is - compute by
-
1,5 6,10 11,15 16,20 21,25
Max S 50
10 5 8 20 7
14 9 7 19 11
1,10
11,20
Max diff 4
Max S 60
17Accuracy of ST-histogram
Due to the same memory limit and the complex of
MHIST-2, st-histogram have more buckets.
18Adapting to Database Updates
R1 relation before update with skew 1 . R2 -
Update the relation by deleting 25 of its
tuple and inserting an equal number of tuples.
19Accuracy dependence of the frequent queries.
20Conclusions
- Better than assuming uniformity and independence
for all values of data skew (z). - For low data skew the st-histograms found to be
sufficient accurate comparing to the traditional
histograms. - Attractive for multi-dimensional histogram since
the high cost of building them . - For high data skew, st-hist much less accurate
than Mhist-2 . - Combination between traditional hist and st-hist
for all the range of data skew can yield the best
of each concept .