Self-tuning Histograms Building Histograms Without Looking at Data

About This Presentation

Title:

Self-tuning Histograms Building Histograms Without Looking at Data

Description:

For high data skew, st-hist much less accurate than 'Mhist-2' . Combination between traditional hist and st-hist for all the range of data skew ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 21

Provided by: ofer5

Category:

more less

Transcript and Presenter's Notes

Title: Self-tuning Histograms Building Histograms Without Looking at Data

1
Self-tuning HistogramsBuilding Histograms
Without Looking at Data
By Ashraf Aboulnage Surajit Chaudhuri
Represents By Miller Ofer
2
Traditional histograms

Histograms impose little cost at queries ,
especially in a large data base.
The cost of building histograms from database is
significant high and prevent us from building
useful histograms .
Data modification causes to lost of histogram
accuracy .

3
Self-Tuning histogram

Similar in structure to traditional histogram.
Builds without looking or sampling the data.
Uses free information from the queries results.
Can be refined in an on-line mode ( The overall
cost of building self tuning histogram is very
low).
Their accuracy depends on how often they are
used, the more it is use , the more it is refined
, the more accurate it becomes

4
The main steps in building the ST-histogram

Initial the histogram.
Refining the bucket values (frequencies).
Restructuring moving the buckets boundaries.

5
Initial The Histogram

Data requirement
B Number of histogram buckets
T Number of tuples
Min/max min and max values of attribute .
Assuming uniformity of the data distribution and
initial each of the buckets as T/b tuples.

6
ST-histogram after initialization
tuples
frequency
buckets
7
The Algorithm for refining the buckets
frequencies. (second step )

begin
Finds set of k buckets that overlapping the
selection range .
Let est be the estimated results size of the
selection range using histogram h .
Let act be the actual result size.
Compute the estimation error by act-est , denote
by esterr .
?

8
?
last overlap bucket

5. for i 1 to k do
8. endfor
end UpdateFreq

proportion
average assumption
9
Refinement example
act 60 rangehigh 25 Rangelow 12.5 high(b1)
15 low(b1) 5
0.5
5/25
35
10
Restructuring Algorithm

Motivation
Frequencies are approximated by their average.
Thus , high frequent value will be contain in
high frequent buckets , but they may be grouped
with low frequency values .
When the range of query adapts to the range of
histogram bucket , no average assumption is
needed.

11
Restructuring Algorithm
Merging step

For every consecutive runs of buckets , find the
maximum differences in frequency between a bucket
in the first run and the buckets in the second
run.
Find the minimum of all these maximum difference
, denote by mindiff.
if mindiff lt mT then
Merge the two runs of buckets corresponding to
mindiff into one run.
Look for other runs to merge .goto line 1.
endif

12
Restructuring Algorithm
Splitting step

ksb b the rest of the buckets that
havent been chosen .
Find the set with k highest frequency .
Compute the splitting extra bucket of each one by
split(bi)
where totalfreq is the sum of all the bucket to
be split
and B is the number of extra bucks
8. Each buckets freq gets the old freq divided to
split(bi)1.

13
Restructuring example
mT ? 3 SB ? 2
merge?1
merge?2
split
split
10
13
17
14
13
11
25
70
10
30
10
1
6
7
8
9
5
4
3
2
15
15
10
24
23
23
25
38
17
23
10
1
7
8
9
5
4
2
3
6
14
Multi-dimensional ST-histogram

Initialization
Assuming a complete uniformity and independence.
Using existing one-dimensional ST-histograms
assuming independence of the attribute .

15
Refining the buckets frequencies
Multi-dimensional

The refining algorithm for multi-dimensional is
identical to the algorithm for one-dimensional
except the two following changes
Finding the overlap selected range, now require
examining a multi-dimensional structure.
The fraction of a bucket overlapping the
selection range is now equal to the volume of the
region divided by the volume of the region
represented by the whole bucket.

16
Restructuring
Multi-dimensional

Merge find the max difference in freq between
any two corresponding buckets of the same line,
merge if the difference within mt . ( mlt1 ).
Split the frequency of partition ji in
dimension i is
compute by

1,5 6,10 11,15 16,20 21,25
Max S 50
10 5 8 20 7
14 9 7 19 11
1,10
11,20
Max diff 4
Max S 60
17
Accuracy of ST-histogram
Due to the same memory limit and the complex of
MHIST-2, st-histogram have more buckets.
18
Adapting to Database Updates
R1 relation before update with skew 1 . R2 -
Update the relation by deleting 25 of its
tuple and inserting an equal number of tuples.
19
Accuracy dependence of the frequent queries.
20
Conclusions

Better than assuming uniformity and independence
for all values of data skew (z).
For low data skew the st-histograms found to be
sufficient accurate comparing to the traditional
histograms.
Attractive for multi-dimensional histogram since
the high cost of building them .
For high data skew, st-hist much less accurate
than Mhist-2 .
Combination between traditional hist and st-hist
for all the range of data skew can yield the best
of each concept .