Self-tuning Histograms Building Histograms Without Looking at Data - PowerPoint PPT Presentation

About This Presentation
Title:

Self-tuning Histograms Building Histograms Without Looking at Data

Description:

For high data skew, st-hist much less accurate than 'Mhist-2' . Combination between traditional hist and st-hist for all the range of data skew ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 21
Provided by: ofer5
Category:

less

Transcript and Presenter's Notes

Title: Self-tuning Histograms Building Histograms Without Looking at Data


1
Self-tuning HistogramsBuilding Histograms
Without Looking at Data
By Ashraf Aboulnage Surajit Chaudhuri
Represents By Miller Ofer
2
Traditional histograms
  • Histograms impose little cost at queries ,
    especially in a large data base.
  • The cost of building histograms from database is
    significant high and prevent us from building
    useful histograms .
  • Data modification causes to lost of histogram
    accuracy .

3
Self-Tuning histogram
  • Similar in structure to traditional histogram.
  • Builds without looking or sampling the data.
  • Uses free information from the queries results.
  • Can be refined in an on-line mode ( The overall
    cost of building self tuning histogram is very
    low).
  • Their accuracy depends on how often they are
    used, the more it is use , the more it is refined
    , the more accurate it becomes

4
The main steps in building the ST-histogram
  1. Initial the histogram.
  2. Refining the bucket values (frequencies).
  3. Restructuring moving the buckets boundaries.

5
Initial The Histogram
  • Data requirement
  • B Number of histogram buckets
  • T Number of tuples
  • Min/max min and max values of attribute .
  • Assuming uniformity of the data distribution and
    initial each of the buckets as T/b tuples.

6
ST-histogram after initialization
tuples
frequency
buckets
7
The Algorithm for refining the buckets
frequencies. (second step )
  • begin
  • Finds set of k buckets that overlapping the
    selection range .
  • Let est be the estimated results size of the
    selection range using histogram h .
  • Let act be the actual result size.
  • Compute the estimation error by act-est , denote
    by esterr .
  • ?

8
?
last overlap bucket
  • 5. for i 1 to k do
  • 8. endfor
  • end UpdateFreq

proportion
average assumption
9
Refinement example
act 60 rangehigh 25 Rangelow 12.5 high(b1)
15 low(b1) 5
0.5
5/25
35
10
Restructuring Algorithm
  • Motivation
  • Frequencies are approximated by their average.
    Thus , high frequent value will be contain in
    high frequent buckets , but they may be grouped
    with low frequency values .
  • When the range of query adapts to the range of
    histogram bucket , no average assumption is
    needed.

11
Restructuring Algorithm
Merging step
  • For every consecutive runs of buckets , find the
    maximum differences in frequency between a bucket
    in the first run and the buckets in the second
    run.
  • Find the minimum of all these maximum difference
    , denote by mindiff.
  • if mindiff lt mT then
  • Merge the two runs of buckets corresponding to
    mindiff into one run.
  • Look for other runs to merge .goto line 1.
  • endif

12
Restructuring Algorithm
Splitting step
  • ksb b the rest of the buckets that
    havent been chosen .
  • Find the set with k highest frequency .
  • Compute the splitting extra bucket of each one by
  • split(bi)
  • where totalfreq is the sum of all the bucket to
    be split
  • and B is the number of extra bucks
  • 8. Each buckets freq gets the old freq divided to
    split(bi)1.

13
Restructuring example
mT ? 3 SB ? 2
merge?1
merge?2
split
split
10
13
17
14
13
11
25
70
10
30
10
1
6
7
8
9
5
4
3
2
15
15
10
24
23
23
25
38
17
23
10
1
7
8
9
5
4
2
3
6
14
Multi-dimensional ST-histogram
  • Initialization
  • Assuming a complete uniformity and independence.
  • Using existing one-dimensional ST-histograms
    assuming independence of the attribute .

15
Refining the buckets frequencies
Multi-dimensional
  • The refining algorithm for multi-dimensional is
    identical to the algorithm for one-dimensional
    except the two following changes
  • Finding the overlap selected range, now require
    examining a multi-dimensional structure.
  • The fraction of a bucket overlapping the
    selection range is now equal to the volume of the
    region divided by the volume of the region
    represented by the whole bucket.

16
Restructuring
Multi-dimensional
  • Merge find the max difference in freq between
    any two corresponding buckets of the same line,
    merge if the difference within mt . ( mlt1 ).
  • Split the frequency of partition ji in
    dimension i is
  • compute by

1,5 6,10 11,15 16,20 21,25
Max S 50
10 5 8 20 7
14 9 7 19 11
1,10
11,20
Max diff 4
Max S 60
17
Accuracy of ST-histogram
Due to the same memory limit and the complex of
MHIST-2, st-histogram have more buckets.
18
Adapting to Database Updates
R1 relation before update with skew 1 . R2 -
Update the relation by deleting 25 of its
tuple and inserting an equal number of tuples.
19
Accuracy dependence of the frequent queries.
20
Conclusions
  • Better than assuming uniformity and independence
    for all values of data skew (z).
  • For low data skew the st-histograms found to be
    sufficient accurate comparing to the traditional
    histograms.
  • Attractive for multi-dimensional histogram since
    the high cost of building them .
  • For high data skew, st-hist much less accurate
    than Mhist-2 .
  • Combination between traditional hist and st-hist
    for all the range of data skew can yield the best
    of each concept .
Write a Comment
User Comments (0)
About PowerShow.com