Histograms for Selectivity Estimation, Part II - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Histograms for Selectivity Estimation, Part II

Description:

Supplements to the previous talk. Introduction to histograms for multi-dimensional data ... Histograms approximate the frequency distribution of an attribute ... – PowerPoint PPT presentation

Number of Views:16

Avg rating:3.0/5.0

Slides: 43

Provided by: CSI115

Category:

more less

Transcript and Presenter's Notes

Title: Histograms for Selectivity Estimation, Part II

1
Histograms for Selectivity Estimation, Part II
Global Optimization of Histograms

Speaker Ho Wai Shing

2
Contents

Supplements to the previous talk
Introduction to histograms for multi-dimensional
data
Global optimization of histograms
Experiment results
Conclusion

3
Summary

Histograms approximate the frequency distribution
of an attribute (or a set of attributes)
group attribute values into "buckets"
approximate the actual frequencies by the
statistical information stored in each bucket.
A taxonomy was discussed

4
Summary of the 1-D Histogram Taxonomy
5
Summary of the 1-D Histogram Taxonomy
Data Distribution
Max-Diff(V, F)
V-optimal(F, F)
equi-width
Compressed(V, F)
V-optimal(V, F)
equi-depth
6
Estimation Procedures

Find the histogram buckets that can contain the
query range
Estimate the counts by studying the overlapped
portion of the query and the buckets
continuous values assumption
point value assumption
uniform spread assumption

7
Uniform Frequency Assumptions

continuous values assumption
point value assumption
uniform spread assumption

freq 16
4
5
2
9
freq 4
4
5
2
9
8
Uniform Frequency Assumptions

Experiments Results show that uniform spread
assumption gives the best estimation
uniform spread assumption is better against
"non-uniform spread" data than continuous value
assumption (strange!)
e.g., if spread is large, many queries should
return 0.

9
Experiments

Datasets
in form of ltdataValue, frequencygt pairs,
distribution of dataValues
uniform, zipf_incr, zipf_dec, etc.
frequency zipf, with different skew factors
different value-frequency correlations
positive, negative, random

10
Experiments

The data value distribution

11
Experiments

Queries
has a form of a lt X ? b
5 types
a -1, b in the range of values
a -1, b one of the appeared values
random a and b, s.t. selectivity in 0, 0.2
random a and b, s.t. selectivity in 0.8, 1
random a and b, s.t. selectivity in 0,
0.2?0.8, 1

12
Experiments

Histograms
all histograms described in the taxonomy
the histograms are of the same size (not the same
number of buckets)
built from 2000 samples (except trivial,
equi-depth precise, and P2)
built through scanning the data once

13
Experiments

cusp_max value distribution
random value freq. correlation
z 1
sort param V usually means better accuracy

14
Experiments

error based on v-optimal(V,A) histogram
increasing sample size can give better results

15
End of Part I
16
Part II Global Optimization of Histograms (GOH)
17
Histograms for n-D data

Histograms discussed previously are on single
attribute (or 1-D data)
Two main approaches for n-D data
Use n 1-D histograms
use an n-D histogram

18
Histograms for n-D data

n 1-D histograms
need "attribute value independence assumption"
can use all 1-D histogram techniques
quite good accuracy already
representative GOH Jagadish et al. SIGMOD'01

19
Histograms for n-D data

n-D histograms
don't need the "avia" which is usually not true
difficult to compute, store and maintain a "good"
partition of the n-D space into buckets
representatives MHIST Poosala Ioannidis
VLDB'97, H-Tree Muralikrishna DeWitt
SIGMOD'88

20
Global Optimization of Histograms (GOH)

given a space budget of B buckets, find an
optimal assignment of buckets among the
dimensions to minimize the error
i.e., give more buckets to attributes that are
used frequently or have skew distributions

21
GOH -- example

e.g., if A1 lt3, 5gt, lt4, 5gt, lt5, 5.5gt, A2lt1, 9gt,
lt3, 2gt, lt5, 5gt and we have 4 buckets,
an A11, A23 assignment is better than a A12,
A22 assignment.

22
Computing GOH

Exhaustive Algorithm (GOHEA)
Based on Dynamic Programming (GOHDP)
Greedy Approach (GOHGA)
Greedy Approach with Remedy (GOHGR)

23
GOHEA

For every possible bucket assignment, calculate
the Error metric and find the minimum
clearly too inefficient

24
GOHDP

Define
E(b, k) the minimum error of using b buckets to
store k histograms
error(b, k) the error of using b buckets to
store the kth histogram
observation
E(b, k) min( E(b-x, k-1) error(x, k) )

25
GOHDP

calculate all error(b, k) O(N2BM)
another algorithm based in DP can calculate all
error(b, k) with a given k in O(N2B) Jagadish et
al. VLDB'98
fill the E(b, k) O(B2M)
O(BM) entries to be filled, and O(B) computations
for each entry

26
GOHDP

note that if we know the allocation beforehand,
we need O(N2B) to construct the histograms
it's still inefficient if M is large

27
GOHGA

The greedy approach is O(N2B)
i.e., nearly no penalty than direct construction
(use the same number of buckets on all
attributes)
define
marginal gain mk(i,j) error(i,k) error(j,k),
i.e., the reduction in error if we use j buckets
instead of i buckets.

28
GOHGA

1. assign 1 buckets to each dimension
2. allocate buckets one by one to the dimension
which has the greatest marginal gain by the new
bucket
3. repeat until all buckets are assigned

29
GOHGA

O(B) steps
O(N2) per marginal gain calculation
since we can get error(b, k) from b 1 in O(N2b)
overall O(N2B)
but sometimes GOHGA does not return the optimum

30
GOHGR

greedy look ahead 1 step to see which allocation
has the greatest marginal gain
remedy look ahead 2 steps to see if it can find
a better allocation
e.g., m1(3,4)30, m1(4,5)130, m2(3,4)40,
m2(4,5)40

31
Experiments

aim
to show that GOH really has a smaller error by
allocating more buckets to skewer data
to show that GOH is efficient to be computed

32
Experiments

types
abs/rel errors of different attributes
abs/rel errors for different bucket budgets
abs/rel errors for different distribution skews
running time

33
Experiments

dataset (synthesis data)
5 attributes
500000 tuples, 500 values per attr
frequencies follow Zipf distribution, random
association between freq. and value.
the 5 attrs has z 0, 0.01, 0.1, 1, 2
10000 queries X?a

34
Experiments
35
Experiments
36
Experiments

dataset (TPC-D dataset with skew data)
as more realistic dataset
has similar results as the synthesis data
dataset (2 attrs)
to evaluate the gain of GOH due to the skew
difference between attrs

37
Experiments

TG3 z0, 2
TG4 z0.02,1.8
TG5 z1.8,2

38
Experiments
39
Conclusions

GOH has smaller errors because it assigns more
buckets to skew or frequently used distributions
Nearly no time penalty on building GOH using GOHGR

40
Future Work

The methods presented can't solve the n-D
histogram problem completely
Try to apply SF-Tree to store and retrieve the
buckets in multi-dimensional histogram
efficiently.

41
References

Jagadish et al. VLDB'98 H.V. Jagadish, Nick
Koudas, S. Muthukrishnan, Viswanath Poosala, Ken
Sevcik, Torsten Suel, Optimal Histograms with
Quality Guarantees, VLDB98
Poosala et al. SIMGOD'96 Viswanath Poosala,
Yannis Ioannidis, Peter Haas, Eugene Shekita,
Improved Histograms for Selectivity Estimation of
Range Predicates, SIGMOD96
Poosala Ioannidis VLDB'97 Viswanath Poosala
and Yannis Ioannidis, Selectivity Estimation
Without the Attribute Value Independence
Assumption, VLDB'97

42
References

Muralikrishna DeWitt SIGMOD'88 M.
Muralikrishna and D. DeWitt, Equi-Depth
Histograms for Estimating Selectivity Factors for
Multi-Dimensional Queries, SIGMOD88

Write a Comment

User Comments (0)