Histograms for Selectivity Estimation, Part II - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Histograms for Selectivity Estimation, Part II

Description:

Supplements to the previous talk. Introduction to histograms for multi-dimensional data ... Histograms approximate the frequency distribution of an attribute ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 43
Provided by: CSI115
Category:

less

Transcript and Presenter's Notes

Title: Histograms for Selectivity Estimation, Part II


1
Histograms for Selectivity Estimation, Part II
Global Optimization of Histograms
  • Speaker Ho Wai Shing

2
Contents
  • Supplements to the previous talk
  • Introduction to histograms for multi-dimensional
    data
  • Global optimization of histograms
  • Experiment results
  • Conclusion

3
Summary
  • Histograms approximate the frequency distribution
    of an attribute (or a set of attributes)
  • group attribute values into "buckets"
  • approximate the actual frequencies by the
    statistical information stored in each bucket.
  • A taxonomy was discussed

4
Summary of the 1-D Histogram Taxonomy
5
Summary of the 1-D Histogram Taxonomy
Data Distribution
Max-Diff(V, F)
V-optimal(F, F)
equi-width
Compressed(V, F)
V-optimal(V, F)
equi-depth
6
Estimation Procedures
  • Find the histogram buckets that can contain the
    query range
  • Estimate the counts by studying the overlapped
    portion of the query and the buckets
  • continuous values assumption
  • point value assumption
  • uniform spread assumption

7
Uniform Frequency Assumptions
  • continuous values assumption
  • point value assumption
  • uniform spread assumption

freq 16
4
5
2
9
freq 4
4
5
2
9
8
Uniform Frequency Assumptions
  • Experiments Results show that uniform spread
    assumption gives the best estimation
  • uniform spread assumption is better against
    "non-uniform spread" data than continuous value
    assumption (strange!)
  • e.g., if spread is large, many queries should
    return 0.

9
Experiments
  • Datasets
  • in form of ltdataValue, frequencygt pairs,
  • distribution of dataValues
  • uniform, zipf_incr, zipf_dec, etc.
  • frequency zipf, with different skew factors
  • different value-frequency correlations
  • positive, negative, random

10
Experiments
  • The data value distribution

11
Experiments
  • Queries
  • has a form of a lt X ? b
  • 5 types
  • a -1, b in the range of values
  • a -1, b one of the appeared values
  • random a and b, s.t. selectivity in 0, 0.2
  • random a and b, s.t. selectivity in 0.8, 1
  • random a and b, s.t. selectivity in 0,
    0.2?0.8, 1

12
Experiments
  • Histograms
  • all histograms described in the taxonomy
  • the histograms are of the same size (not the same
    number of buckets)
  • built from 2000 samples (except trivial,
    equi-depth precise, and P2)
  • built through scanning the data once

13
Experiments
  • cusp_max value distribution
  • random value freq. correlation
  • z 1
  • sort param V usually means better accuracy

14
Experiments
  • error based on v-optimal(V,A) histogram
  • increasing sample size can give better results

15
End of Part I
16
Part II Global Optimization of Histograms (GOH)
17
Histograms for n-D data
  • Histograms discussed previously are on single
    attribute (or 1-D data)
  • Two main approaches for n-D data
  • Use n 1-D histograms
  • use an n-D histogram

18
Histograms for n-D data
  • n 1-D histograms
  • need "attribute value independence assumption"
  • can use all 1-D histogram techniques
  • quite good accuracy already
  • representative GOH Jagadish et al. SIGMOD'01

19
Histograms for n-D data
  • n-D histograms
  • don't need the "avia" which is usually not true
  • difficult to compute, store and maintain a "good"
    partition of the n-D space into buckets
  • representatives MHIST Poosala Ioannidis
    VLDB'97, H-Tree Muralikrishna DeWitt
    SIGMOD'88

20
Global Optimization of Histograms (GOH)
  • given a space budget of B buckets, find an
    optimal assignment of buckets among the
    dimensions to minimize the error
  • i.e., give more buckets to attributes that are
    used frequently or have skew distributions

21
GOH -- example
  • e.g., if A1 lt3, 5gt, lt4, 5gt, lt5, 5.5gt, A2lt1, 9gt,
    lt3, 2gt, lt5, 5gt and we have 4 buckets,
  • an A11, A23 assignment is better than a A12,
    A22 assignment.

22
Computing GOH
  • Exhaustive Algorithm (GOHEA)
  • Based on Dynamic Programming (GOHDP)
  • Greedy Approach (GOHGA)
  • Greedy Approach with Remedy (GOHGR)

23
GOHEA
  • For every possible bucket assignment, calculate
    the Error metric and find the minimum
  • clearly too inefficient

24
GOHDP
  • Define
  • E(b, k) the minimum error of using b buckets to
    store k histograms
  • error(b, k) the error of using b buckets to
    store the kth histogram
  • observation
  • E(b, k) min( E(b-x, k-1) error(x, k) )

25
GOHDP
  • calculate all error(b, k) O(N2BM)
  • another algorithm based in DP can calculate all
    error(b, k) with a given k in O(N2B) Jagadish et
    al. VLDB'98
  • fill the E(b, k) O(B2M)
  • O(BM) entries to be filled, and O(B) computations
    for each entry

26
GOHDP
  • note that if we know the allocation beforehand,
    we need O(N2B) to construct the histograms
  • it's still inefficient if M is large

27
GOHGA
  • The greedy approach is O(N2B)
  • i.e., nearly no penalty than direct construction
    (use the same number of buckets on all
    attributes)
  • define
  • marginal gain mk(i,j) error(i,k) error(j,k),
    i.e., the reduction in error if we use j buckets
    instead of i buckets.

28
GOHGA
  • 1. assign 1 buckets to each dimension
  • 2. allocate buckets one by one to the dimension
    which has the greatest marginal gain by the new
    bucket
  • 3. repeat until all buckets are assigned

29
GOHGA
  • O(B) steps
  • O(N2) per marginal gain calculation
  • since we can get error(b, k) from b 1 in O(N2b)
  • overall O(N2B)
  • but sometimes GOHGA does not return the optimum

30
GOHGR
  • greedy look ahead 1 step to see which allocation
    has the greatest marginal gain
  • remedy look ahead 2 steps to see if it can find
    a better allocation
  • e.g., m1(3,4)30, m1(4,5)130, m2(3,4)40,
    m2(4,5)40

31
Experiments
  • aim
  • to show that GOH really has a smaller error by
    allocating more buckets to skewer data
  • to show that GOH is efficient to be computed

32
Experiments
  • types
  • abs/rel errors of different attributes
  • abs/rel errors for different bucket budgets
  • abs/rel errors for different distribution skews
  • running time

33
Experiments
  • dataset (synthesis data)
  • 5 attributes
  • 500000 tuples, 500 values per attr
  • frequencies follow Zipf distribution, random
    association between freq. and value.
  • the 5 attrs has z 0, 0.01, 0.1, 1, 2
  • 10000 queries X?a

34
Experiments
35
Experiments
36
Experiments
  • dataset (TPC-D dataset with skew data)
  • as more realistic dataset
  • has similar results as the synthesis data
  • dataset (2 attrs)
  • to evaluate the gain of GOH due to the skew
    difference between attrs

37
Experiments
  • TG3 z0, 2
  • TG4 z0.02,1.8
  • TG5 z1.8,2

38
Experiments
39
Conclusions
  • GOH has smaller errors because it assigns more
    buckets to skew or frequently used distributions
  • Nearly no time penalty on building GOH using GOHGR

40
Future Work
  • The methods presented can't solve the n-D
    histogram problem completely
  • Try to apply SF-Tree to store and retrieve the
    buckets in multi-dimensional histogram
    efficiently.

41
References
  • Jagadish et al. VLDB'98 H.V. Jagadish, Nick
    Koudas, S. Muthukrishnan, Viswanath Poosala, Ken
    Sevcik, Torsten Suel, Optimal Histograms with
    Quality Guarantees, VLDB98
  • Poosala et al. SIMGOD'96 Viswanath Poosala,
    Yannis Ioannidis, Peter Haas, Eugene Shekita,
    Improved Histograms for Selectivity Estimation of
    Range Predicates, SIGMOD96
  • Poosala Ioannidis VLDB'97 Viswanath Poosala
    and Yannis Ioannidis, Selectivity Estimation
    Without the Attribute Value Independence
    Assumption, VLDB'97

42
References
  • Muralikrishna DeWitt SIGMOD'88 M.
    Muralikrishna and D. DeWitt, Equi-Depth
    Histograms for Estimating Selectivity Factors for
    Multi-Dimensional Queries, SIGMOD88
Write a Comment
User Comments (0)
About PowerShow.com