A Robust, OptimizationBased Approach for Approximate Answering of Aggregate Queries - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

A Robust, OptimizationBased Approach for Approximate Answering of Aggregate Queries

Description:

... 1) and ? (0 ? ) define the degree to which the workload 'influences' the query distribution. ... Consider a population, i.e. a set of numbers R = {y1,.,yn} ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 47
Provided by: Din22
Learn more at: http://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: A Robust, OptimizationBased Approach for Approximate Answering of Aggregate Queries


1
A Robust, Optimization-Based Approach for
Approximate Answering of Aggregate Queries
  • Surajit Chaudhuri
  • Gautam Das
  • Vivek Narasayya
  • Presented By
  • Vivek Tanneeru
  • Venkata Dinesh Jammula

2
Outline
  • Introduction
  • Objective
  • Drawbacks of Previous work
  • Related Work
  • Architecture for Approximate Query Processing
  • Classical Sampling Techniques
  • Special Case of a Fixed Load
  • Lifting Workload to Query Distributions
  • Relational for Stratified Sampling
  • Solution for Single-Table Selection Queries with
    Aggregation
  • Extensions for General Work Load
  • Comparisons
  • Experimental Results
  • Summary
  • References

3
1. Introduction
  • Decision Support applications - OLAP and data
    mining for analyzing large databases
  • Approximate answers to queries given accurately
    and efficiently benefit the scalability of these
    applications
  • Workload information in picking samples of the
    data

4
2. Objective
  • Pre-compute a sample as an optimization problem
  • Minimize error in estimation of aggregates
  • Implemented on Microsoft SQL Server 2000, for an
    effective solution to be deployed in Commercial
    DBMS

5
3. Drawbacks of Previous work
  • Lack of rigorous problem formulations lead to
    solutions that are difficult to evaluate
    theoretically
  • Does not deal with uncertainty in expected
    workload
  • Ignores the variance in data distribution of
    aggregated columns

6
4. Related Work
  • Weighted Sampling
  • Outlier Index
  • Congressional Sampling
  • On the fly Sampling
  • Histograms

7
5. Architecture for Approximate Query
Processing
  • Preliminaries
  • Consider Queries with selections, foreign-key
    joins and GROUP BY, containing aggregation
    functions such as COUNT, SUM and AVG.
  • Assume a pre-designated amount of storage space
    is available for selecting samples from the
    database
  • Selecting samples can be randomized or
    deterministic

8
Architecture
9
Error Metrics
  • If correct answer for query Q is y while
    approximate answer is y
  • Relative error E(Q) y - y / y
  • Squared error SE(Q) (y - y / y)²
  • If correct answer for the ith group is yi while
    approximate answer is yi
  • Squared error in answering a GROUP BY query Q
  • SE(Q) (1/g) Si ((yi yi)/ yi)²
  • Given a probability distribution of queries pw
  • Mean squared error for the distribution
  • MSE(pw) SQ pw(Q)SE(Q), (where pw(Q) is
    probability of query Q)
  • Root mean squared error (L2)
  • RMSE(pw) vMSE(pw)
  • Other error metrics
  • L1 metric the expected relative error over all
    queries in workload
  • L8 metric the max error over all queries

10
6. Classical Sampling Techniques
  • Uniform Sampling
  • LEMMA 1
  • (a) µ is an unbiased estimator for y, namely,
    Eµ y
  • (b) µ n is an unbiased estimator for Y namely
  • Eµ n Y
  • (c) the variance (or standard error) in
    estimating y is
  • E(µ- y) 2 S2/k
  • (d) the variance in estimating Y is
  • E(µn-Y ) 2 n2S2/k and
  • (e) the relative squared error in estimating Y is
  • E((µn - Y )/Y ) 2 n2S2/Y2k.

11
Classical Sampling Techniques
  • Stratified Sampling
  • LEMMA 2
  • (a) µ is an unbiased estimator for y, namely,
    Eµ y
  • (b) µ n is an unbiased estimator for Y,
    namely,
  • Eµ n Y
  • (c) the variance in estimating y is
  • E(µ - y) 2 1/ n2 ?j nj2 Sj2/ kj
  • (d) the variance in estimating Y is
  • E(µ n-Y ) 2 ? j nj2 Sj2 / kj and
  • (e) the relative squared errorin estimating Y is
  • E((µ n - Y )/Y ) 2 1/ Y2 ? j nj2 Sj2
    /kj .

12
Classical Sampling Techniques
  • Neyman Allocation
  • LEMMA 3
  • Given a population R y1, . . . , yn, k and r,
    the optimal way to form r strata and allocate k
    samples among all strata is to first sort R and
    select strata boundaries so that ? j n j S j is
    minimized, and then, for the j th strata, to set
    the number of samples k j as
  • k j k(n j S j / ? j n j S j )

13
Classical Sampling Techniques
  • Multivariate Stratified Sampling
  • Weighted Sampling
  • Error Estimation and Confidence Intervals

14
7. Special Case Fixed Workload
  • Problem FIXEDSAMP
  • Input R, W, k
  • Output A sample of k records (with appropriate
    additional columns) such that MSE(W) is
    minimized.

15
Fundamental Regions
  • Fundamental Regions For a given relation R and
    workload W, consider partitioning the records in
    R into a minimum number of regions R1, R2, , Rr
    such that for any region Rj, each query in W
    selects either all records in Rj or none.

16
Solution for FIXEDSAMP
  • Step 1. Identify Fundamental Regions
  • Case A. r lt k
  • Case B. r gt k
  • Step 2 Pick Sample Records
  • Step 3 Assign values to additional columns

17
8. Lifting Workload to Query Distributions
  • Resilient to the situation when incoming query is
    similar but not identical to queries in the
    workload
  • Pw lifted workload, probability distribution
  • Pw (Q) Related to the amount of similarity of
    Q to the workload
  • Not concerned with syntactic similarity of query
    expressions

18
Lifted workload (Cont.)
  • Two parameters d (½ d 1) and ? (0 ? ½)
    define the degree to which the workload
    influences the query distribution. For any
    given record inside (resp. outside) RQ, the
    parameter d (resp. ?) represents the probability
    that an incoming query will select this record.
  • PQ(R) is the probability of occurrence of any
    query that selects exactly the set of records R.

19
Lifted workload (Cont.)
  • n1 , n2, n3, and n4 are the counts of records in
    the regions.
  • n2 or n4 large (large overlap), PQ(R) is
    high
  • n1 or n3 large (small overlap), PQ(R) is
    low
  • We elaborate on this issue by analyzing the
    effects of (four) different
  • boundary settings of these parameters.
  • 1. d ? 1 and ? ? 0 implies that incoming
    queries are identical
  • to workload queries.
  • 2. d ? 1 and ? ? ½ implies that incoming
    queries are
  • supersets of workload queries.
  • 3. d ? ½ and ? ? 0 implies that incoming
    queries are subsets
  • of workload queries.
  • 4. d ? ½ and ? ? ½ implies that incoming
    queries are
  • unrestricted.

20
9. Rationale for Stratified Sampling
Consider a population, i.e. a set of numbers R
y1,.,yn. Let the average be y, the sum be Y
and the variance be S2. Suppose we uniformly
sample k numbers. Let the mean of the sample be
µ. The quantity µ is an unbiased estimator for
y, i.e. Eµ y the variance (i.e., squared
error) in estimating y is E(µ-y) 2 S2/k.
21
Stratified Sampling (Cont )
Query Q1 SELECT COUNT() FROM R WHERE PRODUCTID
IN (3,4) Population POPQ1 0,0,1,1
Thus, a stratified sampling scheme partitions R
into r strata containing n1, ., nr records (where
Snj n), with k1, , kr records uniformly
sampled from each stratum (where Skj k).
22
10. Solution for single-table selection queries
with Aggregation
  • Stratification
  • a.) How many strata r to partition relation R
    into,
  • b.) Records from R that belong to each strata
  • Allocation
  • how to divide k( the number of records available
    for the sample) into integers k1, , kr across r
    strata such that Skj k
  • Sampling
  • uniformly samples kj records from stratum Rj to
  • form the final sample of k records

23
Solution for COUNT aggregate
  • Stratification From Lemma 1.
  • Lemma 1 For a workload W consisting of COUNT
    queries, the fundamental regions represent an
    optimal stratification.
  • Allocation We want to minimize the error over
    queries in pw .
  • k1, kr are unknown variables such that Skj
    k.
  • From Equation (2) on earlier slide,
  • MSE(pW) can be expressed as a weighted sum of the
    MSE of each query in the workload
  • Lemma 2 MSE(pW) Si wi MSE(pQ)

24
Allocation (cont)
  • For any Q e W, we express MSE(pQ) as a function
    of the kjs
  • Lemma 3 For a COUNT query Q in W,
  • Let ApproxMSE(pQ)
  • Then,

25
Outline of Proof
  • Since we have an (approximate) formula for
    MSE(pQ), we can express MSE(pw) as a function
    of the kjs variables.
  • Corollary 1 MSE(pw) Sj(aj / kj), where each
    aj is a function of n1,,nr, d, and ?.
  • aj captures the importance of a region it is
    positively correlated with nj as well as the
    frequency of queries in the workload that access
    Rj.
  • Now we can minimize MSE(pw).
  • Lemma 4 Sj (aj / kj) is minimized subject to Sj
    kj k
  • if kj k ( sqrt(aj) / Si sqrt(ai) )
  • This provides a closed-form and computationally
    inexpensive solution to the allocation problem
    since aj depends only on d, ? and the number of
    tuples in each fundamental region.

26
Solution for SUM aggregate
  • Stratification
  • Bucketing Technique
  • We further divide fundamental regions with large
    variance into a set of finer regions, each of
    which has significantly lower internal variance.
  • Treat each region as strata
  • From optimal Neyman Allocation Technique,
  • We have hr finer strata
  • Good to have a large h, but h is set to value 6.

27
Cont
  • Allocation
  • Like COUNT, we express an optimization problem
    with hr unknowns k1,, khr.
  • Unlike COUNT, the specific values of the
    aggregate column in each region (as well as the
    variance of values in each region) influence
    MSE(pQ).
  • Let yj(Yj) be the average (sum) of the aggregate
    column values of all records in region Rj. Since
    the variance within each region is small, each
    value within the region can be approximated as
    simply yj. Thus to express MSE(pQ) as a
    function of the kjs for a SUM query Q in W

28
Pragmatic Issues
  • Identifying Fundamental Regions
  • Handling Large Number of Fundamental Regions
  • Obtaining Integer Solutions
  • Obtaining an Unbiased Estimator

29
Putting all together
30
11. Extensions
  • GROUP BY
  • JOIN
  • Other Extensions

31
12. Comparisons
  • Weighted Sampling
  • Records that are accessed more frequently have a
    greater chance of being included into the sample
  • Assumes fixed workload
  • Outlier Indexing
  • Form their own stratum that is sampled in its
    entirety
  • Assumes fixed workload

32
Comparisons (cont)
  • Congressional Sampling
  • Allocation of samples between two strata
  • To minimize MSE,

33
13. Experimental Results
  • PREVIOUS WORKS
  • USAMP uniform random sampling
  • WSAMP weighted sampling
  • OTLIDX outlier indexing combined with weighted
    sampling
  • CONG Congressional sampling

34
Experimental Setup
  • Databases Used the popular TPC-R benchmark for
    experiments
  • Workloads Generated several workloads over TCP-R
    schema using an automatic query generation
    program
  • Parameters Varied the parameters like,
  • Skew of the data
  • Sampling fraction between 0.1 - 10
  • Workload size was varied between 25 - 800 queries
  • Error Metric Report the average error over all
    queries in the workload

35
Training Set vs Test Set
  • The basic idea is to split the available
  • workload into two sets
  • the training workload and
  • the test workload
  • Training Set The workload used to determine the
    sample
  • Test Set The workload used to estimate the error

36
Results Quality vs Sampling Fraction
37
Cont
38
Cont
39
Cont
40
Quality vs Overlap between Training Set and Test
Set
41
Quality vs Data Skew
42
Cont
43
Cont
44
14. Summary
  • A comprehensive solution to the problem of
    identifying samples for approximately answering
    aggregation queries
  • Its implementation on a database system
  • With a novel technique for lifting a workload, we
    make our solution robust enough to work well even
    for workloads that are similar but not identical
    to the given workload.
  • Handles the problems of data variance,
    heterogeneous mixes of queries, GROUP BY and
    foreign-key joins.

45
15. References
  • Surajit Chaudhuri, Gautam Das, Vivek Narasayya A
    Robust, Optimization-Based Approach for
    Approximate Answering of Aggregate Queries.
    SIGMOD Conference 2001.
  • Surajit Chaudhuri, Gautam Das, Vivek Narasayya.
    Optimized Stratified Sampling for Approximate
    Query Processing.  ACM Transactions on Database
    Systems (TODS), 32(2) 9 (2007)

46
  • Thank You
  • Questions ?
  • Presented By
  • Vivek Tanneeru
  • Venkata Jammula
Write a Comment
User Comments (0)
About PowerShow.com