A Robust, OptimizationBased Approach for Approximate Answering of Aggregate Queries - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

A Robust, OptimizationBased Approach for Approximate Answering of Aggregate Queries

Description:

... 1) and ? (0 ? ) define the degree to which the workload 'influences' the query distribution. ... Consider a population, i.e. a set of numbers R = {y1,.,yn} ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 47

Provided by: Din22

Learn more at: http://crystal.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Robust, OptimizationBased Approach for Approximate Answering of Aggregate Queries

1
A Robust, Optimization-Based Approach for
Approximate Answering of Aggregate Queries

Surajit Chaudhuri
Gautam Das
Vivek Narasayya
Presented By
Vivek Tanneeru
Venkata Dinesh Jammula

2
Outline

Introduction
Objective
Drawbacks of Previous work
Related Work
Architecture for Approximate Query Processing
Classical Sampling Techniques
Special Case of a Fixed Load
Lifting Workload to Query Distributions
Relational for Stratified Sampling
Solution for Single-Table Selection Queries with
Aggregation
Extensions for General Work Load
Comparisons
Experimental Results
Summary
References

3
1. Introduction

Decision Support applications - OLAP and data
mining for analyzing large databases
Approximate answers to queries given accurately
and efficiently benefit the scalability of these
applications
Workload information in picking samples of the
data

4
2. Objective

Pre-compute a sample as an optimization problem
Minimize error in estimation of aggregates
Implemented on Microsoft SQL Server 2000, for an
effective solution to be deployed in Commercial
DBMS

5
3. Drawbacks of Previous work

Lack of rigorous problem formulations lead to
solutions that are difficult to evaluate
theoretically
Does not deal with uncertainty in expected
workload
Ignores the variance in data distribution of
aggregated columns

6
4. Related Work

Weighted Sampling
Outlier Index
Congressional Sampling
On the fly Sampling
Histograms

7
5. Architecture for Approximate Query
Processing

Preliminaries
Consider Queries with selections, foreign-key
joins and GROUP BY, containing aggregation
functions such as COUNT, SUM and AVG.
Assume a pre-designated amount of storage space
is available for selecting samples from the
database
Selecting samples can be randomized or
deterministic

8
Architecture
9
Error Metrics

If correct answer for query Q is y while
approximate answer is y
Relative error E(Q) y - y / y
Squared error SE(Q) (y - y / y)²
If correct answer for the ith group is yi while
approximate answer is yi
Squared error in answering a GROUP BY query Q
SE(Q) (1/g) Si ((yi yi)/ yi)²
Given a probability distribution of queries pw
Mean squared error for the distribution
MSE(pw) SQ pw(Q)SE(Q), (where pw(Q) is
probability of query Q)
Root mean squared error (L2)
RMSE(pw) vMSE(pw)
Other error metrics
L1 metric the expected relative error over all
queries in workload
L8 metric the max error over all queries

10
6. Classical Sampling Techniques

Uniform Sampling
LEMMA 1
(a) µ is an unbiased estimator for y, namely,
Eµ y
(b) µ n is an unbiased estimator for Y namely
Eµ n Y
(c) the variance (or standard error) in
estimating y is
E(µ- y) 2 S2/k
(d) the variance in estimating Y is
E(µn-Y ) 2 n2S2/k and
(e) the relative squared error in estimating Y is
E((µn - Y )/Y ) 2 n2S2/Y2k.

11
Classical Sampling Techniques

Stratified Sampling
LEMMA 2
(a) µ is an unbiased estimator for y, namely,
Eµ y
(b) µ n is an unbiased estimator for Y,
namely,
Eµ n Y
(c) the variance in estimating y is
E(µ - y) 2 1/ n2 ?j nj2 Sj2/ kj
(d) the variance in estimating Y is
E(µ n-Y ) 2 ? j nj2 Sj2 / kj and
(e) the relative squared errorin estimating Y is
E((µ n - Y )/Y ) 2 1/ Y2 ? j nj2 Sj2
/kj .

12
Classical Sampling Techniques

Neyman Allocation
LEMMA 3
Given a population R y1, . . . , yn, k and r,
the optimal way to form r strata and allocate k
samples among all strata is to first sort R and
select strata boundaries so that ? j n j S j is
minimized, and then, for the j th strata, to set
the number of samples k j as
k j k(n j S j / ? j n j S j )

13
Classical Sampling Techniques

Multivariate Stratified Sampling
Weighted Sampling
Error Estimation and Confidence Intervals

14
7. Special Case Fixed Workload

Problem FIXEDSAMP
Input R, W, k
Output A sample of k records (with appropriate
additional columns) such that MSE(W) is
minimized.

15
Fundamental Regions

Fundamental Regions For a given relation R and
workload W, consider partitioning the records in
R into a minimum number of regions R1, R2, , Rr
such that for any region Rj, each query in W
selects either all records in Rj or none.

16
Solution for FIXEDSAMP

Step 1. Identify Fundamental Regions
Case A. r lt k
Case B. r gt k
Step 2 Pick Sample Records
Step 3 Assign values to additional columns

17
8. Lifting Workload to Query Distributions

Resilient to the situation when incoming query is
similar but not identical to queries in the
workload
Pw lifted workload, probability distribution
Pw (Q) Related to the amount of similarity of
Q to the workload
Not concerned with syntactic similarity of query
expressions

18
Lifted workload (Cont.)

Two parameters d (½ d 1) and ? (0 ? ½)
define the degree to which the workload
influences the query distribution. For any
given record inside (resp. outside) RQ, the
parameter d (resp. ?) represents the probability
that an incoming query will select this record.
PQ(R) is the probability of occurrence of any
query that selects exactly the set of records R.

19
Lifted workload (Cont.)

n1 , n2, n3, and n4 are the counts of records in
the regions.
n2 or n4 large (large overlap), PQ(R) is
high
n1 or n3 large (small overlap), PQ(R) is
low
We elaborate on this issue by analyzing the
effects of (four) different
boundary settings of these parameters.
1. d ? 1 and ? ? 0 implies that incoming
queries are identical
to workload queries.
2. d ? 1 and ? ? ½ implies that incoming
queries are
supersets of workload queries.
3. d ? ½ and ? ? 0 implies that incoming
queries are subsets
of workload queries.
4. d ? ½ and ? ? ½ implies that incoming
queries are
unrestricted.

20
9. Rationale for Stratified Sampling
Consider a population, i.e. a set of numbers R
y1,.,yn. Let the average be y, the sum be Y
and the variance be S2. Suppose we uniformly
sample k numbers. Let the mean of the sample be
µ. The quantity µ is an unbiased estimator for
y, i.e. Eµ y the variance (i.e., squared
error) in estimating y is E(µ-y) 2 S2/k.
21
Stratified Sampling (Cont )
Query Q1 SELECT COUNT() FROM R WHERE PRODUCTID
IN (3,4) Population POPQ1 0,0,1,1
Thus, a stratified sampling scheme partitions R
into r strata containing n1, ., nr records (where
Snj n), with k1, , kr records uniformly
sampled from each stratum (where Skj k).
22
10. Solution for single-table selection queries
with Aggregation

Stratification
a.) How many strata r to partition relation R
into,
b.) Records from R that belong to each strata
Allocation
how to divide k( the number of records available
for the sample) into integers k1, , kr across r
strata such that Skj k
Sampling
uniformly samples kj records from stratum Rj to
form the final sample of k records

23
Solution for COUNT aggregate

Stratification From Lemma 1.
Lemma 1 For a workload W consisting of COUNT
queries, the fundamental regions represent an
optimal stratification.
Allocation We want to minimize the error over
queries in pw .
k1, kr are unknown variables such that Skj
k.
From Equation (2) on earlier slide,
MSE(pW) can be expressed as a weighted sum of the
MSE of each query in the workload
Lemma 2 MSE(pW) Si wi MSE(pQ)

24
Allocation (cont)

For any Q e W, we express MSE(pQ) as a function
of the kjs
Lemma 3 For a COUNT query Q in W,
Let ApproxMSE(pQ)
Then,

25
Outline of Proof

Since we have an (approximate) formula for
MSE(pQ), we can express MSE(pw) as a function
of the kjs variables.
Corollary 1 MSE(pw) Sj(aj / kj), where each
aj is a function of n1,,nr, d, and ?.
aj captures the importance of a region it is
positively correlated with nj as well as the
frequency of queries in the workload that access
Rj.
Now we can minimize MSE(pw).
Lemma 4 Sj (aj / kj) is minimized subject to Sj
kj k
if kj k ( sqrt(aj) / Si sqrt(ai) )
This provides a closed-form and computationally
inexpensive solution to the allocation problem
since aj depends only on d, ? and the number of
tuples in each fundamental region.

26
Solution for SUM aggregate

Stratification
Bucketing Technique
We further divide fundamental regions with large
variance into a set of finer regions, each of
which has significantly lower internal variance.
Treat each region as strata
From optimal Neyman Allocation Technique,
We have hr finer strata
Good to have a large h, but h is set to value 6.

27
Cont

Allocation
Like COUNT, we express an optimization problem
with hr unknowns k1,, khr.
Unlike COUNT, the specific values of the
aggregate column in each region (as well as the
variance of values in each region) influence
MSE(pQ).
Let yj(Yj) be the average (sum) of the aggregate
column values of all records in region Rj. Since
the variance within each region is small, each
value within the region can be approximated as
simply yj. Thus to express MSE(pQ) as a
function of the kjs for a SUM query Q in W

28
Pragmatic Issues

Identifying Fundamental Regions
Handling Large Number of Fundamental Regions
Obtaining Integer Solutions
Obtaining an Unbiased Estimator

29
Putting all together
30
11. Extensions

GROUP BY
JOIN
Other Extensions

31
12. Comparisons

Weighted Sampling
Records that are accessed more frequently have a
greater chance of being included into the sample
Assumes fixed workload
Outlier Indexing
Form their own stratum that is sampled in its
entirety
Assumes fixed workload

32
Comparisons (cont)

Congressional Sampling
Allocation of samples between two strata
To minimize MSE,

33
13. Experimental Results

PREVIOUS WORKS
USAMP uniform random sampling
WSAMP weighted sampling
OTLIDX outlier indexing combined with weighted
sampling
CONG Congressional sampling

34
Experimental Setup

Databases Used the popular TPC-R benchmark for
experiments
Workloads Generated several workloads over TCP-R
schema using an automatic query generation
program
Parameters Varied the parameters like,
Skew of the data
Sampling fraction between 0.1 - 10
Workload size was varied between 25 - 800 queries
Error Metric Report the average error over all
queries in the workload

35
Training Set vs Test Set

The basic idea is to split the available
workload into two sets
the training workload and
the test workload
Training Set The workload used to determine the
sample
Test Set The workload used to estimate the error

36
Results Quality vs Sampling Fraction
37
Cont
38
Cont
39
Cont
40
Quality vs Overlap between Training Set and Test
Set
41
Quality vs Data Skew
42
Cont
43
Cont
44
14. Summary

A comprehensive solution to the problem of
identifying samples for approximately answering
aggregation queries
Its implementation on a database system
With a novel technique for lifting a workload, we
make our solution robust enough to work well even
for workloads that are similar but not identical
to the given workload.
Handles the problems of data variance,
heterogeneous mixes of queries, GROUP BY and
foreign-key joins.

45
15. References

Surajit Chaudhuri, Gautam Das, Vivek Narasayya A
Robust, Optimization-Based Approach for
Approximate Answering of Aggregate Queries.
SIGMOD Conference 2001.
Surajit Chaudhuri, Gautam Das, Vivek Narasayya.
Optimized Stratified Sampling for Approximate
Query Processing. ACM Transactions on Database
Systems (TODS), 32(2) 9 (2007)