Reading Report 6 Yin Chen 5 Mar 2004 - PowerPoint PPT Presentation

1 / 12

About This Presentation

Title:

Reading Report 6 Yin Chen 5 Mar 2004

Description:

For a given relation R and workload W, consider partitioning the records in R ... i.e. Consider a relation R (with aggregate column C) containing nine records ... – PowerPoint PPT presentation

Number of Views:15

Avg rating:3.0/5.0

Slides: 13

Provided by: yin

Category:

more less

Transcript and Presenter's Notes

Title: Reading Report 6 Yin Chen 5 Mar 2004

1
Reading Report 6Yin Chen 5 Mar 2004

Reference
A Robust, Optimization-Based Approach for
Approximate Answering of Aggregate Queries,
Surajit Chaudhuri, Gautam das, Vivek Narasayya,
Microsoft Research, One Microsoft Way.
http//portal.acm.org/citation.cfm?id375694dlAC
Mcollportal

2
Problem Over View

Decision support applications such as On Line
Analytical Processing (OLAP) and data mining for
analyzing LARGE databases have become popular.
But expensive and resource intensive
This work uses precomputed samples of the data
instead of the complete data to answer the
queries, to give approximate answers efficiently.
3 drawbacks of previous works
Lack of formulation thus difficult to evaluate
theoretically
Do NOT deal with uncertainty
Ignore the variance in the data distribution of
the aggregated column(s).

3
Related work

Some works based on randomized techniques, assume
a fixed workload, and do NOT cope with
uncertainty.
Each record is tagged with a frequency.
An expected number of k records are selected in
the sample, where the probability of selecting a
record t with frequency ft is k(ft/Sufu))
Records that are accessed more frequently have a
greater chance of being included inside the
sample.
BUT has poor quality. i.e. Consider a set of
queries, let a few queries reference large
partitions and most queries reference very small
partitions. By the weighted sampling scheme most
record will come from the large partitions. Thus,
with high probability, there will be no records
selected from many of the small partitions,
causing large error.
An IMPROVMENT collected outliers of the data
(i.e. the records that contribute to high
variance) into a separate index, while the
remaining data is sampled using a weighted
sampling technique. Queries are answered by
running them against both the outlier index and
the weighted sample, and an estimated answer is
composed out of both results.
Some other works use on-the-fly sampling, but can
be expensive.

4
Architecture

Workload
A workload W is specified as a set of pairs of
queries and their corresponding weights i.e., W
ltQ1,w1gt, ltQq,wqgt
Weight wi indicates the importance of query Qi in
the workload.
Without loss of generality, assume the weights
are normalized, i.e., Siwi 1
Architecture
Inputs a database and a workload W
2 components
An offline component for selecting a sample
An online component that
Rewrites an incoming query to use the sample to
answer the query approximately.
Reports the answer with an estimate of the error
in the answer.
ScaleFactor
Each record in the sample contains an additional
column, ScaleFactor.
The value of the aggregated column of each record
in the sample is first scaled up by multiplying
with the ScaleFactor, and then aggregated.

5
Architecture (Cont.)

Error metrics used to determine the quality of
an approximate answer to an aggregation query.
Suppose the correct answer for a query Q is y
while the approximate answer is y
Relative error E(Q) y - y / y
Squared error SE(Q) (y - y / y)²
Suppose the correct answer for the ith group is
yi while the approximate answer is yi
Squared error in answering a GROUP BY query Q
SE(Q) (1/g) Si ((yi yi)/ yi)²
Given a probability distribution of queries pw
Mean squared error for the distribution MSE(pw)
SQ pw (Q)SE(Q), ( pw (Q) is the probability of
query Q)
Root mean squared error RMSE(pw) square root
of MSE(pw)
Other error metrics
L1 metrics the mean error over all queries in
the workload
L8 metrics the maximum error over all queries

6
The Special Case of a Fixed Workload

Overview
Here provide a solution for the special case of a
fixed workload, i.e., when the incoming queries
are identical to the given workload.
Use an effective deterministic scheme rater than
the conventional randomization scheme.
Problem Formulation
Problem FIXEDSAMP
Input R, W, k
Output A sample of k records (with
appropriate additional columns) such that MSE(W)
is minimized
MSE(W) MSE(Pw) (Mean squared error ), where a
query Q has a probability of occurrence of 1 is Q
?W and 0 otherwise.
Fundamental Regions
For a given relation R and workload W, consider
partitioning the records in R into a minimum
number of regions F R1, R2, , Rr such that
for any region Rj, each query in W selects either
all records in Rj or none. These regions are the
fundamental regions of R induced by W.
i.e. Consider a relation R (with aggregate column
C) containing nine records (with C values 10, 20,
, 90). Let W consist of two queries, Q1 (which
selects records with C values between 10 and 50)
and Q2 (which selects records with C values
between 40 and 70). These two queries induce a
partition of R into four fundamental regions, R1,
R1, R1, R4.
In general the total number of fundamental
regions r depends
on R and W and is upper-bounded by min(2W,
n), where n
is the number of records in R.

7
The Special Case of a Fixed Workload (Cont.)

Solutions for FIXEDSAMP
A deterministic algorithm called FIXED
Step1 (Identify Fundamental Regions) Let r be
the number of fundamental regions.
Case A r k (the selected sample can answer
queries WITHOUT any errors)
Step 2A (Pick Sample Records) Pick exactly one
record from each fundamental region.
Step 3A (Assign Values to Additional Columns)
The idea is that each sample record can be used
to summarize all records from the corresponding
fundamental region, WITHOUT incurring any error.
For a workload consisting of ONLY COUNT queries,
store the count of the number of records in that
fundamental region in a SINGLE additional column
in the sample records (called RegionCount).
For a workload consisting of ONLY SUM queries,
store the sum of the values in the aggregate
column for records in that fundamental region in
the SINGLE additional column in the sample
(called AggSum).
For a workload contains a MIX of COUNT and SUM
queries, need BOTH the RegionCount and AggSum
column, which can also answer AVG queries.
Case B r gt k (select sample that try to
minimize the errors in queries)
Step 2B (Pick Sample Records)
Sort all r regions by their importance, select
the TOP k, and then pick up one record from each
of the selected regions.
The importance of region Rj is defined as fj
nj², where fj is the sum of the weights of all
queries in W that select the region, nj is the
number of records in the region. fj measures the
weights of queries that are affected by Rj while
nj² measures the effect on the (squared) error by
not including Rj.
Step 3B (Assign Values to Additional Columns)
Note that the extra column values of a sample are
NOT required to characterize the corresponding
fundamental region all we care is that they
contain appropriate values so that the error for
the workload is minimized.
To assign values to the RegionCount and AggSum
columns of the k selected sample records, express
MSE(W) as a quadratic function of 2k unknowns
RC1,,RCk and AS1,, ASk, and partially
differentiating with each variable and setting
each result to zero. This gives rise to 2k
simultaneous (sparse) linear equations, which can
be solved by using an iterative technique.

8
The Non-Special Case with a Lifted Workload

Overview
Here provide a solution for the non-special case
which incoming query is similar but NOT
identical to queries.
The problem was focus on the SINGLE-TABLE
selection queries with aggregation containing
either the SUM or COUNT aggregate. A workload W
consists of exactly ONE query Q on relation R.
Problem Formulation
Problem SAMP
Input R, Pw (a probability distribution
function specified by W), and k
Output A sample of k records, (with the
appropriate additional column(s)) such that the
MSE(Pw) is minimized.
lifted workload
For a given W, define a lifted workload pw, i.e.,
a probability distribution of incoming queries.
Intuitively, for any query Q (not necessarily in
W), pw(Q) should be related to the amount of
similarity (dissimilarity) of Q to the workload
high if Q is similar to queries in the workload,
and low otherwise.
We say that two queries Q and Q are similar if
the records selected by Q and Q have significant
overlap.
The objective is to define the distribution PQ
Since for the purposed of lifting, only concern
the set of records selected by a query and NOT
the query itself. Thus instead of mapping queries
to probabilities, PQ maps subsets of R to
probabilities.

9
The Non-Special Case with a Lifted Workload

lifted workload (Cont.)
Assume two parameters d (½ d 1) and ? (0 ?
½). These parameters define the degree to which
the workload influences the query distribution.
For any given record inside (resp. outside) RQ,
the parameter d (resp. ?) represents the
probability that an incoming query will select
this record.
For all R?R, PQ(R) is the probability of
occurrence of any query that selects exactly the
set of records R.

RQ is the records selected by Q.
n1, n2, n3, and n4 are the counts of records in
the regions.
When n2 or n4 are large (i.e., the overlap is
large), PQ(R) is high (i.e. queries that
select RQ are likely to occur).
When n1 or n3 are large (i.e., the overlap is
small), PQ(R) is low (i.e. queries that select
RQ are unlikely to occur).
Setting the parameters d and ?
d ? 1 and ? ? 0 implies that incoming
queries are identical to workload queries
d ? 1 and ? ? ½ implies that incoming queries
are supersets to workload queries
d ? ½ and ? ? 0 implies that incoming queries
are subsets to workload queries
d ? ½ and ? ? ½ implies that incoming queries
are unrestricted

10
The Non-Special Case with a Lifted Workload

Stratified sampling
Stratified sampling is a well-known
generalization of uniform sampling where a
population is partitioned into multiple strata
and samples are selected uniformly from each
stratum, with important strata contributing
relatively more samples.
Define population of a query Q (denoted by POPQ)
on a relation R as a set of size R that
contains the value of the aggregated column that
is selected by Q, or 0 if the record is not
selected.
A stratified sampling scheme partitions R into r
strata containing n1,,nr records (where Snj
n), with k1,,kr records uniformly sampled from
each stratum (where Skj k).
The scheme also associates a ScaleFactor with
each record in the sample. Queries are answered
by execution them on the sample instead of R. For
a COUNT query, the ScaleFactor entries of the
selected records are summed, while for a SUM(y)
query the expression yScaleFactor is summed. If
also wish to return an error guarantee with each
query, then instead of ScaleFactor, we have to
keep track of each nj and kj individually for
each stratum.
Solution for STRAT
3 steps
Step one stratification step, determine
(a) How many strata r to partition relation R
into, and
(b) The records from R that belong to each
stratum.
At the end of step one, we have r strata
R1,,Rr containing n1,,nr records such that Snj
n.
Step two allocation step, determine how to
divide k (the number of records available for the
sample) into integers k1,,kr across the r strata
such that Skj k.
Step three sampling step, uniformly samples kj
records from stratum Rj to form the final sample
of k records.

11
The Non-Special Case with a Lifted Workload

Solution for STRAT (Cont.)
Solution for COUNT Aggregate
Stratification Step
Lemma 1 Consider a relation R with n
records and a workload W of COUNT queries. In the
limit when n tends to infinity, the fundamental
regions F R1, R2, , Rr represent an optimal
stratification.
Allocation Step
(1) Express MSE(pw) as a weighted sum of the MSE
of each query in the workload
Lemma 2 MSE(pw) Si wi MSE(pQ)
(2) For any Q?W, express MSE(pQ) as a function
of the kjs
Lemma 3 For a COUNT query Q in W, let
ApproxMSE(pQ)
Then
Since have an (approximate) formula for
MSE(pQ), we can exppress MSE(pw) as a function
of the variables kj
Corollary 1 MSE(pw) Sj(aj/ kj), where
each aj is a function of n1,,nr, d, and ?
aj captures the importance of a region it
is positively correlated with nj as well as the
frequency of queries in the workload that access
Rj
(3) Minimize MSE(pw)
Lemma 4 Sj(aj/ kj) is minimized subject to
Sjkj k if kj k(sqrt(aj) / Sisqrt(ai))
Lemma 4 provides a closed-form and
computationally inexpensive solution to the
allocation problem since aj depends only on d, ?
and the number of tuples in each fundamental
region.

12
The Non-Special Case with a Lifted Workload

Solution for STRAT (Cont.)
Solution for SUM Aggregate
Stratification Step
Since each stratum may have LARGE internal
variance in the values of the aggregate column,
CANNOT use the same stratification as in the
COUNT case, i.e., strata fundamental regions.
Divide fundamental regions with large variance
into a set of finer regions, each of which has
significantly lower internal variance, and treat
these finer regions as the strata. Within a new
stratum the aggregate column values of records
are CLOSE to one another.
Borrow from statistics literature an
approximation of the optimal Neymann Allocation
technique for minimizing variance, use it to
divide each fundamental region into h finer
regions, thus generating a total of hr, which
become the strata. (h was set to 6).
Allocation Step
(1) Similar to COUNT, it is expressed as an
optimization problem with hr unknowns k1,,
khr.
(2) Different from COUNT, the specific values of
the aggregate column in each region influence
MSE(pQ). Let yj (Yj) be the AVERAG (sum) of the
aggregate column values of all records in region
Rj. Since the variance within each region is
SMALL (due to stratification), we can assume that
each value within the region can be approximated
as yj. Thus to express MSE(pQ) as a function of
the kjs for a SUM query Q in W
As with COUNT, MSE(pW) for SUM is
functionally of the form Sj(aj/ kj), and aj
depends on the same parameters n1, nhr , d, and
? (see Corollary 1),
(3)The same for the minimization step as in Lemma
4.