Title: New SamplingBased Summary Statistics for Improving Approximate Query Answers P' B' Gibbons and Y' Ma
1New Sampling-Based Summary Statistics for
Improving Approximate Query AnswersP. B. Gibbons
and Y. Matias (ACM SIGMOD 1998)
2Outline
- Introduction
- Concise samples
- Counting samples
- Application to hot list queries and Experimental
evaluation - Conclusion
3Introduction
- In large data recording and warehousing
environments, it is often advantageous to provide
fast, approximate answers to queries. - Approximate answer engine.
- The goal is to develop effective data synopses
that capture important information in a concise
representation.
New Data
Data Warehouse
Queries
Response
Figure 1A traditional data warehouse
New Data
Approx. Answer Engine
Data Warehouse
Queries
Response
Figure 2 Data warehouse set-up for
providing approximate query answers.
4Effectiveness of a Synopsis
- Accuracy of the answers it provides
- Fast response time
- Update time
- Large sample size is desirable
- Effectiveness of a synopsis is evaluated as a
function of its footprint, i.e., the number of
memory words to store the synopsis.
5Two new sampling-based synopses
- Concise samples
- ltvalue, countgt represents those values that
appear more than once in the sample - Counting samples
- keep track of all occurrences of a value inserted
into the relation since the value was selected
for the sample. - Fast Incremental Maintenance
6Outline
- Introduction
- Concise samples
- Counting samples
- Application to hot list queries and Experimental
evaluation - Conclusion
7Concise samples
- Observation any value occurring frequently in a
sample is a wasteful use of the available space - Definition 1 A concise sample is a uniform
random sample of the data set such that values
appearing more than once in the sample are
represented as a value and a count, ex ltvalue,
countgt. - More sample points than traditional samples for
the same footprint gt more accurate
8Concise samples properties
- Consider a relation R with n tuples and an
attribute A. The goal is to obtain a uniform
random sample of R.A, i.e., the values of A for a
random subset of the tuples in R. - Definition Let S ltv1, c1gt,, ltvj, cjgt,
vj1,..., vl be a concise sample. Then
sample-size(S) l-j?ji 1ci, and footprint(S)
lj - at most m/2 distinct values, footprint at
most m, -
- Lemma 1 For any footprint m 2, there exists
data sets for which the sample-size of a concise
sample is n/m times lager than its footprint,
where n is the size of the data set. - n/m times as many sample points as a
traditional sample
9Concise samples offline/static
- Offline/static computation
- Footprint is m
- Repeat m times select a random tuple from the
relation and extract its value for attribute A. - Semi-sort the set of values, and replace every
value occurring multiple times with a ltvalue,
countgt pair. - Continue to sample until either adding the sample
point would increase the concise sample footprint
to m1 or n samples have been taken. - For each new value sampled, look-up to see if it
is already in the concise sample and then either
add a new singleton value, convert a singleton to
a ltvalue, countgt pair, or increment the count for
a pair.
10Concise samples Incremental maintenance
- More difficult than maintain a traditional sample
- Maintenance algorithm
- Let S be the current concise sample and
consider a new tuple t. Set up an entry threshold
t(initially 1) for new tuples to be selected for
the sample. - Add t.A to S with probability 1/t. (flip a coin
with heads probability 1/t) - Do a look-up on t.A in S.
- if it is represented by a pair, its count is
incremented. - if t.A is a singleton in S, a pair is created,
- if it is not in S, a singleton is created.
- Increase footprint by 1 in cases b) and c)
- Evict existing sample points to create room.
Raise the threshold to some t. Subject each
sample point in S to this higher threshold---T/T
probability evicted. Subsequent inserts are
selected for the sample with probability 1/t - For any sequence of insertions, the above
algorithm maintains a concise sample.
11Experimental evaluation
- Evaluate the gain in the sample-size of concise
sample over traditional sample - Insert 500k new values, value domain1,D, where
D is varied from 500 to 50k. - Footprint m 1000
- Compare three samples
- traditional
- concise online
- concise offline
-
12Concise Samples experimental evaluation
D potential number of distinct values m
footprint size
Figure 3 Comparing sample-sizes of concise and
traditional samples as a function of skew, for
varying footprints and D/m ratios. In (a) and
(b), authors compare footprint 100 and footprint
1000, respectively, for the same data sets. In
(c) and (d), authors compare D/m 50 and D/m 5
, respectively, for the same footprint 1000.
13Concise samples
- Update time overheads
- The coin flips that must be performed to decide
which inserts are added to the concise sample and
to evict values from the concise sample when the
threshold is raised - The lookups into the current concise sample to
see if a value is already present in the sample
14Outline
- Introduction
- Concise samples
- Counting samples
- Application to hot list queries and Experimental
evaluation - Conclusion
15Counting samples
- Counting samples a variation on concise samples
in which the counts are used to keep track of all
occurrences of a value inserted into the relation
since the value was selected for the sample. - Definition A counting sample for R.A with
threshold T is any subset of R.A obtained as
follows - 1. For each value v occurring c times in R, we
flip a coin with probability 1/T of heads until
the first heads, up to at most c coin tosses in
all if the ith coin toss is heads, then v occurs
c-i1 times in the subset, else v is not in the
subset. - 2. Each value v occurring cgt1 times in the subset
is represented as a pair ltv, cgt, and each value v
occurring exactly once is represented as a
singleton v.
16Counting samples (cont.)
- Incremental maintenance algorithm.
- Let S be the current counting sample and t be a
new tuple - Set up an entry threshold T for new tuples to be
selected. - Look up on t.A in S
- t.A is not in S and we add it to S with
probability 1/T. - Deletion
- Theorem 4 Let R be an arbitrary relation, and let
T be the current threshold for a counting sample
S. (i) Any value v that occurs at least T times
in R is expected to be in S. (ii) Any value v
that occurs fv times in R will be in S with
probability 1-(1-1/T) fv. (iii) For all agt1, if
fv aT, then with probability 1 - e-a, the
value will be in S and its count will be at least
fv - aT
17Outline
- Introduction
- Concise samples
- Counting samples
- Application to hot list queries and experiment
evaluation - Conclusion
18Hot list queries
- Hot list queries
- request an ordered set of ltvalue, countgt pairs
for the k most frequently occurring data values,
for some k. - ex the top selling items in a database of sales
transactions.
19Hot list queries
- Algorithms
- Using traditional samples
- Using concise samples
- Using counting samples
- Using histogram on disk maintains a full
histogram on disk, i.e., ltvalue, countgt pairs for
all distinct values in R, with a copy of the top
m/2 pairs stored as a synopsis within the
approximate answer engine. - -- is considered only as a baseline for accuracy
comparisons -
20Application to hot list queries (cont.)
x-axis rank of a value y-axis count for the
values
21Application to hot list queries (cont.)
22Application to hot list queries (cont.)
23Application to hot list queries overheads
24Conclusion
- Using concise samples may offer the best choice
when considering both accuracy and overheads. - In this paper, a batch-like processing of data
warehouse inserts, in which inserts and queries
do not intermix, is assumed. To address the more
general case , issues of concurrency bottlenecks
need to be addressed. - Future work is to explore the effectiveness of
using concise samples and counting samples for
other concrete approximate answer scenarios.