New SamplingBased Summary Statistics for Improving Approximate Query Answers P' B' Gibbons and Y' Ma - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

New SamplingBased Summary Statistics for Improving Approximate Query Answers P' B' Gibbons and Y' Ma

Description:

In large data recording and warehousing environments, it is often advantageous ... For any sequence of insertions, the above algorithm maintains a concise sample. 11 ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 25

Provided by: DC294

Learn more at: http://crystal.uta.edu

Category:

more less

Transcript and Presenter's Notes

Title: New SamplingBased Summary Statistics for Improving Approximate Query Answers P' B' Gibbons and Y' Ma

1
New Sampling-Based Summary Statistics for
Improving Approximate Query AnswersP. B. Gibbons
and Y. Matias (ACM SIGMOD 1998)

Rongfang Li
Feb 2007

2
Outline

Introduction
Concise samples
Counting samples
Application to hot list queries and Experimental
evaluation
Conclusion

3
Introduction

In large data recording and warehousing
environments, it is often advantageous to provide
fast, approximate answers to queries.
Approximate answer engine.
The goal is to develop effective data synopses
that capture important information in a concise
representation.

New Data
Data Warehouse
Queries
Response
Figure 1A traditional data warehouse
New Data
Approx. Answer Engine
Data Warehouse
Queries
Response
Figure 2 Data warehouse set-up for
providing approximate query answers.
4
Effectiveness of a Synopsis

Accuracy of the answers it provides
Fast response time
Update time
Large sample size is desirable
Effectiveness of a synopsis is evaluated as a
function of its footprint, i.e., the number of
memory words to store the synopsis.

5
Two new sampling-based synopses

Concise samples
ltvalue, countgt represents those values that
appear more than once in the sample
Counting samples
keep track of all occurrences of a value inserted
into the relation since the value was selected
for the sample.
Fast Incremental Maintenance

6
Outline

Introduction
Concise samples
Counting samples
Application to hot list queries and Experimental
evaluation
Conclusion

7
Concise samples

Observation any value occurring frequently in a
sample is a wasteful use of the available space
Definition 1 A concise sample is a uniform
random sample of the data set such that values
appearing more than once in the sample are
represented as a value and a count, ex ltvalue,
countgt.
More sample points than traditional samples for
the same footprint gt more accurate

8
Concise samples properties

Consider a relation R with n tuples and an
attribute A. The goal is to obtain a uniform
random sample of R.A, i.e., the values of A for a
random subset of the tuples in R.
Definition Let S ltv1, c1gt,, ltvj, cjgt,
vj1,..., vl be a concise sample. Then
sample-size(S) l-j?ji 1ci, and footprint(S)
lj
at most m/2 distinct values, footprint at
most m,
Lemma 1 For any footprint m 2, there exists
data sets for which the sample-size of a concise
sample is n/m times lager than its footprint,
where n is the size of the data set.
n/m times as many sample points as a
traditional sample

9
Concise samples offline/static

Offline/static computation
Footprint is m
Repeat m times select a random tuple from the
relation and extract its value for attribute A.
Semi-sort the set of values, and replace every
value occurring multiple times with a ltvalue,
countgt pair.
Continue to sample until either adding the sample
point would increase the concise sample footprint
to m1 or n samples have been taken.
For each new value sampled, look-up to see if it
is already in the concise sample and then either
add a new singleton value, convert a singleton to
a ltvalue, countgt pair, or increment the count for
a pair.

10
Concise samples Incremental maintenance

More difficult than maintain a traditional sample
Maintenance algorithm
Let S be the current concise sample and
consider a new tuple t. Set up an entry threshold
t(initially 1) for new tuples to be selected for
the sample.
Add t.A to S with probability 1/t. (flip a coin
with heads probability 1/t)
Do a look-up on t.A in S.
if it is represented by a pair, its count is
incremented.
if t.A is a singleton in S, a pair is created,
if it is not in S, a singleton is created.
Increase footprint by 1 in cases b) and c)
Evict existing sample points to create room.
Raise the threshold to some t. Subject each
sample point in S to this higher threshold---T/T
probability evicted. Subsequent inserts are
selected for the sample with probability 1/t
For any sequence of insertions, the above
algorithm maintains a concise sample.

11
Experimental evaluation

Evaluate the gain in the sample-size of concise
sample over traditional sample
Insert 500k new values, value domain1,D, where
D is varied from 500 to 50k.
Footprint m 1000
Compare three samples
traditional
concise online
concise offline

12
Concise Samples experimental evaluation
D potential number of distinct values m
footprint size
Figure 3 Comparing sample-sizes of concise and
traditional samples as a function of skew, for
varying footprints and D/m ratios. In (a) and
(b), authors compare footprint 100 and footprint
1000, respectively, for the same data sets. In
(c) and (d), authors compare D/m 50 and D/m 5
, respectively, for the same footprint 1000.
13
Concise samples

Update time overheads
The coin flips that must be performed to decide
which inserts are added to the concise sample and
to evict values from the concise sample when the
threshold is raised
The lookups into the current concise sample to
see if a value is already present in the sample

14
Outline

Introduction
Concise samples
Counting samples
Application to hot list queries and Experimental
evaluation
Conclusion

15
Counting samples

Counting samples a variation on concise samples
in which the counts are used to keep track of all
occurrences of a value inserted into the relation
since the value was selected for the sample.
Definition A counting sample for R.A with
threshold T is any subset of R.A obtained as
follows
1. For each value v occurring c times in R, we
flip a coin with probability 1/T of heads until
the first heads, up to at most c coin tosses in
all if the ith coin toss is heads, then v occurs
c-i1 times in the subset, else v is not in the
subset.
2. Each value v occurring cgt1 times in the subset
is represented as a pair ltv, cgt, and each value v
occurring exactly once is represented as a
singleton v.

16
Counting samples (cont.)

Incremental maintenance algorithm.
Let S be the current counting sample and t be a
new tuple
Set up an entry threshold T for new tuples to be
selected.
Look up on t.A in S
t.A is not in S and we add it to S with
probability 1/T.
Deletion
Theorem 4 Let R be an arbitrary relation, and let
T be the current threshold for a counting sample
S. (i) Any value v that occurs at least T times
in R is expected to be in S. (ii) Any value v
that occurs fv times in R will be in S with
probability 1-(1-1/T) fv. (iii) For all agt1, if
fv aT, then with probability 1 - e-a, the
value will be in S and its count will be at least
fv - aT

17
Outline

Introduction
Concise samples
Counting samples
Application to hot list queries and experiment
evaluation
Conclusion

18
Hot list queries

Hot list queries
request an ordered set of ltvalue, countgt pairs
for the k most frequently occurring data values,
for some k.
ex the top selling items in a database of sales
transactions.

19
Hot list queries

Algorithms
Using traditional samples
Using concise samples
Using counting samples
Using histogram on disk maintains a full
histogram on disk, i.e., ltvalue, countgt pairs for
all distinct values in R, with a copy of the top
m/2 pairs stored as a synopsis within the
approximate answer engine.
-- is considered only as a baseline for accuracy
comparisons

20
Application to hot list queries (cont.)
x-axis rank of a value y-axis count for the
values
21
Application to hot list queries (cont.)
22
Application to hot list queries (cont.)
23
Application to hot list queries overheads
24
Conclusion

Using concise samples may offer the best choice
when considering both accuracy and overheads.
In this paper, a batch-like processing of data
warehouse inserts, in which inserts and queries
do not intermix, is assumed. To address the more
general case , issues of concurrency bottlenecks
need to be addressed.
Future work is to explore the effectiveness of
using concise samples and counting samples for
other concrete approximate answer scenarios.