A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden) Peter J. Haas (IBM Almaden Research Center) - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden) Peter J. Haas (IBM Almaden Research Center)

Description:

Rainer Gemulla (University of Technology Dresden) ... Faculty of Computer Science, Institute System ... corrects inclusion probabilies. General idea (insertion) ... – PowerPoint PPT presentation

Number of Views:18

Avg rating:3.0/5.0

Slides: 40

Provided by: rainergemu

Category:

more less

Transcript and Presenter's Notes

Title: A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden) Peter J. Haas (IBM Almaden Research Center)

1
A Dip in the Reservoir Maintaining Sample
Synopses of Evolving DatasetsRainer Gemulla
(University of Technology Dresden)Wolfgang
Lehner (University of Technology Dresden)Peter
J. Haas (IBM Almaden Research Center)
Faculty of Computer Science, Institute System
Architecture, Database Technology Group
2
Outline

Introduction
Deletions
Resizing
Experiments
Summary

3
Random Sampling

Database applications
huge data sets
complex algorithms(space time)
Requirements
performance, performance, performance
Random sampling
approximate query answering
data mining
data stream processing
query optimization
data integration

Turnover in Europe (TPC-H) Turnover in Europe (TPC-H) Turnover in Europe (TPC-H)
1 8.46 Mil. ? 0.15 Mil. 4s
10 8.51 Mil. ? 0.05 Mil. 52s
100 8.54 Mil. 200s
4
The Problem Space

Setting
arbitrary data sets
samples of the data
evolving data
Scope of this talk
maintenance ofrandom samples
Can we minimize or even avoid access to base
data?

5
Types of Data Sets

Data sets
variation of data set size
influence on sampling

Stable
Growing
Shrinking
Goal stable sample
Goal controlled growing sample
uninteresting
6
Uniform Sampling

Uniform sampling
all samples of the same size are equally likely
many statistical procedures assume uniformity
flexibility
Example
a data set (also called population)
possible samples of size 2

7
Reservoir Sampling

Reservoir sampling
computes a uniform sample of M elements
building block for many sophisticated sampling
schemes
single-scan algorithm
add the first M elements
afterwards, flip a coin
ignore the element (reject)
replace a random element in the sample (accept)
accept probability of the ith element

8
Reservoir Sampling (Example)

Example
sample size M 2

9
Problems with Reservoir Sampling

Problems with reservoir sampling
lacks support for deletions (stable data sets)
cannot efficiently enlarge sample (growing data
sets)

?
10
Outline

Introduction
Deletions
Resizing
Experiments
Summary

11
Naïve/Prior Approaches
Comments
Technique
Algorithm
unstable
conduct deletions, continue with smaller sample
(RS with deletions)
not uniform
use insertions to immediately refill the sample
Naïve
expensive, unstable
let sample size decrease, but occasionally
recompute
Backing sample
stable but expensive
immediately sample from base data to refill the
sample
CAR(WOR)
inexpensive but unstable
coin flip sampling with deletions, purge if too
large
Bernoulli s. with purging
special case of our RP algorithm
developed for data streams (sliding windows only)
Passive sampling
expensive, low space efficiency in our setting
tailored for multiset populations
Distinct-value sampling
12
Random Pairing

Random pairing
compensates deletions with arriving insertions
corrects inclusion probabilies
General idea (insertion)
no uncompensated deletions ? reservoir sampling
otherwise,
randomly select an uncompensated deletion
(partner)
compensate it Was it in the sample?
yes ? add arriving element to sample
no ? ignore arriving element

13
Random Pairing

Example

14
Random Pairing

Details of the algorithm
keeping history of deleted items is expensive,
but
maintenance of two counters suffices
correctness proof is in the paper

15
Outline

Introduction
Deletions
Resizing
Experiments
Summary

16
Growing Data Sets

The problem
growing data set

Data set
Random pairing
growing data set
stable sample
sampling fraction decreases
17
A Negative Result

Negative result
There is no resizing algorithm which can enlarge
a bounded-size sample without ever accessing base
data.
Example
data set
samples of size 2
new data set
samples of size 3

Not uniform!
18
Resizing

Goal
efficiently increase sample size
stay within an upper bound at all times
General idea
convert sample to Bernoulli sample
continue Bernoulli sampling until new sample size
is reached
convert back to reservoir sample
Optimally balance cost
cost of base data accesses (in step 1)
time to reach new sample size (in step 2)

19
Resizing

Bernoulli sampling
uniform sampling scheme
each tuple is added to the sample with
probability q
sample size follows binomial distribution ? no
effective upper bound
Phase 1 Conversion to a Bernoulli sample
given q, randomly determine sample size
reuse reservoir sample to create Bernoulli sample
subsample
sample additional tuples (base data access)
choice of q
small ? less base data accesses
large ? more base data accesses

20
Resizing

Phase 2 Run Bernoulli sampling
accept new tuples with probability q
conduct deletions
stop as soon as new sample size is reached
Phase 3 Revert to Reservoir sampling
switchover is trivial
Choosing q
determines cost of Phase 1 and Phase 2
goal minimize total cost
base data access expensive ? small q
base data access cheap ? large q
details in paper

21
Resizing

Example
resize by 30 if sampling fraction drops below 9
dependent on costs of accessing base data

Low costs
Moderate costs
High costs
immediate resizing
combined solution
degenerates to Bernoulli sampling
22
Outline

Introduction
Deletions
Resizing
Experiments
Summary

23
Total Cost

Total cost
stable dataset, 10M operations
sample size 100k, data access 10 times more
expensive than sample access

Base data access
No base data access
24
Sample size

Sample size
stable dataset, size 1M
sample size 100k

Base data access
No base data access
25
Outline

Introduction
Deletions
Resizing
Experiments
Summary

26
Summary

Reservoir Sampling
lacks support for deletions
complete recomputation to enlarge the sample
Random Pairing
uses arriving insertions to compensate for
deletions
Resizing
base data access cannot be avoided
minimizes total cost
Future work
better q for resizing
combine with existing techniques 4,8,17 to
enhance flexibility, scalability

27
Thank you!
Questions?
28
Backup Bounded-Size Sampling

Why sampling?
performance, performance, performance
How much to sample?
influencing factors
storage consumption
response time
accuracy
choosing the sample size / sampling fraction
largest sample that meets storage requirements
largest sample that meets response time
requirements
smallest sample that meets accuracy requirements

29
Backup Bounded-Size Sampling

Example
random pairing vs. bernoulli sampling
average estimation

Data set
Sample size
Standard error
BS violates 1, 2
BS violates 3
30
Backup Distinct-Value Sampling

Distinct-value sampling (optimistic setting for
DV)
DV-scheme knows avg. dataset size in advance
assume no storage for counters hash functions

Sample size
Execution time
10
1000s
100s
10s
1s
100ms
10ms
10
0
10
0
RP has better memory utilization
RP is significantly faster
31
Backup RS With Deletions