A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden) Peter J. Haas (IBM Almaden Research Center) - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden) Peter J. Haas (IBM Almaden Research Center)

Description:

Rainer Gemulla (University of Technology Dresden) ... Faculty of Computer Science, Institute System ... corrects inclusion probabilies. General idea (insertion) ... – PowerPoint PPT presentation

Number of Views:18
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University of Technology Dresden) Peter J. Haas (IBM Almaden Research Center)


1
A Dip in the Reservoir Maintaining Sample
Synopses of Evolving DatasetsRainer Gemulla
(University of Technology Dresden)Wolfgang
Lehner (University of Technology Dresden)Peter
J. Haas (IBM Almaden Research Center)
Faculty of Computer Science, Institute System
Architecture, Database Technology Group
2
Outline
  1. Introduction
  2. Deletions
  3. Resizing
  4. Experiments
  5. Summary

3
Random Sampling
  • Database applications
  • huge data sets
  • complex algorithms(space time)
  • Requirements
  • performance, performance, performance
  • Random sampling
  • approximate query answering
  • data mining
  • data stream processing
  • query optimization
  • data integration

Turnover in Europe (TPC-H) Turnover in Europe (TPC-H) Turnover in Europe (TPC-H)
1 8.46 Mil. ? 0.15 Mil. 4s
10 8.51 Mil. ? 0.05 Mil. 52s
100 8.54 Mil. 200s
4
The Problem Space
  • Setting
  • arbitrary data sets
  • samples of the data
  • evolving data
  • Scope of this talk
  • maintenance ofrandom samples
  • Can we minimize or even avoid access to base
    data?

5
Types of Data Sets
  • Data sets
  • variation of data set size
  • influence on sampling

Stable
Growing
Shrinking
Goal stable sample
Goal controlled growing sample
uninteresting
6
Uniform Sampling
  • Uniform sampling
  • all samples of the same size are equally likely
  • many statistical procedures assume uniformity
  • flexibility
  • Example
  • a data set (also called population)
  • possible samples of size 2

7
Reservoir Sampling
  • Reservoir sampling
  • computes a uniform sample of M elements
  • building block for many sophisticated sampling
    schemes
  • single-scan algorithm
  • add the first M elements
  • afterwards, flip a coin
  • ignore the element (reject)
  • replace a random element in the sample (accept)
  • accept probability of the ith element

8
Reservoir Sampling (Example)
  • Example
  • sample size M 2

9
Problems with Reservoir Sampling
  • Problems with reservoir sampling
  • lacks support for deletions (stable data sets)
  • cannot efficiently enlarge sample (growing data
    sets)

?
10
Outline
  1. Introduction
  2. Deletions
  3. Resizing
  4. Experiments
  5. Summary

11
Naïve/Prior Approaches
Comments
Technique
Algorithm
unstable
conduct deletions, continue with smaller sample
(RS with deletions)
not uniform
use insertions to immediately refill the sample
Naïve
expensive, unstable
let sample size decrease, but occasionally
recompute
Backing sample
stable but expensive
immediately sample from base data to refill the
sample
CAR(WOR)
inexpensive but unstable
coin flip sampling with deletions, purge if too
large
Bernoulli s. with purging
special case of our RP algorithm
developed for data streams (sliding windows only)
Passive sampling
expensive, low space efficiency in our setting
tailored for multiset populations
Distinct-value sampling
12
Random Pairing
  • Random pairing
  • compensates deletions with arriving insertions
  • corrects inclusion probabilies
  • General idea (insertion)
  • no uncompensated deletions ? reservoir sampling
  • otherwise,
  • randomly select an uncompensated deletion
    (partner)
  • compensate it Was it in the sample?
  • yes ? add arriving element to sample
  • no ? ignore arriving element

13
Random Pairing
  • Example

14
Random Pairing
  • Details of the algorithm
  • keeping history of deleted items is expensive,
    but
  • maintenance of two counters suffices
  • correctness proof is in the paper

15
Outline
  1. Introduction
  2. Deletions
  3. Resizing
  4. Experiments
  5. Summary

16
Growing Data Sets
  • The problem
  • growing data set

Data set
Random pairing
growing data set
stable sample
sampling fraction decreases
17
A Negative Result
  • Negative result
  • There is no resizing algorithm which can enlarge
    a bounded-size sample without ever accessing base
    data.
  • Example
  • data set
  • samples of size 2
  • new data set
  • samples of size 3

Not uniform!
18
Resizing
  • Goal
  • efficiently increase sample size
  • stay within an upper bound at all times
  • General idea
  • convert sample to Bernoulli sample
  • continue Bernoulli sampling until new sample size
    is reached
  • convert back to reservoir sample
  • Optimally balance cost
  • cost of base data accesses (in step 1)
  • time to reach new sample size (in step 2)

19
Resizing
  • Bernoulli sampling
  • uniform sampling scheme
  • each tuple is added to the sample with
    probability q
  • sample size follows binomial distribution ? no
    effective upper bound
  • Phase 1 Conversion to a Bernoulli sample
  • given q, randomly determine sample size
  • reuse reservoir sample to create Bernoulli sample
  • subsample
  • sample additional tuples (base data access)
  • choice of q
  • small ? less base data accesses
  • large ? more base data accesses

20
Resizing
  • Phase 2 Run Bernoulli sampling
  • accept new tuples with probability q
  • conduct deletions
  • stop as soon as new sample size is reached
  • Phase 3 Revert to Reservoir sampling
  • switchover is trivial
  • Choosing q
  • determines cost of Phase 1 and Phase 2
  • goal minimize total cost
  • base data access expensive ? small q
  • base data access cheap ? large q
  • details in paper

21
Resizing
  • Example
  • resize by 30 if sampling fraction drops below 9
  • dependent on costs of accessing base data

Low costs
Moderate costs
High costs
immediate resizing
combined solution
degenerates to Bernoulli sampling
22
Outline
  1. Introduction
  2. Deletions
  3. Resizing
  4. Experiments
  5. Summary

23
Total Cost
  • Total cost
  • stable dataset, 10M operations
  • sample size 100k, data access 10 times more
    expensive than sample access

Base data access
No base data access
24
Sample size
  • Sample size
  • stable dataset, size 1M
  • sample size 100k

Base data access
No base data access
25
Outline
  • Introduction
  • Deletions
  • Resizing
  • Experiments
  • Summary

26
Summary
  • Reservoir Sampling
  • lacks support for deletions
  • complete recomputation to enlarge the sample
  • Random Pairing
  • uses arriving insertions to compensate for
    deletions
  • Resizing
  • base data access cannot be avoided
  • minimizes total cost
  • Future work
  • better q for resizing
  • combine with existing techniques 4,8,17 to
    enhance flexibility, scalability

27
Thank you!
Questions?
28
Backup Bounded-Size Sampling
  • Why sampling?
  • performance, performance, performance
  • How much to sample?
  • influencing factors
  • storage consumption
  • response time
  • accuracy
  • choosing the sample size / sampling fraction
  • largest sample that meets storage requirements
  • largest sample that meets response time
    requirements
  • smallest sample that meets accuracy requirements

29
Backup Bounded-Size Sampling
  • Example
  • random pairing vs. bernoulli sampling
  • average estimation

Data set
Sample size
Standard error
BS violates 1, 2
BS violates 3
30
Backup Distinct-Value Sampling
  • Distinct-value sampling (optimistic setting for
    DV)
  • DV-scheme knows avg. dataset size in advance
  • assume no storage for counters hash functions

Sample size
Execution time
10
1000s
100s
10s
1s
100ms
10ms
10
0
10
0
RP has better memory utilization
RP is significantly faster
31
Backup RS With Deletions
  • Reservoir sampling with deletions
  • conduct deletions, continue with smaller sample
    size

32
Backup Backing Sample
  • Evaluation
  • data set consists of 1 million elements (on
    average)
  • 100k sample, clustered insertions/deletions

Data set
Reservoir sampling
Backing sample
stable
sample is empty eventually
expensive, unstable
33
Backup An Incorrect Approach
  • Idea
  • use arriving insertions to refill the sample

Not uniform!
34
Backup Random Pairing
  • Evaluation
  • data set consists of 1 million elements (on
    average)
  • 100k sample, clustered insertions/deletions

Data set
Reservoir sampling
Random pairing
stable
sample gets emtpy eventually
no base data access!
35
Backup Average Sample Size
  • Average sample size
  • stable dataset, 10M operations
  • sample size 100k

36
Backup Average Sample Size With Clustered
Insertions/Deletions
  • Average sample size with clustered
    insertions/deletions
  • stable dataset, size 10M, 8M operations
  • sample size 100k

37
Backup Cost
  • Cost
  • stable dataset, 10M operations
  • sample size 100k

38
Backup Cost With Clustered Insertions/Deletions
  • Cost with clustered insertions/deletions
  • stable dataset, size 10M, 8M operations
  • sample size 100k

39
Backup Resizing (Value of q)
  • Resizing
  • enlarge sample from 100k to 200k
  • base data access 10ms, arrival rate 1ms
Write a Comment
User Comments (0)
About PowerShow.com