New SamplingBased Summary Statistics for Improving Approximate Query Answers P' B' Gibbons and Y' Ma - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

New SamplingBased Summary Statistics for Improving Approximate Query Answers P' B' Gibbons and Y' Ma

Description:

In large data recording and warehousing environments, it is often advantageous ... For any sequence of insertions, the above algorithm maintains a concise sample. 11 ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 25
Provided by: DC294
Learn more at: http://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: New SamplingBased Summary Statistics for Improving Approximate Query Answers P' B' Gibbons and Y' Ma


1
New Sampling-Based Summary Statistics for
Improving Approximate Query AnswersP. B. Gibbons
and Y. Matias (ACM SIGMOD 1998)
  • Rongfang Li
  • Feb 2007

2
Outline
  • Introduction
  • Concise samples
  • Counting samples
  • Application to hot list queries and Experimental
    evaluation
  • Conclusion

3
Introduction
  • In large data recording and warehousing
    environments, it is often advantageous to provide
    fast, approximate answers to queries.
  • Approximate answer engine.
  • The goal is to develop effective data synopses
    that capture important information in a concise
    representation.

New Data
Data Warehouse
Queries
Response
Figure 1A traditional data warehouse
New Data
Approx. Answer Engine
Data Warehouse
Queries
Response
Figure 2 Data warehouse set-up for
providing approximate query answers.
4
Effectiveness of a Synopsis
  • Accuracy of the answers it provides
  • Fast response time
  • Update time
  • Large sample size is desirable
  • Effectiveness of a synopsis is evaluated as a
    function of its footprint, i.e., the number of
    memory words to store the synopsis.

5
Two new sampling-based synopses
  • Concise samples
  • ltvalue, countgt represents those values that
    appear more than once in the sample
  • Counting samples
  • keep track of all occurrences of a value inserted
    into the relation since the value was selected
    for the sample.
  • Fast Incremental Maintenance

6
Outline
  • Introduction
  • Concise samples
  • Counting samples
  • Application to hot list queries and Experimental
    evaluation
  • Conclusion

7
Concise samples
  • Observation any value occurring frequently in a
    sample is a wasteful use of the available space
  • Definition 1 A concise sample is a uniform
    random sample of the data set such that values
    appearing more than once in the sample are
    represented as a value and a count, ex ltvalue,
    countgt.
  • More sample points than traditional samples for
    the same footprint gt more accurate

8
Concise samples properties
  • Consider a relation R with n tuples and an
    attribute A. The goal is to obtain a uniform
    random sample of R.A, i.e., the values of A for a
    random subset of the tuples in R.
  • Definition Let S ltv1, c1gt,, ltvj, cjgt,
    vj1,..., vl be a concise sample. Then
    sample-size(S) l-j?ji 1ci, and footprint(S)
    lj
  • at most m/2 distinct values, footprint at
    most m,
  • Lemma 1 For any footprint m 2, there exists
    data sets for which the sample-size of a concise
    sample is n/m times lager than its footprint,
    where n is the size of the data set.
  • n/m times as many sample points as a
    traditional sample

9
Concise samples offline/static
  • Offline/static computation
  • Footprint is m
  • Repeat m times select a random tuple from the
    relation and extract its value for attribute A.
  • Semi-sort the set of values, and replace every
    value occurring multiple times with a ltvalue,
    countgt pair.
  • Continue to sample until either adding the sample
    point would increase the concise sample footprint
    to m1 or n samples have been taken.
  • For each new value sampled, look-up to see if it
    is already in the concise sample and then either
    add a new singleton value, convert a singleton to
    a ltvalue, countgt pair, or increment the count for
    a pair.

10
Concise samples Incremental maintenance
  • More difficult than maintain a traditional sample
  • Maintenance algorithm
  • Let S be the current concise sample and
    consider a new tuple t. Set up an entry threshold
    t(initially 1) for new tuples to be selected for
    the sample.
  • Add t.A to S with probability 1/t. (flip a coin
    with heads probability 1/t)
  • Do a look-up on t.A in S.
  • if it is represented by a pair, its count is
    incremented.
  • if t.A is a singleton in S, a pair is created,
  • if it is not in S, a singleton is created.
  • Increase footprint by 1 in cases b) and c)
  • Evict existing sample points to create room.
    Raise the threshold to some t. Subject each
    sample point in S to this higher threshold---T/T
    probability evicted. Subsequent inserts are
    selected for the sample with probability 1/t
  • For any sequence of insertions, the above
    algorithm maintains a concise sample.

11
Experimental evaluation
  • Evaluate the gain in the sample-size of concise
    sample over traditional sample
  • Insert 500k new values, value domain1,D, where
    D is varied from 500 to 50k.
  • Footprint m 1000
  • Compare three samples
  • traditional
  • concise online
  • concise offline

12
Concise Samples experimental evaluation
D potential number of distinct values m
footprint size
Figure 3 Comparing sample-sizes of concise and
traditional samples as a function of skew, for
varying footprints and D/m ratios. In (a) and
(b), authors compare footprint 100 and footprint
1000, respectively, for the same data sets. In
(c) and (d), authors compare D/m 50 and D/m 5
, respectively, for the same footprint 1000.
13
Concise samples
  • Update time overheads
  • The coin flips that must be performed to decide
    which inserts are added to the concise sample and
    to evict values from the concise sample when the
    threshold is raised
  • The lookups into the current concise sample to
    see if a value is already present in the sample

14
Outline
  • Introduction
  • Concise samples
  • Counting samples
  • Application to hot list queries and Experimental
    evaluation
  • Conclusion

15
Counting samples
  • Counting samples a variation on concise samples
    in which the counts are used to keep track of all
    occurrences of a value inserted into the relation
    since the value was selected for the sample.
  • Definition A counting sample for R.A with
    threshold T is any subset of R.A obtained as
    follows
  • 1. For each value v occurring c times in R, we
    flip a coin with probability 1/T of heads until
    the first heads, up to at most c coin tosses in
    all if the ith coin toss is heads, then v occurs
    c-i1 times in the subset, else v is not in the
    subset.
  • 2. Each value v occurring cgt1 times in the subset
    is represented as a pair ltv, cgt, and each value v
    occurring exactly once is represented as a
    singleton v.

16
Counting samples (cont.)
  • Incremental maintenance algorithm.
  • Let S be the current counting sample and t be a
    new tuple
  • Set up an entry threshold T for new tuples to be
    selected.
  • Look up on t.A in S
  • t.A is not in S and we add it to S with
    probability 1/T.
  • Deletion
  • Theorem 4 Let R be an arbitrary relation, and let
    T be the current threshold for a counting sample
    S. (i) Any value v that occurs at least T times
    in R is expected to be in S. (ii) Any value v
    that occurs fv times in R will be in S with
    probability 1-(1-1/T) fv. (iii) For all agt1, if
    fv aT, then with probability 1 - e-a, the
    value will be in S and its count will be at least
    fv - aT

17
Outline
  • Introduction
  • Concise samples
  • Counting samples
  • Application to hot list queries and experiment
    evaluation
  • Conclusion

18
Hot list queries
  • Hot list queries
  • request an ordered set of ltvalue, countgt pairs
    for the k most frequently occurring data values,
    for some k.
  • ex the top selling items in a database of sales
    transactions.

19
Hot list queries
  • Algorithms
  • Using traditional samples
  • Using concise samples
  • Using counting samples
  • Using histogram on disk maintains a full
    histogram on disk, i.e., ltvalue, countgt pairs for
    all distinct values in R, with a copy of the top
    m/2 pairs stored as a synopsis within the
    approximate answer engine.
  • -- is considered only as a baseline for accuracy
    comparisons

20
Application to hot list queries (cont.)
x-axis rank of a value y-axis count for the
values
21
Application to hot list queries (cont.)
22
Application to hot list queries (cont.)
23
Application to hot list queries overheads
24
Conclusion
  • Using concise samples may offer the best choice
    when considering both accuracy and overheads.
  • In this paper, a batch-like processing of data
    warehouse inserts, in which inserts and queries
    do not intermix, is assumed. To address the more
    general case , issues of concurrency bottlenecks
    need to be addressed.
  • Future work is to explore the effectiveness of
    using concise samples and counting samples for
    other concrete approximate answer scenarios.
Write a Comment
User Comments (0)
About PowerShow.com