Approximate Query Processing slides based on the following tutorials PowerPoint PPT Presentation

presentation player overlay
1 / 60
About This Presentation
Transcript and Presenter's Notes

Title: Approximate Query Processing slides based on the following tutorials


1
Approximate Query Processingslides based on the
following tutorials
  • In SIGMOD 2001 by Haas and Hellerstein
  • In VLDB 2001 by Garofalakis and Gibbons

2
Outline
  • Intro Motivation
  • Online query processing
  • Usage examples
  • Basic sampling techniques and estimators
  • Preferential data delivery
  • Online join algorithms
  • Synopses-based query processing
  • One-Dimensional Synopses
  • Multi-Dimensional Synopses and Joins
  • Set-Valued Queries
  • Conclusions

3
Introduction Motivation
SQL Query
DecisionSupport Systems(DSS)
Exact Answer
Long Response Times!
  • Exact answers NOT always required
  • DSS applications usually exploratory
  • Aggregate queries precision to last decimal
    not needed
  • Preview answers while waiting. Trial queries

4
Wait
5
Fast Approximate Answers
  • Primarily for Aggregate queries
  • Goal quickly report the leading digits of
    answers
  • In seconds instead of minutes or hours
  • Most useful if can provide error guarantees
  • E.g., Average salary
  • 59,000 /- 500 (with 95
    confidence) in 10 seconds
  • vs. 59,152.25
    in 10 minutes
  • How?
  • Use a sample instead of the whole data
  • Use some synopses of the data
  • Speed-up

6
Approximate Query Answering
  • Basic Approach 1 Online Query Processing
  • Sampling at query time
  • Answers continually improve, under user control

7
Approximate Query Answering
  • Basic Approach 2 Precomputed Synopses
  • Construct synopses prior to query time
  • Use them to answer queries
  • Need to maintain synopses up-to-date

8
Online Processing
  • Goal
  • Maximize the gain
  • Gain defined on-the-fly
  • May need FEEDBACK

100
?
Time
9
Online Aggregation
  • SELECT AVG(temp) FROM t GROUP BY site
  • 330K rows in table
  • the exact answer

10
Online Aggregation, contd
  • A simple online aggregation interface (after 74
    rows)

11
Online Aggregation, contd
  • After 834 rows

12
Example Online Aggregation
Additional Features Speed up Slow
down Terminate
13
Online Data Visualization
  • In ALVIN (Alberta System for Visualizing Large
    Networks)

14
Online Data Visualization (cont)
  • Web sites in ALVIN

15
Online Data Visualization (cont)
  • In Tioga DataSplash

16
Road Map
  • Background and motivation
  • Examples of online techniques
  • Underlying technology
  • Building blocks sampling, estimation
  • Preferential data delivery
  • Pipelined adaptive processing algorithms
  • Related work
  • Looking forward

17
Sampling Design Issues
  • Granularity of sample
  • Instance-level (row-level) high I/O cost
  • Block-level (page-level) high variability from
    clustering
  • Type of sample
  • Often simple random sample (SRS)
  • Especially for on-the-fly
  • With/without replacement usually not critical
  • Data structure from which to sample
  • Files or relational tables
  • Indexes (B trees, etc)

18
Row-level Sampling Techniques
  • Maintain file in random order
  • Sampling scan
  • Is file initially in random order?
  • Statistical tests needed e.g., Runs test,
    Smirnov test
  • In DB systems cluster via RAND function
  • Must freshen ordering (online reorg)
  • On-the-fly sampling
  • Via index on random column
  • Else get random page, then row within page
  • Ex extent-map sampling
  • Problem variable number of records on page

19
Acceptance/Rejection Sampling
  • Accept row on page i with probability ni/nMAX
  • Commonly used in other settings
  • E.g. sampling from joins
  • E.g. sampling from indexes

Original pages
Modified pages
20
Cost of Row-Level Sampling
  • 100,000 pages
  • 200 rows/page

21
Estimation for Aggregates
  • Point estimates
  • Easy SUM, COUNT, AVERAGE
  • Hard MAX, MIN, quantiles, distinct values
  • Confidence intervals a measure of precision

22
Confidence Intervals
23
The Good and Bad News
  • Good news 1/n1/2 magic (n chosen on-the-fly)
  • Bad news needle-in-a-haystack problem

24
Sampling Deployed in Industry
  • Simulated Bernoulli sampling
  • SQL SELECT WHERE RAND() lt 0.01
  • Similar capability in SAS
  • Bernoulli Sampling with pre-specified rate
  • Informix, Oracle 8i, (DB2)
  • Ex SELECT FROM T1 SAMPLE ROW(10), T2
  • Ex SELECT FROM T1 SAMPLE BLOCK(10), T2
  • Not for novices
  • Need to pre-specify precision
  • no feedback/control
  • recall the multiresolution patterns from
    example
  • No estimators provided in current systems

25
Road Map
  • Background and motivation
  • Examples of online techniques
  • Underlying technology
  • Building blocks sampling, estimation
  • Preferential data delivery
  • Pipelined adaptive processing algorithms
  • Related technology precomputation
  • Looking forward

26
Preferential Data Delivery
  • Why needed
  • Speedup/slowdown arrows
  • Idea Online reordering
  • Deliver interesting items first
  • Interesting determined on the fly
  • Exploit rate gap between produce and
    process/consume

27
Mechanism
  • Two threads -- prefetch from input
    -- spool/enrich from auxiliary side disk
  • Juggle data between buffer and side disk
  • keep buffer full of interesting items
  • getNext chooses best item currently on buffer
  • getNext, enrich/spool decisions -- based on
    reordering policy

28
Policies
?
QOF
time
  • good permutation of items t1tn to t?1t?n
  • quality of feedback for a prefix t?1t?2t?k
  • QOF(UP(t?1), UP(t?2), UP(t?k )), UP user
    preference
  • E.g QOF ?? UPi / (ni)½,
  • ni number of tuples processed from group i
  • preference acts as weight on confidence
    interval
  • goodness of reordering dQOF/dt
  • implication for juggle mechanism
  • process gets item from buffer that increases QOF
    the most
  • juggle tries to maintain buffer with such items

29
Road Map
  • Background and motivation
  • Examples of online techniques
  • Underlying technology
  • Building blocks sampling, estimation,
    pre-computation
  • Preferential data delivery
  • Pipelined adaptive processing algorithms
  • Related work
  • Looking forward

30
Pipelined Data Processing
  • Never, ever wait for anything to finish
  • Selection no problem
  • Grouping hash, dont sort
  • Sorting juggle if possible
  • Joins?
  • Sample of joins vs. join of samples

SELECT AVG(R.a S.b) FROM R, S WHERE R.c S.c
31
Traditional Nested Loops
S
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
R
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
32
Ripple Joins
  • designed for online performance goals
  • simplest version
  • read new tuples r from R and s from S
  • join r and s
  • join r with old S tuples
  • join s with old R tuples

33
Basic Ripple Join
S
x
R
34
Block Ripple Joins (Size 2)
S
  • R

35
Rectangular Ripple Join
S
  • R

36
Ripple Joins, contd
  • Variants
  • Block minimizes I/O in alternating nested loops
  • Index coincides with index-nested loop
  • Hash symmetric hash tables
  • Scalability issues ongoing work!
  • Memory compaction
  • Parallelism
  • Optimization issues
  • For eg. not clear how to choose between variants

37
Road Map
  • Background and motivation
  • Examples of online techniques
  • Underlying technology
  • Related work
  • Online query processing
  • Precomputation
  • Looking forward

38
Related Work on Online QP
  • Morgensteins PhD, Berkeley 80
  • Online Association Rules
  • Ng, et als CAP, SIGMOD 98
  • Hidbers CARMA, SIGMOD 99
  • Implications for deductive DB semantics
  • Monotone aggregation in LDL, Zaniolo and Wang
  • Online agg with subqueries
  • Tan, et al. VLDB 99
  • Dynamic Pipeline Scheduling
  • Urhan/Franklin VLDB 01
  • Pipelining Hash Joins
  • Raschid, Wilschut/Apers, Tukwila, Xjoin
  • Relation to semi-naive evaluation
  • Anytime Algorithms
  • Zilberstein, Russell, et al.

39
Precomputation and Synopses-based query processing
  • Intro
  • One-dimensional synopses
  • Using synopses for query processing
  • Maintaining synopses
  • Some other related issues
  • Conclusions

40
Overview
  • Past work
  • Materialized views
  • Histograms
  • Recent work
  • OLAP Data Cubes (drill-down hierarchies)
  • Extended use of histograms
  • More novel synopses
  • Eg. Random samples, wavelets

41
Online vs. Precomputed
  • Online
  • Seeing the approximate answer
  • Continuous refinement of answers (online
    aggregation)
  • User control what to refine, when to stop
  • No maintenance overheads
  • Precomputed
  • Often seeing entire data
  • Often faster small synopses can reside in
    memory or cache
  • Can use with any DBMS (no need for sampling or
    index striding)
  • Also effective for remote or streaming data

42
Commercial DBMS
  • Oracle, IBM Informix Sampling operator
    (online)
  • IBM DB2 IBM Almaden is working on a prototype
    version of DB2 that supports sampling. The user
    specifies a priori the amount of sampling to be
    done.
  • Microsoft SQL Server New auto statistics
    extract statistics e.g., histograms using fast
    sampling, enabling the Query Optimizer to use the
    latest information. The index
    tuning wizard uses sampling to build statistics.
  • see CN97, CMN98, CN98
  • In summary, not much announced yet

43
Histograms
  • Partition attribute value(s) domain into a set of
    buckets
  • Issues
  • How to partition
  • What to store for each bucket
  • How to estimate an answer using the histogram
  • Long history of use for selectivity estimation
    within a query optimizer Koo80, PSC84, etc.

44
1-D Histograms Equi-Depth
Count in bucket
Domain values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
  • Goal Equal number of rows per bucket (B
    buckets in all)
  • Can construct by first sorting then taking B-1
    equally-spaced splits
  • Faster construction Sample take equally-spaced
    splits in sample

45
Answering Queries Equi-Depth
  • Answering queries
  • select count() from R where 4 lt R.A lt 15
  • approximate answer F R/B, where
  • F number of buckets that overlap the range
  • error guarantee 2 R/B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
4 ? R.A ? 15
46
Answering Queries Histograms
  • Answering queries from 1-D histograms (in
    general)
  • (Implicitly) map the histogram back to an
    approximate relation, apply the
    query to the approximate relation
  • Continuous value mapping SAC79

Count spread evenly among bucket values
- Uniform spread mapping PIH96
47
Maintaining Equi-Depth Histograms
Count in bucket
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
Domain values
  • Can maintain using one-pass algorithms, or
  • Use a backing sample GMP97b Maintain a larger
    sample on disk in support of histogram
    maintenance
  • Keep histogram bucket counts up-to-date by
    incrementing on row insertion, decrementing on
    row deletion
  • Merge adjacent buckets with small counts
  • Split any bucket with a large count, using the
    sample to select a split value, i.e, take median
    of the sample points in bucket range
  • Keeps counts within a factor of 2 for more equal
    buckets, can recompute from the sample

48
Self-Tuning
  • 1. Tune Bucket Frequencies
  • Compare actual selectivity to histogram estimate
  • Use to adjust bucket frequencies

AC99
query range
Actual 60 Estimate 40 Error 20
- Divide dError proportionately, ddampening
factor
d½ of Error 10 So divide 4,3,3
49
Self-Tuning 1-D Histograms
  • 2. Restructure after some updates
  • Merge buckets of near-equal frequencies
  • Split large frequency buckets

50
Histogram Variants
  • Equi-depth
  • Sum of the frequencies in each bucket the same
  • Equi-depth compressed
  • Equi-depth with singleton buckets for a few
    highest frequencies
  • Equi-width
  • For each bucket i, (Maxi Mini) 1/B (Max-Min)
  • V-Optimal
  • Minimizes the variance in each bucket
  • V-Optimal End-biased
  • V-optimal with some of the highest and some of
    the lowest frequencies placed in individual
    buckets

51
Sampling Basics
  • Idea A small random sample S of the data often
    well-represents all the data
  • For a fast approx answer, apply the query to S
    scale the result
  • E.g., R.a is 0,1, S is a 20 sample
  • select count() from R where R.a 0
  • select 5 count() from S where S.a 0

R.a
1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0
1 1 0 1 1 0
Red in S
Est. count 52 10, Exact count 10
  • Unbiased For expressions involving count, sum,
    avg the estimator
  • is unbiased, i.e., the expected value of the
    answer is the actual answer,
  • even for (most) queries with predicates!
  • Error is bound using some confidence interval

52
Sampling Confidence Intervals
Confidence intervals for Average ?(R) standard
deviation of the values of R.A ?(S) s.d.
for S.A
  • In the presence of predicates, S above is the
    subset of sample that satisfies the predicate

53
Sampling from Databases
  • Sampling disk-resident data is slow
  • Row-level sampling has high I/O cost
  • must bring in entire disk block to get the row
  • Block-level sampling rows may be highly
    correlated
  • Random access pattern, possibly via an index
  • Need acceptance/rejection sampling to account for
    the variable number of rows in a page, children
    in an index node, etc
  • Alternatives
  • Random physical clustering destroys natural
    clustering
  • Precomputed samples maintenance issues

54
One-Pass Uniform Sampling
  • Best choice for incremental maintenance
  • Low overheads, no random data access
  • Reservoir Sampling Vit85 Maintains a sample S
    of a fixed-size M
  • Add each new item to S with probability M/N,
    where N is the current number of data items
  • If add an item, evict a random item from S
  • Instead of flipping a coin for each item,
    determine the number of items to skip before the
    next to be added to S
  • To handle deletions,
  • remove from S if deleted item is in S, else
    ignore
  • If S M/2, get a new S using another pass
    (happens only if delete roughly half the items
    cost is fully amortized) GMP97b

55
One-Dimensional Haar Wavelets
  • Wavelets mathematical tool for hierarchical
    decomposition of functions/signals
  • Haar wavelets simplest wavelet basis, easy to
    understand and implement
  • Recursive pairwise averaging and differencing at
    different resolutions

Resolution Averages Detail
Coefficients
2, 2, 0, 2, 3, 5, 4, 4
----
3
2, 1, 4, 4
0, -1, -1, 0
2
1
0
56
Haar Wavelet Coefficients
  • Hierarchical decomposition structure (a.k.a.
    error tree)

Coefficient Supports
Original data
57
Wavelet-based Histograms MVW98
  • Problem range-query selectivity estimation
  • Key idea use a compact subset of Haar/linear
    wavelet coefficients for approximating the data
    distribution
  • Steps
  • compute cumulative data distribution C
  • compute Haar (or linear) wavelet transform of C
  • coefficient thresholding only bltltC
    coefficients can be kept
  • take largest coefficients in absolute normalized
    value
  • Haar basis divide coefficients at resolution j
    by
  • Optimal in terms of the overall Mean Squared
    (L2) Error
  • Greedy heuristic methods
  • Retain coefficients leading to large error
    reduction
  • Throw away coefficients that give small increase
    in error

58
Using Wavelet-based Histograms
  • Selectivity estimation sel(alt Xlt b) Cb
    - Ca-1
  • C is the (approximate) reconstructed
    cumulative distribution
  • Time O(minb, logN), where b size of wavelet
    synopsis (no. of coefficients), N size of
    domain
  • Empirical results over synthetic data
  • Improvements over random sampling and histograms
    (MaxDiff)

59
Some related issues
  • Dynamic maintenance of wavelet-based histograms
    MVW00
  • Multi-Dimensional Synopses and joins
  • Earlier work is often extended to
    multi-dimensions
  • In general, difficult!
  • curse of dimensionality
  • Value/frequency uniformity assumptions is lost

60
Distinct Values Queries
  • select count(distinct target-attr)
  • from rel
  • where P
  • select count(distinct o_custkey)
  • from orders
  • where o_orderdate gt 2001-01-01
  • How many distinct customers have placed orders
    this year?
  • Includes column cardinalities, number of
    species, number of distinct values in a data set
    / data stream

Template
TPCH example
61
Distinct Values Queries
  • Uniform Sampling-based approaches
  • Collect and store uniform sample. At query time,
    apply predicate to sample. Estimate based on a
    function of the distribution.
  • Discouraging results CCM00 proved must examine
    almost the entire table to guarantee the estimate
    is within a factor of 10 with probability gt 1/2,
    regardless of the function used!
  • One pass approaches
  • A hash function maps values to bit position
    according to an exponential distribution FM85
    (cf. Coh97,AMS96)
  • 00001011111 estimate based on rightmost 0-bit
  • Produces a single count Does not handle
    subsequent predicates

62
Conclusions
  • Demand for fast (and approximate) data analysis
    is rising.
  • Commercial data warehouses approaching several
    100s TB and continuously growing
  • Ideally
  • Want to avoid examining the whole data, or
  • If not possible, want to take only one look
  • Still an on-going research issue!
Write a Comment
User Comments (0)
About PowerShow.com