Title: Approximate Query Processing slides based on the following tutorials
1Approximate Query Processingslides based on the
following tutorials
- In SIGMOD 2001 by Haas and Hellerstein
- In VLDB 2001 by Garofalakis and Gibbons
2Outline
- Intro Motivation
- Online query processing
- Usage examples
- Basic sampling techniques and estimators
- Preferential data delivery
- Online join algorithms
- Synopses-based query processing
- One-Dimensional Synopses
- Multi-Dimensional Synopses and Joins
- Set-Valued Queries
- Conclusions
3Introduction Motivation
SQL Query
DecisionSupport Systems(DSS)
Exact Answer
Long Response Times!
- Exact answers NOT always required
- DSS applications usually exploratory
- Aggregate queries precision to last decimal
not needed - Preview answers while waiting. Trial queries
4Wait
5Fast Approximate Answers
- Primarily for Aggregate queries
- Goal quickly report the leading digits of
answers - In seconds instead of minutes or hours
- Most useful if can provide error guarantees
- E.g., Average salary
- 59,000 /- 500 (with 95
confidence) in 10 seconds - vs. 59,152.25
in 10 minutes - How?
- Use a sample instead of the whole data
- Use some synopses of the data
- Speed-up
6Approximate Query Answering
- Basic Approach 1 Online Query Processing
- Sampling at query time
- Answers continually improve, under user control
7Approximate Query Answering
- Basic Approach 2 Precomputed Synopses
- Construct synopses prior to query time
- Use them to answer queries
- Need to maintain synopses up-to-date
8Online Processing
- Goal
- Maximize the gain
- Gain defined on-the-fly
- May need FEEDBACK
100
?
Time
9Online Aggregation
- SELECT AVG(temp) FROM t GROUP BY site
- 330K rows in table
- the exact answer
10Online Aggregation, contd
- A simple online aggregation interface (after 74
rows)
11Online Aggregation, contd
12Example Online Aggregation
Additional Features Speed up Slow
down Terminate
13Online Data Visualization
- In ALVIN (Alberta System for Visualizing Large
Networks)
14Online Data Visualization (cont)
15Online Data Visualization (cont)
16Road Map
- Background and motivation
- Examples of online techniques
- Underlying technology
- Building blocks sampling, estimation
- Preferential data delivery
- Pipelined adaptive processing algorithms
- Related work
- Looking forward
17Sampling Design Issues
- Granularity of sample
- Instance-level (row-level) high I/O cost
- Block-level (page-level) high variability from
clustering - Type of sample
- Often simple random sample (SRS)
- Especially for on-the-fly
- With/without replacement usually not critical
- Data structure from which to sample
- Files or relational tables
- Indexes (B trees, etc)
18Row-level Sampling Techniques
- Maintain file in random order
- Sampling scan
- Is file initially in random order?
- Statistical tests needed e.g., Runs test,
Smirnov test - In DB systems cluster via RAND function
- Must freshen ordering (online reorg)
- On-the-fly sampling
- Via index on random column
- Else get random page, then row within page
- Ex extent-map sampling
- Problem variable number of records on page
19Acceptance/Rejection Sampling
- Accept row on page i with probability ni/nMAX
- Commonly used in other settings
- E.g. sampling from joins
- E.g. sampling from indexes
Original pages
Modified pages
20Cost of Row-Level Sampling
- 100,000 pages
- 200 rows/page
21Estimation for Aggregates
- Point estimates
- Easy SUM, COUNT, AVERAGE
- Hard MAX, MIN, quantiles, distinct values
- Confidence intervals a measure of precision
22Confidence Intervals
23The Good and Bad News
- Good news 1/n1/2 magic (n chosen on-the-fly)
- Bad news needle-in-a-haystack problem
24Sampling Deployed in Industry
- Simulated Bernoulli sampling
- SQL SELECT WHERE RAND() lt 0.01
- Similar capability in SAS
- Bernoulli Sampling with pre-specified rate
- Informix, Oracle 8i, (DB2)
- Ex SELECT FROM T1 SAMPLE ROW(10), T2
- Ex SELECT FROM T1 SAMPLE BLOCK(10), T2
- Not for novices
- Need to pre-specify precision
- no feedback/control
- recall the multiresolution patterns from
example - No estimators provided in current systems
25Road Map
- Background and motivation
- Examples of online techniques
- Underlying technology
- Building blocks sampling, estimation
- Preferential data delivery
- Pipelined adaptive processing algorithms
- Related technology precomputation
- Looking forward
26Preferential Data Delivery
- Why needed
- Speedup/slowdown arrows
- Idea Online reordering
- Deliver interesting items first
- Interesting determined on the fly
- Exploit rate gap between produce and
process/consume
27 Mechanism
- Two threads -- prefetch from input
-- spool/enrich from auxiliary side disk - Juggle data between buffer and side disk
- keep buffer full of interesting items
- getNext chooses best item currently on buffer
- getNext, enrich/spool decisions -- based on
reordering policy
28 Policies
?
QOF
time
- good permutation of items t1tn to t?1t?n
- quality of feedback for a prefix t?1t?2t?k
- QOF(UP(t?1), UP(t?2), UP(t?k )), UP user
preference - E.g QOF ?? UPi / (ni)½,
- ni number of tuples processed from group i
- preference acts as weight on confidence
interval - goodness of reordering dQOF/dt
- implication for juggle mechanism
- process gets item from buffer that increases QOF
the most - juggle tries to maintain buffer with such items
29Road Map
- Background and motivation
- Examples of online techniques
- Underlying technology
- Building blocks sampling, estimation,
pre-computation - Preferential data delivery
- Pipelined adaptive processing algorithms
- Related work
- Looking forward
30Pipelined Data Processing
- Never, ever wait for anything to finish
- Selection no problem
- Grouping hash, dont sort
- Sorting juggle if possible
- Joins?
- Sample of joins vs. join of samples
SELECT AVG(R.a S.b) FROM R, S WHERE R.c S.c
31Traditional Nested Loops
S
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
R
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
32Ripple Joins
- designed for online performance goals
- simplest version
- read new tuples r from R and s from S
- join r and s
- join r with old S tuples
- join s with old R tuples
33Basic Ripple Join
S
x
R
34Block Ripple Joins (Size 2)
S
35Rectangular Ripple Join
S
36Ripple Joins, contd
- Variants
- Block minimizes I/O in alternating nested loops
- Index coincides with index-nested loop
- Hash symmetric hash tables
- Scalability issues ongoing work!
- Memory compaction
- Parallelism
- Optimization issues
- For eg. not clear how to choose between variants
37Road Map
- Background and motivation
- Examples of online techniques
- Underlying technology
- Related work
- Online query processing
- Precomputation
- Looking forward
38Related Work on Online QP
- Morgensteins PhD, Berkeley 80
- Online Association Rules
- Ng, et als CAP, SIGMOD 98
- Hidbers CARMA, SIGMOD 99
- Implications for deductive DB semantics
- Monotone aggregation in LDL, Zaniolo and Wang
- Online agg with subqueries
- Tan, et al. VLDB 99
- Dynamic Pipeline Scheduling
- Urhan/Franklin VLDB 01
- Pipelining Hash Joins
- Raschid, Wilschut/Apers, Tukwila, Xjoin
- Relation to semi-naive evaluation
- Anytime Algorithms
- Zilberstein, Russell, et al.
39Precomputation and Synopses-based query processing
- Intro
- One-dimensional synopses
- Using synopses for query processing
- Maintaining synopses
- Some other related issues
- Conclusions
40Overview
- Past work
- Materialized views
- Histograms
- Recent work
- OLAP Data Cubes (drill-down hierarchies)
- Extended use of histograms
- More novel synopses
- Eg. Random samples, wavelets
41Online vs. Precomputed
- Online
- Seeing the approximate answer
- Continuous refinement of answers (online
aggregation) - User control what to refine, when to stop
- No maintenance overheads
- Precomputed
- Often seeing entire data
- Often faster small synopses can reside in
memory or cache - Can use with any DBMS (no need for sampling or
index striding) - Also effective for remote or streaming data
42Commercial DBMS
- Oracle, IBM Informix Sampling operator
(online) - IBM DB2 IBM Almaden is working on a prototype
version of DB2 that supports sampling. The user
specifies a priori the amount of sampling to be
done. - Microsoft SQL Server New auto statistics
extract statistics e.g., histograms using fast
sampling, enabling the Query Optimizer to use the
latest information. The index
tuning wizard uses sampling to build statistics. - see CN97, CMN98, CN98
- In summary, not much announced yet
43Histograms
- Partition attribute value(s) domain into a set of
buckets - Issues
- How to partition
- What to store for each bucket
- How to estimate an answer using the histogram
- Long history of use for selectivity estimation
within a query optimizer Koo80, PSC84, etc.
441-D Histograms Equi-Depth
Count in bucket
Domain values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
- Goal Equal number of rows per bucket (B
buckets in all) - Can construct by first sorting then taking B-1
equally-spaced splits - Faster construction Sample take equally-spaced
splits in sample
45Answering Queries Equi-Depth
- Answering queries
- select count() from R where 4 lt R.A lt 15
- approximate answer F R/B, where
- F number of buckets that overlap the range
- error guarantee 2 R/B
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
4 ? R.A ? 15
46Answering Queries Histograms
- Answering queries from 1-D histograms (in
general) - (Implicitly) map the histogram back to an
approximate relation, apply the
query to the approximate relation - Continuous value mapping SAC79
Count spread evenly among bucket values
- Uniform spread mapping PIH96
47Maintaining Equi-Depth Histograms
Count in bucket
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
Domain values
- Can maintain using one-pass algorithms, or
- Use a backing sample GMP97b Maintain a larger
sample on disk in support of histogram
maintenance - Keep histogram bucket counts up-to-date by
incrementing on row insertion, decrementing on
row deletion - Merge adjacent buckets with small counts
- Split any bucket with a large count, using the
sample to select a split value, i.e, take median
of the sample points in bucket range - Keeps counts within a factor of 2 for more equal
buckets, can recompute from the sample
48Self-Tuning
- 1. Tune Bucket Frequencies
- Compare actual selectivity to histogram estimate
- Use to adjust bucket frequencies
AC99
query range
Actual 60 Estimate 40 Error 20
- Divide dError proportionately, ddampening
factor
d½ of Error 10 So divide 4,3,3
49Self-Tuning 1-D Histograms
- 2. Restructure after some updates
- Merge buckets of near-equal frequencies
- Split large frequency buckets
50Histogram Variants
- Equi-depth
- Sum of the frequencies in each bucket the same
- Equi-depth compressed
- Equi-depth with singleton buckets for a few
highest frequencies - Equi-width
- For each bucket i, (Maxi Mini) 1/B (Max-Min)
- V-Optimal
- Minimizes the variance in each bucket
- V-Optimal End-biased
- V-optimal with some of the highest and some of
the lowest frequencies placed in individual
buckets
51Sampling Basics
- Idea A small random sample S of the data often
well-represents all the data - For a fast approx answer, apply the query to S
scale the result - E.g., R.a is 0,1, S is a 20 sample
- select count() from R where R.a 0
- select 5 count() from S where S.a 0
R.a
1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0
1 1 0 1 1 0
Red in S
Est. count 52 10, Exact count 10
- Unbiased For expressions involving count, sum,
avg the estimator - is unbiased, i.e., the expected value of the
answer is the actual answer, - even for (most) queries with predicates!
- Error is bound using some confidence interval
52Sampling Confidence Intervals
Confidence intervals for Average ?(R) standard
deviation of the values of R.A ?(S) s.d.
for S.A
- In the presence of predicates, S above is the
subset of sample that satisfies the predicate
53Sampling from Databases
- Sampling disk-resident data is slow
- Row-level sampling has high I/O cost
- must bring in entire disk block to get the row
- Block-level sampling rows may be highly
correlated - Random access pattern, possibly via an index
- Need acceptance/rejection sampling to account for
the variable number of rows in a page, children
in an index node, etc - Alternatives
- Random physical clustering destroys natural
clustering - Precomputed samples maintenance issues
54One-Pass Uniform Sampling
- Best choice for incremental maintenance
- Low overheads, no random data access
- Reservoir Sampling Vit85 Maintains a sample S
of a fixed-size M - Add each new item to S with probability M/N,
where N is the current number of data items - If add an item, evict a random item from S
- Instead of flipping a coin for each item,
determine the number of items to skip before the
next to be added to S - To handle deletions,
- remove from S if deleted item is in S, else
ignore - If S M/2, get a new S using another pass
(happens only if delete roughly half the items
cost is fully amortized) GMP97b
55One-Dimensional Haar Wavelets
- Wavelets mathematical tool for hierarchical
decomposition of functions/signals - Haar wavelets simplest wavelet basis, easy to
understand and implement - Recursive pairwise averaging and differencing at
different resolutions
Resolution Averages Detail
Coefficients
2, 2, 0, 2, 3, 5, 4, 4
----
3
2, 1, 4, 4
0, -1, -1, 0
2
1
0
56Haar Wavelet Coefficients
- Hierarchical decomposition structure (a.k.a.
error tree)
Coefficient Supports
Original data
57Wavelet-based Histograms MVW98
- Problem range-query selectivity estimation
- Key idea use a compact subset of Haar/linear
wavelet coefficients for approximating the data
distribution - Steps
- compute cumulative data distribution C
- compute Haar (or linear) wavelet transform of C
- coefficient thresholding only bltltC
coefficients can be kept - take largest coefficients in absolute normalized
value - Haar basis divide coefficients at resolution j
by - Optimal in terms of the overall Mean Squared
(L2) Error - Greedy heuristic methods
- Retain coefficients leading to large error
reduction - Throw away coefficients that give small increase
in error
58Using Wavelet-based Histograms
- Selectivity estimation sel(alt Xlt b) Cb
- Ca-1 - C is the (approximate) reconstructed
cumulative distribution - Time O(minb, logN), where b size of wavelet
synopsis (no. of coefficients), N size of
domain - Empirical results over synthetic data
- Improvements over random sampling and histograms
(MaxDiff)
59Some related issues
- Dynamic maintenance of wavelet-based histograms
MVW00 - Multi-Dimensional Synopses and joins
- Earlier work is often extended to
multi-dimensions - In general, difficult!
- curse of dimensionality
- Value/frequency uniformity assumptions is lost
60Distinct Values Queries
- select count(distinct target-attr)
- from rel
- where P
- select count(distinct o_custkey)
- from orders
- where o_orderdate gt 2001-01-01
- How many distinct customers have placed orders
this year? - Includes column cardinalities, number of
species, number of distinct values in a data set
/ data stream
Template
TPCH example
61Distinct Values Queries
- Uniform Sampling-based approaches
- Collect and store uniform sample. At query time,
apply predicate to sample. Estimate based on a
function of the distribution. - Discouraging results CCM00 proved must examine
almost the entire table to guarantee the estimate
is within a factor of 10 with probability gt 1/2,
regardless of the function used! - One pass approaches
- A hash function maps values to bit position
according to an exponential distribution FM85
(cf. Coh97,AMS96) - 00001011111 estimate based on rightmost 0-bit
- Produces a single count Does not handle
subsequent predicates
62Conclusions
- Demand for fast (and approximate) data analysis
is rising. - Commercial data warehouses approaching several
100s TB and continuously growing - Ideally
- Want to avoid examining the whole data, or
- If not possible, want to take only one look
- Still an on-going research issue!