Approximate Query Processing slides based on the following tutorials presentation

About This Presentation

Transcript and Presenter's Notes

Title: Approximate Query Processing slides based on the following tutorials

1
Approximate Query Processingslides based on the
following tutorials

In SIGMOD 2001 by Haas and Hellerstein
In VLDB 2001 by Garofalakis and Gibbons

2
Outline

Intro Motivation
Online query processing
Usage examples
Basic sampling techniques and estimators
Preferential data delivery
Online join algorithms
Synopses-based query processing
One-Dimensional Synopses
Multi-Dimensional Synopses and Joins
Set-Valued Queries
Conclusions

3
Introduction Motivation
SQL Query
DecisionSupport Systems(DSS)
Exact Answer
Long Response Times!

Exact answers NOT always required
DSS applications usually exploratory
Aggregate queries precision to last decimal
not needed
Preview answers while waiting. Trial queries

4
Wait
5
Fast Approximate Answers

Primarily for Aggregate queries
Goal quickly report the leading digits of
answers
In seconds instead of minutes or hours
Most useful if can provide error guarantees
E.g., Average salary
59,000 /- 500 (with 95
confidence) in 10 seconds
vs. 59,152.25
in 10 minutes
How?
Use a sample instead of the whole data
Use some synopses of the data
Speed-up

6
Approximate Query Answering

Basic Approach 1 Online Query Processing
Sampling at query time
Answers continually improve, under user control

7
Approximate Query Answering

Basic Approach 2 Precomputed Synopses
Construct synopses prior to query time
Use them to answer queries
Need to maintain synopses up-to-date

8
Online Processing

Goal
Maximize the gain
Gain defined on-the-fly
May need FEEDBACK

100
?
Time
9
Online Aggregation

SELECT AVG(temp) FROM t GROUP BY site
330K rows in table
the exact answer

10
Online Aggregation, contd

A simple online aggregation interface (after 74
rows)

11
Online Aggregation, contd

After 834 rows

12
Example Online Aggregation
Additional Features Speed up Slow
down Terminate
13
Online Data Visualization

In ALVIN (Alberta System for Visualizing Large
Networks)

14
Online Data Visualization (cont)

Web sites in ALVIN

15
Online Data Visualization (cont)

In Tioga DataSplash

16
Road Map

Background and motivation
Examples of online techniques
Underlying technology
Building blocks sampling, estimation
Preferential data delivery
Pipelined adaptive processing algorithms
Related work
Looking forward

17
Sampling Design Issues

Granularity of sample
Instance-level (row-level) high I/O cost
Block-level (page-level) high variability from
clustering
Type of sample
Often simple random sample (SRS)
Especially for on-the-fly
With/without replacement usually not critical
Data structure from which to sample
Files or relational tables
Indexes (B trees, etc)

18
Row-level Sampling Techniques

Maintain file in random order
Sampling scan
Is file initially in random order?
Statistical tests needed e.g., Runs test,
Smirnov test
In DB systems cluster via RAND function
Must freshen ordering (online reorg)
On-the-fly sampling
Via index on random column
Else get random page, then row within page
Ex extent-map sampling
Problem variable number of records on page

19
Acceptance/Rejection Sampling

Accept row on page i with probability ni/nMAX
Commonly used in other settings
E.g. sampling from joins
E.g. sampling from indexes

Original pages
Modified pages
20
Cost of Row-Level Sampling

100,000 pages
200 rows/page

21
Estimation for Aggregates

Point estimates
Easy SUM, COUNT, AVERAGE
Hard MAX, MIN, quantiles, distinct values
Confidence intervals a measure of precision

22
Confidence Intervals
23
The Good and Bad News

Good news 1/n1/2 magic (n chosen on-the-fly)
Bad news needle-in-a-haystack problem

24
Sampling Deployed in Industry

Simulated Bernoulli sampling
SQL SELECT WHERE RAND() lt 0.01
Similar capability in SAS
Bernoulli Sampling with pre-specified rate
Informix, Oracle 8i, (DB2)
Ex SELECT FROM T1 SAMPLE ROW(10), T2
Ex SELECT FROM T1 SAMPLE BLOCK(10), T2
Not for novices
Need to pre-specify precision
no feedback/control
recall the multiresolution patterns from
example
No estimators provided in current systems

25
Road Map

Background and motivation
Examples of online techniques
Underlying technology
Building blocks sampling, estimation
Preferential data delivery
Pipelined adaptive processing algorithms
Related technology precomputation
Looking forward

26
Preferential Data Delivery

Why needed
Speedup/slowdown arrows
Idea Online reordering

Deliver interesting items first
Interesting determined on the fly
Exploit rate gap between produce and
process/consume

27
Mechanism

Two threads -- prefetch from input
-- spool/enrich from auxiliary side disk
Juggle data between buffer and side disk
keep buffer full of interesting items
getNext chooses best item currently on buffer
getNext, enrich/spool decisions -- based on
reordering policy

28
Policies
?
QOF
time

good permutation of items t1tn to t?1t?n
quality of feedback for a prefix t?1t?2t?k
QOF(UP(t?1), UP(t?2), UP(t?k )), UP user
preference
E.g QOF ?? UPi / (ni)½,
ni number of tuples processed from group i
preference acts as weight on confidence
interval
goodness of reordering dQOF/dt
implication for juggle mechanism
process gets item from buffer that increases QOF
the most
juggle tries to maintain buffer with such items

29
Road Map

Background and motivation
Examples of online techniques
Underlying technology
Building blocks sampling, estimation,
pre-computation
Preferential data delivery
Pipelined adaptive processing algorithms
Related work
Looking forward

30
Pipelined Data Processing

Never, ever wait for anything to finish
Selection no problem
Grouping hash, dont sort
Sorting juggle if possible
Joins?
Sample of joins vs. join of samples

SELECT AVG(R.a S.b) FROM R, S WHERE R.c S.c
31
Traditional Nested Loops
S
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
R
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
32
Ripple Joins

designed for online performance goals
simplest version
read new tuples r from R and s from S
join r and s
join r with old S tuples
join s with old R tuples

33
Basic Ripple Join
S
x
R
34
Block Ripple Joins (Size 2)
S

35
Rectangular Ripple Join
S

36
Ripple Joins, contd

Variants
Block minimizes I/O in alternating nested loops
Index coincides with index-nested loop
Hash symmetric hash tables
Scalability issues ongoing work!
Memory compaction
Parallelism
Optimization issues
For eg. not clear how to choose between variants

37
Road Map

Background and motivation
Examples of online techniques
Underlying technology
Related work
Online query processing
Precomputation
Looking forward

38
Related Work on Online QP

Morgensteins PhD, Berkeley 80
Online Association Rules
Ng, et als CAP, SIGMOD 98
Hidbers CARMA, SIGMOD 99
Implications for deductive DB semantics
Monotone aggregation in LDL, Zaniolo and Wang
Online agg with subqueries
Tan, et al. VLDB 99
Dynamic Pipeline Scheduling
Urhan/Franklin VLDB 01
Pipelining Hash Joins
Raschid, Wilschut/Apers, Tukwila, Xjoin
Relation to semi-naive evaluation
Anytime Algorithms
Zilberstein, Russell, et al.

39
Precomputation and Synopses-based query processing

Intro
One-dimensional synopses
Using synopses for query processing
Maintaining synopses
Some other related issues
Conclusions

40
Overview

Past work
Materialized views
Histograms
Recent work
OLAP Data Cubes (drill-down hierarchies)
Extended use of histograms
More novel synopses
Eg. Random samples, wavelets

41
Online vs. Precomputed

Online
Seeing the approximate answer
Continuous refinement of answers (online
aggregation)
User control what to refine, when to stop
No maintenance overheads
Precomputed
Often seeing entire data
Often faster small synopses can reside in
memory or cache
Can use with any DBMS (no need for sampling or
index striding)
Also effective for remote or streaming data

42
Commercial DBMS

Oracle, IBM Informix Sampling operator
(online)
IBM DB2 IBM Almaden is working on a prototype
version of DB2 that supports sampling. The user
specifies a priori the amount of sampling to be
done.
Microsoft SQL Server New auto statistics
extract statistics e.g., histograms using fast
sampling, enabling the Query Optimizer to use the
latest information. The index
tuning wizard uses sampling to build statistics.
see CN97, CMN98, CN98
In summary, not much announced yet

43
Histograms

Partition attribute value(s) domain into a set of
buckets
Issues
How to partition
What to store for each bucket
How to estimate an answer using the histogram
Long history of use for selectivity estimation
within a query optimizer Koo80, PSC84, etc.

44
1-D Histograms Equi-Depth
Count in bucket
Domain values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20

Goal Equal number of rows per bucket (B
buckets in all)
Can construct by first sorting then taking B-1
equally-spaced splits
Faster construction Sample take equally-spaced
splits in sample

45
Answering Queries Equi-Depth

Answering queries
select count() from R where 4 lt R.A lt 15
approximate answer F R/B, where
F number of buckets that overlap the range
error guarantee 2 R/B

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
4 ? R.A ? 15
46
Answering Queries Histograms

Answering queries from 1-D histograms (in
general)
(Implicitly) map the histogram back to an
approximate relation, apply the
query to the approximate relation
Continuous value mapping SAC79

Count spread evenly among bucket values
- Uniform spread mapping PIH96
47
Maintaining Equi-Depth Histograms
Count in bucket
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
18 19 20
Domain values

Can maintain using one-pass algorithms, or
Use a backing sample GMP97b Maintain a larger
sample on disk in support of histogram
maintenance
Keep histogram bucket counts up-to-date by
incrementing on row insertion, decrementing on
row deletion
Merge adjacent buckets with small counts
Split any bucket with a large count, using the
sample to select a split value, i.e, take median
of the sample points in bucket range
Keeps counts within a factor of 2 for more equal
buckets, can recompute from the sample

48
Self-Tuning

1. Tune Bucket Frequencies
Compare actual selectivity to histogram estimate
Use to adjust bucket frequencies

AC99
query range
Actual 60 Estimate 40 Error 20
- Divide dError proportionately, ddampening
factor
d½ of Error 10 So divide 4,3,3
49
Self-Tuning 1-D Histograms

2. Restructure after some updates
Merge buckets of near-equal frequencies
Split large frequency buckets

50
Histogram Variants

Equi-depth
Sum of the frequencies in each bucket the same
Equi-depth compressed
Equi-depth with singleton buckets for a few
highest frequencies
Equi-width
For each bucket i, (Maxi Mini) 1/B (Max-Min)
V-Optimal
Minimizes the variance in each bucket
V-Optimal End-biased
V-optimal with some of the highest and some of
the lowest frequencies placed in individual
buckets

51
Sampling Basics

Idea A small random sample S of the data often
well-represents all the data
For a fast approx answer, apply the query to S
scale the result
E.g., R.a is 0,1, S is a 20 sample
select count() from R where R.a 0
select 5 count() from S where S.a 0

R.a
1 1 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1 0 1 1 1 0 1 0
1 1 0 1 1 0
Red in S
Est. count 52 10, Exact count 10

Unbiased For expressions involving count, sum,
avg the estimator
is unbiased, i.e., the expected value of the
answer is the actual answer,
even for (most) queries with predicates!
Error is bound using some confidence interval

52
Sampling Confidence Intervals
Confidence intervals for Average ?(R) standard
deviation of the values of R.A ?(S) s.d.
for S.A

In the presence of predicates, S above is the
subset of sample that satisfies the predicate

53
Sampling from Databases

Sampling disk-resident data is slow
Row-level sampling has high I/O cost
must bring in entire disk block to get the row
Block-level sampling rows may be highly
correlated
Random access pattern, possibly via an index
Need acceptance/rejection sampling to account for
the variable number of rows in a page, children
in an index node, etc
Alternatives
Random physical clustering destroys natural
clustering
Precomputed samples maintenance issues

54
One-Pass Uniform Sampling

Best choice for incremental maintenance
Low overheads, no random data access
Reservoir Sampling Vit85 Maintains a sample S
of a fixed-size M
Add each new item to S with probability M/N,
where N is the current number of data items
If add an item, evict a random item from S
Instead of flipping a coin for each item,
determine the number of items to skip before the
next to be added to S
To handle deletions,
remove from S if deleted item is in S, else
ignore
If S M/2, get a new S using another pass
(happens only if delete roughly half the items
cost is fully amortized) GMP97b

55
One-Dimensional Haar Wavelets

Wavelets mathematical tool for hierarchical
decomposition of functions/signals
Haar wavelets simplest wavelet basis, easy to
understand and implement
Recursive pairwise averaging and differencing at
different resolutions

Resolution Averages Detail
Coefficients
2, 2, 0, 2, 3, 5, 4, 4
----
3
2, 1, 4, 4
0, -1, -1, 0
2
1
0
56
Haar Wavelet Coefficients

Hierarchical decomposition structure (a.k.a.
error tree)

Coefficient Supports
Original data
57
Wavelet-based Histograms MVW98

Problem range-query selectivity estimation
Key idea use a compact subset of Haar/linear
wavelet coefficients for approximating the data
distribution
Steps
compute cumulative data distribution C
compute Haar (or linear) wavelet transform of C
coefficient thresholding only bltltC
coefficients can be kept
take largest coefficients in absolute normalized
value
Haar basis divide coefficients at resolution j
by
Optimal in terms of the overall Mean Squared
(L2) Error
Greedy heuristic methods
Retain coefficients leading to large error
reduction
Throw away coefficients that give small increase
in error

58
Using Wavelet-based Histograms

Selectivity estimation sel(alt Xlt b) Cb
- Ca-1
C is the (approximate) reconstructed
cumulative distribution
Time O(minb, logN), where b size of wavelet
synopsis (no. of coefficients), N size of
domain
Empirical results over synthetic data
Improvements over random sampling and histograms
(MaxDiff)

59
Some related issues

Dynamic maintenance of wavelet-based histograms
MVW00
Multi-Dimensional Synopses and joins
Earlier work is often extended to
multi-dimensions
In general, difficult!
curse of dimensionality
Value/frequency uniformity assumptions is lost

60
Distinct Values Queries

select count(distinct target-attr)
from rel
where P
select count(distinct o_custkey)
from orders
where o_orderdate gt 2001-01-01
How many distinct customers have placed orders
this year?
Includes column cardinalities, number of
species, number of distinct values in a data set
/ data stream

Template
TPCH example
61
Distinct Values Queries

Uniform Sampling-based approaches
Collect and store uniform sample. At query time,
apply predicate to sample. Estimate based on a
function of the distribution.
Discouraging results CCM00 proved must examine
almost the entire table to guarantee the estimate
is within a factor of 10 with probability gt 1/2,
regardless of the function used!
One pass approaches
A hash function maps values to bit position
according to an exponential distribution FM85
(cf. Coh97,AMS96)
00001011111 estimate based on rightmost 0-bit
Produces a single count Does not handle
subsequent predicates

62
Conclusions

Demand for fast (and approximate) data analysis
is rising.
Commercial data warehouses approaching several
100s TB and continuously growing
Ideally
Want to avoid examining the whole data, or
If not possible, want to take only one look
Still an on-going research issue!

Write a Comment

User Comments (0)

About PowerShow.com

Approximate Query Processing slides based on the following tutorials PowerPoint PPT Presentation