Title: Approximate Query Processing using Wavelets
1Approximate Query Processing using Wavelets
- Kaushik Chakrabarti(Univ Of Illinois)
- Minos Garofalakis(Bell Labs)
- Rajeev Rastogi(Bell Labs)
- Kyuseok Shim(KAIST and AITrc)
- Presented at 26th VLDB Conference, Cairo, Egypt
- Presented By
- Supriya Sudheendra
2Outline
3Introduction
- Approximate Query Processing is a viable solution
for - Huge amounts of data
- High query complexities
- Stringent response-time requirements
- Decision Support Systems
- Support business and organizational
decision-making activities - Helps decision makers compile useful information
from raw data, solve problems and make decisions
4Introduction
- DSS users pose very complex queries to the DBMS
- Requires complex operations over GB or TBs of
disk-resident data - Very long time to execute and produce exact
answers - Number of scenarios where users prefer a fast,
approximate answers
5Prior Work
- Previous Approximate query processing techniques
- Focused on specific forms of aggregate queries
- Data reduction mechanism how to obtain the
synopses of data - Sampling-based Techniques
- A join-operator on 2 uniform random samples
results in a non-uniform sample having very few
tuples - For non-aggregate queries, it produces a small
subset of the exact answer which might be empty
when joins are involved.
6Prior Work
- Histogram Based Techniques
- Problematic for high-dimensional data
- Storage overhead
- High construction cost
- Wavelet Based Techniques
- Mathematical tool for hierarchical decomposition
of functions - Apply wavelet decomposition to input data
collection gt data synopsis - Avoids high construction costs and storage
overhead
7Contribution of the Paper
- Viability and effectiveness of wavelets as a
generic tool for high-dimensional DSS - New, I/O-efficient wavelet decomposition
algorithm for relational tables - Novel Query processing algebra for
Wavelet-Co-Efficient Data Synopses - Extensive Experiments
8Background
- Mathematical tool to hierarchically decompose
functions - Coarse overall approximation together with detail
coefficients that influence function at various
scales - Haar wavelets are conceptually simple, fast to
compute - Variety of applications like image editing and
querying
9One-Dimensional Haar Wavelets
- How to compute, given a data array
- Average the values together pairwise to get a
lower-resolution representation of data - Detailed coefficients-gt differences of the
averaged value from the computed pairwise average - Reconstruction of the data array possible
- Why Detail Coefficients
10One-dimensional Haar Wavelets
- Wavelet Transform Overall average followed by
detail coefficients in increasing order of
resolution. Each entry-gtwavelet coefficient - WA 4, -2, 0, -1
- For vectors containing similar values,
- most detail coefficients have small values that
can be eliminated - Introduces only small errors
11One-dimensional Haar Wavelets
- Overall average more important than any detail
coefficient - To normalize the final entries of WA, each
wavelet coefficient is divided by ?2l - l level of resolution
- WA 4, -2, 0, -1/?2
12Multi-dimensional Haar Wavelets
- Haar wavelets can be extended to
multi-dimensional array - Standard Decomposition
- Fix an ordering for the data dimensions(1,2,d)
- Apply complete 1-D wavelet transform for each 1-d
row of array cells along dimension k
13- Nonstandard Decomposition
- Alternates between dimensions during successive
steps of pairwise averaging and differencing for
each 1-D row of array cells along dimension k - Repeated recursively on quadrant containing all
averages across all dimensions
14Non-standard Decomposition
- Pairwise averaging and differencing for one
positioning of 2x2 box with root 2i1, 2i2 - Distribution of the results in the wavelet
transform array - Process is recursed on lower-left quadrant of WA
15Example Decomposition of a 4 X 4 Array
16Multi-dimensional Haar coefficients Semantics
and Representation
- D-dimensional Haar basis function corresponding
to Wavelet w is defined by - D-dimensional rectangular support region
- Quadrant sign information
17Support Regions for 16 Nonstandard 2-D Haar Basis
Function
- Blank areas regions of A whose reconstruction
is independent of the coefficient - WA0,0 overall average
- WA3,3 contributes only to upper right
quadrant
18Haar CoEfficients Semantics and Representation
- W ltR, S, vgt
- W.R d-dimensional support hyper-rectangle of W
encloses all cells in A to which W contributes - Hyper-rectangle represented by low and high
boundaries across each dimension j, 1lt j ltd - W.R.boundaryj.lo and W.R.boundaryj.hi
- W contributes to each data cell Ai1,id where
- W.R.boundaryj.lo lt ij lt W.R.boundaryj.hi
for all j
19- W.S sign information for all d-dimensional
quadrants of W.R - Denoted by W.S.signj.lo and W.S.signj.hi
corresponding to lower and upper half of W.Rs
extent along j - Computed as the product of d sign-vector entries
that map to that quadrant - W.v scalar magnitude of W
- Quantity that W contributes to all data array
cells enclosed in W.R
20Building Wavelet Coefficient Synopses
- Relation R with d attributes X1, X2, Xd
- Can represent R as a d-dimensional array AR
- Jth dimension is indexed by the values of
attribute Xj - Cells contain the count of tuples in R having the
corresponding combination of attribute values - AR joint frequency distribution of all
attributes of R
21- Chunk-based organization of relational tables
- Joint frequency array AR split into
d-dimensional chunks - Tuples of R of same chunk are stored contiguously
on disk - If R is not chunked, one extra pre-processing
step to reorganize R on disk
22ComputeWavelet Algorithm
- When a chunk is loaded for the first time,
ComputeWavelet can perform entire computation for
decomposing - Pairwise averaging and differencing is performed
as soon as 2d averages are accumulated - Memory efficient- no more than one active
sub-array at a time for each level of resolution
23Processing Relational Queries in Wavelet
Coefficient Domain
Wavelet-Coefficient Synopses WT1, WT2,WTk
Wavelet-Coefficient Synopses WT1, WT2,WTk
Op(WT1,.WTk)
Render(WT1WTk)
RS of Wavelet Coefficients WS
Approximate Relations T1, T2,.Tk
Op(T1, T2. Tk)
Render(WS)
Approx. Result Relation S
Approx. Result Relation S
24Selection Operator
- Our selection operator has the general form
selectpred(WT ), where pred represents a generic
conjunctive predicate on a subset of the d
attributes in T that is,
- pred (li1 Xi1 hi1 ) ? . . . ? (lik Xik
hik ), where lij and hij denote the low and
high boundaries of the selected range along each
selection dimension Dij , j 1, 2, , k, k
d.
25Selection - Relational Domain
Relation
Joint Data Distribution Array
3
3
2
1
Dim. D1
2
3
1
7
6
3
4
8
6
Dim. D2
Query Range
- In relational domain, interested in only those
cells inside query range - In wavelet domain, interested in only the
coefficients that contribute to those cells
26Projection Operator
27Projection- Wavelet Domain
28Join Operator
29Join Operator- Wavelet Domain
30Experimental Study
- Improved answer quality
- Low synopsis construction costs
- Fast query execution
31Query Execution Times
32SELECT-JOIN-SUM
33SELECT Query errors on real-life data
34Conclusion
- Multidimensional wavelets as an effective tool
for general purpose approximate query processing
in modern, high dimensional applications - The query processing algorithms operate directly
on the wavelet-coefficient synopses of relational
data, thus allowing for very fast processing of
arbitrarily complex queries entirely in the
wavelet-coefficient domain - Extensive experimental study with synthetic as
well as real-life data sets that verifies the
effectiveness of the wavelet-based approach
compared to both sampling and histograms
35Thank you?