Data Streams Part 3: Approximate Query Evaluation - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Data Streams Part 3: Approximate Query Evaluation

Description:

Ad hoc queries need approximate history ... conjunction of equality joins (Ri.Aj = Rk.Al) ... Ri.Aj occuring m 1 times in , add m-1 new 'attributes' to Ri ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 33

Provided by: ckch

Category:

more less

Transcript and Presenter's Notes

Title: Data Streams Part 3: Approximate Query Evaluation

1
Data Streams Part 3Approximate Query Evaluation

Reynold Cheng
23rd July, 2002

2
Approximate Query Evaluation

Why?
Handling load streams coming too fast
Data stream is archived in a off-site data
warehouse, expensive access of archived data
Avoid unbounded storage and computation
Ad hoc queries need approximate history
Try to look at the data items only once and in a
fixed order

3
Synopses

Queries may access or aggregate past data
Need bounded-memory history-approximation
Synopsis?
Succinct summary of old stream tuples
Like indexes/materialized-views, but base data is
unavailable
Examples
Sliding Windows
Samples
Sketches
Histograms
Wavelet representation

4
Sketching Techniques

Self-Join Size Estimation
Stream of values from dom(A) 1,2,,n
Let f(i) frequency of value i
Consider SJ(A) Si?dom(A)f(i)2, or Ginis index
of homogeneity.
Useful in parallel DB applications, error
estimation in query result size estimation and
access plan costs.
Equivalent query Q COUNT (R gtltA R)

5
Evaluating SJ(A) S fi2

To update S, keep a counter f(i) for each value i
in the domain D ? ?(dom(A)) bits of storage
This is true for any deterministic algorithm
Question estimating S in sub-linear space
(O(log dom(A)))
A randomized technique that offers strong
probabilistic guarantees on the quality of SJ(A)

6
Self-Join Size Estimation

AMS Technique (randomized sketches)
Given (f(1),f(2),,f(n))
Generate a family of 4-wise independent binary
random variables Zi i 1,, dom(A)
Zi random-1,1
PZi 1 PZi -1 ½ (EZi 0)
By using tools like orthogonal arrays, such
families can be constructed online using
O(logdom(A)) space.

7
Self-Join Size Estimation

Define X Si ?dom(A) f(i)Zi (X incrementally
computable)
Theorem ExpX2 S f(i)2
Cross-terms f(i)Zi f(j)Zj have 0 expectation
Square-terms f(i)Zi f(i)Zi f(i)2
Hence, X2 is an unbiased estimator for SJ(A)
Space log (N S f(i))
Independent samples Xk reduce variance

8
Estimation Quality

How can independent samples Xk improve the
quality of estimation?
Keep s1 x s2 iid instantiations for Xk
Use independent random seeds to get a family of
Zis for each instance
s1 reduces variance, s2 boosts confidence

9
Sample Run of AMS
10
Comments on AMS

The self-join size can be computed on-line
Sufficiently small variance (controlled by s1 and
s2)
Can estimate SJ(A) in one pass
Probabilistic guarantee a relative error of at
most ? with probability at least 1-?
Can this method be extended for other queries?

11
Complex Aggregate Queries

A. Dobra et al. extend the idea of AMS to provide
approximate answers to complex aggregate queries.
SELECT AGG FROM R1,R2,,Rr where ?
AGG COUNT/SUM/AVERAGE
? conjunction of equality joins (Ri.Aj
Rk.Al)
The error of these estimates is at most e with
probability 1-d.

12
Attribute Constraint of ?

Each attribute of a relation appears at most once
in the join condition ?.
For any attribute Ri.Aj occuring m gt 1 times in
?, add m-1 new attributes to Ri
Replace m-1 occurrences in ? with the new
attributes
? ((R1.A1R2.A1) (R1.A1R3.A1)) becomes
((R1.A1R2.A1) (R1.A2R3.A1))

13
The COUNT Query

QCOUNT of tuples in the cross-product of
R1,,Rr that satisfy the equality constraints in
?.
SELECT COUNT() from R1,,Rr where ?
Approach Express the above query using
mathematical operators.
Rename the 2n join attributes in ? to A1, A2,,
A2n such that each constraint in ? is Aj Anj
for 1 ? j ? n.

14
Projection of Domains

D dom(A1) x dom(A2) xx dom(A2n)
Sk is the set of join attributes appearing in Rk
Dk dom(Ak1) x dom(Ak2) xx dom(AkSk)
Ak1,,AkSk are attributes in Sk
Dk is a projection of D on attributes in Sk

15
Value Assignment I

An assignment I assigns values to (a subset of)
join attributes
If I ? D, each join attribute is assigned a value
Ij by I.
If I ? Dk, then I only assigns a value Ij to
attributes j ? Sk.
Use letter j to refer to attribute Aj

16
Value Assignment I

ISk is the projection of I on attributes in Sk.
fk(I) is the number of tuples in Rk that match I.

17
Example

SELECT COUNT () FROM R1, R2, R3 where R1.A1
R2.A1 and R2.A2 R3.A1
Renaming SELECT COUNT () FROM R1, R2, R3 where
A1 A3 and A2 A4
A1 R1.A1 A2 R2.A2
A3 R2.A1 A4 R3.A1
Note Joins in the form Aj Anj
S1 1, S2 2,3, S3 4

18
Example (cont.)

D dom(A1) x dom(A2) x dom(A3) x dom (A4)
I1 lt1, 3, 2, 6gt ? D I13 2
D1 dom(A1) D2 dom(A2) x dom(A3) D3
dom(A4)
I1S1 lt1gt, I1S2 lt3,2gt, I1S3 lt6gt
f2(lt3,2gt) 5 means 5 tuples from R2 have A23,
A32

19
Exact QCOUNT

QCOUNTSI?D,?jIjInj?rk1fk(ISk)
Product of the of tuples in each relation that
match a value assignment I
Summed over all I ?D that satisfy the equi-join
constraints e

20
Approximate QCOUNT

Similar to self-join size!
Construct a random variable X that is an unbiased
estimator for QCOUNT
Variance can be bounded from above
Use averaging and median-selection tricks to
boost the accuracy and confidence of X
Achieve small relative error with high probability

21
Constructing X

For each pair of join attributes j, nj in ?,
build random variables Zj,l
l1,,dom(Aj),where Zj,l ? -1, 1
Random variables belonging to families for
different attribute pairs are independent
SpaceSnj1O(logdom(Aj))

22
Constructing X (2)

For each relation Rk, define atomic sketch Xk
SI?Dk (fk(I)? j?Sk Zj,Ij)
X unbiased estimator of QCOUNT ?rk1Xk
E(X) QCOUNT
Each Xk can be efficiently computed as tuples of
Rk are streaming in.
Initialize Xk to 0
For each tuple t in the Rk stream, add?j?Sk
Zj,tj to Xk

23
Join Graph and Acyclicity

A join graph is an undirected graph
a node for each relation, and,
an edge for each join attribute pair j, nj
If the join graph is acyclic, the variance of X
is bounded from above.
Many join queries are acyclic, like chain joins
and star joins.

24
Probabilistic Guarantees

If QCOUNT is acyclic, multi-join COUNT query over
relations R1,, Rr, then a sketch of size
O(logdom(Aj)) is possible to approximate QCOUNT
so that
the relative error of the estimate is at most e
with probability 1-d.

25
Basic Notions of Approximation

For aggregate queries (e.g., SUM, COUNT),
approximation quality can be measured by absolute
relative error
Estimated value Actual value / Actual value
Open question for queries involving more than
simple aggregation, how should we define
approximation?
Consider S gtltBT (S A,B, T B,C)

Approximate Result
Actual Result
26
Basic Notions of Approximation (2)

Can we accept this kind of approximation?

Actual Result
Approximate Result
27
Basic Notions of Approximation (3)

Can we provide useful (semantically correct) but
stale results?

Approximate Result (correct result at time t - ?)
Actual Result (at time t)
28
Related Issues

Metric for set-valued queries
Composition of approximate operators
How is it understood/controlled by user?
Integrate into query language
Query planning and interaction with resource
allocation
Accuracy-efficiency-storage tradeoff and global
metric

29
Conclusions

To evaluate aggregate queries, O(dom(A)) size
is needed to obtain exact answers.
True for any deterministic algorithms
We introduce randomized algorithms that uses only
O(Snj1logdom(Aj)) space.
Only 1 pass is needed for estimation.

30
Conclusions (2)

Express the target query by mathematical
operators
Use an unbiased estimator for the expression
Independent samples reduce variance
A probabilistic guarantee the relative error of
the estimate is at most e with probability 1-d.
Question Can we use this technique to answer
queries involving more general selection
predicates (e.g., inequality joins)?

31
References

N. Alon, Y. Matias, M. Szegedy. The Space
Complexity of Approximating the Frequency
Moments, STOC 96.
A. Dobra, M. Garofalakis, J. Gehrke, R. Rastogi.
Processing Complex Aggregate Queries over Data
Streams, SIGMOD 02.

32
Thank You!

Write a Comment

User Comments (0)