Data Streams Part 3: Approximate Query Evaluation - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Data Streams Part 3: Approximate Query Evaluation

Description:

Ad hoc queries need approximate history ... conjunction of equality joins (Ri.Aj = Rk.Al) ... Ri.Aj occuring m 1 times in , add m-1 new 'attributes' to Ri ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 33
Provided by: ckch
Category:

less

Transcript and Presenter's Notes

Title: Data Streams Part 3: Approximate Query Evaluation


1
Data Streams Part 3Approximate Query Evaluation
  • Reynold Cheng
  • 23rd July, 2002

2
Approximate Query Evaluation
  • Why?
  • Handling load streams coming too fast
  • Data stream is archived in a off-site data
    warehouse, expensive access of archived data
  • Avoid unbounded storage and computation
  • Ad hoc queries need approximate history
  • Try to look at the data items only once and in a
    fixed order

3
Synopses
  • Queries may access or aggregate past data
  • Need bounded-memory history-approximation
  • Synopsis?
  • Succinct summary of old stream tuples
  • Like indexes/materialized-views, but base data is
    unavailable
  • Examples
  • Sliding Windows
  • Samples
  • Sketches
  • Histograms
  • Wavelet representation

4
Sketching Techniques
  • Self-Join Size Estimation
  • Stream of values from dom(A) 1,2,,n
  • Let f(i) frequency of value i
  • Consider SJ(A) Si?dom(A)f(i)2, or Ginis index
    of homogeneity.
  • Useful in parallel DB applications, error
    estimation in query result size estimation and
    access plan costs.
  • Equivalent query Q COUNT (R gtltA R)

5
Evaluating SJ(A) S fi2
  • To update S, keep a counter f(i) for each value i
    in the domain D ? ?(dom(A)) bits of storage
  • This is true for any deterministic algorithm
  • Question estimating S in sub-linear space
    (O(log dom(A)))
  • A randomized technique that offers strong
    probabilistic guarantees on the quality of SJ(A)

6
Self-Join Size Estimation
  • AMS Technique (randomized sketches)
  • Given (f(1),f(2),,f(n))
  • Generate a family of 4-wise independent binary
    random variables Zi i 1,, dom(A)
  • Zi random-1,1
  • PZi 1 PZi -1 ½ (EZi 0)
  • By using tools like orthogonal arrays, such
    families can be constructed online using
    O(logdom(A)) space.

7
Self-Join Size Estimation
  • Define X Si ?dom(A) f(i)Zi (X incrementally
    computable)
  • Theorem ExpX2 S f(i)2
  • Cross-terms f(i)Zi f(j)Zj have 0 expectation
  • Square-terms f(i)Zi f(i)Zi f(i)2
  • Hence, X2 is an unbiased estimator for SJ(A)
  • Space log (N S f(i))
  • Independent samples Xk reduce variance

8
Estimation Quality
  • How can independent samples Xk improve the
    quality of estimation?
  • Keep s1 x s2 iid instantiations for Xk
  • Use independent random seeds to get a family of
    Zis for each instance
  • s1 reduces variance, s2 boosts confidence

9
Sample Run of AMS
10
Comments on AMS
  • The self-join size can be computed on-line
  • Sufficiently small variance (controlled by s1 and
    s2)
  • Can estimate SJ(A) in one pass
  • Probabilistic guarantee a relative error of at
    most ? with probability at least 1-?
  • Can this method be extended for other queries?

11
Complex Aggregate Queries
  • A. Dobra et al. extend the idea of AMS to provide
    approximate answers to complex aggregate queries.
  • SELECT AGG FROM R1,R2,,Rr where ?
  • AGG COUNT/SUM/AVERAGE
  • ? conjunction of equality joins (Ri.Aj
    Rk.Al)
  • The error of these estimates is at most e with
    probability 1-d.

12
Attribute Constraint of ?
  • Each attribute of a relation appears at most once
    in the join condition ?.
  • For any attribute Ri.Aj occuring m gt 1 times in
    ?, add m-1 new attributes to Ri
  • Replace m-1 occurrences in ? with the new
    attributes
  • ? ((R1.A1R2.A1) (R1.A1R3.A1)) becomes
  • ((R1.A1R2.A1) (R1.A2R3.A1))

13
The COUNT Query
  • QCOUNT of tuples in the cross-product of
    R1,,Rr that satisfy the equality constraints in
    ?.
  • SELECT COUNT() from R1,,Rr where ?
  • Approach Express the above query using
    mathematical operators.
  • Rename the 2n join attributes in ? to A1, A2,,
    A2n such that each constraint in ? is Aj Anj
    for 1 ? j ? n.

14
Projection of Domains
  • D dom(A1) x dom(A2) xx dom(A2n)
  • Sk is the set of join attributes appearing in Rk
  • Dk dom(Ak1) x dom(Ak2) xx dom(AkSk)
    Ak1,,AkSk are attributes in Sk
  • Dk is a projection of D on attributes in Sk

15
Value Assignment I
  • An assignment I assigns values to (a subset of)
    join attributes
  • If I ? D, each join attribute is assigned a value
    Ij by I.
  • If I ? Dk, then I only assigns a value Ij to
    attributes j ? Sk.
  • Use letter j to refer to attribute Aj

16
Value Assignment I
  • ISk is the projection of I on attributes in Sk.
  • fk(I) is the number of tuples in Rk that match I.

17
Example
  • SELECT COUNT () FROM R1, R2, R3 where R1.A1
    R2.A1 and R2.A2 R3.A1
  • Renaming SELECT COUNT () FROM R1, R2, R3 where
    A1 A3 and A2 A4
  • A1 R1.A1 A2 R2.A2
  • A3 R2.A1 A4 R3.A1
  • Note Joins in the form Aj Anj
  • S1 1, S2 2,3, S3 4

18
Example (cont.)
  • D dom(A1) x dom(A2) x dom(A3) x dom (A4)
  • I1 lt1, 3, 2, 6gt ? D I13 2
  • D1 dom(A1) D2 dom(A2) x dom(A3) D3
    dom(A4)
  • I1S1 lt1gt, I1S2 lt3,2gt, I1S3 lt6gt
  • f2(lt3,2gt) 5 means 5 tuples from R2 have A23,
    A32

19
Exact QCOUNT
  • QCOUNTSI?D,?jIjInj?rk1fk(ISk)
  • Product of the of tuples in each relation that
    match a value assignment I
  • Summed over all I ?D that satisfy the equi-join
    constraints e

20
Approximate QCOUNT
  • Similar to self-join size!
  • Construct a random variable X that is an unbiased
    estimator for QCOUNT
  • Variance can be bounded from above
  • Use averaging and median-selection tricks to
    boost the accuracy and confidence of X
  • Achieve small relative error with high probability

21
Constructing X
  • For each pair of join attributes j, nj in ?,
    build random variables Zj,l
    l1,,dom(Aj),where Zj,l ? -1, 1
  • Random variables belonging to families for
    different attribute pairs are independent
  • SpaceSnj1O(logdom(Aj))

22
Constructing X (2)
  • For each relation Rk, define atomic sketch Xk
    SI?Dk (fk(I)? j?Sk Zj,Ij)
  • X unbiased estimator of QCOUNT ?rk1Xk
  • E(X) QCOUNT
  • Each Xk can be efficiently computed as tuples of
    Rk are streaming in.
  • Initialize Xk to 0
  • For each tuple t in the Rk stream, add?j?Sk
    Zj,tj to Xk

23
Join Graph and Acyclicity
  • A join graph is an undirected graph
  • a node for each relation, and,
  • an edge for each join attribute pair j, nj
  • If the join graph is acyclic, the variance of X
    is bounded from above.
  • Many join queries are acyclic, like chain joins
    and star joins.

24
Probabilistic Guarantees
  • If QCOUNT is acyclic, multi-join COUNT query over
    relations R1,, Rr, then a sketch of size
    O(logdom(Aj)) is possible to approximate QCOUNT
    so that
  • the relative error of the estimate is at most e
    with probability 1-d.

25
Basic Notions of Approximation
  • For aggregate queries (e.g., SUM, COUNT),
    approximation quality can be measured by absolute
    relative error
  • Estimated value Actual value / Actual value
  • Open question for queries involving more than
    simple aggregation, how should we define
    approximation?
  • Consider S gtltBT (S A,B, T B,C)

Approximate Result
Actual Result
26
Basic Notions of Approximation (2)
  • Can we accept this kind of approximation?

Actual Result
Approximate Result
27
Basic Notions of Approximation (3)
  • Can we provide useful (semantically correct) but
    stale results?

Approximate Result (correct result at time t - ?)
Actual Result (at time t)
28
Related Issues
  • Metric for set-valued queries
  • Composition of approximate operators
  • How is it understood/controlled by user?
  • Integrate into query language
  • Query planning and interaction with resource
    allocation
  • Accuracy-efficiency-storage tradeoff and global
    metric

29
Conclusions
  • To evaluate aggregate queries, O(dom(A)) size
    is needed to obtain exact answers.
  • True for any deterministic algorithms
  • We introduce randomized algorithms that uses only
    O(Snj1logdom(Aj)) space.
  • Only 1 pass is needed for estimation.

30
Conclusions (2)
  • Express the target query by mathematical
    operators
  • Use an unbiased estimator for the expression
  • Independent samples reduce variance
  • A probabilistic guarantee the relative error of
    the estimate is at most e with probability 1-d.
  • Question Can we use this technique to answer
    queries involving more general selection
    predicates (e.g., inequality joins)?

31
References
  • N. Alon, Y. Matias, M. Szegedy. The Space
    Complexity of Approximating the Frequency
    Moments, STOC 96.
  • A. Dobra, M. Garofalakis, J. Gehrke, R. Rastogi.
    Processing Complex Aggregate Queries over Data
    Streams, SIGMOD 02.

32
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com