Query Approximation in DSMS - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Query Approximation in DSMS

Description:

For a system with continuous queries, data may not arrive at a consistent rate. ... J. and Arasu, A. and Babcock, B. and Babu, S. and Datar, M. and Manku, G. and ... – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 14
Provided by: students88
Category:

less

Transcript and Presenter's Notes

Title: Query Approximation in DSMS


1
  • Query Approximation in DSMS
  • An Introduction by Michael Mohler

2
Limited Resources in DSMS
  • For a system with continuous queries, data may
    not arrive at a consistent rate. There may be
    periods of bursty data. However, it would be
    foolish to design a system with enough resources
    to handle (rare) bursts of data.
  • Query Approximations are used to gracefully
    handle data bursts while minimizing the overall
    system error.

3
Types of Approximation
  • Load-shedding techniques
  • Sampling
  • Value-dependent drops
  • Histogram-based techniques
  • Wavelet-based techniques

4
Load-shedding
  • The simplest way to ensure the system can handle
    the load is to drop tuples. This can be done by
    either dropping tuples at random or by selecting
    tuples to drop (or to keep).
  • For random dropping, it is clear that dropping
    tuples early in the query execution pipeline is
    preferred since doing so frees the most resources.

5
Sample Diagram
  • In the diagram above the base load on A is 3
    (r11 r22) and the base load on B is 4/3
    (r11/33r21/31). That means BOTH are
    overloaded. The total throughput is 1/2.
  • By introducing a drop rate of 1/5 on stream 1 and
    2/5 on stream 2, the load on both A and B is 1.
    The overall throughput is 3/5.

6
N-Dimensional Space
  • Given N input streams, the input rate on each
    stream can be thought of as an element in an
    N-dimensional vector.
  • If the costs and selectivities are known in
    advance, it is possible to define an
    N-dimensional object describing the set of
    feasible input states. Given the feasibility
    constraint, the system seeks to maximize total
    throughput by selecting drop rates at the input.
  • This lends itself to a linear programming
    solution.

7
Linear Programming (LP) Formulation
  • LP (specifically the simplex algorithm) maximizes
    a given value by greedily walking along the
    edges of a given N-dimensional feasible region.
  • However, it is too time consuming to be used in a
    real-time application.

8
Staying FIT (Tatbul, Cetintemel, Zdonic, 2007)?
  • The FIT method makes two improvements to the LP
    model.
  • The LP values at various infeasible points are
    precomputed. At runtime when system overload is
    detected, the system scales the current location
    in N-dimensional space to a known value and
    produces that drop schedule.
  • Second, FIT seeks to distribute the drop
    scheduling by having leaf nodes communicate their
    needs to their parent nodes.

9
Value-Based Tuple Dropping
  • In some cases, it may not be wise to delete
    tuples randomly. Some data may be more important
    than others (e.g. in a medical triage
    application).
  • Similarly, in stream joins, randomly deleting
    tuples significantly degrades performance.
    Selectively dropping tuples based upon values may
    produce less error overall.

10
Bloom Filters
  • Bloom filters can be used to perform certain
    operations (like duplicate-elimination and
    intersection) without storing previously seen
    data.
  • These filters perform a hash on previously seen
    data (total storage of M bits) that determine
    existence within a set. They allow a tunable
    percentage of false positives.

11
Multidimensional Histograms
  • Input data can be stored in the form of an
    approximate N-dimensional histogram and analyzed
    later without storing the entire tuple.
  • However, they scale poorly in higher dimensions
    as the storage overhead becomes prohibitive.

12
Wavelets
  • Wavelets are a mathematical tool to decompose a
    complex function into its constituent functions.
  • The input data can be compacted using wavelet
    decomposition and analyzed directly later in the
    query execution.

13
References
  • Tatbul, N. and Cetintemel, U. and Zdonik, S.
    Staying FIT efficient load shedding techniques
    for distributed stream processing. In Proceedings
    of the 33rd international conference on Very
    large data bases, 2007.
  • Babcock, B. and Datar, M. and Motwani, R. Load
    shedding for aggregation queries over data
    streams., In Proceedings of 20th International
    Conference on Data Engineering, 2004.
  • Burton H. Bloom. Space/time trade-offs in hash
    coding with allowable errors. In Communications
    of the ACM, 1970.
  • Chakrabarti, K. and Garofalakis, M. and Rastogi,
    R. and Shim, K. Approximate query processing
    using wavelets. In the VLDB Journal The
    International Journal on Very Large Data Bases,
    Vol. 10, N. 2, 2001.
  • Motwani, R. and Widom, J. and Arasu, A. and
    Babcock, B. and Babu, S. and Datar, M. and Manku,
    G. and Olston, C. and Rosenstein, J. and Varma,
    R. Query processing, resource management, and
    approximation in a data stream management system.
    In Proceedings of the First Biennial Conference
    on Innovative Data Systems Research (CIDR), 2003.
  • Das, A. and Gehrke, J. and Riedewald, M. Semantic
    Approximation of Data Stream Joins. In IEEE
    TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
    2005.
  • N. Jain and P. Yalagandula and M. Dahlin and Y.
    Zhang. Self-Tuning, Bandwidth-Aware Monitoring
    for Dynamic Data Streams. In Proceedings of 25th
    IEEE International Conference on Data
    Engineering (ICDE), 2009.
  • Thaper, N. and Guha, S. and Indyk, P. and Koudas,
    N. Dynamic multidimensional histograms. In
    Proceedings of the 2002 ACM SIGMOD international
    conference on Management of data, 2002.
Write a Comment
User Comments (0)
About PowerShow.com