Models and Issues in Data Stream Systems - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Models and Issues in Data Stream Systems

Description:

evaluated once over a snapshot of data set. Continuous queries. evaluated continuously ... Support of blocking operators in query plans over data streams ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 26
Provided by: Raymo68
Category:

less

Transcript and Presenter's Notes

Title: Models and Issues in Data Stream Systems


1
Models and Issues in Data Stream Systems
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev
Motwani Jennifer Widom ACM SIGMOD/PODS, 2002
  • Adesola Omotayo
  • September 17, 2004

2
Goals
  • The need for and research issues arising from a
    new model of data processing
  • Review past work relevant to data stream systems
    and current projects in that area.
  • Explore topics in stream query languages, new
    requirements and challenges in query processing,
    and algorithmic issues.

3
Presentation Outline
  • The Data Stream Model
  • Review of Data Stream Projects
  • Queries of Data Streams
  • Proposal for a DSMS
  • Algorithmic Issues

4
The Data Stream Model
  • DS vs. Stored Relational Model
  • data elements arrive online
  • system has no control over arrival order
  • data streams are unbounded
  • processed data stream elements are discarded or
    archived.
  • Use of data in conventional stored relations

5
Queries
  • One-time and Continuous queries
  • One-time queries
  • evaluated once over a snapshot of data set
  • Continuous queries
  • evaluated continuously
  • answers may be stored and updated or may be
    produced as data streams

6
Queries (contd)
  • Predefined and Ad hoc queries
  • Predefined
  • supplied before any relevant data arrives
  • generally continuous queries
  • scheduled one-time queries possible
  • Ad hoc
  • either one-time or continuous queries
  • complicates design of data stream management
    systems

7
Motivating Examples
  • Web-based financial search engine (e.g.
    Traderbot)
  • Modern security applications (e.g. iPolicy
    Networks)
  • Web logs monitoring (e.g. Yahoo)
  • Sensor monitoring (e.g. HP Data Center)
  • Network traffic management (e.g. ISPs)

8
Concrete Example
  • Fraction of backbones traffic attributed to
    customer network
  • (SELECT count()
  • FROM C, B
  • WHERE C.scr B.scr and C.dest B.dest
  • and C.id B.id) /
  • (SELECT count() FROM B)

9
Review of Data Stream Projects
  • Tapestry System
  • Continuous queries
  • Restricted subset of SQL
  • Alert System
  • Event-condition-action style triggers
  • Continuous queries

10
Review of Data Stream Projects (contd)
  • XFilter System
  • Efficient content-based filtering of XML
    documents
  • Continuous queries in XPath language
  • Xyleme System
  • Content-based filtering system
  • High throughput with a restricted query language

11
Review of Data Stream Projects (contd)
  • Tribeca SDB Manager
  • Restricted querying capability over network
    packet streams
  • Tangram System
  • Uses stream processing techniques to analyze
    large quantities of stored data

12
Review of Data Stream Projects (contd)
  • OpenCQ
  • Continuous queries
  • Query processing algorithm based on incremental
    view maintenance.
  • NiagraCQ
  • Continuous queries
  • Groups continuous queries for efficient
    evaluation
  • Support of blocking operators in query plans over
    data streams

13
Review of Data Stream Projects (contd)
  • Viglas and Naughton proposed rate-based
    optimization for queries over data streams
  • Chronicle Data Model
  • Append-only ordered sequences of tuples
    (chronicles)
  • Restricted view definition language and algebra
    (chronicle algebra)
  • Views defined in chronicle algebra could be
    maintained incrementally without storing any of
    the chronicles.

14
Review of Data Stream Projects (contd)
  • Seshadri, Livny, and Ramakrishhnan proposed an
    algebra and a declarative query language for
    querying ordered relations (sequences)
  • Related work includes work on temporal and
    time-series databases

15
Review of Data Stream Projects (contd)
  • Materialized Views
  • Queries that need to be reevaluated or
    incrementally updated
  • Important work in this area
  • self-maintenance
  • data expiration
  • Different from continuous queries
  • stream rather than store results
  • deal with append-only input data
  • provide approximate rather than exact answers
  • processing strategy may adapt as characteristics
    of data streams change

16
Review of Data Stream Projects (contd)
  • Telegraph Project
  • Adaptive query engine for volatile and
    unpredictable environments
  • Query execution strategies over data streams
    generated by sensors
  • Adaptive processing techniques for multiple
    continuous queries
  • Tukwila system
  • Supports adaptive query processing, in order to
    perform dynamic data integration over autonomous
    data sources

17
Review of Data Stream Projects (contd)
  • Aurora Project
  • Targeted towards stream monitoring applications
  • Consists of large network of triggers (data-flow
    graph)
  • Application administrators create and add
    triggers
  • Compile-time and run-time optimization of trigger
    network
  • Detects resource overload and performs load
    shedding based on application-specific measures
    of QoS

18
Queries over Data Streams
  • Unbounded Memory Requirements
  • Approximate Query Answering
  • Data reduction techniques
  • Sketches
  • Random sampling
  • Histograms
  • Wavelets
  • Approaches to approximation
  • Sliding Windows
  • Batch Processing, Sampling, and Synopses
  • Blocking Operators
  • Queries Referencing Past Data

19
Sliding Windows
  • Evaluate query over sliding window of recent data
    from streams
  • Attractive Properties
  • Well-defined and understood
  • Deterministic
  • Emphasizes recent data
  • Research Issues
  • How to define timestamps over streams
  • How to implement sliding window queries
  • Whats their impact on query optimization?
  • How to give approximate answers if window is too
    big to fit in main memory

Window
Past Data
Future Data
Recent Data
20
Sliding Windows (contd)
  • Sequence and Temporal DB
  • Temporal DB
  • Concerned with full history of each data value
    over time
  • Sequence DB
  • Attempts to produce query plans that allow for
    stream access
  • Assumes DB system has control over which sequence
    to process tuples from next

21
Batch Processing, Sampling, and Synopses
  • Dont process data elements as they arrive
  • Two possible bottlenecks
  • Batch processing
  • Sampling
  • Synopsis data structure

22
Blocking Operators
  • Unable to produce the first tuple of its output
    until it has seen its entire input. (e.g.,
    sorting, aggregation operators like SUM)
  • Operators that are root of tree of query
    operators are more tractable than interior nodes
    operators
  • juggle operator (a non-blocking version of sort)

23
Blocking Operators (contd)
  • Tucker et al. suggested augmenting data streams
    with assertions about what can and cannot appear
    in remainder of data stream

daynumber ? 10
daynumber lt 10
Assertion daynumber ? 10
24
Queries Referencing Past Data
  • Ad hoc queries that are issued after some data
    has already been discarded may be impossible to
    answer accurately
  • ad hoc queries allowed to reference future data
    only
  • maintain summaries of data streams (synopses or
    aggregates) that can approximate answers to
    future ad hoc queries

25
Thank You!
  • ?

to be continued
Write a Comment
User Comments (0)
About PowerShow.com