Data Stream Management Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Data Stream Management Systems

Description:

Continuous, unbounded, rapid, time-varying streams of data elements ... Niagara (OGI/Wisconsin) Internet DBs & XML. OpenCQ (Georgia) triggers, view maintenance ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 48
Provided by: mir135
Learn more at: http://web.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Stream Management Systems


1
Data Stream Management Systems
  • CS240B Notes
  • by
  • Carlo Zaniolo

2
Data Streams
  • Continuous, unbounded, rapid, time-varying
    streams of data elements
  • Occur in a variety of modern applications
  • Network monitoring and traffic engineering
  • Sensor networks, RFID tags
  • Telecom call records
  • Financial applications
  • Web logs and click-streams
  • Manufacturing processes
  • DSMS Data Stream Management System

3
Many Research Projects
  • Amazon/Cougar (Cornell) sensors
  • Aurora (Brown/MIT) sensor monitoring, dataflow
  • Hancock (ATT) Telecom streams
  • Niagara (OGI/Wisconsin) Internet DBs XML
  • OpenCQ (Georgia) triggers, view maintenance
  • Stream (Stanford) general-purpose DSMS
  • Tapestry (Xerox) pubish/subscribe filtering
  • Telegraph (Berkeley) adaptive engine for
    sensors
  • Tribeca (Bellcore) network monitoring
  • Stream Mill (UCLA) - power extensibility
  • Gigascope ATT Labs Network Monitoring

4
The (Simplified) Big Picture
Clients
Streamed Result
Server
DSMS
Scratch Store
Stored Relations
5
Databases vs Data Streams
  • Database Systems
  • Model persistent data
  • Table setbag of tuples
  • Updates All
  • Query transient
  • Query Answer exact
  • Query Eval. multi-pass
  • Operator blocking OK
  • Query Plan fixed
  • Data Stream Systems
  • Model transient data
  • Infinite sequence of tuples
  • Updates append only
  • Query persistent
  • Query Answer Often approx
  • Query Eval. one-pass
  • Operators unblocking only
  • Query Plan adaptive

6
Research Challenges
  • Data Models
  • Relational Streams first, XML streams important
    too
  • Tuple-Time Stamping
  • Order is important
  • Windows or other synopses
  • Query Languages SQL or XQUERY extensions
  • Blocking operators and Expressive Power
  • Query Plans
  • Optimized scheduling for response time or memory
  • Quality of Services (QoS) Approximation
  • Load shedding, sampling
  • Support for Advanced Applications
  • Data Stream Mining

7
Data Models
  • Relational Data Streams
  • Each data stream consists of relational tuples
  • The stream can be modelled as an append-only
    relation
  • But repetitions are allowed and order is very
    important!
  • Order based on timestampsor arrival order
  • Streaming XML Data.
  • A stream of structured SAX elements

8
Timestamps
  • Data streams are (basically) ordered according to
    their timestamps
  • The meaning of windows, unions an joins is based
    on timestamps
  • External
  • Injected by data source
  • Model real-world event represented by tuple
  • Tuples may be out-of-order, but if near-ordered
    can reorder with small buffers
  • Internal
  • Introduced as special field by the DSMS
  • Approx. based on the time they arrived
  • Missing (called latent in Stream Mill)
  • The system assigns no timestamp to arriving
    tuples,
  • But tuples are still processed as ordered
    sequences
  • By operators whose semantics expects timestamps
  • Thus operators might instantiated timestamps
    as/when needed

9
Data Stream Query Languages
  • Continuous queries and
  • Blocking Operators

10
Query Operators Sample Stream
Traffic (sourceIP, source IP address
sourcePort, port number on source destIP,
destination IP address destPort,
port number on destination length , length
in bytes time time stamp )
11
Blocking Query Operators
  • No output until the entire input has been
    seeni.e., the last tuple in the input, often
    detected after we hit the EOF.
  • Streams input never ends thus blocking
    operators cannot be used as such
  • Traditional SQL aggregates are blocking
  • Many SQL operators have DBMS implementations
    that are blocking but are not intrinsically
    blocking
  • group by, sort join can be implemented in
    blcoking and nonblocking ways
  • Other operators are intrinsically blocking
  • Can we formally characterize which is which?
  • We will see that nonblocking operators are the
    monotonic ones

12
Problematic Operators for Data Streams
  • Blocking query operatorsi.e., those that must
    see everything in the input before they can
    return anything in the output
  • NonBlocking query operators are those that can
    return results now, without seeing the rest of
    the stream
  • Selection and projection are nonblocking
  • Set Difference, and Traditional aggregates are
    blocking
  • Continuous aggregates are not.

13
Aggregate Invocation two Forms
G grouping attributes, F1,F2 aggregate
expressions
  • Traditional
  • select G, F1 from S where P group by
    G having F2 op J
  • With windows (SQL2003 OLAP Functions)
  • traffic (sourceIP, sourcePort, destIP , destPort,
    length, Time)select sourceIP, Time,
    avg(lenght) over(order by Time,
    partition by sourceIP 50 rows
    preceding)
  • Cumulative (running) window
  • ... over(order by Time,
    partition by sourceIP unlimited preceding)

14
Aggregate Function Properties
  • distributive sum, count, min, max
  • algebraic AVG
  • holistic count-distinct, median
  • On-line aggregates such as exponentially decaying
    AVG
  • User-Defined Aggregates (UDAs)
  • Sliding window invocation 12. Efficient
    computation for memory and CPU
  • Sliding window invocation on 3 ?
  • Continuous window on these ? Yes, also for 5.
  • UDAs can be similar to any of those

15
Avoiding Blocking Behavior
  • Windows aggregates on a limited size window are
    approximate and nonblocking
  • DSMS do windows of all kinds
  • Sliding windows (same as OLAP functions)
  • Tumbles restart every new window (traditional
    definition)
  • Panes the window is broken up into panes
  • Punctuation Tucker, Maier, Sheard, Fegaras
  • Assertion about future stream contents
  • Unblocks operators, reduces state
  • Construct used for avoiding blocking are also
    useful for avoiding infinite memory

16
Joins
  • General case problematic on streams May need to
    join arbitrarily far-apart stream tuples
  • Equijoin on timestamps is easy to computebut
    not very useful
  • Majority of work focuses on joins between one
    stream and a window specified on the other
  • The symmetric case also common
  • Traffic2 as B window TB
  • Multi-joins less common but possible.

Select A.sourceIP, B.sourceIP from Traffic1 as A
window TA, Traffic2 as B where A.destIP
B.destIP
17
Join of Stream S with a Table T (where T is a
DB relation or a Window on a Stream)
  • When a new tuple z with timestamp ts(z) arrives
    in S, join it with all the tuples in T.
  • ts(z) is the timestamp of tuples so produced
  • If T is a window on a stream S
  • T must contain all the tuples up to ts(z)
    included cumulative window on S
  • But we do not have infinite memory so we must
    approximate T with a synopsis. E.g., 30 minutes
    preceding

18
Multi-way Sliding Window Joins
  • Evaluation of n-way sliding window joins queries
  • n streams with associated sliding windows
  • continuously evaluate the joins of all n windows
  • Two natural joins strategies
  • eager join is evaluated each time a new tuple
    arrives in any of the input streams
  • lazy join is evaluated with some pre-specified
    frequency, e.g., every t time units
  • Computation incremental, as in differential
    fixpoint of recursive rules.

19
Query Optimizationand Scheduling
  • Sceduling to minimize response time or minimize
    memoryno real change in CPU time
  • Optimization based on sharing, query plans,
    operators, buffers,

20
A Query Plan
Q1
Q2
?
?
Scheduler Given query plan and selectivity
estimates Schedule tuples through operator
chains
?
Stream3
Stream1
Stream2
21
Schedulers and QoS Metrics
  • Round Robin (RR) is perhaps the most basic
  • operators in a circular queue are given a fixed
    time slice.
  • Starvation is avoided, but little adaptivity
  • FIFO takes the first tuple in input and moves it
    through the chain
  • Minimal latency, poor memory
  • Greedy Alogrithms
  • Buffers with most tuples first
  • Tuples that waited longest first
  • Operators that release more memory first

22
Memory Optimization on a ChainBabcock, Babu,
Datar, Motwani
Output
s1
best slope
s3
selectivity 0.0
s2
Net Selectivity
s2
selectivity 0.6
starvation point
s3
s1
selectivity 0.2
Time
Input
23
Main ideas
  • Operators are thought of as filters which
  • Operate on a set of tuples
  • Produce s tuples in return
  • s ? selectivity of an operator
  • If s 0.2 we can interpret the value in two ways
  • Out of every 10 tuples, the operator outputs 2
    tuples
  • If the input requires 1 unit of memory, the
    output will require 0.2 units of memory

24
The lower envelope
  • Imagine there is a line from this point to every
    operator point (ti, si) to its right
  • The operator that corresponds to the line with
    the steepest slope is called the steepest
    descent operator point

25
The Lower Envelope
  • By starting at the first point (t0, s0) and
    repeatedly calculating the steepest descent
    operator point we find the lower envelope P for
    a progress chart P
  • Notice that the slopes of the segments are
    non-increasing
  • The operators in each segment form a chain.
  • FIFO within chain
  • Greedy across chains

26
Scheduling
  • Chain minimizes memory be required in special
    overload situations
  • But increases response time (latency)
  • Typically though we want to optimize for response
    time
  • Different scheduling protocols optimize different
    objectives latency, inaccuracy, memory use,
    computation, starvation,
  • Computation complexity is independent from
    scheduler
  • Different policies give significantly different
    results only for bursty loads
  • Research Issues
  • Complex query plans (beyond simple paths)
  • Minimization of response time
  • Adaptive strategies how do we switch between the
    two to adapt to load changes?

27
Optimization by Sharing
  • In traditional multi-query optimization
  • sharing (of expressions, results etc) among
    queries can lead to improved performance
  • ExamplesSimilar issues arise when processing
    queries on streams
  • sharing of query operators and expressions
  • sharing of sliding windows

28
Multi-query Processing on Streams
  • Opportunities for optimization when windows are
    shared---e.g
  • select sum (A.length)
  • from Traffic1 A window 1hour, Traffic2 B
    window 1 hour
  • where A.destIP B.destIP
  • select count (distinct A.sourceIP)
  • from Traffic1 A window 1 min, Traffic2 B
    window 1 min
  • where A.destIP B.destIP
  • Strategies for scheduling the evaluation of
    shared joins
  • Largest window only
  • Smallest window first
  • Process at any instant the tuple that is likely
    to benefit the largest number of joins (maximize
    throughput)

29
Shared Predicates Niagara, Telegraph
gt
7
Predicates for R.A
1
11
R.A gt 1 R.A gt 7 R.A gt 11 R.A lt 3 R.A lt 5 R.A
6 R.A 8 R.A ? 9
Agt7
Agt11
Agt1
Tuple A8
lt
3
Alt3
Alt5

6 8
?
9
30
QoS and Load Schedding
  • When input stream rate exceeds system capacity
  • a stream manager can shed load (tuples)
  • Load shedding affects queries and their answers
  • Introducing load shedding in a data stream
    manager is a challenging problem
  • Random and semantic load shedding

31
DSMSQuality of Service (QOS)
  • Approximation and
  • Load Shedding

32
QOS via Synopses and Approximation
  • Synopsis bounded-memory history-approximation
  • Succinct summary of old stream tuples
  • Like indexes/materialized-views, but base data is
    unavailable
  • Examples
  • Sliding Windows
  • Samples
  • Sketching techniques
  • Histograms
  • Wavelet representation
  • Approximate Algorithms e.g., median, quantiles,
  • Fast and light Data Mining algorithms

33
QoS and Load Schedding
  • When input stream rate exceeds system capacity
  • a stream manager can shed load (tuples)
  • Load shedding affects queries and their answers
    drop the tasks and the tuples that will cause
    least loss
  • Introducing load shedding in a data stream
    manager is a challenging problem
  • Random load shedding or semantic load shedding

34
XML Data Streams
35
XML Data Streams Applications
  • An XML data stream is a sequence of tokens
  • Data and application integration
  • Distributed monitoring of computing systems
  • Message-based web services
  • Purchase orders, retail transactions
  • Personalized content delivery

36
XML Streams Data Model
  • XML data tree structure
  • ltPurchase_Docgt
  • ltPR_Number val 50/gt
  • ltSupp_NamegtABClt/Supp_Namegt
  • ltAddressgt
  • ltCitygtFlorham Parklt/Citygt
  • ltStategtNew Jerseylt/Stategt
  • lt/Addressgt
  • ltLine_Itemsgt
  • ltItemgt
  • ltPart_Number val 1050/gt
  • ltQuantity val20/gt
  • lt/Itemgt
  • Data stream SAX events
  • element Purchase_Doc anyType
  • element PR_Number anyType
  • attribute val anySimpleType
  • chardata 50
  • end-attribute
  • end-element
  • element Supp_Name anyType
  • text ABC
  • end-element

37
XML Query Languages
  • XML query languages
  • Xquery, XSLT, Xpath
  • Declarative matching of structured data and text
  • Easy restructuring to meet needs of data
    consumers

38
XML Streams research Issue
  • Efficient Processing of single/multiple queries
    (e.g., Xfilters/Yfilters)
  • Blocking operators/constructs in XQuerye.g.,
    XQuery new function definition mechanisms are
    blocking
  • Integration of relational and XML DSMSjust like
    relational and XML DBMS are now being intergrated.

39
Prototype Systems
  • Aurora (Brandeis, Brown, MIT) CCC02
  • Gigascope (ATT) CJSS03
  • Hancock (ATT) CFP00
  • STREAM (Stanford) MWA03
  • Telegraph (Berkeley) CCD03
  • Stream Mill UCLA

40
Aurora (Brandeis, Brown, MIT)
  • Geared towards monitoring applications (streams,
    triggers, imprecise data, real time requirements)
  • Specified set of operators, connected in a data
    flow graph
  • Optimization of the data flow graph
  • Three query modes (continuous, ad-hoc, view)
  • Aurora accepts QoS specifications and attempts
    to optimize QoS for the outputs produced
  • Real time scheduling, introspection and load
    shedding

41
ATT Hancock and Gigascope
  • Hancock A C-based domain specific language
    which facilitates signature extraction from
    transactional data streams.
  • Signature charetizes behavior of customer or
    services
  • Support for efficient and tunable representation
    of signature collections
  • Support for custom scalable persistent data
    structures
  • Elaborate statistics collection from streams
  • Gigascope SQL based DSMS for monitoring of
    network data

42
STREAM Stanford Uiversity
  • General purpose stream data manager
  • CQL (continuous query language) for declarative
    query specification
  • Consider query plan generation
  • Resource management
  • Operator scheduling
  • Static and dynamic approximations

43
Telegraph UCB
  • Continuous query processing system
  • Support for stream oriented operators
  • Support for adaptivity in query processing
  • Various aspects of optimized multi-query stream
    processing

44
Commercial Systems
  • Sybase publish-subscribe using MQ (Memory
    Queues)
  • MQs are in-memory tables processed using active
    rules and stored procedures
  • Similar solutions in Oracle and Teradata. But
    IBM's MQSeries, Microsoft's MSMQ are web-service
    oriented Java Message Service (JMS), WebSphere,
    CORBA.
  • Two DSMS startups
  • CORAL8 http//coral8.com/
  • Streambase http//www.streambase.com/

45
More Tutorial Talks
  • Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev
    Motwani,Jennifer Widomhttp//theory.stanford.edu/
    rajeev/pods-full-talk.ppt
  • Nick Koudas and Divesh Srivastava. Data stream
    query processing. Tutorial presented at
    International Conference on Very Large Databases
    (VLDB), 1149, 2003. PDF talk slides (PDF)
  • Nick Koudas et al. Matching XML Documents
    Approximately (with S. Yahia and D. Srivastava)
    Tutorial delivered at ICDE 2003
  • Nick Koudas et al. Stream Data Management
    Research Directions and Opportunities. Invited
    Talk at IDEAS 2002.
  • Nick Koudas et al. Mining  Data Streams (with S.
    Guha) Invited Tutorial delivered at PAKDD 2003

46
Implementation Approaches for Continuous Queries
on Streaming XML
  • Automata-based techniques
  • XFilter AF00 finite state machine per path
    expression
  • XTrie CFGR02 shares common sub-paths of PC
    paths
  • YFilter DF03 single NFA for all path
    expressions
  • GMOS03 single DFA, limitations on flexibility
  • XPush GS03 pushdown automaton for tree
    patterns
  • Index-based techniques
  • MatchMaker LP02 shared tree patterns
  • IndexFilter BGKS03 shared path expressions,
    comparison

47
XML Stream Processing Key Ideas
  • Obtain bindings of for clause path expression
    variables
  • Ordered sequence, no duplicates
  • Filter bindings using where clause path
    expression predicates
  • Existential check suffices
  • Compute bindings of return clause path
    expressions
  • Ordered (possibly null) sequence
  • Goal Efficient matching/binding of XML path
    expressions
  • Very large number of path expressions
Write a Comment
User Comments (0)
About PowerShow.com