Query Processing, Resource Management, and Approximation in a Data Stream Management System PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Query Processing, Resource Management, and Approximation in a Data Stream Management System


1
Query Processing, Resource Management, and
Approximation in a Data Stream Management System
  • Selected subset of slides taken from talk by
    Jennifer Widom at NEDS.
  • stanfordstreamdatamanager

2
Data Streams
  • Stream Continuous, unbounded, rapid,
    time-varying streams of data elements
  • DSMS Data Stream Management System

3
The STREAM System
  • Declarative language for registering continuous
    queries considering data streams and stored
    relations
  • Formal semantics more theoretical team

4
Contributions to Date
  • Semantics for continuous queries
  • Query plans
  • Exploiting stream constraints
  • Operator scheduling
  • Approximation techniques

5
The (Simplified) Big Picture
DSMS
Scratch Store
Stored Relations
6
(Simplified) Network Monitoring
Intrusion Warnings
Online Performance Metrics
Register Monitoring Queries
DSMS
Network measurements, Packet traces
Scratch Store
Lookup Tables
7
Declarative Language for Continuous Queries
  • A distinction between STREAM and Aurora
  • Aurora users directly manipulate one large
    execution plan
  • STREAM compiles declarative queries into
    individual plans, system may merge plans
  • Syntax based on SQL, additional constructs for
    sliding windows and sampling

8
Example Query 1
  • Two streams, contrived for ease of examples
  • Orders (orderID, customer, cost)
  • Fulfillments (orderID, clerk)

9
Example Query 1
  • Two streams, contrived for ease of examples
  • Orders (orderID, customer, cost)
  • Fulfillments (orderID, clerk)
  • Total cost of orders fulfilled over the last day
    by clerk Sue for customer Joe
  • Select Sum(O.cost)
  • From Orders O, Fulfillments F Range 1 Day
  • Where O.orderID F.orderID And F.clerk Sue
  • And O.customer Joe

10
Example Query 1
  • Two streams, contrived for ease of examples
  • Orders (orderID, customer, cost)
  • Fulfillments (orderID, clerk)
  • Total cost of orders fulfilled over the last day
    by clerk Sue for customer Joe
  • Select Sum(O.cost)
  • From Orders O, Fulfillments F Range 1 Day
  • Where O.orderID F.orderID And F.clerk Sue
  • And O.customer Joe

11
Example Query 1
  • Two streams, contrived for ease of examples
  • Orders (orderID, customer, cost)
  • Fulfillments (orderID, clerk)
  • Total cost of orders fulfilled over the last day
    by clerk Sue for customer Joe
  • Select Sum(O.cost)
  • From Orders O, Fulfillments F Range 1 Day
  • Where O.orderID F.orderID And F.clerk Sue
  • And O.customer Joe

12
Example Query 1
  • Two streams, contrived for ease of examples
  • Orders (orderID, customer, cost)
  • Fulfillments (orderID, clerk)
  • Total cost of orders fulfilled over the last day
    by clerk Sue for customer Joe
  • Select Sum(O.cost)
  • From Orders O, Fulfillments F Range 1 Day
  • Where O.orderID F.orderID And F.clerk Sue
  • And O.customer Joe

13
Example Query 1
  • Two streams, contrived for ease of examples
  • Orders (orderID, customer, cost)
  • Fulfillments (orderID, clerk)
  • Total cost of orders fulfilled over the last day
    by clerk Sue for customer Joe
  • Select Sum(O.cost)
  • From Orders O, Fulfillments F Range 1 Day
  • Where O.orderID F.orderID And F.clerk Sue
  • And O.customer Joe

14
Example Query 2
  • Using a 10 sample of the Fulfillments stream,
    take the 5 most recent fulfillments for each
    clerk and return the maximum cost
  • Select F.clerk, Max(O.cost)
  • From Orders O,
  • Fulfillments F Partition By clerk Rows 5
    10 Sample
  • Where O.orderID F.orderID
  • Group By F.clerk

15
Example Query 2
  • Using a 10 sample of the Fulfillments stream,
    take the 5 most recent fulfillments for each
    clerk and return the maximum cost
  • Select F.clerk, Max(O.cost)
  • From Orders O,
  • Fulfillments F Partition By clerk Rows 5
    10 Sample
  • Where O.orderID F.orderID
  • Group By F.clerk

16
Example Query 2
  • Using a 10 sample of the Fulfillments stream,
    take the 5 most recent fulfillments for each
    clerk and return the maximum cost
  • Select F.clerk, Max(O.cost)
  • From Orders O,
  • Fulfillments F Partition By clerk Rows 5
    10 Sample
  • Where O.orderID F.orderID
  • Group By F.clerk

17
Example Query 2
  • Using a 10 sample of the Fulfillments stream,
    take the 5 most recent fulfillments for each
    clerk and return the maximum cost
  • Select F.clerk, Max(O.cost)
  • From Orders O,
  • Fulfillments F Partition By clerk Rows 5
    10 Sample
  • Where O.orderID F.orderID
  • Group By F.clerk

18
Semantics of Database Languages
  • An often neglected topic
  • Traditional relational databases are in
    reasonable shape
  • Relational algebra ? SQL
  • But triggers were a mess
  • The semantics of an innocent-looking continuous
    query over data streams may not be obvious

19
A Nonobvious Continuous Query
  • Stream of stock quotes Stocks(ticker,price)
  • Monitor last 10 minutes of quotes
  • Select ? From Stocks Range 10 minutes
  • Is result a relation, a stream, or something
    else?
  • If a relation, what exactly does it contain?
  • If a stream, how does query differ from
  • Select ? From Stocks Range 1 minute
  • or Select ? From Stocks ?

20
Our Semantics and Language
for Continuous Queries
  • Abstract interpretation for CQs based on certain
    black boxes
  • Concrete SQL-based instantiation for our system
    includes syntactic shortcuts, defaults,
    equivalences
  • Goals
  • CQs over multiple streams and relations
  • Exploit relational semantics to the extent
    possible
  • Easy queries should be easy to write, simple
    queries should do what you expect

21
Relations and Streams
  • Assume global, discrete, ordered time domain
    (more on this later)
  • Relation
  • Maps time T to set-of-tuples R
  • Stream
  • Set of (tuple,timestamp) elements

22
Conversions
Streams
Relations
23
Conversion Definitions
  • Stream-to-relation
  • S W is a relation at time T it contains all
    tuples in window W applied to stream S up to T
  • When W ?, contains all tuples in stream S up to
    T
  • Relation-to-stream
  • Istream(R) contains all (r,T ) where r?R at time
    T but r?R at time T1
  • Dstream(R) contains all (r,T ) where r?R at time
    T1 but r?R at time T
  • Rstream(R) contains all (r,T ) where r?R at time
    T

24
Abstract Semantics
  • Take any relational query language
  • Can reference streams in place of relations
  • But must convert to relations using any window
    specification language
    ( default window ? )
  • Can convert relations to streams
  • For streamed results
  • For windows over relations
    (note converts back to relation)

25
Query Result at Time T
  • Use all relations at time T
  • Use all streams up to T, converted to relations
  • Compute relational result
  • Convert result to streams if desired

26
Time
  • Easiest global system clock
  • Stream elements and relation updates timestamped
    on entry to system
  • Application-defined time
  • Streams and relation updates contain application
    timestamps, may be out of order
  • Application generates heartbeat
  • Or deduce heartbeat from parameters stream skew,
    scrambling, latency, and clock progress
  • Query results in application time

27
Abstract Semantics Example 1
  • Select F.clerk, Max(O.cost)
  • From O ?, F Rows 1000
  • Where O.orderID F.orderID
  • Group By F.clerk
  • Maximum-cost order fulfilled by each clerk in
    last 1000 fulfillments

28
Abstract Semantics Example 1
  • Select F.clerk, Max(O.cost)
  • From O ?, F Rows 1000
  • Where O.orderID F.orderID
  • Group By F.clerk
  • At time T entire stream O and last 1000 tuples
    of F as relations
  • Evaluate query, update result relation at T

29
Abstract Semantics Example 1
  • Select Istream(F.clerk, Max(O.cost))
  • From O ?, F Rows 1000
  • Where O.orderID F.orderID
  • Group By F.clerk
  • At time T entire stream O and last 1000 tuples
    of F as relations
  • Evaluate query, update result relation at T
  • Streamed result New element (ltclerk,maxgt,T)
    whenever ltclerk,maxgt changes from T1

30
Abstract Semantics Example 2
  • Relation CurPrice(stock, price)
  • Select stock, Avg(price)
  • From Istream(CurPrice) Range 1 Day
  • Group By stock
  • Average price over last day for each stock

31
Abstract Semantics Example 2
  • Relation CurPrice(stock, price)
  • Select stock, Avg(price)
  • From Istream(CurPrice) Range 1 Day
  • Group By stock
  • Istream provides history of CurPrice
  • Window on history, back to relation, group and
    aggregate

32
Concrete Language CQL
  • Relational query language SQL
  • Window spec. language derived from SQL-99
  • Tuple-based, time-based, partitioned
  • Syntactic shortcuts and defaults
  • So easy queries are easy to write and simple
    queries do what you expect
  • Equivalences
  • Basis for query-rewrite optimizations
  • Includes all relational equivalences, plus new
    stream-based ones

33
Two Extremely Simple CQL Examples
  • Select ? From Strm
  • Had better return Strm (It does)
  • Default ? window for Strm
  • Default Istream for result
  • Select ? From Strm, Rel Where Strm.A Rel.B
  • Often want NOW window for Strm
  • But may not want as default

34
Query Execution
  • When a continuous query is registered, generation
    a query plan
  • Users can also register plans directly
  • Plans composed of three main components
  • Operators (as in most conventional DBMSs)
  • Inter-operator Queues (as in many conventional
    DBMSs)
  • State (synopses)
  • Global scheduler for plan execution

35
Operators and State
  • State (synopses)
  • Summarize tuples seen so far (exact or
    approximate) for operators requiring history
  • To implement windows
  • Example synopsis join
  • Sliding-window join
  • Approximation of full join

36
Simple Query Plan
Q1
Q2
State4
?
State3
?
Scheduler
State1
State2
?
Stream3
Stream1
Stream2
37
Some Issues in Query Plan Generation
  • Compatibility and conversions for streams and
    relations (/- streams)
  • State sharing, incremental computation
  • Windowed joins Multiway versus 2-way
  • Windows in general push down, pull up, split,
    merge,
  • Time coordination, operator-level heartbeats

38
Memory Overhead in Query Processing
  • Queues State
  • Continuous queries keep state indefinitely
  • Online requirements suggest using memory rather
    than disk
  • But we realize this assumption is shaky

39
Memory Overhead in Query Processing
  • Queues State
  • Continuous queries keep state indefinitely
  • Online requirements suggest using memory rather
    than disk
  • But we realize this assumption is shaky
  • Goal minimize memory use while providing timely,
    accurate answers

40
Reducing Memory Overhead
  • Two main techniques to date
  • Exploit constraints on streams to reduce state
  • Clever operator scheduling to reduce queue sizes

41
Exploiting Stream Constraints
  • For most queries, unbounded memory is required
    for arbitrary streams PODS 01

42
Exploiting Stream Constraints
  • For most queries, unbounded memory is required
    for arbitrary streams PODS 01
  • But streams may exhibit constraints that reduce,
    bound, or even eliminate state

43
Exploiting Stream Constraints
  • For most queries, unbounded memory is required
    for arbitrary streams PODS 01
  • But streams may exhibit constraints that reduce,
    bound, or even eliminate state
  • Conventional database constraints
  • Redefined for streams
  • Relaxed for stream environment

44
Stream Constraints
  • Each constraint type defines adherence parameter
    k
  • Clustered(k) for attribute S.A
  • Ordered(k) for attribute S.A
  • Referential-Integrity(k) for join S1 ? S2

45
Algorithm for Exploiting Constraints
  • Input
  • Any Select-Project-Join query over streams
  • Any set of k-constraints
  • Output
  • Query execution plan that reduces or eliminates
    state based on k-constraints
  • If constraints violated, get approximate result

46
Constraint Examples
  • Orders (orderID, cost)
  • Fulfillments (orderID, portion, clerk)
  • Query Many-one join F ? O

47
Constraint Examples
  • Orders (orderID, cost)
  • Fulfillments (orderID, portion, clerk)
  • Query Many-one join F ? O
  • Clustered(k) on F.orderID
  • Matched O tuples discarded after k arrivals of
    non-matching Fs

48
Constraint Examples
  • Orders (orderID, cost)
  • Fulfillments (orderID, portion, clerk)
  • Query Many-one join F ? O
  • Clustered(k) on F.orderID
  • Matched O tuples discarded after k arrivals of
    non-matching Fs
  • Referential-Integrity(k)
  • F tuples retained for at most k arrivals of O
    tuples

49
Operator Scheduling
  • Global scheduler invokes run method of query plan
    operators with timeslice parameter
  • Many possible scheduling objectives minimize
    latency, inaccuracy, memory use, computation,
    starvation,
  • First scheduler round-robin
  • Second scheduler minimize queue sizes
  • Third scheduler minimize combination of queue
    sizes and latency

50
Approximation
  • Why approximate?
  • Memory requirement too high, even with
    constraints and clever scheduling
  • Cant process streams fast enough for query load

51
Approximation (contd)
  • Static rewrite queries to add (or shrink)
    sampling or windows
  • User can participate, predictable behavior
  • Doesnt consider dynamic conditions
  • Dynamic modify query plan insert sampling
    operators, shrink windows, load shedding
  • Adapts to current resource availability
  • How to convey to user? (major open issue)

52
The Holy Grail
  • Given
  • Declarative query
  • Resources
  • Constraints on streams
  • Generate plan and resource allocation that takes
    advantage of constraints and maximizes precision
  • Do it for multiple (weighted) queries,
    dynamically and adaptively, and convey whats
    happening to the user

53
  • http//www-db.stanford.edu/stream
  • Contributors Arvind Arasu, Brian Babcock,
    Shivnath Babu, Mayur Datar, Rajeev Motwani,
    Justin Rosenstein, Rohit Varma
Write a Comment
User Comments (0)
About PowerShow.com