Title: Query Processing, Resource Management, and Approximation in a Data Stream Management System
1Query Processing, Resource Management, and
Approximation in a Data Stream Management System
- Selected subset of slides taken from talk by
Jennifer Widom at NEDS. - stanfordstreamdatamanager
2Data Streams
- Stream Continuous, unbounded, rapid,
time-varying streams of data elements - DSMS Data Stream Management System
3The STREAM System
- Declarative language for registering continuous
queries considering data streams and stored
relations - Formal semantics more theoretical team
4Contributions to Date
- Semantics for continuous queries
- Query plans
- Exploiting stream constraints
- Operator scheduling
- Approximation techniques
5The (Simplified) Big Picture
DSMS
Scratch Store
Stored Relations
6(Simplified) Network Monitoring
Intrusion Warnings
Online Performance Metrics
Register Monitoring Queries
DSMS
Network measurements, Packet traces
Scratch Store
Lookup Tables
7Declarative Language for Continuous Queries
- A distinction between STREAM and Aurora
- Aurora users directly manipulate one large
execution plan - STREAM compiles declarative queries into
individual plans, system may merge plans - Syntax based on SQL, additional constructs for
sliding windows and sampling
8Example Query 1
- Two streams, contrived for ease of examples
- Orders (orderID, customer, cost)
- Fulfillments (orderID, clerk)
9Example Query 1
- Two streams, contrived for ease of examples
- Orders (orderID, customer, cost)
- Fulfillments (orderID, clerk)
- Total cost of orders fulfilled over the last day
by clerk Sue for customer Joe - Select Sum(O.cost)
- From Orders O, Fulfillments F Range 1 Day
- Where O.orderID F.orderID And F.clerk Sue
- And O.customer Joe
10Example Query 1
- Two streams, contrived for ease of examples
- Orders (orderID, customer, cost)
- Fulfillments (orderID, clerk)
- Total cost of orders fulfilled over the last day
by clerk Sue for customer Joe - Select Sum(O.cost)
- From Orders O, Fulfillments F Range 1 Day
- Where O.orderID F.orderID And F.clerk Sue
- And O.customer Joe
11Example Query 1
- Two streams, contrived for ease of examples
- Orders (orderID, customer, cost)
- Fulfillments (orderID, clerk)
- Total cost of orders fulfilled over the last day
by clerk Sue for customer Joe - Select Sum(O.cost)
- From Orders O, Fulfillments F Range 1 Day
- Where O.orderID F.orderID And F.clerk Sue
- And O.customer Joe
12Example Query 1
- Two streams, contrived for ease of examples
- Orders (orderID, customer, cost)
- Fulfillments (orderID, clerk)
- Total cost of orders fulfilled over the last day
by clerk Sue for customer Joe - Select Sum(O.cost)
- From Orders O, Fulfillments F Range 1 Day
- Where O.orderID F.orderID And F.clerk Sue
- And O.customer Joe
13Example Query 1
- Two streams, contrived for ease of examples
- Orders (orderID, customer, cost)
- Fulfillments (orderID, clerk)
- Total cost of orders fulfilled over the last day
by clerk Sue for customer Joe - Select Sum(O.cost)
- From Orders O, Fulfillments F Range 1 Day
- Where O.orderID F.orderID And F.clerk Sue
- And O.customer Joe
14Example Query 2
- Using a 10 sample of the Fulfillments stream,
take the 5 most recent fulfillments for each
clerk and return the maximum cost - Select F.clerk, Max(O.cost)
- From Orders O,
- Fulfillments F Partition By clerk Rows 5
10 Sample - Where O.orderID F.orderID
- Group By F.clerk
15Example Query 2
- Using a 10 sample of the Fulfillments stream,
take the 5 most recent fulfillments for each
clerk and return the maximum cost - Select F.clerk, Max(O.cost)
- From Orders O,
- Fulfillments F Partition By clerk Rows 5
10 Sample - Where O.orderID F.orderID
- Group By F.clerk
16Example Query 2
- Using a 10 sample of the Fulfillments stream,
take the 5 most recent fulfillments for each
clerk and return the maximum cost - Select F.clerk, Max(O.cost)
- From Orders O,
- Fulfillments F Partition By clerk Rows 5
10 Sample - Where O.orderID F.orderID
- Group By F.clerk
17Example Query 2
- Using a 10 sample of the Fulfillments stream,
take the 5 most recent fulfillments for each
clerk and return the maximum cost - Select F.clerk, Max(O.cost)
- From Orders O,
- Fulfillments F Partition By clerk Rows 5
10 Sample - Where O.orderID F.orderID
- Group By F.clerk
18Semantics of Database Languages
- An often neglected topic
- Traditional relational databases are in
reasonable shape - Relational algebra ? SQL
- But triggers were a mess
- The semantics of an innocent-looking continuous
query over data streams may not be obvious
19A Nonobvious Continuous Query
- Stream of stock quotes Stocks(ticker,price)
- Monitor last 10 minutes of quotes
- Select ? From Stocks Range 10 minutes
- Is result a relation, a stream, or something
else? - If a relation, what exactly does it contain?
- If a stream, how does query differ from
- Select ? From Stocks Range 1 minute
- or Select ? From Stocks ?
20Our Semantics and Language
for Continuous Queries
- Abstract interpretation for CQs based on certain
black boxes - Concrete SQL-based instantiation for our system
includes syntactic shortcuts, defaults,
equivalences - Goals
- CQs over multiple streams and relations
- Exploit relational semantics to the extent
possible - Easy queries should be easy to write, simple
queries should do what you expect
21Relations and Streams
- Assume global, discrete, ordered time domain
(more on this later) - Relation
- Maps time T to set-of-tuples R
- Stream
- Set of (tuple,timestamp) elements
22Conversions
Streams
Relations
23Conversion Definitions
- Stream-to-relation
- S W is a relation at time T it contains all
tuples in window W applied to stream S up to T - When W ?, contains all tuples in stream S up to
T - Relation-to-stream
- Istream(R) contains all (r,T ) where r?R at time
T but r?R at time T1 - Dstream(R) contains all (r,T ) where r?R at time
T1 but r?R at time T - Rstream(R) contains all (r,T ) where r?R at time
T
24Abstract Semantics
- Take any relational query language
- Can reference streams in place of relations
- But must convert to relations using any window
specification language
( default window ? ) - Can convert relations to streams
- For streamed results
- For windows over relations
(note converts back to relation)
25Query Result at Time T
- Use all relations at time T
- Use all streams up to T, converted to relations
- Compute relational result
- Convert result to streams if desired
26Time
- Easiest global system clock
- Stream elements and relation updates timestamped
on entry to system - Application-defined time
- Streams and relation updates contain application
timestamps, may be out of order - Application generates heartbeat
- Or deduce heartbeat from parameters stream skew,
scrambling, latency, and clock progress - Query results in application time
27Abstract Semantics Example 1
- Select F.clerk, Max(O.cost)
- From O ?, F Rows 1000
- Where O.orderID F.orderID
- Group By F.clerk
- Maximum-cost order fulfilled by each clerk in
last 1000 fulfillments
28Abstract Semantics Example 1
- Select F.clerk, Max(O.cost)
- From O ?, F Rows 1000
- Where O.orderID F.orderID
- Group By F.clerk
- At time T entire stream O and last 1000 tuples
of F as relations - Evaluate query, update result relation at T
29Abstract Semantics Example 1
- Select Istream(F.clerk, Max(O.cost))
- From O ?, F Rows 1000
- Where O.orderID F.orderID
- Group By F.clerk
- At time T entire stream O and last 1000 tuples
of F as relations - Evaluate query, update result relation at T
- Streamed result New element (ltclerk,maxgt,T)
whenever ltclerk,maxgt changes from T1
30Abstract Semantics Example 2
- Relation CurPrice(stock, price)
- Select stock, Avg(price)
- From Istream(CurPrice) Range 1 Day
- Group By stock
- Average price over last day for each stock
31Abstract Semantics Example 2
- Relation CurPrice(stock, price)
- Select stock, Avg(price)
- From Istream(CurPrice) Range 1 Day
- Group By stock
- Istream provides history of CurPrice
- Window on history, back to relation, group and
aggregate
32Concrete Language CQL
- Relational query language SQL
- Window spec. language derived from SQL-99
- Tuple-based, time-based, partitioned
- Syntactic shortcuts and defaults
- So easy queries are easy to write and simple
queries do what you expect - Equivalences
- Basis for query-rewrite optimizations
- Includes all relational equivalences, plus new
stream-based ones
33Two Extremely Simple CQL Examples
- Select ? From Strm
- Had better return Strm (It does)
- Default ? window for Strm
- Default Istream for result
- Select ? From Strm, Rel Where Strm.A Rel.B
- Often want NOW window for Strm
- But may not want as default
34Query Execution
- When a continuous query is registered, generation
a query plan - Users can also register plans directly
- Plans composed of three main components
- Operators (as in most conventional DBMSs)
- Inter-operator Queues (as in many conventional
DBMSs) - State (synopses)
- Global scheduler for plan execution
35Operators and State
- State (synopses)
- Summarize tuples seen so far (exact or
approximate) for operators requiring history - To implement windows
- Example synopsis join
- Sliding-window join
- Approximation of full join
36Simple Query Plan
Q1
Q2
State4
?
State3
?
Scheduler
State1
State2
?
Stream3
Stream1
Stream2
37Some Issues in Query Plan Generation
- Compatibility and conversions for streams and
relations (/- streams) - State sharing, incremental computation
- Windowed joins Multiway versus 2-way
- Windows in general push down, pull up, split,
merge, - Time coordination, operator-level heartbeats
38Memory Overhead in Query Processing
- Queues State
- Continuous queries keep state indefinitely
- Online requirements suggest using memory rather
than disk - But we realize this assumption is shaky
39Memory Overhead in Query Processing
- Queues State
- Continuous queries keep state indefinitely
- Online requirements suggest using memory rather
than disk - But we realize this assumption is shaky
- Goal minimize memory use while providing timely,
accurate answers
40Reducing Memory Overhead
- Two main techniques to date
- Exploit constraints on streams to reduce state
- Clever operator scheduling to reduce queue sizes
41Exploiting Stream Constraints
- For most queries, unbounded memory is required
for arbitrary streams PODS 01
42Exploiting Stream Constraints
- For most queries, unbounded memory is required
for arbitrary streams PODS 01 - But streams may exhibit constraints that reduce,
bound, or even eliminate state
43Exploiting Stream Constraints
- For most queries, unbounded memory is required
for arbitrary streams PODS 01 - But streams may exhibit constraints that reduce,
bound, or even eliminate state - Conventional database constraints
- Redefined for streams
- Relaxed for stream environment
44Stream Constraints
- Each constraint type defines adherence parameter
k - Clustered(k) for attribute S.A
- Ordered(k) for attribute S.A
- Referential-Integrity(k) for join S1 ? S2
45Algorithm for Exploiting Constraints
- Input
- Any Select-Project-Join query over streams
- Any set of k-constraints
- Output
- Query execution plan that reduces or eliminates
state based on k-constraints - If constraints violated, get approximate result
46Constraint Examples
- Orders (orderID, cost)
- Fulfillments (orderID, portion, clerk)
- Query Many-one join F ? O
47Constraint Examples
- Orders (orderID, cost)
- Fulfillments (orderID, portion, clerk)
- Query Many-one join F ? O
- Clustered(k) on F.orderID
- Matched O tuples discarded after k arrivals of
non-matching Fs
48Constraint Examples
- Orders (orderID, cost)
- Fulfillments (orderID, portion, clerk)
- Query Many-one join F ? O
- Clustered(k) on F.orderID
- Matched O tuples discarded after k arrivals of
non-matching Fs - Referential-Integrity(k)
- F tuples retained for at most k arrivals of O
tuples
49Operator Scheduling
- Global scheduler invokes run method of query plan
operators with timeslice parameter - Many possible scheduling objectives minimize
latency, inaccuracy, memory use, computation,
starvation, - First scheduler round-robin
- Second scheduler minimize queue sizes
- Third scheduler minimize combination of queue
sizes and latency
50Approximation
- Why approximate?
- Memory requirement too high, even with
constraints and clever scheduling - Cant process streams fast enough for query load
51Approximation (contd)
- Static rewrite queries to add (or shrink)
sampling or windows - User can participate, predictable behavior
- Doesnt consider dynamic conditions
- Dynamic modify query plan insert sampling
operators, shrink windows, load shedding - Adapts to current resource availability
- How to convey to user? (major open issue)
52The Holy Grail
- Given
- Declarative query
- Resources
- Constraints on streams
- Generate plan and resource allocation that takes
advantage of constraints and maximizes precision - Do it for multiple (weighted) queries,
dynamically and adaptively, and convey whats
happening to the user
53- http//www-db.stanford.edu/stream
- Contributors Arvind Arasu, Brian Babcock,
Shivnath Babu, Mayur Datar, Rajeev Motwani,
Justin Rosenstein, Rohit Varma