Title: Stream Data Management System Prototypes
1Stream Data Management System Prototypes
- Ying Sheng, Richard Sia
- June 1, 2004
- Professor Carlo Zaniolo
- CS 240B
- Spring 2004
2Outline
- Motivation of DSMS
- Aurora (Brown, Brandeis, MIT)
- Model
- Operator Scheduling
- Storage/Memory Management
- QoS issue
- STREAM (Stanford)
- System Architecture
- Query Language
- Query Plans and Execution
- Performance Issues
- Approximation Techniques
- STREAM Interface
- Conclusion
3Motivation
- HADP ? DAHP
- Continuous data and static queries
- Monitoring using sensor
- Military
- Traffic
- Environment
- Financial analysis
- Object tracking
4Aurora
5Aurora Model
- General Purpose DSMS
- Continuous stream data comes
- Flow through a set of operators
- Output to application or materialized
6Aurora Model
- Components
- Storage manager
- Scheduler
- Load Shedder
- Router
- QoS Monitor
- GUI
7Aurora Model
- 3 kinds of query supported
- Continuous
- View
- Ad-Hoc Query
8Aurora Model
- 8 primitive operators (Box)
- Windowed
- Slide
- Tumble
- Latch
- Resample
- Non-windowed
- Filter
- Map
- GroupBy
- Join
9Aurora Operator Optimization
- Each operator associated with
- Selectivity s(b), sel(b)
- Computation time c(b), cost(b)
- General Optimization Techniques
- Pushing projection upstream
- Combining boxes
- Reordering boxes
10Aurora Operator Optimization
- Case 1 cost of a?b
- c(a) s(a)c(b)
- Case 2 cost of b?a
- c(b) s(b)c(a)
- Criteria for switching box position
- c(a)s(a)c(b) gt c(b)s(b)c(a)
a
b
b
a
11Aurora Operator Scheduling
- Scheduling by OS
- One thread per box, shift the job to OS
- Easier to program
- Aurora Scheduler
- Single thread for the scheduler
- The scheduler pick a box with highest priority
and call the box to consume tuples from queue - Allow finer control of resource
- Scalable !
12Aurora Operator Scheduling
13Aurora Operator Scheduling
- Problem which box to execute next?
- Min-Cost (MC)
- Reduce computation cost
- Min-Latency (ML)
- Return result as soon as possible
- Min-Memory (MM)
- Reduce memory usage of queue
14Aurora Operator Scheduling
b4
b2
streams
application
b5
b3
b1
b6
Downstream
15Aurora Operator Scheduling
- Min-Cost
- Objective avoid overhead of calling boxes
- Min-Latency
- Prefer box which can produce tuples in the output
at a shorter period of time - Min-Memory
- Give preference to box which will consume more
tuples with less computation time - Similar to Chain Operator Scheduling
- More atOperator Scheduling in a Data Stream
Manager, VLDB 2003
16Aurora Storage/Memory Management
- Manage the queue in front of each box
- 2 boxes sharing the same queue
- windowed operator
- The initial queue size is 128 KB
- Queues are managed as a circular queue
- If overflow, double the queue size, or vice versa
17Aurora Storage/Memory Management
- Swap in/out between memory / disk based on
priority of boxes using it - Work with Operator Scheduler to exchange box
priority and buffer-state information - Connection Point Management
- A B-tree indexed on timestamp is built to support
random access of tuples by ad-hoc query
18Aurora Storage/Memory Management
19Aurora QoS Issue
- Different queries/applications have different QoS
requirement - Stock market monitoring
- Average temperature of a set of sensor
- QoS Graph
20Latency-based QoS Graph
Critical Point
QoS
cost(D(b))
est(b)
0
time
eol(b)
latency(b)
b
D(b)
21Aurora QoS-driven Scheduling
- Assign priority to each box based on
- priority (b) utility (b), est (b)
- utility (b) gradient (eol (b))
- How is the QoS degrading by the time the tuple
leave the system when we process it now. - est (b)
- How soon it will exhibit another performance
degradation if we dont process it now. - Performance
- 200 queries/application, each with 5 boxes
- Round robin - 0.43
- QoS driven scheduling 0.85
22Aurora Current Status
- Main components of a DSMS are introduced
- Operator scheduler
- Memory/storage management
- QoS concept in stress environment
- Load shedding
- Implemented in C, with Java-based GUI
- Dependent on a few software/library
- More?
- Distributed architecture Aurora
- Fault tolerance or disaster recovery ?
23STREAM
24STREAM Introduction
- General-purpose prototype DSMS
- Supports data streams and stored relations
- Declarative language for registering continuous
queries - Flexible query plans and execution strategies
- Aggressive sharing of state and computation among
queries
25STREAM Introduction
- Designed to cope with
- Stream rates that may be high, variable, bursty
- Continuous query loads that may be high, volatile
- Primary coping techniques
- Graceful approximation as necessary
- Careful resource allocation and use
- Continuous self-monitoring and reoptimization
26STREAM System Architecture
DSMS
Scratch Store
Stored Relations
27STREAM Query Language
- Continuous Query Language CQL
- Extends SQL with
- Streams as new data type
- Stream Unbounded bag of pairs lttuple, timestampgt
- Relation time-varying bags of tuples
- Continuous instead of one-time semantics
- Three classes of operators
- Relation-to-relation
- Stream-to-relation
- Relation-to-stream
28STREAM CQL Operators
- Relation-to-relation
- SQL constructs
- Stream-to-relation
- Tuple-based sliding window Rows N, Rows
Unbounded - Time-based sliding window Range ?, Now
- Partitioned sliding window Partition By A1,Ak
Rows N - Relation-to-stream
- Istream insert stream
- Dstream delete stream
- Rstream relation stream
29STREAM Example Query 1
- Two example streams
- Orders (orderID, customer, cost)
- Fulfillments (orderID, clerk)
- Total cost of orders fulfilled over the last day
by clerk Sue for customer Joe - Select Sum(O.cost)
- From Orders O, Fulfillments F Range 1 Day
- Where O.orderID F.orderID And F.clerk Sue
And O.customer Joe
30STREAM Example Query 2
- Using a 10 sample of the Fulfillments stream,
take the 5 most recent fulfillments for each
clerk and return the maximum cost - Select F.clerk, Max(O.cost)
- From Orders O, Fulfillments F Partition By clerk
Rows 5 10 Sample - Where O.orderID F.orderID
- Group By F.clerk
31STREAM Simplified Query 2
- Result is a relation, updated as stream elements
arrive - Select F.clerk, Max(O.cost)
- From O, F Rows 100
- Where O.orderID F.orderID
- Group By F.clerk
32STREAM Simplified Query 2
- Result is streamed Emits ltclerk, maxgt stream
element whenever max changes for a clerk (or new
clerk) - Select Istream(F.clerk, Max(O.cost))
- From O, F Rows 100
- Where O.orderID F.orderID
- Group By F.clerk
33STREAM Example Query 3
- Relation CurPrice(stock, price)
- Average price over last day for each stock
- Select stock, Avg(price)
- From Istream(CurPrice) Range 1 Day
- Group By stock
- Istream provides history of CurPrice
- Window on history (back to relation), group and
aggregate
34STREAM Query plans and Execution
- When a continuous query is registered, generate a
query plan - New plan merged with existing plans
- Users can also create manipulate plans directly
- Plans composed of three main components
- Operators
- Flag insertion(), deletion (-)
- Elements tuple-timestamp-flag tuples
- Streams only elements
- Relations both and - elements
- Queues
- Enforce nondecreasing timestamps (heartbeats)
- Mechanisms for buffering tuples
- States (Synopses)
- Global scheduler for plan execution
35STREAM States
- States (Synopses)
- Summarize elements seen so far (exact or
approximate) for operators requiring history - To implement windows
- Example synopsis join
- Sliding-window join
- Approximation of full join
36STREAM Simple Query Plan
Select From S1 Rows 1000, S2 Range
2 Minutes Where S1.A S2.A And S1.A gt 10
37STREAM Performance Issues
- Synopsis Sharing
- Eliminate data redundancy
- Exploiting Constraints
- Selectively discard data to reduce state
- Operator Scheduling
- Reduce queue sizes
38STREAM Synopsis Sharing
- Eliminate redundancy by
- replacing the nearly identical synopses with
light weight stubs - a single store to hold the actual tuples
- Store tracks the progress of each stub, presents
the appropriate view to each stub. - The store contains the union of its corresponding
stubs
39STREAM Synopsis Sharing
Select From S1 Rows 1000, S2 Range
2 Minutes Where S1.A S2.A And S1.A gt
10 Select A, Max(B) From S1 Rows 200 Group
By A
40STREAM Exploiting Constraints
- Specify an adherence parameter k to capture how
closely a given stream or sets of streams adheres
to a constraint of that type - Referential integrity k-constraint
- Ordered-arrival k-constraint
- Clustered-arrival k-constraint
- Query execution plans reduce or eliminate sate
based on k-constraints - If constraint violated, get approximate result
41STREAM Operator Scheduling
- Goal minimize total queue size for
unpredictable, bursty stream arrival patterns - Chain Scheduling Algorithm
- Mark the first operator in the plan as the
current operator - Find the block of consecutive operators starting
at the current operator that maximizes the
reduction in total queue size per unit time. - Mark the first operator following this block as
the current operator and repeat Step 2 until
all operators have been assigned to chains. - Chains are scheduled according to the greedy
algorithm, but within a chain, execution proceeds
in FIFO order. - Proven within constant factor of any
clairvoyant strategy, i.e., the optimal
strategy based on knowledge of future input, for
some queries - Empirical results large savings over naive
strategies for many queries - But minimizing queue sizes is at odds with
minimizing latency
42STREAM Approximation
- CPU-Limited Approximation
- Insufficient CPU time to process each stream
element due to the high data arrival rate. - load-shedding
- sampling operators
- Approximate by probabilistically dropping
elements before they are processed - Memory-Limited Approximation
- The total state required for all registered
queries exceeds available memory. - The system selectively shrinks or discards
synopses.
43STREAM Query Interface
- View the structure of query plans the their
component entities. - View the detailed properties of each entity.
- Dynamically adjust entity properties.
- View monitoring graphs that display time-varying
entity properties plotted dynamically against
time. - Queue sizes, throughput, overall memory usage,
and join selectivity.
44STREAM Query Plan Monitoring
45STREAM Current Status
- Version 1.0 up and running
- Includes a new monitoring and adaptive query
processing infrastructure StreaMon - Executor runs query plans to produce results.
- Profiler collects and maintains statistics about
stream and plan characteristics. - Reoptimizer ensures that the plans and memory
structures are the most efficient for current
characteristics. - Web demo available at http//shark.stanford.edu80
80/ - Future Directions
- Distributed Stream Processing
- Crash Recovery
- Improved Approximation
- Classification of Applications
46Conclusion
- Ideal DSMS
- Well defined and flexible query language
- User-friendly interface
- Scalable
- Operator scheduling
- Storage management
- Synopsis sharing
- Approximation
- Quality assurance
- Fault tolerant
47References
- R. Motwani et al., Query Processing,
Approximation, and Resource Management in a Data
Stream Management System, in proceedings of the
1st CIDR Conference, 2003. - S. Madden et al., Continuously Adaptive
Continuous Queries over Streams, in proceedings
of SIGMOD Conference, 2002 - D. Carney et al., Monitoring Streams - A New
Class of Data Management Applications, in
Proceedings of VLDB conference, 2002. - D. Carney et al., Operator Scheduling in a Data
Stream Manager, in Proceedings of VLDB
conference, 2003 - Stanford STREAM Project Website
http//www-db.stanford.edu/stream/index.html - Aurora Project Website http//www.cs.brown.edu/re
search/aurora
48End