Title: Analysis of : Operator Scheduling in a Data Stream Manager
1Analysis of Operator Scheduling in a Data
Stream Manager
- CS561 Advanced Database Systems
- By
- Eric Bloom
2Agenda
- Overview of Stream Processing
- Aurora Project Goals
- Aurora Processing Example
- Aurora Architecture
- Multi-Thread Vs. Single-Thread processing
- Important Definitions
- Superbox Scheduling and Processing
- Tuple Batching
- Experimental Evaluation
- Quality of Service (QoS) Scheduling
- QoS Scheduling Scalability
- Related Work
3Overview of Stream Processing
- Stream Processing is the processing of
potentially unbounded, continuous streams of data - Data streams are created via micro-sensors, GPS
devices, monitoring devices - Examples include soldier location tracking,
traffic sensors, stock market exchanges, heart
monitors - Data may be received evenly or in bursts
4Aurora Project Goals
- To build a data stream manager that addresses the
performance and processing requirements of
stream-based applications - To support multiple concurrent continuous queries
on one or more application data streams - To use Quality-of-Service(QoS) based criteria to
make resource allocation decisions
5Aurora Processing Example
Input Data Streams
Output to Applications
Historical Storage
Operator Boxes
Continuous ad hoc queries
6Aurora Architecture
Inputs
Outputs
Router
Buffer Manager
Scheduler
B1
B2
B3
B4
Box Processors
Catalogs
Persistent Store
Load Shredder
QoS Monitor
7Multi-Thread Vs. Single Thread Processing
- Multi-Thread Processing
- Each query is processed in its own thread
- The operating system manages resource allocation
- Advantages
- Processing can take advantage of efficient
operating system algorithms - Easier to program
- Disadvantages
- Software has limited control of resource
management - Additional overhead do to cache misses, lock
contention and switching
8Multi-Thread Vs. Single Thread Processing
- Single-Thread Processing
- All operations are processed within a single
thread - All resource allocation decisions are made by the
scheduler - Advantages
- Allows processing to be scheduled based on
latency and other Quality of Service factors
based on query needs - Avoids the limitations of multi-thread processing
- Disadvantages
- More complex to program
- Aurora has chosen to implement a single-threaded
scheduling model
9Important Definitions
- Quality of Service (QoS) Specific requirements
that represent the needs of a specific query. In
Aurora, the primary QoS factor is latency - Query Tree The set of operators (boxes) and
data streams that represent a query. - Superbox A sequence of operators that are
scheduled and executed as an atomic group. Aurora
treats each query as separate superbox. - Two-Level Scheduling Scheduling is done at two
levels. First, at the superbox level (deciding
which superbox to process) and second, what order
to execute the operators within the superbox once
a superbox is selected.
10Important Definitions (Cont.)
- Scheduling Plan The combination of dynamically
based superbox scheduling and algorithm based
operator execution order within the superbox is
called a scheduling plan. - Application-at-a-time (AAAT) is a term used in
Aurora that statically defines each query
(application) as a superbox - Box-at-a-time (BAAT) refers to scheduling at the
box level rather then the superbox level - Static and dynamic scheduling approaches Static
approaches to scheduling are defined prior to
runtime. Dynamic scheduling approaches use
runtime information and statistics to adjust and
prioritize scheduling order during execution - Traversing a superbox This refers to how the
operators within a superbox should be scheduled
and executed
11Non-Superbox Processing
1
6
9
12
14
2
7
10
13
15
3
11
8
16
4
5
12Superbox Processing
A1
A2
A3
A5
A5
C1
C2
C3
C4
C5
B4
B2
B3
B1
B5
B6
13Superbox Traversal
Superbox traversal refers to how the operators
within a superbox should be executed
- Min-Cost (MC) Attempts to optimize
per-output-tuple processing costs by minimizing
the number of operator calls per output tuple - Min-Latency (ML) Attempts to produce initial
tuples as soon as possible - Min-Memory (MM) Attempts to minimize memory
usage
14Superbox Traversal Processing
B4
B2
B3
B1
B5
B6
- Min-Cost(MC)
- B4 gt B5 gt B3 gt B2 gt B6 gt B1
- Min-Latency(ML)
- B1 gt B2 gt B1 gt B6 gt B1 gt B4 gt B2 gt B1 gt B3 gt B2 gt
B1 gt B5 gt B3 gt B2 gt B1 - Min-Memory(MM)
- B3 gt B6 gt B2 gt B5 gt B3 gt B2 gt B1 gt B4 gt B2 gt B1
15Tuple Batching (Train Processing)
- A Tuple Train is the process of executing tuples
as a batch within a single operator call. - The goal of Tuple Train processing is to reduce
overall processing cost per tuple processed - Advantages of Tuple Train processing are
- Decreased number of total operator executions
- Cuts down on low level overhead such as context
switching, scheduling, memory management and
execution queue maintenance - Some windowing and merge-join operators work
efficiently when batching tuples
16Experimental Evaluation Definitions
- Stream-based applications do not currently have a
standardized benchmark - Aurora modeled queries as a rooted tree structure
from a stream input box to an application output
box - Trees are categorized based on depth and fan-out
- Depth is the number of box levels from input to
output - Fan-out is the average number of children of each
box
17Experimental Evaluation Results
- At low volumes Round Robin Box-At-A-Time
(RR-BAAT) scheduling was almost as efficient as
Minimum Cost Application-At-A-Time (MC-AAAT) at
low volumes but much less efficient and higher
levels - At low volumes, the efficiencies of MC-AAAT were
reduced by more complex scheduling overhead - As volumes increased, the efficiencies of MC-AAAT
became more apparent as scheduling overhead
became a lower percentage to total processing - Experimentation was also done to compare ML, MC
and MM scheduling techniques - As expected, each technique minimized their
specified attribute (latency, cost and memory
respectively) - However, at very low processing levels the
simplest algorithms tended to do the best (but
who cares )
18Quality of Service (QoS) Scheduling
- Definitions
- Utility is how useful the tuple will be when it
exits the query - Urgency is represented by the angle of the
downward slope of the utility QoS parameter. In
other words, how fast the utility deteriorates - Approach
- Keep track of the latency of tuples that reside
in the queues and pick tuples for processing
based on whose execution will provide the highest
aggregate QoS delivered to the applications. -
19Latency-Utility Relationship
Critical Points
The older the data gets, The less it is
worth, The lower the quality of
service ------------------------------------ Auror
a combines the QoS charts of each query being
executed with the average latency of the tuples
in each box to decide which superbox to execute
next. The idea is to, on average, maintain the
highest quality of service.
1
Quality of Service
0,0
Latency
20QoS Scheduling Scalability
- Problem
- A per-tuple approach to QoS based scheduling will
not scale because of the amount of processing
needed to maintain it - Solution
- Latency is not calculated at the tuple level,
rather, it is calculated as the average latency
of tuples in the box input queue - Priority is given based on the combination of
utility and urgency - Once a boxs priority (priority tuple or
p-tuple) is calculated, the boxes are placed in
logical buckets bases on their priority value - Scheduling is then done based on the priority of
the bucket - All boxes in a given bucket are considered equal
21Related Work
- Eddies has a tuple-at-a-time scheduler
providing adaptablility, but does not scale well - Urhan works on rate-based pipeline scheduling
of data between operators - NiagaraCQ query optimization for streaming data
from wide-area information sources - STREAM provides comprehensive data stream
management using chain scheduling algorithms - Note, that none of the above projects have a
notion of QoS