Analysis of : Operator Scheduling in a Data Stream Manager presentation

About This Presentation

Transcript and Presenter's Notes

Title: Analysis of : Operator Scheduling in a Data Stream Manager

1
Analysis of Operator Scheduling in a Data
Stream Manager

CS561 Advanced Database Systems
By
Eric Bloom

2
Agenda

Overview of Stream Processing
Aurora Project Goals
Aurora Processing Example
Aurora Architecture
Multi-Thread Vs. Single-Thread processing
Important Definitions
Superbox Scheduling and Processing
Tuple Batching
Experimental Evaluation
Quality of Service (QoS) Scheduling
QoS Scheduling Scalability
Related Work

3
Overview of Stream Processing

Stream Processing is the processing of
potentially unbounded, continuous streams of data
Data streams are created via micro-sensors, GPS
devices, monitoring devices
Examples include soldier location tracking,
traffic sensors, stock market exchanges, heart
monitors
Data may be received evenly or in bursts

4
Aurora Project Goals

To build a data stream manager that addresses the
performance and processing requirements of
stream-based applications
To support multiple concurrent continuous queries
on one or more application data streams
To use Quality-of-Service(QoS) based criteria to
make resource allocation decisions

5
Aurora Processing Example
Input Data Streams
Output to Applications
Historical Storage
Operator Boxes
Continuous ad hoc queries
6
Aurora Architecture
Inputs
Outputs
Router
Buffer Manager
Scheduler
B1
B2
B3
B4
Box Processors
Catalogs
Persistent Store
Load Shredder
QoS Monitor
7
Multi-Thread Vs. Single Thread Processing

Multi-Thread Processing
Each query is processed in its own thread
The operating system manages resource allocation
Advantages
Processing can take advantage of efficient
operating system algorithms
Easier to program
Disadvantages
Software has limited control of resource
management
Additional overhead do to cache misses, lock
contention and switching

8
Multi-Thread Vs. Single Thread Processing

Single-Thread Processing
All operations are processed within a single
thread
All resource allocation decisions are made by the
scheduler
Advantages
Allows processing to be scheduled based on
latency and other Quality of Service factors
based on query needs
Avoids the limitations of multi-thread processing
Disadvantages
More complex to program
Aurora has chosen to implement a single-threaded
scheduling model

9
Important Definitions

Quality of Service (QoS) Specific requirements
that represent the needs of a specific query. In
Aurora, the primary QoS factor is latency
Query Tree The set of operators (boxes) and
data streams that represent a query.
Superbox A sequence of operators that are
scheduled and executed as an atomic group. Aurora
treats each query as separate superbox.
Two-Level Scheduling Scheduling is done at two
levels. First, at the superbox level (deciding
which superbox to process) and second, what order
to execute the operators within the superbox once
a superbox is selected.

10
Important Definitions (Cont.)

Scheduling Plan The combination of dynamically
based superbox scheduling and algorithm based
operator execution order within the superbox is
called a scheduling plan.
Application-at-a-time (AAAT) is a term used in
Aurora that statically defines each query
(application) as a superbox
Box-at-a-time (BAAT) refers to scheduling at the
box level rather then the superbox level
Static and dynamic scheduling approaches Static
approaches to scheduling are defined prior to
runtime. Dynamic scheduling approaches use
runtime information and statistics to adjust and
prioritize scheduling order during execution
Traversing a superbox This refers to how the
operators within a superbox should be scheduled
and executed

11
Non-Superbox Processing
1
6
9
12
14
2
7
10
13
15
3
11
8
16
4
5
12
Superbox Processing
A1
A2
A3
A5
A5
C1
C2
C3
C4
C5
B4
B2
B3
B1
B5
B6
13
Superbox Traversal
Superbox traversal refers to how the operators
within a superbox should be executed

Min-Cost (MC) Attempts to optimize
per-output-tuple processing costs by minimizing
the number of operator calls per output tuple
Min-Latency (ML) Attempts to produce initial
tuples as soon as possible
Min-Memory (MM) Attempts to minimize memory
usage

14
Superbox Traversal Processing
B4
B2
B3
B1
B5
B6

Min-Cost(MC)
B4 gt B5 gt B3 gt B2 gt B6 gt B1
Min-Latency(ML)
B1 gt B2 gt B1 gt B6 gt B1 gt B4 gt B2 gt B1 gt B3 gt B2 gt
B1 gt B5 gt B3 gt B2 gt B1
Min-Memory(MM)
B3 gt B6 gt B2 gt B5 gt B3 gt B2 gt B1 gt B4 gt B2 gt B1

15
Tuple Batching (Train Processing)

A Tuple Train is the process of executing tuples
as a batch within a single operator call.
The goal of Tuple Train processing is to reduce
overall processing cost per tuple processed
Advantages of Tuple Train processing are
Decreased number of total operator executions
Cuts down on low level overhead such as context
switching, scheduling, memory management and
execution queue maintenance
Some windowing and merge-join operators work
efficiently when batching tuples

16
Experimental Evaluation Definitions

Stream-based applications do not currently have a
standardized benchmark
Aurora modeled queries as a rooted tree structure
from a stream input box to an application output
box
Trees are categorized based on depth and fan-out
Depth is the number of box levels from input to
output
Fan-out is the average number of children of each
box

17
Experimental Evaluation Results

At low volumes Round Robin Box-At-A-Time
(RR-BAAT) scheduling was almost as efficient as
Minimum Cost Application-At-A-Time (MC-AAAT) at
low volumes but much less efficient and higher
levels
At low volumes, the efficiencies of MC-AAAT were
reduced by more complex scheduling overhead
As volumes increased, the efficiencies of MC-AAAT
became more apparent as scheduling overhead
became a lower percentage to total processing
Experimentation was also done to compare ML, MC
and MM scheduling techniques
As expected, each technique minimized their
specified attribute (latency, cost and memory
respectively)
However, at very low processing levels the
simplest algorithms tended to do the best (but
who cares )

18
Quality of Service (QoS) Scheduling

Definitions
Utility is how useful the tuple will be when it
exits the query
Urgency is represented by the angle of the
downward slope of the utility QoS parameter. In
other words, how fast the utility deteriorates
Approach
Keep track of the latency of tuples that reside
in the queues and pick tuples for processing
based on whose execution will provide the highest
aggregate QoS delivered to the applications.

19
Latency-Utility Relationship
Critical Points
The older the data gets, The less it is
worth, The lower the quality of
service ------------------------------------ Auror
a combines the QoS charts of each query being
executed with the average latency of the tuples
in each box to decide which superbox to execute
next. The idea is to, on average, maintain the
highest quality of service.
1
Quality of Service
0,0
Latency
20
QoS Scheduling Scalability

Problem
A per-tuple approach to QoS based scheduling will
not scale because of the amount of processing
needed to maintain it
Solution
Latency is not calculated at the tuple level,
rather, it is calculated as the average latency
of tuples in the box input queue
Priority is given based on the combination of
utility and urgency
Once a boxs priority (priority tuple or
p-tuple) is calculated, the boxes are placed in
logical buckets bases on their priority value
Scheduling is then done based on the priority of
the bucket
All boxes in a given bucket are considered equal

21
Related Work

Eddies has a tuple-at-a-time scheduler
providing adaptablility, but does not scale well
Urhan works on rate-based pipeline scheduling
of data between operators
NiagaraCQ query optimization for streaming data
from wide-area information sources
STREAM provides comprehensive data stream
management using chain scheduling algorithms
Note, that none of the above projects have a
notion of QoS

Write a Comment

User Comments (0)

About PowerShow.com

Analysis of : Operator Scheduling in a Data Stream Manager PowerPoint PPT Presentation