Analysis of : Operator Scheduling in a Data Stream Manager PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Analysis of : Operator Scheduling in a Data Stream Manager


1
Analysis of Operator Scheduling in a Data
Stream Manager
  • CS561 Advanced Database Systems
  • By
  • Eric Bloom

2
Agenda
  • Overview of Stream Processing
  • Aurora Project Goals
  • Aurora Processing Example
  • Aurora Architecture
  • Multi-Thread Vs. Single-Thread processing
  • Important Definitions
  • Superbox Scheduling and Processing
  • Tuple Batching
  • Experimental Evaluation
  • Quality of Service (QoS) Scheduling
  • QoS Scheduling Scalability
  • Related Work

3
Overview of Stream Processing
  • Stream Processing is the processing of
    potentially unbounded, continuous streams of data
  • Data streams are created via micro-sensors, GPS
    devices, monitoring devices
  • Examples include soldier location tracking,
    traffic sensors, stock market exchanges, heart
    monitors
  • Data may be received evenly or in bursts

4
Aurora Project Goals
  • To build a data stream manager that addresses the
    performance and processing requirements of
    stream-based applications
  • To support multiple concurrent continuous queries
    on one or more application data streams
  • To use Quality-of-Service(QoS) based criteria to
    make resource allocation decisions

5
Aurora Processing Example
Input Data Streams
Output to Applications
Historical Storage
Operator Boxes
Continuous ad hoc queries
6
Aurora Architecture
Inputs
Outputs
Router
Buffer Manager
Scheduler
B1
B2
B3
B4
Box Processors
Catalogs
Persistent Store
Load Shredder
QoS Monitor
7
Multi-Thread Vs. Single Thread Processing
  • Multi-Thread Processing
  • Each query is processed in its own thread
  • The operating system manages resource allocation
  • Advantages
  • Processing can take advantage of efficient
    operating system algorithms
  • Easier to program
  • Disadvantages
  • Software has limited control of resource
    management
  • Additional overhead do to cache misses, lock
    contention and switching

8
Multi-Thread Vs. Single Thread Processing
  • Single-Thread Processing
  • All operations are processed within a single
    thread
  • All resource allocation decisions are made by the
    scheduler
  • Advantages
  • Allows processing to be scheduled based on
    latency and other Quality of Service factors
    based on query needs
  • Avoids the limitations of multi-thread processing
  • Disadvantages
  • More complex to program
  • Aurora has chosen to implement a single-threaded
    scheduling model

9
Important Definitions
  • Quality of Service (QoS) Specific requirements
    that represent the needs of a specific query. In
    Aurora, the primary QoS factor is latency
  • Query Tree The set of operators (boxes) and
    data streams that represent a query.
  • Superbox A sequence of operators that are
    scheduled and executed as an atomic group. Aurora
    treats each query as separate superbox.
  • Two-Level Scheduling Scheduling is done at two
    levels. First, at the superbox level (deciding
    which superbox to process) and second, what order
    to execute the operators within the superbox once
    a superbox is selected.

10
Important Definitions (Cont.)
  • Scheduling Plan The combination of dynamically
    based superbox scheduling and algorithm based
    operator execution order within the superbox is
    called a scheduling plan.
  • Application-at-a-time (AAAT) is a term used in
    Aurora that statically defines each query
    (application) as a superbox
  • Box-at-a-time (BAAT) refers to scheduling at the
    box level rather then the superbox level
  • Static and dynamic scheduling approaches Static
    approaches to scheduling are defined prior to
    runtime. Dynamic scheduling approaches use
    runtime information and statistics to adjust and
    prioritize scheduling order during execution
  • Traversing a superbox This refers to how the
    operators within a superbox should be scheduled
    and executed

11
Non-Superbox Processing
1
6
9
12
14
2
7
10
13
15
3
11
8
16
4
5
12
Superbox Processing
A1
A2
A3
A5
A5
C1
C2
C3
C4
C5
B4
B2
B3
B1
B5
B6
13
Superbox Traversal
Superbox traversal refers to how the operators
within a superbox should be executed
  • Min-Cost (MC) Attempts to optimize
    per-output-tuple processing costs by minimizing
    the number of operator calls per output tuple
  • Min-Latency (ML) Attempts to produce initial
    tuples as soon as possible
  • Min-Memory (MM) Attempts to minimize memory
    usage

14
Superbox Traversal Processing
B4
B2
B3
B1
B5
B6
  • Min-Cost(MC)
  • B4 gt B5 gt B3 gt B2 gt B6 gt B1
  • Min-Latency(ML)
  • B1 gt B2 gt B1 gt B6 gt B1 gt B4 gt B2 gt B1 gt B3 gt B2 gt
    B1 gt B5 gt B3 gt B2 gt B1
  • Min-Memory(MM)
  • B3 gt B6 gt B2 gt B5 gt B3 gt B2 gt B1 gt B4 gt B2 gt B1

15
Tuple Batching (Train Processing)
  • A Tuple Train is the process of executing tuples
    as a batch within a single operator call.
  • The goal of Tuple Train processing is to reduce
    overall processing cost per tuple processed
  • Advantages of Tuple Train processing are
  • Decreased number of total operator executions
  • Cuts down on low level overhead such as context
    switching, scheduling, memory management and
    execution queue maintenance
  • Some windowing and merge-join operators work
    efficiently when batching tuples

16
Experimental Evaluation Definitions
  • Stream-based applications do not currently have a
    standardized benchmark
  • Aurora modeled queries as a rooted tree structure
    from a stream input box to an application output
    box
  • Trees are categorized based on depth and fan-out
  • Depth is the number of box levels from input to
    output
  • Fan-out is the average number of children of each
    box

17
Experimental Evaluation Results
  • At low volumes Round Robin Box-At-A-Time
    (RR-BAAT) scheduling was almost as efficient as
    Minimum Cost Application-At-A-Time (MC-AAAT) at
    low volumes but much less efficient and higher
    levels
  • At low volumes, the efficiencies of MC-AAAT were
    reduced by more complex scheduling overhead
  • As volumes increased, the efficiencies of MC-AAAT
    became more apparent as scheduling overhead
    became a lower percentage to total processing
  • Experimentation was also done to compare ML, MC
    and MM scheduling techniques
  • As expected, each technique minimized their
    specified attribute (latency, cost and memory
    respectively)
  • However, at very low processing levels the
    simplest algorithms tended to do the best (but
    who cares )

18
Quality of Service (QoS) Scheduling
  • Definitions
  • Utility is how useful the tuple will be when it
    exits the query
  • Urgency is represented by the angle of the
    downward slope of the utility QoS parameter. In
    other words, how fast the utility deteriorates
  • Approach
  • Keep track of the latency of tuples that reside
    in the queues and pick tuples for processing
    based on whose execution will provide the highest
    aggregate QoS delivered to the applications.

19
Latency-Utility Relationship
Critical Points
The older the data gets, The less it is
worth, The lower the quality of
service ------------------------------------ Auror
a combines the QoS charts of each query being
executed with the average latency of the tuples
in each box to decide which superbox to execute
next. The idea is to, on average, maintain the
highest quality of service.
1
Quality of Service
0,0
Latency
20
QoS Scheduling Scalability
  • Problem
  • A per-tuple approach to QoS based scheduling will
    not scale because of the amount of processing
    needed to maintain it
  • Solution
  • Latency is not calculated at the tuple level,
    rather, it is calculated as the average latency
    of tuples in the box input queue
  • Priority is given based on the combination of
    utility and urgency
  • Once a boxs priority (priority tuple or
    p-tuple) is calculated, the boxes are placed in
    logical buckets bases on their priority value
  • Scheduling is then done based on the priority of
    the bucket
  • All boxes in a given bucket are considered equal

21
Related Work
  • Eddies has a tuple-at-a-time scheduler
    providing adaptablility, but does not scale well
  • Urhan works on rate-based pipeline scheduling
    of data between operators
  • NiagaraCQ query optimization for streaming data
    from wide-area information sources
  • STREAM provides comprehensive data stream
    management using chain scheduling algorithms
  • Note, that none of the above projects have a
    notion of QoS
Write a Comment
User Comments (0)
About PowerShow.com