Title: Soft RealTime Scheduling on Simultaneous Multithreaded Processors
1Soft Real-Time Scheduling on Simultaneous
Multithreaded Processors
- Rohit Jain,
- Christonpher J. Hughes
- Sarita V. Adve
- IEEE REAL-TIME SYSTEMS SYMPOSIUM (RTSS02)
- Present by Yen-Sheng Chang
- Monday, November 24, 2003
2Outline
- Abstract
- Introduction to SMT (hyperthreading)
- Related Works
- Resource Sharing Algorithms
- Co-Scheduling Algorithms
- Experimental Methodology
- Results
- Conclusions
3Abstract
- Simultaneous multithreading (SMT) improves
processor throughput by processing instructions
from multiple threads each cycle. - Two Decisions with SMT
- co-schedule selection
- Which threads to run simultaneously (the
co-schedule) - resource sharing
- How to share processor resources among
co-scheduled threads. - The choice of co-scheduling and resource sharing
algorithm may be tightly coupled.
4Abstract (conclude.)
- We find (using simulation) that the best
algorithm uses global scheduling, exploit
symbiosis, prioritizes high utilization tasks,
and uses dynamic resource sharing. - Significant profiling overhead
- No admission control
- Trade off schedulability!!! (our approach)
5Introduction to SMT (hyperthreading)
Traditional Pipelining (single-issue)
Superscalar (wide-issue)
2-issue
6SMT (cont.)
Multithreaded Processor
7SMT (cont.)
simultaneous multithreaded
traditional (single-issue)
superscalar
multithreaded
8SMT (cont.)
- Review of two questions
- co-schedule
- resource sharing
9Related Works
- co-schedule selection
- The fine-grained resource sharing problem is
unique to SMT. - ICOUNT (seeks to maximize throughput)
- Without consideration of any real-time deadline.
- Symbiotic Job Scheduling for SMT
- Symbiosis exploited
10Related Works
- resource sharing
- Transparent Threads
- no real time tasks
- Applications of Thread Prioritization in SMT
- one interactive task with other non-real-time
tasks. - Real-Time Scheduling on Multithreaded
Processors - no SMT, no resource sharing problems
11Resource Sharing Algorithms
- The threads share most processor resources
- Instruction fetch mechanism
- Instruction window
- Execution units (functional units)
- Caches
- Previous work has focused mostly on resource
sharing algorithms to maximize total throughput
in terms of completed instructions per cycle
(IPC) - May have negative/positive impact on the
schedulability.
12Resource Sharing Algorithms (cont.)
- Throughput-driven resource sharing
- ICOUNT, gives priority to the thread that has the
least instructions in the instruction window. - Refer to Dynamic algorithm.
- Performance prediction is difficult.
- Resource sharing with performance guarantees
- Static resource sharing algorithm where a fixed
set of resources is reserved for a given job. - May be suboptimal!
- Performance prediction is easy (identical to
uniprocessor). - Resources controlled by thread-specific resource
sharing algorithms - SMT is particularly sensitive to the instruction
fetch bandwidth sharing. - Fetch bandwidth instruction window.
13Co-Scheduling Algorithms
- Design Space explored
- Partitioning vs. Global scheduling
- Symbiosis-aware vs. Symbiosis-oblivious
- Prediction of Execution Time, Utilization, and
Symbiosis - Partitioning Algorithms
- Global Scheduling Algorhtms
14Co-Scheduling Algorithms (cont.)
- Partitioning vs. Global scheduling
- Admission control
- Task migration on an SMT processor is free.
- Symbiosis-aware vs. Symbiosis-oblivious
- With dynamic resource sharing on an SMT
processor, the execution time of a job depends on
which other jobs are co-scheduled with it. -
15Co-Scheduling Algorithms (cont.)
- Design space explored
- EDF as the underlying algorithm.
- Partitioning ? EDF schedules the tasks within a
context - Global scheduling ? EDF chooses the next task.
16Predicting Execution Time, Utilization, and
Symbiosis
IPC instruction count
17Predicting Execution Time, Utilization, and
Symbiosis (cont.)
- IPC
- Job IPC in single-thread mode
- ? profiling one frame of each frame type.
- Job IPC with static resource sharing
- ? profiling each allocation in single-threaded
mode - Job IPC with dynamic resource sharing
- ? profiles all possible co-schedules to obtain
the task IPCs. - Average job IPC with dynamic resource sharing
- ? when the IPCs depend on the co-schedule, an
approximation must be made. - ? we use the job IPC averaged across all
possible co-schedules are measured above. - Instruction count
- Use the average instruction count of a large
number of frames as the prediction.
18Partitioning Algorithm
- Different between multiprocessor and SMT
- A partition in SMT is a set of tasks such that no
two will execute simultaneously. (Thus, on an SMT
supporting N contexts, up to N partitions may be
created) - The cost of migrating tasks for SMT is free!
- SMT can allocate resources among the context.
19Partitioning Algorithm (cont.)
- PART-NOSYM-DYN-b
- Bin-packing based algorithm uses the
first-fit-decreasing-utilization (FFDU)
heuristic. - Approximation for utilization (the need of
enhanced-version) - PART-NOSYM-DYN-e
- Modifies the admission test so that no task-set
is ever rejected in this phase. - Simulates the schedule for a hyperperiod, to
determine if it would meet the deadlines. - Complexity is high, but it gives partitioning the
fairest showing against global scheduling.
20Partitioning Algorithm (cont.)
- PART-NOSYM-STAT
- Independent of co-schedule
- Only dependent on resource allocation
- ? No need for enhanced-version!
- FFDU heuristic with an EDF admission test.
- C1, C2, , Cn denote the N hardware contexts.
- Initial all resources are allocated to C1
- Re-allocation resources from C1 to Ck such that
Ck can accommodate. - If C1 dont have enough resource ? Fail
- Remove smallest utilization task form C1 to
another context that can accommodate it. - If no such context is found ? Fail
21Partitioning Algorithm (cont.)
- PART-SYM-DYN-b
- Maximizes average symbiosis among tasks in
different partitions, while keeping the total
utilization of tasks in each partition reasonably
balanced. - Weighted hypergraph with nodes representing the
tasks - A hypergraph is a graph in which generalized
edges (called hyperedges) may connect more than
two nodes. - The weight on a hyperedge (u1, u2, , uN) is the
inverse of the symbiosis factor of the
co-schedule formed by tasks u1, u2, , uN. - Each node is weighted with its tasks
utilization. - A hypergraph-partitioning algorithm is used.
- The sum of node-weights (utilization) is
balanced. - The weight of the hyperedges is minimized
(maximizing symbiosis) - PART-SYM-DYN-e
Reference B. L. Chamberlain. Graph Partitioning
Algorithms for Distributing Workloads of
Parallel Computations
22Global Scheduling Algorithms
23Global Scheduling Algorithms (cont.)
- Symbiosis-Oblivious Global Scheduling
- GLOB-NOSYM-PLAIN
- EDF
- GLOB-NOSYM-US
- EDF-USm/2m-1 algorithm
- Giving the highest priority to high utilization
tasks in the task set.
Reference A. Srinivasan and S. Baruah.
Deadline-based Scheduling of Periodic Task
Systems on Multiprcessor
24Global Scheduling Algorithms (cont.)
- Symbiosis-Aware Global Scheduling
- GLOB-SYM-PLAIN
- Extends EDF to exploit symbiosis in a
straightforward way. - It first selects the task with the earliest
deadline. - For the other (N-1) tasks, it chooses the set
that maximizes symbiosis when running with the
first task. - Positive ? Improving schedulability (improve
overall throughput) - Negative ? Potentially reduce schedulability. (no
real-time characteristic) - GLOB-SYM-US
- Improve the negative of GLOB-SYM-PLAIN
- Defaults to GLOB-NOSYM-US if a task Ti has
utilization Ui gt N/2N-1 - Otherwise, it defaults to GLOB-SYM-PLAIN
25Co-schedule Algorithm (conclude)
26Experimental Methodology
- Metrics
- critical serial utilization (CSU)
- The total utilization obtained by uniformly
increasing the utilization of all tasks until a
further increase causes the task-set to become
unschedulable. (5 deadline ? soft real-time)
27Experiment Setup
RSIM simulator
Real Workload
28Results
- Best algorithm ? GLOB-SYM-US
- Static vs. Dynamic resource sharing
- Static resource sharing generally implies lower
throughput than dynamic resource sharing. - Partitioning vs. Global algorithm
- Enhanced-version is more competitive to
GLOB-SYM-US. - Symbiosis-awareness
- Partitioning ? often helps
- Global scheduling ? it depends
29Conclusions
- Best algorithm
- Global scheduling, exploits symbiosis,
prioritizes high utilization tasks, uses dynamic
resource sharing. - Require a lot of profiling
- Two alternatives
- Partitioning algorithm that utilizes static
resource sharing - (PART-NONSYM-STAT)
- Worse schedulability and somewhat more complex.
- Provide a strict admission control and requires
less profilng. - Earliest deadline first global algorithms
- (GLOB-NONSYM-PLAIN)
- Not providing strict admission control, but
requires no profiling.
30Conclusion (conclude)
- Dynamic resource sharing is better than statis
for schedulability - Partitioning algorithm can be made competitive
with global scheduling algorithm, but with more
complexity. - Symbiosis-awareness
- beneficial for partitioning algorithms because
they do not entirely ignore real-time constraint - Can hurt or help global scheduling algorithms,
depending on the relative magnitude of the
symbiosis factors and total utilization of the
applications.
31