Title: Thread Scheduling for Multiprogrammed Multiprocessors
1Thread Scheduling for Multiprogrammed
Multiprocessors
Robert D. Blumofe The University of Texas at
Austin
Work done in collaboration with Nimar Arora,
Dionisios Papadopoulos, and Greg Plaxton
2Static Partitioning
T1
A program partitions the work T1 evenly among P
(light-weight) processes, also known as kernel
threads, so each process performs T1/P work.
process 1
process 2
process 3
process 4
At runtime, P processors execute the P processes
in parallel, taking time T1/P and realizing
linear speedup.
3Multiprogrammed Environments
If another program is running concurrently, then
the P processes may execute on PA lt P processors.
In general, we hope to achieve execution time
T1/PA, thereby realizing linear speedup.
The statically partitioned program may fall far
short of this goal.
In this example, the execution time is T1/2 ,
though PA 3.
4Performance of Static Partitioning
Speedup measured on 8-processor Sun Ultra
Enterprise 5000.
8
7
6
5
speedup
4
3
2
1
1
4
8
12
16
20
24
28
32
processes
5Our Performance
Speedup measured on 8-processor Sun Ultra
Enterprise 5000.
8
7
6
5
speedup
4
3
2
1
1
4
8
12
16
20
24
28
32
8
12
16
processes
6Dynamic Scheduling
A program partitions the work into (user-level)
threads to expose all of the parallelism. A
computation may create millions of threads.
Threads are dynamically scheduled through two
levels.
Each computation has a (user-level) thread
scheduler that maps its threads to its processes.
The kernel maps all processes to all processors.
We define the processor average PA of a
computation as the time-average number of
processors on which the computation executes, as
determined by the kernel.
Goal execution time T T1/PA, irrespective of
kernel scheduling.
7Our Results Theory and Practice
We present a (user-level) thread scheduler, the
non-blocking work stealer, and we show that its
execution time T satisfies the following bounds.
T is the critical-path length, the theoretical
minimum execution time with infinitely many
processors.
Theory ET O(T1/PA TP/PA).
- The kernel is assumed to be an adversary.
- This bound is optimal to within a constant
factor. - For any e gt 0, we have T O(T1/PA (T
lg(1/e))P/PA) with probability at least 1-e.
Practice T T1/PA TP/PA.
- We have T T1/PA whenever P is small relative to
the average parallelism, T1/T.
8Outline
- The non-blocking work stealer
- Work stealing
- The non-blocking implementation
- Algorithmic analysis
- Empirical analysis
- Related work
- Conclusion
9Work Stealing
Each process maintains a pool of ready threads
organized as a deque (double-ended queue) with a
top and a bottom.
A process obtains work by popping the bottom-most
thread from its deque and executing that thread.
- If the thread blocks or terminates, then the
process pops another thread.
- If the thread creates or enables another thread,
then the process pushes one thread on the bottom
of its deque and continues executing the other.
If a process finds that its deque is empty, then
it becomes a thief and steals the top-most thread
from the deque of a randomly chosen victim
process.
10Non-Blocking Implementation
- The non-blocking work stealer is an
implementation of work stealing with the
following two features. - The deques are implemented with non-blocking
synchronization. - Instead of locks, atomic load-test-store machine
instructions are used. Examples
load-linked/store-conditional and
compare-and-swap. - There exists a constant c (10) such that if a
process performs a deque operation, then after
executing c instructions, some process has
succeeded in performing a deque operation. - Each process, between consecutive steal attempts,
performs a yield system call.
11Implementation Status
- The non-blocking work stealer has been
implemented in the Hood C threads library. - Hood is instrumented to measure work and
critical-path length. - Runs on Sun machines with SPARC v9 processors
running Solaris. - Ports underway for SGI machines with MIPS
processors running IRIX and x86-based machines
running Windows NT.
Implementations for Cilk and Java are in the
planning stages.
12Outline
- The non-blocking work stealer
- Algorithmic analysis
- Lower bounds
- Analysis of non-blocking work stealer
- Empirical analysis
- Related work
- Conclusion
13Dag Model
A multithreaded computation is modeled as a dag
(directed acyclic graph).
- Each node represents one executed instruction and
takes one time unit to execute.
- We assume a single source node and out-degree at
most 2.
- The work T1 is the number of nodes. The
critical-path length T is the length of a
longest (directed) path.
- A node is ready if all of its ancestors have been
executed. Only ready nodes can be executed.
14Lower Bounds
At each time step i 1, 2, , T, the kernel
chooses to schedule any subset of the P
processes, and those scheduled processes execute
one instruction. Let pi denote the number of
processes scheduled at step i.
T
1
p
The processor average is defined by PA
.
å
i
T
The execution time is given by T .
- T ³ TP/PA, because the kernel can force
³ TP. - There must be at least T steps i with pi ¹ 0,
and for each such step, the kernel can schedule
pi P processes.
15Greedy Schedules
A schedule is greedy if at each step i, the
number of nodes executed is equal to the minimum
of pi and the number of ready nodes.
Theorem Any greedy schedule has length at most
T1/PA T P/PA.
Proof We prove that T1 T P. At
each step each scheduled process pays one token.
- If the process executes a node, then it places a
token in the work bucket. Execution ends with T1
tokens in the work bucket.
work bucket
- Otherwise, the process places a token in the idle
bucket. There are at most T steps at which a
process places a token in the idle bucket, and at
each such step at most P tokens are placed in the
idle bucket.
idle bucket
16Analysis of Non-Blocking Work Stealer
Theorem The non-blocking work stealer runs in
expected time O(T1/PA TP/PA).
Proof sketch Let S denote the number of steal
attempts. We prove that O(T1 S) and
ES O(TP). At each step each scheduled
process pays one token.
- If the process is working, then it places a
token in the work bucket. Execution ends with
O(T1) tokens in the work bucket.
work bucket
- Otherwise, the process places a token in the
steal bucket. Execution ends with O(S) tokens in
the steal bucket.
steal bucket
17Enabling Tree
- An edge (u,v) is an enabling edge if the
execution of u made v ready. Node u is the
designated parent of v.
- The enabling edges form an enabling tree.
18Structural Lemma
For any deque, at all times during the execution
of the work-stealing algorithm, the designated
parents of the nodes in the deque lie on a
root-to-leaf path in the enabling tree.
- Consider any process at any time during the
execution. - v0 is the ready node of the thread that is being
executed. - v1, v2, , vk are the ready nodes of the threads
in the processs deque ordered from bottom to
top. - For i 0, 1, , k, node ui is the designated
parent of vi. - Then for i 1, 2, , k, node ui is an ancestor
of ui-1 in the enabling tree.
19Analysis of Steal Attempts
We use a potential function to bound the number
of steal attempts.
At each step i, each ready node u has potential
fi(u) 3T-d(u), where d(u) is the depth of u in
the enabling tree.
The potential Fi at step i is the sum of all
ready node potentials.
- The deques are top-heavy the top-most node
contributes a constant fraction.
- With constant probability, P steal attempts cause
the potential to decrease by a constant fraction.
- The initial potential is F0 3T, and it never
increases.
- The expected number of steal attempts until the
potential decreases to 0 is O(TP).
20Outline
- The non-blocking work stealer
- Algorithmic analysis
- Empirical analysis
- Yields in practice
- Performance modeling
- Related work
- Conclusion
21Performance Without Yield
Speedup measured on 8-processor Sun Ultra
Enterprise 5000.
8
7
6
5
speedup
4
3
2
1
1
4
8
12
16
20
24
28
32
processes
22Yield in Practice
3
2.5
2
executed nodes
normalized execution time
1.5
process about to execute this node is swapped out.
1
0.5
0
8
16
8
16
8
16
lu
barnes
heat
Processes spin making steal attempts, but all
deques are empty.
23Performance Model
Execution time T c1T1/PA c2TP/PA .
Utilization
The ratio P/(T1/T) is the normalized number of
processes.
For all multithreaded applications and all input
problems, the utilization can be lower bounded as
a function of one number, the normalized number
of processes.
We test this claim with a synthetic application,
knary, that produces a wide range of work and
critical-path lengths for different inputs.
24Knary Utilization
Utilization measured on 8-processor Sun Ultra
Enterprise 5000.
No other program is running, so PA min8, P.
1
0.8
0.6
0.4
Utilization
0.2
0.1
1e-5
0.0001
0.001
0.01
0.1
1
10
100
Normalized processes
25Application Utilization
Utilization measured on 8-processor Sun Ultra
Enterprise 5000.
No other program is running, so PA min8, P.
1
0.8
0.6
0.4
Utilization
0.2
0.1
1e-5
0.0001
0.001
0.01
0.1
1
10
100
Normalized processes
26Varying Number of Processors
To test the model when the number of processors
varies over time, we run the test applications
concurrently with a synthetic application, cycler.
- Repeatedly, cycler creates a random number of
processes, each of which runs for a random amount
of time. - Each process repeatedly increments a shared
counter. - At regular intervals, the counter value and a
timestamp are written to a buffer. - For any time interval, we can look at the counter
values at the start and end to determine the
processor average PA(cycler) for cycler over that
interval.
27Knary Utilization
Utilization measured on 8-processor Sun Ultra
Enterprise 5000.
Cycler is also running, so PA min8 -
PA(cycler), P.
1
0.8
0.6
0.4
Utilization
0.2
0.1
1e-5
0.0001
0.001
0.01
0.1
1
10
100
Normalized processes
28Application Utilization
Utilization measured on 8-processor Sun Ultra
Enterprise 5000.
Cycler is also running, so PA min8 -
PA(cycler), P.
1
0.8
0.6
0.4
Utilization
0.2
0.1
1e-5
0.0001
0.001
0.01
0.1
1
10
100
Normalized processes
29Outline
- The non-blocking work stealer
- Algorithmic analysis
- Empirical analysis
- Related work
- Coscheduling
- Process control
- Conclusion
30Coscheduling
With coscheduling (gang scheduling), all of a
computations processes are scheduled to run in
parallel.
- For some computation mixes, coscheduling is not
effective. Example A computation with 4
processes and a computation with 1 process on a
4-processor machine.
- Resource-intensive computations that do not
scale down may require coscheduling for high
performance. Example Data-parallel programs
with large working sets.
31Process Control
With process control, each computation creates
and kills processes dynamically so that it always
runs with a number of processes equal to the
number of processors assigned to it.
Process control and the non-blocking work stealer
complement each other.
- With work stealing, a new process can be created
at any time, and a process can be killed when its
deque is empty.
- With the non-blocking work stealer, there is
little penalty for operating with more processes
than processors.
- Process control can help keep P close to PA.
32Outline
- The non-blocking work stealer
- Algorithmic analysis
- Empirical analysis
- Related work
- Conclusion
33Summary of Results
The non-blocking work stealer is a user-level
thread scheduler that is efficient in theory and
in practice.
Theory ET O(T1/PA TP/PA).
Practice T T1/PA TP/PA.
- These results hold even when the number of
processes exceeds the number of processors or
when the number of processors grows and shrinks
over time.
- These results hold with no need for
non-commercial operating-system support, such
as coscheduling or process control.
34Unsubstantiated Claims
- The non-blocking work stealer works well in
multiprogrammed environments. - Our experiments did not use true
multiprogrammed workloads. - Our analysis and experiments focus exclusively on
processor resources. - The non-blocking work stealer can be used in
multimedia and real-time applications. - If the kernel delivers quality-of-service
guarantees in terms of PA, then the non-blocking
work stealer can translate this guarantee into a
guarantee on execution time.
35Turnkey Applications
Turnkey parallel applications must select the
number of processes automatically. Users may not
know how many processors are in their computer.
With the non-blocking work stealer, applications
automatically can create a number of processes
equal to the number of processors.