Thread Scheduling for Multiprogrammed Multiprocessors - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Thread Scheduling for Multiprogrammed Multiprocessors

Description:

A program partitions the work into (user-level) threads to expose all of the parallelism. A computation may create millions of threads. ... – PowerPoint PPT presentation

Number of Views:348
Avg rating:5.0/5.0
Slides: 36
Provided by: robertb51
Category:

less

Transcript and Presenter's Notes

Title: Thread Scheduling for Multiprogrammed Multiprocessors


1
Thread Scheduling for Multiprogrammed
Multiprocessors
Robert D. Blumofe The University of Texas at
Austin
Work done in collaboration with Nimar Arora,
Dionisios Papadopoulos, and Greg Plaxton
2
Static Partitioning
T1
A program partitions the work T1 evenly among P
(light-weight) processes, also known as kernel
threads, so each process performs T1/P work.
process 1
process 2
process 3
process 4
At runtime, P processors execute the P processes
in parallel, taking time T1/P and realizing
linear speedup.
3
Multiprogrammed Environments
If another program is running concurrently, then
the P processes may execute on PA lt P processors.
In general, we hope to achieve execution time
T1/PA, thereby realizing linear speedup.
The statically partitioned program may fall far
short of this goal.
In this example, the execution time is T1/2 ,
though PA 3.
4
Performance of Static Partitioning
Speedup measured on 8-processor Sun Ultra
Enterprise 5000.
8
7
6
5
speedup
4
3
2
1
1
4
8
12
16
20
24
28
32
processes
5
Our Performance
Speedup measured on 8-processor Sun Ultra
Enterprise 5000.
8
7
6
5
speedup
4
3
2
1
1
4
8
12
16
20
24
28
32
8
12
16
processes
6
Dynamic Scheduling
A program partitions the work into (user-level)
threads to expose all of the parallelism. A
computation may create millions of threads.
Threads are dynamically scheduled through two
levels.
Each computation has a (user-level) thread
scheduler that maps its threads to its processes.
The kernel maps all processes to all processors.
We define the processor average PA of a
computation as the time-average number of
processors on which the computation executes, as
determined by the kernel.
Goal execution time T T1/PA, irrespective of
kernel scheduling.
7
Our Results Theory and Practice
We present a (user-level) thread scheduler, the
non-blocking work stealer, and we show that its
execution time T satisfies the following bounds.
T is the critical-path length, the theoretical
minimum execution time with infinitely many
processors.
Theory ET O(T1/PA TP/PA).
  • The kernel is assumed to be an adversary.
  • This bound is optimal to within a constant
    factor.
  • For any e gt 0, we have T O(T1/PA (T
    lg(1/e))P/PA) with probability at least 1-e.

Practice T T1/PA TP/PA.
  • We have T T1/PA whenever P is small relative to
    the average parallelism, T1/T.

8
Outline
  • The non-blocking work stealer
  • Work stealing
  • The non-blocking implementation
  • Algorithmic analysis
  • Empirical analysis
  • Related work
  • Conclusion

9
Work Stealing
Each process maintains a pool of ready threads
organized as a deque (double-ended queue) with a
top and a bottom.
A process obtains work by popping the bottom-most
thread from its deque and executing that thread.
  • If the thread blocks or terminates, then the
    process pops another thread.
  • If the thread creates or enables another thread,
    then the process pushes one thread on the bottom
    of its deque and continues executing the other.

If a process finds that its deque is empty, then
it becomes a thief and steals the top-most thread
from the deque of a randomly chosen victim
process.
10
Non-Blocking Implementation
  • The non-blocking work stealer is an
    implementation of work stealing with the
    following two features.
  • The deques are implemented with non-blocking
    synchronization.
  • Instead of locks, atomic load-test-store machine
    instructions are used. Examples
    load-linked/store-conditional and
    compare-and-swap.
  • There exists a constant c (10) such that if a
    process performs a deque operation, then after
    executing c instructions, some process has
    succeeded in performing a deque operation.
  • Each process, between consecutive steal attempts,
    performs a yield system call.

11
Implementation Status
  • The non-blocking work stealer has been
    implemented in the Hood C threads library.
  • Hood is instrumented to measure work and
    critical-path length.
  • Runs on Sun machines with SPARC v9 processors
    running Solaris.
  • Ports underway for SGI machines with MIPS
    processors running IRIX and x86-based machines
    running Windows NT.

Implementations for Cilk and Java are in the
planning stages.
12
Outline
  • The non-blocking work stealer
  • Algorithmic analysis
  • Lower bounds
  • Analysis of non-blocking work stealer
  • Empirical analysis
  • Related work
  • Conclusion

13
Dag Model
A multithreaded computation is modeled as a dag
(directed acyclic graph).
  • Each node represents one executed instruction and
    takes one time unit to execute.
  • We assume a single source node and out-degree at
    most 2.
  • The work T1 is the number of nodes. The
    critical-path length T is the length of a
    longest (directed) path.
  • A node is ready if all of its ancestors have been
    executed. Only ready nodes can be executed.

14
Lower Bounds
At each time step i 1, 2, , T, the kernel
chooses to schedule any subset of the P
processes, and those scheduled processes execute
one instruction. Let pi denote the number of
processes scheduled at step i.
T
1
p
The processor average is defined by PA
.
å
i
T
The execution time is given by T .
  • T ³ T1/PA, because ³ T1.
  • T ³ TP/PA, because the kernel can force
    ³ TP.
  • There must be at least T steps i with pi ¹ 0,
    and for each such step, the kernel can schedule
    pi P processes.

15
Greedy Schedules
A schedule is greedy if at each step i, the
number of nodes executed is equal to the minimum
of pi and the number of ready nodes.
Theorem Any greedy schedule has length at most
T1/PA T P/PA.
Proof We prove that T1 T P. At
each step each scheduled process pays one token.
  • If the process executes a node, then it places a
    token in the work bucket. Execution ends with T1
    tokens in the work bucket.

work bucket
  • Otherwise, the process places a token in the idle
    bucket. There are at most T steps at which a
    process places a token in the idle bucket, and at
    each such step at most P tokens are placed in the
    idle bucket.

idle bucket
16
Analysis of Non-Blocking Work Stealer
Theorem The non-blocking work stealer runs in
expected time O(T1/PA TP/PA).
Proof sketch Let S denote the number of steal
attempts. We prove that O(T1 S) and
ES O(TP). At each step each scheduled
process pays one token.
  • If the process is working, then it places a
    token in the work bucket. Execution ends with
    O(T1) tokens in the work bucket.

work bucket
  • Otherwise, the process places a token in the
    steal bucket. Execution ends with O(S) tokens in
    the steal bucket.

steal bucket
17
Enabling Tree
  • An edge (u,v) is an enabling edge if the
    execution of u made v ready. Node u is the
    designated parent of v.
  • The enabling edges form an enabling tree.

18
Structural Lemma
For any deque, at all times during the execution
of the work-stealing algorithm, the designated
parents of the nodes in the deque lie on a
root-to-leaf path in the enabling tree.
  • Consider any process at any time during the
    execution.
  • v0 is the ready node of the thread that is being
    executed.
  • v1, v2, , vk are the ready nodes of the threads
    in the processs deque ordered from bottom to
    top.
  • For i 0, 1, , k, node ui is the designated
    parent of vi.
  • Then for i 1, 2, , k, node ui is an ancestor
    of ui-1 in the enabling tree.

19
Analysis of Steal Attempts
We use a potential function to bound the number
of steal attempts.
At each step i, each ready node u has potential
fi(u) 3T-d(u), where d(u) is the depth of u in
the enabling tree.
The potential Fi at step i is the sum of all
ready node potentials.
  • The deques are top-heavy the top-most node
    contributes a constant fraction.
  • With constant probability, P steal attempts cause
    the potential to decrease by a constant fraction.
  • The initial potential is F0 3T, and it never
    increases.
  • The expected number of steal attempts until the
    potential decreases to 0 is O(TP).

20
Outline
  • The non-blocking work stealer
  • Algorithmic analysis
  • Empirical analysis
  • Yields in practice
  • Performance modeling
  • Related work
  • Conclusion

21
Performance Without Yield
Speedup measured on 8-processor Sun Ultra
Enterprise 5000.
8
7
6
5
speedup
4
3
2
1
1
4
8
12
16
20
24
28
32
processes
22
Yield in Practice
3
2.5
2
executed nodes
normalized execution time
1.5
process about to execute this node is swapped out.
1
0.5
0
8
16
8
16
8
16
lu
barnes
heat
Processes spin making steal attempts, but all
deques are empty.
23
Performance Model
Execution time T c1T1/PA c2TP/PA .
Utilization
The ratio P/(T1/T) is the normalized number of
processes.
For all multithreaded applications and all input
problems, the utilization can be lower bounded as
a function of one number, the normalized number
of processes.
We test this claim with a synthetic application,
knary, that produces a wide range of work and
critical-path lengths for different inputs.
24
Knary Utilization
Utilization measured on 8-processor Sun Ultra
Enterprise 5000.
No other program is running, so PA min8, P.
1
0.8
0.6
0.4
Utilization
0.2
0.1
1e-5
0.0001
0.001
0.01
0.1
1
10
100
Normalized processes
25
Application Utilization
Utilization measured on 8-processor Sun Ultra
Enterprise 5000.
No other program is running, so PA min8, P.
1
0.8
0.6
0.4
Utilization
0.2
0.1
1e-5
0.0001
0.001
0.01
0.1
1
10
100
Normalized processes
26
Varying Number of Processors
To test the model when the number of processors
varies over time, we run the test applications
concurrently with a synthetic application, cycler.
  • Repeatedly, cycler creates a random number of
    processes, each of which runs for a random amount
    of time.
  • Each process repeatedly increments a shared
    counter.
  • At regular intervals, the counter value and a
    timestamp are written to a buffer.
  • For any time interval, we can look at the counter
    values at the start and end to determine the
    processor average PA(cycler) for cycler over that
    interval.

27
Knary Utilization
Utilization measured on 8-processor Sun Ultra
Enterprise 5000.
Cycler is also running, so PA min8 -
PA(cycler), P.
1
0.8
0.6
0.4
Utilization
0.2
0.1
1e-5
0.0001
0.001
0.01
0.1
1
10
100
Normalized processes
28
Application Utilization
Utilization measured on 8-processor Sun Ultra
Enterprise 5000.
Cycler is also running, so PA min8 -
PA(cycler), P.
1
0.8
0.6
0.4
Utilization
0.2
0.1
1e-5
0.0001
0.001
0.01
0.1
1
10
100
Normalized processes
29
Outline
  • The non-blocking work stealer
  • Algorithmic analysis
  • Empirical analysis
  • Related work
  • Coscheduling
  • Process control
  • Conclusion

30
Coscheduling
With coscheduling (gang scheduling), all of a
computations processes are scheduled to run in
parallel.
  • For some computation mixes, coscheduling is not
    effective. Example A computation with 4
    processes and a computation with 1 process on a
    4-processor machine.
  • Resource-intensive computations that do not
    scale down may require coscheduling for high
    performance. Example Data-parallel programs
    with large working sets.

31
Process Control
With process control, each computation creates
and kills processes dynamically so that it always
runs with a number of processes equal to the
number of processors assigned to it.
Process control and the non-blocking work stealer
complement each other.
  • With work stealing, a new process can be created
    at any time, and a process can be killed when its
    deque is empty.
  • With the non-blocking work stealer, there is
    little penalty for operating with more processes
    than processors.
  • Process control can help keep P close to PA.

32
Outline
  • The non-blocking work stealer
  • Algorithmic analysis
  • Empirical analysis
  • Related work
  • Conclusion

33
Summary of Results
The non-blocking work stealer is a user-level
thread scheduler that is efficient in theory and
in practice.
Theory ET O(T1/PA TP/PA).
Practice T T1/PA TP/PA.
  • These results hold even when the number of
    processes exceeds the number of processors or
    when the number of processors grows and shrinks
    over time.
  • These results hold with no need for
    non-commercial operating-system support, such
    as coscheduling or process control.

34
Unsubstantiated Claims
  • The non-blocking work stealer works well in
    multiprogrammed environments.
  • Our experiments did not use true
    multiprogrammed workloads.
  • Our analysis and experiments focus exclusively on
    processor resources.
  • The non-blocking work stealer can be used in
    multimedia and real-time applications.
  • If the kernel delivers quality-of-service
    guarantees in terms of PA, then the non-blocking
    work stealer can translate this guarantee into a
    guarantee on execution time.

35
Turnkey Applications
Turnkey parallel applications must select the
number of processes automatically. Users may not
know how many processors are in their computer.
With the non-blocking work stealer, applications
automatically can create a number of processes
equal to the number of processors.
Write a Comment
User Comments (0)
About PowerShow.com