Title: Automatic Thread Extraction with Decoupled Software Pipelining
1Automatic Thread Extraction with Decoupled
Software Pipelining
- Presented by
- Jeremy Cutler
- with thanks to
- Guilherme Ottoni, Ram Rangan, Adam Stoler, David
I. August - Liberty Research Group
- Department of Computer Science
- Princeton University
2A Fundamental Change
- Transistor trend continues
- Clock rate limited by
- Power delivery
- Heat dissipation
- Design complexity
Source Intel, Wikipedia, Sutter/Dr. Dobbs Journal
3The Response CMP
- For
- legacy apps (C/C)
- single-threaded
- sequential codes
- Speedup over single core
- 0.0
- Worse
- Shared resources (e.g. caches)
- Simple cores trend
- Must Extract Thread Parallelism!
IBM Power 5 (1.9GHz) Die Photo Source IBM
4Existing Parallelization Approaches
(Non-speculative)
Scientific Codes (FORTRAN-like)
General-purpose Codes (legacy C/C)
for(i1 iltN i) // C ai ai 1 // X
while(ptr ptr-gtnext) // LD ptr-gtval
ptr-gtval 1 // X
DOACROSS Cytron, ICPP 86
DOALL
5Pipelined Parallelism for General-Purpose Codes
while(ptr ptr-gtnext) // LD ptr-gtval
ptr-gtval 1 // X
DOACROSS
Decoupled Software Pipelining (DSWP)
Generalization of DOPIPE Davies, UIUC 81
6Comparison DOALL, DOACROSS, DSWP
DOACROSS
DOALL
DSWP
1 iter/cycle
1 iter/cycle
1 iter/cycle
lat(comm) 1
1 iter/cycle
0.5 iter/cycle
1 iter/cycle
lat(comm) 2
7Implementing Decoupled Software Pipelining (DSWP)
while(ptr ptr-gtnext) ptr-gtval ptr-gtval
1
Thread 1
Thread 2
Loop
Dependence Graph
DAGSCC
Inter-thread communication latency is a one-time
cost
8Implementing Inter-Thread Control
DependencesNode Splitting
L1
L2
9Handling Arbitrary Control Flow Control
Extensions to Dependence Graph
CFG
- Loop-iteration control dependences
- Traditional definition of control dependence
Ferrante et al., TOPLAS 87 not appropriate for
loops - Conditional control dependences
- To implement inter-thread data dependences that
may or may not occur - Multi-threaded code generation from the extended
dependence graph
10Evaluation
- DSWP implemented in the back-end of IMPACT
compiler - Accurate dual-core Itanium 2 model
- Synchronization Array support for comm./sync.
- ISA extended with produce/consume instructions
- Important application loops selected (16-98
total execution)
11Evaluation
12Partitioning and Parallelism181.mcf
Queue Occupancy (elements)
32
DAGSCC
Speedup
45
0
Time (cycles)
48
Time (cycles)
43
Time (cycles)
-2
Currently use a simple load-balancing heuristic
Time (cycles)
13Evaluation Varying Processor Width
- Modified, half-width Itanium 2 models
1-Core
2-Core (used by DSWP)
Full-width
Half-width
- On half-width model, speedup from DSWP is larger
- Better performance compatibility
- More effective on simpler cores
14What about more threads?
while(ptr ptr-gtnext) ptr-gtval ptr-gtval
1
2. DOALL Consumer
Producer
Dep. Graph
Consumer 1
Consumer 2
1. Multiple SCCs
15Breaking SCCs Speculative DSWP
164.gzip 38 speedup with 3 threads
Only one SCC!
181.mcf 2.9x speedup with 4 threads
x
Mis-speculationdetected
x
16Conclusion
- DSWP extracts pipelined thread-level parallelism
from general-purpose, sequential programs - More applicable than traditional parallelization
techniques - Handles arbitrary control flow
- Future research directions
- Additional analyses and optimizations
- Break dependence cycles code transformations,
speculation - Reduce communication
- Explore DOALL-consumer opportunities