Automatic Thread Extraction with Decoupled Software Pipelining

About This Presentation

Title:

Automatic Thread Extraction with Decoupled Software Pipelining

Description:

1 iter/cycle. 1 iter/cycle. 1 iter/cycle. 0.5 iter/cycle. 1 iter/cycle. 1 iter/cycle. lat(comm) = 2: lat(comm) = 1: ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 17

Provided by: eecgTo

Learn more at: https://www.eecg.toronto.edu

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Thread Extraction with Decoupled Software Pipelining

1
Automatic Thread Extraction with Decoupled
Software Pipelining

Presented by
Jeremy Cutler
with thanks to
Guilherme Ottoni, Ram Rangan, Adam Stoler, David
I. August
Liberty Research Group
Department of Computer Science
Princeton University

2
A Fundamental Change

Transistor trend continues
Clock rate limited by
Power delivery
Heat dissipation
Design complexity

Source Intel, Wikipedia, Sutter/Dr. Dobbs Journal
3
The Response CMP

For
legacy apps (C/C)
single-threaded
sequential codes
Speedup over single core
0.0
Worse
Shared resources (e.g. caches)
Simple cores trend
Must Extract Thread Parallelism!

IBM Power 5 (1.9GHz) Die Photo Source IBM
4
Existing Parallelization Approaches
(Non-speculative)
Scientific Codes (FORTRAN-like)
General-purpose Codes (legacy C/C)
for(i1 iltN i) // C ai ai 1 // X
while(ptr ptr-gtnext) // LD ptr-gtval
ptr-gtval 1 // X
DOACROSS Cytron, ICPP 86
DOALL
5
Pipelined Parallelism for General-Purpose Codes
while(ptr ptr-gtnext) // LD ptr-gtval
ptr-gtval 1 // X
DOACROSS
Decoupled Software Pipelining (DSWP)
Generalization of DOPIPE Davies, UIUC 81
6
Comparison DOALL, DOACROSS, DSWP
DOACROSS
DOALL
DSWP
1 iter/cycle
1 iter/cycle
1 iter/cycle
lat(comm) 1
1 iter/cycle
0.5 iter/cycle
1 iter/cycle
lat(comm) 2
7
Implementing Decoupled Software Pipelining (DSWP)
while(ptr ptr-gtnext) ptr-gtval ptr-gtval
1
Thread 1
Thread 2
Loop
Dependence Graph
DAGSCC
Inter-thread communication latency is a one-time
cost
8
Implementing Inter-Thread Control
DependencesNode Splitting
L1
L2
9
Handling Arbitrary Control Flow Control
Extensions to Dependence Graph
CFG

Loop-iteration control dependences
Traditional definition of control dependence
Ferrante et al., TOPLAS 87 not appropriate for
loops
Conditional control dependences
To implement inter-thread data dependences that
may or may not occur
Multi-threaded code generation from the extended
dependence graph

10
Evaluation

DSWP implemented in the back-end of IMPACT
compiler
Accurate dual-core Itanium 2 model
Synchronization Array support for comm./sync.
ISA extended with produce/consume instructions
Important application loops selected (16-98
total execution)

11
Evaluation
12
Partitioning and Parallelism181.mcf
Queue Occupancy (elements)
32
DAGSCC
Speedup
45
0
Time (cycles)
48
Time (cycles)
43
Time (cycles)
-2
Currently use a simple load-balancing heuristic
Time (cycles)
13
Evaluation Varying Processor Width

Modified, half-width Itanium 2 models

1-Core
2-Core (used by DSWP)
Full-width
Half-width

On half-width model, speedup from DSWP is larger
Better performance compatibility
More effective on simpler cores

14
What about more threads?
while(ptr ptr-gtnext) ptr-gtval ptr-gtval
1
2. DOALL Consumer
Producer
Dep. Graph
Consumer 1
Consumer 2
1. Multiple SCCs
15
Breaking SCCs Speculative DSWP
164.gzip 38 speedup with 3 threads
Only one SCC!
181.mcf 2.9x speedup with 4 threads
x
Mis-speculationdetected
x
16
Conclusion

DSWP extracts pipelined thread-level parallelism
from general-purpose, sequential programs
More applicable than traditional parallelization
techniques
Handles arbitrary control flow
Future research directions
Additional analyses and optimizations
Break dependence cycles code transformations,
speculation
Reduce communication
Explore DOALL-consumer opportunities

Write a Comment

User Comments (0)

About PowerShow.com

Automatic Thread Extraction with Decoupled Software Pipelining - PowerPoint PPT Presentation

Automatic Thread Extraction with Decoupled Software Pipelining

1 iter/cycle. 1 iter/cycle. 1 iter/cycle. 0.5 iter/cycle. 1 iter/cycle. 1 iter/cycle. lat(comm) = 2: lat(comm) = 1: ... – PowerPoint PPT presentation