Automatic Thread Extraction with Decoupled Software Pipelining - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Thread Extraction with Decoupled Software Pipelining

Description:

1 iter/cycle. 1 iter/cycle. 1 iter/cycle. 0.5 iter/cycle. 1 iter/cycle. 1 iter/cycle. lat(comm) = 2: lat(comm) = 1: ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 17
Provided by: eecgTo
Category:

less

Transcript and Presenter's Notes

Title: Automatic Thread Extraction with Decoupled Software Pipelining


1
Automatic Thread Extraction with Decoupled
Software Pipelining
  • Presented by
  • Jeremy Cutler
  • with thanks to
  • Guilherme Ottoni, Ram Rangan, Adam Stoler, David
    I. August
  • Liberty Research Group
  • Department of Computer Science
  • Princeton University

2
A Fundamental Change
  • Transistor trend continues
  • Clock rate limited by
  • Power delivery
  • Heat dissipation
  • Design complexity

Source Intel, Wikipedia, Sutter/Dr. Dobbs Journal
3
The Response CMP
  • For
  • legacy apps (C/C)
  • single-threaded
  • sequential codes
  • Speedup over single core
  • 0.0
  • Worse
  • Shared resources (e.g. caches)
  • Simple cores trend
  • Must Extract Thread Parallelism!

IBM Power 5 (1.9GHz) Die Photo Source IBM
4
Existing Parallelization Approaches
(Non-speculative)
Scientific Codes (FORTRAN-like)
General-purpose Codes (legacy C/C)
for(i1 iltN i) // C ai ai 1 // X
while(ptr ptr-gtnext) // LD ptr-gtval
ptr-gtval 1 // X
DOACROSS Cytron, ICPP 86
DOALL
5
Pipelined Parallelism for General-Purpose Codes
while(ptr ptr-gtnext) // LD ptr-gtval
ptr-gtval 1 // X
DOACROSS
Decoupled Software Pipelining (DSWP)
Generalization of DOPIPE Davies, UIUC 81
6
Comparison DOALL, DOACROSS, DSWP
DOACROSS
DOALL
DSWP
1 iter/cycle
1 iter/cycle
1 iter/cycle
lat(comm) 1
1 iter/cycle
0.5 iter/cycle
1 iter/cycle
lat(comm) 2
7
Implementing Decoupled Software Pipelining (DSWP)
while(ptr ptr-gtnext) ptr-gtval ptr-gtval
1
Thread 1
Thread 2
Loop
Dependence Graph
DAGSCC
Inter-thread communication latency is a one-time
cost
8
Implementing Inter-Thread Control
DependencesNode Splitting
L1
L2
9
Handling Arbitrary Control Flow Control
Extensions to Dependence Graph
CFG
  • Loop-iteration control dependences
  • Traditional definition of control dependence
    Ferrante et al., TOPLAS 87 not appropriate for
    loops
  • Conditional control dependences
  • To implement inter-thread data dependences that
    may or may not occur
  • Multi-threaded code generation from the extended
    dependence graph

10
Evaluation
  • DSWP implemented in the back-end of IMPACT
    compiler
  • Accurate dual-core Itanium 2 model
  • Synchronization Array support for comm./sync.
  • ISA extended with produce/consume instructions
  • Important application loops selected (16-98
    total execution)

11
Evaluation
12
Partitioning and Parallelism181.mcf
Queue Occupancy (elements)
32
DAGSCC
Speedup
45
0
Time (cycles)
48
Time (cycles)
43
Time (cycles)
-2
Currently use a simple load-balancing heuristic
Time (cycles)
13
Evaluation Varying Processor Width
  • Modified, half-width Itanium 2 models

1-Core
2-Core (used by DSWP)
Full-width
Half-width
  • On half-width model, speedup from DSWP is larger
  • Better performance compatibility
  • More effective on simpler cores

14
What about more threads?
while(ptr ptr-gtnext) ptr-gtval ptr-gtval
1
2. DOALL Consumer
Producer
Dep. Graph
Consumer 1
Consumer 2
1. Multiple SCCs
15
Breaking SCCs Speculative DSWP
164.gzip 38 speedup with 3 threads
Only one SCC!
181.mcf 2.9x speedup with 4 threads
x
Mis-speculationdetected
x
16
Conclusion
  • DSWP extracts pipelined thread-level parallelism
    from general-purpose, sequential programs
  • More applicable than traditional parallelization
    techniques
  • Handles arbitrary control flow
  • Future research directions
  • Additional analyses and optimizations
  • Break dependence cycles code transformations,
    speculation
  • Reduce communication
  • Explore DOALL-consumer opportunities
Write a Comment
User Comments (0)
About PowerShow.com