Title: Processor Architectures and Program Mapping
1Processor Architectures and Program Mapping
Exploiting ILP part 2 code generation
- TU/e 5kk10
- Henk Corporaal
- Jef van Meerbergen
- Bart Mesman
2Overview
- Enhance performance architecture methods
- Instruction Level Parallelism
- VLIW
- Examples
- C6
- TM
- TTA
- Clustering
- Code generation
- Hands-on
3Compiler basics
- Overview
- Compiler trajectory / structure / passes
- Control Flow Graph (CFG)
- Mapping and Scheduling
- Basic block list scheduling
- Extended scheduling scope
- Loop schedulin
4Compiler basics trajectory
Source program
Preprocessor
Compiler
Error messages
Assembler
Library code
Loader/Linker
Object program
5Compiler basics structure / passes
Source code
Lexical analyzer
token generation
check syntax check semantic
parse tree generation
Parsing
Intermediate code
data flow analysis local optimizations
global optimizations
Code optimization
code selection peephole optimizations
Code generation
making interference graph graph
coloring
spill code insertion
caller / callee save and restore code
Register allocation
Sequential code
Scheduling and allocation
exploiting ILP
Object code
6Compiler basics structure Simple compilation
example
position initial rate 60
Lexical analyzer
temp1 intoreal(60) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
id id id 60
Syntax analyzer
Code optimizer
temp1 id3 60.0 id1 id2 temp1
Code generator
movf id3, r2 mulf 60, r2, r2 movf id2,
r1 addf r2, r1 movf r1, id1
Intermediate code generator
7Compiler basics Control flow graph (CFG)
C input code
if (a gt b) r a b else r b
a
1 sub t1, a, b bgz t1, 2, 3
CFG
2 rem r, a, b goto 4
3 rem r, b, a goto 4
4 .. ..
Program, is collection of Functions, each
function is collection of Basic Blocks,
each BB contains set of
Instructions, each instruction consists of
several Transports,..
8Mapping / Scheduling placing operations in
space and time
a
b
2
- d a b
- e a d
- f 2 b d
- r f e
- x z y
d
z
y
e
f
-
x
r
Data Dependence Graph (DDG)
9How to map these operations?
- Architecture constraints
- One Function Unit
- All operations single cycle latency
b
a
2
d
cycle
z
1
y
e
f
2
-
3
x
r
4
5
-
6
10How to map these operations?
- Architecture constraints
- One Add-sub and one Mul unit
- All operations single cycle latency
Mul
Add-sub
cycle
1
2
3
-
4
5
6
11There are many mapping solutions
12Basic Block Scheduling
- Make a dependence graph
- Determine minimal length
- Determine ASAP, ALAP, and slack of each operation
- Place each operation in first cycle with
sufficient resources - Note
- Scheduling order sequential
- Priority determined by used heuristic e.g. slack
13Basic Block Scheduling
ASAP cycle
B
C
ALAP cycle
ADD
A
slack
lt1,1gt
A
C
SUB
lt2,2gt
ADD
NEG
LD
lt3,3gt
lt1,3gt
lt2,3gt
A
B
LD
MUL
ADD
lt4,4gt
lt2,4gt
lt1,4gt
z
y
X
14Cycle based list scheduling
proc Schedule(DDG (V,E)) beginproc ready
v ??(u,v) ? E ready ready
sched ? current_cycle 0 while
sched ? V do for each v ? ready do
if ?ResourceConfl(v,current_cycle,
sched) then cycle(v)
current_cycle sched sched ?
v endif endfor
current_cycle current_cycle 1
ready v v ? sched ? ? (u,v)? E, u ? sched
ready v v ? ready ? ? (u,v)?
E, cycle(u) delay(u,v) ? current_cycle
endwhile endproc
15Extended basic block scheduling Code Motion
- Downward code motions?
- a ? B, a ? C, a ? D, c ? D, d ? D
- Upward code motions?
- c ? A, d ? A, e ? B, e ? C, e ? A
16Extended Scheduling scope
Code
CFG Control Flow Graph
A
A If cond Then B Else C D If cond Then
E Else F G
C
B
D
F
E
G
17Scheduling scopes
Trace Superblock Decision tree
Hyperblock/region
18Code movement (upwards) within regions
destination block
Legend
Copy needed
I
I
Intermediate block
I
I
Check for off-liveness
Code movement
I
add
source block
19Extended basic block schedulingCode Motion
- A dominates B ? A is always executed before B
- Consequently
- A does not dominate B ? code motion from B to A
requires - code duplication
- B post-dominates A ? B is always executed after A
- Consequently
- B does not post-dominate A ? code motion from B
to A is speculative
Q1 does C dominate E? Q2 does C dominate D? Q3
does F post-dominate D? Q4 does D post-dominate
B?
20Scheduling Loops
Loop Optimizations
A
B
C
A
D
A
B
C
B
C
C
C
C
C
D
D
Loop unrolling
Loop peeling
21Scheduling Loops
- Problems with unrolling
- Exploits only parallelism within sets of n
iterations - Iteration start-up latency
- Code expansion
Basic block scheduling
Basic block scheduling and unrolling
resource utilization
Software pipelining
time
22Software pipelining
- Software pipelining a loop is
- Scheduling the loop such that iterations start
before preceding iterations have finished - Or
- Moving operations across the backedge
LD LD ML LD ML ST ML ST ST
Unroling 5/3 cycles/iteration
Software pipelining 1 cycle/iteration
3 cycles/iteration
23Software pipelining (contd)
- Basic techniques
- Modulo scheduling (Rau, Lam)
- list scheduling with modulo resource constraints
- Kernel recognition techniques
- unroll the loop
- schedule the iterations
- identify a repeating pattern
- Examples
- Perfect pipelining (Aiken and Nicolau)
- URPR (Su, Ding and Xia)
- Petri net pipelining (Allan)
- Enhanced pipeline scheduling (Ebcioglu)
- fill first cycle of iteration
- copy this instruction over the backedge
24Software pipelining Modulo scheduling
Example Modulo scheduling a loop
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Prologue
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Kernel
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Epilogue
(c) Software pipeline
- Prologue fills the SW pipeline with iterations
- Epilogue drains the SW pipeline
25Software pipelining determine II, Initation
Interval
Cyclic data dependences
For (i0.....) Ai6 3Ai-1
cycle(v) ? cycle(u) delay(u,v) -
II.distance(u,v)
26Modulo scheduling constraints
MII minimum initiation interval bounded by cyclic
dependences and resources
MII max ResMII, RecMII
27The Role of the Compiler
- 9 steps required to translate an HLL program
- Front-end compilation
- Determine dependencies
- Graph partitioning make multiple threads (or
tasks) - Bind partitions to compute nodes
- Bind operands to locations
- Bind operations to time slots Scheduling
- Bind operations to functional units
- Bind transports to buses
- Execute operations and perform transports
28Division of responsibilities between hardware and
compiler
Application
Frontend
Superscalar
Determine Dependencies
Determine Dependencies
Dataflow
Binding of Operands
Binding of Operands
Multi-threaded
Scheduling
Scheduling
Indep. Arch
Binding of Operations
Binding of Operations
VLIW
Binding of Transports
Binding of Transports
TTA
Execute
Responsibility of compiler
Responsibility of Hardware
29Overview
- Enhance performance architecture methods
- Instruction Level Parallelism
- VLIW
- Examples
- C6
- TM
- TTA
- Clustering
- Code generation
- Hands-on
30Hands-on (not this year)
- Map JPEG to a TTA processor
- see web page http//www.ics.ele.tue.nl/heco/cour
ses/pam - Install TTA tools (compiler and simulator)
- Go through all listed steps
- Perform DSE design space exploration
- Add SFU
- 1 or 2 page report in 2 weeks
31Hands-on
- Lets look at DSE Design Space Exploration
- We will use the Imagine processor
- http//cva.stanford.edu/projects/imagine/
32Mapping applications to processorsMOVE framework
User intercation
Optimizer
Architecture parameters
feedback
feedback
Parametric compiler
Hardware generator
Move framework
Parallel object code
chip
TTA based system
33Code generation trajectory for TTAs
- Frontend
- GCC or SUIF
- (adapted)
Application (C)
Compiler frontend
Sequential code
Sequential simulation
Input/Output
Architecture description
Compiler backend
Profiling data
Parallel code
Parallel simulation
Input/Output
34Exploration TTA resource reduction
35Exporation TTA connectivity reduction
Critical connections disappear
Reducing bus delay
Execution time
FU stage constrains cycle time
0
Number of connections removed
36Can we do better
Yes !!
- How ?
- Transformations
- SFUs Special Function Units
- Multiple Processors
37Transforming the specification
Based on associativity of operation a (b c)
(a b) c
38Transforming the specification
r 2b a x z y
d a b e a d f 2 b d r f e x
z y
1
b
z
y
a
ltlt
-
x
r
39Changing the architectureadding SFUs special
function units
4-input adder why is this faster?
40Changing the architectureadding SFUs special
function units
- In the extreme case put everything into one unit!
Spatial mapping - no control flow
However no flexibility / programmability !!
41SFUs fine grain patterns
- Why using fine grain SFUs
- Code size reduction
- Register file ports reduction
- Could be cheaper and/or faster
- Transport reduction
- Power reduction (avoid charging non-local wires)
- Supports whole application domain !
- Which patterns do need support?
- Detection of recurring operation patterns needed
42SFUs covering results
43Exploration resulting architecture
- Architecture for image processing
- Note the reduced connectivity
44Conclusions
- Billions of embedded processing systems
- how to design these systems quickly, cheap,
correct, low power,.... ? - what will their processing platform look like?
- VLIWs are very powerful and flexible
- can be easily tuned to application domain
- TTAs even more flexible, scalable, and lower power
45Conclusions
- Compilation for ILP architectures is getting
mature, and - Enters the commercial area.
- However
- Great discrepancy between available and
exploitable parallelism - Advanced code scheduling techniques needed to
exploit ILP
46Bottom line
Do not pay for hardware if
you can do it by software !!