Processor Architectures and Program Mapping - PowerPoint PPT Presentation

About This Presentation

Title:

Processor Architectures and Program Mapping

Description:

Processor Architectures and Program Mapping. TU/e 5kk10. Henk Corporaal ... Processor Architectures and Program Mapping H. Corporaal, J. ... peephole ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 47

Provided by: abc774

Category:

more less

Transcript and Presenter's Notes

Title: Processor Architectures and Program Mapping

1
Processor Architectures and Program Mapping
Exploiting ILP part 2 code generation

TU/e 5kk10
Henk Corporaal
Jef van Meerbergen
Bart Mesman

2
Overview

Enhance performance architecture methods
Instruction Level Parallelism
VLIW
Examples
C6
TM
TTA
Clustering
Code generation
Hands-on

3
Compiler basics

Overview
Compiler trajectory / structure / passes
Control Flow Graph (CFG)
Mapping and Scheduling
Basic block list scheduling
Extended scheduling scope
Loop schedulin

4
Compiler basics trajectory
Source program
Preprocessor
Compiler
Error messages
Assembler
Library code
Loader/Linker
Object program
5
Compiler basics structure / passes
Source code
Lexical analyzer
token generation
check syntax check semantic
parse tree generation
Parsing
Intermediate code
data flow analysis local optimizations
global optimizations
Code optimization
code selection peephole optimizations
Code generation
making interference graph graph
coloring
spill code insertion
caller / callee save and restore code
Register allocation
Sequential code
Scheduling and allocation
exploiting ILP
Object code
6
Compiler basics structure Simple compilation
example
position initial rate 60
Lexical analyzer
temp1 intoreal(60) temp2 id3 temp1 temp3
id2 temp2 id1 temp3
id id id 60
Syntax analyzer
Code optimizer
temp1 id3 60.0 id1 id2 temp1
Code generator
movf id3, r2 mulf 60, r2, r2 movf id2,
r1 addf r2, r1 movf r1, id1
Intermediate code generator
7
Compiler basics Control flow graph (CFG)
C input code
if (a gt b) r a b else r b
a
1 sub t1, a, b bgz t1, 2, 3
CFG
2 rem r, a, b goto 4
3 rem r, b, a goto 4
4 .. ..
Program, is collection of Functions, each
function is collection of Basic Blocks,
each BB contains set of
Instructions, each instruction consists of
several Transports,..
8
Mapping / Scheduling placing operations in
space and time
a
b
2

d a b
e a d
f 2 b d
r f e
x z y

d
z
y

e
f
-
x
r
Data Dependence Graph (DDG)
9
How to map these operations?

Architecture constraints
One Function Unit
All operations single cycle latency

b
a
2

d
cycle

z
1
y

e
f

2

-
3
x

r

4
5
-
6

10
How to map these operations?

Architecture constraints
One Add-sub and one Mul unit
All operations single cycle latency

Mul
Add-sub
cycle
1

2

3

-
4
5
6
11
There are many mapping solutions
12
Basic Block Scheduling

Make a dependence graph
Determine minimal length
Determine ASAP, ALAP, and slack of each operation
Place each operation in first cycle with
sufficient resources
Note
Scheduling order sequential
Priority determined by used heuristic e.g. slack

13
Basic Block Scheduling
ASAP cycle
B
C
ALAP cycle
ADD
A
slack
lt1,1gt
A
C
SUB
lt2,2gt
ADD
NEG
LD
lt3,3gt
lt1,3gt
lt2,3gt
A
B
LD
MUL
ADD
lt4,4gt
lt2,4gt
lt1,4gt
z
y
X
14
Cycle based list scheduling
proc Schedule(DDG (V,E)) beginproc ready
v ??(u,v) ? E ready ready
sched ? current_cycle 0 while
sched ? V do for each v ? ready do
if ?ResourceConfl(v,current_cycle,
sched) then cycle(v)
current_cycle sched sched ?
v endif endfor
current_cycle current_cycle 1
ready v v ? sched ? ? (u,v)? E, u ? sched
ready v v ? ready ? ? (u,v)?
E, cycle(u) delay(u,v) ? current_cycle
endwhile endproc
15
Extended basic block scheduling Code Motion

Downward code motions?
a ? B, a ? C, a ? D, c ? D, d ? D
Upward code motions?
c ? A, d ? A, e ? B, e ? C, e ? A

16
Extended Scheduling scope
Code
CFG Control Flow Graph
A
A If cond Then B Else C D If cond Then
E Else F G
C
B
D
F
E
G
17
Scheduling scopes
Trace Superblock Decision tree
Hyperblock/region
18
Code movement (upwards) within regions
destination block
Legend
Copy needed
I
I
Intermediate block
I
I
Check for off-liveness
Code movement
I
add
source block
19
Extended basic block schedulingCode Motion

A dominates B ? A is always executed before B
Consequently
A does not dominate B ? code motion from B to A
requires
code duplication
B post-dominates A ? B is always executed after A
Consequently
B does not post-dominate A ? code motion from B
to A is speculative

Q1 does C dominate E? Q2 does C dominate D? Q3
does F post-dominate D? Q4 does D post-dominate
B?
20
Scheduling Loops
Loop Optimizations
A
B
C
A
D
A
B
C
B
C
C
C
C
C
D
D
Loop unrolling
Loop peeling
21
Scheduling Loops

Problems with unrolling
Exploits only parallelism within sets of n
iterations
Iteration start-up latency
Code expansion

Basic block scheduling
Basic block scheduling and unrolling
resource utilization
Software pipelining
time
22
Software pipelining

Software pipelining a loop is
Scheduling the loop such that iterations start
before preceding iterations have finished
Or
Moving operations across the backedge

LD LD ML LD ML ST ML ST ST
Unroling 5/3 cycles/iteration
Software pipelining 1 cycle/iteration
3 cycles/iteration
23
Software pipelining (contd)

Basic techniques
Modulo scheduling (Rau, Lam)
list scheduling with modulo resource constraints
Kernel recognition techniques
unroll the loop
schedule the iterations
identify a repeating pattern
Examples
Perfect pipelining (Aiken and Nicolau)
URPR (Su, Ding and Xia)
Petri net pipelining (Allan)
Enhanced pipeline scheduling (Ebcioglu)
fill first cycle of iteration
copy this instruction over the backedge

24
Software pipelining Modulo scheduling
Example Modulo scheduling a loop
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Prologue
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Kernel
ld r1,(r2) mul r3,r1,3 sub r4,r3,1 st r4,(r5)
Epilogue
(c) Software pipeline

Prologue fills the SW pipeline with iterations
Epilogue drains the SW pipeline

25
Software pipelining determine II, Initation
Interval
Cyclic data dependences
For (i0.....) Ai6 3Ai-1
cycle(v) ? cycle(u) delay(u,v) -
II.distance(u,v)
26
Modulo scheduling constraints
MII minimum initiation interval bounded by cyclic
dependences and resources
MII max ResMII, RecMII
27
The Role of the Compiler

9 steps required to translate an HLL program
Front-end compilation
Determine dependencies
Graph partitioning make multiple threads (or
tasks)
Bind partitions to compute nodes
Bind operands to locations
Bind operations to time slots Scheduling
Bind operations to functional units
Bind transports to buses
Execute operations and perform transports

28
Division of responsibilities between hardware and
compiler
Application
Frontend
Superscalar
Determine Dependencies
Determine Dependencies
Dataflow
Binding of Operands
Binding of Operands
Multi-threaded
Scheduling
Scheduling
Indep. Arch
Binding of Operations
Binding of Operations
VLIW
Binding of Transports
Binding of Transports
TTA
Execute
Responsibility of compiler
Responsibility of Hardware
29
Overview

Enhance performance architecture methods
Instruction Level Parallelism
VLIW
Examples
C6
TM
TTA
Clustering
Code generation
Hands-on

30
Hands-on (not this year)

Map JPEG to a TTA processor
see web page http//www.ics.ele.tue.nl/heco/cour
ses/pam
Install TTA tools (compiler and simulator)
Go through all listed steps
Perform DSE design space exploration
Add SFU
1 or 2 page report in 2 weeks

31
Hands-on

Lets look at DSE Design Space Exploration
We will use the Imagine processor
http//cva.stanford.edu/projects/imagine/

32
Mapping applications to processorsMOVE framework
User intercation
Optimizer
Architecture parameters
feedback
feedback
Parametric compiler
Hardware generator
Move framework
Parallel object code
chip
TTA based system
33
Code generation trajectory for TTAs

Frontend
GCC or SUIF
(adapted)

Application (C)
Compiler frontend
Sequential code
Sequential simulation
Input/Output
Architecture description
Compiler backend
Profiling data
Parallel code
Parallel simulation
Input/Output
34
Exploration TTA resource reduction
35
Exporation TTA connectivity reduction
Critical connections disappear
Reducing bus delay
Execution time
FU stage constrains cycle time
0
Number of connections removed
36
Can we do better
Yes !!

How ?
Transformations
SFUs Special Function Units
Multiple Processors

37
Transforming the specification

Based on associativity of operation a (b c)
(a b) c
38
Transforming the specification
r 2b a x z y
d a b e a d f 2 b d r f e x
z y
1
b
z
y
a
ltlt

-
x
r
39
Changing the architectureadding SFUs special
function units

4-input adder why is this faster?
40
Changing the architectureadding SFUs special
function units