PPT – CSL718 : VLIW - Software Driven ILP PowerPoint presentation

About This Presentation

Title:

CSL718 : VLIW - Software Driven ILP

Description:

Compiler Support for Exposing and Exploiting ILP. 1st Apr, 2006. Anshul ... Two ... d may not be known at compile time. These could depend on other loop ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 36

Provided by: anshul8

Category:

more less

Transcript and Presenter's Notes

Title: CSL718 : VLIW - Software Driven ILP

1
CSL718 VLIW - Software Driven ILP

Compiler Support for Exposing and Exploiting ILP
1st Apr, 2006

2
Code Scheduling for VLIW

Objective is to move code around and form packets
of concurrently executable instructions
Two possibilities
Local
work on a straight line piece of code (basic
block), i.e., do not go across conditional
branches
Global
code can move across conditional branches
Loops need to be tackled in both cases

3
Pipeline scheduling example

for (i1000 igt0 i--)
xi xi s
Loop L.D F0, 0(R1)
ADD.D F4, F0, F2
S.D F4, 0(R1)
DADDUI R1, R1, -8
BNE R1, R2, Loop

4
Latency due to data hazards
Producer instruction Consumer instruction Latency
FP ALU op FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Assume no structural hazards
5
Straight forward scheduling

Loop L.D F0, 0(R1) 1
stall 2
ADD.D F4, F0, F2 3
stall 4
stall 5
S.D F4, 0(R1) 6
DADDUI R1, R1, -8 7
stall 8
BNE R1, R2, Loop 9
stall 10

6
A better schedule

Loop L.D F0, 0(R1) 1
DADDUI R1, R1, -8 2
ADD.D F4, F0, F2 3
stall 4
BNE R1, R2, Loop 5
S.D F4, 8(R1) 6

7
Loop unrolling

Loop L.D F0, 0(R1)
ADD.D F4, F0, F2
S.D F4, 0(R1) 6
L.D F0, -8(R1)
ADD.D F4, F0, F2
S.D F4, -8(R1) 12
L.D F0, -16(R1)
ADD.D F4, F0, F2
S.D F4, -16(R1) 18
L.D F0, -24(R1)
ADD.D F4, F0, F2
S.D F4, -24(R1) 24
DADDUI R1, R1, -32
BNE R1, R2, Loop 28
28/47

8
Re-scheduling

Loop L.D F0, 0(R1)
L.D F6, -8(R1)
L.D F10, -16(R1)
L.D F14, -24(R1) 4
ADD.D F4, F0, F2
ADD.D F8, F6, F2
ADD.D F12, F10, F2
ADD.D F16, F14, F2 8
S.D F4, 0(R1)
S.D F8, -8(R1) 10
DADDUI R1, R1, -32
S.D F12, -16(R1) 12
BNE R1, R2, Loop
S.D F16, -24(R1) 14
14/43.5

9
Limits to unrolling

Decrease in amount of loop overhead amortized
with each unroll
Growth in code size
Register renaming leads to register pressure

10
Scheduling example with 2 issue proc

Loop L.D F0, 0(R1) 1
L.D F6, -8(R1) 2
L.D F10,-16(R1) ADD.D F4, F0, F2 3
L.D F14,-24(R1) ADD.D F8, F6, F2 4
L.D F18,-32(R1) ADD.D F12, F10, F2 5
S.D F4, 0(R1) ADD.D F16, F14, F2 6
S.D F8, -8(R1) ADD.D F20, F18, F2 7
S.D F12,-16(R1) 8
DADDUI R1,R1,-40 9
S.D F16, 16(R1) 10
BNE R1,R2,Loop 11
S.D F20, 8(R1) 12

11
Scheduling example with 5 issue proc

L.D F0, 0(R1) L.D F6, -8(R1)
L.D F10,-16(R1) L.D F14,-24(R1)
L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2
ADD.D F8,F6,F2
L.D F26,-48(R1) ADD.D F12,F10,F2
ADD.D F16,F14,F2
ADD.D F20,F18,F2
ADD.D F24,F22,F2
S.D F4, 0(R1) S.D F8, -8(R1) ADD.D F28,F26,F2
S.D F12,-16(R1) S.D F16,-24(R1)
DADDUI R1,R1,-56
S.D F20, 24(R1) S.D F24, 16(R1)
S.D F28, 8(R1)
BNE R1,R2,Loop

12
Scheduling Results

cycles/iteration
Straight forward scheduling 10
With instruction re-ordering 6
With loop unrolling (4 times) 7
Unrolling re-ordering 3.5
Scheduling on 2 issue VLIW 2.5
Scheduling on 5 issue VLIW 1.3

13
Loop Level Parallelism

Dependences in context of a loop
Dependence within an iteration
Dependence across iteration or loop carried
dependence
Example with no loop carried dependence
for (i1000 igt0 i--)
xi xi s
There is dependence within an iteration.

14
Example with loop carried dependence

for (i1 ilt100 i)
Ai1 Ai Ci / S1 /
Bi1 Bi Ai1 / S2 /
Assume that arrays are distinct and
non-overlapping
S1 uses a value of A from previous iteration
restricts overlapping of different iterations
S2 uses a value of A from the same iteration
restricts movement of instructions within
iteration

15
Another example

for (i1 ilt100 i)
Ai Ai Bi / S1 /
Bi1 Ci Di / S2 /
S1 uses a value of computed by S2 in the previous
iteration.
Still, iterations can be parallelized. There is
no cycle among dependences.
A transformation can remove loop carried
dependence.

16
Transformed loop of previous example

A1 A1 B1
for (i1 ilt99 i)
Bi1 Ci Di
Ai1 Ai1 Bi1
Now there is no loop carried dependence. The
iterations can be parallelized, preserving
dependence within the iteration.

17
Dependence distance

for (i6 ilt100 i)
Bi Bi-5 Bi
Dependence distance is 5.
This gives an opportunity for parallelization in
spite of loop carried dependence.

18
Finding dependence

Important for
code scheduling
loop parallelization
removal of false dependences
Analysis is complicated by
pointers and passing of parameters by reference
and consequent potential for aliasing
use of complex expressions as indices of arrays

19
Analysis with affine indices

for (im iltn i)
Aa?ib Ac?id Bi
Can a?ib become equal to c?id for some values
of i within the range m to n?
Difficult to determine in general. a,b,c,d may
not be known at compile time. These could depend
on other loop indices.
When a,b,c,d are constants, we can use GCD test.
Dependence implies GCD(c,a) must divide d-b.

20
Example with affine indices

for (i1 ilt100 i)
A4i1 A6i4 Bi
GCD(4,6) is 2 and 4-1 is 3.
2 does not divide 3. Therefore, indices can never
have same values.
values of 4i1 5,9,13,17,21,25,29,33,37,41,....
values of 6i4 10,16,22,28,34,40,46,52,58,....
Sometimes GCD(c,a) my divide d-b, but dependence
may not exist.

21
Reducing impact of dependent computations

Copy propagation
DADDUI R1,R2,4
DADDUI R1,R1,4
Tree height reduction
ADD R1,R2,R3
ADD R4,R1,R6
ADD R8,R4,R7

DADDUI R1,R2,8 (used in loop
unrolling) ADD R1,R2,R3 ADD R4,R6,R7
ADD R8,R1,R4
22
Software pipelining symbolic loop unrolling
iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
23
Software pipelining example

Loop L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
DADDUI R1,R1,-8
BNE R1,R2,Loop

Loop S.D F4,16(R1) ADD.D F4,F0,F2
L.D F0,0(R1) DADDUI R1,R1,-8 BNE
R1,R2,Loop
iteration i L.D F0,0(R1) ADD.D F4,F0,F2 S.D
F4,0(R1)
iteration i1 L.D F0,0(R1) ADD.D F4,F0,F2 S.D
F4,0(R1)
iteration i2 L.D F0,0(R1) ADD.D F4,F0,F2 S.D
F4,0(R1)
24
Difficulties in software pipelining

Overheads - increased register requirement
Register management required
Loop body may be complex
It may require several transformations before
pipelining

25
Global code scheduling

Consider regions of code which are larger than
basic blocks
Include multiple basic blocks and conditionals
How to move code across branch and join points?

26
Static scheduling and branch prediction

Static branch prediction is helpful with
delayed branches
static scheduling
Methods
Fixed prediction
Opcode based prediction
Address based prediction
Profile driven prediction (misprediction 10,
instructions between mispredictions 100)

27
Branch prediction and scheduling

LD R1, 0(R2)
DSUBU R1, R1, R3
BEQZ R1, L
OR R4, R5, R6
DADDU R10, R4, R3
L DADDU R7, R8, R9
A move when branch is predicted as not taken and
R4 not needed in taken path
B move when branch is predicted as taken and R7
not needed in taken path

A
B
28
Global code scheduling

When can assignment to B be moved before the
comparison?
When can assignment to C be moved above the join
point? above the comparison?

AiAiBi
Ai0?
Bi...
X
Ci...
predicted path
29
Trace scheduling
30
Region for global scheduling

Trace
linear path through code (with high probability)
multiple entries and exits
Superblock
linear path with single entry, multiple exits
Hyperblock
superblock plus internal control flow
Treegion
tree with single entry, multiple exits
Trace-2
loop free region

31
Trace
10
90
B1
30
70
B2
B3
70
30
B4
20
80
B5
20
B6
90
10
32
Superblock
10
90
B1
30
70
B2
B3
70
30
B4
20
80
B5
20
B6
90
10
33
Superblock with tail duplication
10
50.4
39.6
B1
30
70
B2
B3
70
30
B4
B4
14
6
56
24
B5
20
B6
B6
39.6
5.6
4.4
50.4
34
Hyperblock
10
72
18
B1
30
70
B2
B3
70
30
B4
20
80
B5
20
B6
B6
18
8
2
72
35
Treegion
10
90
B1
30
70
B2
B3
70
30
B4
B4
14
6
56
24
B5
B5
14
6
B6
B6
B6
B6
5.6
1.4
2.4
0.6
12.6
21.6
50.4
5.4

Write a Comment

User Comments (0)