Title: CSL718 : VLIW - Software Driven ILP
1CSL718 VLIW - Software Driven ILP
- Compiler Support for Exposing and Exploiting ILP
- 1st Apr, 2006
2Code Scheduling for VLIW
- Objective is to move code around and form packets
of concurrently executable instructions - Two possibilities
- Local
- work on a straight line piece of code (basic
block), i.e., do not go across conditional
branches - Global
- code can move across conditional branches
- Loops need to be tackled in both cases
3Pipeline scheduling example
- for (i1000 igt0 i--)
- xi xi s
- Loop L.D F0, 0(R1)
- ADD.D F4, F0, F2
- S.D F4, 0(R1)
- DADDUI R1, R1, -8
- BNE R1, R2, Loop
4Latency due to data hazards
Producer instruction Consumer instruction Latency
FP ALU op FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Assume no structural hazards
5Straight forward scheduling
- Loop L.D F0, 0(R1) 1
- stall 2
- ADD.D F4, F0, F2 3
- stall 4
- stall 5
- S.D F4, 0(R1) 6
- DADDUI R1, R1, -8 7
- stall 8
- BNE R1, R2, Loop 9
- stall 10
6A better schedule
- Loop L.D F0, 0(R1) 1
- DADDUI R1, R1, -8 2
- ADD.D F4, F0, F2 3
- stall 4
- BNE R1, R2, Loop 5
- S.D F4, 8(R1) 6
7Loop unrolling
- Loop L.D F0, 0(R1)
- ADD.D F4, F0, F2
- S.D F4, 0(R1) 6
- L.D F0, -8(R1)
- ADD.D F4, F0, F2
- S.D F4, -8(R1) 12
- L.D F0, -16(R1)
- ADD.D F4, F0, F2
- S.D F4, -16(R1) 18
- L.D F0, -24(R1)
- ADD.D F4, F0, F2
- S.D F4, -24(R1) 24
- DADDUI R1, R1, -32
- BNE R1, R2, Loop 28
28/47
8Re-scheduling
- Loop L.D F0, 0(R1)
- L.D F6, -8(R1)
- L.D F10, -16(R1)
- L.D F14, -24(R1) 4
- ADD.D F4, F0, F2
- ADD.D F8, F6, F2
- ADD.D F12, F10, F2
- ADD.D F16, F14, F2 8
- S.D F4, 0(R1)
- S.D F8, -8(R1) 10
- DADDUI R1, R1, -32
- S.D F12, -16(R1) 12
- BNE R1, R2, Loop
- S.D F16, -24(R1) 14
14/43.5
9Limits to unrolling
- Decrease in amount of loop overhead amortized
with each unroll - Growth in code size
- Register renaming leads to register pressure
10Scheduling example with 2 issue proc
- Loop L.D F0, 0(R1) 1
- L.D F6, -8(R1) 2
- L.D F10,-16(R1) ADD.D F4, F0, F2 3
- L.D F14,-24(R1) ADD.D F8, F6, F2 4
- L.D F18,-32(R1) ADD.D F12, F10, F2 5
- S.D F4, 0(R1) ADD.D F16, F14, F2 6
- S.D F8, -8(R1) ADD.D F20, F18, F2 7
- S.D F12,-16(R1) 8
- DADDUI R1,R1,-40 9
- S.D F16, 16(R1) 10
- BNE R1,R2,Loop 11
- S.D F20, 8(R1) 12
11Scheduling example with 5 issue proc
- L.D F0, 0(R1) L.D F6, -8(R1)
- L.D F10,-16(R1) L.D F14,-24(R1)
- L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2
ADD.D F8,F6,F2 - L.D F26,-48(R1) ADD.D F12,F10,F2
ADD.D F16,F14,F2 - ADD.D F20,F18,F2
ADD.D F24,F22,F2 - S.D F4, 0(R1) S.D F8, -8(R1) ADD.D F28,F26,F2
- S.D F12,-16(R1) S.D F16,-24(R1)
DADDUI R1,R1,-56 - S.D F20, 24(R1) S.D F24, 16(R1)
- S.D F28, 8(R1)
BNE R1,R2,Loop
12Scheduling Results
- cycles/iteration
- Straight forward scheduling 10
- With instruction re-ordering 6
- With loop unrolling (4 times) 7
- Unrolling re-ordering 3.5
- Scheduling on 2 issue VLIW 2.5
- Scheduling on 5 issue VLIW 1.3
13Loop Level Parallelism
- Dependences in context of a loop
- Dependence within an iteration
- Dependence across iteration or loop carried
dependence - Example with no loop carried dependence
- for (i1000 igt0 i--)
- xi xi s
- There is dependence within an iteration.
14Example with loop carried dependence
- for (i1 ilt100 i)
- Ai1 Ai Ci / S1 /
- Bi1 Bi Ai1 / S2 /
-
- Assume that arrays are distinct and
non-overlapping - S1 uses a value of A from previous iteration
- restricts overlapping of different iterations
- S2 uses a value of A from the same iteration
- restricts movement of instructions within
iteration
15Another example
- for (i1 ilt100 i)
- Ai Ai Bi / S1 /
- Bi1 Ci Di / S2 /
-
- S1 uses a value of computed by S2 in the previous
iteration. - Still, iterations can be parallelized. There is
no cycle among dependences. - A transformation can remove loop carried
dependence.
16Transformed loop of previous example
- A1 A1 B1
- for (i1 ilt99 i)
- Bi1 Ci Di
- Ai1 Ai1 Bi1
-
- Now there is no loop carried dependence. The
iterations can be parallelized, preserving
dependence within the iteration.
17Dependence distance
- for (i6 ilt100 i)
- Bi Bi-5 Bi
-
- Dependence distance is 5.
- This gives an opportunity for parallelization in
spite of loop carried dependence.
18Finding dependence
- Important for
- code scheduling
- loop parallelization
- removal of false dependences
- Analysis is complicated by
- pointers and passing of parameters by reference
and consequent potential for aliasing - use of complex expressions as indices of arrays
19Analysis with affine indices
- for (im iltn i)
- Aa?ib Ac?id Bi
-
- Can a?ib become equal to c?id for some values
of i within the range m to n? - Difficult to determine in general. a,b,c,d may
not be known at compile time. These could depend
on other loop indices. - When a,b,c,d are constants, we can use GCD test.
Dependence implies GCD(c,a) must divide d-b.
20Example with affine indices
- for (i1 ilt100 i)
- A4i1 A6i4 Bi
-
- GCD(4,6) is 2 and 4-1 is 3.
- 2 does not divide 3. Therefore, indices can never
have same values. - values of 4i1 5,9,13,17,21,25,29,33,37,41,....
- values of 6i4 10,16,22,28,34,40,46,52,58,....
- Sometimes GCD(c,a) my divide d-b, but dependence
may not exist.
21Reducing impact of dependent computations
- Copy propagation
- DADDUI R1,R2,4
- DADDUI R1,R1,4
- Tree height reduction
- ADD R1,R2,R3
- ADD R4,R1,R6
- ADD R8,R4,R7
DADDUI R1,R2,8 (used in loop
unrolling) ADD R1,R2,R3 ADD R4,R6,R7
ADD R8,R1,R4
22Software pipelining symbolic loop unrolling
iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
23Software pipelining example
- Loop L.D F0,0(R1)
- ADD.D F4,F0,F2
- S.D F4,0(R1)
- DADDUI R1,R1,-8
- BNE R1,R2,Loop
Loop S.D F4,16(R1) ADD.D F4,F0,F2
L.D F0,0(R1) DADDUI R1,R1,-8 BNE
R1,R2,Loop
iteration i L.D F0,0(R1) ADD.D F4,F0,F2 S.D
F4,0(R1)
iteration i1 L.D F0,0(R1) ADD.D F4,F0,F2 S.D
F4,0(R1)
iteration i2 L.D F0,0(R1) ADD.D F4,F0,F2 S.D
F4,0(R1)
24Difficulties in software pipelining
- Overheads - increased register requirement
- Register management required
- Loop body may be complex
- It may require several transformations before
pipelining
25Global code scheduling
- Consider regions of code which are larger than
basic blocks - Include multiple basic blocks and conditionals
- How to move code across branch and join points?
26Static scheduling and branch prediction
- Static branch prediction is helpful with
- delayed branches
- static scheduling
- Methods
- Fixed prediction
- Opcode based prediction
- Address based prediction
- Profile driven prediction (misprediction 10,
instructions between mispredictions 100)
27Branch prediction and scheduling
- LD R1, 0(R2)
- DSUBU R1, R1, R3
- BEQZ R1, L
- OR R4, R5, R6
- DADDU R10, R4, R3
- L DADDU R7, R8, R9
- A move when branch is predicted as not taken and
R4 not needed in taken path - B move when branch is predicted as taken and R7
not needed in taken path
A
B
28Global code scheduling
- When can assignment to B be moved before the
comparison? - When can assignment to C be moved above the join
point? above the comparison?
AiAiBi
Ai0?
Bi...
X
Ci...
predicted path
29Trace scheduling
30Region for global scheduling
- Trace
- linear path through code (with high probability)
- multiple entries and exits
- Superblock
- linear path with single entry, multiple exits
- Hyperblock
- superblock plus internal control flow
- Treegion
- tree with single entry, multiple exits
- Trace-2
- loop free region
31Trace
10
90
B1
30
70
B2
B3
70
30
B4
20
80
B5
20
B6
90
10
32Superblock
10
90
B1
30
70
B2
B3
70
30
B4
20
80
B5
20
B6
90
10
33Superblock with tail duplication
10
50.4
39.6
B1
30
70
B2
B3
70
30
B4
B4
14
6
56
24
B5
20
B6
B6
39.6
5.6
4.4
50.4
34Hyperblock
10
72
18
B1
30
70
B2
B3
70
30
B4
20
80
B5
20
B6
B6
18
8
2
72
35Treegion
10
90
B1
30
70
B2
B3
70
30
B4
B4
14
6
56
24
B5
B5
14
6
B6
B6
B6
B6
5.6
1.4
2.4
0.6
12.6
21.6
50.4
5.4