CSL718 : VLIW - Software Driven ILP - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

CSL718 : VLIW - Software Driven ILP

Description:

Compiler Support for Exposing and Exploiting ILP. 1st Apr, 2006. Anshul ... Two ... d may not be known at compile time. These could depend on other loop ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 36
Provided by: anshul8
Category:

less

Transcript and Presenter's Notes

Title: CSL718 : VLIW - Software Driven ILP


1
CSL718 VLIW - Software Driven ILP
  • Compiler Support for Exposing and Exploiting ILP
  • 1st Apr, 2006

2
Code Scheduling for VLIW
  • Objective is to move code around and form packets
    of concurrently executable instructions
  • Two possibilities
  • Local
  • work on a straight line piece of code (basic
    block), i.e., do not go across conditional
    branches
  • Global
  • code can move across conditional branches
  • Loops need to be tackled in both cases

3
Pipeline scheduling example
  • for (i1000 igt0 i--)
  • xi xi s
  • Loop L.D F0, 0(R1)
  • ADD.D F4, F0, F2
  • S.D F4, 0(R1)
  • DADDUI R1, R1, -8
  • BNE R1, R2, Loop

4
Latency due to data hazards
Producer instruction Consumer instruction Latency
FP ALU op FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Assume no structural hazards
5
Straight forward scheduling
  • Loop L.D F0, 0(R1) 1
  • stall 2
  • ADD.D F4, F0, F2 3
  • stall 4
  • stall 5
  • S.D F4, 0(R1) 6
  • DADDUI R1, R1, -8 7
  • stall 8
  • BNE R1, R2, Loop 9
  • stall 10

6
A better schedule
  • Loop L.D F0, 0(R1) 1
  • DADDUI R1, R1, -8 2
  • ADD.D F4, F0, F2 3
  • stall 4
  • BNE R1, R2, Loop 5
  • S.D F4, 8(R1) 6

7
Loop unrolling
  • Loop L.D F0, 0(R1)
  • ADD.D F4, F0, F2
  • S.D F4, 0(R1) 6
  • L.D F0, -8(R1)
  • ADD.D F4, F0, F2
  • S.D F4, -8(R1) 12
  • L.D F0, -16(R1)
  • ADD.D F4, F0, F2
  • S.D F4, -16(R1) 18
  • L.D F0, -24(R1)
  • ADD.D F4, F0, F2
  • S.D F4, -24(R1) 24
  • DADDUI R1, R1, -32
  • BNE R1, R2, Loop 28
    28/47

8
Re-scheduling
  • Loop L.D F0, 0(R1)
  • L.D F6, -8(R1)
  • L.D F10, -16(R1)
  • L.D F14, -24(R1) 4
  • ADD.D F4, F0, F2
  • ADD.D F8, F6, F2
  • ADD.D F12, F10, F2
  • ADD.D F16, F14, F2 8
  • S.D F4, 0(R1)
  • S.D F8, -8(R1) 10
  • DADDUI R1, R1, -32
  • S.D F12, -16(R1) 12
  • BNE R1, R2, Loop
  • S.D F16, -24(R1) 14
    14/43.5

9
Limits to unrolling
  • Decrease in amount of loop overhead amortized
    with each unroll
  • Growth in code size
  • Register renaming leads to register pressure

10
Scheduling example with 2 issue proc
  • Loop L.D F0, 0(R1) 1
  • L.D F6, -8(R1) 2
  • L.D F10,-16(R1) ADD.D F4, F0, F2 3
  • L.D F14,-24(R1) ADD.D F8, F6, F2 4
  • L.D F18,-32(R1) ADD.D F12, F10, F2 5
  • S.D F4, 0(R1) ADD.D F16, F14, F2 6
  • S.D F8, -8(R1) ADD.D F20, F18, F2 7
  • S.D F12,-16(R1) 8
  • DADDUI R1,R1,-40 9
  • S.D F16, 16(R1) 10
  • BNE R1,R2,Loop 11
  • S.D F20, 8(R1) 12

11
Scheduling example with 5 issue proc
  • L.D F0, 0(R1) L.D F6, -8(R1)
  • L.D F10,-16(R1) L.D F14,-24(R1)
  • L.D F18,-32(R1) L.D F22,-40(R1) ADD.D F4,F0,F2
    ADD.D F8,F6,F2
  • L.D F26,-48(R1) ADD.D F12,F10,F2
    ADD.D F16,F14,F2
  • ADD.D F20,F18,F2
    ADD.D F24,F22,F2
  • S.D F4, 0(R1) S.D F8, -8(R1) ADD.D F28,F26,F2
  • S.D F12,-16(R1) S.D F16,-24(R1)
    DADDUI R1,R1,-56
  • S.D F20, 24(R1) S.D F24, 16(R1)
  • S.D F28, 8(R1)
    BNE R1,R2,Loop

12
Scheduling Results
  • cycles/iteration
  • Straight forward scheduling 10
  • With instruction re-ordering 6
  • With loop unrolling (4 times) 7
  • Unrolling re-ordering 3.5
  • Scheduling on 2 issue VLIW 2.5
  • Scheduling on 5 issue VLIW 1.3

13
Loop Level Parallelism
  • Dependences in context of a loop
  • Dependence within an iteration
  • Dependence across iteration or loop carried
    dependence
  • Example with no loop carried dependence
  • for (i1000 igt0 i--)
  • xi xi s
  • There is dependence within an iteration.

14
Example with loop carried dependence
  • for (i1 ilt100 i)
  • Ai1 Ai Ci / S1 /
  • Bi1 Bi Ai1 / S2 /
  • Assume that arrays are distinct and
    non-overlapping
  • S1 uses a value of A from previous iteration
  • restricts overlapping of different iterations
  • S2 uses a value of A from the same iteration
  • restricts movement of instructions within
    iteration

15
Another example
  • for (i1 ilt100 i)
  • Ai Ai Bi / S1 /
  • Bi1 Ci Di / S2 /
  • S1 uses a value of computed by S2 in the previous
    iteration.
  • Still, iterations can be parallelized. There is
    no cycle among dependences.
  • A transformation can remove loop carried
    dependence.

16
Transformed loop of previous example
  • A1 A1 B1
  • for (i1 ilt99 i)
  • Bi1 Ci Di
  • Ai1 Ai1 Bi1
  • Now there is no loop carried dependence. The
    iterations can be parallelized, preserving
    dependence within the iteration.

17
Dependence distance
  • for (i6 ilt100 i)
  • Bi Bi-5 Bi
  • Dependence distance is 5.
  • This gives an opportunity for parallelization in
    spite of loop carried dependence.

18
Finding dependence
  • Important for
  • code scheduling
  • loop parallelization
  • removal of false dependences
  • Analysis is complicated by
  • pointers and passing of parameters by reference
    and consequent potential for aliasing
  • use of complex expressions as indices of arrays

19
Analysis with affine indices
  • for (im iltn i)
  • Aa?ib Ac?id Bi
  • Can a?ib become equal to c?id for some values
    of i within the range m to n?
  • Difficult to determine in general. a,b,c,d may
    not be known at compile time. These could depend
    on other loop indices.
  • When a,b,c,d are constants, we can use GCD test.
    Dependence implies GCD(c,a) must divide d-b.

20
Example with affine indices
  • for (i1 ilt100 i)
  • A4i1 A6i4 Bi
  • GCD(4,6) is 2 and 4-1 is 3.
  • 2 does not divide 3. Therefore, indices can never
    have same values.
  • values of 4i1 5,9,13,17,21,25,29,33,37,41,....
  • values of 6i4 10,16,22,28,34,40,46,52,58,....
  • Sometimes GCD(c,a) my divide d-b, but dependence
    may not exist.

21
Reducing impact of dependent computations
  • Copy propagation
  • DADDUI R1,R2,4
  • DADDUI R1,R1,4
  • Tree height reduction
  • ADD R1,R2,R3
  • ADD R4,R1,R6
  • ADD R8,R4,R7

DADDUI R1,R2,8 (used in loop
unrolling) ADD R1,R2,R3 ADD R4,R6,R7
ADD R8,R1,R4
22
Software pipelining symbolic loop unrolling
iteration 0
iteration 1
iteration 2
iteration 3
iteration 4
iteration 5
iteration 6
23
Software pipelining example
  • Loop L.D F0,0(R1)
  • ADD.D F4,F0,F2
  • S.D F4,0(R1)
  • DADDUI R1,R1,-8
  • BNE R1,R2,Loop

Loop S.D F4,16(R1) ADD.D F4,F0,F2
L.D F0,0(R1) DADDUI R1,R1,-8 BNE
R1,R2,Loop
iteration i L.D F0,0(R1) ADD.D F4,F0,F2 S.D
F4,0(R1)
iteration i1 L.D F0,0(R1) ADD.D F4,F0,F2 S.D
F4,0(R1)
iteration i2 L.D F0,0(R1) ADD.D F4,F0,F2 S.D
F4,0(R1)
24
Difficulties in software pipelining
  • Overheads - increased register requirement
  • Register management required
  • Loop body may be complex
  • It may require several transformations before
    pipelining

25
Global code scheduling
  • Consider regions of code which are larger than
    basic blocks
  • Include multiple basic blocks and conditionals
  • How to move code across branch and join points?

26
Static scheduling and branch prediction
  • Static branch prediction is helpful with
  • delayed branches
  • static scheduling
  • Methods
  • Fixed prediction
  • Opcode based prediction
  • Address based prediction
  • Profile driven prediction (misprediction 10,
    instructions between mispredictions 100)

27
Branch prediction and scheduling
  • LD R1, 0(R2)
  • DSUBU R1, R1, R3
  • BEQZ R1, L
  • OR R4, R5, R6
  • DADDU R10, R4, R3
  • L DADDU R7, R8, R9
  • A move when branch is predicted as not taken and
    R4 not needed in taken path
  • B move when branch is predicted as taken and R7
    not needed in taken path

A
B
28
Global code scheduling
  • When can assignment to B be moved before the
    comparison?
  • When can assignment to C be moved above the join
    point? above the comparison?

AiAiBi
Ai0?
Bi...
X
Ci...
predicted path
29
Trace scheduling
30
Region for global scheduling
  • Trace
  • linear path through code (with high probability)
  • multiple entries and exits
  • Superblock
  • linear path with single entry, multiple exits
  • Hyperblock
  • superblock plus internal control flow
  • Treegion
  • tree with single entry, multiple exits
  • Trace-2
  • loop free region

31
Trace
10
90
B1
30
70
B2
B3
70
30
B4
20
80
B5
20
B6
90
10
32
Superblock
10
90
B1
30
70
B2
B3
70
30
B4
20
80
B5
20
B6
90
10
33
Superblock with tail duplication
10
50.4
39.6
B1
30
70
B2
B3
70
30
B4
B4
14
6
56
24
B5
20
B6
B6
39.6
5.6
4.4
50.4
34
Hyperblock
10
72
18
B1
30
70
B2
B3
70
30
B4
20
80
B5
20
B6
B6
18
8
2
72
35
Treegion
10
90
B1
30
70
B2
B3
70
30
B4
B4
14
6
56
24
B5
B5
14
6
B6
B6
B6
B6
5.6
1.4
2.4
0.6
12.6
21.6
50.4
5.4
Write a Comment
User Comments (0)
About PowerShow.com