Title: Lecture 5: Introduction to Advanced Pipelining
1Lecture 5 Introduction to Advanced Pipelining
2Arithmetic Pipeline
- The floating point executions cannot be performed
in one cycle during the EX stage. Allowing much
more time will increase the pipeline cycle time
or subsequent instructions have to be stalled - Solution is to break the FP EX stage to several
stages whose delay can match the cycle time of
the instruction pipeline - Such a FP or arithmetic pipeline does not reduce
latency, but can decouple from the integer unit
and increase throughput for a sequence of FP
instructions - What is a vector instruction and or a vector
computer?
3MIPS R4000 Floating Point
- FP Adder, FP Multiplier, FP Divider
- Last step of FP Multiplier/Divider uses FP Adder
HW - 8 kinds of stages in FP units
- Stage Functional unit Description
- A FP adder Mantissa ADD stage
- D FP divider Divide pipeline stage
- E FP multiplier Exception test stage
- M FP multiplier First stage of multiplier
- N FP multiplier Second stage of multiplier
- R FP adder Rounding stage
- S FP adder Operand shift stage
- U Unpack FP numbers
4MIPS FP Pipe Stages
- FP Instr 1 2 3 4 5 6 7 8
- Add, Subtract U SA AR RS
- Multiply U EM M M M N NA R
- Divide U A R D28 DA DR, DR, DA, DR, A, R
- Square root U E (AR)108 A R
- Negate U S
- Absolute value U S
- FP compare U A R
- Stages
- M First stage of multiplier
- N Second stage of multiplier
- R Rounding stage
- S Operand shift stage
- U Unpack FP numbers
A Mantissa ADD stage D Divide pipeline
stage E Exception test stage
5Pipeline with Floating point operations
- Example of FP pipeline integrated with the
instruction pipeline Fig. A.31, A.32 and A.33
distributed in the class - The FP pipeline consists of one integer unit with
1 stage, one FP/integer multiply unit with 7
stages, one FP adder with 4 stages, and a
FP/integer divider with 24 stages - A.31 shows the pipeline, A.32 shows execution of
independent instns, and A.33 shows effect of data
dependency - Impact of data dependency is severe. Possibility
of out-of-order execution gt creates different
hazards to be considered later
6R4000 Performance
- Not ideal CPI of 1
- Load stalls (1 or 2 clock cycles)
- Branch stalls (2 cycles unfilled slots)
- FP result stalls RAW data hazard (latency)
- FP structural stalls Not enough FP hardware
(parallelism)
7FP Loop Where are the Hazards?
- Loop LD F0,0(R1) F0vector element
- ADDD F4,F0,F2 add scalar from F2
- SD 0(R1),F4 store result
- SUBI R1,R1,8 decrement pointer 8B (DW)
- BNEZ R1,Loop branch R1!zero
- NOP delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0
8FP Loop Hazards
Loop LD F0,0(R1) F0vector element
ADDD F4,F0,F2 add scalar in F2
SD 0(R1),F4 store result SUBI R1,R1,8 decre
ment pointer 8B (DW) BNEZ R1,Loop branch
R1!zero NOP delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0
9FP Loop Showing Stalls
1 Loop LD F0,0(R1) F0vector element
2 stall 3 ADDD F4,F0,F2 add scalar in F2
4 stall 5 stall 6 SD 0(R1),F4 store result
7 SUBI R1,R1,8 decrement pointer 8B (DW) 8
BNEZ R1,Loop branch R1!zero
9 stall delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1
- 9 clocks Rewrite code to minimize stalls?
10Minimizing Stalls Technique 1 Compiler
Optimization
1 Loop LD F0,0(R1) 2 stall
3 ADDD F4,F0,F2 4 SUBI R1,R1,8
5 BNEZ R1,Loop delayed branch 6
SD 8(R1),F4 altered when move past SUBI
Swap BNEZ and SD by changing address of SD
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1
11Technique 2 Loop Unrolling (Software Pipelining)
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
1 cycle delay 3 SD 0(R1),F4 drop SUBI
BNEZ 2cycles delay 4 LD F6,-8(R1)
5 ADDD F8,F6,F2 1 cycle delay
6 SD -8(R1),F8 drop SUBI BNEZ 2 cycles
delay 7 LD F10,-16(R1) 8 ADDD F12,F10,F2
1 cycle delay 9 SD -16(R1),F12 drop SUBI
BNEZ 2 cycles delay 10 LD F14,-24(R1)
11 ADDD F16,F14,F2 1 cycle delay
12 SD -24(R1),F16 2 cycles daly
13 SUBI R1,R1,32 alter to 48 14 BNEZ R1,LOOP
15 NOP1 cycle delay for FP operation after
load. 2 cycles delay for store after FP 15 4 x
(12) 27 clock cycles, or 6.8 per
iteration Loop Unrolling is essential for ILP
Processors Why?
12Minimize Stall Loop Unrolling
1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4
10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,
32 13 BNEZ R1,LOOP Delayed
branch 14 SD 8(R1),F16 8-32 -24 14 clock
cycles, or 3.5 per iteration When safe to move
instructions?
- What assumptions made when moved code?
- OK to move store past SUBI even though changes
register - OK to move loads before stores get right data?
- When is it safe for compiler to do such changes?
13Compiler Perspectives on Code Movement
- Definitions compiler concerned about
dependencies in program, whether or not a HW
hazard depends on a given pipeline - Try to schedule to avoid hazards
- (True) Data dependencies (RAW if a hazard for HW)
- Instruction i produces a result used by
instruction j, or - Instruction j is data dependent on instruction k,
and instruction k is data dependent on
instruction i. - If dependent, cant execute in parallel
- Easy to determine for registers (fixed names)
- Hard for memory
- Does 100(R4) 20(R6)?
- From different loop iterations, does 20(R6)
20(R6)?
14Compiler Perspectives on Code Movement
- Another kind of dependence called name
dependence two instructions use same name
(register or memory location) but dont exchange
data - Antidependence (WAR if a hazard for HW)
- Instruction j writes a register or memory
location that instruction i reads from and
instruction i is executed first - Output dependence (WAW if a hazard for HW)
- Instruction i and instruction j write the same
register or memory location ordering between
instructions must be preserved.
15RAW
WAR
WAW and RAW
EXAMPLE
I1
I3
I5
I1. Load R1, A /R1? Memory(A)/ I2. Add R2, R1 /R2
? (R2)(R1)/ I3. Add R3, R4 /R3 ? (R3)(R4)/ I4.
Mul R4, R5 /R4 ? (R4)(R5)/ I5. Comp R6 /R6 ?
Not(R6)/ I6. Mul R6, R7 /R6 ? (R6)(R7)/
Program order
I6
I4
I2
Flow dependence
Anti- dependence
Output dependence, also flow dependence
Due to Superscalar Processing, it is
possible that I4 completes before I3 starts.
Similarly the value of R6 depends on the
beginning and end of I5 and I6. Unpredictable
result! A sample program and its dependence
graph, where I2 and I3 share the adder and I4 and
I6 share the same multiplier. These two
dependences can be removed by duplicating the
resources, or pipelined adders and multipliers.
16Register Renaming
- Rewrite the previous program as
- I1. R1b ? Memory (A)
- I2. R2b ? (R2a) (R1b)
- I3. R3b ? (R3a) (R4a)
- I4. R4b ? (R4a) (R5a)
- I5. R6b ? -(R6a)
- I6. R6c ? (R6b) (R7a)
- Allocate more registers and rename the registers
that really do not have flow dependency. The WAR
hazard between I3 and I4, and WAW hazard between
I5 and I6 have been removed. - These two hazards also called Name dependencies
17Where are the name dependencies?
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 drop SUBI BNEZ 4 LD F0,-8(R1)
2 ADDD F4,F0,F2 3 SD -8(R1),F4 drop SUBI
BNEZ 7 LD F0,-16(R1) 8 ADDD F4,F0,F2
9 SD -16(R1),F4 drop SUBI BNEZ
10 LD F0,-24(R1) 11 ADDD F4,F0,F2
12 SD -24(R1),F4 13 SUBI R1,R1,32 alter to
48 14 BNEZ R1,LOOP 15 NOP How can remove
them?
18Where are the name dependencies?
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 drop SUBI BNEZ 4 LD F6,-8(R1)
5 ADDD F8,F6,F2 6 SD -8(R1),F8 drop SUBI
BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2
9 SD -16(R1),F12 drop SUBI BNEZ
10 LD F14,-24(R1) 11 ADDD F16,F14,F2
12 SD -24(R1),F16 13 SUBI R1,R1,32 alter to
48 14 BNEZ R1,LOOP 15 NOP Called register
renaming
19Detailed Scoreboard Pipeline Control