Lecture 5: Introduction to Advanced Pipelining - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 5: Introduction to Advanced Pipelining

Description:

Title: Lecture 3: R4000 + Intro to ILP Author: David A. Patterson Last modified by: Dr. Laxmi N. Bhuyan Created Date: 9/4/1996 7:14:34 AM Document presentation format – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 18
Provided by: DavidAPa1
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Lecture 5: Introduction to Advanced Pipelining


1
Lecture 5 Introduction to Advanced Pipelining
  • L.N. Bhuyan
  • CS 162

2
Arithmetic Pipeline
  • The floating point executions cannot be performed
    in one cycle during the EX stage. Allowing much
    more time will increase the pipeline cycle time
    or subsequent instructions have to be stalled
  • Solution is to break the FP EX stage to several
    stages whose delay can match the cycle time of
    the instruction pipeline
  • Such a FP or arithmetic pipeline does not reduce
    latency, but can decouple from the integer unit
    and increase throughput for a sequence of FP
    instructions
  • What is a vector instruction and or a vector
    computer?

3
MIPS R4000 Floating Point
  • FP Adder, FP Multiplier, FP Divider
  • Last step of FP Multiplier/Divider uses FP Adder
    HW
  • 8 kinds of stages in FP units
  • Stage Functional unit Description
  • A FP adder Mantissa ADD stage
  • D FP divider Divide pipeline stage
  • E FP multiplier Exception test stage
  • M FP multiplier First stage of multiplier
  • N FP multiplier Second stage of multiplier
  • R FP adder Rounding stage
  • S FP adder Operand shift stage
  • U Unpack FP numbers

4
MIPS FP Pipe Stages
  • FP Instr 1 2 3 4 5 6 7 8
  • Add, Subtract U SA AR RS
  • Multiply U EM M M M N NA R
  • Divide U A R D28 DA DR, DR, DA, DR, A, R
  • Square root U E (AR)108 A R
  • Negate U S
  • Absolute value U S
  • FP compare U A R
  • Stages
  • M First stage of multiplier
  • N Second stage of multiplier
  • R Rounding stage
  • S Operand shift stage
  • U Unpack FP numbers

A Mantissa ADD stage D Divide pipeline
stage E Exception test stage
5
Pipeline with Floating point operations
  • Example of FP pipeline integrated with the
    instruction pipeline Fig. A.31, A.32 and A.33
    distributed in the class
  • The FP pipeline consists of one integer unit with
    1 stage, one FP/integer multiply unit with 7
    stages, one FP adder with 4 stages, and a
    FP/integer divider with 24 stages
  • A.31 shows the pipeline, A.32 shows execution of
    independent instns, and A.33 shows effect of data
    dependency
  • Impact of data dependency is severe. Possibility
    of out-of-order execution gt creates different
    hazards to be considered later

6
R4000 Performance
  • Not ideal CPI of 1
  • Load stalls (1 or 2 clock cycles)
  • Branch stalls (2 cycles unfilled slots)
  • FP result stalls RAW data hazard (latency)
  • FP structural stalls Not enough FP hardware
    (parallelism)

7
FP Loop Where are the Hazards?
  • Loop LD F0,0(R1) F0vector element
  • ADDD F4,F0,F2 add scalar from F2
  • SD 0(R1),F4 store result
  • SUBI R1,R1,8 decrement pointer 8B (DW)
  • BNEZ R1,Loop branch R1!zero
  • NOP delayed branch slot

Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0
  • Where are the stalls?

8
FP Loop Hazards
Loop LD F0,0(R1) F0vector element
ADDD F4,F0,F2 add scalar in F2
SD 0(R1),F4 store result SUBI R1,R1,8 decre
ment pointer 8B (DW) BNEZ R1,Loop branch
R1!zero NOP delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0
9
FP Loop Showing Stalls
1 Loop LD F0,0(R1) F0vector element
2 stall 3 ADDD F4,F0,F2 add scalar in F2
4 stall 5 stall 6 SD 0(R1),F4 store result
7 SUBI R1,R1,8 decrement pointer 8B (DW) 8
BNEZ R1,Loop branch R1!zero
9 stall delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1
  • 9 clocks Rewrite code to minimize stalls?

10
Minimizing Stalls Technique 1 Compiler
Optimization
1 Loop LD F0,0(R1) 2 stall
3 ADDD F4,F0,F2 4 SUBI R1,R1,8
5 BNEZ R1,Loop delayed branch 6
SD 8(R1),F4 altered when move past SUBI
Swap BNEZ and SD by changing address of SD
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1
  • 6 clocks

11
Technique 2 Loop Unrolling (Software Pipelining)
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
1 cycle delay 3 SD 0(R1),F4 drop SUBI
BNEZ 2cycles delay 4 LD F6,-8(R1)
5 ADDD F8,F6,F2 1 cycle delay
6 SD -8(R1),F8 drop SUBI BNEZ 2 cycles
delay 7 LD F10,-16(R1) 8 ADDD F12,F10,F2
1 cycle delay 9 SD -16(R1),F12 drop SUBI
BNEZ 2 cycles delay 10 LD F14,-24(R1)
11 ADDD F16,F14,F2 1 cycle delay
12 SD -24(R1),F16 2 cycles daly
13 SUBI R1,R1,32 alter to 48 14 BNEZ R1,LOOP
15 NOP1 cycle delay for FP operation after
load. 2 cycles delay for store after FP 15 4 x
(12) 27 clock cycles, or 6.8 per
iteration Loop Unrolling is essential for ILP
Processors Why?
12
Minimize Stall Loop Unrolling
1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4
10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,
32 13 BNEZ R1,LOOP Delayed
branch 14 SD 8(R1),F16 8-32 -24 14 clock
cycles, or 3.5 per iteration When safe to move
instructions?
  • What assumptions made when moved code?
  • OK to move store past SUBI even though changes
    register
  • OK to move loads before stores get right data?
  • When is it safe for compiler to do such changes?

13
Compiler Perspectives on Code Movement
  • Definitions compiler concerned about
    dependencies in program, whether or not a HW
    hazard depends on a given pipeline
  • Try to schedule to avoid hazards
  • (True) Data dependencies (RAW if a hazard for HW)
  • Instruction i produces a result used by
    instruction j, or
  • Instruction j is data dependent on instruction k,
    and instruction k is data dependent on
    instruction i.
  • If dependent, cant execute in parallel
  • Easy to determine for registers (fixed names)
  • Hard for memory
  • Does 100(R4) 20(R6)?
  • From different loop iterations, does 20(R6)
    20(R6)?

14
Compiler Perspectives on Code Movement
  • Another kind of dependence called name
    dependence two instructions use same name
    (register or memory location) but dont exchange
    data
  • Antidependence (WAR if a hazard for HW)
  • Instruction j writes a register or memory
    location that instruction i reads from and
    instruction i is executed first
  • Output dependence (WAW if a hazard for HW)
  • Instruction i and instruction j write the same
    register or memory location ordering between
    instructions must be preserved.

15
RAW
WAR
WAW and RAW
EXAMPLE
I1
I3
I5
I1. Load R1, A /R1? Memory(A)/ I2. Add R2, R1 /R2
? (R2)(R1)/ I3. Add R3, R4 /R3 ? (R3)(R4)/ I4.
Mul R4, R5 /R4 ? (R4)(R5)/ I5. Comp R6 /R6 ?
Not(R6)/ I6. Mul R6, R7 /R6 ? (R6)(R7)/
Program order
I6
I4
I2
Flow dependence
Anti- dependence
Output dependence, also flow dependence
Due to Superscalar Processing, it is
possible that I4 completes before I3 starts.
Similarly the value of R6 depends on the
beginning and end of I5 and I6. Unpredictable
result! A sample program and its dependence
graph, where I2 and I3 share the adder and I4 and
I6 share the same multiplier. These two
dependences can be removed by duplicating the
resources, or pipelined adders and multipliers.
16
Register Renaming
  • Rewrite the previous program as
  • I1. R1b ? Memory (A)
  • I2. R2b ? (R2a) (R1b)
  • I3. R3b ? (R3a) (R4a)
  • I4. R4b ? (R4a) (R5a)
  • I5. R6b ? -(R6a)
  • I6. R6c ? (R6b) (R7a)
  • Allocate more registers and rename the registers
    that really do not have flow dependency. The WAR
    hazard between I3 and I4, and WAW hazard between
    I5 and I6 have been removed.
  • These two hazards also called Name dependencies

17
Where are the name dependencies?
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 drop SUBI BNEZ 4 LD F0,-8(R1)
2 ADDD F4,F0,F2 3 SD -8(R1),F4 drop SUBI
BNEZ 7 LD F0,-16(R1) 8 ADDD F4,F0,F2
9 SD -16(R1),F4 drop SUBI BNEZ
10 LD F0,-24(R1) 11 ADDD F4,F0,F2
12 SD -24(R1),F4 13 SUBI R1,R1,32 alter to
48 14 BNEZ R1,LOOP 15 NOP How can remove
them?
18
Where are the name dependencies?
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 drop SUBI BNEZ 4 LD F6,-8(R1)
5 ADDD F8,F6,F2 6 SD -8(R1),F8 drop SUBI
BNEZ 7 LD F10,-16(R1) 8 ADDD F12,F10,F2
9 SD -16(R1),F12 drop SUBI BNEZ
10 LD F14,-24(R1) 11 ADDD F16,F14,F2
12 SD -24(R1),F16 13 SUBI R1,R1,32 alter to
48 14 BNEZ R1,LOOP 15 NOP Called register
renaming
19
Detailed Scoreboard Pipeline Control
Write a Comment
User Comments (0)
About PowerShow.com