Pipelining Issues Lecture - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Pipelining Issues Lecture

Description:

Must do register dependency analysis. Code Rearranging can help... DLX branch tests if register = 0 or 0. DLX Solution: Move Zero test to ID/RF stage ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 32
Provided by: DAnd81
Category:

less

Transcript and Presenter's Notes

Title: Pipelining Issues Lecture


1
Pipelining IssuesLecture 5
  • David Andrews

2
Todays Agenda
  • Pipelines Provide Significant Performance
    Enhancements
  • Pipelines Also Introduce Special Problems
  • Hazards
  • Structural
  • Data
  • Control
  • Hazards Challenge Ideal Speedups
  • Cause us to reduce CPI
  • Reduction in theoretical speedup
  • Common Techniques To Deal With Hazards
  • Structural can be eliminated by more hardware
  • Data Hazards can be minimized by forwarding
  • Control Hazards can be minimized by delayed
    branching

3
For next time.
  • Homework ch3 3.1, 3.3,3.4
  • Read Chapter 4
  • This will complete desktop CPU design
  • After Chapter 4 material, we will delve into
    Memory

4
ReviewCPU Operates as State Machine
  • CPUs perform standard operations
  • IF Instruction Fetch
  • All Instructions do same
  • ID Instruction Decode
  • All Instructions decoded
  • Some Access Registers
  • EX Execute Instruction
  • Arithmetic operations
  • Address Calculation
  • MEM Memory Access
  • Ld/St Fetch Data To/From Memory
  • WB Write Back
  • Update Register File With Result

5
5 Steps of DLX DatapathFigure 3.1, Page 130
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
IR
L M D
6
Ideal Speedup for Pipelining
  • Assume clock cycle of non pipelined machine is
    tk.
  • Assume we are executing N gtgtgtgtgtk instructions
  • We get first result out k clocks from start
  • We continue to output result every clock, so N-1
    clocks later we complete
  • Therefore.

7
Ideal Speedup for Pipeline
  • As N goes to infinity, tspeedup -gt k
  • Interesting eh..

8
Its Not That Easy for Computers
  • Limits to pipelining Hazards prevent next
    instruction from executing during its designated
    clock cycle
  • Structural hazards HW cannot support this
    combination of instructions (single person to
    fold and put clothes away)
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline (missing
    sock)
  • Control hazards Pipelining of branches other
    instructions that change the PC
  • Common solution is to stall the pipeline until
    the hazard is resolved, inserting one or more
    bubbles in the pipeline

9
One Memory Port/Structural HazardsFigure 3.6,
Page 142
10
Example Dual-port vs. Single-port
  • Machine A Dual ported memory
  • Machine B Single ported memory, but its
    pipelined implementation has a 1.05 times faster
    clock rate
  • Ideal CPI 1 for both
  • Loads are 40 of instructions executed
  • SpeedUpA Pipeline Depth/(1 0) x
    (clockunpipe/clockpipe)
  • Pipeline Depth
  • SpeedUpB Pipeline Depth/(1 0.4 x 1)
    x (clockunpipe/(clockunpipe / 1.05)
  • (Pipeline Depth/1.4) x 1.05
  • 0.75 x Pipeline Depth
  • SpeedUpA / SpeedUpB Pipeline
    Depth/(0.75 x Pipeline Depth) 1.33
  • Machine A is 1.33 times faster

11
Data Hazards
  • Data Hazards are associated with accessing data
    from registers

12
Data Hazard
13
Software Fix Stall
14
Stalling
  • Depends on Good Compiler
  • Must do register dependency analysis
  • Code Rearranging can help.
  • However, limited due to nature of programs
  • Better solution is forwarding

15
Data Forwarding
  • Result is somewhere in pipeline
  • Can add additional data paths (and muxs) to make
    the result available

16
Loads
  • Unfortunately, forwarding does not completely
    solve all problems..

Cant go backwards in time !
17
Loads require 1 stall
18
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
  • Fast code
  • LW Rb,b
  • LW Rc,c
  • LW Re,e
  • ADD Ra,Rb,Rc
  • LW Rf,f
  • SW a,Ra
  • SUB Rd,Re,Rf
  • SW d,Rd

19
Three Generic Data Hazards
  • InstrI followed by InstrJ
  • Read After Write (RAW) InstrJ tries to read
    operand before InstrI writes it
  • Reading from register file
  • Writing to register file
  • This is what we have been looking at in our
    forwarding
  • Op r1, r2, r3 r1 is written
  • Op r4, r1, r5 r1 is read

20
Three Generic Data Hazards
  • InstrI followed by InstrJ
  • Write After Read (WAR) InstrJ tries to write
    operand before InstrI reads i
  • Gets wrong operand
  • Cant happen in DLX 5 stage pipeline because
  • All instructions take 5 stages, and
  • Reads are always in stage 2, and
  • Writes are always in stage 5

21
Three Generic Data Hazards
  • InstrI followed by InstrJ
  • Write After Write (WAW) InstrJ tries to write
    operand before InstrI writes it
  • Leaves wrong result ( InstrI not InstrJ )
  • Cant happen in DLX 5 stage pipeline because
  • All instructions take 5 stages, and
  • Writes are always in stage 5
  • Will see WAR and WAW in later more complicated
    pipes

22
Data Hazards Summary
  • Of three types (WAW, WAR, RAW)
  • WAW pipeline prevents this as all writes occur in
    same stage
  • WAR pipeline prevents this as read occurs in
    stage 2, write in stage 5
  • RAW Can Occur
  • Forwarding solves all but load word hazards
  • Need to place NOP in between
  • Good compiler can eliminate by code
    re-arraingment

23
Control Hazards
  • Control hazards occur when sequential instruction
    execution is violated
  • Unconditional jump/branch
  • Conditional jump/branch
  • Lets see

24
Original Design
  • Conditions for branches not resolved until Ex
    stage
  • New address not muxed until MemAccess stage
  • 2 conditions
  • Take branch (unconditional or condition met)
  • Dont take conditional branch

25
Control Hazard on BranchesThree Stage Stall
26
Branch Stall Impact
  • If CPI 1, 30 branch, Stall 3 cycles gt new CPI
    1.9!
  • Two part solution
  • Determine branch taken or not sooner, AND
  • Compute taken branch address earlier
  • DLX branch tests if register 0 or 0
  • DLX Solution
  • Move Zero test to ID/RF stage
  • Adder to calculate new PC in ID/RF stage
  • 1 clock cycle penalty for branch versus 3

27
Pipelined DLX DatapathFigure 3.22, page 163
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
This is the correct 1 cycle latency
implementation!
28
Four Branch Hazard Alternatives
  • 1 Stall until branch direction is clear
  • 2 Predict Branch Not Taken
  • Execute successor instructions in sequence
  • Squash instructions in pipeline if branch
    actually taken
  • Advantage of late pipeline state update
  • 47 DLX branches not taken on average
  • PC4 already calculated, so use it to get next
    instruction
  • 3 Predict Branch Taken
  • 53 DLX branches taken on average
  • But havent calculated branch target address in
    DLX
  • DLX still incurs 1 cycle branch penalty
  • Other machines branch target known before outcome

29
Four Branch Hazard Alternatives
  • 4 Delayed Branch
  • Define branch to take place AFTER a following
    instruction
  • branch instruction sequential
    successor1 sequential successor2 ........ seque
    ntial successorn
  • branch target if taken
  • 1 slot delay allows proper decision and branch
    target address in 5 stage pipeline
  • DLX uses this

Branch delay of length n
30
Delayed Branch
  • Where to get instructions to fill branch delay
    slot?
  • Before branch instruction
  • From the target address only valuable when
    branch taken
  • From fall through only valuable when branch not
    taken
  • Cancelling branches allow more slots to be
    filled
  • Compiler effectiveness for single branch delay
    slot
  • Fills about 60 of branch delay slots
  • About 80 of instructions executed in branch
    delay slots useful in computation
  • About 50 (60 x 80) of slots usefully filled
  • Delayed Branch downside 7-8 stage pipelines,
    multiple instructions issued per clock
    (superscalar)

31
Evaluating Branch Alternatives
  • Scheduling Branch CPI speedup v. speedup v.
    scheme penalty unpipelined stall
  • Stall pipeline 3 1.42 3.5 1.0
  • Predict taken 1 1.14 4.4 1.26
  • Predict not taken 1 1.09 4.5 1.29
  • Delayed branch 0.5 1.07 4.6 1.31
  • Conditional Unconditional 14, 65 change PC
Write a Comment
User Comments (0)
About PowerShow.com