Title: Appendix A Pipelining: Basic and Intermediate Concepts
1Appendix A PipeliningBasic and Intermediate
Concepts
2Outline
- What is pipelining?
- The basic pipeline for a RISC instruction set
- The major hurdle of pipelining pipeline hazards
- Data hazards
- Control hazards
3What Is Pipelining?
4Pipelining Its Natural!
- Laundry Example
- Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold - Washer takes 30 minutes
- Dryer takes 40 minutes
- Folder takes 20 minutes
5Sequential Laundry
6 PM
7
8
9
11
10
Midnight
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
- Sequential laundry takes 6 hours for 4 loads
- If they learned pipelining, how long would
laundry take?
6Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
- Pipelined laundry takes 3.5 hours for 4 loads
7Pipelining Lessons
- Pipelining does not help latency of single task,
it helps throughput of entire workload - Pipeline rate limited by slowest pipeline stage
- Multiple tasks operating simultaneously
- Potential speedup Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- Time to fill pipeline and time to drain it
reduces speedup
6 PM
7
8
9
Time
T a s k O r d e r
8What is Pipelining?
- Pipelining is an implementation technique whereby
multiple instructions are overlapped in
execution. - Not visible to the programmer
- Each step in the pipeline completes a part of an
instruction. - Each step is completing different parts of
different instructions in parallel - Each of these steps is called a pipe stage or a
pipe segment.
9What is Pipelining ? (Cont.)
- The time required between moving an instruction
one step down the pipeline is a machine cycle. - All the stages must be ready to proceed at the
same time - Slowest pipe stage dominates
- Machine cycle is usually one clock cycle
(sometimes two, rarely more) - The pipeline designers goal is to balance the
length of each pipeline stage.
10What is Pipelining? (Cont.)
- If the stages are perfectly balanced, then the
time per instruction on the pipelined machine
assuming ideal conditions--is equal to -
- Simple model - common latch clock
11Major Pipeline Benefit Performance
- Ideal Performance
- time-per instruction unpiped-instruction-
time/ stages - Asymptote of course - however 10 is commonly
achieved - Difference is due to difficulty in achieving
laminar stage design - 2 ways to view the performance mechanism
- Reduced CPI
- Assume a processor takes multiple clock cycles
per instruction - Reduced cycle-time
- Assume a processor takes 1 long clock cycle per
instruction
12Other Pipeline Benefits
- Completely HW mechanism
- No programming model shift required to exploit
this form of concurrency - BUT - the compiler will need to change and
compile time will go up - All modern machines are pipelined
- Key technique in advancing performance in the
80s - In the 90s we just moved to multiple pipelines
- Beware - no benefit is totally free/good
13Start with Unpipelined RISC
Use DLX for example, which is similar to MIPS
- Every instruction can be executed in 5 steps
- Every instructions takes at most 5 clock cycles
- Each step outputs just passed to next step (no
latches)
14Steps 1 2
- IF - instruction fetch step
- IR ? Mem PC --------- fetch the next
instruction from memory - NPC ? PC 4 ---------- compute the new PC
- ID - instruction decode and register fetch step
- A ? Regs IR 6.. 10
- B ? Regs IR 11.. 15
- Imm ? ((IR16)16IR16..31)
- Done in parallel with instruction (opcode)
decoding - Possible since register specifiers are encoded in
fixed fields - May fetch register contents that we dont use
- Also calculate the sign extended immediate
15Step 3
- EX - execution/effective address step (4 options
depending on opcode) - Memory Reference (LW R1, 1000(R2) SW R3,
500(R4)) - ALUOutput ? A Imm --------- effective address
- Register - Register ALU instruction (ADD R1, R2,
R3) - ALUOutput ?A func B
- Register - Immediate ALU instruction (ADD, R1,
R2, ) - ALUOutput ? A op Imm
- Branch (BEQZ R1, 2000)
- ALUOutput ? NPC Imm
- Cond ? (A op 0)
- In Load-Store machine no instruction needs to
simultaneously calculate a data address and
perform an ALU operation on the data - Hence EX/EFA can be combined into a single cycle.
16Steps 4 5
- MEM - memory access/ branch completion
- PC ? NPC
- Memory reference
- LMD ?Mem ALUOutput (Load) OR
- Mem ALUOutput lt-- B (Store)
- Branch
- if (cond) then PC ?ALUOutput
- WB - write back
- Register - Register ALU
- Regs IR16.. 20 ? ALUOutput
- Register - Immediate ALU
- Regs IR11.. 15 ?ALUOutput
- Load
- Regs IR11.. 15 ? LMD
17Datapath
PC?NPCPC?ALUO(Branch)
NPC?PC4
Cond?(A op 0)
Load/Store
(Load Result)
IR?MemPC
ALUO (ALU op)or LMD (Load)
ALUO ?AImm A func B A op Imm NPCImm
18Discussion
- Assume separate instruction and data memories
- Implement with separate instruction and data
caches (Chapter 5) - Data memory reference only occurs at stage 4
- Load and Store
- Update registers only occurs at stage 5
- All ALU operations and Load
- All register reads are early (in ID) and all
writes are late (in WB)
19Discussion (Cont.)
- Branch and store require 4 cycles and others 5
- Branch 12, store 5 ? overall
CPI4.83(50.8340.17) - Model is correct but not optimized
- ALUs - 1 would have sufficed since in any given
cycle only 1 is active - Instruction and data memories do not have to be
separate - Branches can be completed at the end of ID stage
(see later)
20The Basic Pipeline for DLX/MIPS
21Simple DLX/MIPS Pipeline
- Stages now get executed 1 per cycle
- Ideal result is the CPI reduced from 5 to 1
- Is it really this simple? Of course not but its
a start - Different operations use the same resource on the
same cycle? - Structure Hazard!!
- Separate instruction and data memories (IM, DM)
- Register files read in ID and write in WB
(distinct use) - Write PC in IF and write either the incremented
PC or the value of the branch target of an
earlier branch (branch handling problem) - Registers are needed between two adjacent stages
for storing intermediate results - Otherwise, they will be overwritten by next
instruction)
22Best Case Pipeline Scenario
Fill
Drain
Stable(5 times throughput)
23Perform register write/read in the first/second
half of CC
Read
Write
A pipeline can be though of as a series of data
paths (resources) shifted in time
24IF/ID, ID/EX,EX/MEM, MEM/WB are
pipelineregisters/latches
25Events on Every Pipe Stage Figure A.19
Extra pipeline registers between stages are used
to store intermediate results
26Events on Every Pipe Stage (Cont.) Figure A.19
27Important Pipeline Characteristics
- Latency
- Time it takes for an instruction to go through
the pipe - Latency stages x stage-delay
- Dominant feature if there are lots of exceptions
- Throughput
- Determined by the rate at which instructions can
start/finish - Dominant feature if no exceptions
28Basic Performance Issues
- Pipelining improve CPU instruction throughput
- Does not reduce the execution time of an
individual instruction - Slightly increase the execution time of an
individual instruction - Overhead in the control of the pipeline
- Pipeline register delay clock skew (Appendix
A-10) - Limit the practical depth of a pipeline
- A program runs faster and has lower total
execution time, even though no single instruction
runs faster
29Benefit Example
From the viewpoint of reduced clock cycle (i.e.
CPI 1)
- Unpipelined DLX
- 5 steps - take 50, 50, 60, 50, 50 ns respectively
- Hence total instruction time 260 ns (one clock
cycle) - Looks like a 5-stage pipeline
- But there are parasites everywhere
- Assume 5 ns added to slowest stage for extra
latches - Primarily due to set-up and hold times
- Hence (assuming no stage/step improvement)
- Must run at slowest stage parasites 60 5
65 ns/stage - In steady state (no exceptions) instruction done
every 65 ns - Speedup 260/65 4x improvement
30Benefit Example (Cont.)
From the viewpoint of reduced CPI
- Unpipelined DLX
- 10-ns clock cycles
- Clock cycles ALU/4 (40), branches/4 (20) and
memory/5 (40) - Average instruction execution time clock cycle
Average CPI 10 ns ((4020) 4 40 5)
10ns 4.4 44ns - Pipelined DLX
- 1 ns of overhead to the clock 11-ns clock cycles
- 11-ns is also the average instruction execution
time - Speedup from pipelining 44ns/11ns 4 times
31Pipeline Hazards
- Pipeline hazards prevent the next instruction in
the instruction stream from execution during its
designated clock cycles - Hazards reduce the pipeline performance from the
ideal speedup
32Pipeline Hazards
- Structural hazards
- Caused by resource conflict
- Possible to avoid by adding resources but may
be too costly - Data hazards
- Instruction depends on the results of a previous
instruction in a way that is exposed by the
overlapping of instructions in the pipeline - Can be mitigated somewhat by a smart compiler
- Control hazards
- When the PC does not get just incremented
- Branches and jumps - not too bad
33Hazards cause Stalls Two Policy Choices
- How about just stalling all stages
- OK but problem is usually adjacent stage
conflicts - Hence nothing moves and stall condition never
clears - Cheap option but it does not work
- Stall later let earlier progress
- Instructions issued later than the stalled
instructions are also stalled - Instructions issued earlier than the stalled
instructions must continue - But we will see in Chapters 3 and 4 that we can
reorder the instructions or let the instructions
after the stalled instruction goes on to reduce
the impacts of hazards
34Structural Hazards
- If some combination of instructions cannot be
accommodated because of resource conflicts, the
machine is said to have a structural hazard. - Some functional unit is not fully pipelined
- Some resource has not been duplicated enough to
allow all combinations of instructions in the
pipeline to execute - Single port register file - conflict with
multiple stage needs - Memory fetch - may need one in both IF and MEM
stages - Pipeline stalls instructions until the required
unit is available - A stall is commonly called a pipeline bubble or
just bubble
35Structural Hazard Example
36Remove Structural Hazard
No real hazard if inst1 is not a load or store
(Only load/store/branch use stage 4)
37Pipeline Stalled for a Structural Hazard (Another
View)
38Calculating Stall Effects
Ignore pipeline overhead andassume balanced
pipeline
From the viewpoint of decreasing CPI
Therefore,
All instruction takethe same numberof cycles
39Calculating Stall Effects (Cont.)
From the viewpoint of decreasing clock cycle time
Therefore,
40Example Dual-port vs. Single-port Memory
- Machine A Dual ported memory
- Machine B Single ported memory, but its
pipelined implementation has a 1.05 times faster
clock rate - Ideal CPI 1 for both
- Loads are 40 of instructions executed
41Why Would a Designer Allow Structural Hazard?
- A machine without structural hazards will always
have a lower CPI (if all other factors are equal) - Why would a Designer Allow Structural Hazard?
- Reduce cost
- Pipeline or duplicate all the functional units
may be too costly
42Why Would a Designer Allow Structural Hazard?
(Cont.)
- DLX implementation with a FP multiply unit but no
pipelining - Accept a new multiply every five clock cycles
(initiation interval) - How does structural hazard impact mdlidp2?
- Mdljdp2 has 14 FP multiplication
- DLX implementation can handle up to 20 of FP
multiplication - If FP multiplication in mdljdp2 are not clustered
but distributed uniformly ? performance impact is
very limited - If FP multiplication in mdljdp2 are all clustered
without intervening instruction and 14 of
instructions take 5 cycles each ? CPI increases
from 1 to 1.7 - In practice, impact of structural hazard lt 0.03
(data hazard has a more severe impact !!!)
43Data Hazards
44Introduction
- Data hazards occur when the pipeline changes the
order of read/write accesses to operands so that
the order differs from the order seen
sequentially executing instructions on an
unpipelined machine. - Example later instructions use a result not
having been produced by an earlier instruction - Example
- ADD R1, R2, R3
- SUB R4, R1, R5
- AND R6, R1, R7
- OR R8, R1, R9
- XOR R10, R1, R11
R1 ? R2 R3 R1 gets produced in the first
instruction,and used in every subsequent
instruction
45The use of the result of ADD in the next three
instructions causes a hazard, since the register
is not written until after those instructions
read it
read/write
46Forwarding -- also called bypassing, shorting,
short-circuiting
- Key is to keep the ALU result around
- Example
- ADD R1,R2,R3
- SUB R4, R1,R5
- How do we handle this in general?
- Forwarded value can be at ALU output or Mem stage
output
- ADD produces R1 value at ALU output
- SUB needs it again at the ALU input
47Forwarding (Cont.)
- Use the example at slide 44 as an example
- Forward the result from where ADD produces
(EX/MEM register) to where SUB needs it (ALU
input latch) - Forwarding works as follows
- ALU result from EX/MEM register is fed back to
ALU input latch - If the forwarding hardware detects the previous
ALU operation has written the register
corresponding to a source for the current ALU
operation, control logic selects the forwarding
result as the ALU input rather than the value
read from the register file - Generalization of forward
- Pass a result directly to the functional unit
requires it a result is forwarded from the
pipeline register corresponding to the output of
one unit to the input of another
48Result With Forwarding
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
IM
OR R8, R1, R9
IM
XOR R10, R1, R11
49Multiplexing Issues in Forwarding
50Another Forwarding Example
- Example
- ADD R1, R2, R3
- LW R4, 0(R1)
- SW 12(R1), R4
- Forwarding Result Next Page
51A?R2B?R3
R1?AO
AOAB(Prod. R1)
Do Nothing
A?R1B?R4Imm?0
R4?LMD
LMDMemAO(Prod. R4)
AOAImm(Use R1)
A?R1B?R4Imm?12
MemAO?B(Use R4)
AOAImm(Use R1)
52When Forwarding Fails
DM LMD?MEMALUO RD R1?LMD
RSA?R1, B?R5 ALU ALUO?A-B
RSA?R1, B?R7 ALU ALUO?A ANDB
RSA?R1, B?R5 ALU ALUO?A OR B
53Stalls
- Some latencies cant be absorbed -- the case in
the previous slide - Stalls are the result
- Need pipeline interlock circuits
- Detects a hazard and introduces bubbles until the
hazard clears - CPI for stalled instructions will bloat by the
number of bubbles - Bubbles cause the forwarding paths to change
- In MIPS/DLX, if the instruction after load uses
the load result, one clock-cycle stall will occur!
54Bubbles and new Forwarding Paths
55Handling Stalls
Hardware vs. Software
- Hardware Pipeline Interlocks
- Must detect when required data cannot be provided
- Stall stages to create bubble
- Software pipeline or instruction scheduling
- Performed by a smart compiler
LW RB, BLW RC, CADD RA, RB, RCSW A, RALW RE,
ELW RF, FSUB RD, RE, RFSW D, RD
LW RB, BLW RC, CLW RE, EADD RA, RB, RCLW RF,
F SW A, RA SUB RD, RE, RFSW D, RD
A B C D E F Pipeline Scheduling
56Memory Reference May Also Cause Data Hazard
- The previous two operations are with registers,
but it is also possible for a pair of
instructions to create a dependence by
reading/writing the same memory location - But the latter is impossible for MIPS
- Why???
Because data memory reference only occurs at
stage 4
57Data Hazard Forms
i occurs before j program execution order
- RAW - read after write
- j reads before i writes - hence j gets wrong old
value - Most common form of data hazard problem
- As we have seen forwarding can overcome this one
- WAW - write after write
- instructions i then j
- j writes before i writes - leaving incorrect
value - can this happen in MIPS? Why?
- WAW can happen only in pipelines that write in
more than one pipe stage (or allow an instruction
to proceed even when a previous instruction is
stalled)
58Data Hazard Forms (Cont.)
- WAR - write after read
- i then j is intended order
- j writes before i reads - i ends up with
incorrect new value - Is this a Problem in the MIPS? Why?
- May happen only when some instructions write
results early in pipe stages, and others read a
source late in stages - RAR read after read
- Not a hazard
59MIPS Ordering
- Some things are not a problem
- MIPS has only a single memory write and a single
register write stage - Hence this ordering requirement is preserved
- However things can get a lot worse
- And will when we look at varying operational
latencies - For example floating point instructions in the
MIPS - WAR MIPS Ordering
- Writing happens late in the pipe
- Reading happens early
- Hence no WAR problems
- However, other machines might exhibit this problem
60Control Hazards
61Introduction
- Control Hazards How does branch influence the
pipeline? - Problem is more complex - need 2 things
- Branch target (taken means not PC4, not taken
the condition fails) (MEM) - CC valid - in the DLX case the result of the Zero
detect unit (EX) - Both happen late in the pipe
- How to deal with branch?
- Stall the pipeline as soon as we detect the
branch (ID), and stall the pipeline until we
reach the MEM stage - Three-cycle stall
- The first IF is essentially a stall (when taken
branch) - Consider a 30 branch frequency and an ideal CPI
of 1
62DLX Pipeline Re-visited
63A branch causes a 3-cycle stall in the DLX
pipeline
64Branch Delay Reduction
Branch delay is the length of the control hazard
- Hardware mechanisms
- Find out if the branch is taken or not taken
earlier in the pipeline - Compute the taken PC earlier
- With another adder
- BTA (branch taken address) can be computed during
the ID stage - Then BTA and the normal PC mechanism contain both
possibilities - Choice depends on the instruction and CC - also
known after ID stage - Software mechanisms
- Design your ISA properly
- e.g. BNEZ, BEQZ on DLX permit CCs to be known
during ID - Do instruction scheduling into the branch delay
slots - Know likelihood of taken vs. not taken branch
statistics - Improves chances to guess correctly
- Varies with application and instruction placement
65New Improved DLX Pipeline
NPC
Imm
66Hardware Mechanism
- Move Zero test to ID/RF stage
- Adder to calculate new PC in ID/RF stage
- 1 clock cycle penalty for branch versus 3
- Note an ALU instruction followed by a branch on
the result of the instruction will incur a data
hazard stall - Example
- SUB R1, R2, R3 IF-ID-EXE-MEM-WB
- BEQZ R1, 100 IF-ID -EXE-
MEM-WB
67Branch Behavior in Programs
- Integer benchmark
- Conditional branch frequencies of 14 to 16
- Much lower unconditional branch frequencies
- FP benchmark
- Conditional branch frequencies of 3 to 12
- Forward branches dominate backward branches
(3.71) - 67 of conditional branches are taken on average
- 60 of forward branches are taken on average
- 85 of backward branches are taken on average
(usually loop)
68Control Hazard Avoidance
- Simplest Scheme
- Freeze pipe until you know the CC and branch
target - Cheap but slow
- Too slow since wed negate half of the pipeline
speedup since 2 or 3 bubbles (old vs. new DLX
pipeline designs) - Predict not taken (47 DLX branches not taken on
average) - Make sure to defer state change (destructive
phase) is delayed until you know whether you
guessed right - If not then back out or flush
- Predict taken (53 DLX branches taken on
average) - No use in DLX/MIPS (target address and branch
outcome are known at the same stage) - Or let the compiler decide - same options
69Predict-Not-Taken
A Stall indeed
70Delayed Branch
- Delayed branch ? make the stall cycle useful
- Add delay slots branch penalty length of
branch delay - 1 for DLX/MIPS
- Instructions in the delay slot are executed
whether or not the branch is taken - See if the compiler can schedule something useful
in these slots - Hope that filled slots actually help advance the
computation - A branch delay of length n
- branch instruction sequential successor1 sequent
ial successor2 ........ sequential successorn - branch target if taken
Always execute!
71Delayed-Branch Behavior
72Delayed Branch (Cont.)
- Where to get instructions to fill branch delay
slot? - Before branch instruction
- From the target address only valuable when
branch taken - From fall through only valuable when branch not
taken - When the slots cannot be scheduled, they are
filled with the no-op instruction (indeed,
stall!!) - Canceling branches allow more slots to be filled
73SUB R4, R5, R6
Scheduling the branch-delay slot
74Delay-Branch Scheduling Schemes and Their
Requirements
75Delayed Branch (Cont.)
- Limitation on delayed-branch scheduling
- Restrictions on the instructions that are
scheduled into delay slots - Ability to predict at compile time if a branch is
likely to be taken or not - Delayed branches are architecturally visible
feature - Advantage Use compiler scheduling to reduce
branch penalties - Disadvantage expose an aspect of implementation
that is likely to change - Delay branch is less useful for longer branch
delay - Can not easily hide the longer delay ? change to
hardware scheme
76Canceling (Nullifying) Branch
- Eliminate the requirements on the instruction
placed in the delay slot, enabling the compiler
to use from target/falling through without
meeting the requirements - Idea
- Associate each branch instruction the predicted
direction - If branch goes as predicted then nothing changes
- If unpredicted direction then nullify all or some
of the delay slot instructions - Result is more freedom for the compiler delay
slot scheduling - A common approach in HPs PA processors
77Delayed and Canceling Delay Branches allow
control hazards to be hidden 70 of the time
78Performance of Delayed and Canceling Branches
(Another View)
On average, 30 ofbranch delay slots are wasted
79Evaluating Branch Alternatives
Appendix A-2426 for anexample on the
branchevaluation
- With ideal CPI 1 and stalls freq x penalty
80Static Branch Prediction Using Compiler
Technology
- How to statically predict branches?
- Examination of program behavior
- Always predict taken (on average, 67 are taken)
- Mis-prediction rate varies large (959)
- Predict backward branches taken, forward branches
un-taken (mis-prediction rate gt 60 -- 70) - Profile-based predictor use profile information
collected from earlier runs - Simplest is the Basket bit idea
- Easily extends to use more bits
- Definite win for some regular applications
81Mis-prediction Rate for a Profile-based Predictor
FP is better than integer
82Prediction-taken VS. Profile-based Predictor
20 for prediction-taken 110 for profile-based
Standard Deviationsare large
83Performance of DLX Integer Pipeline
of all cycles due to control data hazard
stalls
Stall instructions 9 23 CPI 1.09
1.23 Average CPI 1.11Improvement 5/1.11 4.5