Appendix A Pipelining: Basic and Intermediate Concepts

About This Presentation

Title:

Appendix A Pipelining: Basic and Intermediate Concepts

Description:

The basic pipeline for a RISC instruction set. The major hurdle of pipelining pipeline hazards ... Single port register file - conflict with multiple stage needs ... – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 84

Provided by: haor

Category:

more less

Transcript and Presenter's Notes

Title: Appendix A Pipelining: Basic and Intermediate Concepts

1
Appendix A PipeliningBasic and Intermediate
Concepts
2
Outline

What is pipelining?
The basic pipeline for a RISC instruction set
The major hurdle of pipelining pipeline hazards
Data hazards
Control hazards

3
What Is Pipelining?
4
Pipelining Its Natural!

Laundry Example
Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes

5
Sequential Laundry
6 PM
7
8
9
11
10
Midnight
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r

Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would
laundry take?

6
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r

Pipelined laundry takes 3.5 hours for 4 loads

7
Pipelining Lessons

Pipelining does not help latency of single task,
it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline and time to drain it
reduces speedup

6 PM
7
8
9
Time
T a s k O r d e r
8
What is Pipelining?

Pipelining is an implementation technique whereby
multiple instructions are overlapped in
execution.
Not visible to the programmer
Each step in the pipeline completes a part of an
instruction.
Each step is completing different parts of
different instructions in parallel
Each of these steps is called a pipe stage or a
pipe segment.

9
What is Pipelining ? (Cont.)

The time required between moving an instruction
one step down the pipeline is a machine cycle.
All the stages must be ready to proceed at the
same time
Slowest pipe stage dominates
Machine cycle is usually one clock cycle
(sometimes two, rarely more)
The pipeline designers goal is to balance the
length of each pipeline stage.

10
What is Pipelining? (Cont.)

If the stages are perfectly balanced, then the
time per instruction on the pipelined machine
assuming ideal conditions--is equal to
Simple model - common latch clock

11
Major Pipeline Benefit Performance

Ideal Performance
time-per instruction unpiped-instruction-
time/ stages
Asymptote of course - however 10 is commonly
achieved
Difference is due to difficulty in achieving
laminar stage design
2 ways to view the performance mechanism
Reduced CPI
Assume a processor takes multiple clock cycles
per instruction
Reduced cycle-time
Assume a processor takes 1 long clock cycle per
instruction

12
Other Pipeline Benefits

Completely HW mechanism
No programming model shift required to exploit
this form of concurrency
BUT - the compiler will need to change and
compile time will go up
All modern machines are pipelined
Key technique in advancing performance in the
80s
In the 90s we just moved to multiple pipelines
Beware - no benefit is totally free/good

13
Start with Unpipelined RISC
Use DLX for example, which is similar to MIPS

Every instruction can be executed in 5 steps
Every instructions takes at most 5 clock cycles
Each step outputs just passed to next step (no
latches)

14
Steps 1 2

IF - instruction fetch step
IR ? Mem PC --------- fetch the next
instruction from memory
NPC ? PC 4 ---------- compute the new PC
ID - instruction decode and register fetch step
A ? Regs IR 6.. 10
B ? Regs IR 11.. 15
Imm ? ((IR16)16IR16..31)
Done in parallel with instruction (opcode)
decoding
Possible since register specifiers are encoded in
fixed fields
May fetch register contents that we dont use
Also calculate the sign extended immediate

15
Step 3

EX - execution/effective address step (4 options
depending on opcode)
Memory Reference (LW R1, 1000(R2) SW R3,
500(R4))
ALUOutput ? A Imm --------- effective address
Register - Register ALU instruction (ADD R1, R2,
R3)
ALUOutput ?A func B
Register - Immediate ALU instruction (ADD, R1,
R2, )
ALUOutput ? A op Imm
Branch (BEQZ R1, 2000)
ALUOutput ? NPC Imm
Cond ? (A op 0)
In Load-Store machine no instruction needs to
simultaneously calculate a data address and
perform an ALU operation on the data
Hence EX/EFA can be combined into a single cycle.

16
Steps 4 5

MEM - memory access/ branch completion
PC ? NPC
Memory reference
LMD ?Mem ALUOutput (Load) OR
Mem ALUOutput lt-- B (Store)
Branch
if (cond) then PC ?ALUOutput
WB - write back
Register - Register ALU
Regs IR16.. 20 ? ALUOutput
Register - Immediate ALU
Regs IR11.. 15 ?ALUOutput
Load
Regs IR11.. 15 ? LMD

17
Datapath
PC?NPCPC?ALUO(Branch)
NPC?PC4
Cond?(A op 0)
Load/Store
(Load Result)
IR?MemPC
ALUO (ALU op)or LMD (Load)
ALUO ?AImm A func B A op Imm NPCImm
18
Discussion

Assume separate instruction and data memories
Implement with separate instruction and data
caches (Chapter 5)
Data memory reference only occurs at stage 4
Load and Store
Update registers only occurs at stage 5
All ALU operations and Load
All register reads are early (in ID) and all
writes are late (in WB)

19
Discussion (Cont.)

Branch and store require 4 cycles and others 5
Branch 12, store 5 ? overall
CPI4.83(50.8340.17)
Model is correct but not optimized
ALUs - 1 would have sufficed since in any given
cycle only 1 is active
Instruction and data memories do not have to be
separate
Branches can be completed at the end of ID stage
(see later)

20
The Basic Pipeline for DLX/MIPS
21
Simple DLX/MIPS Pipeline

Stages now get executed 1 per cycle
Ideal result is the CPI reduced from 5 to 1
Is it really this simple? Of course not but its
a start
Different operations use the same resource on the
same cycle?
Structure Hazard!!
Separate instruction and data memories (IM, DM)
Register files read in ID and write in WB
(distinct use)
Write PC in IF and write either the incremented
PC or the value of the branch target of an
earlier branch (branch handling problem)
Registers are needed between two adjacent stages
for storing intermediate results
Otherwise, they will be overwritten by next
instruction)

22
Best Case Pipeline Scenario
Fill
Drain
Stable(5 times throughput)
23
Perform register write/read in the first/second
half of CC
Read
Write
A pipeline can be though of as a series of data
paths (resources) shifted in time
24
IF/ID, ID/EX,EX/MEM, MEM/WB are
pipelineregisters/latches
25
Events on Every Pipe Stage Figure A.19
Extra pipeline registers between stages are used
to store intermediate results
26
Events on Every Pipe Stage (Cont.) Figure A.19
27
Important Pipeline Characteristics

Latency
Time it takes for an instruction to go through
the pipe
Latency stages x stage-delay
Dominant feature if there are lots of exceptions
Throughput
Determined by the rate at which instructions can
start/finish
Dominant feature if no exceptions

28
Basic Performance Issues

Pipelining improve CPU instruction throughput
Does not reduce the execution time of an
individual instruction
Slightly increase the execution time of an
individual instruction
Overhead in the control of the pipeline
Pipeline register delay clock skew (Appendix
A-10)
Limit the practical depth of a pipeline
A program runs faster and has lower total
execution time, even though no single instruction
runs faster

29
Benefit Example
From the viewpoint of reduced clock cycle (i.e.
CPI 1)

Unpipelined DLX
5 steps - take 50, 50, 60, 50, 50 ns respectively
Hence total instruction time 260 ns (one clock
cycle)
Looks like a 5-stage pipeline
But there are parasites everywhere
Assume 5 ns added to slowest stage for extra
latches
Primarily due to set-up and hold times
Hence (assuming no stage/step improvement)
Must run at slowest stage parasites 60 5
65 ns/stage
In steady state (no exceptions) instruction done
every 65 ns
Speedup 260/65 4x improvement

30
Benefit Example (Cont.)
From the viewpoint of reduced CPI

Unpipelined DLX
10-ns clock cycles
Clock cycles ALU/4 (40), branches/4 (20) and
memory/5 (40)
Average instruction execution time clock cycle
Average CPI 10 ns ((4020) 4 40 5)
10ns 4.4 44ns
Pipelined DLX
1 ns of overhead to the clock 11-ns clock cycles
11-ns is also the average instruction execution
time
Speedup from pipelining 44ns/11ns 4 times

31
Pipeline Hazards

Pipeline hazards prevent the next instruction in
the instruction stream from execution during its
designated clock cycles
Hazards reduce the pipeline performance from the
ideal speedup

32
Pipeline Hazards

Structural hazards
Caused by resource conflict
Possible to avoid by adding resources but may
be too costly
Data hazards
Instruction depends on the results of a previous
instruction in a way that is exposed by the
overlapping of instructions in the pipeline
Can be mitigated somewhat by a smart compiler
Control hazards
When the PC does not get just incremented
Branches and jumps - not too bad

33
Hazards cause Stalls Two Policy Choices

How about just stalling all stages
OK but problem is usually adjacent stage
conflicts
Hence nothing moves and stall condition never
clears
Cheap option but it does not work
Stall later let earlier progress
Instructions issued later than the stalled
instructions are also stalled
Instructions issued earlier than the stalled
instructions must continue
But we will see in Chapters 3 and 4 that we can
reorder the instructions or let the instructions
after the stalled instruction goes on to reduce
the impacts of hazards

34
Structural Hazards

If some combination of instructions cannot be
accommodated because of resource conflicts, the
machine is said to have a structural hazard.
Some functional unit is not fully pipelined
Some resource has not been duplicated enough to
allow all combinations of instructions in the
pipeline to execute
Single port register file - conflict with
multiple stage needs
Memory fetch - may need one in both IF and MEM
stages
Pipeline stalls instructions until the required
unit is available
A stall is commonly called a pipeline bubble or
just bubble

35
Structural Hazard Example
36
Remove Structural Hazard
No real hazard if inst1 is not a load or store
(Only load/store/branch use stage 4)
37
Pipeline Stalled for a Structural Hazard (Another
View)
38
Calculating Stall Effects
Ignore pipeline overhead andassume balanced
pipeline
From the viewpoint of decreasing CPI
Therefore,
All instruction takethe same numberof cycles
39
Calculating Stall Effects (Cont.)
From the viewpoint of decreasing clock cycle time
Therefore,
40
Example Dual-port vs. Single-port Memory

Machine A Dual ported memory
Machine B Single ported memory, but its
pipelined implementation has a 1.05 times faster
clock rate
Ideal CPI 1 for both
Loads are 40 of instructions executed

41
Why Would a Designer Allow Structural Hazard?

A machine without structural hazards will always
have a lower CPI (if all other factors are equal)
Why would a Designer Allow Structural Hazard?
Reduce cost
Pipeline or duplicate all the functional units
may be too costly

42
Why Would a Designer Allow Structural Hazard?
(Cont.)

DLX implementation with a FP multiply unit but no
pipelining
Accept a new multiply every five clock cycles
(initiation interval)
How does structural hazard impact mdlidp2?
Mdljdp2 has 14 FP multiplication
DLX implementation can handle up to 20 of FP
multiplication
If FP multiplication in mdljdp2 are not clustered
but distributed uniformly ? performance impact is
very limited
If FP multiplication in mdljdp2 are all clustered
without intervening instruction and 14 of
instructions take 5 cycles each ? CPI increases
from 1 to 1.7
In practice, impact of structural hazard lt 0.03
(data hazard has a more severe impact !!!)

43
Data Hazards
44
Introduction

Data hazards occur when the pipeline changes the
order of read/write accesses to operands so that
the order differs from the order seen
sequentially executing instructions on an
unpipelined machine.
Example later instructions use a result not
having been produced by an earlier instruction
Example
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
XOR R10, R1, R11

R1 ? R2 R3 R1 gets produced in the first
instruction,and used in every subsequent
instruction
45
The use of the result of ADD in the next three
instructions causes a hazard, since the register
is not written until after those instructions
read it
read/write
46
Forwarding -- also called bypassing, shorting,
short-circuiting

Key is to keep the ALU result around
Example
ADD R1,R2,R3
SUB R4, R1,R5
How do we handle this in general?
Forwarded value can be at ALU output or Mem stage
output

ADD produces R1 value at ALU output
SUB needs it again at the ALU input

47
Forwarding (Cont.)

Use the example at slide 44 as an example
Forward the result from where ADD produces
(EX/MEM register) to where SUB needs it (ALU
input latch)
Forwarding works as follows
ALU result from EX/MEM register is fed back to
ALU input latch
If the forwarding hardware detects the previous
ALU operation has written the register
corresponding to a source for the current ALU
operation, control logic selects the forwarding
result as the ALU input rather than the value
read from the register file
Generalization of forward
Pass a result directly to the functional unit
requires it a result is forwarded from the
pipeline register corresponding to the output of
one unit to the input of another

48
Result With Forwarding
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
IM
OR R8, R1, R9
IM
XOR R10, R1, R11
49
Multiplexing Issues in Forwarding
50
Another Forwarding Example

Example
ADD R1, R2, R3
LW R4, 0(R1)
SW 12(R1), R4
Forwarding Result Next Page

51
A?R2B?R3
R1?AO
AOAB(Prod. R1)
Do Nothing
A?R1B?R4Imm?0
R4?LMD
LMDMemAO(Prod. R4)
AOAImm(Use R1)
A?R1B?R4Imm?12
MemAO?B(Use R4)
AOAImm(Use R1)
52
When Forwarding Fails
DM LMD?MEMALUO RD R1?LMD
RSA?R1, B?R5 ALU ALUO?A-B
RSA?R1, B?R7 ALU ALUO?A ANDB
RSA?R1, B?R5 ALU ALUO?A OR B
53
Stalls

Some latencies cant be absorbed -- the case in
the previous slide
Stalls are the result
Need pipeline interlock circuits
Detects a hazard and introduces bubbles until the
hazard clears
CPI for stalled instructions will bloat by the
number of bubbles
Bubbles cause the forwarding paths to change
In MIPS/DLX, if the instruction after load uses
the load result, one clock-cycle stall will occur!

54
Bubbles and new Forwarding Paths
55
Handling Stalls
Hardware vs. Software

Hardware Pipeline Interlocks
Must detect when required data cannot be provided
Stall stages to create bubble
Software pipeline or instruction scheduling
Performed by a smart compiler

LW RB, BLW RC, CADD RA, RB, RCSW A, RALW RE,
ELW RF, FSUB RD, RE, RFSW D, RD
LW RB, BLW RC, CLW RE, EADD RA, RB, RCLW RF,
F SW A, RA SUB RD, RE, RFSW D, RD
A B C D E F Pipeline Scheduling
56
Memory Reference May Also Cause Data Hazard

The previous two operations are with registers,
but it is also possible for a pair of
instructions to create a dependence by
reading/writing the same memory location
But the latter is impossible for MIPS
Why???

Because data memory reference only occurs at
stage 4
57
Data Hazard Forms
i occurs before j program execution order

RAW - read after write
j reads before i writes - hence j gets wrong old
value
Most common form of data hazard problem
As we have seen forwarding can overcome this one
WAW - write after write
instructions i then j
j writes before i writes - leaving incorrect
value
can this happen in MIPS? Why?
WAW can happen only in pipelines that write in
more than one pipe stage (or allow an instruction
to proceed even when a previous instruction is
stalled)

58
Data Hazard Forms (Cont.)

WAR - write after read
i then j is intended order
j writes before i reads - i ends up with
incorrect new value
Is this a Problem in the MIPS? Why?
May happen only when some instructions write
results early in pipe stages, and others read a
source late in stages
RAR read after read
Not a hazard

59
MIPS Ordering

Some things are not a problem
MIPS has only a single memory write and a single
register write stage
Hence this ordering requirement is preserved
However things can get a lot worse
And will when we look at varying operational
latencies
For example floating point instructions in the
MIPS
WAR MIPS Ordering
Writing happens late in the pipe
Reading happens early
Hence no WAR problems
However, other machines might exhibit this problem

60
Control Hazards
61
Introduction

Control Hazards How does branch influence the
pipeline?
Problem is more complex - need 2 things
Branch target (taken means not PC4, not taken
the condition fails) (MEM)
CC valid - in the DLX case the result of the Zero
detect unit (EX)
Both happen late in the pipe
How to deal with branch?
Stall the pipeline as soon as we detect the
branch (ID), and stall the pipeline until we
reach the MEM stage
Three-cycle stall
The first IF is essentially a stall (when taken
branch)
Consider a 30 branch frequency and an ideal CPI
of 1

62
DLX Pipeline Re-visited
63
A branch causes a 3-cycle stall in the DLX
pipeline
64
Branch Delay Reduction
Branch delay is the length of the control hazard

Hardware mechanisms
Find out if the branch is taken or not taken
earlier in the pipeline
Compute the taken PC earlier
With another adder
BTA (branch taken address) can be computed during
the ID stage
Then BTA and the normal PC mechanism contain both
possibilities
Choice depends on the instruction and CC - also
known after ID stage
Software mechanisms
Design your ISA properly
e.g. BNEZ, BEQZ on DLX permit CCs to be known
during ID
Do instruction scheduling into the branch delay
slots
Know likelihood of taken vs. not taken branch
statistics
Improves chances to guess correctly
Varies with application and instruction placement

65
New Improved DLX Pipeline
NPC
Imm
66
Hardware Mechanism

Move Zero test to ID/RF stage
Adder to calculate new PC in ID/RF stage
1 clock cycle penalty for branch versus 3
Note an ALU instruction followed by a branch on
the result of the instruction will incur a data
hazard stall
Example
SUB R1, R2, R3 IF-ID-EXE-MEM-WB
BEQZ R1, 100 IF-ID -EXE-
MEM-WB

67
Branch Behavior in Programs

Integer benchmark
Conditional branch frequencies of 14 to 16
Much lower unconditional branch frequencies
FP benchmark
Conditional branch frequencies of 3 to 12
Forward branches dominate backward branches
(3.71)
67 of conditional branches are taken on average
60 of forward branches are taken on average
85 of backward branches are taken on average
(usually loop)

68
Control Hazard Avoidance

Simplest Scheme
Freeze pipe until you know the CC and branch
target
Cheap but slow
Too slow since wed negate half of the pipeline
speedup since 2 or 3 bubbles (old vs. new DLX
pipeline designs)
Predict not taken (47 DLX branches not taken on
average)
Make sure to defer state change (destructive
phase) is delayed until you know whether you
guessed right
If not then back out or flush
Predict taken (53 DLX branches taken on
average)
No use in DLX/MIPS (target address and branch
outcome are known at the same stage)
Or let the compiler decide - same options

69
Predict-Not-Taken
A Stall indeed
70
Delayed Branch

Delayed branch ? make the stall cycle useful
Add delay slots branch penalty length of
branch delay
1 for DLX/MIPS
Instructions in the delay slot are executed
whether or not the branch is taken
See if the compiler can schedule something useful
in these slots
Hope that filled slots actually help advance the
computation
A branch delay of length n
branch instruction sequential successor1 sequent
ial successor2 ........ sequential successorn
branch target if taken

Always execute!
71
Delayed-Branch Behavior
72
Delayed Branch (Cont.)

Where to get instructions to fill branch delay
slot?
Before branch instruction
From the target address only valuable when
branch taken
From fall through only valuable when branch not
taken
When the slots cannot be scheduled, they are
filled with the no-op instruction (indeed,
stall!!)
Canceling branches allow more slots to be filled

73
SUB R4, R5, R6
Scheduling the branch-delay slot
74
Delay-Branch Scheduling Schemes and Their
Requirements
75
Delayed Branch (Cont.)

Limitation on delayed-branch scheduling
Restrictions on the instructions that are
scheduled into delay slots
Ability to predict at compile time if a branch is
likely to be taken or not
Delayed branches are architecturally visible
feature
Advantage Use compiler scheduling to reduce
branch penalties
Disadvantage expose an aspect of implementation
that is likely to change
Delay branch is less useful for longer branch
delay
Can not easily hide the longer delay ? change to
hardware scheme

76
Canceling (Nullifying) Branch

Eliminate the requirements on the instruction
placed in the delay slot, enabling the compiler
to use from target/falling through without
meeting the requirements
Idea
Associate each branch instruction the predicted
direction
If branch goes as predicted then nothing changes
If unpredicted direction then nullify all or some
of the delay slot instructions
Result is more freedom for the compiler delay
slot scheduling
A common approach in HPs PA processors

77
Delayed and Canceling Delay Branches allow
control hazards to be hidden 70 of the time
78
Performance of Delayed and Canceling Branches
(Another View)
On average, 30 ofbranch delay slots are wasted
79
Evaluating Branch Alternatives
Appendix A-2426 for anexample on the
branchevaluation

With ideal CPI 1 and stalls freq x penalty

80
Static Branch Prediction Using Compiler
Technology

How to statically predict branches?
Examination of program behavior
Always predict taken (on average, 67 are taken)
Mis-prediction rate varies large (959)
Predict backward branches taken, forward branches
un-taken (mis-prediction rate gt 60 -- 70)
Profile-based predictor use profile information
collected from earlier runs
Simplest is the Basket bit idea
Easily extends to use more bits
Definite win for some regular applications

81
Mis-prediction Rate for a Profile-based Predictor
FP is better than integer
82
Prediction-taken VS. Profile-based Predictor
20 for prediction-taken 110 for profile-based
Standard Deviationsare large
83
Performance of DLX Integer Pipeline
of all cycles due to control data hazard
stalls
Stall instructions 9 23 CPI 1.09
1.23 Average CPI 1.11Improvement 5/1.11 4.5

Write a Comment

User Comments (0)