CS 2200 Lecture 09a Pipelining

About This Presentation

Title:

CS 2200 Lecture 09a Pipelining

Description:

Another example: auto assembly line. Throughput is how often stuff comes out of a pipeline ... More technical detail. If times for all S stages are equal to T: ... – PowerPoint PPT presentation

Number of Views:185

Avg rating:3.0/5.0

Slides: 140

Provided by: michaelt8

Category:

more less

Transcript and Presenter's Notes

Title: CS 2200 Lecture 09a Pipelining

1
CS 2200 Lecture 09aPipelining

(Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, and Michael
Niemier)

2
Class demo

Can someone come to the front of the class and
explain to me how to do 5 loads of laundry?
I need 1 person to actually do the laundry
and 5 more to be ummthe laundry.

3
Short review Single cycle MIPS machine
Single cycle MIPS machine
4
Short review Non-MIPS single cycle machine

y a bx cx2

cx2
c
x
y
a
b
bx
5
Short review Multi cycle MIPS machine
Multi cycle MIPS machine
6
Short review Multi cycle LC2200 machine
A
LdA
B
LdB
10
memory 1024x 32 bits
Addr Din
registers 16x 32 bits
Din
IR31..0
WrREG
WrMEM
2
4
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
IR19..0
regno
20
Dout
Dout
sign extend
0?
1
1
7
Other Processor Designs(with more than one bus)

One-bus is simple, recipe-oriented.
Alternatives
add parallel busses for data transfers that occur
together
e.g. ALU input/input/output
add parallel compute units for operation that
occur together
e.g. PC1 in parallel with everything else
mux paths together as necessary
(somewhat ad-hoc)

8
Other Processor Designs(with more than one bus)

Add busses! One per ALU port

regnos
Register File (3 ports)
9
Other Processor Designswith more than one bus

Fetch unit performs PC1 and instruction lookup

Instr Memory
10
Cycles Per Instruction?

Well, you have a choice!
CPI 1
one long cycle
Tclock 5nS?

11
Cycles Per Instruction?

Well, you have a choice!
CPI 1
one long cycle!
Tclock 5nS
CPI 5
five short cycles
Tclock 1nS
5nS/instruction either way

12
Transition

Can we do better?
What if we have 5 instructions?
With single cycle, 25 ns needed
With multi cycle, 25 ns needed
But its also possible to do in less than 10 ns

13
Pipelining
14
Pipelining Its Natural!

Laundry Example
Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes

15
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r

Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would
laundry take?

16
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
Note More time to go out later that night

Pipelined laundry takes 3.5 hours for 4 loads

17
Pipelining Lessons

Multiple tasks operating simultaneously
Pipelining doesnt help latency of single task,
it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Potential speedup Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Also, need time to fill and drain the
pipeline.

6 PM
7
8
9
Time
T a s k O r d e r
18
Pipelining Some terms

If youre doing laundry or implementing a mP,
each stage where something is done called a pipe
stage
In laundry example, washer, dryer, and folding
table are pipe stages clothes enter at one end,
exit other
In a mP, instructions enter at one end and have
been executed when they leave
Another example auto assembly line
Throughput is how often stuff comes out of a
pipeline

19
More on throughput

All pipe stages are connected so everything must
move from one to another at same time
How fast this happens is a function of time it
takes for slowest stage to finish
Example If laundry takes 30 min. to wash but 40
min. to dry, itll be idle in washer for 10 min.
In a mP, this is machine cycle time (usually 1
clock)
If a each pipe stage is perfectly balanced time
wise
Time/Instruction Time on unpipelined/ of pipe
stages
Therefore speedup from pipelining of pipe
stages
But of course nothings perfect!

20
So really, how is pipelining faster?

Pipelining reduces average execution
time/instruction
Could be viewed as decreasing of clock cycles
per instruction (CPI)
In perfect pipeline, you should see 1 instruction
result each cycle even though that instruction
actually required multiple pipe stages/multiple
cycles
Pipelining is implementation technique, not
visible to programmer
(a good thing b/c its one less thing a programmer
has to worry about!)

21
More technical detail

General characteristics
Complete process broken into S independent steps
Each step done independently at a stage
Stages arranged in linear order to match process
As each stage finishes its pieces, it passes it
to the next stage
Time for 1 complete processing sequence sum of
all stages
BUT rate at which we can initiate new work
max of any stage time

22
More technical detail

If times for all S stages are equal to T
Time for one initiation to complete still ST
Time between 2 initiates T not ST
Initiations per second 1/T
Pipelining Overlap multiple executions of same
sequence
Improves THROUGHPUT, not the time to perform a
single operation
Other examples
Automobile assembly plant, chemical factory,
garden hose, cooking

23
More technical detail

Books approach to draw pipeline timing diagrams
Time runs left-to-right, in units of stage time
Each row below corresponds to distinct
initiation
Boundary b/t 2 column entries pipeline register
(i.e. hamper)
Must look at column contents to see what stage is
doing what

Time for N initiations to complete NT (S-1)T
Throughput Time per initiation T (S-1)T/N ?
T!
24
Ideal digital system pipeline speedup
Unpipelined
combinational logic delay t
combinational logic delay t
combinational logic delay t
combinational logic delay t
delay for 1 piece of data 4t latch setup
(assume small)
Latch
Latch
approximate delay for 1000 pieces of data 4000t
Pipelined
combinational logic delay t
combinational logic delay t
combinational logic delay t
combinational logic delay t
Latch
Latch
delay for 1 piece of data 4(t latch setup)
approximate delay for 1000 pieces of data 3t
1000t
4000
4
speedup for 1000 pieces of data
1003
Ideal speedup of pipeline stages
25
Example

IF The instruction fetch sequence (2 ns)
ID Decode and fetch register operands (1 ns)
EX Perform ALU operation (2 ns)
MEM Perform data memory operation (2 ns)
WB Write result (if any) back into reg. file
(1 ns)
Hmmm5 stages ? 5X performance increase over a
single cycle design?
Electrical design challenge
Can we make HW do each stage in same time?

26
Example
1
2
2
2
1
Total time 8 ns
One initiation
Try to overlap
Doesnt line up!
Possible solution insert 1ns after ID to allow
alignment
Structural Hazard
27
More technical detail
Delay ID by 1 ns also
9 ns
No structural hazard
15 ns

One initiation 9 ns or 10 ns (depending on how
you look at it)
4 initiations 15 ns ? Average of 1 initiation
every 3.75 ns
How long for 1000 initiations?
What is the equivalent time between
initiations?
What is the effective speedup?

28
Transition

to a microprocessor

29
The Big Picture Literally!
30
The new look dataflow
IF/ID
ID/EX
EX/MEM
MEM/WB
4
M u x
ADD
PC
Branch taken
Comp.
IR6...10
M u x
Inst. Memory
IR11..15
Register File
ALU
MEM/ WB.IR
M u x
Data Mem.
Data must be stored from one stage to the
next in pipeline registers/latches. hold
temporary values between clocks and needed info.
for execution.
M u x
Sign Extend
16
32
31
Another way to look at it
Clock Number
Time
Program execution order (in instructions)
32
So, what about the details?

In each cycle, new instruction fetched and begins
5 cycle execution
In perfect world (pipeline) performance improved
5 times over!
So, thats it, huh? Hardly!!!
What else do we have to worry about?
Must know whats going on in every cycle of
machine
What if 2 instructions try to use the same
resource at same time?
(LOTS more on this later)
Separate instruction/data memories, multiple
register ports, etc. help avoid this

33
So seriously, what does pipelining do for us?

For starters, pipelining does not reduce the
execution time of a single instruction.
Actually, b/c of overhead of controlling
pipeline, execution time usually increases!
So why do it?
Pipelining increases CPU instruction throughput.
of instructions executed in some given time
frame should increase b/c of pipelining
Thus, a program runs faster but all instructions
actually execute a little slower. Crazy, huh?

34
Limits, limits, limits

So, now that the ideal stuff is out of the way,
lets look at how a pipeline REALLY works
Pipelines are slowed b/c of
Pipeline latency
Imbalance of pipeline stages
(Think A chain is only as strong as its weakest
link)
Well, a pipeline is only as fast as its slowest
stage
Pipeline overhead (from where?)
Register delay from pipe stage latches
Clock skew Once a clock cycle is as small as
the sum of the clock skew and latch overhead, you
cant get any work done

35
Speed Up Equation for Pipelining
For simple RISC pipeline, CPI 1. W/microcode,
unpipelined CPI pipeline depth
Single-cycle HW would have a slow clock
36
Transition

to MIPS examples

37
CS 2200 Lecture 09bMIPS Pipelining Examples

(Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, and Michael
Niemier)

38
Executing Instructions in Pipelined Datapath

Following charts describe 3 scenarios
Processing of load word (lw) instruction
Bug included in design (make SURE you understand
the bug)
Processing of lw
Bug corrected (make SURE you understand the fix)
Processing of lw followed in pipeline by sub
(Sets the stage for discussion of HAZARDS and
inter-instruction dependencies)

39
Load word Cycle 1
40
Load Word Cycle 2
41
Load Word Cycle 3
42
Load Word Cycle 4
43
Load Word Cycle 5
44
Load Word Fixed Bug
45
A 2 instruction sequence

Examine multiple-cycle single-cycle diagrams
for a sequence of 2 independent instructions
(i.e. no common registers b/t them)
lw 10, 9(1)
sub 11, 2, 3

46
Single-cycle diagrams cycle 1
47
Single-cycle diagrams cycle 2
48
Single-cycle diagrams cycle 3
49
Single-cycle diagrams cycle 4
50
Single-cycle diagrams cycle 5
51
Single-cycle diagrams cycle 6
52
Pipelined Control

Potentially very complicated, approach
methodically.
Example (independent instructions)
lw 10, 9(1)
sub 11, 2, 3
and 12, 4, 5
or 13, 6, 7
add 14, 8, 9

53
Pipelined Control

Example (dependent instructions)
(2 used in sequential instructions)
sub 2, 1, 3 register 2 written by sub
add 12, 2, 5 1st operand (2) depends on sub
or 13, 6, 2 2nd operand (2) depends on sub
add 14, 2, 2 1st and 2nd (2) depends on sub
sw 15, 100(2) index (2) depends on sub
Problem
write-back for sub wont occur until the 5th
cycle
First assume sequence of independent instructions
later, remove this assumption

54
Control signal summary
55
Questions about control signals

Following discussion relevant to a single
instruction
Q Are all control signals active at the same
time?
A ?
Q Can we generate all these signals at the same
time?
A ?

56
Control lines by pipe stage

Each data flow component is active in only one
pipeline stage
So, divide control signals into groups according
to active component
1. Instruction Fetch
Always read instruction memory and write PC
(basically nothing special)
2. Instruction Decode / Register Fetch
Still nothing special to control
(same action every time)
3. Execution (must decode control sigs from
inst.)
RegDst does target reg come from bits 20-16 or
15-11?
ALUOp how to control ALU operation
ALUSrc does 2nd ALU input come from reg. file
or sign ext.?

57
Control lines by pipe stage

4. Memory likewise
Branch used to generate PCSrc
PCSrc does PC get incremented or replaced by
output of branch adder
MemRead signal reads from memory
MemWrite signal writes to memory
5. Write Back likewise
MemToReg does value going back to reg file come
from ALU or memory?
RegWrite is there in fact a register write back
to perform?

58
Passing control w/pipe registers

Analogy send instruction with car on assembly
line
Install Corinthian leather interior on car 6 _at_
stage 3

59
Pipelined datapath w/control signals
60
CS 2200 Lecture XHazards

(Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, and Michael
Niemier)

61
The hazards of pipelining

Pipeline hazards prevent next instruction from
executing during designated clock cycle
There are 3 classes of hazards
Structural Hazards
Arise from resource conflicts
HW cannot support all possible combinations of
instructions
Data Hazards
Occur when given instruction depends on data from
an instruction ahead of it in pipeline
Control Hazards
Result from branch, other instructions that
change flow of program (i.e. change PC)

62
How do we deal with hazards?

Often, pipeline must be stalled
Stalling pipeline usually lets some
instruction(s) in pipeline proceed,
another/others wait for data, resource, etc.
A note on terminology
If we say an instruction was issued later than
instruction x, we mean that it was issued after
instruction x and is not as far along in the
pipeline
If we say an instruction was issued earlier than
instruction x, we mean that it was issued before
instruction x and is further along in the pipeline

63
Stalls and performance

Stalls impede progress of a pipeline and result
in deviation from 1 instruction executing/clock
cycle
Pipelining can be viewed to
Decrease CPI or clock cycle time for instruction
Lets see what affect stalls have on CPI
CPI pipelined
Ideal CPI Pipeline stall cycles per instruction
1 Pipeline stall cycles per instruction
Ignoring overhead and assuming stages are
balanced

64
More pipeline performance issues

Pipelining can appear to improve clock cycle time
Can assume the CPI of an unpipelined and a
pipelined machine is 1
This results in
If pipe stages perfectly balanced, we assume no
overhead
clock cycle on pipelined machine is smaller than
unpipelined machine by a factor equal to pipeline
depth.

65
Even more pipeline performance issues!

This results in
Which leads to
If no stalls, speedup equal to of pipeline
stages in ideal case

66
Structural hazards

1 way to avoid structural hazards is to duplicate
resources
i.e. An ALU to perform an arithmetic operation
and an adder to increment PC
If not all possible combinations of instructions
can be executed, structural hazards occur
Most common instances of structural hazards
When a functional unit not fully pipelined
When some resource not duplicated enough
Pipelines stall result of hazards, CPI increased
from the usual 1

67
An example of a structural hazard
Load
Instruction 1
Instruction 2
Instruction 3
Instruction 4
Whats the problem here?
Time
68
How is it resolved?
Load
Instruction 1
Instruction 2
Stall
Instruction 3
Pipeline generally stalled by inserting a
bubble or NOP
Time
69
Or alternatively
Clock Number

LOAD instruction steals an instruction fetch
cycle which will cause the pipeline to stall.
Thus, no instruction completes on clock cycle 8

70
A simple example

The facts
Data references constitute 40 of an instruction
mix
Ideal CPI of the pipelined machine is 1
A machine with a structural hazard has a clock
rate thats 1.05 times higher than a machine
without the hazard.
How much does this LOAD problem hurt us?
Recall Avg. Inst. Time CPI x Clock Cycle Time
(1 0.4 x 1) x (Clock cycle timeideal/1.05)
1.3 x Clock cycle timeideal
Therefore the machine without the hazard is
better

71
Remember the common case!

All things being equal, a machine without
structural hazards will always have a lower CPI.
But, in some cases it may be better to allow them
than to eliminate them.
These are situations a computer architect might
have to consider
Is pipelining functional units or duplicating
them costly in terms of HW?
Does structural hazard occur often?
Whats the common case???

72
Data hazards

These exist because of pipelining
Why do they exist???
Pipelining changes order or read/write accesses
to operands
Order differs from order seen by sequentially
executing instructions on unpipelined machine
Consider this example
ADD R1, R2, R3
SUB R4, R1, R5
AND R6, R1, R7
OR R8, R1, R9
XOR R10, R1, R11

All instructions after ADD use result of ADD
For the DLX mP, ADD writes the register in WB
but SUB needs it in ID. This is a data hazard
73
Illustrating a data hazard
ADD R1, R2, R3
SUB R4, R1, R5
Reg
Mem
DM
AND R6, R1, R7
Reg
Mem
OR R8, R1, R9
Reg
Mem
XOR R10, R1, R11
Time
ADD instruction causes a hazard in next 3
instructions b/c register not written until after
those 3 read it.
74
Forwarding

Problem illustrated on previous slide can
actually be solved relatively easily with
forwarding
In this example, result of the ADD instruction
not really needed until after ADD actually
produces it
Can we move the result from EX/MEM register to
the beginning of ALU (where SUB needs it)?
Yes! Hence this slide!
Generally speaking
Forwarding occurs when a result is passed
directly to functional unit that requires it.
Result goes from output of one unit to input of
another

75
When can we forward?
ADD R1, R2, R3
SUB gets info. from EX/MEM pipe register AND
gets info. from MEM/WB pipe register OR gets
info. by forwarding from register file
SUB R4, R1, R5
Reg
Mem
DM
AND R6, R1, R7
Reg
Mem
OR R8, R1, R9
Reg
Mem
XOR R10, R1, R11
Rule of thumb If line goes forward you can do
forwarding. If its drawn backward, its
physically impossible.
Time
76
HW Change for Forwarding
77
Data hazard specifics

There are actually 3 different kinds of data
hazards!
Read After Write (RAW)
Write After Write (WAW)
Write After Read (WAR)
Well discuss/illustrate each on forthcoming
slides. However, 1st a note on convention.
Discussion of hazards will use generic
instructions i j.
i is always issued before j.
Thus, i will always be further along in pipeline
than j.
With an in-order issue/in-order completion
machine, were not as concerned with WAW, WAR

78
Read after write (RAW) hazards

With RAW hazard, instruction j tries to read a
source operand before instruction i writes it.
Thus, j would incorrectly receive an old or
incorrect value
Graphically/Example
Can use stalling or forwarding to resolve this
hazard

i ADD R1, R2, R3 j SUB R4, R1, R6
Instruction j is a read instruction issued after i
Instruction i is a write instruction issued
before j
79
Write after write (WAW) hazards

With WAW hazard, instruction j tries to write an
operand before instruction i writes it.
The writes are performed in wrong order leaving
the value written by earlier instruction
Graphically/Example

i DIV F1, F2, F3 j SUB F1, F4, F6
Instruction j is a write instruction issued after
i
Instruction i is a write instruction issued
before j
80
Write after read (WAR) hazards

With WAR hazard, instruction j tries to write an
operand before instruction i reads it.
Instruction i would incorrectly receive newer
value of its operand
Instead of getting old value, it could receive
some newer, undesired value
Graphically/Example

i DIV F7, F1, F3 j SUB F1, F4, F6
Instruction j is a write instruction issued after
i
Instruction i is a read instruction issued before
j
81
Forwarding It doesnt always work
LW R1, 0(R2)
Load has a latency that forwarding cant
solve. Pipeline must stall until hazard cleared
(starting with instruction that wants to use
data until source produces it).
Reg
IM
DM
SUB R4, R1, R5
Reg
IM
AND R6, R1, R7
Reg
IM
OR R8, R1, R9
Time
To get the data to subtract instruction we need a
time machine!
82
The solution pictorially
Reg
IM
DM
Reg
LW R1, 0(R2)
Reg
IM
DM
SUB R4, R1, R5
IM
Reg
AND R6, R1, R7
Reg
IM
OR R8, R1, R9
Time
Insertion of bubble causes of cycles to
complete this sequence to grow by 1
83
Data hazards and the compiler

Compiler should be able to help eliminate some
stalls caused by data hazards
i.e. compiler could not generate a LOAD
instruction that is immediately followed by
instruction that uses result of LOADs
destination register.
Technique is called pipeline/instruction
scheduling

84
What about control logic?

For DLX integer pipeline, all data hazards can be
checked during ID phase of pipeline
If data hazard, instruction stalled before its
issued
Whether forwarding is needed can also be
determined at this stage, controls signals set
If hazard detected, control unit of pipeline must
stall pipeline and prevent instructions in IF, ID
from advancing
All control information carried along in pipeline
registers so only these fields must be changed

85
Some example situations
86
Detecting Data Hazards
87
Hazard Detection Logic

Insert a bubble into pipeline if any are true
ID/EX.RegWrite AND
((ID/EX.RegDst0 AND ID/EX.WriteRegRtIF/ID.ReadRe
gRs) OR
(ID/EX.RegDst1 AND ID/EX.WriteRegRdIF/ID.ReadReg
Rs) OR
(ID/EX.RegDst0 AND ID/EX.WriteRegRtIF/ID.ReadReg
Rt) OR
(ID/EX.RegDst1 AND ID/EX.WriteRegRdIF/ID.ReadReg
Rt))
OR EX/MEM AND
((EX/MEM.WriteReg IF/ID.ReadRegRs) OR
(EX/MEM.WriteReg IF/ID.ReadRegRt))
OR MEM/WB.RegWrite AND
((MEM/WB.WriteReg IF/ID.ReadRegRs) OR
(MEM/WB.WriteReg IF/ID.ReadRegRt))

Notation ID/EX.RegDst
Pipeline Register
Field
88
How to Insert Bubbles

If hazard detected
Dont write to PC or IF/ID reg. de-assert
signals for NOP

89
Incorporation of Hazard Detection Unit
90
Stall Ex. Cycle 1
91
Stall Ex. Cycle 2
92
Stall Ex. Cycle 3 1st Bubble Inserted
93
Stall Ex. Cycle 4 2nd Bubble Inserted
94
Stall Ex. Cycle 5 3rd Bubble Inserted
95
Stall Ex. Cycle 6 End of Stall
96
Stall Ex. Cycle 7
97
Control Hazards
98
R-Type
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
99
Control Hazard on Branches2 Stage Stall?
10 beq r1,r3,36
14 and r2,r3,r5
18 or r6,r1,r7
22 add r8,r1,r9
36 xor r10,r1,r11
100
Example

simulation

101
Scenario

We have the following code segment
lw R6, X(R0)
beq R1, R2, SKIP
add R1, R2, R3
SKIP add R5, R4, R1
sw R7, X(R0)
X .word 5

102
lw R6,X(R0)
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
103
lw R6,X(R0)
beq R1,R2,SKIP
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
104
lw R6,X(R0)
beq R1,R2,SKIP
BUBBLE
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
Note Bubble because no branch predict or slot
fill.
WB
EX
MEM
ID
IF
105
lw R6,X(R0)
beq R1,R2,SKIP
BUBBLE
BUBBLE
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
Second bubble because were detecting BEQ in 3rd
stage.
WB
EX
MEM
ID
IF
106
lw R6,X(R0)
BUBBLE
BUBBLE
add R1,R2,R3
beq R1,R2,SKIP
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
107
beq R1,R2,SKIP
add R5,R4,R1
BUBBLE
BUBBLE
add R1,R2,R3
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
108
BUBBLE
add R5,R4,R1
sw R7,X(R0)
BUBBLE
add R1,R2,R3
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
Forwarding Unit
WB
EX
MEM
ID
IF
109
BUBBLE
add R1,R2,R3
add R5,R4,R1
sw R7,X(R0)
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
110
add R1,R2,R3
add R5,R4,R1
sw R7,X(R0)
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
111
add R5,R4,R1
sw R7,X(R0)
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
112
Dealing with Branch Hazards (more detail)
113
Branch Hazards

So far, weve limited discussion of hazards to
Arithmetic/logic operations
Data transfers
Also need to consider hazards involving branches
Example
40 beq 1, 3, 28
44 and 12, 2, 5
48 or 13, 6, 2
52 add 14, 2, 2
72 lw 4, 50(7)
How long will it take before the branch decision
takes effect?
What happens in the meantime?

114
Branch signal determined in MEM stage
115
Pipeline impact on branch

If branch condition true, must skip 44, 48, 52
But, these have already started down the pipeline
They will complete unless we do something about
it
How do we deal with this?
Well consider 2 possibilities

116
Dealing w/branch hazards always stall

Branch taken
Wait 3 cycles
No proper instructions in the pipeline
Same delay as without stalls (no time lost)

117
Dealing w/branch hazards always stall

Branch not taken
Still must wait 3 cycles
Time lost
Could have spent cycles fetching and decoding
next instructions

118
Dealing w/branch hazardsassume branch not taken

On average, branches are taken ½ the time
If branch not taken
Continue normal processing
Else, if branch is taken
Need to flush improper instruction from pipeline
Cuts overall time for branch processing in ½

119
Flushing unwanted instructions from pipeline

Useful to compare w/stalling pipeline
Simple stall inject bubble into pipe at ID
stage only
Change control to 0 in the ID stage
Let bubbles percolate to the right
Flushing pipe must change inst. In IF, ID, and
EX
IF Stage
Zero instruction field of IF/ID pipeline register
Use new control signal IF.Flush
ID Stage
Use existing bubble injection mux that zeros
control for stalls
Signal ID.Flush is ORed w/stall signal from
hazard detection unit
EX Stage
Add new muxes to zero EX pipeline register
control lines
Both muxes controlled by single EX.Flush signal
Control determines when to flush
Depends on Opcode and value of branch condition

120
Inserting bubbles v. flushing pipeline
121
Assume branch not takenand branch is not taken

Execution proceeds normally no penalty

122
Assume branch not takenand branch is taken

Bubbles injected into 3 stages during cycle 5

123
Reservation Table Picture

Another way of looking at it

Assume Branch Not Taken and Correct
40 beq 1, 3, 72 44 and 12, 2, 5 48 or
13, 6, 2 52 add 14, 2, 2 72 lw 4,
50(7)
No penalty 3 cycle penalty
Assume Branch Not Taken and NOT Correct
124
Branch Penalty Impact

Assume 16 of all instructions are branches
4 unconditional branches 3 cycle penalty
12 conditional 50 taken
For a sequence of N instructions (assume N is
large)
N cycles to initiate each
3 0.04 N delays due to unconditional branches
0.5 3 0.12 N delays due to conditional
taken
Also, an extra 4 cycles for pipeline to empty
Total
1.3N 4 total cycles (or 1.3 cycles/instruction)
(CPI)
30 Performance Hit!!! (Bad thing)

125
Branch Penalty Impact

Some solutions
In ISA branches always execute next 1 or 2
instructions
Instruction so executed said to be in delay slot
See SPARC ISA
In organization move comparator to ID stage and
decide in the ID stage
Reduces branch delay by 2 cycles
Increases the cycle time

126
Branch Prediction

Prior solutions are ugly
Better ( more common) guess in IF stage
Technique is called branch predicting needs 2
parts
Predictor to guess where/if instruction will
branch (and to where)
Recovery Mechanism i.e. a way to fix your
mistake
Prior strategy
Predictor always guess branch never taken
Recovery flush instructions if branch taken
Alternative accumulate info. in IF stage as to
Whether or not for any particular PC value a
branch was taken next
To where it is taken
How to update with information from later stages

127
A Branch Predictor
128
Branch History Table
129
Branch Prediction Information

One bit predictor
Use result from last time we saw this instruction
Problem
Even if branch is almost always taken, we will be
wrong at least twice
1st time we the instruction
1st time the branch is not taken
Also, 1st time branch is taken again after than
And if branch alternates b/t taken, not taken
We get 0 accuracy
Can we do better? Yep.

130
Branch Prediction Information

How to do better?
Keep a counter in each entry of the number of
times taken in the last N times executed
Keep information about the pattern of previous
branches
Books scheme a 2-bit saturating counter
Increment when branch is taken
Decrement when branch is not taken
Dont increment or decrement above or below a
max/min count
Use sign of count as predictor

131
Books 2 Bit Branch Counter
132
Computing Performance