Title: CS 2200 Lecture 09a Pipelining
1CS 2200 Lecture 09aPipelining
- (Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, and Michael
Niemier)
2Class demo
- Can someone come to the front of the class and
explain to me how to do 5 loads of laundry? - I need 1 person to actually do the laundry
- and 5 more to be ummthe laundry.
3Short review Single cycle MIPS machine
Single cycle MIPS machine
4Short review Non-MIPS single cycle machine
cx2
c
x
y
a
b
bx
5Short review Multi cycle MIPS machine
Multi cycle MIPS machine
6Short review Multi cycle LC2200 machine
A
LdA
B
LdB
10
memory 1024x 32 bits
Addr Din
registers 16x 32 bits
Din
IR31..0
WrREG
WrMEM
2
4
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
IR19..0
regno
20
Dout
Dout
sign extend
0?
1
1
7Other Processor Designs(with more than one bus)
- One-bus is simple, recipe-oriented.
- Alternatives
- add parallel busses for data transfers that occur
together - e.g. ALU input/input/output
- add parallel compute units for operation that
occur together - e.g. PC1 in parallel with everything else
- mux paths together as necessary
- (somewhat ad-hoc)
8Other Processor Designs(with more than one bus)
- Add busses! One per ALU port
regnos
Register File (3 ports)
9Other Processor Designswith more than one bus
- Fetch unit performs PC1 and instruction lookup
Instr Memory
10Cycles Per Instruction?
- Well, you have a choice!
- CPI 1
- one long cycle
- Tclock 5nS?
11Cycles Per Instruction?
- Well, you have a choice!
- CPI 1
- one long cycle!
- Tclock 5nS
- CPI 5
- five short cycles
- Tclock 1nS
- 5nS/instruction either way
12Transition
- Can we do better?
- What if we have 5 instructions?
- With single cycle, 25 ns needed
- With multi cycle, 25 ns needed
- But its also possible to do in less than 10 ns
13Pipelining
14Pipelining Its Natural!
- Laundry Example
- Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold - Washer takes 30 minutes
- Dryer takes 40 minutes
- Folder takes 20 minutes
15Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
- Sequential laundry takes 6 hours for 4 loads
- If they learned pipelining, how long would
laundry take?
16Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
Note More time to go out later that night
- Pipelined laundry takes 3.5 hours for 4 loads
17Pipelining Lessons
- Multiple tasks operating simultaneously
- Pipelining doesnt help latency of single task,
it helps throughput of entire workload - Pipeline rate limited by slowest pipeline stage
- Potential speedup Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- Also, need time to fill and drain the
pipeline.
6 PM
7
8
9
Time
T a s k O r d e r
18Pipelining Some terms
- If youre doing laundry or implementing a mP,
each stage where something is done called a pipe
stage - In laundry example, washer, dryer, and folding
table are pipe stages clothes enter at one end,
exit other - In a mP, instructions enter at one end and have
been executed when they leave - Another example auto assembly line
- Throughput is how often stuff comes out of a
pipeline
19More on throughput
- All pipe stages are connected so everything must
move from one to another at same time - How fast this happens is a function of time it
takes for slowest stage to finish - Example If laundry takes 30 min. to wash but 40
min. to dry, itll be idle in washer for 10 min. - In a mP, this is machine cycle time (usually 1
clock) - If a each pipe stage is perfectly balanced time
wise - Time/Instruction Time on unpipelined/ of pipe
stages - Therefore speedup from pipelining of pipe
stages - But of course nothings perfect!
20So really, how is pipelining faster?
- Pipelining reduces average execution
time/instruction - Could be viewed as decreasing of clock cycles
per instruction (CPI) - In perfect pipeline, you should see 1 instruction
result each cycle even though that instruction
actually required multiple pipe stages/multiple
cycles - Pipelining is implementation technique, not
visible to programmer - (a good thing b/c its one less thing a programmer
has to worry about!)
21More technical detail
- General characteristics
- Complete process broken into S independent steps
- Each step done independently at a stage
- Stages arranged in linear order to match process
- As each stage finishes its pieces, it passes it
to the next stage - Time for 1 complete processing sequence sum of
all stages - BUT rate at which we can initiate new work
max of any stage time
22More technical detail
- If times for all S stages are equal to T
- Time for one initiation to complete still ST
- Time between 2 initiates T not ST
- Initiations per second 1/T
- Pipelining Overlap multiple executions of same
sequence - Improves THROUGHPUT, not the time to perform a
single operation - Other examples
- Automobile assembly plant, chemical factory,
garden hose, cooking
23More technical detail
- Books approach to draw pipeline timing diagrams
- Time runs left-to-right, in units of stage time
- Each row below corresponds to distinct
initiation - Boundary b/t 2 column entries pipeline register
- (i.e. hamper)
- Must look at column contents to see what stage is
doing what
Time for N initiations to complete NT (S-1)T
Throughput Time per initiation T (S-1)T/N ?
T!
24Ideal digital system pipeline speedup
Unpipelined
combinational logic delay t
combinational logic delay t
combinational logic delay t
combinational logic delay t
delay for 1 piece of data 4t latch setup
(assume small)
Latch
Latch
approximate delay for 1000 pieces of data 4000t
Pipelined
combinational logic delay t
combinational logic delay t
combinational logic delay t
combinational logic delay t
Latch
Latch
delay for 1 piece of data 4(t latch setup)
approximate delay for 1000 pieces of data 3t
1000t
4000
4
speedup for 1000 pieces of data
1003
Ideal speedup of pipeline stages
25Example
- IF The instruction fetch sequence (2 ns)
- ID Decode and fetch register operands (1 ns)
- EX Perform ALU operation (2 ns)
- MEM Perform data memory operation (2 ns)
- WB Write result (if any) back into reg. file
(1 ns) - Hmmm5 stages ? 5X performance increase over a
single cycle design? - Electrical design challenge
- Can we make HW do each stage in same time?
26Example
1
2
2
2
1
Total time 8 ns
One initiation
Try to overlap
Doesnt line up!
Possible solution insert 1ns after ID to allow
alignment
Structural Hazard
27More technical detail
Delay ID by 1 ns also
9 ns
No structural hazard
15 ns
- One initiation 9 ns or 10 ns (depending on how
you look at it) - 4 initiations 15 ns ? Average of 1 initiation
every 3.75 ns - How long for 1000 initiations?
- What is the equivalent time between
initiations? - What is the effective speedup?
28Transition
29The Big Picture Literally!
30The new look dataflow
IF/ID
ID/EX
EX/MEM
MEM/WB
4
M u x
ADD
PC
Branch taken
Comp.
IR6...10
M u x
Inst. Memory
IR11..15
Register File
ALU
MEM/ WB.IR
M u x
Data Mem.
Data must be stored from one stage to the
next in pipeline registers/latches. hold
temporary values between clocks and needed info.
for execution.
M u x
Sign Extend
16
32
31Another way to look at it
Clock Number
Time
Program execution order (in instructions)
32So, what about the details?
- In each cycle, new instruction fetched and begins
5 cycle execution - In perfect world (pipeline) performance improved
5 times over! - So, thats it, huh? Hardly!!!
- What else do we have to worry about?
- Must know whats going on in every cycle of
machine - What if 2 instructions try to use the same
resource at same time? - (LOTS more on this later)
- Separate instruction/data memories, multiple
register ports, etc. help avoid this
33So seriously, what does pipelining do for us?
- For starters, pipelining does not reduce the
execution time of a single instruction. - Actually, b/c of overhead of controlling
pipeline, execution time usually increases! - So why do it?
- Pipelining increases CPU instruction throughput.
- of instructions executed in some given time
frame should increase b/c of pipelining - Thus, a program runs faster but all instructions
actually execute a little slower. Crazy, huh?
34Limits, limits, limits
- So, now that the ideal stuff is out of the way,
lets look at how a pipeline REALLY works - Pipelines are slowed b/c of
- Pipeline latency
- Imbalance of pipeline stages
- (Think A chain is only as strong as its weakest
link) - Well, a pipeline is only as fast as its slowest
stage - Pipeline overhead (from where?)
- Register delay from pipe stage latches
- Clock skew Once a clock cycle is as small as
the sum of the clock skew and latch overhead, you
cant get any work done
35Speed Up Equation for Pipelining
For simple RISC pipeline, CPI 1. W/microcode,
unpipelined CPI pipeline depth
Single-cycle HW would have a slow clock
36Transition
37CS 2200 Lecture 09bMIPS Pipelining Examples
- (Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, and Michael
Niemier)
38Executing Instructions in Pipelined Datapath
- Following charts describe 3 scenarios
- Processing of load word (lw) instruction
- Bug included in design (make SURE you understand
the bug) - Processing of lw
- Bug corrected (make SURE you understand the fix)
- Processing of lw followed in pipeline by sub
- (Sets the stage for discussion of HAZARDS and
inter-instruction dependencies)
39Load word Cycle 1
40Load Word Cycle 2
41Load Word Cycle 3
42Load Word Cycle 4
43Load Word Cycle 5
44Load Word Fixed Bug
45A 2 instruction sequence
- Examine multiple-cycle single-cycle diagrams
for a sequence of 2 independent instructions - (i.e. no common registers b/t them)
- lw 10, 9(1)
- sub 11, 2, 3
46Single-cycle diagrams cycle 1
47Single-cycle diagrams cycle 2
48Single-cycle diagrams cycle 3
49Single-cycle diagrams cycle 4
50Single-cycle diagrams cycle 5
51Single-cycle diagrams cycle 6
52Pipelined Control
- Potentially very complicated, approach
methodically. - Example (independent instructions)
- lw 10, 9(1)
- sub 11, 2, 3
- and 12, 4, 5
- or 13, 6, 7
- add 14, 8, 9
53Pipelined Control
- Example (dependent instructions)
- (2 used in sequential instructions)
- sub 2, 1, 3 register 2 written by sub
- add 12, 2, 5 1st operand (2) depends on sub
- or 13, 6, 2 2nd operand (2) depends on sub
- add 14, 2, 2 1st and 2nd (2) depends on sub
- sw 15, 100(2) index (2) depends on sub
- Problem
- write-back for sub wont occur until the 5th
cycle - First assume sequence of independent instructions
- later, remove this assumption
54Control signal summary
55Questions about control signals
- Following discussion relevant to a single
instruction - Q Are all control signals active at the same
time? - A ?
- Q Can we generate all these signals at the same
time? - A ?
56Control lines by pipe stage
- Each data flow component is active in only one
pipeline stage - So, divide control signals into groups according
to active component - 1. Instruction Fetch
- Always read instruction memory and write PC
- (basically nothing special)
- 2. Instruction Decode / Register Fetch
- Still nothing special to control
- (same action every time)
- 3. Execution (must decode control sigs from
inst.) - RegDst does target reg come from bits 20-16 or
15-11? - ALUOp how to control ALU operation
- ALUSrc does 2nd ALU input come from reg. file
or sign ext.?
57Control lines by pipe stage
- 4. Memory likewise
- Branch used to generate PCSrc
- PCSrc does PC get incremented or replaced by
output of branch adder - MemRead signal reads from memory
- MemWrite signal writes to memory
- 5. Write Back likewise
- MemToReg does value going back to reg file come
from ALU or memory? - RegWrite is there in fact a register write back
to perform?
58Passing control w/pipe registers
- Analogy send instruction with car on assembly
line - Install Corinthian leather interior on car 6 _at_
stage 3
59Pipelined datapath w/control signals
60CS 2200 Lecture XHazards
- (Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, and Michael
Niemier)
61The hazards of pipelining
- Pipeline hazards prevent next instruction from
executing during designated clock cycle - There are 3 classes of hazards
- Structural Hazards
- Arise from resource conflicts
- HW cannot support all possible combinations of
instructions - Data Hazards
- Occur when given instruction depends on data from
an instruction ahead of it in pipeline - Control Hazards
- Result from branch, other instructions that
change flow of program (i.e. change PC)
62How do we deal with hazards?
- Often, pipeline must be stalled
- Stalling pipeline usually lets some
instruction(s) in pipeline proceed,
another/others wait for data, resource, etc. - A note on terminology
- If we say an instruction was issued later than
instruction x, we mean that it was issued after
instruction x and is not as far along in the
pipeline - If we say an instruction was issued earlier than
instruction x, we mean that it was issued before
instruction x and is further along in the pipeline
63Stalls and performance
- Stalls impede progress of a pipeline and result
in deviation from 1 instruction executing/clock
cycle - Pipelining can be viewed to
- Decrease CPI or clock cycle time for instruction
- Lets see what affect stalls have on CPI
- CPI pipelined
- Ideal CPI Pipeline stall cycles per instruction
- 1 Pipeline stall cycles per instruction
- Ignoring overhead and assuming stages are
balanced
64More pipeline performance issues
- Pipelining can appear to improve clock cycle time
- Can assume the CPI of an unpipelined and a
pipelined machine is 1 - This results in
- If pipe stages perfectly balanced, we assume no
overhead - clock cycle on pipelined machine is smaller than
unpipelined machine by a factor equal to pipeline
depth.
65Even more pipeline performance issues!
- This results in
- Which leads to
- If no stalls, speedup equal to of pipeline
stages in ideal case
66Structural hazards
- 1 way to avoid structural hazards is to duplicate
resources - i.e. An ALU to perform an arithmetic operation
and an adder to increment PC - If not all possible combinations of instructions
can be executed, structural hazards occur - Most common instances of structural hazards
- When a functional unit not fully pipelined
- When some resource not duplicated enough
- Pipelines stall result of hazards, CPI increased
from the usual 1
67An example of a structural hazard
Load
Instruction 1
Instruction 2
Instruction 3
Instruction 4
Whats the problem here?
Time
68How is it resolved?
Load
Instruction 1
Instruction 2
Stall
Instruction 3
Pipeline generally stalled by inserting a
bubble or NOP
Time
69Or alternatively
Clock Number
- LOAD instruction steals an instruction fetch
cycle which will cause the pipeline to stall. - Thus, no instruction completes on clock cycle 8
70A simple example
- The facts
- Data references constitute 40 of an instruction
mix - Ideal CPI of the pipelined machine is 1
- A machine with a structural hazard has a clock
rate thats 1.05 times higher than a machine
without the hazard. - How much does this LOAD problem hurt us?
- Recall Avg. Inst. Time CPI x Clock Cycle Time
- (1 0.4 x 1) x (Clock cycle timeideal/1.05)
- 1.3 x Clock cycle timeideal
- Therefore the machine without the hazard is
better
71Remember the common case!
- All things being equal, a machine without
structural hazards will always have a lower CPI. - But, in some cases it may be better to allow them
than to eliminate them. - These are situations a computer architect might
have to consider - Is pipelining functional units or duplicating
them costly in terms of HW? - Does structural hazard occur often?
- Whats the common case???
72Data hazards
- These exist because of pipelining
- Why do they exist???
- Pipelining changes order or read/write accesses
to operands - Order differs from order seen by sequentially
executing instructions on unpipelined machine - Consider this example
- ADD R1, R2, R3
- SUB R4, R1, R5
- AND R6, R1, R7
- OR R8, R1, R9
- XOR R10, R1, R11
All instructions after ADD use result of ADD
For the DLX mP, ADD writes the register in WB
but SUB needs it in ID. This is a data hazard
73Illustrating a data hazard
ADD R1, R2, R3
SUB R4, R1, R5
Reg
Mem
DM
AND R6, R1, R7
Reg
Mem
OR R8, R1, R9
Reg
Mem
XOR R10, R1, R11
Time
ADD instruction causes a hazard in next 3
instructions b/c register not written until after
those 3 read it.
74Forwarding
- Problem illustrated on previous slide can
actually be solved relatively easily with
forwarding - In this example, result of the ADD instruction
not really needed until after ADD actually
produces it - Can we move the result from EX/MEM register to
the beginning of ALU (where SUB needs it)? - Yes! Hence this slide!
- Generally speaking
- Forwarding occurs when a result is passed
directly to functional unit that requires it. - Result goes from output of one unit to input of
another
75When can we forward?
ADD R1, R2, R3
SUB gets info. from EX/MEM pipe register AND
gets info. from MEM/WB pipe register OR gets
info. by forwarding from register file
SUB R4, R1, R5
Reg
Mem
DM
AND R6, R1, R7
Reg
Mem
OR R8, R1, R9
Reg
Mem
XOR R10, R1, R11
Rule of thumb If line goes forward you can do
forwarding. If its drawn backward, its
physically impossible.
Time
76HW Change for Forwarding
77Data hazard specifics
- There are actually 3 different kinds of data
hazards! - Read After Write (RAW)
- Write After Write (WAW)
- Write After Read (WAR)
- Well discuss/illustrate each on forthcoming
slides. However, 1st a note on convention. - Discussion of hazards will use generic
instructions i j. - i is always issued before j.
- Thus, i will always be further along in pipeline
than j. - With an in-order issue/in-order completion
machine, were not as concerned with WAW, WAR
78Read after write (RAW) hazards
- With RAW hazard, instruction j tries to read a
source operand before instruction i writes it. - Thus, j would incorrectly receive an old or
incorrect value - Graphically/Example
- Can use stalling or forwarding to resolve this
hazard
i ADD R1, R2, R3 j SUB R4, R1, R6
Instruction j is a read instruction issued after i
Instruction i is a write instruction issued
before j
79Write after write (WAW) hazards
- With WAW hazard, instruction j tries to write an
operand before instruction i writes it. - The writes are performed in wrong order leaving
the value written by earlier instruction - Graphically/Example
i DIV F1, F2, F3 j SUB F1, F4, F6
Instruction j is a write instruction issued after
i
Instruction i is a write instruction issued
before j
80Write after read (WAR) hazards
- With WAR hazard, instruction j tries to write an
operand before instruction i reads it. - Instruction i would incorrectly receive newer
value of its operand - Instead of getting old value, it could receive
some newer, undesired value - Graphically/Example
i DIV F7, F1, F3 j SUB F1, F4, F6
Instruction j is a write instruction issued after
i
Instruction i is a read instruction issued before
j
81Forwarding It doesnt always work
LW R1, 0(R2)
Load has a latency that forwarding cant
solve. Pipeline must stall until hazard cleared
(starting with instruction that wants to use
data until source produces it).
Reg
IM
DM
SUB R4, R1, R5
Reg
IM
AND R6, R1, R7
Reg
IM
OR R8, R1, R9
Time
To get the data to subtract instruction we need a
time machine!
82The solution pictorially
Reg
IM
DM
Reg
LW R1, 0(R2)
Reg
IM
DM
SUB R4, R1, R5
IM
Reg
AND R6, R1, R7
Reg
IM
OR R8, R1, R9
Time
Insertion of bubble causes of cycles to
complete this sequence to grow by 1
83Data hazards and the compiler
- Compiler should be able to help eliminate some
stalls caused by data hazards - i.e. compiler could not generate a LOAD
instruction that is immediately followed by
instruction that uses result of LOADs
destination register. - Technique is called pipeline/instruction
scheduling
84What about control logic?
- For DLX integer pipeline, all data hazards can be
checked during ID phase of pipeline - If data hazard, instruction stalled before its
issued - Whether forwarding is needed can also be
determined at this stage, controls signals set - If hazard detected, control unit of pipeline must
stall pipeline and prevent instructions in IF, ID
from advancing - All control information carried along in pipeline
registers so only these fields must be changed
85Some example situations
86Detecting Data Hazards
87Hazard Detection Logic
- Insert a bubble into pipeline if any are true
- ID/EX.RegWrite AND
- ((ID/EX.RegDst0 AND ID/EX.WriteRegRtIF/ID.ReadRe
gRs) OR - (ID/EX.RegDst1 AND ID/EX.WriteRegRdIF/ID.ReadReg
Rs) OR - (ID/EX.RegDst0 AND ID/EX.WriteRegRtIF/ID.ReadReg
Rt) OR - (ID/EX.RegDst1 AND ID/EX.WriteRegRdIF/ID.ReadReg
Rt)) - OR EX/MEM AND
- ((EX/MEM.WriteReg IF/ID.ReadRegRs) OR
- (EX/MEM.WriteReg IF/ID.ReadRegRt))
- OR MEM/WB.RegWrite AND
- ((MEM/WB.WriteReg IF/ID.ReadRegRs) OR
- (MEM/WB.WriteReg IF/ID.ReadRegRt))
Notation ID/EX.RegDst
Pipeline Register
Field
88How to Insert Bubbles
- If hazard detected
- Dont write to PC or IF/ID reg. de-assert
signals for NOP
89Incorporation of Hazard Detection Unit
90Stall Ex. Cycle 1
91Stall Ex. Cycle 2
92Stall Ex. Cycle 3 1st Bubble Inserted
93Stall Ex. Cycle 4 2nd Bubble Inserted
94Stall Ex. Cycle 5 3rd Bubble Inserted
95Stall Ex. Cycle 6 End of Stall
96Stall Ex. Cycle 7
97Control Hazards
98R-Type
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
99Control Hazard on Branches2 Stage Stall?
10 beq r1,r3,36
14 and r2,r3,r5
18 or r6,r1,r7
22 add r8,r1,r9
36 xor r10,r1,r11
100Example
101Scenario
- We have the following code segment
- lw R6, X(R0)
- beq R1, R2, SKIP
- add R1, R2, R3
- SKIP add R5, R4, R1
- sw R7, X(R0)
- X .word 5
102lw R6,X(R0)
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
103lw R6,X(R0)
beq R1,R2,SKIP
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
104lw R6,X(R0)
beq R1,R2,SKIP
BUBBLE
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
Note Bubble because no branch predict or slot
fill.
WB
EX
MEM
ID
IF
105lw R6,X(R0)
beq R1,R2,SKIP
BUBBLE
BUBBLE
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
Second bubble because were detecting BEQ in 3rd
stage.
WB
EX
MEM
ID
IF
106lw R6,X(R0)
BUBBLE
BUBBLE
add R1,R2,R3
beq R1,R2,SKIP
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
107beq R1,R2,SKIP
add R5,R4,R1
BUBBLE
BUBBLE
add R1,R2,R3
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
108BUBBLE
add R5,R4,R1
sw R7,X(R0)
BUBBLE
add R1,R2,R3
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
Forwarding Unit
WB
EX
MEM
ID
IF
109BUBBLE
add R1,R2,R3
add R5,R4,R1
sw R7,X(R0)
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
110add R1,R2,R3
add R5,R4,R1
sw R7,X(R0)
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
111add R5,R4,R1
sw R7,X(R0)
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
112Dealing with Branch Hazards (more detail)
113Branch Hazards
- So far, weve limited discussion of hazards to
- Arithmetic/logic operations
- Data transfers
- Also need to consider hazards involving branches
- Example
- 40 beq 1, 3, 28
- 44 and 12, 2, 5
- 48 or 13, 6, 2
- 52 add 14, 2, 2
- 72 lw 4, 50(7)
- How long will it take before the branch decision
takes effect? - What happens in the meantime?
114Branch signal determined in MEM stage
115Pipeline impact on branch
- If branch condition true, must skip 44, 48, 52
- But, these have already started down the pipeline
- They will complete unless we do something about
it - How do we deal with this?
- Well consider 2 possibilities
116Dealing w/branch hazards always stall
- Branch taken
- Wait 3 cycles
- No proper instructions in the pipeline
- Same delay as without stalls (no time lost)
117Dealing w/branch hazards always stall
- Branch not taken
- Still must wait 3 cycles
- Time lost
- Could have spent cycles fetching and decoding
next instructions
118Dealing w/branch hazardsassume branch not taken
- On average, branches are taken ½ the time
- If branch not taken
- Continue normal processing
- Else, if branch is taken
- Need to flush improper instruction from pipeline
- Cuts overall time for branch processing in ½
119Flushing unwanted instructions from pipeline
- Useful to compare w/stalling pipeline
- Simple stall inject bubble into pipe at ID
stage only - Change control to 0 in the ID stage
- Let bubbles percolate to the right
- Flushing pipe must change inst. In IF, ID, and
EX - IF Stage
- Zero instruction field of IF/ID pipeline register
- Use new control signal IF.Flush
- ID Stage
- Use existing bubble injection mux that zeros
control for stalls - Signal ID.Flush is ORed w/stall signal from
hazard detection unit - EX Stage
- Add new muxes to zero EX pipeline register
control lines - Both muxes controlled by single EX.Flush signal
- Control determines when to flush
- Depends on Opcode and value of branch condition
120Inserting bubbles v. flushing pipeline
121Assume branch not takenand branch is not taken
- Execution proceeds normally no penalty
122Assume branch not takenand branch is taken
- Bubbles injected into 3 stages during cycle 5
123Reservation Table Picture
- Another way of looking at it
Assume Branch Not Taken and Correct
40 beq 1, 3, 72 44 and 12, 2, 5 48 or
13, 6, 2 52 add 14, 2, 2 72 lw 4,
50(7)
No penalty 3 cycle penalty
Assume Branch Not Taken and NOT Correct
124Branch Penalty Impact
- Assume 16 of all instructions are branches
- 4 unconditional branches 3 cycle penalty
- 12 conditional 50 taken
- For a sequence of N instructions (assume N is
large) - N cycles to initiate each
- 3 0.04 N delays due to unconditional branches
- 0.5 3 0.12 N delays due to conditional
taken - Also, an extra 4 cycles for pipeline to empty
- Total
- 1.3N 4 total cycles (or 1.3 cycles/instruction)
(CPI) - 30 Performance Hit!!! (Bad thing)
125Branch Penalty Impact
- Some solutions
- In ISA branches always execute next 1 or 2
instructions - Instruction so executed said to be in delay slot
- See SPARC ISA
- In organization move comparator to ID stage and
decide in the ID stage - Reduces branch delay by 2 cycles
- Increases the cycle time
126Branch Prediction
- Prior solutions are ugly
- Better ( more common) guess in IF stage
- Technique is called branch predicting needs 2
parts - Predictor to guess where/if instruction will
branch (and to where) - Recovery Mechanism i.e. a way to fix your
mistake - Prior strategy
- Predictor always guess branch never taken
- Recovery flush instructions if branch taken
- Alternative accumulate info. in IF stage as to
- Whether or not for any particular PC value a
branch was taken next - To where it is taken
- How to update with information from later stages
127A Branch Predictor
128Branch History Table
129Branch Prediction Information
- One bit predictor
- Use result from last time we saw this instruction
- Problem
- Even if branch is almost always taken, we will be
wrong at least twice - 1st time we the instruction
- 1st time the branch is not taken
- Also, 1st time branch is taken again after than
- And if branch alternates b/t taken, not taken
- We get 0 accuracy
- Can we do better? Yep.
130Branch Prediction Information
- How to do better?
- Keep a counter in each entry of the number of
times taken in the last N times executed - Keep information about the pattern of previous
branches - Books scheme a 2-bit saturating counter
- Increment when branch is taken
- Decrement when branch is not taken
- Dont increment or decrement above or below a
max/min count - Use sign of count as predictor
131Books 2 Bit Branch Counter
132Computing Performance
- Program assumptions
- 23 loads and in ½ of cases, next instruction
uses load value - 13 stores
- 19 conditional branches
- 2 unconditional branches
- 43 other
- Machine Assumptions
- 5 stage pipe with all forwarding
- Only penalty is 1 cycle on use of load value
immediately after a load) - Jumps are totally resolved in ID stage for a 1
cycle branch penalty - 75 branch prediction accuracy
- 1 cycle delay on misprediction
133The Answer
- CPI penalty calculation
- Loads
- 50 of the 23 of loads have 1 cycle penalty
.5.230.115 - Jumps
- All of the 2 of jumps have 1 cycle penalty
0.021 0.02 - Conditional Branches
- 25 of the 19 are mispredicted for a 1 cycle
penalty 0.250.191 0.0475 - Total Penalty 0.115 0.02 0.0475 0.1825
134Exception Hazards
- 40hex sub 11, 2, 4
- 44hex and 12, 2, 5
- 48hex or 13, 6, 2
- 4bhex add 1, 2, 1 (overflow in EX stage)
- 50hex slt 15, 6, 7 (already in ID stage)
- 54hex lw 16, 50(7) (already in IF stage)
-
- 40000040hex sw 25, 1000(0) exception handler
- 40000044hex sw 26, 1004(0)
- Need to transfer control to exception handler
ASAP - Dont want invalid data to contaminate registers
or memory - Need to flush instructions already in the
pipeline - Start fetching instructions from 40000040hex
- Save addr. following offending instruction
(50hex) in TrapPC (EPC) - Dont clobber 1 use for debugging
135Flushing pipeline after exception
- Cycle 6
- Exception detected, flush signals generated,
bubbles injected - Cycle 7
- 3 bubbles appear in ID, EX, MEM stages
- PC gets 40000040hex, TrapPC gets 50hex
136Managing exception hazards gets much worse!
- Different exception types may occur in different
stages - Challenge is to associate exception with proper
instruction difficult! - Relax this requirement in non-critical cases
imprecise exceptions - Most machines use precise instructions
- Further challenge exceptions can happen at same
time
137Discussion
- How does instruction set design impact
pipelining? - Does increasing the depth of pipelining always
increase performance?
138Comparative Performance
- Throughput instructions per clock cycle 1/cpi
- Pipeline has fast throughput and fast clock rate
- Latency inherent execution time, in cycles
- High latency for pipelining causes problems
- Increased time to resolve hazards
139Summary
- Performance
- Execution time or throughput
- Amdahls law
- Multi-bus/multi-unit circuits
- one long clock cycle or N shorter cycles
- Pipelining
- overlap independent tasks
- Pipelining in processors
- hazards limit opportunities for overlap