Title: Chapter Six
1Chapter Six
Enhancing Performance with Pipelining
2Sequential Laundry
2 AM
12
6 PM
7
8
11
1
10
9
30
30
30
30
30
30
30
30
30
30
30
30
30
30
30
30
T a s k O r d e r
Time
- Sequential laundry takes 8 hours for 4 loads
- If they learned pipelining, how long would
laundry take?
3Pipelined Laundry
- Pipelining doesnt help latency of single task,
it helps throughput of entire workload - Multiple tasks operating simultaneously using
different resources - Potential speedup Number pipe stages
- Pipeline rate limited by slowest pipeline stage
- Unbalanced lengths of pipe stages reduces speedup
- Time to fill pipeline and time to drain it
reduces speedup - Stall for dependencies
6 PM
7
8
9
Time
T a s k O r d e r
4Single Stage VS. Pipeline Performance
-
-
- Ideal speedup is number of stages in
the pipeline.
5Pipelining
- What makes it easy in MIPS
- all instructions are the same length
- just a few instruction formats
- memory operands appear only in loads and stores
- What makes it hard?
- structural hazards suppose we had only one
memory - control hazards need to worry about branch
instructions - data hazards an instruction depends on a
previous instruction - Well build a simple pipeline and look at these
issues
6The Five Stages of Load
- Ifetch Instruction Fetch
- Fetch the instruction from the Instruction Memory
- Reg/Dec Registers Fetch and Instruction Decode
- Exec Calculate the memory address
- Mem Read the data from the Data Memory
- Wr Write the data back to the register file
7Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1
Cycle 2
Clk
Single Cycle Implementation
Load
Store
Waste
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Clk
Multiple Cycle Implementation
Load
Store
R-type
Pipeline Implementation
Load
Store
R-type
8Basic Idea
x
e
c
u
t
e
/
M
E
M
M
e
m
o
r
y
a
c
c
e
s
s
W
B
W
r
i
t
e
b
a
c
k
a
d
d
r
e
s
s
c
a
l
c
u
l
a
t
i
o
n
9Pipelined Datapath
-
- Walk through lw instruction
- Walk through sw instruction
- The design
10Corrected Datapath
11Graphically Representing Pipelines
-
- Can help with answering questions like
- how many cycles does it take to execute this
code? - what is the ALU doing during cycle 4?
- use this representation to help understand
datapaths
12Why Pipeline? Because the resources are there!
Time (clock cycles)
I n s t r. O r d e r
Inst 0
Inst 1
Inst 2
Inst 3
Inst 4
13Pipeline Control
14Pipeline Control
- Pass control signals along just like the data
15Datapath with Control
16Designing a Pipelined Processor
- Go back and examine your datapath and control
diagram - associated resources with states
- ensure that flows do not conflict, or figure out
how to resolve - assert control in appropriate stage
17Can pipelining get us into trouble?
- Yes Pipeline Hazards
- structural hazards attempt to use the same
resource two different ways at the same time - data hazards attempt to use item before it is
ready - instruction depends on result of prior
instruction still in the pipeline - control hazards attempt to make a decision
before condition is evaluated - branch instructions
- Can always resolve hazards by waiting
- pipeline control must detect the hazard
- take action (or delay action) to resolve hazards
18Data Hazards
- Problem with starting next instruction before
first is finished - dependencies that go backward in time are data
hazards
r
2
R
e
g
D
M
D
M
R
e
g
R
e
g
R
e
g
19Software Solution
- Have compiler guarantee no hazards
- Where should compiler insert nop
instructions? sub 2, 1, 3 and 12, 2,
5 or 13, 6, 2 add 14, 2, 2 sw 15,
100(2) - Problem
- It happens too often to rely on compiler
- It really slows us down!
20Data Hazard Solution Forwarding
- Use temporary results, dont wait for them to be
written - Also, write register file during 1st half of
clock and read during 2nd half -
r
e
g
i
s
t
e
r
2
X
X
X
2
0
X
X
X
X
X
V
a
l
u
e
o
f
E
X
/
M
E
M
X
X
X
X
2
0
X
X
X
X
V
a
l
u
e
o
f
M
E
M
/
W
B
D
M
R
e
g
R
e
g
D
M
R
e
g
R
e
g
21Hazard Conditions
- Steer the result from precious instruction to the
ALU - EX hazard
- if (EX/MEM.RegWrite
- and (EX/MEM.RegisterRd 0)
- and (EX /MEM.RegisterRd ID/EX.RegisterRs))
ForwardA 10 - if (EX/MEM.RegWrite
- and (EX/MEM.RegisterRd 0)
- and (EX /MEM.RegisterRd ID/EX.RegisterRt))
ForwardB 10 - MEM hazard
- if (MEM/WB.RegWrite
- and (MEM/WB.RegisterRd 0)
- and (MEM/WB.RegisterRd ID/EX.RegisterRs))
ForwardA 01 - if (MEM/WB.RegWrite
- and (MEM/WB.RegisterRd 0)
- and (MEM/WB.RegisterRd ID/EX.RegisterRt))
ForwardB 01
22Forwarding
I
F
/
I
D
n
o
i
t
c
u
r
t
s
n
I
R
s
I
F
/
I
D
.
R
e
g
i
s
t
e
r
R
s
R
t
I
F
/
I
D
.
R
e
g
i
s
t
e
r
R
t
R
t
I
F
/
I
D
.
R
e
g
i
s
t
e
r
R
t
M
E
X
/
M
E
M
.
R
e
g
i
s
t
e
r
R
d
u
R
d
I
F
/
I
D
.
R
e
g
i
s
t
e
r
R
d
x
M
E
M
/
W
B
.
R
e
g
i
s
t
e
r
R
d
i
t
00 Register file 01 Mem. or earlier
ALU 10 Prior ALU
23Can't always forward
- lw can still cause a hazard
- an instruction tries to read a register following
a load instruction that writes to the same
register. -
- Thus, we need a hazard detection unit to stall
the load instruction
24Stalling
- We can stall the pipeline by keeping an
instruction in the same stage - Repeat in clock cycle 4 what they did in clock
cycle 3
25Hazard Detection Unit
- Stall by letting an instruction that wont write
anything go forward - controls writing of the PC and IF/ID plus MUX
26Branch Hazards
- When we decide to branch, other instructions are
in the pipeline! - We are predicting branch not taken
- need to add hardware for flushing instructions if
we are wrong
R
e
g
27Flushing Instructions
28Improving Performance
- Superpipelining ideal maximum speedup is
related to number of stages - Superscalar start more than one instruction in
the same cycle - Dynamic pipeline scheduling
- Try and avoid stalls! E.g., reorder these
instructions - lw t0, 0(t1)
- lw t2, 4(t1)
- sw t2, 0(t1)
- sw t0, 4(t1)
29Dynamic Scheduling
- The hardware performs the scheduling
- hardware tries to find instructions to execute
- out of order execution is possible
- speculative execution and dynamic branch
prediction - All modern processors are very complicated
- DEC Alpha 21264 9 stage pipeline, 6 instruction
issue - PowerPC and Pentium branch history table
- Compiler technology important
30Dynamic Scheduling in PowerPC 604 and Pentium Pro
- Both In-order Issue, Out-of-order execution,
In-order Commit -
- Pentium Pro central reservation station for any
functional units with one bus shared by a branch
and an integer unit
31Dynamic Scheduling in Pentium Pro
- PPro doesnt pipeline 80x86 instructions
- PPro decode unit translates the Intel
instructions into 72-bit micro-operations (
MIPS) - Sends micro-operations to reorder buffer
reservation stations - Takes 1 clock cycle to determine length of 80x86
instructions 2 more to create the
micro-operations - Most instructions translate to 1 to 4
micro-operations - Complex 80x86 instructions are executed by a
conventional microprogram (8K x 72 bits) that
issues long sequences of micro-operations
32FYI MIPS R3000 clocking discipline
phi1
phi2
- 2-phase non-overlapping clocks
phi1
phi1
phi2
Edge-triggered