Chap' 6: Pipelining presentation

About This Presentation

Transcript and Presenter's Notes

Title: Chap' 6: Pipelining

1
Chap. 6 Pipelining

Joonwon Lee
lecture slides http//www-inst.eecs.berkeley.edu/
cs152/

2
The Five Stages of Load
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Load

Ifetch Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec Registers Fetch and Instruction Decode
Exec Calculate the memory address
Mem Read the data from the Data Memory
Wr Write the data back to the register file

3
Pipelining

Improve perfomance by increasing instruction
throughput
Ideal speedup is number of stages in
the pipeline. Do we achieve this?

4
Basic Idea

What do we need to add to actually split the
datapath into stages?

5
Graphically Representing Pipelines

Can help with answering questions like
how many cycles does it take to execute this
code?
what is the ALU doing during cycle 4?
use this representation to help understand
datapaths

6
Conventional Pipelined Execution Representation
Time
Program Flow
7
Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1
Cycle 2
Clk
Single Cycle Implementation
Load
Store
Waste
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Clk
Multiple Cycle Implementation
Load
Store
R-type
Pipeline Implementation
Load
Store
R-type
8
Why Pipeline?

Suppose we execute 100 instructions
Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst 4500 ns
Multicycle Machine
10 ns/cycle x 4.6 CPI (due to inst mix) x 100
inst 4600 ns
Ideal pipelined machine
10 ns/cycle x (1 CPI x 100 inst 4 cycle drain)
1040 ns

9
Why Pipeline? Because the resources are there!
Time (clock cycles)
I n s t r. O r d e r
Inst 0
Inst 1
Inst 2
Inst 3
Inst 4
10
Can pipelining get us into trouble?

Yes Pipeline Hazards
structural hazards attempt to use the same
resource two different ways at the same time
E.g., combined washer/dryer would be a structural
hazard or folder busy doing something else
(watching TV)
data hazards attempt to use item before it is
ready
E.g., one sock of pair in dryer and one in
washer cant fold until get sock from washer
through dryer
instruction depends on result of prior
instruction still in the pipeline
control hazards attempt to make a decision
before condition is evaulated
E.g., washing football uniforms and need to get
proper detergent level need to see after dryer
before next load in
branch instructions
Can always resolve hazards by waiting
pipeline control must detect the hazard
take action (or delay action) to resolve hazards

11
Single Memory is a Structural Hazard
Time (clock cycles)
I n s t r. O r d e r
Mem
Reg
Reg
Load
Instr 1
Instr 2
Mem
Mem
Reg
Reg
Instr 3
Instr 4
Detection is easy in this case! (right half
highlight means read, left half write)
12
Control Hazard Solutions

Stall wait until decision is clear
Its possible to move up decision to 2nd stage by
adding hardware to check registers as being read
Impact 2 clock cycles per branch instruction gt
slow

I n s t r. O r d e r
Time (clock cycles)
Mem
Reg
Reg
Add
Mem
Reg
Reg
Beq
Load
Mem
Reg
Reg
13
Control Hazard Solutions

Predict guess one direction then back up if
wrong
Predict not taken
Impact 1 clock cycles per branch instruction if
right, 2 if wrong (right 50 of time)
More dynamic scheme history of 1 branch ( 90)

I n s t r. O r d e r
Time (clock cycles)
Mem
Reg
Reg
Add
Mem
Reg
Reg
Beq
Load
Mem
Mem
Reg
Reg
14
Control Hazard Solutions

Redefine branch behavior (takes place after next
instruction) delayed branch
Impact 0 clock cycles per branch instruction if
can find instruction to put in slot ( 50 of
time)
As launch more instruction per clock cycle, less
useful

I n s t r. O r d e r
Time (clock cycles)
Mem
Reg
Reg
Add
Mem
Reg
Reg
Beq
Misc
Mem
Mem
Reg
Reg
Load
Mem
Mem
Reg
Reg
15
Data Hazard on r1
add r1 ,r2,r3
sub r4, r1 ,r3
and r6, r1 ,r7
or r8, r1 ,r9
xor r10, r1 ,r11
16
Data Hazard on r1

Dependencies backwards in time are hazards

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
17
Data Hazard Solution

Forward result from one stage to another
or OK if define read/write properly

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
18
Forwarding (or Bypassing) What about Loads

Dependencies backwards in time are
hazards
Cant solve with forwarding
Must delay/stall instruction dependent on loads

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw r1,0(r2)
Reg
Reg
ALU
Im
Dm
sub r4,r1,r3
Dm
Reg
Reg
19
Designing a Pipelined Processor

Go back and examine your datapath and control
diagram
associated resources with states
ensure that flows do not conflict, or figure out
how to resolve
assert control in appropriate stage

20
Pipelined Processor (almost) for slides

What happens if we start a new instruction every
cycle?

Valid
IRex
IR
IRwb
Inst. Mem
IRmem
WB Ctrl
Dcd Ctrl
Ex Ctrl
Mem Ctrl
Equal
Reg. File
Reg File
Exec
PC
Next PC
Mem Access
Data Mem
21
Control and Datapath
IR lt- MemPC PC lt PC4
A lt- Rrs Blt Rrt
S lt A B
S lt A SX
S lt A or ZX
S lt A SX
If Cond PC lt PCSX
M lt MemS
MemS lt- B
Rrd lt S
Rrd lt M
Rrt lt S
Equal
Reg. File
Reg File
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
22
Pipelining the Load Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Clock
2nd lw
3rd lw

The five independent functional units in the
pipeline datapath are
Instruction Memory for the Ifetch stage
Register Files Read ports (bus A and busB) for
the Reg/Dec stage
ALU for the Exec stage
Data Memory for the Mem stage
Register Files Write port (bus W) for the Wr
stage

23
The Four Stages of R-type
Cycle 1
Cycle 2
Cycle 3
Cycle 4
R-type

Ifetch Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec Registers Fetch and Instruction Decode
Exec
ALU operates on the two register operands
Update PC
Wr Write the ALU output back to the register file

24
Pipelining the R-type and Load Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
Ops! We have a problem!
R-type
R-type
Load
R-type
R-type

We have pipeline conflict or structural hazard
Two instructions try to write to the register
file at the same time!
Only one write port

25
Important Observation

Each functional unit can only be used once per
instruction
Each functional unit must be used at the same
stage for all instructions
Load uses Register Files Write Port during its
5th stage
R-type uses Register Files Write Port during its
4th stage

2 ways to solve this pipeline hazard.

26
Solution 1 Insert Bubble into the Pipeline
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
Load
R-type
Pipeline
R-type
R-type
Bubble

Insert a bubble into the pipeline to prevent 2
writes at the same cycle
The control logic can be complex.
Lose instruction fetch and issue opportunity.
No instruction is started in Cycle 6!

27
Solution 2 Delay R-types Write by One Cycle

Delay R-types register write by one cycle
Now R-type instructions also use Reg Files write
port at Stage 5
Mem stage is a NOOP stage nothing is being done.

4
1
2
3
5
Mem
R-type
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
R-type
R-type
Load
R-type
R-type
28
Modified Control Datapath
IR lt- MemPC PC lt PC4
A lt- Rrs Blt Rrt
S lt A B
S lt A SX
S lt A or ZX
S lt A SX
if Cond PC lt PCSX
M lt MemS
MemS lt- B
M lt S
M lt S
Rrd lt M
Rrd lt M
Rrt lt M
Equal
Reg. File
Reg File
S
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
29
The Four Stages of Store
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Store
Wr

Ifetch Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec Registers Fetch and Instruction Decode
Exec Calculate the memory address
Mem Write the data into the Data Memory

30
The Three Stages of Beq
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Beq
Wr

Ifetch Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec
Registers Fetch and Instruction Decode
Exec
compares the two register operand,
select correct branch target address
latch into PC

31
Control Diagram
IR lt- MemPC PC lt PC4
A lt- Rrs Blt Rrt
S lt A B
S lt A SX
S lt A or ZX
S lt A SX
If Cond PC lt PCSX
M lt MemS
MemS lt- B
M lt S
M lt S
Rrd lt S
Rrd lt M
Rrt lt S
Equal
Reg. File
Reg File
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
32
Datapath Data Stationary Control
IR
v
v
v
fun
rw
rw
rw
wb
wb
wb
Inst. Mem
Decode
WB Ctrl
me
me
rt
Mem Ctrl
rs
ex
op
im
rs
rt
Reg. File
Reg File
Exec
Mem Access
Data Mem
Next PC
33
Lets Try it Out
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
these addresses are octal
34
Start Fetch 10
Inst. Mem
Decode
WB Ctrl
Mem Ctrl
IR
im
rs
rt
Reg. File
Reg File
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
10
PC
35
Fetch 14, Decode 10
lw r1, r2(35)
Inst. Mem
Decode
WB Ctrl
Mem Ctrl
IR
im
2
rt
Reg. File
Reg File
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
14
PC
36
Fetch 20, Decode 14, Exec 10
addI r2, r2, 3
Inst. Mem
Decode
WB Ctrl
lw r1
Mem Ctrl
IR
35
2
rt
Reg. File
Reg File
r2
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
20
PC
37
Fetch 24, Decode 20, Exec 14, Mem 10
sub r3, r4, r5
addI r2, r2, 3
Inst. Mem
Decode
WB Ctrl
lw r1
Mem Ctrl
IR
3
4
5
Reg. File
Reg File
r2
r235
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
24
PC
38
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
beq r6, r7 100
Inst. Mem
Decode
WB Ctrl
addI r2
lw r1
sub r3
Mem Ctrl
IR
6
7
Reg. File
Reg File
r4
Mr235
r23
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
30
PC
39
Fetch 34, Dcd 30, Ex 24, Mem 20, WB 14
ori r8, r9 17
Inst. Mem
Decode
WB Ctrl
addI r2
sub r3
Mem Ctrl
beq
IR
9
xx
100
r1Mr235
Reg. File
Reg File
r6
r23
r4-r5
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
34
PC
40
Fetch 100, Dcd 34, Ex 30, Mem 24, WB 20
Inst. Mem
Decode
ori r8
WB Ctrl
sub r3
beq
add r10, r11, r12
Mem Ctrl
11
12
17
Reg. File
r1Mr235
IR
Reg File
r9
r4-r5
r2 r23
xxx
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
100
PC
ooops, we should have only one delayed instruction
41
Fetch 104, Dcd 100, Ex 34, Mem 30, WB 24
n
Inst. Mem
Decode
add r10
WB Ctrl
beq
ori r8
Mem Ctrl
and r13, r14, r15
14
15
xx
Reg. File
r1Mr235
IR
Reg File
r11
xxx
r9 17
r2 r23
Exec
r3 r4-r5
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
104
PC
Squash the extra instruction in the branch shadow!
42
Fetch 108, Dcd 104, Ex 100, Mem 34, WB 30
n
Inst. Mem
Decode
ori r8
add r10
WB Ctrl
and r13
Mem Ctrl
xx
Reg. File
r1Mr235
IR
Reg File
r14
r9 17
r2 r23
r11r12
Exec
r3 r4-r5
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
110
PC
Squash the extra instruction in the branch shadow!
43
Fetch 114, Dcd 110, Ex 104, Mem 100, WB 34
n
NO WB NO Ovflow
and r13
Inst. Mem
Decode
add r10
WB Ctrl
Mem Ctrl
Reg. File
r1Mr235
IR
Reg File
r11r12
r2 r23
r14 R15
Exec
r3 r4-r5
r8 r9 17
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
114
PC
Squash the extra instruction in the branch shadow!
44
Pipeline Hazards Again
I-Fet ch DCD MemOpFetch OpFetch
Exec Store
IFetch DCD
Structural Hazard
I-Fet ch DCD OpFetch Jump
Control Hazard
IFetch DCD
IF DCD EX Mem WB
RAW (read after write) Data Hazard
IF DCD EX Mem
WB
WAW Data Hazard (write after write)
IF DCD EX Mem WB
IF DCD
OF Ex Mem
IF DCD OF Ex RS
WAR Data Hazard (write after read)
45
Data Hazards

Avoid some by design
eliminate WAR by always fetching operands early
(DCD) in pipe
eleminate WAW by doing all WBs in order (last
stage, static)
Detect and resolve remaining ones
stall or forward (if possible)

46
Hazard Detection

Suppose instruction i is about to be issued and
a predecessor instruction j is in the
instruction pipeline.
A RAW hazard exists on register ??if ????Rregs( i
) ??Wregs( j )
Keep a record of pending writes (for inst's in
the pipe) and compare with operand regs of
current instruction.
When instruction issues, reserve its result
register.
When on operation completes, remove its write
reservation.
A WAW hazard exists on register ??if ????Wregs( i
) ??Wregs( j )
A WAR hazard exists on register ??if ????Wregs( i
) ??Rregs( j )

47
Record of Pending Writes
IAU
npc

Current operand registers
Pending writes
hazard lt
((rs rwex) regWex) OR
((rs rwmem) regWme) OR
((rs rwwb) regWwb) OR
((rt rwex) regWex) OR
((rt rwmem) regWme) OR
((rt rwwb) regWwb)

I mem
Regs
op rw rs rt
PC
im
op
rw
n
B
A
alu
op
rw
n
S
D mem
m
op
rw
n
Regs
48
Resolve RAW by forwarding
IAU

Detect nearest valid write op operand register
and forward into op latches, bypassing remainder
of the pipe
Increase muxes to add paths from pipeline
registers
Data Forwarding Data Bypassing

npc
I mem
Regs
op rw rs rt
PC
Forward mux
im
op
rw
n
B
A
alu
op
rw
n
S
D mem
m
op
rw
n
Regs
49
What about memory operations?
If instructions are initiated in order and
operations always occur in the same stage,
there can be no hazards between memory
operations! What does delaying WB on
arithmetic operations cost? cycles ?
hardware ? What about data dependence on
loads? R1 lt- R4 R5 R2 lt- Mem R2 I
R3 lt- R2 R1 gt
op Rd Ra Rb
op Rd Ra Rb
A
B
Rd
R
"Delayed Loads"
T
Rd
to reg file
50
Compiler Avoiding Load Stalls
51
What about Interrupts, Traps, Faults?

External Interrupts
Allow pipeline to drain,
Load PC with interupt address
Faults (within instruction, restartable)
Force trap instruction into IF
disable writes till trap hits WB
must save multiple PCs or PC state

Refer to MIPS solution
52
Exception Handling
IAU
npc
detect bad instruction address
I mem
Regs
lw 2,20(5)
PC
detect bad instruction
im
op
rw
n
B
A
detect overflow
alu
S
detect bad data address
D mem
m
Allow exception to take effect
Regs
53
Exception Problem

Exceptions/Interrupts 5 instructions executing
in 5 stage pipeline
How to stop the pipeline?
Restart?
Who caused the interrupt?
Stage Problem interrupts occurring
IF Page fault on instruction fetch misaligned
memory access memory-protection violation
ID Undefined or illegal opcode
EX Arithmetic exception
MEM Page fault on data fetch misaligned memory
access memory-protection violation memory
error
Load with data page fault, Add with instruction
page fault?
Solution 1 interrupt vector/instruction 2
interrupt ASAP, restart everything incomplete

54
Resolution Freeze above Bubble Below
IAU
npc
I mem
freeze
Regs
op rw rs rt
PC
bubble
im
op
rw
n
B
A
alu
op
rw
n
S
D mem
m
op
rw
n
Regs
55
Issues in Pipelined design
Limitation
IF
D
Ex
M
W
Pipelining
IF
D
Ex
M
W
Issue rate, FU stalls, FU depth
IF
D
Ex
M
W
IF
D
Ex
M
W
Super-pipeline
- Issue one instruction per (fast) cycle
- ALU takes multiple cycles
IF
D
Ex
M
W
IF
D
Ex
M
W
Clock skew, FU stalls, FU depth
IF
D
Ex
M
W
IF
D
Ex
M
W
Super-scalar
Hazard resolution
IF
D
Ex
M
W
- Issue multiple scalar
IF
D
Ex
M
W
IF
D
Ex
M
W
instructions per cycle
IF
D
Ex
M
W
VLIW
- Each instruction specifies
Packing
IF
D
Ex
M
W
multiple scalar operations - Compiler determines
parallelism
Ex
M
W
Ex
M
W
Ex
M
W
Vector operations
Applicability
IF
D
Ex
M
W
- Each instruction specifies
Ex
M
W
Ex
M
W
series of identical operations
Ex
M
W
56
Partitioned Instruction Issue (simple Superscalar)
independent int and FP issue to separate pipelines
I-Cache
Int Reg
Inst Issue and Bypass
FP Reg
Operand / Result Busses
Int Unit
Load / Store Unit
FP Add
FP Mul
D-Cache
Single Issue Total Time Int Time FP Time Max
Speedup Total Time
MAX(Int Time, FP Time)
57
Unrolling
58
Software Pipelining
59
Multiple Pipes/ Harder Superscalar
Issues Reg. File ports Detecting Data
Dependences Bypassing RAW Hazard WAR
Hazard Multiple load/store ops? Branches
IR0
IR1
Register File
A
B
R
D
T
60
Branch penalties in superscalar
Example resolved in op-fetch stage, single
exposed delay (ala MIPS, Sparc)
I-fetch
Branch
delay
Squash 2
I-fetch
Branch
Squash 1
delay
61
Getting CPI lt 1 Issuing Multiple
Instructions/Cycle

Two main variations Superscalar and VLIW
Superscalar varying no. instructions/cycle (1 to
6)
Parallelism and dependencies determined/resolved
by HW
IBM PowerPC 604, Sun UltraSparc, DEC Alpha 21164,
HP 7100
Very Long Instruction Words (VLIW) fixed number
of instructions (16) parallelism determined by
compiler
Pipeline is exposed compiler must schedule
delays to get right result
Explicit Parallel Instruction Computer (EPIC)/
Intel
128 bit packets containing 3 instructions (can
execute sequentially)
Can link 128 bit packets together to allow more
parallelism
Compiler determines parallelism, HW checks
dependencies and fowards/stalls

62
Getting CPI lt 1 Issuing Multiple
Instructions/Cycle

Superscalar DLX 2 instructions, 1 FP 1
anything else
Fetch 64-bits/clock cycle Int on left, FP on
right
Can only issue 2nd instruction if 1st
instruction issues
More ports for FP registers to do FP load FP
op in a pair
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
1 cycle load delay expands to 3 instructions in
SS
instruction in right half cant use it, nor
instructions in next slot

63
Unrolled Loop that Minimizes Stalls for Scalar
1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4
10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,
32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 8-32
-24 14 clock cycles, or 3.5 per iteration
LD to ADDD 1 Cycle ADDD to SD 2 Cycles
64
Loop Unrolling in Superscalar

Integer instruction FP instruction Clock cycle
Loop LD F0,0(R1) 1
LD F6,-8(R1) 2
LD F10,-16(R1) ADDD F4,F0,F2 3
LD F14,-24(R1) ADDD F8,F6,F2 4
LD F18,-32(R1) ADDD F12,F10,F2 5
SD 0(R1),F4 ADDD F16,F14,F2 6
SD -8(R1),F8 ADDD F20,F18,F2 7
SD -16(R1),F12 8
SD -24(R1),F16 9
SUBI R1,R1,40 10
BNEZ R1,LOOP 11
SD -32(R1),F20 12
Unrolled 5 times to avoid delays (1 due to SS)
12 clocks, or 2.4 clocks per iteration

65
Software Pipelining

Observation if iterations from loops are
independent, then can get ILP by taking
instructions from different iterations
Software pipelining reorganizes loops so that
each iteration is made from instructions chosen
from different iterations of the original loop (
Tomasulo in SW)

66
Software Pipelining Example

Before Unrolled 3 times
1 LD F0,0(R1)
2 ADDD F4,F0,F2
3 SD 0(R1),F4
4 LD F6,-8(R1)
5 ADDD F8,F6,F2
6 SD -8(R1),F8
7 LD F10,-16(R1)
8 ADDD F12,F10,F2
9 SD -16(R1),F12
10 SUBI R1,R1,24
11 BNEZ R1,LOOP

After Software Pipelined 1 SD 0(R1),F4 Stores
Mi 2 ADDD F4,F0,F2 Adds to Mi-1
3 LD F0,-16(R1) Loads Mi-2 4 SUBI R1,R1,8
5 BNEZ R1,LOOP

Symbolic Loop Unrolling
Less code space
Fill drain pipe only once vs. each
iteration in loop unrolling

67
Limits of Superscalar

While Integer/FP split is simple for the HW, get
CPI of 0.5 only for programs with
Exactly 50 FP operations
No hazards
If more instructions issue at same time, greater
difficulty of decode and issue
Even 2-scalar gt examine 2 opcodes, 6 register
specifiers, decide if 1 or 2 instructions can
issue
VLIW tradeoff instruction space for simple
decoding
The long instruction word has room for many
operations
By definition, all the operations the compiler
puts in the long instruction word can execute in
parallel
E.g., 2 integer operations, 2 FP ops, 2 Memory
refs, 1 branch
16 to 24 bits per field gt 716 or 112 bits to
724 or 168 bits wide
Need compiling technique that schedules across
several branches

68
Loop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clockreference
1 reference 2 operation 1 op. 2 branch
LD F0,0(R1) LD F6,-8(R1) 1
LD F10,-16(R1) LD F14,-24(R1) 2
LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
F8,F6,F2 3
LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
ADDD F20,F18,F2 ADDD F24,F22,F2 5
SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
SD -16(R1),F12 SD -24(R1),F16 7
SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
SD -0(R1),F28 BNEZ R1,LOOP 9
Unrolled 7 times to avoid delays
7 results in 9 clocks, or 1.3 clocks per
iteration
Need more registers in VLIW(EPIC gt 128int
128FP)

69
Trace Scheduling

Parallelism across IF branches vs. LOOP branches
Two steps
Trace Selection
Find likely sequence of basic blocks (trace) of
(statically predicted) long sequence of
straight-line code
Trace Compaction
Squeeze trace into few VLIW instructions
Need bookkeeping code in case prediction is wrong

70
HW Schemes Instruction Parallelism

Why in HW at run time?
Works when cant know real dependence at compile
time
Compiler simpler
Code for one machine runs well on another
Key idea Allow instructions behind stall to
proceed
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
Enables out-of-order execution gt out-of-order
completion
ID stage checked both for structural

71
HW Schemes Instruction Parallelism

Out-of-order execution divides ID stage
1. Issuedecode instructions, check for
structural hazards
2. Read operandswait until no data hazards, then
read operands
Scoreboards allow instruction to execute whenever
1 2 hold, not waiting for prior instructions
CDC 6600 In order issue, out of order execution,
out of order commit ( also called completion)

72
Scoreboard Implications

Out-of-order completion gt WAR, WAW hazards?
Solutions for WAR
Queue both the operation and copies of its
operands
Read registers only during Read Operands stage
For WAW, must detect hazard stall until other
completes
Need to have multiple instructions in execution
phase gt multiple execution units or pipelined
execution units
Scoreboard keeps track of dependencies, state or
operations
Scoreboard replaces ID, EX, WB with 4 stages

73
Performance of Dynamic SS

Iteration Instructions Issues Executes Writes
result
no.
clock-cycle number
1 LD F0,0(R1) 1 2 4
1 ADDD F4,F0,F2 1 5 8
1 SD 0(R1),F4 2 9
1 SUBI R1,R1,8 3 4 5
1 BNEZ R1,LOOP 4 5
2 LD F0,0(R1) 5 6 8
2 ADDD F4,F0,F2 5 9 12
2 SD 0(R1),F4 6 13
2 SUBI R1,R1,8 7 8 9
2 BNEZ R1,LOOP 8 9
4 clocks per iteration
Branches, Decrements still take 1 clock cycle

74
Dynamic Branch Prediction

Solution 2-bit scheme where change prediction
only if get misprediction twice

T
NT
Predict Taken
Predict Taken
T
T
NT
NT
Predict Not Taken
Predict Not Taken
T
NT
75
BHT Accuracy

Mispredict because either
Wrong guess for that branch
Got branch history of wrong branch when index the
table
4096 entry table programs vary from 1
misprediction (nasa7, tomcatv) to 18 (eqntott),
with spice at 9 and gcc at 12
4096 about as good as infinite table, but 4096 is
a lot of HW

76
Need Address _at_ Same Time as Prediction

Branch Target Buffer (BTB) Address of branch
index to get prediction AND branch address (if
taken)
Note must check for branch match now, since
cant use wrong branch address
Return instruction addresses predicted with stack

77
HW support for More ILP

Avoid branch prediction by turning branches into
conditionally executed instructions
if (x) then A B op C else NOP
If false, then neither store result nor cause
exception
Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move PA-RISC can annul any following
instr.
EPIC 64 1-bit condition fields selected so
conditional execution
Drawbacks to conditional instructions
Still takes a clock even if annulled
Stall if condition evaluated late
Complex conditions reduce effectiveness
condition becomes known late in pipeline

78
HW support for More ILP

Speculation allow an instructionwithout any
consequences (including exceptions) if branch is
not actually taken (HW undo)
Often try to combine with dynamic scheduling
Separate speculative bypassing of results from
real bypassing of results
When instruction no longer speculative, write
results (instruction commit)
execute out-of-order but commit in order

79
HW support for More ILP

Need HW buffer for results of uncommitted
instructions reorder buffer
Reorder buffer can be operand source
Once operand commits, result is found in register
3 fields instr. type, destination, value
Use reorder buffer number instead of reservation
station
Instructionsd instructions on mispredicted
branches or on exceptions

Reorder Buffer
FP Op Queue
FP Regs
Res Stations
Res Stations
FP Adder
FP Adder
80
Limits to Multi-Issue Machines

Inherent limitations of ILP
1 branch in 5 How to keep a 5-way VLIW busy?
Latencies of units many operations must be
scheduled
Need about Pipeline Depth x No. Functional Units
of independentDifficulties in building HW
Duplicate FUs to get parallel execution
Increase ports to Register File
VLIW example needs 7 read and 3 write for Int.
Reg. 5 read and 3 write for FP reg
Increase ports to memory
Decoding SS and impact on clock rate, pipeline
depth

81
Limits to Multi-Issue Machines

Limitations specific to either SS or VLIW
implementation
Decode issue in SS
VLIW code size unroll loops wasted fields in
VLIW
VLIW lock step gt 1 hazard all instructions
stall
VLIW binary compatibility

82
3 Recent Machines

Alpha 21164 Pentium II HP PA-8000
Year 1995 1996 1996
Clock 600 MHz (97) 300 MHz (97) 236 MHz (97)
Cache 8K/8K/96K/2M 16K/16K/0.5M 0/0/4M
Issue rate 2int2FP 3 instr (x86) 4 instr
Pipe stages 7-9 12-14 7-9
Out-of-Order 6 loads 40 instr (µop) 56 instr
Rename regs none 40 56

83
SPECint95base Performance (Oct. 1997)
84
SPECfp95base Performance (Oct. 1997)
85
Summary Pipelining

What makes it easy
all instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores
What makes it hard?
structural hazards suppose we had only one
memory
control hazards need to worry about branch
instructions
data hazards an instruction depends on a
previous instruction

86
Summary

Pipelines pass control information down the pipe
just as data moves down pipe
Forwarding/Stalls handled by local control
Exceptions stop the pipeline
More performance from deeper pipelines,
parallelism

87
Summary

Superscalar and VLIW
CPI lt 1
Dynamic issue vs. Static issue
More instructions issue at same time, larger the
penalty of hazards
SW Pipelining
Symbolic Loop Unrolling to get most from pipeline
with little code expansion, little overhead

Write a Comment

User Comments (0)

About PowerShow.com

Chap' 6: Pipelining PowerPoint PPT Presentation