Title: Chap' 6: Pipelining
1Chap. 6 Pipelining
- Joonwon Lee
- lecture slides http//www-inst.eecs.berkeley.edu/
cs152/
2The Five Stages of Load
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Load
- Ifetch Instruction Fetch
- Fetch the instruction from the Instruction Memory
- Reg/Dec Registers Fetch and Instruction Decode
- Exec Calculate the memory address
- Mem Read the data from the Data Memory
- Wr Write the data back to the register file
3Pipelining
- Improve perfomance by increasing instruction
throughput - Ideal speedup is number of stages in
the pipeline. Do we achieve this?
4Basic Idea
-
- What do we need to add to actually split the
datapath into stages?
5Graphically Representing Pipelines
-
- Can help with answering questions like
- how many cycles does it take to execute this
code? - what is the ALU doing during cycle 4?
- use this representation to help understand
datapaths
6Conventional Pipelined Execution Representation
Time
Program Flow
7Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1
Cycle 2
Clk
Single Cycle Implementation
Load
Store
Waste
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Clk
Multiple Cycle Implementation
Load
Store
R-type
Pipeline Implementation
Load
Store
R-type
8Why Pipeline?
- Suppose we execute 100 instructions
- Single Cycle Machine
- 45 ns/cycle x 1 CPI x 100 inst 4500 ns
- Multicycle Machine
- 10 ns/cycle x 4.6 CPI (due to inst mix) x 100
inst 4600 ns - Ideal pipelined machine
- 10 ns/cycle x (1 CPI x 100 inst 4 cycle drain)
1040 ns
9Why Pipeline? Because the resources are there!
Time (clock cycles)
I n s t r. O r d e r
Inst 0
Inst 1
Inst 2
Inst 3
Inst 4
10Can pipelining get us into trouble?
- Yes Pipeline Hazards
- structural hazards attempt to use the same
resource two different ways at the same time - E.g., combined washer/dryer would be a structural
hazard or folder busy doing something else
(watching TV) - data hazards attempt to use item before it is
ready - E.g., one sock of pair in dryer and one in
washer cant fold until get sock from washer
through dryer - instruction depends on result of prior
instruction still in the pipeline - control hazards attempt to make a decision
before condition is evaulated - E.g., washing football uniforms and need to get
proper detergent level need to see after dryer
before next load in - branch instructions
- Can always resolve hazards by waiting
- pipeline control must detect the hazard
- take action (or delay action) to resolve hazards
11Single Memory is a Structural Hazard
Time (clock cycles)
I n s t r. O r d e r
Mem
Reg
Reg
Load
Instr 1
Instr 2
Mem
Mem
Reg
Reg
Instr 3
Instr 4
Detection is easy in this case! (right half
highlight means read, left half write)
12Control Hazard Solutions
- Stall wait until decision is clear
- Its possible to move up decision to 2nd stage by
adding hardware to check registers as being read - Impact 2 clock cycles per branch instruction gt
slow
I n s t r. O r d e r
Time (clock cycles)
Mem
Reg
Reg
Add
Mem
Reg
Reg
Beq
Load
Mem
Reg
Reg
13Control Hazard Solutions
- Predict guess one direction then back up if
wrong - Predict not taken
- Impact 1 clock cycles per branch instruction if
right, 2 if wrong (right 50 of time) - More dynamic scheme history of 1 branch ( 90)
I n s t r. O r d e r
Time (clock cycles)
Mem
Reg
Reg
Add
Mem
Reg
Reg
Beq
Load
Mem
Mem
Reg
Reg
14Control Hazard Solutions
- Redefine branch behavior (takes place after next
instruction) delayed branch - Impact 0 clock cycles per branch instruction if
can find instruction to put in slot ( 50 of
time) - As launch more instruction per clock cycle, less
useful
I n s t r. O r d e r
Time (clock cycles)
Mem
Reg
Reg
Add
Mem
Reg
Reg
Beq
Misc
Mem
Mem
Reg
Reg
Load
Mem
Mem
Reg
Reg
15Data Hazard on r1
add r1 ,r2,r3
sub r4, r1 ,r3
and r6, r1 ,r7
or r8, r1 ,r9
xor r10, r1 ,r11
16Data Hazard on r1
- Dependencies backwards in time are hazards
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
17Data Hazard Solution
- Forward result from one stage to another
-
- or OK if define read/write properly
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
18Forwarding (or Bypassing) What about Loads
- Dependencies backwards in time are
hazards - Cant solve with forwarding
- Must delay/stall instruction dependent on loads
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw r1,0(r2)
Reg
Reg
ALU
Im
Dm
sub r4,r1,r3
Dm
Reg
Reg
19Designing a Pipelined Processor
- Go back and examine your datapath and control
diagram - associated resources with states
- ensure that flows do not conflict, or figure out
how to resolve - assert control in appropriate stage
20Pipelined Processor (almost) for slides
- What happens if we start a new instruction every
cycle?
Valid
IRex
IR
IRwb
Inst. Mem
IRmem
WB Ctrl
Dcd Ctrl
Ex Ctrl
Mem Ctrl
Equal
Reg. File
Reg File
Exec
PC
Next PC
Mem Access
Data Mem
21Control and Datapath
IR lt- MemPC PC lt PC4
A lt- Rrs Blt Rrt
S lt A B
S lt A SX
S lt A or ZX
S lt A SX
If Cond PC lt PCSX
M lt MemS
MemS lt- B
Rrd lt S
Rrd lt M
Rrt lt S
Equal
Reg. File
Reg File
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
22Pipelining the Load Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Clock
2nd lw
3rd lw
- The five independent functional units in the
pipeline datapath are - Instruction Memory for the Ifetch stage
- Register Files Read ports (bus A and busB) for
the Reg/Dec stage - ALU for the Exec stage
- Data Memory for the Mem stage
- Register Files Write port (bus W) for the Wr
stage
23The Four Stages of R-type
Cycle 1
Cycle 2
Cycle 3
Cycle 4
R-type
- Ifetch Instruction Fetch
- Fetch the instruction from the Instruction Memory
- Reg/Dec Registers Fetch and Instruction Decode
- Exec
- ALU operates on the two register operands
- Update PC
- Wr Write the ALU output back to the register file
24Pipelining the R-type and Load Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
Ops! We have a problem!
R-type
R-type
Load
R-type
R-type
- We have pipeline conflict or structural hazard
- Two instructions try to write to the register
file at the same time! - Only one write port
25Important Observation
- Each functional unit can only be used once per
instruction - Each functional unit must be used at the same
stage for all instructions - Load uses Register Files Write Port during its
5th stage - R-type uses Register Files Write Port during its
4th stage
- 2 ways to solve this pipeline hazard.
26Solution 1 Insert Bubble into the Pipeline
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
Load
R-type
Pipeline
R-type
R-type
Bubble
- Insert a bubble into the pipeline to prevent 2
writes at the same cycle - The control logic can be complex.
- Lose instruction fetch and issue opportunity.
- No instruction is started in Cycle 6!
27Solution 2 Delay R-types Write by One Cycle
- Delay R-types register write by one cycle
- Now R-type instructions also use Reg Files write
port at Stage 5 - Mem stage is a NOOP stage nothing is being done.
4
1
2
3
5
Mem
R-type
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
R-type
R-type
Load
R-type
R-type
28Modified Control Datapath
IR lt- MemPC PC lt PC4
A lt- Rrs Blt Rrt
S lt A B
S lt A SX
S lt A or ZX
S lt A SX
if Cond PC lt PCSX
M lt MemS
MemS lt- B
M lt S
M lt S
Rrd lt M
Rrd lt M
Rrt lt M
Equal
Reg. File
Reg File
S
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
29The Four Stages of Store
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Store
Wr
- Ifetch Instruction Fetch
- Fetch the instruction from the Instruction Memory
- Reg/Dec Registers Fetch and Instruction Decode
- Exec Calculate the memory address
- Mem Write the data into the Data Memory
30The Three Stages of Beq
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Beq
Wr
- Ifetch Instruction Fetch
- Fetch the instruction from the Instruction Memory
- Reg/Dec
- Registers Fetch and Instruction Decode
- Exec
- compares the two register operand,
- select correct branch target address
- latch into PC
31Control Diagram
IR lt- MemPC PC lt PC4
A lt- Rrs Blt Rrt
S lt A B
S lt A SX
S lt A or ZX
S lt A SX
If Cond PC lt PCSX
M lt MemS
MemS lt- B
M lt S
M lt S
Rrd lt S
Rrd lt M
Rrt lt S
Equal
Reg. File
Reg File
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
32Datapath Data Stationary Control
IR
v
v
v
fun
rw
rw
rw
wb
wb
wb
Inst. Mem
Decode
WB Ctrl
me
me
rt
Mem Ctrl
rs
ex
op
im
rs
rt
Reg. File
Reg File
Exec
Mem Access
Data Mem
Next PC
33Lets Try it Out
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
these addresses are octal
34Start Fetch 10
Inst. Mem
Decode
WB Ctrl
Mem Ctrl
IR
im
rs
rt
Reg. File
Reg File
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
10
PC
35Fetch 14, Decode 10
lw r1, r2(35)
Inst. Mem
Decode
WB Ctrl
Mem Ctrl
IR
im
2
rt
Reg. File
Reg File
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
14
PC
36Fetch 20, Decode 14, Exec 10
addI r2, r2, 3
Inst. Mem
Decode
WB Ctrl
lw r1
Mem Ctrl
IR
35
2
rt
Reg. File
Reg File
r2
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
20
PC
37Fetch 24, Decode 20, Exec 14, Mem 10
sub r3, r4, r5
addI r2, r2, 3
Inst. Mem
Decode
WB Ctrl
lw r1
Mem Ctrl
IR
3
4
5
Reg. File
Reg File
r2
r235
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
24
PC
38Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
beq r6, r7 100
Inst. Mem
Decode
WB Ctrl
addI r2
lw r1
sub r3
Mem Ctrl
IR
6
7
Reg. File
Reg File
r4
Mr235
r23
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
30
PC
39Fetch 34, Dcd 30, Ex 24, Mem 20, WB 14
ori r8, r9 17
Inst. Mem
Decode
WB Ctrl
addI r2
sub r3
Mem Ctrl
beq
IR
9
xx
100
r1Mr235
Reg. File
Reg File
r6
r23
r4-r5
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
34
PC
40Fetch 100, Dcd 34, Ex 30, Mem 24, WB 20
Inst. Mem
Decode
ori r8
WB Ctrl
sub r3
beq
add r10, r11, r12
Mem Ctrl
11
12
17
Reg. File
r1Mr235
IR
Reg File
r9
r4-r5
r2 r23
xxx
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
100
PC
ooops, we should have only one delayed instruction
41Fetch 104, Dcd 100, Ex 34, Mem 30, WB 24
n
Inst. Mem
Decode
add r10
WB Ctrl
beq
ori r8
Mem Ctrl
and r13, r14, r15
14
15
xx
Reg. File
r1Mr235
IR
Reg File
r11
xxx
r9 17
r2 r23
Exec
r3 r4-r5
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
104
PC
Squash the extra instruction in the branch shadow!
42Fetch 108, Dcd 104, Ex 100, Mem 34, WB 30
n
Inst. Mem
Decode
ori r8
add r10
WB Ctrl
and r13
Mem Ctrl
xx
Reg. File
r1Mr235
IR
Reg File
r14
r9 17
r2 r23
r11r12
Exec
r3 r4-r5
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
110
PC
Squash the extra instruction in the branch shadow!
43Fetch 114, Dcd 110, Ex 104, Mem 100, WB 34
n
NO WB NO Ovflow
and r13
Inst. Mem
Decode
add r10
WB Ctrl
Mem Ctrl
Reg. File
r1Mr235
IR
Reg File
r11r12
r2 r23
r14 R15
Exec
r3 r4-r5
r8 r9 17
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
114
PC
Squash the extra instruction in the branch shadow!
44Pipeline Hazards Again
I-Fet ch DCD MemOpFetch OpFetch
Exec Store
IFetch DCD
Structural Hazard
I-Fet ch DCD OpFetch Jump
Control Hazard
IFetch DCD
IF DCD EX Mem WB
RAW (read after write) Data Hazard
IF DCD EX Mem
WB
WAW Data Hazard (write after write)
IF DCD EX Mem WB
IF DCD
OF Ex Mem
IF DCD OF Ex RS
WAR Data Hazard (write after read)
45Data Hazards
- Avoid some by design
- eliminate WAR by always fetching operands early
(DCD) in pipe - eleminate WAW by doing all WBs in order (last
stage, static) - Detect and resolve remaining ones
- stall or forward (if possible)
46Hazard Detection
- Suppose instruction i is about to be issued and
a predecessor instruction j is in the
instruction pipeline. - A RAW hazard exists on register ??if ????Rregs( i
) ??Wregs( j ) - Keep a record of pending writes (for inst's in
the pipe) and compare with operand regs of
current instruction. - When instruction issues, reserve its result
register. - When on operation completes, remove its write
reservation. - A WAW hazard exists on register ??if ????Wregs( i
) ??Wregs( j ) - A WAR hazard exists on register ??if ????Wregs( i
) ??Rregs( j )
47Record of Pending Writes
IAU
npc
- Current operand registers
- Pending writes
- hazard lt
- ((rs rwex) regWex) OR
- ((rs rwmem) regWme) OR
- ((rs rwwb) regWwb) OR
- ((rt rwex) regWex) OR
- ((rt rwmem) regWme) OR
- ((rt rwwb) regWwb)
I mem
Regs
op rw rs rt
PC
im
op
rw
n
B
A
alu
op
rw
n
S
D mem
m
op
rw
n
Regs
48Resolve RAW by forwarding
IAU
- Detect nearest valid write op operand register
and forward into op latches, bypassing remainder
of the pipe - Increase muxes to add paths from pipeline
registers - Data Forwarding Data Bypassing
npc
I mem
Regs
op rw rs rt
PC
Forward mux
im
op
rw
n
B
A
alu
op
rw
n
S
D mem
m
op
rw
n
Regs
49What about memory operations?
If instructions are initiated in order and
operations always occur in the same stage,
there can be no hazards between memory
operations! What does delaying WB on
arithmetic operations cost? cycles ?
hardware ? What about data dependence on
loads? R1 lt- R4 R5 R2 lt- Mem R2 I
R3 lt- R2 R1 gt
op Rd Ra Rb
op Rd Ra Rb
A
B
Rd
R
"Delayed Loads"
T
Rd
to reg file
50Compiler Avoiding Load Stalls
51What about Interrupts, Traps, Faults?
- External Interrupts
- Allow pipeline to drain,
- Load PC with interupt address
- Faults (within instruction, restartable)
- Force trap instruction into IF
- disable writes till trap hits WB
- must save multiple PCs or PC state
Refer to MIPS solution
52Exception Handling
IAU
npc
detect bad instruction address
I mem
Regs
lw 2,20(5)
PC
detect bad instruction
im
op
rw
n
B
A
detect overflow
alu
S
detect bad data address
D mem
m
Allow exception to take effect
Regs
53Exception Problem
- Exceptions/Interrupts 5 instructions executing
in 5 stage pipeline - How to stop the pipeline?
- Restart?
- Who caused the interrupt?
- Stage Problem interrupts occurring
- IF Page fault on instruction fetch misaligned
memory access memory-protection violation - ID Undefined or illegal opcode
- EX Arithmetic exception
- MEM Page fault on data fetch misaligned memory
access memory-protection violation memory
error - Load with data page fault, Add with instruction
page fault? - Solution 1 interrupt vector/instruction 2
interrupt ASAP, restart everything incomplete
54Resolution Freeze above Bubble Below
IAU
npc
I mem
freeze
Regs
op rw rs rt
PC
bubble
im
op
rw
n
B
A
alu
op
rw
n
S
D mem
m
op
rw
n
Regs
55Issues in Pipelined design
Limitation
IF
D
Ex
M
W
Pipelining
IF
D
Ex
M
W
Issue rate, FU stalls, FU depth
IF
D
Ex
M
W
IF
D
Ex
M
W
Super-pipeline
- Issue one instruction per (fast) cycle
- ALU takes multiple cycles
IF
D
Ex
M
W
IF
D
Ex
M
W
Clock skew, FU stalls, FU depth
IF
D
Ex
M
W
IF
D
Ex
M
W
Super-scalar
Hazard resolution
IF
D
Ex
M
W
- Issue multiple scalar
IF
D
Ex
M
W
IF
D
Ex
M
W
instructions per cycle
IF
D
Ex
M
W
VLIW
- Each instruction specifies
Packing
IF
D
Ex
M
W
multiple scalar operations - Compiler determines
parallelism
Ex
M
W
Ex
M
W
Ex
M
W
Vector operations
Applicability
IF
D
Ex
M
W
- Each instruction specifies
Ex
M
W
Ex
M
W
series of identical operations
Ex
M
W
56Partitioned Instruction Issue (simple Superscalar)
independent int and FP issue to separate pipelines
I-Cache
Int Reg
Inst Issue and Bypass
FP Reg
Operand / Result Busses
Int Unit
Load / Store Unit
FP Add
FP Mul
D-Cache
Single Issue Total Time Int Time FP Time Max
Speedup Total Time
MAX(Int Time, FP Time)
57Unrolling
58Software Pipelining
59Multiple Pipes/ Harder Superscalar
Issues Reg. File ports Detecting Data
Dependences Bypassing RAW Hazard WAR
Hazard Multiple load/store ops? Branches
IR0
IR1
Register File
A
B
R
D
T
60Branch penalties in superscalar
Example resolved in op-fetch stage, single
exposed delay (ala MIPS, Sparc)
I-fetch
Branch
delay
Squash 2
I-fetch
Branch
Squash 1
delay
61Getting CPI lt 1 Issuing Multiple
Instructions/Cycle
- Two main variations Superscalar and VLIW
- Superscalar varying no. instructions/cycle (1 to
6) - Parallelism and dependencies determined/resolved
by HW - IBM PowerPC 604, Sun UltraSparc, DEC Alpha 21164,
HP 7100 - Very Long Instruction Words (VLIW) fixed number
of instructions (16) parallelism determined by
compiler - Pipeline is exposed compiler must schedule
delays to get right result - Explicit Parallel Instruction Computer (EPIC)/
Intel - 128 bit packets containing 3 instructions (can
execute sequentially) - Can link 128 bit packets together to allow more
parallelism - Compiler determines parallelism, HW checks
dependencies and fowards/stalls
62Getting CPI lt 1 Issuing Multiple
Instructions/Cycle
- Superscalar DLX 2 instructions, 1 FP 1
anything else - Fetch 64-bits/clock cycle Int on left, FP on
right - Can only issue 2nd instruction if 1st
instruction issues - More ports for FP registers to do FP load FP
op in a pair - Type Pipe Stages
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- Int. instruction IF ID EX MEM WB
- FP instruction IF ID EX MEM WB
- 1 cycle load delay expands to 3 instructions in
SS - instruction in right half cant use it, nor
instructions in next slot
63Unrolled Loop that Minimizes Stalls for Scalar
1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4
10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,
32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 8-32
-24 14 clock cycles, or 3.5 per iteration
LD to ADDD 1 Cycle ADDD to SD 2 Cycles
64Loop Unrolling in Superscalar
- Integer instruction FP instruction Clock cycle
- Loop LD F0,0(R1) 1
- LD F6,-8(R1) 2
- LD F10,-16(R1) ADDD F4,F0,F2 3
- LD F14,-24(R1) ADDD F8,F6,F2 4
- LD F18,-32(R1) ADDD F12,F10,F2 5
- SD 0(R1),F4 ADDD F16,F14,F2 6
- SD -8(R1),F8 ADDD F20,F18,F2 7
- SD -16(R1),F12 8
- SD -24(R1),F16 9
- SUBI R1,R1,40 10
- BNEZ R1,LOOP 11
- SD -32(R1),F20 12
- Unrolled 5 times to avoid delays (1 due to SS)
- 12 clocks, or 2.4 clocks per iteration
65Software Pipelining
- Observation if iterations from loops are
independent, then can get ILP by taking
instructions from different iterations - Software pipelining reorganizes loops so that
each iteration is made from instructions chosen
from different iterations of the original loop (
Tomasulo in SW)
66Software Pipelining Example
- Before Unrolled 3 times
- 1 LD F0,0(R1)
- 2 ADDD F4,F0,F2
- 3 SD 0(R1),F4
- 4 LD F6,-8(R1)
- 5 ADDD F8,F6,F2
- 6 SD -8(R1),F8
- 7 LD F10,-16(R1)
- 8 ADDD F12,F10,F2
- 9 SD -16(R1),F12
- 10 SUBI R1,R1,24
- 11 BNEZ R1,LOOP
After Software Pipelined 1 SD 0(R1),F4 Stores
Mi 2 ADDD F4,F0,F2 Adds to Mi-1
3 LD F0,-16(R1) Loads Mi-2 4 SUBI R1,R1,8
5 BNEZ R1,LOOP
- Symbolic Loop Unrolling
- Less code space
- Fill drain pipe only once vs. each
iteration in loop unrolling
67Limits of Superscalar
- While Integer/FP split is simple for the HW, get
CPI of 0.5 only for programs with - Exactly 50 FP operations
- No hazards
- If more instructions issue at same time, greater
difficulty of decode and issue - Even 2-scalar gt examine 2 opcodes, 6 register
specifiers, decide if 1 or 2 instructions can
issue - VLIW tradeoff instruction space for simple
decoding - The long instruction word has room for many
operations - By definition, all the operations the compiler
puts in the long instruction word can execute in
parallel - E.g., 2 integer operations, 2 FP ops, 2 Memory
refs, 1 branch - 16 to 24 bits per field gt 716 or 112 bits to
724 or 168 bits wide - Need compiling technique that schedules across
several branches
68Loop Unrolling in VLIW
- Memory Memory FP FP Int. op/ Clockreference
1 reference 2 operation 1 op. 2 branch - LD F0,0(R1) LD F6,-8(R1) 1
- LD F10,-16(R1) LD F14,-24(R1) 2
- LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
F8,F6,F2 3 - LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
- ADDD F20,F18,F2 ADDD F24,F22,F2 5
- SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
- SD -16(R1),F12 SD -24(R1),F16 7
- SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
- SD -0(R1),F28 BNEZ R1,LOOP 9
- Unrolled 7 times to avoid delays
- 7 results in 9 clocks, or 1.3 clocks per
iteration - Need more registers in VLIW(EPIC gt 128int
128FP)
69Trace Scheduling
- Parallelism across IF branches vs. LOOP branches
- Two steps
- Trace Selection
- Find likely sequence of basic blocks (trace) of
(statically predicted) long sequence of
straight-line code - Trace Compaction
- Squeeze trace into few VLIW instructions
- Need bookkeeping code in case prediction is wrong
70HW Schemes Instruction Parallelism
- Why in HW at run time?
- Works when cant know real dependence at compile
time - Compiler simpler
- Code for one machine runs well on another
- Key idea Allow instructions behind stall to
proceed - DIVD F0,F2,F4
- ADDD F10,F0,F8
- SUBD F12,F8,F14
- Enables out-of-order execution gt out-of-order
completion - ID stage checked both for structural
71HW Schemes Instruction Parallelism
- Out-of-order execution divides ID stage
- 1. Issuedecode instructions, check for
structural hazards - 2. Read operandswait until no data hazards, then
read operands - Scoreboards allow instruction to execute whenever
1 2 hold, not waiting for prior instructions - CDC 6600 In order issue, out of order execution,
out of order commit ( also called completion)
72Scoreboard Implications
- Out-of-order completion gt WAR, WAW hazards?
- Solutions for WAR
- Queue both the operation and copies of its
operands - Read registers only during Read Operands stage
- For WAW, must detect hazard stall until other
completes - Need to have multiple instructions in execution
phase gt multiple execution units or pipelined
execution units - Scoreboard keeps track of dependencies, state or
operations - Scoreboard replaces ID, EX, WB with 4 stages
73Performance of Dynamic SS
- Iteration Instructions Issues Executes Writes
result - no.
clock-cycle number - 1 LD F0,0(R1) 1 2 4
- 1 ADDD F4,F0,F2 1 5 8
- 1 SD 0(R1),F4 2 9
- 1 SUBI R1,R1,8 3 4 5
- 1 BNEZ R1,LOOP 4 5
- 2 LD F0,0(R1) 5 6 8
- 2 ADDD F4,F0,F2 5 9 12
- 2 SD 0(R1),F4 6 13
- 2 SUBI R1,R1,8 7 8 9
- 2 BNEZ R1,LOOP 8 9
- 4 clocks per iteration
- Branches, Decrements still take 1 clock cycle
74Dynamic Branch Prediction
- Solution 2-bit scheme where change prediction
only if get misprediction twice
T
NT
Predict Taken
Predict Taken
T
T
NT
NT
Predict Not Taken
Predict Not Taken
T
NT
75BHT Accuracy
- Mispredict because either
- Wrong guess for that branch
- Got branch history of wrong branch when index the
table - 4096 entry table programs vary from 1
misprediction (nasa7, tomcatv) to 18 (eqntott),
with spice at 9 and gcc at 12 - 4096 about as good as infinite table, but 4096 is
a lot of HW
76Need Address _at_ Same Time as Prediction
- Branch Target Buffer (BTB) Address of branch
index to get prediction AND branch address (if
taken) - Note must check for branch match now, since
cant use wrong branch address - Return instruction addresses predicted with stack
77HW support for More ILP
- Avoid branch prediction by turning branches into
conditionally executed instructions - if (x) then A B op C else NOP
- If false, then neither store result nor cause
exception - Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move PA-RISC can annul any following
instr. - EPIC 64 1-bit condition fields selected so
conditional execution - Drawbacks to conditional instructions
- Still takes a clock even if annulled
- Stall if condition evaluated late
- Complex conditions reduce effectiveness
condition becomes known late in pipeline
78HW support for More ILP
- Speculation allow an instructionwithout any
consequences (including exceptions) if branch is
not actually taken (HW undo) - Often try to combine with dynamic scheduling
- Separate speculative bypassing of results from
real bypassing of results - When instruction no longer speculative, write
results (instruction commit) - execute out-of-order but commit in order
79HW support for More ILP
- Need HW buffer for results of uncommitted
instructions reorder buffer - Reorder buffer can be operand source
- Once operand commits, result is found in register
- 3 fields instr. type, destination, value
- Use reorder buffer number instead of reservation
station - Instructionsd instructions on mispredicted
branches or on exceptions
Reorder Buffer
FP Op Queue
FP Regs
Res Stations
Res Stations
FP Adder
FP Adder
80Limits to Multi-Issue Machines
- Inherent limitations of ILP
- 1 branch in 5 How to keep a 5-way VLIW busy?
- Latencies of units many operations must be
scheduled - Need about Pipeline Depth x No. Functional Units
of independentDifficulties in building HW - Duplicate FUs to get parallel execution
- Increase ports to Register File
- VLIW example needs 7 read and 3 write for Int.
Reg. 5 read and 3 write for FP reg - Increase ports to memory
- Decoding SS and impact on clock rate, pipeline
depth
81Limits to Multi-Issue Machines
- Limitations specific to either SS or VLIW
implementation - Decode issue in SS
- VLIW code size unroll loops wasted fields in
VLIW - VLIW lock step gt 1 hazard all instructions
stall - VLIW binary compatibility
823 Recent Machines
- Alpha 21164 Pentium II HP PA-8000
- Year 1995 1996 1996
- Clock 600 MHz (97) 300 MHz (97) 236 MHz (97)
- Cache 8K/8K/96K/2M 16K/16K/0.5M 0/0/4M
- Issue rate 2int2FP 3 instr (x86) 4 instr
- Pipe stages 7-9 12-14 7-9
- Out-of-Order 6 loads 40 instr (µop) 56 instr
- Rename regs none 40 56
83SPECint95base Performance (Oct. 1997)
84SPECfp95base Performance (Oct. 1997)
85Summary Pipelining
- What makes it easy
- all instructions are the same length
- just a few instruction formats
- memory operands appear only in loads and stores
- What makes it hard?
- structural hazards suppose we had only one
memory - control hazards need to worry about branch
instructions - data hazards an instruction depends on a
previous instruction
86Summary
- Pipelines pass control information down the pipe
just as data moves down pipe - Forwarding/Stalls handled by local control
- Exceptions stop the pipeline
- More performance from deeper pipelines,
parallelism
87Summary
- Superscalar and VLIW
- CPI lt 1
- Dynamic issue vs. Static issue
- More instructions issue at same time, larger the
penalty of hazards - SW Pipelining
- Symbolic Loop Unrolling to get most from pipeline
with little code expansion, little overhead