Chap' 6: Pipelining - PowerPoint PPT Presentation

1 / 87
About This Presentation
Title:

Chap' 6: Pipelining

Description:

what is the ALU doing during cycle 4? use this representation to help understand datapaths ... Single Cycle Implementation: Load. Store. Waste. Ifetch. R-type ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 88
Provided by: DavidEC150
Category:
Tags: chap | cycle | pipelining

less

Transcript and Presenter's Notes

Title: Chap' 6: Pipelining


1
Chap. 6 Pipelining
  • Joonwon Lee
  • lecture slides http//www-inst.eecs.berkeley.edu/
    cs152/

2
The Five Stages of Load
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Load
  • Ifetch Instruction Fetch
  • Fetch the instruction from the Instruction Memory
  • Reg/Dec Registers Fetch and Instruction Decode
  • Exec Calculate the memory address
  • Mem Read the data from the Data Memory
  • Wr Write the data back to the register file

3
Pipelining
  • Improve perfomance by increasing instruction
    throughput
  • Ideal speedup is number of stages in
    the pipeline. Do we achieve this?

4
Basic Idea
  • What do we need to add to actually split the
    datapath into stages?

5
Graphically Representing Pipelines
  • Can help with answering questions like
  • how many cycles does it take to execute this
    code?
  • what is the ALU doing during cycle 4?
  • use this representation to help understand
    datapaths

6
Conventional Pipelined Execution Representation
Time
Program Flow
7
Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1
Cycle 2
Clk
Single Cycle Implementation
Load
Store
Waste
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Clk
Multiple Cycle Implementation
Load
Store
R-type
Pipeline Implementation
Load
Store
R-type
8
Why Pipeline?
  • Suppose we execute 100 instructions
  • Single Cycle Machine
  • 45 ns/cycle x 1 CPI x 100 inst 4500 ns
  • Multicycle Machine
  • 10 ns/cycle x 4.6 CPI (due to inst mix) x 100
    inst 4600 ns
  • Ideal pipelined machine
  • 10 ns/cycle x (1 CPI x 100 inst 4 cycle drain)
    1040 ns

9
Why Pipeline? Because the resources are there!
Time (clock cycles)
I n s t r. O r d e r
Inst 0
Inst 1
Inst 2
Inst 3
Inst 4
10
Can pipelining get us into trouble?
  • Yes Pipeline Hazards
  • structural hazards attempt to use the same
    resource two different ways at the same time
  • E.g., combined washer/dryer would be a structural
    hazard or folder busy doing something else
    (watching TV)
  • data hazards attempt to use item before it is
    ready
  • E.g., one sock of pair in dryer and one in
    washer cant fold until get sock from washer
    through dryer
  • instruction depends on result of prior
    instruction still in the pipeline
  • control hazards attempt to make a decision
    before condition is evaulated
  • E.g., washing football uniforms and need to get
    proper detergent level need to see after dryer
    before next load in
  • branch instructions
  • Can always resolve hazards by waiting
  • pipeline control must detect the hazard
  • take action (or delay action) to resolve hazards

11
Single Memory is a Structural Hazard
Time (clock cycles)
I n s t r. O r d e r
Mem
Reg
Reg
Load
Instr 1
Instr 2
Mem
Mem
Reg
Reg
Instr 3
Instr 4
Detection is easy in this case! (right half
highlight means read, left half write)
12
Control Hazard Solutions
  • Stall wait until decision is clear
  • Its possible to move up decision to 2nd stage by
    adding hardware to check registers as being read
  • Impact 2 clock cycles per branch instruction gt
    slow

I n s t r. O r d e r
Time (clock cycles)
Mem
Reg
Reg
Add
Mem
Reg
Reg
Beq
Load
Mem
Reg
Reg
13
Control Hazard Solutions
  • Predict guess one direction then back up if
    wrong
  • Predict not taken
  • Impact 1 clock cycles per branch instruction if
    right, 2 if wrong (right 50 of time)
  • More dynamic scheme history of 1 branch ( 90)

I n s t r. O r d e r
Time (clock cycles)
Mem
Reg
Reg
Add
Mem
Reg
Reg
Beq
Load
Mem
Mem
Reg
Reg
14
Control Hazard Solutions
  • Redefine branch behavior (takes place after next
    instruction) delayed branch
  • Impact 0 clock cycles per branch instruction if
    can find instruction to put in slot ( 50 of
    time)
  • As launch more instruction per clock cycle, less
    useful

I n s t r. O r d e r
Time (clock cycles)
Mem
Reg
Reg
Add
Mem
Reg
Reg
Beq
Misc
Mem
Mem
Reg
Reg
Load
Mem
Mem
Reg
Reg
15
Data Hazard on r1
add r1 ,r2,r3
sub r4, r1 ,r3
and r6, r1 ,r7
or r8, r1 ,r9
xor r10, r1 ,r11
16
Data Hazard on r1
  • Dependencies backwards in time are hazards

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
17
Data Hazard Solution
  • Forward result from one stage to another
  • or OK if define read/write properly

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
18
Forwarding (or Bypassing) What about Loads
  • Dependencies backwards in time are
    hazards
  • Cant solve with forwarding
  • Must delay/stall instruction dependent on loads

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw r1,0(r2)
Reg
Reg
ALU
Im
Dm
sub r4,r1,r3
Dm
Reg
Reg
19
Designing a Pipelined Processor
  • Go back and examine your datapath and control
    diagram
  • associated resources with states
  • ensure that flows do not conflict, or figure out
    how to resolve
  • assert control in appropriate stage

20
Pipelined Processor (almost) for slides
  • What happens if we start a new instruction every
    cycle?

Valid
IRex
IR
IRwb
Inst. Mem
IRmem
WB Ctrl
Dcd Ctrl
Ex Ctrl
Mem Ctrl
Equal
Reg. File
Reg File
Exec
PC
Next PC
Mem Access
Data Mem
21
Control and Datapath
IR lt- MemPC PC lt PC4
A lt- Rrs Blt Rrt
S lt A B
S lt A SX
S lt A or ZX
S lt A SX
If Cond PC lt PCSX
M lt MemS
MemS lt- B
Rrd lt S
Rrd lt M
Rrt lt S
Equal
Reg. File
Reg File
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
22
Pipelining the Load Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Clock
2nd lw
3rd lw
  • The five independent functional units in the
    pipeline datapath are
  • Instruction Memory for the Ifetch stage
  • Register Files Read ports (bus A and busB) for
    the Reg/Dec stage
  • ALU for the Exec stage
  • Data Memory for the Mem stage
  • Register Files Write port (bus W) for the Wr
    stage

23
The Four Stages of R-type
Cycle 1
Cycle 2
Cycle 3
Cycle 4
R-type
  • Ifetch Instruction Fetch
  • Fetch the instruction from the Instruction Memory
  • Reg/Dec Registers Fetch and Instruction Decode
  • Exec
  • ALU operates on the two register operands
  • Update PC
  • Wr Write the ALU output back to the register file

24
Pipelining the R-type and Load Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
Ops! We have a problem!
R-type
R-type
Load
R-type
R-type
  • We have pipeline conflict or structural hazard
  • Two instructions try to write to the register
    file at the same time!
  • Only one write port

25
Important Observation
  • Each functional unit can only be used once per
    instruction
  • Each functional unit must be used at the same
    stage for all instructions
  • Load uses Register Files Write Port during its
    5th stage
  • R-type uses Register Files Write Port during its
    4th stage
  • 2 ways to solve this pipeline hazard.

26
Solution 1 Insert Bubble into the Pipeline
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
Load
R-type
Pipeline
R-type
R-type
Bubble
  • Insert a bubble into the pipeline to prevent 2
    writes at the same cycle
  • The control logic can be complex.
  • Lose instruction fetch and issue opportunity.
  • No instruction is started in Cycle 6!

27
Solution 2 Delay R-types Write by One Cycle
  • Delay R-types register write by one cycle
  • Now R-type instructions also use Reg Files write
    port at Stage 5
  • Mem stage is a NOOP stage nothing is being done.

4
1
2
3
5
Mem
R-type
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Clock
R-type
R-type
Load
R-type
R-type
28
Modified Control Datapath
IR lt- MemPC PC lt PC4
A lt- Rrs Blt Rrt
S lt A B
S lt A SX
S lt A or ZX
S lt A SX
if Cond PC lt PCSX
M lt MemS
MemS lt- B
M lt S
M lt S
Rrd lt M
Rrd lt M
Rrt lt M
Equal
Reg. File
Reg File
S
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
29
The Four Stages of Store
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Store
Wr
  • Ifetch Instruction Fetch
  • Fetch the instruction from the Instruction Memory
  • Reg/Dec Registers Fetch and Instruction Decode
  • Exec Calculate the memory address
  • Mem Write the data into the Data Memory

30
The Three Stages of Beq
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Beq
Wr
  • Ifetch Instruction Fetch
  • Fetch the instruction from the Instruction Memory
  • Reg/Dec
  • Registers Fetch and Instruction Decode
  • Exec
  • compares the two register operand,
  • select correct branch target address
  • latch into PC

31
Control Diagram
IR lt- MemPC PC lt PC4
A lt- Rrs Blt Rrt
S lt A B
S lt A SX
S lt A or ZX
S lt A SX
If Cond PC lt PCSX
M lt MemS
MemS lt- B
M lt S
M lt S
Rrd lt S
Rrd lt M
Rrt lt S
Equal
Reg. File
Reg File
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
32
Datapath Data Stationary Control
IR
v
v
v
fun
rw
rw
rw
wb
wb
wb
Inst. Mem
Decode
WB Ctrl
me
me
rt
Mem Ctrl
rs
ex
op
im
rs
rt
Reg. File
Reg File
Exec
Mem Access
Data Mem
Next PC
33
Lets Try it Out
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
these addresses are octal
34
Start Fetch 10
Inst. Mem
Decode
WB Ctrl
Mem Ctrl
IR
im
rs
rt
Reg. File
Reg File
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
10
PC
35
Fetch 14, Decode 10
lw r1, r2(35)
Inst. Mem
Decode
WB Ctrl
Mem Ctrl
IR
im
2
rt
Reg. File
Reg File
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
14
PC
36
Fetch 20, Decode 14, Exec 10
addI r2, r2, 3
Inst. Mem
Decode
WB Ctrl
lw r1
Mem Ctrl
IR
35
2
rt
Reg. File
Reg File
r2
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
20
PC
37
Fetch 24, Decode 20, Exec 14, Mem 10
sub r3, r4, r5
addI r2, r2, 3
Inst. Mem
Decode
WB Ctrl
lw r1
Mem Ctrl
IR
3
4
5
Reg. File
Reg File
r2
r235
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
24
PC
38
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
beq r6, r7 100
Inst. Mem
Decode
WB Ctrl
addI r2
lw r1
sub r3
Mem Ctrl
IR
6
7
Reg. File
Reg File
r4
Mr235
r23
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
30
PC
39
Fetch 34, Dcd 30, Ex 24, Mem 20, WB 14
ori r8, r9 17
Inst. Mem
Decode
WB Ctrl
addI r2
sub r3
Mem Ctrl
beq
IR
9
xx
100
r1Mr235
Reg. File
Reg File
r6
r23
r4-r5
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
34
PC
40
Fetch 100, Dcd 34, Ex 30, Mem 24, WB 20
Inst. Mem
Decode
ori r8
WB Ctrl
sub r3
beq
add r10, r11, r12
Mem Ctrl
11
12
17
Reg. File
r1Mr235
IR
Reg File
r9
r4-r5
r2 r23
xxx
Exec
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
100
PC
ooops, we should have only one delayed instruction
41
Fetch 104, Dcd 100, Ex 34, Mem 30, WB 24
n
Inst. Mem
Decode
add r10
WB Ctrl
beq
ori r8
Mem Ctrl
and r13, r14, r15
14
15
xx
Reg. File
r1Mr235
IR
Reg File
r11
xxx
r9 17
r2 r23
Exec
r3 r4-r5
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
104
PC
Squash the extra instruction in the branch shadow!
42
Fetch 108, Dcd 104, Ex 100, Mem 34, WB 30
n
Inst. Mem
Decode
ori r8
add r10
WB Ctrl
and r13
Mem Ctrl
xx
Reg. File
r1Mr235
IR
Reg File
r14
r9 17
r2 r23
r11r12
Exec
r3 r4-r5
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
110
PC
Squash the extra instruction in the branch shadow!
43
Fetch 114, Dcd 110, Ex 104, Mem 100, WB 34
n
NO WB NO Ovflow
and r13
Inst. Mem
Decode
add r10
WB Ctrl
Mem Ctrl
Reg. File
r1Mr235
IR
Reg File
r11r12
r2 r23
r14 R15
Exec
r3 r4-r5
r8 r9 17
Mem Access
Data Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3,
r4, r5 24 beq r6, r7, 100 30 ori r8, r9,
17 34 add r10, r11, r12 100 and r13, r14, 15
Next PC
114
PC
Squash the extra instruction in the branch shadow!
44
Pipeline Hazards Again
I-Fet ch DCD MemOpFetch OpFetch
Exec Store
IFetch DCD
Structural Hazard
I-Fet ch DCD OpFetch Jump
Control Hazard
IFetch DCD
IF DCD EX Mem WB
RAW (read after write) Data Hazard
IF DCD EX Mem
WB
WAW Data Hazard (write after write)
IF DCD EX Mem WB
IF DCD
OF Ex Mem
IF DCD OF Ex RS
WAR Data Hazard (write after read)
45
Data Hazards
  • Avoid some by design
  • eliminate WAR by always fetching operands early
    (DCD) in pipe
  • eleminate WAW by doing all WBs in order (last
    stage, static)
  • Detect and resolve remaining ones
  • stall or forward (if possible)

46
Hazard Detection
  • Suppose instruction i is about to be issued and
    a predecessor instruction j is in the
    instruction pipeline.
  • A RAW hazard exists on register ??if ????Rregs( i
    ) ??Wregs( j )
  • Keep a record of pending writes (for inst's in
    the pipe) and compare with operand regs of
    current instruction.
  • When instruction issues, reserve its result
    register.
  • When on operation completes, remove its write
    reservation.
  • A WAW hazard exists on register ??if ????Wregs( i
    ) ??Wregs( j )
  • A WAR hazard exists on register ??if ????Wregs( i
    ) ??Rregs( j )

47
Record of Pending Writes
IAU
npc
  • Current operand registers
  • Pending writes
  • hazard lt
  • ((rs rwex) regWex) OR
  • ((rs rwmem) regWme) OR
  • ((rs rwwb) regWwb) OR
  • ((rt rwex) regWex) OR
  • ((rt rwmem) regWme) OR
  • ((rt rwwb) regWwb)

I mem
Regs
op rw rs rt
PC
im
op
rw
n
B
A
alu
op
rw
n
S
D mem
m
op
rw
n
Regs
48
Resolve RAW by forwarding
IAU
  • Detect nearest valid write op operand register
    and forward into op latches, bypassing remainder
    of the pipe
  • Increase muxes to add paths from pipeline
    registers
  • Data Forwarding Data Bypassing

npc
I mem
Regs
op rw rs rt
PC
Forward mux
im
op
rw
n
B
A
alu
op
rw
n
S
D mem
m
op
rw
n
Regs
49
What about memory operations?
If instructions are initiated in order and
operations always occur in the same stage,
there can be no hazards between memory
operations! What does delaying WB on
arithmetic operations cost? cycles ?
hardware ? What about data dependence on
loads? R1 lt- R4 R5 R2 lt- Mem R2 I
R3 lt- R2 R1 gt
op Rd Ra Rb
op Rd Ra Rb
A
B
Rd
R
"Delayed Loads"
T
Rd
to reg file
50
Compiler Avoiding Load Stalls
51
What about Interrupts, Traps, Faults?
  • External Interrupts
  • Allow pipeline to drain,
  • Load PC with interupt address
  • Faults (within instruction, restartable)
  • Force trap instruction into IF
  • disable writes till trap hits WB
  • must save multiple PCs or PC state

Refer to MIPS solution
52
Exception Handling
IAU
npc
detect bad instruction address
I mem
Regs
lw 2,20(5)
PC
detect bad instruction
im
op
rw
n
B
A
detect overflow
alu
S
detect bad data address
D mem
m
Allow exception to take effect
Regs
53
Exception Problem
  • Exceptions/Interrupts 5 instructions executing
    in 5 stage pipeline
  • How to stop the pipeline?
  • Restart?
  • Who caused the interrupt?
  • Stage Problem interrupts occurring
  • IF Page fault on instruction fetch misaligned
    memory access memory-protection violation
  • ID Undefined or illegal opcode
  • EX Arithmetic exception
  • MEM Page fault on data fetch misaligned memory
    access memory-protection violation memory
    error
  • Load with data page fault, Add with instruction
    page fault?
  • Solution 1 interrupt vector/instruction 2
    interrupt ASAP, restart everything incomplete

54
Resolution Freeze above Bubble Below
IAU
npc
I mem
freeze
Regs
op rw rs rt
PC
bubble
im
op
rw
n
B
A
alu
op
rw
n
S
D mem
m
op
rw
n
Regs
55
Issues in Pipelined design
Limitation
IF
D
Ex
M
W
Pipelining
IF
D
Ex
M
W
Issue rate, FU stalls, FU depth
IF
D
Ex
M
W
IF
D
Ex
M
W
Super-pipeline
- Issue one instruction per (fast) cycle
- ALU takes multiple cycles
IF
D
Ex
M
W
IF
D
Ex
M
W
Clock skew, FU stalls, FU depth
IF
D
Ex
M
W
IF
D
Ex
M
W
Super-scalar
Hazard resolution
IF
D
Ex
M
W
- Issue multiple scalar
IF
D
Ex
M
W
IF
D
Ex
M
W
instructions per cycle
IF
D
Ex
M
W
VLIW
- Each instruction specifies
Packing
IF
D
Ex
M
W
multiple scalar operations - Compiler determines
parallelism
Ex
M
W
Ex
M
W
Ex
M
W
Vector operations
Applicability
IF
D
Ex
M
W
- Each instruction specifies
Ex
M
W
Ex
M
W
series of identical operations
Ex
M
W
56
Partitioned Instruction Issue (simple Superscalar)
independent int and FP issue to separate pipelines
I-Cache
Int Reg
Inst Issue and Bypass
FP Reg
Operand / Result Busses
Int Unit
Load / Store Unit
FP Add
FP Mul
D-Cache
Single Issue Total Time Int Time FP Time Max
Speedup Total Time
MAX(Int Time, FP Time)
57
Unrolling
58
Software Pipelining
59
Multiple Pipes/ Harder Superscalar
Issues Reg. File ports Detecting Data
Dependences Bypassing RAW Hazard WAR
Hazard Multiple load/store ops? Branches
IR0
IR1
Register File
A
B
R
D
T
60
Branch penalties in superscalar
Example resolved in op-fetch stage, single
exposed delay (ala MIPS, Sparc)
I-fetch
Branch
delay
Squash 2
I-fetch
Branch
Squash 1
delay
61
Getting CPI lt 1 Issuing Multiple
Instructions/Cycle
  • Two main variations Superscalar and VLIW
  • Superscalar varying no. instructions/cycle (1 to
    6)
  • Parallelism and dependencies determined/resolved
    by HW
  • IBM PowerPC 604, Sun UltraSparc, DEC Alpha 21164,
    HP 7100
  • Very Long Instruction Words (VLIW) fixed number
    of instructions (16) parallelism determined by
    compiler
  • Pipeline is exposed compiler must schedule
    delays to get right result
  • Explicit Parallel Instruction Computer (EPIC)/
    Intel
  • 128 bit packets containing 3 instructions (can
    execute sequentially)
  • Can link 128 bit packets together to allow more
    parallelism
  • Compiler determines parallelism, HW checks
    dependencies and fowards/stalls

62
Getting CPI lt 1 Issuing Multiple
Instructions/Cycle
  • Superscalar DLX 2 instructions, 1 FP 1
    anything else
  • Fetch 64-bits/clock cycle Int on left, FP on
    right
  • Can only issue 2nd instruction if 1st
    instruction issues
  • More ports for FP registers to do FP load FP
    op in a pair
  • Type Pipe Stages
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • 1 cycle load delay expands to 3 instructions in
    SS
  • instruction in right half cant use it, nor
    instructions in next slot

63
Unrolled Loop that Minimizes Stalls for Scalar
1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4
10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,
32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 8-32
-24 14 clock cycles, or 3.5 per iteration
LD to ADDD 1 Cycle ADDD to SD 2 Cycles
64
Loop Unrolling in Superscalar
  • Integer instruction FP instruction Clock cycle
  • Loop LD F0,0(R1) 1
  • LD F6,-8(R1) 2
  • LD F10,-16(R1) ADDD F4,F0,F2 3
  • LD F14,-24(R1) ADDD F8,F6,F2 4
  • LD F18,-32(R1) ADDD F12,F10,F2 5
  • SD 0(R1),F4 ADDD F16,F14,F2 6
  • SD -8(R1),F8 ADDD F20,F18,F2 7
  • SD -16(R1),F12 8
  • SD -24(R1),F16 9
  • SUBI R1,R1,40 10
  • BNEZ R1,LOOP 11
  • SD -32(R1),F20 12
  • Unrolled 5 times to avoid delays (1 due to SS)
  • 12 clocks, or 2.4 clocks per iteration

65
Software Pipelining
  • Observation if iterations from loops are
    independent, then can get ILP by taking
    instructions from different iterations
  • Software pipelining reorganizes loops so that
    each iteration is made from instructions chosen
    from different iterations of the original loop (
    Tomasulo in SW)

66
Software Pipelining Example
  • Before Unrolled 3 times
  • 1 LD F0,0(R1)
  • 2 ADDD F4,F0,F2
  • 3 SD 0(R1),F4
  • 4 LD F6,-8(R1)
  • 5 ADDD F8,F6,F2
  • 6 SD -8(R1),F8
  • 7 LD F10,-16(R1)
  • 8 ADDD F12,F10,F2
  • 9 SD -16(R1),F12
  • 10 SUBI R1,R1,24
  • 11 BNEZ R1,LOOP

After Software Pipelined 1 SD 0(R1),F4 Stores
Mi 2 ADDD F4,F0,F2 Adds to Mi-1
3 LD F0,-16(R1) Loads Mi-2 4 SUBI R1,R1,8
5 BNEZ R1,LOOP
  • Symbolic Loop Unrolling
  • Less code space
  • Fill drain pipe only once vs. each
    iteration in loop unrolling

67
Limits of Superscalar
  • While Integer/FP split is simple for the HW, get
    CPI of 0.5 only for programs with
  • Exactly 50 FP operations
  • No hazards
  • If more instructions issue at same time, greater
    difficulty of decode and issue
  • Even 2-scalar gt examine 2 opcodes, 6 register
    specifiers, decide if 1 or 2 instructions can
    issue
  • VLIW tradeoff instruction space for simple
    decoding
  • The long instruction word has room for many
    operations
  • By definition, all the operations the compiler
    puts in the long instruction word can execute in
    parallel
  • E.g., 2 integer operations, 2 FP ops, 2 Memory
    refs, 1 branch
  • 16 to 24 bits per field gt 716 or 112 bits to
    724 or 168 bits wide
  • Need compiling technique that schedules across
    several branches

68
Loop Unrolling in VLIW
  • Memory Memory FP FP Int. op/ Clockreference
    1 reference 2 operation 1 op. 2 branch
  • LD F0,0(R1) LD F6,-8(R1) 1
  • LD F10,-16(R1) LD F14,-24(R1) 2
  • LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
    F8,F6,F2 3
  • LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
  • ADDD F20,F18,F2 ADDD F24,F22,F2 5
  • SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
  • SD -16(R1),F12 SD -24(R1),F16 7
  • SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
  • SD -0(R1),F28 BNEZ R1,LOOP 9
  • Unrolled 7 times to avoid delays
  • 7 results in 9 clocks, or 1.3 clocks per
    iteration
  • Need more registers in VLIW(EPIC gt 128int
    128FP)

69
Trace Scheduling
  • Parallelism across IF branches vs. LOOP branches
  • Two steps
  • Trace Selection
  • Find likely sequence of basic blocks (trace) of
    (statically predicted) long sequence of
    straight-line code
  • Trace Compaction
  • Squeeze trace into few VLIW instructions
  • Need bookkeeping code in case prediction is wrong

70
HW Schemes Instruction Parallelism
  • Why in HW at run time?
  • Works when cant know real dependence at compile
    time
  • Compiler simpler
  • Code for one machine runs well on another
  • Key idea Allow instructions behind stall to
    proceed
  • DIVD F0,F2,F4
  • ADDD F10,F0,F8
  • SUBD F12,F8,F14
  • Enables out-of-order execution gt out-of-order
    completion
  • ID stage checked both for structural

71
HW Schemes Instruction Parallelism
  • Out-of-order execution divides ID stage
  • 1. Issuedecode instructions, check for
    structural hazards
  • 2. Read operandswait until no data hazards, then
    read operands
  • Scoreboards allow instruction to execute whenever
    1 2 hold, not waiting for prior instructions
  • CDC 6600 In order issue, out of order execution,
    out of order commit ( also called completion)

72
Scoreboard Implications
  • Out-of-order completion gt WAR, WAW hazards?
  • Solutions for WAR
  • Queue both the operation and copies of its
    operands
  • Read registers only during Read Operands stage
  • For WAW, must detect hazard stall until other
    completes
  • Need to have multiple instructions in execution
    phase gt multiple execution units or pipelined
    execution units
  • Scoreboard keeps track of dependencies, state or
    operations
  • Scoreboard replaces ID, EX, WB with 4 stages

73
Performance of Dynamic SS
  • Iteration Instructions Issues Executes Writes
    result
  • no.
    clock-cycle number
  • 1 LD F0,0(R1) 1 2 4
  • 1 ADDD F4,F0,F2 1 5 8
  • 1 SD 0(R1),F4 2 9
  • 1 SUBI R1,R1,8 3 4 5
  • 1 BNEZ R1,LOOP 4 5
  • 2 LD F0,0(R1) 5 6 8
  • 2 ADDD F4,F0,F2 5 9 12
  • 2 SD 0(R1),F4 6 13
  • 2 SUBI R1,R1,8 7 8 9
  • 2 BNEZ R1,LOOP 8 9
  • 4 clocks per iteration
  • Branches, Decrements still take 1 clock cycle

74
Dynamic Branch Prediction
  • Solution 2-bit scheme where change prediction
    only if get misprediction twice





T
NT
Predict Taken
Predict Taken
T
T
NT
NT
Predict Not Taken
Predict Not Taken
T
NT
75
BHT Accuracy
  • Mispredict because either
  • Wrong guess for that branch
  • Got branch history of wrong branch when index the
    table
  • 4096 entry table programs vary from 1
    misprediction (nasa7, tomcatv) to 18 (eqntott),
    with spice at 9 and gcc at 12
  • 4096 about as good as infinite table, but 4096 is
    a lot of HW

76
Need Address _at_ Same Time as Prediction
  • Branch Target Buffer (BTB) Address of branch
    index to get prediction AND branch address (if
    taken)
  • Note must check for branch match now, since
    cant use wrong branch address
  • Return instruction addresses predicted with stack

77
HW support for More ILP
  • Avoid branch prediction by turning branches into
    conditionally executed instructions
  • if (x) then A B op C else NOP
  • If false, then neither store result nor cause
    exception
  • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
    conditional move PA-RISC can annul any following
    instr.
  • EPIC 64 1-bit condition fields selected so
    conditional execution
  • Drawbacks to conditional instructions
  • Still takes a clock even if annulled
  • Stall if condition evaluated late
  • Complex conditions reduce effectiveness
    condition becomes known late in pipeline

78
HW support for More ILP
  • Speculation allow an instructionwithout any
    consequences (including exceptions) if branch is
    not actually taken (HW undo)
  • Often try to combine with dynamic scheduling
  • Separate speculative bypassing of results from
    real bypassing of results
  • When instruction no longer speculative, write
    results (instruction commit)
  • execute out-of-order but commit in order

79
HW support for More ILP
  • Need HW buffer for results of uncommitted
    instructions reorder buffer
  • Reorder buffer can be operand source
  • Once operand commits, result is found in register
  • 3 fields instr. type, destination, value
  • Use reorder buffer number instead of reservation
    station
  • Instructionsd instructions on mispredicted
    branches or on exceptions

Reorder Buffer
FP Op Queue
FP Regs
Res Stations
Res Stations
FP Adder
FP Adder
80
Limits to Multi-Issue Machines
  • Inherent limitations of ILP
  • 1 branch in 5 How to keep a 5-way VLIW busy?
  • Latencies of units many operations must be
    scheduled
  • Need about Pipeline Depth x No. Functional Units
    of independentDifficulties in building HW
  • Duplicate FUs to get parallel execution
  • Increase ports to Register File
  • VLIW example needs 7 read and 3 write for Int.
    Reg. 5 read and 3 write for FP reg
  • Increase ports to memory
  • Decoding SS and impact on clock rate, pipeline
    depth

81
Limits to Multi-Issue Machines
  • Limitations specific to either SS or VLIW
    implementation
  • Decode issue in SS
  • VLIW code size unroll loops wasted fields in
    VLIW
  • VLIW lock step gt 1 hazard all instructions
    stall
  • VLIW binary compatibility

82
3 Recent Machines
  • Alpha 21164 Pentium II HP PA-8000
  • Year 1995 1996 1996
  • Clock 600 MHz (97) 300 MHz (97) 236 MHz (97)
  • Cache 8K/8K/96K/2M 16K/16K/0.5M 0/0/4M
  • Issue rate 2int2FP 3 instr (x86) 4 instr
  • Pipe stages 7-9 12-14 7-9
  • Out-of-Order 6 loads 40 instr (µop) 56 instr
  • Rename regs none 40 56

83
SPECint95base Performance (Oct. 1997)
84
SPECfp95base Performance (Oct. 1997)
85
Summary Pipelining
  • What makes it easy
  • all instructions are the same length
  • just a few instruction formats
  • memory operands appear only in loads and stores
  • What makes it hard?
  • structural hazards suppose we had only one
    memory
  • control hazards need to worry about branch
    instructions
  • data hazards an instruction depends on a
    previous instruction

86
Summary
  • Pipelines pass control information down the pipe
    just as data moves down pipe
  • Forwarding/Stalls handled by local control
  • Exceptions stop the pipeline
  • More performance from deeper pipelines,
    parallelism

87
Summary
  • Superscalar and VLIW
  • CPI lt 1
  • Dynamic issue vs. Static issue
  • More instructions issue at same time, larger the
    penalty of hazards
  • SW Pipelining
  • Symbolic Loop Unrolling to get most from pipeline
    with little code expansion, little overhead
Write a Comment
User Comments (0)
About PowerShow.com