Title: Lecture 2: Review of Instruction Set Design and Pipelining
1Lecture 2 Review of Instruction Set Design and
Pipelining
- Dr. Ben Juurlink
- Modern Computer Architectures
- Fall 2001
2Computer Architecture Is
- the attributes of a computing system as seen
by the programmer, i.e., the conceptual structure
and functional behavior, as distinct from the
organization of the data flows and controls the
logic design, and the physical implementation. - Amdahl, Blauw, and Brooks, 1964
SOFTWARE
3Instruction Set Architecture (ISA)
software
instruction set
hardware
4Interface Design
- A good interface
- Lasts through many implementations (portability,
compatibility) - Is used in many different ways (generality)
- Provides convenient functionality to higher
levels - Permits an efficient implementation at lower
levels
use
time
imp 1
Interface
use
imp 2
use
imp 3
5Evolution of Instruction Sets
Single Accumulator (EDSAC 1950)
Accumulator Index Registers
(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model from
Implementation
High-level Language Based
Concept of a Family
(B5000 1963)
(IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets
Load/Store Architecture
(CDC 6600, Cray 1 1963-76)
(Vax, Intel 432 1977-80)
RISC
(Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987)
VLIW/EPIC?
(IA-64. . .1999)
6Evolution of Instruction Sets
- Major advances in computer architecture are
typically associated with landmark instruction
set designs - Ex Stack vs GPR (System 360)
- Design decisions must take into account
- technology
- machine organization
- programming languages
- compiler technology
- operating systems
- And they in turn influence these
7A "Typical" RISC
- 32-bit fixed format instruction (3 formats)
- 32 32-bit GPR (R0 contains zero, DP FP take pair)
- 3-address, reg-reg arithmetic instruction
- Single address mode for load/store base
displacement - no indirection
- Simple branch conditions
- Delayed branch
- Note x86/Pentium can be classified as CISC but
complex instructions are translated to several
micro-operations which are executed by hardware - instruction set is CISC
- implementation is RISC
see SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM
PowerPC, CDC 6600, CDC 7600, Cray-1,
Cray-2, Cray-3
8Example MIPS (? DLX)
Register-Register
5
6
10
11
31
26
0
15
16
20
21
25
Op
Rs1
Rs2
Rd
Opx
Register-Immediate
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rd
Branch
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rs2/Opx
Jump / Call
31
26
0
25
target
Op
9MIPS Characteristics
- All instructions have the same length (can be
wasteful of memory) - Only load/store instructions access memory,
arithmetic instructions operate on registers - Only one address mode for load/store
base(register)displacement(immediate) - Only branch-if-equal (beq) and branch-if-not-equal
(bne) instructions, no branch-if-less-than (blt)
or branch-if-greater-than-or-equal (bge) etc. - Next instruction after branch is always executed
(delayed branch) - Why? Pipelining! (next)
10Pipelining Its Natural!
- Laundry Example
- Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold - Washer takes 30 minutes
- Dryer takes 40 minutes
- Folder takes 20 minutes
11Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
- Sequential laundry takes 6 hours for 4 loads
- If they learned pipelining, how long would
laundry take?
12Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
- Pipelined laundry takes 3.5 hours for 4 loads
13Pipelining Lessons
- Pipelining doesnt help latency of single task,
it helps throughput of entire workload - Pipeline rate limited by slowest pipeline stage
- Multiple tasks operating simultaneously
- Potential speedup Number pipeline stages
- Unbalanced lengths of pipe stages reduces speedup
- Time to fill pipeline and time to drain it
reduces speedup
6 PM
7
8
9
Time
T a s k O r d e r
14Computer Pipelines
- Execute billions of instructions, so throughout
is what matters - MIPS/DLX desirable features
- all instructions same length -gt can start
fetching next instruction before instruction is
decoded - registers located in same place in instruction
format -gt can read registers before instruction
is decoded - memory operands only in loads or stores -gt we
dont have to fetch data from memory before
instruction is executed
155 Steps of MIPS/DLX Datapath
e
g
i
s
t
e
r
1
R
e
a
d
Z
e
r
o
r
e
g
i
s
t
e
r
2
A
L
U
A
L
U
R
e
a
d
W
r
i
t
e
A
d
d
r
e
s
s
r
e
s
u
l
t
d
a
t
a
r
e
g
i
s
t
e
r
M
D
a
t
a
u
W
r
i
t
e
x
m
e
m
o
r
y
d
a
t
a
1
1
6
S
i
g
n
e
x
t
e
n
d
16Pipelined MIPS/DLX Datapath
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
17Visualizing PipeliningFigure 3.3, Page 133
Time (clock cycles)
I n s t r. O r d e r
18Summary
- Five pipeline stages
- Instruction fetch
- Instruction decode and register fetch
(speculative!) - Execute (arithmetic instr or calc. branch
condition) / address calculation (load/store
instr.) - Memory access (only for load/store) / calculate
branch target - Write-back result to register file
- Problem suppose that memory operands can also
appear in arithmetic instructions, as in - addm R1,R2,4(R3) R1 R2 MemR34
- How would this affect the pipeline?
Solution Stages 3 and 4 would expand to an
address stage, memory stage, and then execute
stage.
19Its Not That Easy...
- Limits to pipelining Hazards prevent next
instruction from executing during its designated
clock cycle - Structural hazards HW cannot support this
combination of instructions (single person to
fold and put clothes away) - Data hazards Instruction depends on result of
prior instruction still in the pipeline (missing
sock) - Control hazards Branch instructions change the
sequential order of instruction execution (for
very dirty laundry we have to check the water
temperature setting)
20One Memory Port/Structural HazardsFigure 3.6,
Page 142
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
21One Memory Port/Structural HazardsFigure 3.7,
Page 143
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
stall
Instr 3
22Data HazardsFigure 3.9, page 147
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
23Three Generic Data Hazards
- InstrI followed by InstrJ
- Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it
i add r1,r2,r3 j add r4,r1,r5
24Three Generic Data Hazards
- InstrI followed by InstrJ
- i add r1,r2,r3
- j add r2,r4,r5
- Write After Read (WAR) InstrJ writes operand
before InstrI reads it - Gets wrong operand
- Cant happen in DLX 5-stage pipeline because
- All instructions take 5 stages, and
- Reads are always in stage 2, and
- Writes are always in stage 5
25Three Generic Data Hazards
- InstrI followed by InstrJ
- Write After Write (WAW) InstrJ tries to write
operand before InstrI writes it - Leaves wrong result ( InstrI not InstrJ )
- Cant happen in DLX 5 stage pipeline because
- All instructions take 5 stages, and
- Writes are always in stage 5
- Will see WAR and WAW in superscalar designs
i add r1,r2,r3 j add r1,r4,r5
26Forwarding to Avoid Data HazardsRegister file
forwarding
Time (clock cycles)
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
27Forwarding to Avoid Data Hazards
Time (clock cycles)
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
28Hardware for ForwardingFigure 3.20, Page 161
29Data Hazard Even with ForwardingUse after load
Time (clock cycles)
lw r1, 0(r2)
I n s t r. O r d e r
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
30Hazard Detection UnitHardware must detect hazard
and stall the pipeline(insert a pipeline bubble)
Time (clock cycles)
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
31Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
- Fast code
- LW Rb,b
- LW Rc,c
- LW Re,e
- ADD Ra,Rb,Rc
- LW Rf,f
- SW a,Ra
- SUB Rd,Re,Rf
- SW d,Rd
32Control Hazard on BranchesThree Stage Stall
33Branch Stall Impact
- If CPI 1, 30 branch, Stall 3 cycles gt new CPI
1.9! - Two part solution
- Determine branch taken or not sooner, AND
- Compute taken branch address earlier
- MIPS/DLX branches test if register 0 or ? 0
- MIPS/DLX Solution reduce branch penalty by
- Moving zero test to ID/RF stage
- Adding adder to calculate new PC in ID/RF stage
- Now 1 clock cycle penalty for branch versus 3
34Pipelined DLX DatapathFigure 3.22, page 163
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
This is the correct 1 cycle latency
implementation!
35Pipelined DLX/MIPS Datapath
This is the correct 1 cycle branch
delay implementation
36Five Branch Hazard Alternatives
- 1 Stall until branch direction is clear
- 2 Predict Branch Not Taken
- Execute successor instructions in sequence
- Squash instructions in pipeline if branch
actually taken - Advantage of late pipeline state update
- 47 DLX branches not taken on average
- PC4 already calculated, so use it to get next
instruction - 3 Predict Branch Taken
- 53 DLX branches taken on average
- But havent calculated branch target address in
DLX - DLX still incurs 1 cycle branch penalty
- Other machines branch target known before outcome
37Five Branch Hazard Alternatives
- 4 Delayed Branch
- Define branch to take place AFTER a following
instruction - branch instruction sequential
successor1 sequential successor2
Branch delay of n cycles sequential successorn - branch target if taken
- 1 slot delay allows proper decision and branch
target address calculation in 5 stage pipeline - MIPS/DLX uses this
- 5 Branch Prediction predict the outcome of a
branch based on its history (next lecture)
38Delayed Branch
- Where to get instructions to fill branch delay
slot? - Before branch instruction
- From the target address only valuable when
branch taken. Furthermore, it must be legal
instruction at target address writes to a
register that is not used by other branch
direction. - From fall through only valuable when branch not
taken - Cancelling or nullifying branches allow more
slots to be filled. - Instr. in delay slot includes direction branch
was predicted. If correct, instr. executes
normally. If not, it is turned into a nop. - Compiler effectiveness for single branch delay
slot - Fills about 60 of branch delay slots
- About 80 of instr. executed in branch delay
slots useful in computation - About 50 (60 x 80) of slots usefully filled
- Delayed Branch downsides
- Deeper pipelines (P4 has 20!) means longer branch
delays means more difficult job to fill delay
slots - Architecturally visible. From a MIPS R10000
(superscalar) paper Delayed branches are no
longer needed, but are maintained - for compatibility.
39Pipelining Summary
- Just overlap tasks, easy if tasks are
independent - Ideal speedup number of pipeline stages
- Hazards limit performance
- Structural need more HW resources
- Data (RAW,WAR,WAW) need forwarding, compiler
scheduling - Control delayed branch, prediction
- MIPS/DLX instruction set designed with pipelining
in mind - All instructions same length
- Load/store architecture
- Registers located in same place in instruction
format
40Outlook
- Next lecture
- Advanced Pipelining
- Dynamic Branch Prediction
- Instruction-Level Parallelism
- Multiple Issue (superscalar and VLIW) processors
- Textbook chapter 4.