Instruction Set Architecture (ISA) presentation

About This Presentation

Transcript and Presenter's Notes

Title: Instruction Set Architecture (ISA)

1
Instruction Set Architecture (ISA)
software
instruction set
hardware
2
Interface Design

A good interface
Lasts through many implementations (portability,
compatibility)
Is used in many differeny ways (generality)
Provides convenient functionality to higher
levels
Permits an efficient implementation at lower
levels

use
time
imp 1
Interface
use
imp 2
use
imp 3
3
Evolution of Instruction Sets
Single Accumulator (EDSAC 1950)
Accumulator Index Registers
(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model from
Implementation
High-level Language Based
Concept of a Family
(B5000 1963)
(IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets
Load/Store Architecture
(CDC 6600, Cray 1 1963-76)
(Vax, Intel 432 1977-80)
RISC
(Mips,Sparc,HP-PA,IBM RS6000, . . .1987)
4
Evolution of Instruction Sets

Major advances in computer architecture are
typically associated with landmark instruction
set designs
Ex Stack vs GPR (System 360)
Design decisions must take into account
technology
machine organization
programming languages
compiler technology
operating systems
And they in turn influence these

5
What influences ISA Design?

The need to refer to values / memory
Registers
Main memory
But possibly
Cache?
Values on a stack?
Why do these choices exist?

6
Addressing Modes

Register add R4, R3
Immediate add R4, 3
Displacement add R4, 100R3
Register indirect add R4, R3
Index add R4, R1R2
Direct add R4, 1001
Memory Indirect add R4, _at_R3
Autoincrement add R4,R3
Autodecrement add R4, R3-
Scales add R4, 100R2R3

7
A "Typical" RISC

32-bit fixed format instruction (3 formats)
32 32-bit GPR (R0 contains zero, DP take pair)
3-address, reg-reg arithmetic instruction
Single address mode for load/store base
displacement
no indirection
Simple branch conditions
Delayed branch

see SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM
PowerPC, CDC 6600, CDC 7600, Cray-1,
Cray-2, Cray-3
8
Example MIPS
Register-Register
5
6
10
11
31
26
0
15
16
20
21
25
Op
Rs1
Rs2
Rd
Opx
Register-Immediate
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rd
Branch
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rs2/Opx
Jump / Call
31
26
0
25
target
Op
9
Warts x86

Floating point co-processor design
Complex string move instructions
Used in practice
Self-modifying code
Condition registers

10
Pipelining Its Natural!

Laundry Example
Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes

11
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r

Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would
laundry take?

12
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r

Pipelined laundry takes 3.5 hours for 4 loads

13
Pipelining Lessons

Pipelining doesnt help latency of single task,
it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline and time to drain it
reduces speedup

6 PM
7
8
9
Time
T a s k O r d e r
14
Computer Pipelines

Execute billions of instructions, so throughput
is what matters
DLX desirable features all instructions same
length, registers located in same place in
instruction format, memory operands only in loads
or stores

15
5 Steps of DLX DatapathFigure 3.1, Page 130
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
IR
L M D
16
Fetch Decode

Instruction Fetch (IF)
IR lt- MEMPC
NPC lt- PC4
Decode / register fetch (ID)
A lt- REGS IR6..10
B lt- REGS IR11..15
Imm lt- IR16..31

17
Execute Step

Memory Reference
ALUOutput lt- A Imm
Calculates effective address of the memory
operation
Reg-Reg ALU
ALUOutput lt- A func B
Reg-Imm ALU
ALUOutput lt- A op Imm
Branch
ALUOutput lt- NPC Imm
Cond lt- (A op 0)

18
Memory Access

Memory Reference
LMD lt- Mem ALUOutput
Or, Mem ALUOutput lt- B
Branch
If (cond) PC lt- ALUOutputelse PC lt- NPC

19
Writeback

Reg-Reg
Regs IR16..20 lt- ALUOutput
Reg-Imm
Regs IR11..15 lt- ALUOutput
Load
Regs IR11..15 lt- LMD

20
Non-Pipelined Implementation

Branch and store instructions require four cycles
All others require five cycles
Assumes memory access is immediate, otherwise
its slower
Or, we could have implemented the machine with a
single long clock cycle.
No one would do this. Requires duplication of
shared units / information

21
Pipelined DLX DatapathFigure 3.4, page 137
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
Write Back
Memory Access

Data stationary control
local decode for each instruction phase /
pipeline stage

22
Pipelined Implementation
I IF ID EX MEM WB
I1 IF ID EX MEM WB
I2 IF ID EX MEM WB
I3 IF ID EX MEM WB
23
Pipeline Latches

Each instruction is active in only a single
pipeline stage at a time
The pipeline latches can also be used to simplify
testing debugging
Latches add overhead, though.
But, some latch designs let us overlap
computation and latch overhead

24
Visualizing Pipelining ResourcesFigure 3.3, Page
133
Instruction Memory
Time (clock cycles)
I n s t r. O r d e r
Data Memory
25
Its Not That Easy for Computers

Limits to pipelining Hazards prevent next
instruction from executing during its designated
clock cycle
Structural hazards HW cannot support this
combination of instructions (single person to
fold and put clothes away)
Data hazards Instruction depends on result of
prior instruction still in the pipeline (missing
sock)
Control hazards Pipelining of branches other
instructions stall the pipeline until the hazard
bubbles in the pipeline

26
One Memory Port/Structural HazardsFigure 3.6,
Page 142
Time (clock cycles)
Data Memory
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
Instruction Memory
27
One Memory Port/Structural HazardsFigure 3.7,
Page 143
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
stall
Instr 3
28
Structural Hazards

How do you avoid them?
Duplicate resources
Pipeline resources
Why would they exist?
Cost e.g. duplicating memory interface is
expensive
Latency it may be better to avoid pipelining to
reduce the latency of a specified operation
Example CDC7600 MIPS R2010 FPU had shorter
latency rather than fully pipelined FP
operations.
Typically pipeline FMUL, but not e.g. FDIV

29
Speed Up Equation for Pipelining

CPIpipelined Ideal CPI Pipeline stall
clock cycles per instr
Speedup Ideal CPI x Pipeline depth Clock
Cycleunpipelined
Ideal CPI Pipeline stall CPI Clock
Cyclepipelined
Speedup Pipeline depth Clock
Cycleunpipelined
1 Pipeline stall CPI Clock
Cyclepipelined

30
Example Dual-port vs. Single-port

Machine A Dual ported memory
Machine B Single ported memory, but its
pipelined implementation has a 1.05 times faster
clock rate
Ideal CPI 1 for both
Assume loads are 40 of executed instructions
SpeedUpA Pipeline Depth/(1 0) x
(clockunpipe/clockpipe)
Pipeline Depth
SpeedUpB Pipeline Depth/(1 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05)
(Pipeline Depth/1.4) x 1.05
0.75 x Pipeline Depth
SpeedUpA / SpeedUpB Pipeline
Depth/(0.75 x Pipeline Depth) 1.33
Machine A is 1.33 times faster

31
Data Hazard on R1Figure 3.9, page 147
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
32
Data Hazards

SUB / AND read old value of R1
And, depending on previous instructions, they may
read different old values
OR may read proper value if reads occur after
writes in the register file access (major / minor
clocks)
Only XOR would read the proper value
Not deterministic interrupts affect timing
But, people tried exposed pipelines
MIPS
Intel i860

33
Three Generic Data Hazards

InstrI followed by InstrJ
Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it

J

I
34
Three Generic Data Hazards

InstrI followed by InstrJ
Write After Read (WAR) InstrJ tries to write
operand before InstrI reads i
Gets wrong operand
Cant happen in DLX 5 stage pipeline because
All instructions take 5 stages, and
Reads are always in stage 2, and
Writes are always in stage 5

J

I
35
Three Generic Data Hazards
J

I

InstrI followed by InstrJ
Write After Write (WAW) InstrJ tries to write
operand before InstrI writes it
Leaves wrong result ( InstrI not InstrJ )
Cant happen in DLX 5 stage pipeline because
All instructions take 5 stages, and
Writes are always in stage 5
Could happen if WB for ALU happened in MEM stage,
or if MEM access took two cycles
Well see WAR and WAW in later more complicated
pipes

36
Forwarding to Avoid Data HazardFigure 3.10, Page
149
Time (clock cycles)
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
37
HW Change for ForwardingFigure 3.20, Page 161
38
Data Hazard Even with ForwardingFigure 3.12,
Page 153
Time (clock cycles)
lw r1, 0(r2)
I n s t r. O r d e r
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
39
Data Hazards Requiring Stalls

LW doesnt have data until end of cycle 4 (MEM
cycle for LW)
SUB needs to have data by beginning of that cycle
Thus, cant completely eliminate the hazard
The easiest thing to do is use a pipeline
interlock to force a stall

40
Data Hazard Even with ForwardingFigure 3.13,
Page 154
Time (clock cycles)
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
41
Prior to stall
Lw IF ID EX MEM WB
SUB IF ID EX MEM WB
AND IF ID EX MEM WB
OR IF ID EX MEM WB
42
With stall
LW IF ID EX MEM WB
SUB IF ID stall EX MEM WB
AND IF stall ID EX MEM WB
OR stall IF ID EX MEM
43
Example

Suppose that 30 of instructions are loads that
½ the time, the instruction following the load
depends on the load value.
If this hazard creates a single-cycle delay, how
much faster is the ideal pipelined machine?
0.7 1 0.3 1.5 1.15
So, ideal machine is 15 faster

44
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd

Fast code
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd

Stall
Stall
45
How common are load stalls?
46
Implementing Load Interlocks

Software insert NOPS
Hardware
Is load destination source for subsequent
instruction?
Two possible registers in subsequent instruction
Have to check for all possible formats!

47
Control Hazard on BranchesThree Stage Stall
48
Branch Stall Impact

If CPI 1, 30 branch, Stall 3 cycles gt new CPI
1.9!
Two part solution
Determine branch taken or not sooner (in ID), AND
Compute taken branch address earlier
DLX branch tests if register 0 or ! 0
DLX Solution
Move Zero test to ID/RF stage
Adder to calculate new PC in ID/RF stage
1 clock cycle penalty for branch versus 3
Data hazard stall if branch depends on result of
prior ALU operation
ADD R1 R2 R3
BR R10, foo

49
Alternatives

Figuring out its a branch
pre-decode the branch
Computing the condition
Use condition codes
But the condition needs to be computed early
enough
Address
Dont use relative branches

50
Pipelined DLX DatapathFigure 3.22, page 163
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
This is the correct 1 cycle latency
implementation! needs mux!
51
Four Branch Hazard Alternatives

1 Stall until branch direction is clear
2 Predict Branch Not Taken
Execute successor instructions in sequence
Squash instructions in pipeline if branch
actually taken
Advantage of late pipeline state update
47 DLX branches not taken on average
PC4 already calculated, so use it to get next
instruction
3 Predict Branch Taken
53 DLX branches taken on average
But havent calculated branch target address in
DLX
DLX still incurs 1 cycle branch penalty
Other machines branch target known before outcome

52
Four Branch Hazard Alternatives

4 Delayed Branch
Define branch to take place AFTER a following
instruction
branch instruction sequential
successor1 sequential successor2 ........ seque
ntial successorn
branch target if taken
1 slot delay allows proper decision and branch
target address in 5 stage pipeline
DLX uses this

Branch delay of length n
53
Delayed Branch
A

Where to get instructions to fill branch delay
slot?
Before branch instruction
From the target address only valuable when
branch taken
From fall through only valuable when branch not
taken
Canceling branches allow more slots to be filled

Br
54
Delayed Branch

Compiler effectiveness for single branch delay
slot
Fills about 60 of branch delay slots
About 80 of instructions executed in branch
delay slots useful in computation
About 50 (60 x 80) of slots usefully filled
Problems
Exposes pipeline design to user
Increased pipeline depth -gt need more slots
Increase issue width -gt need more slots

55
Evaluating Branch Alternatives

Scheduling Branch CPI speedup v. speedup v.
scheme penalty unpipelined stall
Stall pipeline 3 1.42 3.5 1.0
Predict taken 1 1.14 4.4 1.26
Predict not taken 1 1.09 4.5 1.29
Delayed branch 0.5 1.07 4.6 1.31
Conditional Unconditional 14, 65 change PC

56
Hardware / Software

Compiler based static branch prediction
Use machine learning to guess branch direction
Profile based prediction
Run the program several times
record behavior for runs
Assume the past predicts the future

57
Complexity - Exceptions

Synch vs. async
E.g. page faults vs. I/O completion
User requested vs. coerced
O/S transition vs. page fault
User maskable vs. unmaskable
Within vs. between instructions
One word of multi-word operation causes fault
Resume vs. terminate

58
Complexity - Exceptions

Restartable
Machine provides mechanism to restart program
execution
Precise
All instructions prior to the excepting condition
are committed, all following are not committed

59
Exceptions - Ordering

Consider exception arising in MEM and IF stages
MEM because of invalid access
IF also because of access
The IF fault may occur before the MEM fault,
but the MEM fault needs to be reported first
How? Pipeline the exception state, throw
exceptions at WB

60
Exceptions

Well soon read a seminal paper on handling
precise exceptions
Another alternative is to use exception
barriers or trap barriers
Precise exceptions may be more than is needed by
many programs
We can allow the compiler/program to specify trap
barriers this may allow better execution

61
NetburstTM Micro-architecture Pipeline vs P6
Intro at 733MHz .18µ
Intro at ³ 1.4GHz .18µ
Hyper pipelined Technology enables industry
leading performance and clock rate
62
Hyper Pipelined Technology
63
Pipelining Summary

Just overlap tasks, and easy if tasks are
independent
Speed Up Pipeline Depth if ideal CPI is 1,
then
Hazards limit performance on computers
Structural need more HW resources
Data (RAW,WAR,WAW) need forwarding, compiler
scheduling
Control delayed branch, prediction

Pipeline Depth
Clock Cycle Unpipelined
Speedup
X
Clock Cycle Pipelined
1 Pipeline stall CPI

Write a Comment

User Comments (0)

About PowerShow.com

Instruction Set Architecture (ISA) PowerPoint PPT Presentation