Instruction Set Architecture (ISA) - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

Instruction Set Architecture (ISA)

Description:

Title: Lecture 1: Course Introduction and Overview Author: Randy H. Katz Last modified by: Dirk Grunwald Created Date: 8/12/1995 11:37:26 AM Document presentation format – PowerPoint PPT presentation

Number of Views:373
Avg rating:3.0/5.0
Slides: 64
Provided by: ran57
Category:

less

Transcript and Presenter's Notes

Title: Instruction Set Architecture (ISA)


1
Instruction Set Architecture (ISA)
software
instruction set
hardware
2
Interface Design
  • A good interface
  • Lasts through many implementations (portability,
    compatibility)
  • Is used in many differeny ways (generality)
  • Provides convenient functionality to higher
    levels
  • Permits an efficient implementation at lower
    levels

use
time
imp 1
Interface
use
imp 2
use
imp 3
3
Evolution of Instruction Sets
Single Accumulator (EDSAC 1950)
Accumulator Index Registers
(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model from
Implementation
High-level Language Based
Concept of a Family
(B5000 1963)
(IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets
Load/Store Architecture
(CDC 6600, Cray 1 1963-76)
(Vax, Intel 432 1977-80)
RISC
(Mips,Sparc,HP-PA,IBM RS6000, . . .1987)
4
Evolution of Instruction Sets
  • Major advances in computer architecture are
    typically associated with landmark instruction
    set designs
  • Ex Stack vs GPR (System 360)
  • Design decisions must take into account
  • technology
  • machine organization
  • programming languages
  • compiler technology
  • operating systems
  • And they in turn influence these

5
What influences ISA Design?
  • The need to refer to values / memory
  • Registers
  • Main memory
  • But possibly
  • Cache?
  • Values on a stack?
  • Why do these choices exist?

6
Addressing Modes
  • Register add R4, R3
  • Immediate add R4, 3
  • Displacement add R4, 100R3
  • Register indirect add R4, R3
  • Index add R4, R1R2
  • Direct add R4, 1001
  • Memory Indirect add R4, _at_R3
  • Autoincrement add R4,R3
  • Autodecrement add R4, R3-
  • Scales add R4, 100R2R3

7
A "Typical" RISC
  • 32-bit fixed format instruction (3 formats)
  • 32 32-bit GPR (R0 contains zero, DP take pair)
  • 3-address, reg-reg arithmetic instruction
  • Single address mode for load/store base
    displacement
  • no indirection
  • Simple branch conditions
  • Delayed branch

see SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM
PowerPC, CDC 6600, CDC 7600, Cray-1,
Cray-2, Cray-3
8
Example MIPS
Register-Register
5
6
10
11
31
26
0
15
16
20
21
25
Op
Rs1
Rs2
Rd
Opx
Register-Immediate
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rd
Branch
31
26
0
15
16
20
21
25
immediate
Op
Rs1
Rs2/Opx
Jump / Call
31
26
0
25
target
Op
9
Warts x86
  • Floating point co-processor design
  • Complex string move instructions
  • Used in practice
  • Self-modifying code
  • Condition registers

10
Pipelining Its Natural!
  • Laundry Example
  • Ann, Brian, Cathy, Dave each have one load of
    clothes to wash, dry, and fold
  • Washer takes 30 minutes
  • Dryer takes 40 minutes
  • Folder takes 20 minutes

11
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
  • Sequential laundry takes 6 hours for 4 loads
  • If they learned pipelining, how long would
    laundry take?

12
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
  • Pipelined laundry takes 3.5 hours for 4 loads

13
Pipelining Lessons
  • Pipelining doesnt help latency of single task,
    it helps throughput of entire workload
  • Pipeline rate limited by slowest pipeline stage
  • Multiple tasks operating simultaneously
  • Potential speedup Number pipe stages
  • Unbalanced lengths of pipe stages reduces speedup
  • Time to fill pipeline and time to drain it
    reduces speedup

6 PM
7
8
9
Time
T a s k O r d e r
14
Computer Pipelines
  • Execute billions of instructions, so throughput
    is what matters
  • DLX desirable features all instructions same
    length, registers located in same place in
    instruction format, memory operands only in loads
    or stores

15
5 Steps of DLX DatapathFigure 3.1, Page 130
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
IR
L M D
16
Fetch Decode
  • Instruction Fetch (IF)
  • IR lt- MEMPC
  • NPC lt- PC4
  • Decode / register fetch (ID)
  • A lt- REGS IR6..10
  • B lt- REGS IR11..15
  • Imm lt- IR16..31

17
Execute Step
  • Memory Reference
  • ALUOutput lt- A Imm
  • Calculates effective address of the memory
    operation
  • Reg-Reg ALU
  • ALUOutput lt- A func B
  • Reg-Imm ALU
  • ALUOutput lt- A op Imm
  • Branch
  • ALUOutput lt- NPC Imm
  • Cond lt- (A op 0)

18
Memory Access
  • Memory Reference
  • LMD lt- Mem ALUOutput
  • Or, Mem ALUOutput lt- B
  • Branch
  • If (cond) PC lt- ALUOutputelse PC lt- NPC

19
Writeback
  • Reg-Reg
  • Regs IR16..20 lt- ALUOutput
  • Reg-Imm
  • Regs IR11..15 lt- ALUOutput
  • Load
  • Regs IR11..15 lt- LMD

20
Non-Pipelined Implementation
  • Branch and store instructions require four cycles
  • All others require five cycles
  • Assumes memory access is immediate, otherwise
    its slower
  • Or, we could have implemented the machine with a
    single long clock cycle.
  • No one would do this. Requires duplication of
    shared units / information

21
Pipelined DLX DatapathFigure 3.4, page 137
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
Write Back
Memory Access
  • Data stationary control
  • local decode for each instruction phase /
    pipeline stage

22
Pipelined Implementation
I IF ID EX MEM WB
I1 IF ID EX MEM WB
I2 IF ID EX MEM WB
I3 IF ID EX MEM WB
23
Pipeline Latches
  • Each instruction is active in only a single
    pipeline stage at a time
  • The pipeline latches can also be used to simplify
    testing debugging
  • Latches add overhead, though.
  • But, some latch designs let us overlap
    computation and latch overhead

24
Visualizing Pipelining ResourcesFigure 3.3, Page
133
Instruction Memory
Time (clock cycles)
I n s t r. O r d e r
Data Memory
25
Its Not That Easy for Computers
  • Limits to pipelining Hazards prevent next
    instruction from executing during its designated
    clock cycle
  • Structural hazards HW cannot support this
    combination of instructions (single person to
    fold and put clothes away)
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline (missing
    sock)
  • Control hazards Pipelining of branches other
    instructions stall the pipeline until the hazard
    bubbles in the pipeline

26
One Memory Port/Structural HazardsFigure 3.6,
Page 142
Time (clock cycles)
Data Memory
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
Instruction Memory
27
One Memory Port/Structural HazardsFigure 3.7,
Page 143
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
stall
Instr 3
28
Structural Hazards
  • How do you avoid them?
  • Duplicate resources
  • Pipeline resources
  • Why would they exist?
  • Cost e.g. duplicating memory interface is
    expensive
  • Latency it may be better to avoid pipelining to
    reduce the latency of a specified operation
  • Example CDC7600 MIPS R2010 FPU had shorter
    latency rather than fully pipelined FP
    operations.
  • Typically pipeline FMUL, but not e.g. FDIV

29
Speed Up Equation for Pipelining
  • CPIpipelined Ideal CPI Pipeline stall
    clock cycles per instr
  • Speedup Ideal CPI x Pipeline depth Clock
    Cycleunpipelined
  • Ideal CPI Pipeline stall CPI Clock
    Cyclepipelined
  • Speedup Pipeline depth Clock
    Cycleunpipelined
  • 1 Pipeline stall CPI Clock
    Cyclepipelined



30
Example Dual-port vs. Single-port
  • Machine A Dual ported memory
  • Machine B Single ported memory, but its
    pipelined implementation has a 1.05 times faster
    clock rate
  • Ideal CPI 1 for both
  • Assume loads are 40 of executed instructions
  • SpeedUpA Pipeline Depth/(1 0) x
    (clockunpipe/clockpipe)
  • Pipeline Depth
  • SpeedUpB Pipeline Depth/(1 0.4 x 1)
    x (clockunpipe/(clockunpipe / 1.05)
  • (Pipeline Depth/1.4) x 1.05
  • 0.75 x Pipeline Depth
  • SpeedUpA / SpeedUpB Pipeline
    Depth/(0.75 x Pipeline Depth) 1.33
  • Machine A is 1.33 times faster

31
Data Hazard on R1Figure 3.9, page 147
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
32
Data Hazards
  • SUB / AND read old value of R1
  • And, depending on previous instructions, they may
    read different old values
  • OR may read proper value if reads occur after
    writes in the register file access (major / minor
    clocks)
  • Only XOR would read the proper value
  • Not deterministic interrupts affect timing
  • But, people tried exposed pipelines
  • MIPS
  • Intel i860

33
Three Generic Data Hazards
  • InstrI followed by InstrJ
  • Read After Write (RAW) InstrJ tries to read
    operand before InstrI writes it

J

I
34
Three Generic Data Hazards
  • InstrI followed by InstrJ
  • Write After Read (WAR) InstrJ tries to write
    operand before InstrI reads i
  • Gets wrong operand
  • Cant happen in DLX 5 stage pipeline because
  • All instructions take 5 stages, and
  • Reads are always in stage 2, and
  • Writes are always in stage 5

J

I
35
Three Generic Data Hazards
J

I
  • InstrI followed by InstrJ
  • Write After Write (WAW) InstrJ tries to write
    operand before InstrI writes it
  • Leaves wrong result ( InstrI not InstrJ )
  • Cant happen in DLX 5 stage pipeline because
  • All instructions take 5 stages, and
  • Writes are always in stage 5
  • Could happen if WB for ALU happened in MEM stage,
    or if MEM access took two cycles
  • Well see WAR and WAW in later more complicated
    pipes

36
Forwarding to Avoid Data HazardFigure 3.10, Page
149
Time (clock cycles)
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
37
HW Change for ForwardingFigure 3.20, Page 161
38
Data Hazard Even with ForwardingFigure 3.12,
Page 153
Time (clock cycles)
lw r1, 0(r2)
I n s t r. O r d e r
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
39
Data Hazards Requiring Stalls
  • LW doesnt have data until end of cycle 4 (MEM
    cycle for LW)
  • SUB needs to have data by beginning of that cycle
  • Thus, cant completely eliminate the hazard
  • The easiest thing to do is use a pipeline
    interlock to force a stall

40
Data Hazard Even with ForwardingFigure 3.13,
Page 154
Time (clock cycles)
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
41
Prior to stall
Lw IF ID EX MEM WB
SUB IF ID EX MEM WB
AND IF ID EX MEM WB
OR IF ID EX MEM WB
42
With stall
LW IF ID EX MEM WB
SUB IF ID stall EX MEM WB
AND IF stall ID EX MEM WB
OR stall IF ID EX MEM
43
Example
  • Suppose that 30 of instructions are loads that
    ½ the time, the instruction following the load
    depends on the load value.
  • If this hazard creates a single-cycle delay, how
    much faster is the ideal pipelined machine?
  • 0.7 1 0.3 1.5 1.15
  • So, ideal machine is 15 faster

44
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
  • Fast code
  • LW Rb,b
  • LW Rc,c
  • LW Re,e
  • ADD Ra,Rb,Rc
  • LW Rf,f
  • SW a,Ra
  • SUB Rd,Re,Rf
  • SW d,Rd

Stall
Stall
45
How common are load stalls?
46
Implementing Load Interlocks
  • Software insert NOPS
  • Hardware
  • Is load destination source for subsequent
    instruction?
  • Two possible registers in subsequent instruction
  • Have to check for all possible formats!

47
Control Hazard on BranchesThree Stage Stall
48
Branch Stall Impact
  • If CPI 1, 30 branch, Stall 3 cycles gt new CPI
    1.9!
  • Two part solution
  • Determine branch taken or not sooner (in ID), AND
  • Compute taken branch address earlier
  • DLX branch tests if register 0 or ! 0
  • DLX Solution
  • Move Zero test to ID/RF stage
  • Adder to calculate new PC in ID/RF stage
  • 1 clock cycle penalty for branch versus 3
  • Data hazard stall if branch depends on result of
    prior ALU operation
  • ADD R1 R2 R3
  • BR R10, foo

49
Alternatives
  • Figuring out its a branch
  • pre-decode the branch
  • Computing the condition
  • Use condition codes
  • But the condition needs to be computed early
    enough
  • Address
  • Dont use relative branches

50
Pipelined DLX DatapathFigure 3.22, page 163
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
This is the correct 1 cycle latency
implementation! needs mux!
51
Four Branch Hazard Alternatives
  • 1 Stall until branch direction is clear
  • 2 Predict Branch Not Taken
  • Execute successor instructions in sequence
  • Squash instructions in pipeline if branch
    actually taken
  • Advantage of late pipeline state update
  • 47 DLX branches not taken on average
  • PC4 already calculated, so use it to get next
    instruction
  • 3 Predict Branch Taken
  • 53 DLX branches taken on average
  • But havent calculated branch target address in
    DLX
  • DLX still incurs 1 cycle branch penalty
  • Other machines branch target known before outcome

52
Four Branch Hazard Alternatives
  • 4 Delayed Branch
  • Define branch to take place AFTER a following
    instruction
  • branch instruction sequential
    successor1 sequential successor2 ........ seque
    ntial successorn
  • branch target if taken
  • 1 slot delay allows proper decision and branch
    target address in 5 stage pipeline
  • DLX uses this

Branch delay of length n
53
Delayed Branch
A
  • Where to get instructions to fill branch delay
    slot?
  • Before branch instruction
  • From the target address only valuable when
    branch taken
  • From fall through only valuable when branch not
    taken
  • Canceling branches allow more slots to be filled

Br
54
Delayed Branch
  • Compiler effectiveness for single branch delay
    slot
  • Fills about 60 of branch delay slots
  • About 80 of instructions executed in branch
    delay slots useful in computation
  • About 50 (60 x 80) of slots usefully filled
  • Problems
  • Exposes pipeline design to user
  • Increased pipeline depth -gt need more slots
  • Increase issue width -gt need more slots

55
Evaluating Branch Alternatives
  • Scheduling Branch CPI speedup v. speedup v.
    scheme penalty unpipelined stall
  • Stall pipeline 3 1.42 3.5 1.0
  • Predict taken 1 1.14 4.4 1.26
  • Predict not taken 1 1.09 4.5 1.29
  • Delayed branch 0.5 1.07 4.6 1.31
  • Conditional Unconditional 14, 65 change PC

56
Hardware / Software
  • Compiler based static branch prediction
  • Use machine learning to guess branch direction
  • Profile based prediction
  • Run the program several times
  • record behavior for runs
  • Assume the past predicts the future

57
Complexity - Exceptions
  • Synch vs. async
  • E.g. page faults vs. I/O completion
  • User requested vs. coerced
  • O/S transition vs. page fault
  • User maskable vs. unmaskable
  • Within vs. between instructions
  • One word of multi-word operation causes fault
  • Resume vs. terminate

58
Complexity - Exceptions
  • Restartable
  • Machine provides mechanism to restart program
    execution
  • Precise
  • All instructions prior to the excepting condition
    are committed, all following are not committed

59
Exceptions - Ordering
  • Consider exception arising in MEM and IF stages
  • MEM because of invalid access
  • IF also because of access
  • The IF fault may occur before the MEM fault,
    but the MEM fault needs to be reported first
  • How? Pipeline the exception state, throw
    exceptions at WB

60
Exceptions
  • Well soon read a seminal paper on handling
    precise exceptions
  • Another alternative is to use exception
    barriers or trap barriers
  • Precise exceptions may be more than is needed by
    many programs
  • We can allow the compiler/program to specify trap
    barriers this may allow better execution

61
NetburstTM Micro-architecture Pipeline vs P6
Intro at 733MHz .18µ
Intro at ³ 1.4GHz .18µ
Hyper pipelined Technology enables industry
leading performance and clock rate
62
Hyper Pipelined Technology
63
Pipelining Summary
  • Just overlap tasks, and easy if tasks are
    independent
  • Speed Up Pipeline Depth if ideal CPI is 1,
    then
  • Hazards limit performance on computers
  • Structural need more HW resources
  • Data (RAW,WAR,WAW) need forwarding, compiler
    scheduling
  • Control delayed branch, prediction

Pipeline Depth
Clock Cycle Unpipelined
Speedup
X
Clock Cycle Pipelined
1 Pipeline stall CPI
Write a Comment
User Comments (0)
About PowerShow.com