Title: Chapter 5: ISAs
1Chapter 5 ISAs
- In MARIE, we had simple instructions
- 4 bit op code followed by either
- 12 bit address for load, store, add, subt, jump
- 2 bit condition code for skipcond
- 12 0s for instructions that did not need a datum
- However, most ISAs are much more complex so there
are many more op codes and possibly more than 1
operand - How do we specify the operation?
- Each operation will have a unique op code,
although op codes might not be equal length (in
MARIE, all were 4 bits, in some ISAs, op codes
range from 8 bits to 16 or more) - How do we specify the number of operands?
- This is usually determined by the op code,
although it could also be specified in the
instruction as an added piece of instruction
information - How do we specify the location of each operand?
- We need addressing information
2Instruction Formats
PDP-10 fixed length instructions, 9 bit op code
(512 operations) followed by 2 operands one
operand in a register, the other in memory
PDP-11 variable length with 13 different
formats Varies from 4 bit op code to 16 bit op
code, 0, 1, 2 and 3 operands can be specified
based on the format
32 More Formats
The variable length Intel (Pentium) format is
shown above, instructions can vary from 1 byte to
17 with op codes being 1 or 2 bytes long, and all
instructions having up to 2 operands The fixed
length PowerPC format is shown on the right, all
instructions are 32 bits but there are five basic
forms, and have up to 3 operands as long as the 3
operands are stored in registers
4Instruction Format Decisions
- Length decisions
- Fixed length
- makes instruction fetching predictable (which
helps out in pipelining) - Variable length
- flexible instructions, can accommodate up to 3
operands including 3 memory references and length
is determined by need, so does not waste memory
space - Number of addressing modes
- Fewer addressing modes makes things easier on the
architect, but possibly harder on the programmer - Simple addressing modes makes pipelining easier
- How many registers?
- Generally, the more the better but with more
registers, there is less space available for
other circuits or cache (more registers more
expense)
5Alignment
- Another question is what alignment should be
used? - Recall that most machines today have word sizes
of 32 bits or 64 bits and the CPU fetches or
stores 1 word at a time - Yet memory is organized in bytes
- Should we allow the CPU to access something
smaller than a word? - If so, we have to worry about alignment
- Two methods used
- Big Endian bytes are placed in order in the
word - Little Endian bytes are placed in opposite
order - See below where the word is 12345678
- Different architectures use different alignments
between these two
Intel uses little Endian, and bitmaps were
developed this way, so a bitmap must be
converted before it can be viewed on a big Endian
machine!
6Type of CPU Storage
- Although all architectures today use register
storage, other approaches have been tried - Accumulator-based a single data register, the
accumulator (MARIE is like this) - This was common in early computers when register
storage was very expensive - General-purpose registers many data registers
are available for the programmers use - Most RISC architectures are of this form
- Special-purpose registers many data registers,
but each has its own implied use (e.g., a counter
register for loops, an I/O register for I/O
operations, a base register for arrays, etc) - Pentium is of this form
- Stack-based instead of general-purpose
registers, storage is a stack, operations are
rearranged to be performed in postfix order - An early alternative to accumulator-based
architectures, obsolete now
7Load-Store Architectures
- When deciding on the number of registers to make
available, architects also decide whether to
support a load-store instruction set - In a load-store instruction set, the only
operations that are allowed to reference memory
are loads and stores - All other operations (ALU operations, branches)
must reference only values in registers or
immediate data (data in the instruction itself) - This makes programming more difficult because
simple operations like inc X must now first cause
X to be loaded to a register and stored back to X
after the inc - But it is necessary to support a pipeline, which
ultimately speeds up processing! - All RISC architectures are load-store instruction
sets and require at least 16 registers (hopefully
more!) - Many CISC architectures permit memory-memory and
memory-register ALU operations so these machines
can get by with fewer registers - Intel has 4 general purpose data registers
8Number of Operands
- The number of operands that an instruction can
specify has an impact on instruction sizes - Consider the instruction Add R1, R2, R3
- Add op code is 6 bits
- Assume 32 registers, each takes 5 bits
- This instruction is 21 bits long
- Consider Add X, Y, Z
- Assume 256 MBytes of memory
- Each memory reference is 28 bits
- This instruction is 90 bits long!
- However, we do not necessarily want to limit our
instructions to having 1 or 2 operands, so we
must either permit long instructions or find a
compromise - The load-store instruction set is a compromise
3 operands can be referenced as long as they are
all in registers, 1 operand can be reference in
memory as long as it is in an instruction by
itself (load or store use 1 memory reference
only)
91, 2 and 3 Operand Examples
Instruction Comment SUB Y, A, B Y ? A B MPY
T, D, E T ? D E ADD T, T, C T ? T C DIV Y,
Y, T Y ? Y / T Using three addresses
Instruction Comment LOAD D AC ? D MPY E AC
? D E ADD C AC ? AC C STOR Y Y ? AC LOAD
A AC ? A SUB B AC ? AC B DIV Y AC ? AC /
Y STOR Y Y ? AC Using one address
Instruction Comment MOVE Y, A Y ? A SUB Y,
B Y ? Y B MOVE T, D T ? D MPY T, E T ? T
E ADD T, C T ? T C DIV Y, T Y ? Y / T Using
two addresses
Here we compare the length of code if we have one
address instructions, two address instructions
and three address instructions, each computes Y
(A B) / (C D E) Notice one and two
address operand instructions write over a source
operand, thus destroying data
See pages 206-207 for another example
10Addressing Modes
- In our instruction, how do we specify the data?
- We have different modes to specify how to find
the data - Most modes generate memory addresses, some modes
reference registers instead - Below are the most common formats (we have
already used Direct and Indirect in our MARIE
examples of the last chapter)
11Computing These Modes
In Register, the operand is stored in a
register, and the register is specified in
the Instruction Example Add R1, R2, R3
In Immediate, the operand is in the instruction
such as Add 5 This is used when the datum is
known at compile time this is the quickest
form of addressing
In Direct, the operand is in memory and the
instruction contains a reference to the memory
location because there is a memory access,
this method is slower than Register Examples
Add Y (in assembly) Add 110111000 (in machine)
In Indirect, the memory reference is to a
pointer, this requires two memory accesses and so
is the slowest of all addressing modes
12Continued
Indexed or Based is like Direct except that the
address referenced is computed as a combination
of a base value stored in a register and an
offset in the instruction Example Add
R3(300) This is also called Displacement or Base
Displacement
Register Indirect mode is like Indirect except
that the instruction references a pointer in a
register, not memory so that one memory access is
saved Notice that Register and Register Indirect
can permit shorter instructions because the
register specification is shorter than a memory
address specification
In Stack, the operand is at the top of the stack
where the stack is pointed to by a
special register called the Stack Pointer
this is like Register Indirect in that it
accesses a register followed by memory
13Example
Assume memory stores the values as shown To the
left and register R1 stores 800
Assume our instruction is Load 800 The value
loaded into the accumulator depends on
the addressing mode used, see below
Data is 800 Datas location is at 800 Datas
location is pointed to by value in 800 Datas
location is at R1 800 (1600)
14Instruction Types
- Now that we have explored some of the issues in
designing an instruction set, lets consider the
types of instructions - Data movement (load, store)
- I/O
- Arithmetic (, -, , /, )
- Boolean (AND, OR, NOT, XOR, Compare)
- Bit manipulation (rotate, shift)
- Transfer of control (conditional branch,
unconditional branch, branch and link, trap) - Special purpose (halt, interrupt, others)
- Those marked with use the ALU
- Note branches add or subtract or change the PC,
so these use the ALU - Those marked with use memory or I/O
15Instruction-Level Pipelining
- We have already covered the fetch-execute process
- It turns out that, if we are clever about
designing our architecture, we can design the
fetch-execute cycle so that each phase uses
different hardware - we can overlap instruction execution in a
pipeline - the CPU becomes like an assembly line,
instructions are fetched from memory and sent
down the pipeline one at a time - the first instruction is at stage 2 when the
second instruction is at stage 1 - or, instruction j is at stage 1 when instruction
j 1 is at stage 2 and instruction j 2 is at
stage 3, etc
- The length of the pipeline determines how many
overlapping instructions we can have - The longer the pipeline, the greater the overlap
and so the greater the potential for speedup - It turns out that long pipelines are difficult to
keep running efficiently though, so smaller
pipelines are often used
16A 6-stage Pipeline
- Stage 1 Fetch instruction
- Stage 2 Decode op code
- Stage 3 Calculate operand addresses
- Stage 4 Fetch operands (from registers usually)
- Stage 5 Execute instruction (includes computing
new PC for branches or doing loads and stores) - Stage 6 Store result (in register if ALU
operation or load)
This is a pipeline timing diagram showing how
instructions overlap
17Pipeline Performance
- Assume a machine has 6 steps in the fetch-execute
cycle - A non-pipelined machine will take 6 n clock
cycles to execute a program of n instructions - A pipelined machine will take n (6 1) clock
cycles to execute the same program! - If n is 1000, the pipelined machine is 6000 /
1005 times faster, or almost a speed up of 6
times! - In general, a pipelines performance is computed
as - Time (k n 1) tp where k number of
stages, n number of instructions and tp is the
time per stage (plus delays caused by moving
instructions down the pipeline) - The non-pipeline machine is Time k n
- So the speedup is k n / (k n 1) tp
- However, there are problems that a pipeline faces
because it overlaps the execution of instructions
that cause it to slow down
18Pipeline Problems
- Pipelines are impacted by
- Resource conflicts
- If it takes more than 1 cycle to perform a stage,
then the next instruction cannot move into that
stage - For instance, floating point operations often
take 2-10 cycles to execute rather than the
single cycle of most integer operations - Data dependences
- Consider
- Load R1, X
- Add R3, R2, R1
- Since we want to add in the 5th stage, but the
datum in R1 is not available until the previous
instruction reaches the 6th stage, the Add must
be postponed by at least 1 cycle - Branches
- In a branch, the PC is changed, but in a
pipeline, we may have already fetched one or more
instructions before we reach the stage in the
pipeline where the PC is changed!
19Impact of Branches
- In our 6-stage pipeline
- we compute the new PC at the 5th stage, so we
would have loaded 4 wrong instructions (these 4
instructions are the branch penalty) - thus, every branch slows the pipeline down by 4
cycles because 4 wrong instructions were already
fetched - consider the four-stage pipeline below
- S1 fetch instruction
- S2 decode instruction, compute operand
addresses - S3 fetch operands
- S4 execute instruction, store result (this
includes computing the PC value) - So, every branch instruction is followed by 3
incorrectly fetched instructions, or a branch
penalty of 3
20Other Ideas
- In order to improve performance, architects have
come up with all kinds of interesting ideas to
maintain a pipelines performance - Superscalar have multiple pipelines so that the
CPU can fetch, decode, execute 2 or more
instructions at a time - Use Branch Prediction when it comes to
branching, try to guess if the branch is taken
and if so where in advance to lower or remove the
branch penalty if you guess wrong, start over
from where you guessed incorrectly - Compiler Optimizations let the compiler
rearrange your assembly code so that data
dependencies are broken up and branch penalties
are removed by filling the slots after a branch
with neutral instructions - SuperPipeline divide pipeline stages into
substages to obtain a greater overlap without
necessarily changing the clock speed - We study these ideas in 462
21Real ISAs
- Intel 2 operands, variable length,
register-memory operations (but not
memory-memory), pipeline superscalar with
speculation but at the microcode level - MIPS fixed length, 3 operand instructions if
operands are in registers, load-store otherwise,
8-stage superpipeline, very simple instruction set