Title: Instruction Set Principles and Examples
1Instruction Set Principles and Examples
2Outline
- Introduction
- Classifying instruction set architectures
- Instruction set measurements
- Memory addressing
- Addressing modes for signal processing
- Type and size of operands
- Operations in the instruction set
- Operations for media and signal processing
- Instructions for control flow
- Encoding an instruction set
- Role of compilers
- MIPS architecture
3Brief Introduction to ISA
- Instruction Set Architecture a set of
instructions - Each instruction is directly executed by the
CPUs hardware - How is it represented?
- By a binary format since the hardware understands
only bits - Concatenate together binary encoding for
instructions, registers, constants, memories - Typical physical blobs are bits, bytes, words,
n-words - Word size is typically 16, 32, 64 bits today
- Options - fixed or variable length formats
- Fixed - each instruction encoded in same size
field (typically 1 word) - Variable half-word, whole-word, multiple word
instructions are possible
4Example of Program Execution
- Command
- 1 Load AC from Memory
- 2 Store AC to memory
- 5 Add to AC from memory
- Add the contents of memory 940 to the content of
memory 941 and stores the result at 941
Fetch
Execution
5A Note on Measurements
- Were taking the quantitative approach
- BUT measurements will vary
- Due to application selection or application mix
- Due to the particular compiler being used
- Also dependent on compiler optimization selection
- And the target ISA
- Hence the measurements well talk about
- Are useful to understand the method
- Are a typical yet small sample derived from
benchmark codes - To do it for real
- You would want lots of real applications
- Plus - your compiler and ISA
6Classifying Instruction Set Architecture
7Instruction Set Design
The instruction set influences everything
8Instruction Characteristics
- Usually a simple operation
- Which operation is identified by the op-code
field - But operations require operands - 0, 1, or 2
- To identify where they are, they must be
addressed - Address is to some piece of storage
- Typical storage possibilities are main memory,
registers, or a stack - 2 options explicit or implicit addressing
- Implicit - the op-code implies the address of the
operands - ADD on a stack machine - pops the top 2 elements
of the stack, then pushes the result - HP calculators work this way
- Explicit - the address is specified in some field
of the instruction - Note the potential for 3 addresses - 2 operands
the destination
9Classifying Instruction Set Architectures
Based on CPU internal storage optionsAND of
operands
These choices critically affect - instructions,
CPI, and cycle time
10Operand Locations for Four ISA Classes
11CAB
- Stack
- Push A
- Push B
- Add
- Pop the top-2 values of the stack (A, B) and push
the result value into the stack - Pop C
- Accumulator (AC)
- Load A
- Add B
- Add AC (A) with B and store the result into AC
- Store C
- Register (register-memory)
- Load R1, A
- Add R3, R1, B
- Store R3, C
- Register (load-store)
- Load R1, A
- Load R2, B
- Add R3, R1, R2
- Store R3, C
12Pros and Cons of Stack, Accumulator, Register
Machine
13Modern Choice Load-store Register (GPR)
Architecture
- Reasons for choosing GPR (general-purpose
registers) architecture - Registers (stacks and accumulators) are faster
than memory - Registers are easier and more effective for a
compiler to use - (AB) (CD) (EF)
- May be evaluated in any order (for pipelining
concerns or ) - But on a stack machine ? must left to right
- Registers can be used to hold variables
- Reduce memory traffic
- Speed up programs
- Improve code density (fewer bits are used to name
a register) - Compiler writers prefer that all registers be
equivalent and unreserved - The number of GPR at least 16
14Characteristics Divide GPR Architectures
- of operands
- Three-operand 1 result and 2 source operands
- Two-operand 1 both source/result and 1 source
- How many operands are memory addresses
- 0 3 (two sources 1 result)
Load-store
Register-memory
Memory-memory
15Pros and Cons of Three Most Common GPR Computers
16Short Summary Classifying Instruction Set
Architectures
- Expect the use of general-purpose registers
- Figure 2.4 pipelining (Appendix A)
- Expect the use of Register-Register (load-store)
GPR architecture
17Memory Addressing
18Memory Addressing Basics
All architectures must address memory
- What is accessed - byte, word, multiple words?
- Todays machine are byte addressable
- Main memory is organized in 32 - 64 byte lines
- Big-Endian or Little-Endian addressing
- Hence there is a natural alignment problem
- Size s bytes at byte address A is aligned if A
mod s 0 - Misaligned access takes multiple aligned memory
references - Memory addressing mode influences instruction
counts (IC) and clock cycles per instruction (CPI)
19Typical Address Modes (I)
20Typical Address Modes (II)
21Use of Memory Addressing Mode (Figure 2.7)
Based on a VAX which supported everything
Not counting Register mode (50 of all)
22Displacement Field Size
At least 1216 bits (75 -- 99) of the
displacements
23Immediate Operands
24Distribution of Immediate Values
25Addressing Modes for Signal Processing
- DSPs deal with infinite, continuous streams of
data, they routinely rely on circular buffers - Modulo or circular addressing mode
- Support data shuffling in Fast Fourier Transform
(FFT) - Bit reverse addressing
- 0112 ? 1102
- However, the two fancy addressing modes do not
used heavily - Mismatch between what programmers and compilers
actually use versus what architects expect
26Frequency of Addressing Modes for T1 TMS320C54x
DSP
27Short Summary Memory Addressing
- Need to support at least three addressing modes
- Displacement, immediate, and register deferred
( REGISTER) - They represent 75 -- 99 of the addressing modes
in benchmarks - The size of the address for displacement mode to
be at least 1216 bits (75 99) - The size of immediate field to be at least 8 16
bits (50 80)
28Operand Type Size
- Specified by instruction (opcode) or by hardware
tag - Tagged machines are extinct
- Typical types assume word 32 bits
- Character - byte - ASCII or EBCDIC (IBM) - 4 per
word - Short integer - 2- bytes, 2s complement
- Integer - one word - 2s complement
- Float - one word - usually IEEE 754 these days
- Double precision float - 2 words - IEEE 754
- BCD or packed decimal - 4- bit values packed 8
per word - Instructions will be needed for common
conversions -- software can do the rare ones
29Data Access Patterns
30Operands for Media and Signal Processing
- Graphics applications vertex
- (x, y, z) w to help with color or hidden
surfaces (R, G, B, A) - 32-bit floating-point values
- DSPs
- Fixed point a binary point just to the right of
the sign bit - Represent fractions between 1 and 1
- Have a separate exponent variable
- Blocked floating point a block of variables has
a common exponent - Need some registers that are wider to guard
against round-off error
31Operand Type and Size in DSP
32Short Summary Type and Size of Operand
- The future - as we go to 64 bit machines
- Decimals future is unclear
- Larger offsets, immediate, etc. is likely
- Usage of 64 and 128 bit values will increase
- DSPs need wider accumulating registers than the
size in memory to aid accuracy in fixed-point
arithmetic
33What Operations are Needed
- Arithmetic Logical
- Integer arithmetic ADD, SUB, MULT, DIV, SHIFT
- Logical operation AND, OR, XOR, NOT
- Data Transfer - copy, load, store
- Control - branch, jump, call, return, trap
- System - OS and memory management
- Well ignore these for now - but remember they
are needed - Floating Point
- Same as arithmetic but usually take bigger
operands - Decimal - if you go for it what else do you need?
- legacy from COBOL and the commercial application
domain - String - move, compare, search
- Graphics pixel and vertex, compression/decompres
sion operations
34Top 10 Instructions for 80x86
- load 22
- conditional branch 20
- compare 16
- store 12
- add 8
- and 6
- sub 5
- move register-register 4
- call 1
- return 1
- The most widely executed instructions are the
simple operations of an instruction set - The top-10 instructions for 80x86 account for 96
of instructions executed - Make them fast, as they are the common case
35Control Instructions are a Big Deal
- Jumps - unconditional transfer
- Conditional Branches
- How is condition code set? by flag or part of
the instruction - How is target specified? How far away is it?
- Calls
- How is target specified? How far away is it?
- Where is return address kept?
- How are the arguments passed? Callee vs. Caller
save! - Returns
- Where is the return address? How far away is it?
- How are the results passed?
36Breakdown of Control Flows
- Call/Returns
- Integer 19 FP 8
- Jump
- Integer 6 FP 10
- Conditional Branch
- Integer 75 FP 82
37Branch Address Specification
- Known at compile time for unconditional and
conditional branches - hence specified in the
instruction - As a register containing the target address
- As a PC-relative offset
- Consider word length addresses, registers, and
instructions - Full address desired? Then pick the register
option. - BUT - setup and effective address will take
longer. - If you can deal with smaller offset then PC
relative works - PC relative is also position independent - so
simple linker duty
38Returns and Indirect Jumps
- Branch target is not known at compile time
- Need a way to specify the target dynamically
- Use a register
- Permit any addressing mode
- RegsR4 ? RegsR4 MemRegsR1
- Also useful for
- case or switch
- Dynamically shared libraries
- High-order functions or function pointers
- Virtual functions in OO
39Branch Stats - 90 are PC Relative
- Call/Return
- TeX 16, Spice 13, GCC 10
- Jump
- TeX 18, Spice 12, GCC 12
- Conditional
- TeX 66, Spice 75, GCC 78
40Branch Distances
41Condition Testing Options
42What kinds of compares do Branches Use?
43Direction, Frequency, and real Change
Key points 75 are forward branch Most
backward branches are loops - taken about 90
Branch statistics are both compiler and
application dependent Any loop optimizations
may have large effect
44Short Summary Operations in the Instruction Set
- Branch addressing to be able to jump to about
100 instructions either above or below the
branch - Imply a PC-relative branch displacement of at
least 8 bits - Register-indirect and PC-relative addressing for
jump instructions to support returns as well as
many other features of current systems
45Encoding an Instruction Set
46Encoding the ISA
- Encode instructions into a binary representation
for execution by CPU - Can pick anything but
- Affects the size of code - so it should be tight
- Affects the CPU design - in particular the
instruction decode - So it may have a big influence on the CPI or
cycle-time - Must balance several competing forces
- Desire for lots of addressing modes and registers
- Desire to make average program size compact
- Desire to have instructions encoded into lengths
that will be easy to handle in a pipelined
implementation (multiple of bytes)
473 Popular Encoding Choices
- Variable (compact code but difficult to encode)
- Primary opcode is fixed in size, but opcode
modifiers may exist - Opcode specifies number of arguments - each used
as address fields - Best when there are many addressing modes and
operations - Use as few bits as possible, but individual
instructions can vary widely in length - e. g. VAX - integer ADD versions vary between 3
and 19 bytes - Fixed (easy to encode, but lengthy code)
- Every instruction looks the same - some field may
be interpreted differently - Combine the operation and the addressing mode
into the opcode - e. g. all modern RISC machines
- Hybrid
- Set of fixed formats
- e. g. IBM 360 and Intel 80x86
Trade-off between size of programVS. ease of
decoding
483 Popular Encoding Choices (Cont.)
49An Example of Variable Encoding -- VAX
- addl3 r1, 737(r2), (r3) 32-bit integer add
instruction with 3 operands ? need 6 bytes to
represent it - Opcode for addl3 1 byte
- A VAX address specifier is 1 byte (4-bits
addressing mode, 4-bits register) - r1 1 byte (register addressing mode r1)
- 737(r2)
- 1 byte for address specifier (displacement
addressing r2) - 2 bytes for displacement 737
- (r3) 1 byte for address specifier (register
indirect r3) - Length of VAX instructions 153 bytes
50Short Summary Encoding the Instruction Set
- Choice between variable and fixed instruction
encoding - Code size than performance ? variable encoding
- Performance than code size ? fixed encoding
51Role of Compilers
- Critical goals in ISA from the compiler viewpoint
- What features will lead to high-quality code
- What makes it easy to write efficient compilers
for an architecture
52Compiler and ISA
- ISA decisions are no more for programming AL
easily - Due to HLL, ISA is a compiler target today
- Performance of a computer will be significantly
affected by compiler - Understanding compiler technology today is
critical to designing and efficiently
implementing an instruction set - Architecture choice affects the code quality and
the complexity of building a compiler for it
53Goal of the Compiler
- Primary goal is correctness
- Second goal is speed of the object code
- Others
- Speed of the compilation
- Ease of providing debug support
- Inter-operability among languages
- Flexibility of the implementation - languages may
not change much but they do evolve - e. g.
Fortran 66 gt HPF
Make the frequent cases fast and the rare case
correct
54Typical Modern Compiler Structure
Common Intermediate Representation
Somewhat language dependentLargely machine
independent
Small language dependentSlight machine dependent
Language independentHighly machine dependent
55Typical Modern Compiler Structure (Cont.)
- Multi-pass structure ? easy to write bug-free
compilers - Transform HL, more abstract representations, into
progressively low-level representations,
eventually reaching the instruction set - Compilers must make assumptions about the ability
of later steps to deal with certain problems - Ex. 1 choose which procedure calls to expand
inline before they know the exact size of the
procedure being called - Ex. 2 Global common sub-expression elimination
- Find two instances of an expression that compute
the same value and saves the result of the first
one in a temporary - Temporary must be register, not memory
(Performance) - Assume register allocator will allocate temporary
into register
56Optimization Types
- High level - done at source code level
- Procedure called only once - so put it in-line
and save CALL - Local - done on basic sequential block
(straight-line code) - Common sub-expressions produce same value
- Constant propagation - replace constant valued
variable with the constant - saves multiple
variable accesses with same value - Global - same as local but done across branches
- Code motion - remove code from loops that compute
same value on each pass and put it before the
loop - Simplify or eliminate array addressing
calculations in loop
57Optimization Types (Cont.)
- Register allocation
- Use graph coloring (graph theory) to allocate
registers - NP-complete
- Heuristic algorithm works best when there are at
least 16 (and preferably more) registers - Processor-dependent optimization
- Strength reduction replace multiply with shift
and add sequence - Pipeline scheduling reorder instructions to
minimize pipeline stalls - Branch offset optimization Reorder code to
minimize branch offsets
58Major Types of Optimizations and Example in Each
Class
59Change in IC Due to Optimization
- Level 1 local optimizations, code scheduling,
and local register allocation - Level 2 global optimization, loop transformation
(software pipelining), global register allocation - Level 3 procedure integration
60Optimization Observations
- Hard to reduce branches
- Biggest reduction is often memory references
- Some ALU operation reduction happens but it is
usually a few - Implication
- Branch, Call, and Return become a larger relative
of the instruction mix - Control instructions among the hardest to speed up
61Impact of Compiler Technology on Architects
Decisions
- Important questions
- How are variables allocated and addressed?
- How many registers will be needed?
- We must look at 3 areas to allocate data
62Where to allocate data?
- Stack
- Local variable access in activation records,
almost no push/pop - Addressing is relative to the stack pointer
- Grown or shrunk on calls and returns
- Global data area - the easy one
- Constants and global static structures
- For arrays addressing may be indexed off head
- Heap
- Used for dynamic objects
- Access usually by pointers
- Data is typically not scalar
63Register Allocation Data
- Reasonably simple for stack objects
- Hard for global data due to aliasing opportunity
- Must be conservative
- Heap objects pointers in general are even
harder - Computed pointers make allocation impossible to
register save the target data - Any structured data - string, array, etc. is too
big to save - Since register allocation is a major optimization
source - The effect is clearly important
p a a p a
64How can Architects Help Compiler Writers
- Provide Regularity
- Address modes, operations, and data types should
be orthogonal (independent) of each other - Simplify code generation especially multi-pass
- Counterexample restrict what registers can be
used for a certain classes of instructions - Provide primitives - not solutions
- Special features that match a HLL construct are
often un-usable - What works in one language may be detrimental to
others
65How can Architects Help Compiler Writers (Cont.)
- Simplify trade-offs among alternatives
- How to write good code? What is a good code?
- Metric IC or code size (no longer true) ?caches
and pipeline - Anything that makes code sequence performance
obvious is a definite win! - How many times a variable should be referenced
before it is cheaper to load it into a register - Provide instructions that bind the quantities
known at compile time as constants - Dont hide compile time constants
- Instructions which work off of something that the
compiler thinks could be a run-time determined
value hand-cuffs the optimizer
66Short Summary -- Compilers
- ISA has at least 16 GPR (not counting FP
registers) to simplify allocation of registers
using graph coloring - Orthogonality suggests all supported addressing
modes apply to all instructions that transfer
data - Simplicity understand that less is more in ISA
design - Provide primitives instead of solutions
- Simplify trade-offs between alternatives
- Dont bind constants at runtime
- Counterexample Lack of compiler support for
multimedia instructions
67The MIPS Architecture
68Expectations for New ISA
- Use general-purpose registers, with a load-store
architecture - Support displacement (offset size12-16 bits),
immediate (size 8 to 16 bits), and register
indirect - Support 8-, 16-, 32-, and 64-bit integers and
64-bit IEEE 754 floating-point numbers - Support the following simple instructions load,
store, add, subtract, move register-register,
and, shift, compare equal, compare not equal,
branch (with a PC-relative address at least 8
bits long), jump, call, return - Use fixed instruction encoding if interested in
performance and use variable instruction encoding
if interested in code size - Provide at least 16 general-purpose registers
(GPA) separate floating-point registers, be
sure all addressing modes apply to all data
transfer instructions, and aim for a minimalist
instruction set
69MIPS
- Simple load- store ISA
- Enable efficient pipeline implementation
- Fixed instruction set encoding
- Efficiency as a compiler target
- MIPS64 variant is discussed here
70Register for MIPS
- 32 64-bit integer GPRs - R0, R1, ... R31, R0 0
always - 32 FPRs - used for single or double precision
- For single precision F0, F1, ... , F31 (32-bit)
- For double precision F0, F2, ... , F30 (64-bit)
- Extra status registers - moves via GPRs
- Instructions for moving between an FRP and a GPR
71Data Types for MIPS
- 8-bit byte, 16-bit half words, 32-bit word, and
64-bit double words for integer data - 32-bit single precision and 64-bit double
precision for FP - MIPS64 operations work on 64-bit integer and 32-
or 64-bit floating point - Bytes, half words, and words are loaded into the
GPRs with zeros or the sign bit replicated to
fill the 64 bits of the GPRs - All references between memory and either GPRs or
FPRs are through load or stores
72Addressing Modes for MIPS
- Data addressing immediate and displacement (16
bits) - Displacement Add R4, 100(R1) (RegsR4?RegsR4M
em100RegsR1) - Register-indirect placing 0 in displacement
field - Add R4, (R1) (RegsR4?RegsR4MemRegsR1)
- Absolute addressing (16 bits) using R0 as the
base register - Add R1, (1001) (RegsR4?RegsR4Mem1001)
- Byte addressable with 64-bit address
- Mode selection for Big Endian or Little Endian
73MIPS Instruction Format
- Encode addressing mode into the opcode
- All instructions are 32 bits with 6-bit primary
opcode
74MIPS Instruction Format (Cont.)
I-Type Instruction
- Loads and Stores LW R1, 30(R2), S.S F0, 40(R4)
- ALU ops on immediates DADDIU R1, R2, 3
- rt lt-- rs op immediate
- Conditional branches BEQZ R3, offset
- rs is the register checked
- rt unused
- immediate specifies the offset
- Jump registers ,jump and link register JR R3
- rs is target register
- rt and immediate are unused but 011
75MIPS Instruction Format (Cont.)
R-Type Instruction
- Register-register ALU operations rd?rs funct rt
DADDU R1, R2, R3 - Function encodes the data path operations Add,
Sub... - read/write special registers
- Moves
J-Type Instruction Jump, Jump and Link, Trap and
return from exception
6 26
opcode
Offset added to PC
76MIPS instruction MIX
SPECint2000
77MIPS instruction MIX (Cont.)
SPECfp2000