Title: CPE 626: Advanced VLSI Design L02
1CPE 626 Advanced VLSI DesignL02
- Department of Electrical and Computer
Engineering University of Alabama in Huntsville
2Outline
- Simple Processor MU0
- Datapath Design
- Control Logic
- ALU Design
- Pipeline Processor DLX
- ISA
- Registers
- Addressing Modes and Data Types
- Instruction Format
- Instruction Set
- Non-pipeline Implementation
- Pipeline Implementation
3MU0 A Simple Processor
- Instruction format
- Instruction set
4MU0 Logic Design
- Follow an approach to separate the design into
two components - Datapath all the components carrying, storing
or processing bits including the accumulator,
program counter, ALU, and instruction register - Control logic everything that does not fit
comfortably into datapath - Datapath design many ways to do this
- Assume that memory access is limiting factor, and
assume that memory access will take exactly one
clock cycle
5MU0 Datapath Example
- Program Counter PC
- Accumulator - ACC
- Arithmetic-Logic Unit ALU
- Instruction Register
- Instruction Decode andControl Logic
Follow the principle that the memory will be
limiting factor in design each instruction takes
exactly the number of clock cycles defined by the
number of memory accesses it must take.
Note We do not have a dedicated PC incrementer!
Why?
6MU0 Datapath Design
- Assume that each instruction starts when it has
arrived in the IR - Step 1 EX (execute)
- LDA S ACC lt- MemS
- STO S MemS lt- ACC
- ADD S ACC lt- ACC MemS
- SUB S ACC lt- ACC - MemS
- JMP S PC lt- S
- JGE S if (ACC gt 0) PC lt- S
- JNE S if (ACC ! 0) PC lt- S
- Step 2 IF (fetch the next instruction)
- Either PC or the address in the IR is issued to
fetch the next instruction - address is incremented in the ALU and value saved
into the PC
- Initialization
- Reset input to start executing instructions from
a known address here it is 000hex - provide zero at the ALU output and then load it
into the PC register
7MU0 RTL Organization
- Control Logic
- Asel
- Bsel
- ACCce (ACC change enable)
- PCce (PC change enable)
- IRce (IR change enable)
- ACCoe (ACC output enable)
- ALUfs (ALU function select)
- MEMrq (memory request)
- RnW (read/write)
- Ex/ft (execute/fetch)
8MU0 control logic
9LDA S (0000)
Ex/ft 0
Ex/ft 1
B
B1
10STO S (0001)
Ex/ft 0
Ex/ft 1
x
B1
11ADD S (0010)
Ex/ft 0
Ex/ft 1
AB
B1
12SUB S (0011)
Ex/ft 0
Ex/ft 1
A-B
B1
13JMP S (0100)
Ex/ft 0
B1
14JGE S (0101)
Ex/ft 0, ACC15 1
Ex/ft 0, ACC15 0
B1
B1
15JNE S (0110)
Ex/ft 0, ACCz 1
Ex/ft 0, ACCz 0
B1
B1
16STP (001)
Ex/ft 0
x
17Reset
Ex/ft 0
0
18MU0 ALU Design
- ALU functions AB, A-B, B, B1, 0 (used only
when reset is active) gt 4 functions
- Aen (enable operand A)
- Binv (invert operand B)
19Another ExampleDLX Architecture
20DLX Registers
- GPRs with load-store architecture
- GPR 32 32-bit named R0, R1,... R31, R00
- FPR (floating point registers)
- single precision32 32-bit named F0, F1,... F31
(accessed independently) - double precision16 64-bit named F0, F2,... F30
(accessed in pairs) - Instructions which support transfers between
GPRs and FPRs - Other status registers, e.g., floating-point
status register (hold information about the
results of FP ops)
21Addressing Modes and Data Types
- Immediate with a 16-bit value field
- Displacement with a 16-bit displacement
- register deferred derived when disp0
- absolute derived from displacement with R0
- Byte addressable in big-endian with 32-bit
address - All memory references are load/store through GPR
or FPR and must be aligned - Data types
- 8-bit bytes, 16-bit half words (loaded into
registers with either zeros or the sign bit
replicated to fill 32 bits) - 32-bit integers
- 32-bit single precision and 64-bit
double-precision for FP
22Instruction Formats
- I-type load, store, arithmetic, logic,
relational, shift, branch - R-type arithmetic, logic, relational
- J-type jump, jump and link, trap, return from
exception
I-type instruction
Encodes Loads and stores of bytes, words, half
words All immediates (rd?rs1 op
immediate) Conditional branch instructions (rs1
is register, rd is unused) Jump register, jump
and link register (rd0, rsdestination, imm.0)
R-type instruction
Reg-reg ALU operations (rd?rs1 func rs2)
funcadd, sub,... Read/write special registers
and moves
J-type instruction
26
6
Offset added to PC
Opcode
Jump and jump and link Trap and return from
exception
23Instructions for Data Transfers
Instruction Opcode Instruction Meaning
LB, LBU, SB Load byte, load byte unsigned, store byte
LH, LHU, SH Load half word, load half word unsigned, store half word
LW, SW Load word, store word (to/from integer registers)
LF, LD, SF, SD Load SP float, load DP float, store SP float, store DP float (SP - single precision, DP - double precision)
MOVI2S, MOVS2I Move from/to GPR to/from a special register
MOVF, MOVD Copy one floating-point register or a DP pair to another register or pair
MOVFP2I, MOVI2FP Move 32 bits from/to FP register to/from integer registers
Example Instruction Meaning
LW R1, 30(R2) RegsR1 ?32 Mem30 RegsR2
LW R1, 1000(R0) RegsR1 ?32 Mem1000 0
LB R1, 40(R3) RegsR1 ?32 (Mem40 RegsR30)24 Mem40 RegsR3
LBU R1, 40(R3) RegsR1 ?32 (0)24 Mem40 RegsR3
LH R1, 40(R3) RegsR1 ?32 (Mem40 RegsR30)16 Mem40 RegsR3 Mem41RegsR3
LF F0, 50(R3) RegsF0 ?32 Mem50 RegsR3
LD F0, 50(R2) RegsF0 RegsF1 ?32 Mem50 RegsR2
24Arithmetic/logical instructions
- All ALU instructions are register-register
- add, sub, and, or, xor, shift
- Immediate forms also available
- LHI loads immediate value into most significant
16 bits - R0 used to synthesise other operations
- Loading constant is an immediate gtadd with R0
as one source - Register-register move is an add with R0 as one
source - Compare operations put 1 ("true") in destination
if condition is met
25Arithmetic/logical instructions (contd)
Instruction Opcode Instruction Meaning
ADD, ADDI, ADDU, ADDUI Add, add immediate (all immediates are 16-bits) signed and unsigned
SUB, SUBI, SUBU, SUBUI Subtract, subtract immediate signed and unsigned
MULT, MULTU, DIV, DIVU Multiply and divide, signed and unsigned operands must be floating-point registers all operations take and yield 32-bit values
AND, ANDI And, and immediate
OR, ORI, XOR, XORI Or, or immediate, exclusive or, exclusive or immediate
LHI Load high immediate - loads upper half of register with immediate
SLL, SRL, SRA, SLLI, SRLI, SRAI Shifts both immediate(S__I) and variable form(S__) shifts are shift left logical, right logical, right arithmetic
S__, S__I Set conditional "__"may be LT, GT, LE, GE, EQ, NE
Example Instruction Meaning
ADD R1, R2, R3 RegsR1 ? RegsR2 RegsR3
ADDI R1, R2, 3 RegsR1 ? RegsR2 3
LHI R1, 42 RegsR1 ? 42016
SLLI R1, R2, 5 RegsR1 ? RegsR2 ltlt 5
SLT R1, R2, R3 if (RegsR2 lt RegsR3) RegsR1 ? 1 else RegsR1 ? 0
26Control-flow instructions
- Jump can use 26-bit signed offset from PC or
contents of register - Jump-and-link saves PC in R31
- Conditional branches test source for
zero/non-zero and use 16-bit signed offset
Instruction Opcode Instruction Meaning
BEQZ, BNEZ Branch GPR equal/not equal to zero 16-bit offset from PC
BFPT, BFPF Test comparison bit in the FP status register and branch 16-bit offset from PC
J, JR Jumps 26-bit offset from PC(J) or target in register(JR)
TRAP Transfer to operating system at a vectored address
RFE Return to user code from an exception restore user code
27Floating-point instructions in DLX
- Moves between floating point (32-bit) and
double-precision (64-bit) registers - Operations add, subtract, multiply, divide
- Also, integer multiply/divide on floating point
regs
Instruction Opcode Instruction Meaning
ADDD, ADDF Add DP, SP numbers
SUBD, SUBF Subtract DP, SP numbers
MULTD, MULTF Multiply DP, SP floating point
DIVD, DIVF Divide DP, SP floating point
CVTF2D, CVTF2I, CVTD2F, CVTD2I, CVTI2F, CVTI2D Convert instructions CVTx2y converts from type x to type y, where x and y are one of I(Integer), D(Double precision), or F(Single precision). Both operands are in the FP registers.
__D, __F DP and SP compares "__" may be LT, GT, LE, GE, EQ, NE set comparison bit in FP status register.
28A Simple Implementationof DLX
29Instruction Execution
- Process of instruction execution is usually
broken up into stages (divide and conquer) - smaller stages are easier to design
- easy to optimize (change) one stage without
touching the others - 5 main stages for DLX each stage takes one
clock cycle - Instruction Fetch (IF)
- Instruction Decode / Register fetch cycle (ID)
- Execution / Effective address cycle (EX)
- Memory access / Branch completion cycle (MEM)
- Write-back cycle (WB)
30Instruction Fetch (IF)
- Send out PC and fetch the instruction from the
memory into instruction register (IR) - IR is used to hold the instruction
- Increment the PC by 4 to address the next
sequential instruction - NPC is used to hold the next sequential address
IR ? MemPC NPC ? PC 4
31Instruction Decode (ID)
- Decode the instruction to determine instruction
type (Opcode field - 6 ms bits of the
instruction) - Read in data from all necessary registers
- temporary registers A, B hold outputs of GPR
- Imm is used to hold sign-extended lower 16-bits
of the IR - decoding is done in parallel with reading
registers since these fields are at fixed
locations - a register may be read even we do not use it
A ? RegsIR6..10 B ? RegsIR11..15 Imm ?
(IR16)16IR16..31
32Execution EX (1/2)
- Register-register ALU instruction
- ALU performs the operation specified by the
opcode on the values in registers A and Bthe
result is placed in the temporary register
ALUOutput - Register-immediate ALU instruction
- ALU performs the operation specified by the
opcode on the value in register A and on the
value in register Immthe result is placed in
the temporary register ALUOutput
ALUOutput ? A op B
ALUOutput ? A op Imm
33Execution EX (2/2)
- Memory reference
- ALU adds the operands to form effective address
and places the result into the temporary
register ALUOutput - Branch
- ALU adds the NPC to the Imm to compute the
address of the branch target - Register A is checked to determine whether the
branch is taken (for BEQZ op is for BNEZ op
is !) - Cond is 1-bit register (1 - branch is taken, 0 -
not taken)
ALUOutput ? A Imm
ALUOutput ? NPC Imm Cond ? (A op 0)
34Memory access (MEM)
- Memory reference
- load
- store
- Branch
- if the instruction branches, the PC is replaced
with the branch destination otherwise, it is
replaced with NPC
LMD ? MemALUOutput
MemALUOutput ? B
if (cond) PC ? ALUOutput else PC ? NPC
35Write-back (WB)
- Register-register ALU
- Register-immediate ALU
- Load instruction
RegsIR16..20 ? ALUOutput
RegsIR11..15 ? ALUOutput
RegsIR11..15 ? LMD
36Datapath
Memory Access
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
Write Back
Next PC
M U X
Next SEQ PC
Add
NPC
Zero?
4
RS1
M U X
InstructionMemory
RS2
A
Reg. File
IR
PC
ALU
ALUoutput
RD
M U X
B
LMD
DataMemory
M U X
Sign Extend
Imm
Imm
WB Data
37Sequential Execution
Time clocks
10
5
Ii
Ii1
Ii2
Instructions
Sequential execution for these 3 instructions
(Ii, Ii1, Ii2) takes 15 clock cycles
38Pipelined Execution
Time clocks
10
5
- Analogy with automobile assembly line
- many steps, each contributing something to the
construction of the car - each step operates in parallel with other steps,
though on a different car
Ii
Ii1
Ii2
Ii3
Ii4
Instructions
Pipe stages (segments)
Pipelined execution for instructions Ii, Ii1,
and Ii2 takes 7 clock cycles
39Pipelining Lessons
Time clocks
- Pipelining does not help latency of single
instruction, it helps throughput of entire
workload - Multiple instructions operating simultaneously
using different resources - Potential speedup Number pipe stages
- Time to fill pipeline and time to drain
reduce speedup 2.15X vs. 5X in this example
5
Ii
Ii1
Ii2
Instructions
- Latency Throughput
- Latency ...how long it takes to execute an
instruction - Throughput ...how often an instruction exits the
pipeline
40Pipelining Lessons (contd)
Time clocks
- Pipeline stages are hooked together gt all stages
must be ready to proceed at the same time - Machine cycle the time required between moving
an instruction one step down the pipeline
(usually one clock cycle) - The length of a machine cycle is determined by
the time required for the slowest stage - Unbalanced lengths of pipe stages also reduces
speedup
5
Ii
Ii1
Ii2
Instructions
41Visualizing Pipeline
Time (clock cycles)
CC 2
CC 3
CC 4
CC 6
CC 7
CC 5
CC 1
I n s t r. O r d e r
IM
42Pipeline Datapath
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
Next PC
M U X
Next SEQ PC
Add
Zero?
4
IR6..10
IR11..15
M U X
InstructionMemory
IR
Reg. File
PC
ALU
M U X
DataMemory
M U X
Sign Extend
Imm
MEM/WB.IR11..15 or MEM/WB.IR16..20
WB Data
43Instruction Flow through Pipeline Regs
Time (clock cycles)
CC 4
CC 3
CC 1
CC 2
Lw R4,0(R2)
Sub R6,R5,R7
Add R1,R2,R3
Xor R9,R8,R1
Nop
Add R1,R2,R3
Lw R4,0(R2)
Sub R6,R5,R7
Nop
Add R1,R2,R3
Nop
Lw R4,0(R2)
Nop
Nop
Nop
Add R1,R2,R3
44DLX Pipeline Definition IF, ID
- Stage IF
- IF/ID.IR ? MemPC
- if EX/MEM.cond IF/ID.NPC, PC ? EX/MEM.ALUOUT
else IF/ID.NPC, PC ? PC 4 - Stage ID
- ID/EX.A ? RegsIF/ID.IR610 ID/EX.B ?
RegsIF/ID.IR1115 - ID/EX.Imm ? (IF/ID.IR16)16 IF/ID.IR1631
- ID/EX.NPC ? IF/ID.NPC ID/EX.IR ? IF/ID.IR
45DLX Pipeline Definition IE
- ALU
- EX/MEM.IR ? ID/EX.IR
- EX/MEM.ALUOUT ? ID/EX.A func ID/EX.B
orEX/MEM.ALUOUT ? ID/EX.A func ID/EX.Imm - EX/MEM.cond ? 0
- load/store
- EX/MEM.IR ? ID/EX.IREX/MEM.B ? ID/EX.B
- EX/MEM.ALUOUT ? ID/EX.A ? ID/EX.Imm
- EX/MEM.cond ? 0
- branch
- EX/MEM.NPC ? ID/EX.A ? ID/EX.Imm
- EX/MEM.cond ? (ID/EX.A func 0)
46DLX Pipeline Definition MEM, WB
- Stage MEM
- ALU
- MEM/WB.IR ? EX/MEM.IR
- MEM/WB.ALUOUT ? EX/MEM.ALUOUT
- load/store
- MEM/WB.IR ? EX/MEM.IR
- MEM/WB.LMD ? MemEX/MEM.ALUOUT
orMemEX/MEM.ALUOUT ? EX/MEM.B - Stage WB
- ALU
- RegsMEM/WB.IR1620 ? MEM/WB.ALUOUT
orRegsMEM/WB.IR1115 ? MEM/WB.ALUOUT - load
- RegsMEM/WB.IR1115 ? MEM/WB.LMD