CPE 626: Advanced VLSI Design L02 - PowerPoint PPT Presentation

About This Presentation

Title:

CPE 626: Advanced VLSI Design L02

Description:

Title: CA226: Advanced Computer Architectures Author: aleksander Last modified by: Default Created Date: 1/5/2001 1:58:05 PM Document presentation format – PowerPoint PPT presentation

Number of Views:146

Avg rating:3.0/5.0

Slides: 47

Provided by: Alek155

Learn more at: http://www.ece.uah.edu

Category:

more less

Transcript and Presenter's Notes

Title: CPE 626: Advanced VLSI Design L02

1
CPE 626 Advanced VLSI DesignL02

Department of Electrical and Computer
Engineering University of Alabama in Huntsville

2
Outline

Simple Processor MU0
Datapath Design
Control Logic
ALU Design
Pipeline Processor DLX
ISA
Registers
Addressing Modes and Data Types
Instruction Format
Instruction Set
Non-pipeline Implementation
Pipeline Implementation

3
MU0 A Simple Processor

Instruction format
Instruction set

4
MU0 Logic Design

Follow an approach to separate the design into
two components
Datapath all the components carrying, storing
or processing bits including the accumulator,
program counter, ALU, and instruction register
Control logic everything that does not fit
comfortably into datapath
Datapath design many ways to do this
Assume that memory access is limiting factor, and
assume that memory access will take exactly one
clock cycle

5
MU0 Datapath Example

Program Counter PC
Accumulator - ACC
Arithmetic-Logic Unit ALU

Instruction Register
Instruction Decode andControl Logic

Follow the principle that the memory will be
limiting factor in design each instruction takes
exactly the number of clock cycles defined by the
number of memory accesses it must take.
Note We do not have a dedicated PC incrementer!
Why?
6
MU0 Datapath Design

Assume that each instruction starts when it has
arrived in the IR
Step 1 EX (execute)
LDA S ACC lt- MemS
STO S MemS lt- ACC
ADD S ACC lt- ACC MemS
SUB S ACC lt- ACC - MemS
JMP S PC lt- S
JGE S if (ACC gt 0) PC lt- S
JNE S if (ACC ! 0) PC lt- S
Step 2 IF (fetch the next instruction)
Either PC or the address in the IR is issued to
fetch the next instruction
address is incremented in the ALU and value saved
into the PC

Initialization
Reset input to start executing instructions from
a known address here it is 000hex
provide zero at the ALU output and then load it
into the PC register

7
MU0 RTL Organization

Control Logic
Asel
Bsel
ACCce (ACC change enable)
PCce (PC change enable)
IRce (IR change enable)
ACCoe (ACC output enable)
ALUfs (ALU function select)
MEMrq (memory request)
RnW (read/write)
Ex/ft (execute/fetch)

8
MU0 control logic
9
LDA S (0000)
Ex/ft 0
Ex/ft 1
B
B1
10
STO S (0001)
Ex/ft 0
Ex/ft 1
x
B1
11
ADD S (0010)
Ex/ft 0
Ex/ft 1
AB
B1
12
SUB S (0011)
Ex/ft 0
Ex/ft 1
A-B
B1
13
JMP S (0100)
Ex/ft 0
B1
14
JGE S (0101)
Ex/ft 0, ACC15 1
Ex/ft 0, ACC15 0
B1
B1
15
JNE S (0110)
Ex/ft 0, ACCz 1
Ex/ft 0, ACCz 0
B1
B1
16
STP (001)
Ex/ft 0
x
17
Reset
Ex/ft 0
0
18
MU0 ALU Design

ALU functions AB, A-B, B, B1, 0 (used only
when reset is active) gt 4 functions

Aen (enable operand A)
Binv (invert operand B)

19
Another ExampleDLX Architecture
20
DLX Registers

GPRs with load-store architecture
GPR 32 32-bit named R0, R1,... R31, R00
FPR (floating point registers)
single precision32 32-bit named F0, F1,... F31
(accessed independently)
double precision16 64-bit named F0, F2,... F30
(accessed in pairs)
Instructions which support transfers between
GPRs and FPRs
Other status registers, e.g., floating-point
status register (hold information about the
results of FP ops)

21
Addressing Modes and Data Types

Immediate with a 16-bit value field
Displacement with a 16-bit displacement
register deferred derived when disp0
absolute derived from displacement with R0
Byte addressable in big-endian with 32-bit
address
All memory references are load/store through GPR
or FPR and must be aligned
Data types
8-bit bytes, 16-bit half words (loaded into
registers with either zeros or the sign bit
replicated to fill 32 bits)
32-bit integers
32-bit single precision and 64-bit
double-precision for FP

22
Instruction Formats

I-type load, store, arithmetic, logic,
relational, shift, branch
R-type arithmetic, logic, relational
J-type jump, jump and link, trap, return from
exception

I-type instruction
Encodes Loads and stores of bytes, words, half
words All immediates (rd?rs1 op
immediate) Conditional branch instructions (rs1
is register, rd is unused) Jump register, jump
and link register (rd0, rsdestination, imm.0)
R-type instruction
Reg-reg ALU operations (rd?rs1 func rs2)
funcadd, sub,... Read/write special registers
and moves
J-type instruction
26
6
Offset added to PC
Opcode
Jump and jump and link Trap and return from
exception
23
Instructions for Data Transfers
Instruction Opcode Instruction Meaning
LB, LBU, SB Load byte, load byte unsigned, store byte
LH, LHU, SH Load half word, load half word unsigned, store half word
LW, SW Load word, store word (to/from integer registers)
LF, LD, SF, SD Load SP float, load DP float, store SP float, store DP float (SP - single precision, DP - double precision)
MOVI2S, MOVS2I Move from/to GPR to/from a special register
MOVF, MOVD Copy one floating-point register or a DP pair to another register or pair
MOVFP2I, MOVI2FP Move 32 bits from/to FP register to/from integer registers
Example Instruction Meaning
LW R1, 30(R2) RegsR1 ?32 Mem30 RegsR2
LW R1, 1000(R0) RegsR1 ?32 Mem1000 0
LB R1, 40(R3) RegsR1 ?32 (Mem40 RegsR30)24 Mem40 RegsR3
LBU R1, 40(R3) RegsR1 ?32 (0)24 Mem40 RegsR3
LH R1, 40(R3) RegsR1 ?32 (Mem40 RegsR30)16 Mem40 RegsR3 Mem41RegsR3
LF F0, 50(R3) RegsF0 ?32 Mem50 RegsR3
LD F0, 50(R2) RegsF0 RegsF1 ?32 Mem50 RegsR2
24
Arithmetic/logical instructions

All ALU instructions are register-register
add, sub, and, or, xor, shift
Immediate forms also available
LHI loads immediate value into most significant
16 bits
R0 used to synthesise other operations
Loading constant is an immediate gtadd with R0
as one source
Register-register move is an add with R0 as one
source
Compare operations put 1 ("true") in destination
if condition is met

25
Arithmetic/logical instructions (contd)
Instruction Opcode Instruction Meaning
ADD, ADDI, ADDU, ADDUI Add, add immediate (all immediates are 16-bits) signed and unsigned
SUB, SUBI, SUBU, SUBUI Subtract, subtract immediate signed and unsigned
MULT, MULTU, DIV, DIVU Multiply and divide, signed and unsigned operands must be floating-point registers all operations take and yield 32-bit values
AND, ANDI And, and immediate
OR, ORI, XOR, XORI Or, or immediate, exclusive or, exclusive or immediate
LHI Load high immediate - loads upper half of register with immediate
SLL, SRL, SRA, SLLI, SRLI, SRAI Shifts both immediate(S__I) and variable form(S__) shifts are shift left logical, right logical, right arithmetic
S__, S__I Set conditional "__"may be LT, GT, LE, GE, EQ, NE
Example Instruction Meaning
ADD R1, R2, R3 RegsR1 ? RegsR2 RegsR3
ADDI R1, R2, 3 RegsR1 ? RegsR2 3
LHI R1, 42 RegsR1 ? 42016
SLLI R1, R2, 5 RegsR1 ? RegsR2 ltlt 5
SLT R1, R2, R3 if (RegsR2 lt RegsR3) RegsR1 ? 1 else RegsR1 ? 0
26
Control-flow instructions

Jump can use 26-bit signed offset from PC or
contents of register
Jump-and-link saves PC in R31
Conditional branches test source for
zero/non-zero and use 16-bit signed offset

Instruction Opcode Instruction Meaning
BEQZ, BNEZ Branch GPR equal/not equal to zero 16-bit offset from PC
BFPT, BFPF Test comparison bit in the FP status register and branch 16-bit offset from PC
J, JR Jumps 26-bit offset from PC(J) or target in register(JR)
TRAP Transfer to operating system at a vectored address
RFE Return to user code from an exception restore user code
27
Floating-point instructions in DLX

Moves between floating point (32-bit) and
double-precision (64-bit) registers
Operations add, subtract, multiply, divide
Also, integer multiply/divide on floating point
regs

Instruction Opcode Instruction Meaning
ADDD, ADDF Add DP, SP numbers
SUBD, SUBF Subtract DP, SP numbers
MULTD, MULTF Multiply DP, SP floating point
DIVD, DIVF Divide DP, SP floating point
CVTF2D, CVTF2I, CVTD2F, CVTD2I, CVTI2F, CVTI2D Convert instructions CVTx2y converts from type x to type y, where x and y are one of I(Integer), D(Double precision), or F(Single precision). Both operands are in the FP registers.
__D, __F DP and SP compares "__" may be LT, GT, LE, GE, EQ, NE set comparison bit in FP status register.
28
A Simple Implementationof DLX
29
Instruction Execution

Process of instruction execution is usually
broken up into stages (divide and conquer)
smaller stages are easier to design
easy to optimize (change) one stage without
touching the others
5 main stages for DLX each stage takes one
clock cycle
Instruction Fetch (IF)
Instruction Decode / Register fetch cycle (ID)
Execution / Effective address cycle (EX)
Memory access / Branch completion cycle (MEM)
Write-back cycle (WB)

30
Instruction Fetch (IF)

Send out PC and fetch the instruction from the
memory into instruction register (IR)
IR is used to hold the instruction
Increment the PC by 4 to address the next
sequential instruction
NPC is used to hold the next sequential address

IR ? MemPC NPC ? PC 4
31
Instruction Decode (ID)

Decode the instruction to determine instruction
type (Opcode field - 6 ms bits of the
instruction)
Read in data from all necessary registers
temporary registers A, B hold outputs of GPR
Imm is used to hold sign-extended lower 16-bits
of the IR
decoding is done in parallel with reading
registers since these fields are at fixed
locations
a register may be read even we do not use it

A ? RegsIR6..10 B ? RegsIR11..15 Imm ?
(IR16)16IR16..31
32
Execution EX (1/2)

Register-register ALU instruction
ALU performs the operation specified by the
opcode on the values in registers A and Bthe
result is placed in the temporary register
ALUOutput
Register-immediate ALU instruction
ALU performs the operation specified by the
opcode on the value in register A and on the
value in register Immthe result is placed in
the temporary register ALUOutput

ALUOutput ? A op B
ALUOutput ? A op Imm
33
Execution EX (2/2)

Memory reference
ALU adds the operands to form effective address
and places the result into the temporary
register ALUOutput
Branch
ALU adds the NPC to the Imm to compute the
address of the branch target
Register A is checked to determine whether the
branch is taken (for BEQZ op is for BNEZ op
is !)
Cond is 1-bit register (1 - branch is taken, 0 -
not taken)

ALUOutput ? A Imm
ALUOutput ? NPC Imm Cond ? (A op 0)
34
Memory access (MEM)

Memory reference
load
store
Branch
if the instruction branches, the PC is replaced
with the branch destination otherwise, it is
replaced with NPC

LMD ? MemALUOutput
MemALUOutput ? B
if (cond) PC ? ALUOutput else PC ? NPC
35
Write-back (WB)

Register-register ALU
Register-immediate ALU
Load instruction

RegsIR16..20 ? ALUOutput
RegsIR11..15 ? ALUOutput
RegsIR11..15 ? LMD
36
Datapath
Memory Access
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
Write Back
Next PC
M U X
Next SEQ PC
Add
NPC
Zero?
4
RS1
M U X
InstructionMemory
RS2
A
Reg. File
IR
PC
ALU
ALUoutput
RD
M U X
B
LMD
DataMemory
M U X
Sign Extend
Imm
Imm
WB Data
37
Sequential Execution
Time clocks
10
5
Ii
Ii1
Ii2
Instructions
Sequential execution for these 3 instructions
(Ii, Ii1, Ii2) takes 15 clock cycles
38
Pipelined Execution
Time clocks
10
5

Analogy with automobile assembly line
many steps, each contributing something to the
construction of the car
each step operates in parallel with other steps,
though on a different car

Ii
Ii1
Ii2
Ii3
Ii4
Instructions
Pipe stages (segments)
Pipelined execution for instructions Ii, Ii1,
and Ii2 takes 7 clock cycles
39
Pipelining Lessons
Time clocks

Pipelining does not help latency of single
instruction, it helps throughput of entire
workload
Multiple instructions operating simultaneously
using different resources
Potential speedup Number pipe stages
Time to fill pipeline and time to drain
reduce speedup 2.15X vs. 5X in this example

5
Ii
Ii1
Ii2
Instructions

Latency Throughput
Latency ...how long it takes to execute an
instruction
Throughput ...how often an instruction exits the
pipeline

40
Pipelining Lessons (contd)
Time clocks

Pipeline stages are hooked together gt all stages
must be ready to proceed at the same time
Machine cycle the time required between moving
an instruction one step down the pipeline
(usually one clock cycle)
The length of a machine cycle is determined by
the time required for the slowest stage
Unbalanced lengths of pipe stages also reduces
speedup

5
Ii
Ii1
Ii2
Instructions
41
Visualizing Pipeline
Time (clock cycles)
CC 2
CC 3
CC 4
CC 6
CC 7
CC 5
CC 1
I n s t r. O r d e r
IM
42
Pipeline Datapath
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
Next PC
M U X
Next SEQ PC
Add
Zero?
4
IR6..10
IR11..15
M U X
InstructionMemory
IR
Reg. File
PC
ALU
M U X
DataMemory
M U X
Sign Extend
Imm
MEM/WB.IR11..15 or MEM/WB.IR16..20
WB Data
43
Instruction Flow through Pipeline Regs
Time (clock cycles)
CC 4
CC 3
CC 1
CC 2
Lw R4,0(R2)
Sub R6,R5,R7
Add R1,R2,R3
Xor R9,R8,R1
Nop
Add R1,R2,R3
Lw R4,0(R2)
Sub R6,R5,R7
Nop
Add R1,R2,R3
Nop
Lw R4,0(R2)
Nop
Nop
Nop
Add R1,R2,R3
44
DLX Pipeline Definition IF, ID