Title: The Processor: Datapath and Control
1The Processor Datapath and Control
- We will design a microprocessor that includes a
subset of the MIPS instruction set - Memory access load/store word (lw, sw)
- AL instructions add, sub, and, or, and slt.
- Branch instructions beq and jump (j).
- The subset doesn't include all the integer nor
any fp instructions but the principle is the
same. - For every instruction the first two steps are
identical - Fetch an instruction from where the PC points to
in memory. - Decode the instruction and read the registers or
memory contents specified.
2Abstract View of the DataPath
- The data path contains 2 types of logic elements
- Combinational Elements that operate on data
values. Their outputs depend on their inputs. The
ALU is an combinnational element. - State Elements with internal storage. Their
state is defined by the values they contain
(memory and registers).
3Clocking Methodology
- A state element has at least two inputs and one
output. The inputs are the data value to be
written into the element and the clock signal
which determines when the value will be written.
The output is the data value stored in the
element. Thus a state element can be read from at
any time but written depending on the clock. - A clocking methodology defines when signals can
be read and written. This is crucial (?????) to
the correct design of a computer. - We will assume an edge-triggered clocking
methodology. Any values stored in the machine are
updated only on a clock edge.
4Edge-Triggered Clocking
- Because only stateelements can storevalues, any
collectionof combinational logicmust have its
inputscoming from a set of state elements and
its outputs written to set of state elements. The
time necessary for the signals to reach element 2
defines the length of the clock cycle. - An edge-triggered methodologyallows us to read
the contents of an register, send the value
through some combinational logic and write that
register in thesame clock cycle. We assume that
state elements have implicit clock signals.
5Fetching an Instruction
- A memory unit will hold the instructions that are
to be executed. The address of the next
instruction is in the PC. We need an ALU that
performs only addition in order to calculate the
next instruction to fetch. - Thick arrows symbolize 32-bit buses unless
specified differently. Thin arrows specify 1-bit
lines, colored lines specify control lines.
6The Register File
- The R-type instructions (also called the
arithmetic-logical instructions) read the
contents of 2 registers, perform an ALU op. , and
write the result back into a third register. - The 32 registers are stored in the register file.
The register file has 3 5-bit inputs to specify
the registers, 2 32-bit outputs for the data
read, 1 32-bit input for the data written and 1
control signal to decide if data should be
written in. In addition we will need an ALU to
perform the operations.
7Data Memory
- The 2 elements needed to implement load and store
instructions are data memory and a unit that
sign-extends the 16-bit constant in an I-type
instruction. In addition we use the existing ALU
to compute the address to access. - The data memory has 2 32-bit inputs, the address
and the write data, and 1 32-input the read data.
In addition it has 2 control lines MemWrite and
MemRead.
8Branch Equal
- The beq instruction has 3 operands two registers
that are compared for equality and a 16-bit
offset used to compute the branch address
relative to the PC. To implement this instruction
we must add the sign-extend offset to the PC. - There are 2 important details1. The base for
the address calculation is the address afterthe
current instruction's address. But since we
compute PC4 when fetching we already have this
address2. The offset is in words not bytes so
we have to shift left the offset by 2.
9Combining ALU and Memory Instructions
- The ALU datapath (slide 6) and the Memory
datapath (slide 7) are similar. The differences
are - The second input to the ALU is a register
(R-type) or the sign-extended offset (I-type). - The value stored into the destination register
comes from the ALU (R-type) or from memory
(I-type) . - Using 2 multiplexors (Mux) we can combine both
datapaths.
10The Complete Datapath
- This simple processor can compute ALU
instructions, access memory or compute the next
instruction's address in a single cycle.
11ALU Control
- The ALU has 3 control inputs, we use 5 of the 8
possible input combinations000 AND001 OR010
add110 subtract111 slt - The ALU control uses as its inputs the funct
field of the instruction and a 2-bit control
field called the ALUOp. - For lw/sw the ALU computes the address using
addition (ALUOp00), for the R-type instructions
the ALU performs one of 5 actions depending on
the function field of the instruction (ALUOp10),
for beq the ALU performs a subtraction
(ALUOp01). - The ALU control is a large truth table that given
the funct field and ALUOp outputs 3-bit controls
for the ALU.
12Main Control
- Look at the formats of the R-type and I-type
instructionsField opcode rs rt rd shamt
funct Bits 31-26 25-21 20-16 15-11
10-6 5-0Field opcode rs rt
address Bits 31-26 25-21 20-16
15-0 - The following observations can be made
- The opcode is always in bits 31-26
- The 2 registers to be read are always the rs
(25-21) and rt (20-16) fields (R-type, beq, and
store). - The base register for load/ store instructions is
always rs (25-21) - The 16-bit offset for beq, lw,sw is always in
bits (15-0) - The destination register is in one of two places
For a lw it is rt (20-16), for a R-type it is rd
(15-11). Thus we need a MUX to select which field
of the instruction is written.
13The Main Control Signals
- There are 7 control signals in our
microprocessor, let's see what happens when they
are asserted (set to 1) and deasserted (set to
0)Signal Deasserted AssertedRegDst
The Write reg is rt The Write reg is
rdRegWrite None The Write register is
written
with the Write data
ALUSrc The 2nd ALU operand The 2nd ALU
operand is the comes from
the register file is the 16-bit addressPCSrc
PCPC 4 PCBranch targetMemRead
None Memory contents at the
address input are put on the
Read data outputMemWrite None
Memory contents at the
address input are replaced by
the Write data inputMemtoReg The value
of the reg. Write The value of the reg. Write
data input is from the ALU
data input is from memory
14Main Control Diagram
15Opcode to Control
- The control lines are determined by the opcodes
of the instructions. The exception is the PCSrc
line which is dependent on the output of the beq
instruction as well (x means don't care). - Line R-type lw sw beqRegDst
1 0 x xALUSrc 0 1 1 0MemtoReg 0 1 x xRegW
rite 1 1 0 0MemRead 0 1 0 0MemWrite 0 0 1
0Branch 0 0 0 1ALUOp 10 00 00 01 - At this stage the Control is a block box, which
receives inputs and gives outputs.
16Operation of the Datapath
- Let's see the stages of execution of a R-type
instruction add t1,t2,t3 - 1. An instruction is fetched from memory, the PC
is incremented - 2. Two registers t2 and t3 are read from the
register file. - 3. The ALU operates on the data read from the
register file. - 4. The results of the ALU is written into the
register t3. - This doesn't really happen in 4 steps because the
implementation is combinational, but at the end
of the clock cycle the result is written into the
destination register. - Let's look at lw t1,offset(t2)
- 1. An instruction is fetched from memory, the PC
is incremented - 2. The register t2 is read from the register
file. - 3. The ALU computes the sum of t2 and the
sign-extended offset. - 4. The sum from the ALU is used as the address
for the data memory. - 5. The data from memory is written into register
t1.
17Adding the Jump Instruction
- The j instruction uses pseudodirect addressing,
the upper 4 bits of PC4 are concatenated
(???????) to the 26 bits (shifted left by 2) of
the address in the J-type instruction.
18Performance of Single-Cycle Machines
- Let's assume that the operation time for the
following units is Memory - 2 nanoseconds (ns),
ALU and adders - 2 ns, Register file - 1 ns. We
will assume that MUXs, control, sign-extension,
PC accesses, and wires have no delays. - Which implementation is faster? 1. Every
instruction operates in 1 clock cycle of fixed
length.2. Every instruction operates in a
varying length clock cycle. - Lets look at the time needed by each
instructionInst. Fetch Reg. Rd ALU op
Memory Reg. Wr TotalR-Type 2
1 2 0 1
6nsLoad 2 1
2 2 1
8nsStore 2 1 2
2 7nsBranch
2 1 2
5nsJump 2
2ns
19Fixed vs. Variable Cycle Length
- Lets Assume a program has the following
instruction mix 24 loads, 12 stores, 44
R-type, 18 branchs, 2 jumps. - CPU execution time Instruction count Cycle
time - For the fixed cycle length the cycle time is 8
ns, long enough for the longest instruction
(load). Thus each instruction takes 8 ns to
execute. - For the variable cycle time the average CPU clock
cycle is824 712 644 518 22
6.3 ns - It is obvious that the variable clock
implementation is faster but it is extremely hard
to implement. - So why not use the single cycle implementation
which is only 6.3/8 78 slower? - When adding instructions such as multiply and
divide which can take tens of cycles this scheme
is too slow.
20A Multicycle Implementation
- We broke each instruction into several steps, we
can use these steps to build a multicycle
implementation. Each step takes 1 cycle, the
multicycle implementation allows a functional
unit to be used more than once in each
instruction as long as it is used on different
clock cycles.
We now have only a single memory unit and a
single ALU. In addition we need registers to hold
the output of each stage.
21New Registers and MUXs
- We have now added several new registers(which
hare transparent to the programmer) and some new
MUXs - Instruction Register (IR) - the instruction
fetched - Memory Data Register (MDR) - data read from
memory - A, B - registers read from the register file
- ALUOut - result of ALU operation
- The new MUXs added are
- An additional MUX to the 1st ALU input, chooses
between the A register and the PC. - The MUX on the 2nd ALU input is changed from a
2-way to a 4-way MUX. The additional inputs are
the constant 4 (used to increment the PC) and the
sign-extended and shifted offset field (used in
beq).
22Multicycle Diagram
- There are 3 possible sources for the PC value 1.
The output of the ALU which is PC4 2. The
register ALUOut which is the address of the
computed branch target 3. The lower 26 bits of
the IR shifted left by 2, concatenated with the 4
upper bits of the PC.
23The Instruction Execution Stages (1,2)
- 1. Instruction Fetch (IF)- Fetch the instruction
from memory and compute the address of the next
sequential addressIR MemoryPC PC PC 4 - 2. Instruction Decode (ID) and register fetch -
get the registers from the register file and
compute the potential branch address (even if it
isn't needed in the future)A
RegIR25-21B RegIR20-16ALUOut PC
(sign-extended(IR15-0)ltlt2)
24The Instruction Execution Stages (3)
- 3. Execution (EX), Memory address computation or
branch completion - In this stage the operation
is determined by the the instruction class A.
Memory reference ALUOut A
sign-extended(IR15-0)B. R-type ALUOut
A op BC. Branch if (A B) PC
ALUOutD. Jump PC PC31-28 cat
(IR25-0ltlt2)
25The Instruction Execution Stages (4,5)
- 4. Memory access (Mem) or R-type completion -
During this step the load/store instruction
accesses memory or the AL instruction write its
results.A. Memory reference MDR
MemoryALUOut (load) MemoryALUOut B
(store)B. R-type RegIR15-11 ALUOut - 5. Memory read completion step - The load
completes by writing the value from memory into a
register.RegIR20-16MDR
26Cycles Per Instruction (CPI)
- The CPI of a program defines how many cycles an
average instruction takes. Assuming an
instruction mix (for the gcc compiler) of 22
loads, 11 stores, 49 R-type, 16 branches, and
2 jumps what is the CPI, assuming each state
requires one clock cycle? - The number of clock cycles for each instruction
format isLoads 5 Stores 4 R-type 4
Branches 3 Jumps 3 - Thus the CPI 0.225 (0.11 0.49)4 (0.16
0.02)3 4.04 - This is better than the worst case CPI in which
each instruction would have taken the same number
of clock cycles.
27Exceptions
- One of the most hardest parts of control is
implementing exceptions and interrupts, events
other than branches and jumps which change the
normal flow of instruction execution. - An exception is an unexpected event that happens
during program execution such as an arithmetic
overflow or an illegal instruction (which are the
only 2 in our design). - An interrupt is an event that is external to the
processor, such as requests by I/O devices. - When an exception occurs the machine must save
the address of the offending instruction in the
exception program counter (EPC), and then
transfer execution to the OS. The OS might
service the exception and return control to the
program or terminate execution.
28Causes of Exceptions
- In order for the OS to handle the exception it
must know the cause of the exception. MIPS has a
register called the Cause register which holds
the reason of the exception. - A second method is called vectored interrupts. In
a vectored interrupt the address to which control
is transferred is determined by the exception
cause. The OS knows the cause of the exception by
the address that is jumped to. - We need two additional registers the EPC which
holds the address of the instruction and the
Cause Register which holds 0 for an undefined
instruction and 1 for arithmetic overflow. - We will need 2 control signals to write to the
EPC and cause registers (EPCWrite and CauseWrite)
and a signal to set the LSB of the Cause register
(IntCause).
29Datapath with Exceptions
- IntCause is defined by the control if it can't
decode the instruction or if the ALU signals an
overflow. The next PC MUX now has 4 inputs, the
exception handler addr is added