CS 42906290 Lecture 04 MIPS, Dataflow Design, Pipelining - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

CS 42906290 Lecture 04 MIPS, Dataflow Design, Pipelining

Description:

(Lectures based on the work of Jay Brockman, Sharon Hu, Randy Katz, Peter Kogge, ... RAMs (SRAM, DRAM), ROMs (PROM, EEPROM), disk. tradeoff between speed and cost/bit ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 67
Provided by: michaelt8
Category:

less

Transcript and Presenter's Notes

Title: CS 42906290 Lecture 04 MIPS, Dataflow Design, Pipelining


1
CS 4290/6290 Lecture 04MIPS, Dataflow Design,
Pipelining
  • (Lectures based on the work of Jay Brockman,
    Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
    Ken MacKenzie, Richard Murphy, Michael Niemier,
    and Milos Pruvlovic)

2
The organization of a computer
  • Von Neumann Model
  • Stored-program machine instructions are
    represented as numbers
  • Programs can be stored in memory to be
    read/written just like numbers.

Compiler
Control
Input
Memory
Datapath
Output
Processor
3
Functions of Each Component
  • Datapath performs data manipulation operations
  • arithmetic logic unit (ALU)
  • floating point unit (FPU)
  • Control directs operation of other components
  • finite state machines
  • micro-programming
  • Memory stores instructions and data
  • random access v.s. sequential access
  • volatile v.s. non-volatile
  • RAMs (SRAM, DRAM), ROMs (PROM, EEPROM), disk
  • tradeoff between speed and cost/bit
  • Input/Output and I/O devices interface to the
    environment
  • mouse, keyboard, display, device drivers

4
The Performance Perspective
  • Performance of a machine determined by
  • Instruction count, clock cycles per instruction,
    clock cycle time
  • Processor design (datapath and control)
    determines
  • Clock cycles per instruction
  • Clock cycle time
  • We will discuss two implementations.
  • Single-Cycle Implementation (a bx cx2
    example)
  • Advantage One clock cycle per instruction
  • Disadvantage Less flexible
  • Multiple-Cycle Implementation (bus based)
  • Advantage Shorter clock cycle times, different
    number of cycles for different instructions,
    functional unit sharing,

5
MIPS Instruction Formats
  • All MIPS instructions are 32 bits (4 bytes) long.
  • R-type
  • I-Type
  • J-type

6
The MIPS Subset
  • Consider a subset of instructions
  • memory-reference lw, sw
  • arithmetic-logical add, sub, and, or, slt
  • branching beq, j
  • Organizational overview
  • fetch an instruction based on the content of PC
  • decode the instruction
  • fetch operands
  • (read one or two registers)
  • execute
  • (effective address calculation/arithmetic-logical
    operations/comparison)
  • store result
  • (write to memory / write to register / update PC)

At simplest level, this is how Von Neumann, RISC
model works
7
Implementation Overview
simplest view of Von Neumann, RISC mP
  • Abstract / Simplified View
  • 2 types of signals data and control
  • Clocking strategy All storage elements clocked
    by same
  • clock edge.

Data
Address
PC
Ra
Instruction
Address
Rb
A
L
U
Instruction Memory
Register File
Rw
Data Memory
Data
8
Review of Design Steps
  • Instruction set Architecture gt RTL
    representation
  • RTL representation gt
  • Datapath components
  • Datapath interconnects
  • Datapath components gt Control signals
  • Control signals gt Control logic
  • Writing RTL How many states (cycles) should an
    instruction take?
  • CPI
  • Datapath component sharing

i.e. PC ? PC 4
(or 4 ? 3 2)
need these to do
need these to do
need these to do
9
Single Cycle Implementation
  • Each instruction takes one cycle to complete.
  • We wait for everything to settle down, and the
    right thing to be done
  • ALU might not produce right answer right away
    (why?)
  • we use write signals along with clock to
    determine when to write
  • Cycle time determined by length of the longest
    path

referring to 2 slides ago, what instruction
takes the longest?
10
An exercise in dataflow design
  • OK, as a class exercise, were going to design a
    simple MIPS dataflow.
  • FYI, the slides that describe this are in
    Appendix A
  • but lets do this together first
  • and think about ways to make it better along the
    way
  • Well use the instruction formats to help

11
Lets start with a few instructions
  • For example
  • Add 5, 6, 7
  • SW 0(9), 10
  • Sub 1, 2, 3
  • LW 11, 0(12)
  • We want to execute these instructions in order.
  • Whats the first thing we have to do?

12
Lets say we want to fetchan R-type
instruction (arithmetic)
  • Instruction format
  • RTL
  • Instruction fetch memPC
  • ALU operation regrd lt- regrs op regrt
  • Go to next instruction Pc lt- PC 4
  • Ra, Rb and Rw are from instructions rs, rt, rd
    fields.
  • Actual ALU operation and register write should
    occur after decoding the instruction.

13
Lets say we want to fetchan I-Type
Arithmetic/Logic Instructions
  • Instruction format
  • RTL for arithmetic operations e.g., ADDI
  • Instruction fetch memPC
  • Add operation regrt lt- regrs
    SignExt(imm16)
  • Go to next instruction Pc lt- PC 4
  • Also, immediate instructions

14
Lets say we want to fetchan I-Type Load/Store
Instructions
  • Instruction format
  • RTL for load/store operations e.g., LW
  • Instruction fetch memPC
  • Compute memory address Addr lt- regrs
    SignExt(imm16)
  • Load data into register regrt lt- memAddr
  • Go to next instruction Pc lt- PC 4
  • How about store?

same thing, just skip 3rd step (memaddr ?
regrs)
15
Lets say we want to fetchan I-Type Branch
Instructions
  • Instruction format
  • RTL for branch operations e.g., BEQ
  • Instruction fetch memPC
  • Compute conditon Cond lt- regrs - regrt
  • Calculate the next instructions address
  • if (Cond eq 0) then
  • PC lt- PC 4 (SignExd(imm16) x 4)
  • else ?

16
Lets say we want to fetchan J-Type Jump
Instructions
  • Instruction format
  • RTL operations e.g., BEQ
  • Instruction fetch memPC
  • Set up PC PC lt- ((PC 4)lt3129gt
    CONCAT(targetlt250gt) x 4

17
What do we get?A Single Cycle Datapath
P
C
S
r
c
A
d
d
4
t

2
ALUctr
3
i
M
e
m
W
r
i
t
e
A
L
U
S
r
c
M
e
m
t
o
R
e
g
i
Z
e
r
o
A
L
U
A
L
U
R
e
a
d
A
d
d
r
e
s
s
r
e
s
u
l
t
M
d
a
t
a
M
u
u
x
D
a
t
a
x
m
e
m
o
r
y
W
r
i
t
e
R
e
g
W
r
i
t
e
d
a
t
a
If you dont understand this, take a look at
Appendix A
S
i
g
n
M
e
m
R
e
a
d
e
x
t
e
n
d
18
Control Logic
19
The HW needed, plus control
Single cycle MIPS machine
When we talk about control, we talk about these
blocks
20
Implementing Control
  • Implementation Steps Review
  • Identify control inputs and control outputs
  • Make a control signal table for each cycle
  • Derive control logic from the control table
  • As youve seen (and as well review), this logic
    can take on many forms combinational logic,
    ROMs, microcode, or combinations

21
Single Cycle Control Input/Output
  • Control Inputs
  • Opcode (6 bits)
  • How about R-type instructions?
  • Control Outputs
  • RegDst
  • ALUSrc
  • MemtoReg
  • RegWrite
  • MemRead
  • MemWrite
  • Branch
  • Jump
  • ALUctr

Step 2 Make a control signal table for each cycle
22
Control Signal Table
(inputs)
R-type
(outputs)
23
The HW needed, plus control
Single cycle MIPS machine
24
Main control, ALU control
Func
ALUctr
OP
ALU Control
Main Control
6
ALUOp
3
6
2
(opcode)
ALU
Other cnt. signals
  • Use OP field to generate ALUOp (encoding)
  • Control signal fed to ALU control block
  • Use Func field and ALUOp to generate ALUctr
    (decoding)
  • Specifically sets 3 ALU control signals
  • B-Invert, Carry-in, operation

25
Main control, ALU control
Or in other words 00 ALU performs add 01 ALU
performs sub 10 ALU does what function code
says (see p. 284 for more)
26
Generating ALUctr
  • We want these outputs

and - 00
or - 01
mux
adder - 10
ALUctrlt2gt B-negate (C-in B-invert) ALUctrlt1gt
Select ALU Output ALUctrlt0gt Select ALU Output
Invert B and C-in must be a 1 for subtract
less - 11
27
The Logic
This table is used to generate the actual Boolean
logic gates that produce ALUctr.
Could generate gates by hand, often done w/SW.
(ALUOp)
ALUOp0
X/1
ALUctrlt2gt
ALUOp1
1/0
0/X
1/1
F3
1/0
ALUctr
(funclt50gt)
110/110
ALUctrlt1gt
F2
0/X
1/1
Ex ALUctrlt2gt (SUB/BEQ)
ALUctrlt0gt
F1
1/X
0/0
0/0
F0
0/X
0/X
28
Recall
Single cycle MIPS machine
Recall, for MIPS, we have to build a Main Control
Block and an ALU Control Block
29
Well, heres what we did
Single cycle MIPS machine
We came up with the information to generate this
logic which would fit here in the datapath.
30
Single cycle versus multi-cycle
31
Single Cycle Implementation
  • Calculate cycle time assuming negligible delays
    except
  • memory (2ns), ALU and adders (2ns), register file
    access (1ns)

32
Single-Cycle Implementation (Contd)
  • Single-cycle, fixed-length clock
  • CPI 1
  • Clock cycle propagation delay of the longest
    datapath operations among all instruction types
  • Easy to implement
  • Single-cycle, variable-length clock
  • CPI 1
  • Clock cycle ? ((type-i instructions)
    propagation delay of the type i instruction
    datapath operations)
  • Better than the previous, but impractical to
    implement
  • Disadvantages
  • What if we have floating-point operations?
  • How about component usage?

33
Multiple Cycle Alternative
  • Break an instruction into smaller steps
  • Execute each step in one cycle.
  • Execution sequence
  • Balance amount of work to be done
  • Restrict each cycle to use only one major
    functional unit
  • At the end of a cycle
  • Store values for use in later cycles, why?
  • Introduce additional internal registers
  • The advantages
  • Cycle time much shorter
  • Diff. inst. take different of cycles to
    complete
  • Functional unit used more than once per
    instruction

34
Step 1 Instruction Fetch
  • Use PC to get instruction, put it in IR.
  • Increment PC by 4, put the result back in PC.
  • Can you write this using the RTL notation?
  • IR lt- MemoryPC , PC lt- PC 4What is the
    advantage of updating the PC now?

35
Step 2 I-Decode and Register Fetch
  • Read registers rs and rt in case we need them
  • Compute branch address in case instruction is
    branch
  • RTL A lt- RegIR25-21
  • B lt- RegIR20-16
  • ALUOut lt- PC (sign-extend(IR15-0) ltlt2)
  • Did we set any control lines based on the
    instruction type? (we are busy "decoding" it in
    our control logic)

Means in parallel
36
Step 3 (Instruction dependent)
  • ALU is performing 1 of 3 functions, based on
    instruction type
  • Memory Reference ALUOut lt- A
    sign-extend(IR15-0)
  • R-type ALUOut lt- A op B
  • Branch if (AB) then (PC lt- ALUOut)

37
Step 4 (R-type or memory-access)
  • Loads and stores access memory MDR lt-
    MemoryALUOut or MemoryALUOut lt- B
  • R-type instructions finish RegIR15-11 lt-
    ALUOutWhen does the write actually take
    place?
  • -at the end of the cycle on the edge.

38
Step 5 Write-Back
  • RegIR20-16lt- MDR
  • What about all the other instructions?

39
Single cycle
40
Multiple Cycle Design
  • Break up instructions into steps, each step takes
    1 cycle
  • balance work to be done
  • restrict each cycle to use only 1 major
    functional unit
  • At the end of a cycle
  • store values for use in later cycles (easiest
    thing to do)
  • introduce additional internal registers

41
Execution Sequence Summary
IR ? MemoryPC
PC ? PC 4
A ? RegIR(2521)
B ? RegIR(2016)
ALUOut ? PC SignEx(IR(150) ltlt 2)
42
Control Signals
New
Old
  • PC PCWrite, PCWriteCond, PCSource
  • Memory IorD, MemRead, MemWrite
  • IR IRWrite
  • Reg. File RegWrite, MemtoReg, RegDst
  • ALU ALUSrcA, ALUSrcB, ALUOp, ALUCnt.

RegDst, MemToReg, RegWrite, MemRead, MemWrite,
Branch, ALUSrc, ALUOp, ALUCnt.
43
Implementing the Control
  • Value of control signals is dependent upon
  • what instruction is being executed
  • which step is being performed
  • Use accumulated information to specify a finite
    state machine
  • use a state diagram, or
  • use microprogramming
  • Implementation can be derived from specification

44
Graphical Specification of FSM
t
Instruction Fetch
MemRead ALUSrcA 0 IorD 0 IRWrite ALUSrcB
01 ALUOp 00 PCWrite PCSource 00
Instruction decode/ Register fetch
1
0
ALUSrcA 0 ALUSrcB 11 ALUOp 00
start
8
9
Branch Completion
Memory address computation
Jump Completion
2
6
Execution
ALUSrcA 1 ALUSrcB 00 ALUOp
01 PCWriteCond PCSource 01
ALUSrcA 1 ALUSrcB 10 ALUOp 00
ALUSrcA 1 ALUSrcB 00 ALUOp 10
PCWrite PCSource 10
Memory access
5
Memory access
RegDst 1 RegWrite MemToReg 0
MemRead IorD 1
MemRead IorD 1
3
Tells us what values are needed and during what
step
R-type completion
7
RegDst 0 RegWrite MemToReg 1
4
Memory read completion
45
Finite State Machine for Control
Control logic is inside this box (could be
implemented in many different ways)
The outputs that we want now also dependent
on the current state.
could be ROM, logic, etc.
Inputs (which now also include the previous state)
(Still might need ALU control logic and hence
function code developed earlier)
46
Microprogramming
  • For our example, state diagrams, combinational
    logic more than adequate
  • But were dealing with small subset of MIPS
    processor
  • Full MIPS instruction set has over 100
    instructions
  • In 1 implementation instructions take from 1 to
    20 clock cycles
  • Control would be much more complex for this case
  • Another alternative microcoding
  • Think of control signals that must be asserted in
    a state as an instruction to be executed by
    datapath
  • Call these micro instructions

47
The entire microprogram
48
Sample Microinstruction
  • Ifetch IR lt- MemPC PC lt- PC4

Microinstruction 1d011ddd000100d11
49
Pipelining
50
Pipelining Its Natural!
  • Laundry Example
  • Ann, Brian, Cathy, Dave each have one load of
    clothes to wash, dry, and fold
  • Washer takes 30 minutes
  • Dryer takes 40 minutes
  • Folder takes 20 minutes

51
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
  • Sequential laundry takes 6 hours for 4 loads
  • If they learned pipelining, how long would
    laundry take?

52
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
Note More time to go out later that night
  • Pipelined laundry takes 3.5 hours for 4 loads

53
Pipelining Lessons
  • Multiple tasks operating simultaneously
  • Pipelining doesnt help latency of single task,
    it helps throughput of entire workload
  • Pipeline rate limited by slowest pipeline stage
  • Potential speedup Number pipe stages
  • Unbalanced lengths of pipe stages reduces speedup
  • Also, need time to fill and drain the
    pipeline.

6 PM
7
8
9
Time
T a s k O r d e r
54
Pipelining Some terms
  • If youre doing laundry or implementing a mP,
    each stage where something is done called a pipe
    stage
  • In laundry example, washer, dryer, and folding
    table are pipe stages clothes enter at one end,
    exit other
  • In a mP, instructions enter at one end and have
    been executed when they leave
  • Another example auto assembly line
  • Throughput is how often stuff comes out of a
    pipeline

55
More technical detail
  • If times for all S stages are equal to T
  • Time for one initiation to complete still ST
  • Time between 2 initiates T not ST
  • Initiations per second 1/T
  • Pipelining Overlap multiple executions of same
    sequence
  • Improves THROUGHPUT, not the time to perform a
    single operation
  • Other examples
  • Automobile assembly plant, chemical factory,
    garden hose, cooking

56
More technical detail
  • Books approach to draw pipeline timing diagrams
  • Time runs left-to-right, in units of stage time
  • Each row below corresponds to distinct
    initiation
  • Boundary b/t 2 column entries pipeline register
  • (i.e. hamper)
  • Must look at column contents to see what stage is
    doing what

Time for N initiations to complete NT (S-1)T
Throughput Time per initiation T (S-1)T/N ?
T!
57
Ideal digital system pipeline speedup
Unpipelined
combinational logic delay t
combinational logic delay t
combinational logic delay t
combinational logic delay t
delay for 1 piece of data 4t latch setup
(assume small)
Latch
Latch
approximate delay for 1000 pieces of data 4000t
Pipelined
combinational logic delay t
combinational logic delay t
combinational logic delay t
combinational logic delay t
Latch
Latch
delay for 1 piece of data 4(t latch setup)
approximate delay for 1000 pieces of data 3t
1000t
4000
4
speedup for 1000 pieces of data
1003
Ideal speedup of pipeline stages
58
The new look dataflow
IF/ID
ID/EX
EX/MEM
MEM/WB
4
M u x
ADD
PC
Branch taken
Comp.
IR6...10
M u x
Inst. Memory
IR11..15
Register File
ALU
MEM/ WB.IR
M u x
Data Mem.
Data must be stored from one stage to the
next in pipeline registers/latches. hold
temporary values between clocks and needed info.
for execution.
M u x
Sign Extend
16
32
59
Another way to look at it
Clock Number
Time
Program execution order (in instructions)
60
So, what about the details?
  • In each cycle, new instruction fetched and begins
    5 cycle execution
  • In perfect world (pipeline) performance improved
    5 times over!
  • So, thats it, huh? Hardly!!!
  • What else do we have to worry about?
  • Must know whats going on in every cycle of
    machine
  • What if 2 instructions try to use the same
    resource at same time?
  • (LOTS more on this later)
  • Separate instruction/data memories, multiple
    register ports, etc. help avoid this

61
Limits, limits, limits
  • So, now that the ideal stuff is out of the way,
    lets look at how a pipeline REALLY works
  • Pipelines are slowed b/c of
  • Pipeline latency
  • Imbalance of pipeline stages
  • (Think A chain is only as strong as its weakest
    link)
  • Well, a pipeline is only as fast as its slowest
    stage
  • Pipeline overhead (from where?)
  • Register delay from pipe stage latches
  • Clock skew Once a clock cycle is as small as
    the sum of the clock skew and latch overhead, you
    cant get any work done

62
Note
  • See Appendix B in the supplementary materials for
    more detail, examples.

63
Control Signals in a Pipeline
64
Questions about control signals
  • Following discussion relevant to a single
    instruction
  • Q Are all control signals active at the same
    time?
  • A ?
  • Q Can we generate all these signals at the same
    time?
  • A ?

65
Passing control w/pipe registers
  • Analogy send instruction with car on assembly
    line
  • Install Corinthian leather interior on car 6 _at_
    stage 3

66
Pipelined datapath w/control signals
Write a Comment
User Comments (0)
About PowerShow.com