Pipelining - PowerPoint PPT Presentation

About This Presentation
Title:

Pipelining

Description:

Pipelining Between 411 problems sets, I haven t had a minute to do laundry Now that s what I call dirty laundry Read Chapter 4.5-4.8 Forget 411 – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 29
Provided by: McM144
Learn more at: https://www.cs.unc.edu
Category:

less

Transcript and Presenter's Notes

Title: Pipelining


1
Pipelining
Between 411 problems sets, I havent had a minute
to do laundry
Now thats what Icall dirty laundry
Read Chapter 4.5-4.8
2
Forget 411 Lets Solve a Relevant Problem
INPUT dirty laundry
Device Washer Function Fill, Agitate,
Spin WasherPD 30 mins
OUTPUT 4 more weeks
Device Dryer Function Heat, Spin DryerPD 60
mins
3
One Load at a Time
  • Everyone knows that the real reason that UNC
    students put off doing laundry so long is not
    because they procrastinate, are lazy, or even
    have better things to do.
  • The fact is, doing laundry one load at a time is
    not smart.
  • (Sorry Mom, but you were wrong about this one!)

Step 1
Step 2
Total WasherPD DryerPD _________ mins
90
4
Doing N Loads of Laundry
  • Heres how they do laundry at Duke, the
    combinational way.
  • (Actually, this is just an urban legend. No one
    at Duke actually does laundry. The butlers all
    arrive on Wednesday morning, pick up the dirty
    laundry and return it all pressed and starched by
    dinner)

Step 2
Step 4

Total N(WasherPD DryerPD) ____________
mins
N90
5
Doing N Loads the UNC way
  • UNC students pipeline the laundry process.
  • Thats why we wait!

Step 2
Step 3

Actually, its more like N60 30 if we account
for the startup transient correctly. When doing
pipeline analysis, were mostly interested in the
steady state where we assume we have an
infinite supply of inputs.
Total N Max(WasherPD, DryerPD)
____________ mins
N60
6
Recall Our Performance Measures
  • LatencyThe delay from when an input is
    established until the output associated with that
    input becomes valid.
  • (Duke Laundry _________ mins)
  • ( UNC Laundry _________ mins)
  • Throughput
  • The rate at which inputs or outputs are
    processed.
  • (Duke Laundry _________ outputs/min)
  • ( UNC Laundry _________ outputs/min)

90
120
1/90
1/60
7
Okay, Back to Circuits
For combinational logic latency tPD,
throughput 1/tPD. We cant get the answer
faster, but are we making effective use of our
hardware at all times?
X
F(X)
G(X)
P(X)
F G are idle, just holding their outputs
stable while H performs its computation
8
Pipelined Circuits
use registers to hold Hs input stable!
Now F G can be working on input Xi1 while H is
performing its computation on Xi. Weve created
a 2-stage pipeline if we have a valid input X
during clock cycle j, P(X) is valid during clock
j2.
Suppose F, G, H have propagation delays of 15,
20, 25 ns and we are using ideal zero-delay
registers (ts 0, tpd 0)
Pipelining uses registers to improve the
throughput of combinational circuits
latency 45 ______
throughput 1/45 ______
unpipelined 2-stage pipeline
9
Pipeline Diagrams
Clock cycle
i
i1
i2
i3
Input
Xi
F Reg
Pipeline stages
G Reg
H Reg
The results associated with a particular set of
input data moves diagonally through the diagram,
progressing through one pipeline stage each clock
cycle.
10
Pipelining Summary
  • Advantages
  • Higher throughput than combinational system
  • Different parts of the logic work on different
    parts of the problem
  • Disadvantages
  • Generally, increases latency
  • Only as good as the weakest link(often called
    the pipelines BOTTLENECK)

11
Review of CPU Performance
MIPS Millions of Instructions/Second
Freq Clock Frequency, MHz
CPI Clocks per Instruction
To Increase MIPS 1. DECREASE CPI. - RISC
simplicity reduces CPI to 1.0. - CPI below 1.0?
State-of-the-art multiple instruction issue 2.
INCREASE Freq. - Freq limited by delay along
longest combinational path hence - PIPELINING is
the key to improving performance.
12
Where Are the Bottlenecks?
Pipelining goal Break LONG combinational paths ?
memories, ALU in separate stages
13
miniMIPS Timing
  • Different instructions use various parts of the
    data path.

1 instr every 14 nS, 14 nS, 20 nS, 9 nS, 19 nS
Program execution order
Time
CLK
The above scenario is possible only if the system
could vary the clock period based on the
instruction being executed. This leads to
complicated timing generation, and, in the end,
slower systems, since it is not very compatible
with pipelining!
6 nS 2 nS 2 nS 5 nS 4 nS 6 nS 1 nS
Instruction Fetch
Instruction Decode
Register Prop Delay
ALU Operation
Branch Target
Data Access
Register Setup
14
Uniform miniMIPS Timing
  • With a fixed clock period, we have to allow for
    the worse case.

1 instr EVERY 20 nS
Program execution order
Time
CLK
add 4, 5, 6
beq 1, 2, 40
lw 3, 30(0)
jal 20000
sw 2, 20(4)
By accounting for the worse case path (i.e.
allowing time for each possible combination of
operations) we can implement a fixed clock
period. This simplifies timing generation,
enforces a uniform processing order, and allows
for pipelining!
6 nS 2 nS 2 nS 5 nS 4 nS 6 nS 1 nS
Instruction Fetch
Instruction Decode
Register Prop Delay
ALU Operation
Branch Target
Data Access
Register Setup
15
Goal 5-Stage Pipeline
GOAL Maintain (nearly) 1.0 CPI, but increase
clock speed to barely include slowest components
(mems, regfile, ALU) APPROACH structure
processor as 5-stage pipeline
16
5-Stage miniMIPS
0x80000000
PClt3129gtJlt250gt00
0x80000040
JT
0x80000080
BT
PCSEL
0
1
2
3
4
5
6
Instruction
PC
Memory
Omits some details
A
D
Instruction
NO bypass or interlock logic
Fetch
Jlt250gt
Register
RA1
RA2
WA
File
RD1
RD2

JT
Imm lt150gt
SEXT
SEXT
BZ
shamtlt106gt

16
ASEL
Register
2
0
1
BT
File
Address is available right after instruction
enters Memory stage
A
B
ALU
ALUFN
Z
V
N
C
ALU
Wr
R/W
WD
Adr
PC4
almost 2 clock cycles
Memory
Data Memory
RD
Rtlt2016gt
31
27
Rdlt1511gt
Data is needed just before rising clock edge at
end of Write Back stage
WASEL
0 1 2 3
Write
Register
WA
WD
Back
WA
File
WERF
WE
17
Pipelining
  • Improve performance by increasing instruction
    throughput
  • Ideal speedup is number of stages in the
    pipeline. Do we achieve this?

18
Pipelining
  • What makes it easy
  • all instructions are the same length
  • just a few instruction formats
  • memory operands appear only in loads and stores
  • What makes it hard?
  • structural hazards suppose we had only one
    memory
  • control hazards need to worry about branch
    instructions
  • data hazards an instruction depends on a
    previous instruction
  • Individual Instructions still take the same
    number of cycles
  • But weve improved the through-put by increasing
    the number of simultaneously executing
    instructions

19
Structural Hazards
Inst Fetch Reg Read ALU Data Access Reg Write
Inst Fetch Reg Read ALU Data Access Reg Write
Inst Fetch Reg Read ALU Data Access Reg Write
Inst Fetch Reg Read ALU Data Access Reg Write
20
Data Hazards
  • Problem with starting next instruction before
    first is finished
  • dependencies that go backward in time are data
    hazards

21
Software Solution
  • Have compiler guarantee no hazards
  • Where do we insert the nops ? sub 2, 1,
    3 and 12, 2, 5 or 13, 6, 2 add 14,
    2, 2 sw 15, 100(2)
  • Problem this really slows us down!

22
Forwarding
  • Use temporary results, dont wait for them to be
    written register file forwarding to handle
    read/write to same register ALU forwarding

23
Can't always forward
  • Load word can still cause a hazard
  • an instruction tries to read a register following
    a load instruction that writes to the same
    register.
  • Thus, we need a hazard detection unit to stall
    the instruction

24
Stalling
  • We can stall the pipeline by keeping an
    instruction in the same stage

25
Branch Hazards
  • When we decide to branch, other instructions are
    in the pipeline!
  • We are predicting branch not taken
  • need to add hardware for flushing instructions if
    we are wrong

26
5-Stage miniMIPS
0x80000000
PClt3129gtJlt250gt00
0x80000040
JT
0x80000080
BT
PCSEL
0
1
2
3
4
5
6
We wanted a simple, clean pipeline but
Instruction
PC
Memory
A
D
Instruction
Fetch
Jlt250gt
Register
RA1
RA2
WA
File
RD1
RD2

JT
Imm lt150gt
SEXT
SEXT
BZ
shamtlt106gt

16
ASEL
Register
2
0
1
BT
File
A
B
ALU
ALUFN
Z
V
N
C
ALU
Wr
R/W
WD
Adr
PC4
Memory
Data Memory
RD
Rtlt2016gt
31
27
Rdlt1511gt
WASEL
0 1 2 3
Write
Register
WA
WD
Back
WA
File
WERF
WE
27
Pipeline Summary (I)
Started with unpipelined implementation
direct execute, 1 cycle/instruction it had a
long cycle time mem regs alu mem wb We
ended up with a 5-stage pipelined
implementation increase throughput (3x???)
delayed branch decision (1 cycle) Choose to
execute instruction after branch delayed
register writeback (3 cycles) Add bypass paths (6
x 2 12) to forward correct value memory data
available only in WB stage Introduce NOPs at
IRALU, to stall IF and RF stages until LD result
was ready
28
Pipeline Summary (II)
  • Fallacy 1 Pipelining is easy
  • Smart people get it wrong all of the time!
  • Fallacy 2 Pipelining is independent of ISA
  • Many ISA decisions impact how easy/costly it is
    to implement pipelining (i.e. branch semantics,
    addressing modes).
  • Fallacy 3 Increasing Pipeline stages improves
    performance
  • Diminishing returns. Increasing complexity.
Write a Comment
User Comments (0)
About PowerShow.com