Pipelining - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Pipelining

Description:

each have one load of clothes. to wash, dry, and fold. Washer takes 30 minutes ... The only DLX instructions active in this cycle are loads, stores, and branches ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 46

Provided by: Rand222

Category:

more less

Transcript and Presenter's Notes

Title: Pipelining

1
Pipelining

By Pradondet Nilagupta
Based on Lecture note on
Advanced Computer Architecture
Prof. Mike Schulte
Prof. Yirng-An Chen

2
Introduction to Pipelining

Pipelining An implementation technique that
overlaps the execution of multiple instructions.
Laundry Example
Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes

3
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r

Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would
laundry take?

4
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r

Pipelined laundry takes 3.5 hours for 4 loads
Speedup 6/3.5 1.7

5
Pipelining Lessons

Pipelining doesnt help latency of single task,
it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline and time to drain it
reduces speedup

6 PM
7
8
9
Time
T a s k O r d e r
6
Computer Pipelines

Execute billions of instructions, so throughput
is what matters
RISC desirable features all instructions same
length, registers located in same place in
instruction format, memory operands only in loads
or stores

7
Pipelining Basics
Unpipelined System
Delay 33ns Throughput 30MHz
Op1
Op2
Op3
??
Time

One operation must complete before next can begin
Operations spaced 33ns apart

8
3 Stage Pipelining
Delay 39ns Throughput 77MHz
Op1
Op2

Space operations 13ns apart
3 operations occur simultaneously

Op3
Op4
??
Time
9
Limitation Nonuniform Pipelining
Delay 18 3 54 ns Throughput 55MHz
Clock

Throughput limited by slowest stage
Delay determined by clock period number of
stages
Must attempt to balance stages

10
Limitation Deep Pipelines
Delay 48ns, Throughput 128MHz

Diminishing returns as add more pipeline stages
Register delays become limiting factor
Increased latency
Small throughput gains

11
Limitation Sequential Dependencies
R E G
Comb. Logic
R E G
Comb. Logic
R E G
Comb. Logic
Clock
Op1
Op2

Op4 gets result from Op1 !
Pipeline Hazard

Op3
Op4
??
Time
12
Speed Up Equation for Pipelining

Assumptions
No delays except components latencies
A fixed pipeline overhead 2ns.
What is the cycle time for the pipeline version
of the circuit that maximizes performance without
allocating multiple cycles to a stage?
What is the total execution time for the pipeline
version?
What is the speedup versus a single-cycle
unpipelined version?

13
Multiple-Cycle DLX Cycles 1 and 2

Most DLX instruction can be implemented in 5
clock cycles (see Figure 3.1 on page 130).
The first two clock cycles are the same for every
instruction.
1. Instruction fectch cycle (IF)
IR lt MemPC (load instruction)
NPC lt PC4 (update program counter)
2. Instruction decode / register fetch cycle (ID)
A lt RegsIR (fetch source reg1)
B lt RegsIR (fetch source reg2)
Imm lt (IR ) IR (fetch and sign-ext
imm.)

6...10
1115
16
16
1631
14
Multiple-Cycle DLX Cycle 3

The third cycle is known as the
Execution/ effective address cycle (EX)
The actions performed in this cycle depend on the
type of operations.
Memory reference (e.g., LW R1, 30 (R2))
ALUOutput lt A Imm (Calculate effective
address)
Register-Register ALU op. (e.g., ADD R1, R2, R3)
ALUOutput ltA op B (Perform ALU operation)
Register-Immed. ALU op. (e.g., ADDI R1, R2, 3)
ALUOutput ltA op Imm (Perform ALU operation)
Branch (e.g., BEQZ R4, next)
ALUOutput lt NPC Imm (Compute branch target)
Cond lt (A 0) (Compare A to 0)

15
Multiple-Cycle DLX Cycle 4

The fourth cycle is known as the
Memory access / branch completion cycle (MEM)
The only DLX instructions active in this cycle
are loads, stores, and branches
Loads (e.g., LW R1, 30 (R2))
LMD lt MemALUOutput (load memory onto
processor)
Stores (e.g., 500(R4), R3)
MemALUOutput lt B (store data into memory)
Branch (e.g., BEQZ R4, next)
if (cond) PC lt ALUoutput (Set PC based on
cond)
else PC lt NPC

16
Multiple-Cycle DLX Cycle 5

The fifth cycle is known as the
Write-back cycle (WB)
During this cycles, results are written to the
register file
Register-Register ALU op. (e.g., ADD R1, R2, R3)
RegsIR lt ALUOutput
Register-Immed. ALU op (e.g., ADD R1, R2, 3)
RegsIR lt ALUOutput
Load Instruction (e.g., LW R1, 30 (R2))
RegsIR lt LMD

1620
1115
1115
17
5 Steps of DLX DatapathFigure 3.1
18
CPI for the Multiple-Cycle DLX

The multiple-cycle DLX requires 4 cycles for
branches and stores and 5 cycles for the other
operations.
Assuming 20 of the instructions are branches or
loads, this gives a CPI of 4.80.
We could improve the CPI by allowing ALU
operations to complete during memory cycle
Assuming 40 of the instructions are ALU
operations, this would reduce the CPI to 4.40.

19
Pipelining DLX

To reduce the CPI, DLX can be implemented using a
five stage pipeline.
In this example, it takes 10 cycles execute 5
instructions for a CPI of 2.

20
Visualizing PipeliningFigure 3.3, Page 133
Time (clock cycles)
I n s t r. O r d e r
21
Pipelined DLX DatapathFigure 3.4 page 134
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
M U X

Zero?
Write Back
Memory Access
4
M U X
PC
ALU
Regs
Data Mem.
M U X
M U X
Inst. Mem.
16
32
Sign Ext.
IF/ID
ID/EX
EX/MEM
MEM/WB

Pipeline registers are used to tranfer results
from one pipeline stage to the next.

22
Basic Performance Issues in Pipelining

Pipelining increases the CPU instruction
throughput - the number of instructions complete
per unit of time - but it is not reduce the
execution time of an individual instruction.

23
Pipeline Speedup Example

Assume the multiple cycle DLX has a 10-ns clock
cycle, loads take 5 clock cycles and account for
40 of the instructions, and all other
instructions take 4 clock cycles.
If pipelining the machine add 1-ns to the clock
cycle, how much speedup in instruction execution
rate do we get from pipelining.
MC Ave Instr. Time Clock cycle x Average CPI
10 ns x (0.6 x 4 0.4 x 5)
44 ns
PL Ave Instr. Time 10 1 11 ns
Speedup 44 / 11 4
This ignores time needed to fill empty the
pipeline and delays due to hazards.

24
Its Not That Easy for Computers

Limits to pipelining Hazards prevent next
instruction from executing during its designated
clock cycle
Structural hazards Hardware cannot support this
combination of instructions - two instructions
need the same resource.
Data hazards Instruction depends on result of
prior instruction still in the pipeline
Control hazards Pipelining of branches other
instructions that change the PC
Common solution is to stall the pipeline until
the hazard is resolved, inserting one or more
bubbles in the pipeline

25
Speed Up Equations for Pipelining

Stalls reduce the speedup obtained from
pipelining
Speedup from pipelining Ave Instr Time
unpipelined
Ave Instr Time
pipelined
CPIunpipelined x Clock
Cycleunpipelined
CPIpipelined x Clock
Cyclepipelined
CPIpipelined Ideal CPI Pipeline stall CPI
1 Pipeline stall CPI
Speedup CPIunpipelined Clock
Cycleunpipelined
1 Pipeline stall CPI Clock
Cyclepipelined
Speedup lt Pipeline depth
1 Pipeline stall CPI

x
26
Speed Up Equation for Pipelining

CPIpipelined Ideal CPI Pipeline stall clock
cycles per instr
Speedup Ideal CPI x Pipeline depth Clock
Cycleunpipelined
Ideal CPI Pipeline stall CPI Clock
Cyclepipelined
ASSUMING IDEAL CPI OF 1
Speedup Pipeline depth Clock
Cycleunpipelined
1 Pipeline stall CPI Clock
Cyclepipelined

x
x
27
Structure Hazards

Sometime called Resource Conflict.
Example.
Some pipelined machines have shared a single
memory pipeline for a data and instruction. As a
result, when an instruction contains a data
memory reference, it will conflict with the
instruction reference for a latter instruction.

28
One Memory Port/Structural HazardsFigure 3.6,
Page 142
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
29
One Memory Port/Structural HazardsFigure 3.7,
Page 143
Load
I n s t r. O r d e r
Instr 1
Instr 2
stall
Instr 3
30
A pipeline Stalled for a Structural Hazard
Inst. 1 2 3 4 5 6 7 8 9 10
Load Inst IF ID EX MEM WB
Intst i1 IF ID EX MEM WB
Intst i2 IF ID EX MEM WB
Intst i3 STALL IF ID EX MEM WB
Intst i4 IF ID EX MEM WB
Intst i5 IF ID EX MEM
Intst i6 IF ID EX
31
Example One or Two Memory Ports?

Machine A has a two port memory - access
instructions and data simultaneously.
Machine B has a one port memory, but its
pipelined implementation has a 1.05 times faster
clock rate
Ideal CPI 1 for both
Loads are 40 of instructions executed
Ave Instr.Time A Clock cycle A x CPI A
Clock cycle A
Ave Instr.Time B Clock cycle B x CPI B
(Clock cycle A / 1.05) x (1 0.4)
Clock cyle A x 1.33
Ave Instr.Time B 1.33
Ave Instr.Time A
Machine A is 1.33 times faster

32
Data Hazard

Data hazard occur when pipeline changes the order
of read/write accesses to operands so that the
order differs from the order seen by sequentially
execution instructions on an unpipelined machine

33
Data Hazard on R1Figure 3.9, page 147
34
Forwarding to Avoid Data HazardFigure 3.10, Page
149
35
Three Generic Data Hazards

InstrI followed be InstrJ
Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it

I ADD R1, R2, R3 IF ID EX MEM WB J
SUB R4, R1, R5 IF ID EX MEM WB
36
Three Generic Data Hazards

Write After Write (WAW)
InstrJ tries to write operand before InstrI
writes it
Leaves wrong result ( InstrI not InstrJ)
Cant happen in DLX 5 stage pipeline because
All instructions take 5 stages, and
Writes are always in stage 5

I LW R1, 0(R2) IF ID EX MEM1 MEM2
WB J ADD R1, R2, R3 IF ID EX
WB
37
Three Generic Data Hazards

InstrI followed be InstrJ
Write After Read (WAR) InstrJ tries to write
operand before InstrI reads it
Cant happen in the DLX 5 stage pipeline because
All instructions take 5 stages,
Reads are always in stage 2, and
Writes are always in stage 5

I SW 0(R1), R2 IF ID EX MEM1 MEM2
WB J ADD R2, R3, R4 IF ID EX
WB
38
Data Hazard Even with ForwardingFigure 3.12,
Page 153
39
Data Hazard Even with ForwardingFigure 3.13,
Page 154
40
HW Change for ForwardingFigure 3.20, Page 161
41
Compiler Scheduling for Data Hazards

Rather than just allow the pipeline to stall, the
compiler could try to schedule the pipeline to
avoid these stalls by arranging the code sequence
to eliminate the hazard. The technique, called
pipeline scheduling or instruction scheduling

42
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd

Fast code
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd

43
Compiler Avoiding Load Stalls
e
44
Pipelining Summary

Pipelining overlaps the execution of multiple
instructions.
With an idea pipeline, the CPI is one, and the
speedup is equal to the number of stages in the
pipeline.
However, several factors prevent us from
achieving the ideal speedup, including
Not being able to divide the pipeline evenly
The time needed to empty and flush the pipeline
Overhead needed for pipeling
Structural, data, and control harzards

45
Pipelining Summary