Title: Pipelining
1Pipelining
- By Pradondet Nilagupta
- Based on Lecture note on
- Advanced Computer Architecture
- Prof. Mike Schulte
- Prof. Yirng-An Chen
2Introduction to Pipelining
- Pipelining An implementation technique that
overlaps the execution of multiple instructions. - Laundry Example
- Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold - Washer takes 30 minutes
- Dryer takes 40 minutes
- Folder takes 20 minutes
3Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
- Sequential laundry takes 6 hours for 4 loads
- If they learned pipelining, how long would
laundry take?
4Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
- Pipelined laundry takes 3.5 hours for 4 loads
- Speedup 6/3.5 1.7
5Pipelining Lessons
- Pipelining doesnt help latency of single task,
it helps throughput of entire workload - Pipeline rate limited by slowest pipeline stage
- Multiple tasks operating simultaneously
- Potential speedup Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- Time to fill pipeline and time to drain it
reduces speedup
6 PM
7
8
9
Time
T a s k O r d e r
6Computer Pipelines
- Execute billions of instructions, so throughput
is what matters - RISC desirable features all instructions same
length, registers located in same place in
instruction format, memory operands only in loads
or stores
7Pipelining Basics
Unpipelined System
Delay 33ns Throughput 30MHz
Op1
Op2
Op3
??
Time
- One operation must complete before next can begin
- Operations spaced 33ns apart
83 Stage Pipelining
Delay 39ns Throughput 77MHz
Op1
Op2
- Space operations 13ns apart
- 3 operations occur simultaneously
Op3
Op4
??
Time
9Limitation Nonuniform Pipelining
Delay 18 3 54 ns Throughput 55MHz
Clock
- Throughput limited by slowest stage
- Delay determined by clock period number of
stages - Must attempt to balance stages
10Limitation Deep Pipelines
Delay 48ns, Throughput 128MHz
- Diminishing returns as add more pipeline stages
- Register delays become limiting factor
- Increased latency
- Small throughput gains
11Limitation Sequential Dependencies
R E G
Comb. Logic
R E G
Comb. Logic
R E G
Comb. Logic
Clock
Op1
Op2
- Op4 gets result from Op1 !
- Pipeline Hazard
Op3
Op4
??
Time
12Speed Up Equation for Pipelining
- Assumptions
- No delays except components latencies
- A fixed pipeline overhead 2ns.
- What is the cycle time for the pipeline version
of the circuit that maximizes performance without
allocating multiple cycles to a stage? - What is the total execution time for the pipeline
version? - What is the speedup versus a single-cycle
unpipelined version?
13Multiple-Cycle DLX Cycles 1 and 2
- Most DLX instruction can be implemented in 5
clock cycles (see Figure 3.1 on page 130). - The first two clock cycles are the same for every
instruction. - 1. Instruction fectch cycle (IF)
- IR lt MemPC (load instruction)
- NPC lt PC4 (update program counter)
- 2. Instruction decode / register fetch cycle (ID)
- A lt RegsIR (fetch source reg1)
- B lt RegsIR (fetch source reg2)
- Imm lt (IR ) IR (fetch and sign-ext
imm.)
6...10
1115
16
16
1631
14Multiple-Cycle DLX Cycle 3
- The third cycle is known as the
- Execution/ effective address cycle (EX)
- The actions performed in this cycle depend on the
type of operations. - Memory reference (e.g., LW R1, 30 (R2))
- ALUOutput lt A Imm (Calculate effective
address) - Register-Register ALU op. (e.g., ADD R1, R2, R3)
- ALUOutput ltA op B (Perform ALU operation)
- Register-Immed. ALU op. (e.g., ADDI R1, R2, 3)
- ALUOutput ltA op Imm (Perform ALU operation)
- Branch (e.g., BEQZ R4, next)
- ALUOutput lt NPC Imm (Compute branch target)
- Cond lt (A 0) (Compare A to 0)
15Multiple-Cycle DLX Cycle 4
- The fourth cycle is known as the
- Memory access / branch completion cycle (MEM)
- The only DLX instructions active in this cycle
are loads, stores, and branches - Loads (e.g., LW R1, 30 (R2))
- LMD lt MemALUOutput (load memory onto
processor) - Stores (e.g., 500(R4), R3)
- MemALUOutput lt B (store data into memory)
- Branch (e.g., BEQZ R4, next)
- if (cond) PC lt ALUoutput (Set PC based on
cond) - else PC lt NPC
16Multiple-Cycle DLX Cycle 5
- The fifth cycle is known as the
- Write-back cycle (WB)
- During this cycles, results are written to the
register file - Register-Register ALU op. (e.g., ADD R1, R2, R3)
- RegsIR lt ALUOutput
- Register-Immed. ALU op (e.g., ADD R1, R2, 3)
- RegsIR lt ALUOutput
- Load Instruction (e.g., LW R1, 30 (R2))
- RegsIR lt LMD
1620
1115
1115
175 Steps of DLX DatapathFigure 3.1
18CPI for the Multiple-Cycle DLX
- The multiple-cycle DLX requires 4 cycles for
branches and stores and 5 cycles for the other
operations. - Assuming 20 of the instructions are branches or
loads, this gives a CPI of 4.80. - We could improve the CPI by allowing ALU
operations to complete during memory cycle - Assuming 40 of the instructions are ALU
operations, this would reduce the CPI to 4.40.
19Pipelining DLX
- To reduce the CPI, DLX can be implemented using a
five stage pipeline. - In this example, it takes 10 cycles execute 5
instructions for a CPI of 2.
20Visualizing PipeliningFigure 3.3, Page 133
Time (clock cycles)
I n s t r. O r d e r
21Pipelined DLX DatapathFigure 3.4 page 134
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
M U X
Zero?
Write Back
Memory Access
4
M U X
PC
ALU
Regs
Data Mem.
M U X
M U X
Inst. Mem.
16
32
Sign Ext.
IF/ID
ID/EX
EX/MEM
MEM/WB
- Pipeline registers are used to tranfer results
from one pipeline stage to the next.
22Basic Performance Issues in Pipelining
- Pipelining increases the CPU instruction
throughput - the number of instructions complete
per unit of time - but it is not reduce the
execution time of an individual instruction.
23Pipeline Speedup Example
- Assume the multiple cycle DLX has a 10-ns clock
cycle, loads take 5 clock cycles and account for
40 of the instructions, and all other
instructions take 4 clock cycles. - If pipelining the machine add 1-ns to the clock
cycle, how much speedup in instruction execution
rate do we get from pipelining. - MC Ave Instr. Time Clock cycle x Average CPI
- 10 ns x (0.6 x 4 0.4 x 5)
- 44 ns
- PL Ave Instr. Time 10 1 11 ns
- Speedup 44 / 11 4
- This ignores time needed to fill empty the
pipeline and delays due to hazards. -
24Its Not That Easy for Computers
- Limits to pipelining Hazards prevent next
instruction from executing during its designated
clock cycle - Structural hazards Hardware cannot support this
combination of instructions - two instructions
need the same resource. - Data hazards Instruction depends on result of
prior instruction still in the pipeline - Control hazards Pipelining of branches other
instructions that change the PC - Common solution is to stall the pipeline until
the hazard is resolved, inserting one or more
bubbles in the pipeline
25Speed Up Equations for Pipelining
- Stalls reduce the speedup obtained from
pipelining - Speedup from pipelining Ave Instr Time
unpipelined - Ave Instr Time
pipelined - CPIunpipelined x Clock
Cycleunpipelined - CPIpipelined x Clock
Cyclepipelined - CPIpipelined Ideal CPI Pipeline stall CPI
- 1 Pipeline stall CPI
- Speedup CPIunpipelined Clock
Cycleunpipelined - 1 Pipeline stall CPI Clock
Cyclepipelined - Speedup lt Pipeline depth
- 1 Pipeline stall CPI
x
26Speed Up Equation for Pipelining
- CPIpipelined Ideal CPI Pipeline stall clock
cycles per instr - Speedup Ideal CPI x Pipeline depth Clock
Cycleunpipelined - Ideal CPI Pipeline stall CPI Clock
Cyclepipelined - ASSUMING IDEAL CPI OF 1
- Speedup Pipeline depth Clock
Cycleunpipelined - 1 Pipeline stall CPI Clock
Cyclepipelined
x
x
27Structure Hazards
- Sometime called Resource Conflict.
- Example.
- Some pipelined machines have shared a single
memory pipeline for a data and instruction. As a
result, when an instruction contains a data
memory reference, it will conflict with the
instruction reference for a latter instruction.
28One Memory Port/Structural HazardsFigure 3.6,
Page 142
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
29One Memory Port/Structural HazardsFigure 3.7,
Page 143
Load
I n s t r. O r d e r
Instr 1
Instr 2
stall
Instr 3
30A pipeline Stalled for a Structural Hazard
Inst. 1 2 3 4 5 6 7 8 9 10
Load Inst IF ID EX MEM WB
Intst i1 IF ID EX MEM WB
Intst i2 IF ID EX MEM WB
Intst i3 STALL IF ID EX MEM WB
Intst i4 IF ID EX MEM WB
Intst i5 IF ID EX MEM
Intst i6 IF ID EX
31Example One or Two Memory Ports?
- Machine A has a two port memory - access
instructions and data simultaneously. - Machine B has a one port memory, but its
pipelined implementation has a 1.05 times faster
clock rate - Ideal CPI 1 for both
- Loads are 40 of instructions executed
- Ave Instr.Time A Clock cycle A x CPI A
- Clock cycle A
- Ave Instr.Time B Clock cycle B x CPI B
- (Clock cycle A / 1.05) x (1 0.4)
- Clock cyle A x 1.33
- Ave Instr.Time B 1.33
- Ave Instr.Time A
- Machine A is 1.33 times faster
32Data Hazard
- Data hazard occur when pipeline changes the order
of read/write accesses to operands so that the
order differs from the order seen by sequentially
execution instructions on an unpipelined machine
33Data Hazard on R1Figure 3.9, page 147
34Forwarding to Avoid Data HazardFigure 3.10, Page
149
35Three Generic Data Hazards
- InstrI followed be InstrJ
- Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it
I ADD R1, R2, R3 IF ID EX MEM WB J
SUB R4, R1, R5 IF ID EX MEM WB
36Three Generic Data Hazards
- Write After Write (WAW)
- InstrJ tries to write operand before InstrI
writes it - Leaves wrong result ( InstrI not InstrJ)
- Cant happen in DLX 5 stage pipeline because
- All instructions take 5 stages, and
- Writes are always in stage 5
I LW R1, 0(R2) IF ID EX MEM1 MEM2
WB J ADD R1, R2, R3 IF ID EX
WB
37Three Generic Data Hazards
- InstrI followed be InstrJ
- Write After Read (WAR) InstrJ tries to write
operand before InstrI reads it - Cant happen in the DLX 5 stage pipeline because
- All instructions take 5 stages,
- Reads are always in stage 2, and
- Writes are always in stage 5
I SW 0(R1), R2 IF ID EX MEM1 MEM2
WB J ADD R2, R3, R4 IF ID EX
WB
38Data Hazard Even with ForwardingFigure 3.12,
Page 153
39Data Hazard Even with ForwardingFigure 3.13,
Page 154
40HW Change for ForwardingFigure 3.20, Page 161
41Compiler Scheduling for Data Hazards
- Rather than just allow the pipeline to stall, the
compiler could try to schedule the pipeline to
avoid these stalls by arranging the code sequence
to eliminate the hazard. The technique, called
pipeline scheduling or instruction scheduling
42Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
- Fast code
- LW Rb,b
- LW Rc,c
- LW Re,e
- ADD Ra,Rb,Rc
- LW Rf,f
- SW a,Ra
- SUB Rd,Re,Rf
- SW d,Rd
43Compiler Avoiding Load Stalls
e
44Pipelining Summary
- Pipelining overlaps the execution of multiple
instructions. - With an idea pipeline, the CPI is one, and the
speedup is equal to the number of stages in the
pipeline. - However, several factors prevent us from
achieving the ideal speedup, including - Not being able to divide the pipeline evenly
- The time needed to empty and flush the pipeline
- Overhead needed for pipeling
- Structural, data, and control harzards
45Pipelining Summary
- Just overlap tasks, and easy if tasks are
independent - Speed Up VS. Pipeline Depth if ideal CPI is 1,
then - Hazards limit performance on computers
- Structural need more HW resources
- Data need forwarding, compiler scheduling
- Control discuss next time
Pipeline Depth
Clock Cycle Unpipelined
Speedup
X
Clock Cycle Pipelined
1 Pipeline stall CPI