Title: History of Pipelining
1History of Pipelining
- Introduced in IBM 7030 (Stretch computer)
- CDC 6600 used pipelining and multiple functional
units - RISC processors in 80s were pipelined and were
efforts to get IPC of 1 - I486 was the first pipelined CISC processor
- Pipelined VAX from Digital
- Pipelined Motorola 68000K
- Current Trend deep pipelines
2Pipeline Illustrated
3 Pipeline Partitioning
Divide functionality into k-stages, k-fold
speedup? Latches Clock skew Uniform
sub-computations
4 Earle Latch
5 Pipeline Partitioning
6 Pipeline Partitioning
K opt Sq Rt (GT/LS) G Cost of non pipelined
design L Cost of each Latch K number of
stages T latency of non-pipelined design S
latency increase due to latch ( i.e. T/k S
is new clock period)
7 Non-Pipelined FP Multiplier
8 Pipelined FP Multiplier
Non-pipelined chip count 175 Non pipelined delay
400 ns Non pipelined clk 2.5 MHz Assume
latching delay17ns Set up time 5ns Max stage
delay 150 Minimum clk period 172ns Pipelined
clk 5.8 MHz Latency for each mult 516 ns
(instead of 400 ns)
9 Pipelined FP Multiplier
10 Pipeline Partitioning
11CPU Example
- Suppose 2 ns for memory access, 2 ns for ALU
operation, and 1 ns for register file read or
write compute instr rate - Nonpipelined Execution
- lw IF Read Reg ALU Memory Write Reg 2
1 2 2 1 8 ns - add IF Read Reg ALU Write Reg 2 1 2
1 6 ns - Pipelined Execution
- Max(IF,Read Reg,ALU,Memory,Write Reg) 2 ns
12Problems for Computers
- Limits to pipelining Hazards prevent next
instruction from executing during its designated
clock cycle - Data hazards Instruction depends on result of
prior instruction still in the pipeline (missing
sock) - Structural hazards 2 instructions need the same
resource at the same time - Control hazards Pipelining of branches other
instructions stall the pipeline until the hazard
bubbles in the pipeline
13Structural Hazard 1 Single Memory (1/2)
Read same memory twice in same clock cycle
14Structural Hazard 1 Single Memory (2/2)
- Solution
- have both an L1 Instruction Cache and an L1 Data
Cache - need more complex hardware to control when both
caches miss
15Structural Hazard 2 Registers (1/2)
Cant read and write to registers simultaneously
16Structural Hazard 2 Registers (2/2)
- Fact Register access is VERY fast takes less
than half the time of ALU stage - Solution introduce convention
- always Write to Registers during first half of
each clock cycle - always Read from Registers during second half of
each clock cycle - Result can perform Read and Write during same
clock cycle
17Things to Remember
- Optimal Pipeline
- Each stage is executing part of an instruction
each clock cycle. - One instruction finishes during each clock cycle.
- On average, execute far more quickly.
- What makes this work?
- Similarities between instructions allow us to use
same stages for all instructions (generally). - Each stage takes about the same amount of time as
all others little wasted time.
18MIPS ISA Handout
19Data Hazards (1/2)
- Consider the following sequence of instructions
20Data Hazards (2/2)
Dependencies backwards in time are hazards
21Data Hazard Solution Forwarding
- Forward result from one stage to another
sub and and could use forwarding
22Data Hazard Loads (1/4)
- Dependencies backwards in time are hazards
- Cant solve with forwarding
- Must stall instruction dependent on load, then
forward (more hardware)
23Data Hazard Loads (2/4)
- Hardware must stall pipeline
- Called interlock
24Data Hazard Loads (3/4)
- Stall is equivalent to nop
lw t0, 0(t1)
nop
sub t3,t0,t2
and t5,t0,t4
25Data Hazard Loads (4/4)
- Instruction slot after a load is called load
delay slot - If that instruction uses the result of the load,
then the hardware interlock will stall it for one
cycle. - If the compiler puts an unrelated instruction in
that slot, then no stall - Letting the hardware stall the instruction in the
delay slot is equivalent to putting a nop in the
slot (except the latter uses more code space)
26Control Hazard Branching (1/5)
Where do we do the compare for the branch?
27Control Hazard Branching (2/5)
- We put branch decision-making hardware in ALU
stage - therefore two more instructions after the branch
will always be fetched, whether or not the branch
is taken - Desired functionality of a branch
- if we do not take the branch, dont waste any
time and continue executing normally - if we take the branch, dont execute any
instructions after the branch, just go to the
desired label
28Control Hazard Branching (3/5)
- Initial Solution Stall until decision is made
- insert no-op instructions those that
accomplish nothing, just take time - Drawback branches take 3 clock cycles each
(assuming comparator is put in ALU stage)
29Control Hazard Branching (4/5)
- Optimization 1
- move asynchronous comparator up to Stage 2
- as soon as instruction is decoded (Opcode
identifies is as a branch), immediately make a
decision and set the value of the PC (if
necessary) - Benefit since branch is complete in Stage 2,
only one unnecessary instruction is fetched, so
only one no-op is needed - Side Note This means that branches are idle in
Stages 3, 4 and 5.
30Control Hazard Branching (5/5)
- Insert a single no-op (bubble)
- Impact 2 clock cycles per branch instruction ?
slow
31Quiz
- Assume 1 instr/clock, delayed branch, 5 stage
pipeline, forwarding, interlock on unresolved
load hazards (after 103 loops, so pipeline full) - Loop lw t0, 0(s1) addu t0, t0,
s2 sw t0, 0(s1) addiu s1, s1,
-4 bne s1, zero, Loop nop - How many pipeline stages (clock cycles) per loop
iteration to execute this code?
1 2 3 4 5 6 7 8 9 10
32Quiz Answer
- Assume 1 instr/clock, delayed branch, 5 stage
pipeline, forwarding, interlock on unresolved
load hazards. 103 iterations, so pipeline full. - Loop lw t0, 0(s1) addu t0, t0, s2 sw t0,
0(s1) addiu s1, s1, -4 bne s1, zero,
Loop nop - How many pipeline stages (clock cycles) per loop
iteration to execute this code?
2. (data hazard so stall)
1.
3.
4.
5.
6.
1 2 3 4 5 6 7 8 9 10
33Pipelining Idealism
- Uniform Suboperations
- The operation to be pipelined can be evenly
partitioned into uniform-latency suboperations - Repetition of Identical Operations
- The same operations are to be performed
repeatedly on a large number of different inputs - Repetition of Independent Operations
- All the repetitions of the same operation are
mutually independent, i.e. no data dependence - and no resource conflicts
- Good Examples automobile assembly line
- floating-point multiplier
- instruction pipeline???
34Instruction Pipeline Design
- Uniform Suboperations ...
- ? balance pipeline stages
- - stage quantization to yield balanced stages
- - minimize internal fragmentation (some waiting
stages) - Identical operations ...
- ? unifying instruction types
- - coalescing instruction types into one
multi-function pipe - - minimize external fragmentation (some idling
stages) - Independent operations ...
- ? resolve data and resource hazards
- - inter-instruction dependency detection and
resolution - - minimize performance loss
35The Generic Instruction Cycle
- The computation to be pipelined
- Instruction Fetch (IF)
- Instruction Decode (ID)
- Operand(s) Fetch (OF)
- Instruction Execution (EX)
- Operand Store (OS)
- Update Program Counter (PC)
36The GENERIC Instruction Pipeline (GNR)
- Based on Obvious Subcomputations
37Balancing Pipeline Stages
- Without pipelining
- Tcyc? TIFTIDTOFTEXTOS
- 31
- Pipelined
- Tcyc ? maxTIF, TID, TOF, TEX, TOS
- 9
- Speedup 31 / 9
- Can we do better in terms of either performance
or efficiency?
TIF 6 units
TID 2 units
TID 9 units
TEX 5 units
TOS 9 units
38Balancing Pipeline Stages
- Two Methods for Stage Quantization
- Merging of multiple subcomputations into one.
- Subdividing a subcomputation into multiple
subcomputations. - Current Trends
- Deeper pipelines (more and more stages).
- Multiplicity of different (subpipelines).
- Pipelining of memory access (tricky).
39Granularity of Pipeline Stages
Coarser-Grained Machine Cycle 4 machine cyc /
instruction cyc
Finer-Grained Machine Cycle 11 machine cyc
/instruction cyc
TIFID 8 units
TID 9 units
TEX 5 units
TOS 9 units
Tcyc 3 units
40Hardware Requirements
- Logic needed for each pipeline stage
- Register file ports needed to support all the
stages - Memory accessing ports needed to support all the
stages
41Pipeline Examples
MIPS R2000/R3000
AMDAHL 470V/7
1
2
3
4
5
6
7
8
9
10
Check Result
11
PC
GEN
.
12
42Coalescing Resource Requirements
- Procedure
- Analyze the sequence of register transfers
required by each instruction type. - Find commonality across instruction types and
merge them to share the same pipeline stage. - If there exists flexibility, shift or reorder
some register transfers to facilitate further
merging.
43Unifying instructions to 6-stage pipeline
- The 6-stage TYPICAL (TYP) pipeline
44Physical Organization of 6-stage pipeline
45Pipeline Interface to Memory
46Pipeline Interface to Register File
47Minimizing Pipeline Stalls
- Dependences lead to Pipeline Hazards
48Occurrence of Hazards
49Penalties Due to RAW hazards
50Incorporation of Forwarding Paths in TYP pipeline
51Penalties with Forwarding Paths
52Forwarding Paths for leading ALU instruction
53Pipeline Interlocks for leading ALU instruction
54Forwarding for leading Load
55Pipeline Interlocks for ALU,Load
56Pipeline Interlocks for Branch
57 Historical Trivia
- First MIPS design did not interlock and stall on
load-use data hazard - Real reason for name behind MIPS Microprocessor
without Interlocked Pipeline Stages - Word Play on acronym for Millions of
Instructions Per Second, also called MIPS
58Load Delay Slot (MIPS R2000)
t0
t1
t2
t3
t4
t5
i
IF
ID
RD
ALU
MEM
WB
j
IF
ID
RD
ALU
MEM
WB
k
IF
ID
RD
ALU
MEM
WB
- The effect of a delayed Load is not visible
to the instructions in its delay slots.
h Rk ? -- i Rk ? MEM - j -- ?
Rk k -- ? Rk
Which (Rk) do we really mean?
59RISC Pipeline Example
60Real Pipelined Processor Example MIPS R2000
61Intel i486 5-Stage CISC Pipeline
62IBMs Experience on Pipelined Processors
Agerwala and Cocke 1987
- Attributes and Assumptions
- Memory Bandwidth
- at least one word/cycle to fetch 1
instruction/cycle from I-cache - 40 of instructions are load/store, require
access to D-cache - Code Characteristics (dynamic)
- loads - 25
- stores - 15
- ALU/Reg-Reg - 40
- branches - 20 1/3 unconditional (always taken)
- 1/3 conditional taken
- 1/3 conditional not taken
63More Statistics and Assumptions
- Cache Performance
- hit ratio of 100 is assumed in the experiments
- cache latency I-cache i D-cache d default
id1 cycle - Load and Branch Scheduling
- loads
- 25 cannot be scheduled
- 65 can be moved back 2 inst 10 - 1 delay slot
- branches
- unconditional - 100 schedulable (fill 1 delay
slot) - conditional - 50 schedulable (fill 1 delay
slot)
64CPI Calculations I
25L 15S 40ALU 20Br
- No cache bypass of reg. file, no scheduling of
loads or branches - Load Penalty 2 cycles
- Branch Penalty 2 cycles
- Total CPI 1 0.252 0.20.662 1 0.5
0.27 1.77 CPI (assume br not taken, penalty
only for 66 branches) - Bypass reg file for loads
- Load Penalty 1 cycle
- Total CPI 1 0.25 0.27 1.52 CPI
65CPI Calculations II
- Bypass, scheduling of loads and branches
- Load Penalty
- 75 can be moved back 1 gt no penalty
- remaining 25 gt 1 cycle penalty
- Load overhead0.250.2510.0625
- Branch Penalty
- 1/3 Uncond. 100 schedulable gt 1 cycle
- 1/3 Cond. Not Taken, if biased for NT gt no
penalty - 1/3 Cond. Taken
- 50 schedulable gt 1 cycle
- 50 unschedulable gt 2 cycles
- Branch overhead0.20.3310.330.51
0.330.52 0.167 - Total CPI 1 0.063 0.167 1.23 CPI
66CPI Calculations III
- Parallel target address generation to reduce
penalty from 2-gt1cycle - 90 of branches can be coded as PC relative (inst
of reg indirect) - i.e. target address can be computed without
register access - A separate adder can compute (PCoffset) during
reg read stage - Branch Penalty
- Conditional Unconditional
- Uncond 0.20.330.11 0.0066 CPI
- Cond 0.20.660.90.510.10.510.10.5
20.079 - Total CPI 1 0.063 0.087 1.15 CPI 0.87
IPC
67Deeply Pipelined Processors
68Deeply Pipelined Processors
69Problem Set 2