History of Pipelining

About This Presentation

Title:

History of Pipelining

Description:

All the repetitions of the same operation are mutually independent, i.e. no data dependence ... Good Examples: automobile assembly line. floating-point ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 70

Provided by: lizy

Category:

more less

Transcript and Presenter's Notes

Title: History of Pipelining

1
History of Pipelining

Introduced in IBM 7030 (Stretch computer)
CDC 6600 used pipelining and multiple functional
units
RISC processors in 80s were pipelined and were
efforts to get IPC of 1
I486 was the first pipelined CISC processor
Pipelined VAX from Digital
Pipelined Motorola 68000K
Current Trend deep pipelines

2
Pipeline Illustrated
3
Pipeline Partitioning
Divide functionality into k-stages, k-fold
speedup? Latches Clock skew Uniform
sub-computations
4
Earle Latch
5
Pipeline Partitioning
6
Pipeline Partitioning
K opt Sq Rt (GT/LS) G Cost of non pipelined
design L Cost of each Latch K number of
stages T latency of non-pipelined design S
latency increase due to latch ( i.e. T/k S
is new clock period)
7
Non-Pipelined FP Multiplier

8
Pipelined FP Multiplier
Non-pipelined chip count 175 Non pipelined delay
400 ns Non pipelined clk 2.5 MHz Assume
latching delay17ns Set up time 5ns Max stage
delay 150 Minimum clk period 172ns Pipelined
clk 5.8 MHz Latency for each mult 516 ns
(instead of 400 ns)
9
Pipelined FP Multiplier
10
Pipeline Partitioning
11
CPU Example

Suppose 2 ns for memory access, 2 ns for ALU
operation, and 1 ns for register file read or
write compute instr rate
Nonpipelined Execution
lw IF Read Reg ALU Memory Write Reg 2
1 2 2 1 8 ns
add IF Read Reg ALU Write Reg 2 1 2
1 6 ns
Pipelined Execution
Max(IF,Read Reg,ALU,Memory,Write Reg) 2 ns

12
Problems for Computers

Limits to pipelining Hazards prevent next
instruction from executing during its designated
clock cycle
Data hazards Instruction depends on result of
prior instruction still in the pipeline (missing
sock)
Structural hazards 2 instructions need the same
resource at the same time
Control hazards Pipelining of branches other
instructions stall the pipeline until the hazard
bubbles in the pipeline

13
Structural Hazard 1 Single Memory (1/2)
Read same memory twice in same clock cycle
14
Structural Hazard 1 Single Memory (2/2)

Solution
have both an L1 Instruction Cache and an L1 Data
Cache
need more complex hardware to control when both
caches miss

15
Structural Hazard 2 Registers (1/2)
Cant read and write to registers simultaneously
16
Structural Hazard 2 Registers (2/2)

Fact Register access is VERY fast takes less
than half the time of ALU stage
Solution introduce convention
always Write to Registers during first half of
each clock cycle
always Read from Registers during second half of
each clock cycle
Result can perform Read and Write during same
clock cycle

17
Things to Remember

Optimal Pipeline
Each stage is executing part of an instruction
each clock cycle.
One instruction finishes during each clock cycle.
On average, execute far more quickly.
What makes this work?
Similarities between instructions allow us to use
same stages for all instructions (generally).
Each stage takes about the same amount of time as
all others little wasted time.

18
MIPS ISA Handout
19
Data Hazards (1/2)

Consider the following sequence of instructions

20
Data Hazards (2/2)
Dependencies backwards in time are hazards
21
Data Hazard Solution Forwarding

Forward result from one stage to another

sub and and could use forwarding
22
Data Hazard Loads (1/4)

Dependencies backwards in time are hazards

Cant solve with forwarding
Must stall instruction dependent on load, then
forward (more hardware)

23
Data Hazard Loads (2/4)

Hardware must stall pipeline
Called interlock

24
Data Hazard Loads (3/4)

Stall is equivalent to nop

lw t0, 0(t1)
nop
sub t3,t0,t2
and t5,t0,t4
25
Data Hazard Loads (4/4)

Instruction slot after a load is called load
delay slot
If that instruction uses the result of the load,
then the hardware interlock will stall it for one
cycle.
If the compiler puts an unrelated instruction in
that slot, then no stall
Letting the hardware stall the instruction in the
delay slot is equivalent to putting a nop in the
slot (except the latter uses more code space)

26
Control Hazard Branching (1/5)
Where do we do the compare for the branch?
27
Control Hazard Branching (2/5)

We put branch decision-making hardware in ALU
stage
therefore two more instructions after the branch
will always be fetched, whether or not the branch
is taken
Desired functionality of a branch
if we do not take the branch, dont waste any
time and continue executing normally
if we take the branch, dont execute any
instructions after the branch, just go to the
desired label

28
Control Hazard Branching (3/5)

Initial Solution Stall until decision is made
insert no-op instructions those that
accomplish nothing, just take time
Drawback branches take 3 clock cycles each
(assuming comparator is put in ALU stage)

29
Control Hazard Branching (4/5)

Optimization 1
move asynchronous comparator up to Stage 2
as soon as instruction is decoded (Opcode
identifies is as a branch), immediately make a
decision and set the value of the PC (if
necessary)
Benefit since branch is complete in Stage 2,
only one unnecessary instruction is fetched, so
only one no-op is needed
Side Note This means that branches are idle in
Stages 3, 4 and 5.

30
Control Hazard Branching (5/5)

Insert a single no-op (bubble)

Impact 2 clock cycles per branch instruction ?
slow

31
Quiz

Assume 1 instr/clock, delayed branch, 5 stage
pipeline, forwarding, interlock on unresolved
load hazards (after 103 loops, so pipeline full)
Loop lw t0, 0(s1) addu t0, t0,
s2 sw t0, 0(s1) addiu s1, s1,
-4 bne s1, zero, Loop nop
How many pipeline stages (clock cycles) per loop
iteration to execute this code?

1 2 3 4 5 6 7 8 9 10
32
Quiz Answer

Assume 1 instr/clock, delayed branch, 5 stage
pipeline, forwarding, interlock on unresolved
load hazards. 103 iterations, so pipeline full.
Loop lw t0, 0(s1) addu t0, t0, s2 sw t0,
0(s1) addiu s1, s1, -4 bne s1, zero,
Loop nop
How many pipeline stages (clock cycles) per loop
iteration to execute this code?

2. (data hazard so stall)
1.
3.
4.
5.
6.
1 2 3 4 5 6 7 8 9 10
33
Pipelining Idealism

Uniform Suboperations
The operation to be pipelined can be evenly
partitioned into uniform-latency suboperations
Repetition of Identical Operations
The same operations are to be performed
repeatedly on a large number of different inputs
Repetition of Independent Operations
All the repetitions of the same operation are
mutually independent, i.e. no data dependence
and no resource conflicts
Good Examples automobile assembly line
floating-point multiplier
instruction pipeline???

34
Instruction Pipeline Design

Uniform Suboperations ...
? balance pipeline stages
- stage quantization to yield balanced stages
- minimize internal fragmentation (some waiting
stages)
Identical operations ...
? unifying instruction types
- coalescing instruction types into one
multi-function pipe
- minimize external fragmentation (some idling
stages)
Independent operations ...
? resolve data and resource hazards
- inter-instruction dependency detection and
resolution
- minimize performance loss

35
The Generic Instruction Cycle

The computation to be pipelined
Instruction Fetch (IF)
Instruction Decode (ID)
Operand(s) Fetch (OF)
Instruction Execution (EX)
Operand Store (OS)
Update Program Counter (PC)

36
The GENERIC Instruction Pipeline (GNR)

Based on Obvious Subcomputations

37
Balancing Pipeline Stages

Without pipelining
Tcyc? TIFTIDTOFTEXTOS
31
Pipelined
Tcyc ? maxTIF, TID, TOF, TEX, TOS
9
Speedup 31 / 9
Can we do better in terms of either performance
or efficiency?

TIF 6 units
TID 2 units
TID 9 units
TEX 5 units
TOS 9 units
38
Balancing Pipeline Stages

Two Methods for Stage Quantization
Merging of multiple subcomputations into one.
Subdividing a subcomputation into multiple
subcomputations.
Current Trends
Deeper pipelines (more and more stages).
Multiplicity of different (subpipelines).
Pipelining of memory access (tricky).

39
Granularity of Pipeline Stages
Coarser-Grained Machine Cycle 4 machine cyc /
instruction cyc
Finer-Grained Machine Cycle 11 machine cyc
/instruction cyc
TIFID 8 units
TID 9 units
TEX 5 units
TOS 9 units
Tcyc 3 units
40
Hardware Requirements

Logic needed for each pipeline stage
Register file ports needed to support all the
stages
Memory accessing ports needed to support all the
stages

41
Pipeline Examples
MIPS R2000/R3000
AMDAHL 470V/7
1
2
3
4
5
6
7
8
9
10
Check Result
11
PC

GEN
.
12
42
Coalescing Resource Requirements

Procedure
Analyze the sequence of register transfers
required by each instruction type.
Find commonality across instruction types and
merge them to share the same pipeline stage.
If there exists flexibility, shift or reorder
some register transfers to facilitate further
merging.

43
Unifying instructions to 6-stage pipeline

The 6-stage TYPICAL (TYP) pipeline

44
Physical Organization of 6-stage pipeline
45
Pipeline Interface to Memory

46
Pipeline Interface to Register File

47
Minimizing Pipeline Stalls

Dependences lead to Pipeline Hazards

48
Occurrence of Hazards

Necessary Conditions

49
Penalties Due to RAW hazards
50
Incorporation of Forwarding Paths in TYP pipeline
51
Penalties with Forwarding Paths
52
Forwarding Paths for leading ALU instruction

53
Pipeline Interlocks for leading ALU instruction

54
Forwarding for leading Load

55
Pipeline Interlocks for ALU,Load

56
Pipeline Interlocks for Branch

57
Historical Trivia

First MIPS design did not interlock and stall on
load-use data hazard
Real reason for name behind MIPS Microprocessor
without Interlocked Pipeline Stages
Word Play on acronym for Millions of
Instructions Per Second, also called MIPS

58
Load Delay Slot (MIPS R2000)
t0
t1
t2
t3
t4
t5
i
IF
ID
RD
ALU
MEM
WB
j
IF
ID
RD
ALU
MEM
WB
k
IF
ID
RD
ALU
MEM
WB
- The effect of a delayed Load is not visible
to the instructions in its delay slots.
h Rk ? -- i Rk ? MEM - j -- ?
Rk k -- ? Rk
Which (Rk) do we really mean?
59
RISC Pipeline Example
60
Real Pipelined Processor Example MIPS R2000
61
Intel i486 5-Stage CISC Pipeline
62
IBMs Experience on Pipelined Processors
Agerwala and Cocke 1987

Attributes and Assumptions
Memory Bandwidth
at least one word/cycle to fetch 1
instruction/cycle from I-cache
40 of instructions are load/store, require
access to D-cache
Code Characteristics (dynamic)
loads - 25
stores - 15
ALU/Reg-Reg - 40
branches - 20 1/3 unconditional (always taken)
1/3 conditional taken
1/3 conditional not taken

63
More Statistics and Assumptions

Cache Performance
hit ratio of 100 is assumed in the experiments
cache latency I-cache i D-cache d default
id1 cycle
Load and Branch Scheduling
loads
25 cannot be scheduled
65 can be moved back 2 inst 10 - 1 delay slot
branches
unconditional - 100 schedulable (fill 1 delay
slot)
conditional - 50 schedulable (fill 1 delay
slot)

64
CPI Calculations I
25L 15S 40ALU 20Br

No cache bypass of reg. file, no scheduling of
loads or branches
Load Penalty 2 cycles
Branch Penalty 2 cycles
Total CPI 1 0.252 0.20.662 1 0.5
0.27 1.77 CPI (assume br not taken, penalty
only for 66 branches)
Bypass reg file for loads
Load Penalty 1 cycle
Total CPI 1 0.25 0.27 1.52 CPI

65
CPI Calculations II

Bypass, scheduling of loads and branches
Load Penalty
75 can be moved back 1 gt no penalty
remaining 25 gt 1 cycle penalty
Load overhead0.250.2510.0625
Branch Penalty
1/3 Uncond. 100 schedulable gt 1 cycle
1/3 Cond. Not Taken, if biased for NT gt no
penalty
1/3 Cond. Taken
50 schedulable gt 1 cycle
50 unschedulable gt 2 cycles
Branch overhead0.20.3310.330.51
0.330.52 0.167
Total CPI 1 0.063 0.167 1.23 CPI

66
CPI Calculations III

Parallel target address generation to reduce
penalty from 2-gt1cycle
90 of branches can be coded as PC relative (inst
of reg indirect)
i.e. target address can be computed without
register access
A separate adder can compute (PCoffset) during
reg read stage
Branch Penalty
Conditional Unconditional
Uncond 0.20.330.11 0.0066 CPI
Cond 0.20.660.90.510.10.510.10.5
20.079
Total CPI 1 0.063 0.087 1.15 CPI 0.87
IPC

History of Pipelining - PowerPoint PPT Presentation

History of Pipelining

All the repetitions of the same operation are mutually independent, i.e. no data dependence ... Good Examples: automobile assembly line. floating-point ... – PowerPoint PPT presentation