History of Pipelining - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

History of Pipelining

Description:

All the repetitions of the same operation are mutually independent, i.e. no data dependence ... Good Examples: automobile assembly line. floating-point ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 70
Provided by: lizy
Category:

less

Transcript and Presenter's Notes

Title: History of Pipelining


1
History of Pipelining
  • Introduced in IBM 7030 (Stretch computer)
  • CDC 6600 used pipelining and multiple functional
    units
  • RISC processors in 80s were pipelined and were
    efforts to get IPC of 1
  • I486 was the first pipelined CISC processor
  • Pipelined VAX from Digital
  • Pipelined Motorola 68000K
  • Current Trend deep pipelines

2
Pipeline Illustrated
3
Pipeline Partitioning
Divide functionality into k-stages, k-fold
speedup? Latches Clock skew Uniform
sub-computations
4
Earle Latch
5
Pipeline Partitioning
6
Pipeline Partitioning
K opt Sq Rt (GT/LS) G Cost of non pipelined
design L Cost of each Latch K number of
stages T latency of non-pipelined design S
latency increase due to latch ( i.e. T/k S
is new clock period)
7
Non-Pipelined FP Multiplier

8
Pipelined FP Multiplier
Non-pipelined chip count 175 Non pipelined delay
400 ns Non pipelined clk 2.5 MHz Assume
latching delay17ns Set up time 5ns Max stage
delay 150 Minimum clk period 172ns Pipelined
clk 5.8 MHz Latency for each mult 516 ns
(instead of 400 ns)
9
Pipelined FP Multiplier
10
Pipeline Partitioning
11
CPU Example
  • Suppose 2 ns for memory access, 2 ns for ALU
    operation, and 1 ns for register file read or
    write compute instr rate
  • Nonpipelined Execution
  • lw IF Read Reg ALU Memory Write Reg 2
    1 2 2 1 8 ns
  • add IF Read Reg ALU Write Reg 2 1 2
    1 6 ns
  • Pipelined Execution
  • Max(IF,Read Reg,ALU,Memory,Write Reg) 2 ns

12
Problems for Computers
  • Limits to pipelining Hazards prevent next
    instruction from executing during its designated
    clock cycle
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline (missing
    sock)
  • Structural hazards 2 instructions need the same
    resource at the same time
  • Control hazards Pipelining of branches other
    instructions stall the pipeline until the hazard
    bubbles in the pipeline

13
Structural Hazard 1 Single Memory (1/2)
Read same memory twice in same clock cycle
14
Structural Hazard 1 Single Memory (2/2)
  • Solution
  • have both an L1 Instruction Cache and an L1 Data
    Cache
  • need more complex hardware to control when both
    caches miss

15
Structural Hazard 2 Registers (1/2)
Cant read and write to registers simultaneously
16
Structural Hazard 2 Registers (2/2)
  • Fact Register access is VERY fast takes less
    than half the time of ALU stage
  • Solution introduce convention
  • always Write to Registers during first half of
    each clock cycle
  • always Read from Registers during second half of
    each clock cycle
  • Result can perform Read and Write during same
    clock cycle

17
Things to Remember
  • Optimal Pipeline
  • Each stage is executing part of an instruction
    each clock cycle.
  • One instruction finishes during each clock cycle.
  • On average, execute far more quickly.
  • What makes this work?
  • Similarities between instructions allow us to use
    same stages for all instructions (generally).
  • Each stage takes about the same amount of time as
    all others little wasted time.

18
MIPS ISA Handout
19
Data Hazards (1/2)
  • Consider the following sequence of instructions

20
Data Hazards (2/2)
Dependencies backwards in time are hazards
21
Data Hazard Solution Forwarding
  • Forward result from one stage to another

sub and and could use forwarding
22
Data Hazard Loads (1/4)
  • Dependencies backwards in time are hazards
  • Cant solve with forwarding
  • Must stall instruction dependent on load, then
    forward (more hardware)

23
Data Hazard Loads (2/4)
  • Hardware must stall pipeline
  • Called interlock

24
Data Hazard Loads (3/4)
  • Stall is equivalent to nop

lw t0, 0(t1)
nop
sub t3,t0,t2
and t5,t0,t4
25
Data Hazard Loads (4/4)
  • Instruction slot after a load is called load
    delay slot
  • If that instruction uses the result of the load,
    then the hardware interlock will stall it for one
    cycle.
  • If the compiler puts an unrelated instruction in
    that slot, then no stall
  • Letting the hardware stall the instruction in the
    delay slot is equivalent to putting a nop in the
    slot (except the latter uses more code space)

26
Control Hazard Branching (1/5)
Where do we do the compare for the branch?
27
Control Hazard Branching (2/5)
  • We put branch decision-making hardware in ALU
    stage
  • therefore two more instructions after the branch
    will always be fetched, whether or not the branch
    is taken
  • Desired functionality of a branch
  • if we do not take the branch, dont waste any
    time and continue executing normally
  • if we take the branch, dont execute any
    instructions after the branch, just go to the
    desired label

28
Control Hazard Branching (3/5)
  • Initial Solution Stall until decision is made
  • insert no-op instructions those that
    accomplish nothing, just take time
  • Drawback branches take 3 clock cycles each
    (assuming comparator is put in ALU stage)

29
Control Hazard Branching (4/5)
  • Optimization 1
  • move asynchronous comparator up to Stage 2
  • as soon as instruction is decoded (Opcode
    identifies is as a branch), immediately make a
    decision and set the value of the PC (if
    necessary)
  • Benefit since branch is complete in Stage 2,
    only one unnecessary instruction is fetched, so
    only one no-op is needed
  • Side Note This means that branches are idle in
    Stages 3, 4 and 5.

30
Control Hazard Branching (5/5)
  • Insert a single no-op (bubble)
  • Impact 2 clock cycles per branch instruction ?
    slow

31
Quiz
  • Assume 1 instr/clock, delayed branch, 5 stage
    pipeline, forwarding, interlock on unresolved
    load hazards (after 103 loops, so pipeline full)
  • Loop lw t0, 0(s1) addu t0, t0,
    s2 sw t0, 0(s1) addiu s1, s1,
    -4 bne s1, zero, Loop nop
  • How many pipeline stages (clock cycles) per loop
    iteration to execute this code?

1 2 3 4 5 6 7 8 9 10
32
Quiz Answer
  • Assume 1 instr/clock, delayed branch, 5 stage
    pipeline, forwarding, interlock on unresolved
    load hazards. 103 iterations, so pipeline full.
  • Loop lw t0, 0(s1) addu t0, t0, s2 sw t0,
    0(s1) addiu s1, s1, -4 bne s1, zero,
    Loop nop
  • How many pipeline stages (clock cycles) per loop
    iteration to execute this code?

2. (data hazard so stall)
1.
3.
4.
5.
6.
1 2 3 4 5 6 7 8 9 10
33
Pipelining Idealism
  • Uniform Suboperations
  • The operation to be pipelined can be evenly
    partitioned into uniform-latency suboperations
  • Repetition of Identical Operations
  • The same operations are to be performed
    repeatedly on a large number of different inputs
  • Repetition of Independent Operations
  • All the repetitions of the same operation are
    mutually independent, i.e. no data dependence
  • and no resource conflicts
  • Good Examples automobile assembly line
  • floating-point multiplier
  • instruction pipeline???

34
Instruction Pipeline Design
  • Uniform Suboperations ...
  • ? balance pipeline stages
  • - stage quantization to yield balanced stages
  • - minimize internal fragmentation (some waiting
    stages)
  • Identical operations ...
  • ? unifying instruction types
  • - coalescing instruction types into one
    multi-function pipe
  • - minimize external fragmentation (some idling
    stages)
  • Independent operations ...
  • ? resolve data and resource hazards
  • - inter-instruction dependency detection and
    resolution
  • - minimize performance loss

35
The Generic Instruction Cycle
  • The computation to be pipelined
  • Instruction Fetch (IF)
  • Instruction Decode (ID)
  • Operand(s) Fetch (OF)
  • Instruction Execution (EX)
  • Operand Store (OS)
  • Update Program Counter (PC)

36
The GENERIC Instruction Pipeline (GNR)
  • Based on Obvious Subcomputations

37
Balancing Pipeline Stages
  • Without pipelining
  • Tcyc? TIFTIDTOFTEXTOS
  • 31
  • Pipelined
  • Tcyc ? maxTIF, TID, TOF, TEX, TOS
  • 9
  • Speedup 31 / 9
  • Can we do better in terms of either performance
    or efficiency?

TIF 6 units
TID 2 units
TID 9 units
TEX 5 units
TOS 9 units
38
Balancing Pipeline Stages
  • Two Methods for Stage Quantization
  • Merging of multiple subcomputations into one.
  • Subdividing a subcomputation into multiple
    subcomputations.
  • Current Trends
  • Deeper pipelines (more and more stages).
  • Multiplicity of different (subpipelines).
  • Pipelining of memory access (tricky).

39
Granularity of Pipeline Stages
Coarser-Grained Machine Cycle 4 machine cyc /
instruction cyc
Finer-Grained Machine Cycle 11 machine cyc
/instruction cyc
TIFID 8 units
TID 9 units
TEX 5 units
TOS 9 units
Tcyc 3 units
40
Hardware Requirements
  • Logic needed for each pipeline stage
  • Register file ports needed to support all the
    stages
  • Memory accessing ports needed to support all the
    stages

41
Pipeline Examples
MIPS R2000/R3000
AMDAHL 470V/7
1
2
3
4
5
6
7
8
9
10
Check Result
11
PC

GEN
.
12
42
Coalescing Resource Requirements
  • Procedure
  • Analyze the sequence of register transfers
    required by each instruction type.
  • Find commonality across instruction types and
    merge them to share the same pipeline stage.
  • If there exists flexibility, shift or reorder
    some register transfers to facilitate further
    merging.

43
Unifying instructions to 6-stage pipeline
  • The 6-stage TYPICAL (TYP) pipeline

44
Physical Organization of 6-stage pipeline
45
Pipeline Interface to Memory

46
Pipeline Interface to Register File

47
Minimizing Pipeline Stalls
  • Dependences lead to Pipeline Hazards

48
Occurrence of Hazards
  • Necessary Conditions

49
Penalties Due to RAW hazards
50
Incorporation of Forwarding Paths in TYP pipeline
51
Penalties with Forwarding Paths
52
Forwarding Paths for leading ALU instruction

53
Pipeline Interlocks for leading ALU instruction

54
Forwarding for leading Load

55
Pipeline Interlocks for ALU,Load

56
Pipeline Interlocks for Branch

57
Historical Trivia
  • First MIPS design did not interlock and stall on
    load-use data hazard
  • Real reason for name behind MIPS Microprocessor
    without Interlocked Pipeline Stages
  • Word Play on acronym for Millions of
    Instructions Per Second, also called MIPS

58
Load Delay Slot (MIPS R2000)
t0
t1
t2
t3
t4
t5
i
IF
ID
RD
ALU
MEM
WB
j
IF
ID
RD
ALU
MEM
WB
k
IF
ID
RD
ALU
MEM
WB
- The effect of a delayed Load is not visible
to the instructions in its delay slots.
h Rk ? -- i Rk ? MEM - j -- ?
Rk k -- ? Rk
Which (Rk) do we really mean?
59
RISC Pipeline Example
60
Real Pipelined Processor Example MIPS R2000
61
Intel i486 5-Stage CISC Pipeline
62
IBMs Experience on Pipelined Processors
Agerwala and Cocke 1987
  • Attributes and Assumptions
  • Memory Bandwidth
  • at least one word/cycle to fetch 1
    instruction/cycle from I-cache
  • 40 of instructions are load/store, require
    access to D-cache
  • Code Characteristics (dynamic)
  • loads - 25
  • stores - 15
  • ALU/Reg-Reg - 40
  • branches - 20 1/3 unconditional (always taken)
  • 1/3 conditional taken
  • 1/3 conditional not taken

63
More Statistics and Assumptions
  • Cache Performance
  • hit ratio of 100 is assumed in the experiments
  • cache latency I-cache i D-cache d default
    id1 cycle
  • Load and Branch Scheduling
  • loads
  • 25 cannot be scheduled
  • 65 can be moved back 2 inst 10 - 1 delay slot
  • branches
  • unconditional - 100 schedulable (fill 1 delay
    slot)
  • conditional - 50 schedulable (fill 1 delay
    slot)

64
CPI Calculations I
25L 15S 40ALU 20Br
  • No cache bypass of reg. file, no scheduling of
    loads or branches
  • Load Penalty 2 cycles
  • Branch Penalty 2 cycles
  • Total CPI 1 0.252 0.20.662 1 0.5
    0.27 1.77 CPI (assume br not taken, penalty
    only for 66 branches)
  • Bypass reg file for loads
  • Load Penalty 1 cycle
  • Total CPI 1 0.25 0.27 1.52 CPI

65
CPI Calculations II
  • Bypass, scheduling of loads and branches
  • Load Penalty
  • 75 can be moved back 1 gt no penalty
  • remaining 25 gt 1 cycle penalty
  • Load overhead0.250.2510.0625
  • Branch Penalty
  • 1/3 Uncond. 100 schedulable gt 1 cycle
  • 1/3 Cond. Not Taken, if biased for NT gt no
    penalty
  • 1/3 Cond. Taken
  • 50 schedulable gt 1 cycle
  • 50 unschedulable gt 2 cycles
  • Branch overhead0.20.3310.330.51
    0.330.52 0.167
  • Total CPI 1 0.063 0.167 1.23 CPI

66
CPI Calculations III
  • Parallel target address generation to reduce
    penalty from 2-gt1cycle
  • 90 of branches can be coded as PC relative (inst
    of reg indirect)
  • i.e. target address can be computed without
    register access
  • A separate adder can compute (PCoffset) during
    reg read stage
  • Branch Penalty
  • Conditional Unconditional
  • Uncond 0.20.330.11 0.0066 CPI
  • Cond 0.20.660.90.510.10.510.10.5
    20.079
  • Total CPI 1 0.063 0.087 1.15 CPI 0.87
    IPC

67
Deeply Pipelined Processors
68
Deeply Pipelined Processors
69
Problem Set 2
  • 2.4, 2.8, 2.18, 2.21
Write a Comment
User Comments (0)
About PowerShow.com