Pipelining - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Pipelining

Description:

each have one load of clothes. to wash, dry, and fold. Washer takes 30 minutes ... The only DLX instructions active in this cycle are loads, stores, and branches ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 46
Provided by: Rand222
Category:

less

Transcript and Presenter's Notes

Title: Pipelining


1
Pipelining
  • By Pradondet Nilagupta
  • Based on Lecture note on
  • Advanced Computer Architecture
  • Prof. Mike Schulte
  • Prof. Yirng-An Chen

2
Introduction to Pipelining
  • Pipelining An implementation technique that
    overlaps the execution of multiple instructions.
  • Laundry Example
  • Ann, Brian, Cathy, Dave each have one load of
    clothes to wash, dry, and fold
  • Washer takes 30 minutes
  • Dryer takes 40 minutes
  • Folder takes 20 minutes

3
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
  • Sequential laundry takes 6 hours for 4 loads
  • If they learned pipelining, how long would
    laundry take?

4
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
  • Pipelined laundry takes 3.5 hours for 4 loads
  • Speedup 6/3.5 1.7

5
Pipelining Lessons
  • Pipelining doesnt help latency of single task,
    it helps throughput of entire workload
  • Pipeline rate limited by slowest pipeline stage
  • Multiple tasks operating simultaneously
  • Potential speedup Number pipe stages
  • Unbalanced lengths of pipe stages reduces speedup
  • Time to fill pipeline and time to drain it
    reduces speedup

6 PM
7
8
9
Time
T a s k O r d e r
6
Computer Pipelines
  • Execute billions of instructions, so throughput
    is what matters
  • RISC desirable features all instructions same
    length, registers located in same place in
    instruction format, memory operands only in loads
    or stores

7
Pipelining Basics
Unpipelined System
Delay 33ns Throughput 30MHz
Op1
Op2
Op3
??
Time
  • One operation must complete before next can begin
  • Operations spaced 33ns apart

8
3 Stage Pipelining
Delay 39ns Throughput 77MHz
Op1
Op2
  • Space operations 13ns apart
  • 3 operations occur simultaneously

Op3
Op4
??
Time
9
Limitation Nonuniform Pipelining
Delay 18 3 54 ns Throughput 55MHz
Clock
  • Throughput limited by slowest stage
  • Delay determined by clock period number of
    stages
  • Must attempt to balance stages

10
Limitation Deep Pipelines
Delay 48ns, Throughput 128MHz
  • Diminishing returns as add more pipeline stages
  • Register delays become limiting factor
  • Increased latency
  • Small throughput gains

11
Limitation Sequential Dependencies
R E G
Comb. Logic
R E G
Comb. Logic
R E G
Comb. Logic
Clock
Op1
Op2
  • Op4 gets result from Op1 !
  • Pipeline Hazard

Op3
Op4
??
Time
12
Speed Up Equation for Pipelining
  • Assumptions
  • No delays except components latencies
  • A fixed pipeline overhead 2ns.
  • What is the cycle time for the pipeline version
    of the circuit that maximizes performance without
    allocating multiple cycles to a stage?
  • What is the total execution time for the pipeline
    version?
  • What is the speedup versus a single-cycle
    unpipelined version?

13
Multiple-Cycle DLX Cycles 1 and 2
  • Most DLX instruction can be implemented in 5
    clock cycles (see Figure 3.1 on page 130).
  • The first two clock cycles are the same for every
    instruction.
  • 1. Instruction fectch cycle (IF)
  • IR lt MemPC (load instruction)
  • NPC lt PC4 (update program counter)
  • 2. Instruction decode / register fetch cycle (ID)
  • A lt RegsIR (fetch source reg1)
  • B lt RegsIR (fetch source reg2)
  • Imm lt (IR ) IR (fetch and sign-ext
    imm.)

6...10
1115
16
16
1631
14
Multiple-Cycle DLX Cycle 3
  • The third cycle is known as the
  • Execution/ effective address cycle (EX)
  • The actions performed in this cycle depend on the
    type of operations.
  • Memory reference (e.g., LW R1, 30 (R2))
  • ALUOutput lt A Imm (Calculate effective
    address)
  • Register-Register ALU op. (e.g., ADD R1, R2, R3)
  • ALUOutput ltA op B (Perform ALU operation)
  • Register-Immed. ALU op. (e.g., ADDI R1, R2, 3)
  • ALUOutput ltA op Imm (Perform ALU operation)
  • Branch (e.g., BEQZ R4, next)
  • ALUOutput lt NPC Imm (Compute branch target)
  • Cond lt (A 0) (Compare A to 0)

15
Multiple-Cycle DLX Cycle 4
  • The fourth cycle is known as the
  • Memory access / branch completion cycle (MEM)
  • The only DLX instructions active in this cycle
    are loads, stores, and branches
  • Loads (e.g., LW R1, 30 (R2))
  • LMD lt MemALUOutput (load memory onto
    processor)
  • Stores (e.g., 500(R4), R3)
  • MemALUOutput lt B (store data into memory)
  • Branch (e.g., BEQZ R4, next)
  • if (cond) PC lt ALUoutput (Set PC based on
    cond)
  • else PC lt NPC

16
Multiple-Cycle DLX Cycle 5
  • The fifth cycle is known as the
  • Write-back cycle (WB)
  • During this cycles, results are written to the
    register file
  • Register-Register ALU op. (e.g., ADD R1, R2, R3)
  • RegsIR lt ALUOutput
  • Register-Immed. ALU op (e.g., ADD R1, R2, 3)
  • RegsIR lt ALUOutput
  • Load Instruction (e.g., LW R1, 30 (R2))
  • RegsIR lt LMD

1620
1115
1115
17
5 Steps of DLX DatapathFigure 3.1
18
CPI for the Multiple-Cycle DLX
  • The multiple-cycle DLX requires 4 cycles for
    branches and stores and 5 cycles for the other
    operations.
  • Assuming 20 of the instructions are branches or
    loads, this gives a CPI of 4.80.
  • We could improve the CPI by allowing ALU
    operations to complete during memory cycle
  • Assuming 40 of the instructions are ALU
    operations, this would reduce the CPI to 4.40.

19
Pipelining DLX
  • To reduce the CPI, DLX can be implemented using a
    five stage pipeline.
  • In this example, it takes 10 cycles execute 5
    instructions for a CPI of 2.

20
Visualizing PipeliningFigure 3.3, Page 133
Time (clock cycles)
I n s t r. O r d e r
21
Pipelined DLX DatapathFigure 3.4 page 134
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
M U X

Zero?
Write Back
Memory Access
4
M U X
PC
ALU
Regs
Data Mem.
M U X
M U X
Inst. Mem.
16
32
Sign Ext.
IF/ID
ID/EX
EX/MEM
MEM/WB
  • Pipeline registers are used to tranfer results
    from one pipeline stage to the next.

22
Basic Performance Issues in Pipelining
  • Pipelining increases the CPU instruction
    throughput - the number of instructions complete
    per unit of time - but it is not reduce the
    execution time of an individual instruction.

23
Pipeline Speedup Example
  • Assume the multiple cycle DLX has a 10-ns clock
    cycle, loads take 5 clock cycles and account for
    40 of the instructions, and all other
    instructions take 4 clock cycles.
  • If pipelining the machine add 1-ns to the clock
    cycle, how much speedup in instruction execution
    rate do we get from pipelining.
  • MC Ave Instr. Time Clock cycle x Average CPI
  • 10 ns x (0.6 x 4 0.4 x 5)
  • 44 ns
  • PL Ave Instr. Time 10 1 11 ns
  • Speedup 44 / 11 4
  • This ignores time needed to fill empty the
    pipeline and delays due to hazards.

24
Its Not That Easy for Computers
  • Limits to pipelining Hazards prevent next
    instruction from executing during its designated
    clock cycle
  • Structural hazards Hardware cannot support this
    combination of instructions - two instructions
    need the same resource.
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline
  • Control hazards Pipelining of branches other
    instructions that change the PC
  • Common solution is to stall the pipeline until
    the hazard is resolved, inserting one or more
    bubbles in the pipeline

25
Speed Up Equations for Pipelining
  • Stalls reduce the speedup obtained from
    pipelining
  • Speedup from pipelining Ave Instr Time
    unpipelined
  • Ave Instr Time
    pipelined
  • CPIunpipelined x Clock
    Cycleunpipelined
  • CPIpipelined x Clock
    Cyclepipelined
  • CPIpipelined Ideal CPI Pipeline stall CPI
  • 1 Pipeline stall CPI
  • Speedup CPIunpipelined Clock
    Cycleunpipelined
  • 1 Pipeline stall CPI Clock
    Cyclepipelined
  • Speedup lt Pipeline depth
  • 1 Pipeline stall CPI

x
26
Speed Up Equation for Pipelining
  • CPIpipelined Ideal CPI Pipeline stall clock
    cycles per instr
  • Speedup Ideal CPI x Pipeline depth Clock
    Cycleunpipelined
  • Ideal CPI Pipeline stall CPI Clock
    Cyclepipelined
  • ASSUMING IDEAL CPI OF 1
  • Speedup Pipeline depth Clock
    Cycleunpipelined
  • 1 Pipeline stall CPI Clock
    Cyclepipelined

x
x
27
Structure Hazards
  • Sometime called Resource Conflict.
  • Example.
  • Some pipelined machines have shared a single
    memory pipeline for a data and instruction. As a
    result, when an instruction contains a data
    memory reference, it will conflict with the
    instruction reference for a latter instruction.

28
One Memory Port/Structural HazardsFigure 3.6,
Page 142
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
29
One Memory Port/Structural HazardsFigure 3.7,
Page 143
Load
I n s t r. O r d e r
Instr 1
Instr 2
stall
Instr 3
30
A pipeline Stalled for a Structural Hazard
Inst. 1 2 3 4 5 6 7 8 9 10
Load Inst IF ID EX MEM WB
Intst i1 IF ID EX MEM WB
Intst i2 IF ID EX MEM WB
Intst i3 STALL IF ID EX MEM WB
Intst i4 IF ID EX MEM WB
Intst i5 IF ID EX MEM
Intst i6 IF ID EX
31
Example One or Two Memory Ports?
  • Machine A has a two port memory - access
    instructions and data simultaneously.
  • Machine B has a one port memory, but its
    pipelined implementation has a 1.05 times faster
    clock rate
  • Ideal CPI 1 for both
  • Loads are 40 of instructions executed
  • Ave Instr.Time A Clock cycle A x CPI A
  • Clock cycle A
  • Ave Instr.Time B Clock cycle B x CPI B
  • (Clock cycle A / 1.05) x (1 0.4)
  • Clock cyle A x 1.33
  • Ave Instr.Time B 1.33
  • Ave Instr.Time A
  • Machine A is 1.33 times faster

32
Data Hazard
  • Data hazard occur when pipeline changes the order
    of read/write accesses to operands so that the
    order differs from the order seen by sequentially
    execution instructions on an unpipelined machine

33
Data Hazard on R1Figure 3.9, page 147
34
Forwarding to Avoid Data HazardFigure 3.10, Page
149
35
Three Generic Data Hazards
  • InstrI followed be InstrJ
  • Read After Write (RAW) InstrJ tries to read
    operand before InstrI writes it

I ADD R1, R2, R3 IF ID EX MEM WB J
SUB R4, R1, R5 IF ID EX MEM WB
36
Three Generic Data Hazards
  • Write After Write (WAW)
  • InstrJ tries to write operand before InstrI
    writes it
  • Leaves wrong result ( InstrI not InstrJ)
  • Cant happen in DLX 5 stage pipeline because
  • All instructions take 5 stages, and
  • Writes are always in stage 5

I LW R1, 0(R2) IF ID EX MEM1 MEM2
WB J ADD R1, R2, R3 IF ID EX
WB
37
Three Generic Data Hazards
  • InstrI followed be InstrJ
  • Write After Read (WAR) InstrJ tries to write
    operand before InstrI reads it
  • Cant happen in the DLX 5 stage pipeline because
  • All instructions take 5 stages,
  • Reads are always in stage 2, and
  • Writes are always in stage 5

I SW 0(R1), R2 IF ID EX MEM1 MEM2
WB J ADD R2, R3, R4 IF ID EX
WB
38
Data Hazard Even with ForwardingFigure 3.12,
Page 153
39
Data Hazard Even with ForwardingFigure 3.13,
Page 154
40
HW Change for ForwardingFigure 3.20, Page 161
41
Compiler Scheduling for Data Hazards
  • Rather than just allow the pipeline to stall, the
    compiler could try to schedule the pipeline to
    avoid these stalls by arranging the code sequence
    to eliminate the hazard. The technique, called
    pipeline scheduling or instruction scheduling

42
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
  • Fast code
  • LW Rb,b
  • LW Rc,c
  • LW Re,e
  • ADD Ra,Rb,Rc
  • LW Rf,f
  • SW a,Ra
  • SUB Rd,Re,Rf
  • SW d,Rd

43
Compiler Avoiding Load Stalls
e
44
Pipelining Summary
  • Pipelining overlaps the execution of multiple
    instructions.
  • With an idea pipeline, the CPI is one, and the
    speedup is equal to the number of stages in the
    pipeline.
  • However, several factors prevent us from
    achieving the ideal speedup, including
  • Not being able to divide the pipeline evenly
  • The time needed to empty and flush the pipeline
  • Overhead needed for pipeling
  • Structural, data, and control harzards

45
Pipelining Summary
  • Just overlap tasks, and easy if tasks are
    independent
  • Speed Up VS. Pipeline Depth if ideal CPI is 1,
    then
  • Hazards limit performance on computers
  • Structural need more HW resources
  • Data need forwarding, compiler scheduling
  • Control discuss next time

Pipeline Depth
Clock Cycle Unpipelined
Speedup
X
Clock Cycle Pipelined
1 Pipeline stall CPI
Write a Comment
User Comments (0)
About PowerShow.com