CS 2200 Lecture 09a Pipelining - PowerPoint PPT Presentation

1 / 139
About This Presentation
Title:

CS 2200 Lecture 09a Pipelining

Description:

Another example: auto assembly line. Throughput is how often stuff comes out of a pipeline ... More technical detail. If times for all S stages are equal to T: ... – PowerPoint PPT presentation

Number of Views:185
Avg rating:3.0/5.0
Slides: 140
Provided by: michaelt8
Category:

less

Transcript and Presenter's Notes

Title: CS 2200 Lecture 09a Pipelining


1
CS 2200 Lecture 09aPipelining
  • (Lectures based on the work of Jay Brockman,
    Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
    Ken MacKenzie, Richard Murphy, and Michael
    Niemier)

2
Class demo
  • Can someone come to the front of the class and
    explain to me how to do 5 loads of laundry?
  • I need 1 person to actually do the laundry
  • and 5 more to be ummthe laundry.

3
Short review Single cycle MIPS machine
Single cycle MIPS machine
4
Short review Non-MIPS single cycle machine
  • y a bx cx2

cx2
c
x
y
a
b
bx
5
Short review Multi cycle MIPS machine
Multi cycle MIPS machine
6
Short review Multi cycle LC2200 machine
A
LdA
B
LdB
10
memory 1024x 32 bits
Addr Din
registers 16x 32 bits
Din
IR31..0
WrREG
WrMEM
2
4
ALU 00 ADD 01 NAND 10 A - B 11 A 1
func
IR19..0
regno
20
Dout
Dout
sign extend
0?
1
1
7
Other Processor Designs(with more than one bus)
  • One-bus is simple, recipe-oriented.
  • Alternatives
  • add parallel busses for data transfers that occur
    together
  • e.g. ALU input/input/output
  • add parallel compute units for operation that
    occur together
  • e.g. PC1 in parallel with everything else
  • mux paths together as necessary
  • (somewhat ad-hoc)

8
Other Processor Designs(with more than one bus)
  • Add busses! One per ALU port

regnos
Register File (3 ports)
9
Other Processor Designswith more than one bus
  • Fetch unit performs PC1 and instruction lookup

Instr Memory
10
Cycles Per Instruction?
  • Well, you have a choice!
  • CPI 1
  • one long cycle
  • Tclock 5nS?

11
Cycles Per Instruction?
  • Well, you have a choice!
  • CPI 1
  • one long cycle!
  • Tclock 5nS
  • CPI 5
  • five short cycles
  • Tclock 1nS
  • 5nS/instruction either way

12
Transition
  • Can we do better?
  • What if we have 5 instructions?
  • With single cycle, 25 ns needed
  • With multi cycle, 25 ns needed
  • But its also possible to do in less than 10 ns

13
Pipelining
14
Pipelining Its Natural!
  • Laundry Example
  • Ann, Brian, Cathy, Dave each have one load of
    clothes to wash, dry, and fold
  • Washer takes 30 minutes
  • Dryer takes 40 minutes
  • Folder takes 20 minutes

15
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
  • Sequential laundry takes 6 hours for 4 loads
  • If they learned pipelining, how long would
    laundry take?

16
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
Note More time to go out later that night
  • Pipelined laundry takes 3.5 hours for 4 loads

17
Pipelining Lessons
  • Multiple tasks operating simultaneously
  • Pipelining doesnt help latency of single task,
    it helps throughput of entire workload
  • Pipeline rate limited by slowest pipeline stage
  • Potential speedup Number pipe stages
  • Unbalanced lengths of pipe stages reduces speedup
  • Also, need time to fill and drain the
    pipeline.

6 PM
7
8
9
Time
T a s k O r d e r
18
Pipelining Some terms
  • If youre doing laundry or implementing a mP,
    each stage where something is done called a pipe
    stage
  • In laundry example, washer, dryer, and folding
    table are pipe stages clothes enter at one end,
    exit other
  • In a mP, instructions enter at one end and have
    been executed when they leave
  • Another example auto assembly line
  • Throughput is how often stuff comes out of a
    pipeline

19
More on throughput
  • All pipe stages are connected so everything must
    move from one to another at same time
  • How fast this happens is a function of time it
    takes for slowest stage to finish
  • Example If laundry takes 30 min. to wash but 40
    min. to dry, itll be idle in washer for 10 min.
  • In a mP, this is machine cycle time (usually 1
    clock)
  • If a each pipe stage is perfectly balanced time
    wise
  • Time/Instruction Time on unpipelined/ of pipe
    stages
  • Therefore speedup from pipelining of pipe
    stages
  • But of course nothings perfect!

20
So really, how is pipelining faster?
  • Pipelining reduces average execution
    time/instruction
  • Could be viewed as decreasing of clock cycles
    per instruction (CPI)
  • In perfect pipeline, you should see 1 instruction
    result each cycle even though that instruction
    actually required multiple pipe stages/multiple
    cycles
  • Pipelining is implementation technique, not
    visible to programmer
  • (a good thing b/c its one less thing a programmer
    has to worry about!)

21
More technical detail
  • General characteristics
  • Complete process broken into S independent steps
  • Each step done independently at a stage
  • Stages arranged in linear order to match process
  • As each stage finishes its pieces, it passes it
    to the next stage
  • Time for 1 complete processing sequence sum of
    all stages
  • BUT rate at which we can initiate new work
    max of any stage time

22
More technical detail
  • If times for all S stages are equal to T
  • Time for one initiation to complete still ST
  • Time between 2 initiates T not ST
  • Initiations per second 1/T
  • Pipelining Overlap multiple executions of same
    sequence
  • Improves THROUGHPUT, not the time to perform a
    single operation
  • Other examples
  • Automobile assembly plant, chemical factory,
    garden hose, cooking

23
More technical detail
  • Books approach to draw pipeline timing diagrams
  • Time runs left-to-right, in units of stage time
  • Each row below corresponds to distinct
    initiation
  • Boundary b/t 2 column entries pipeline register
  • (i.e. hamper)
  • Must look at column contents to see what stage is
    doing what

Time for N initiations to complete NT (S-1)T
Throughput Time per initiation T (S-1)T/N ?
T!
24
Ideal digital system pipeline speedup
Unpipelined
combinational logic delay t
combinational logic delay t
combinational logic delay t
combinational logic delay t
delay for 1 piece of data 4t latch setup
(assume small)
Latch
Latch
approximate delay for 1000 pieces of data 4000t
Pipelined
combinational logic delay t
combinational logic delay t
combinational logic delay t
combinational logic delay t
Latch
Latch
delay for 1 piece of data 4(t latch setup)
approximate delay for 1000 pieces of data 3t
1000t
4000
4
speedup for 1000 pieces of data
1003
Ideal speedup of pipeline stages
25
Example
  • IF The instruction fetch sequence (2 ns)
  • ID Decode and fetch register operands (1 ns)
  • EX Perform ALU operation (2 ns)
  • MEM Perform data memory operation (2 ns)
  • WB Write result (if any) back into reg. file
    (1 ns)
  • Hmmm5 stages ? 5X performance increase over a
    single cycle design?
  • Electrical design challenge
  • Can we make HW do each stage in same time?

26
Example
1
2
2
2
1
Total time 8 ns
One initiation
Try to overlap
Doesnt line up!
Possible solution insert 1ns after ID to allow
alignment
Structural Hazard
27
More technical detail
Delay ID by 1 ns also
9 ns
No structural hazard
15 ns
  • One initiation 9 ns or 10 ns (depending on how
    you look at it)
  • 4 initiations 15 ns ? Average of 1 initiation
    every 3.75 ns
  • How long for 1000 initiations?
  • What is the equivalent time between
    initiations?
  • What is the effective speedup?

28
Transition
  • to a microprocessor

29
The Big Picture Literally!
30
The new look dataflow
IF/ID
ID/EX
EX/MEM
MEM/WB
4
M u x
ADD
PC
Branch taken
Comp.
IR6...10
M u x
Inst. Memory
IR11..15
Register File
ALU
MEM/ WB.IR
M u x
Data Mem.
Data must be stored from one stage to the
next in pipeline registers/latches. hold
temporary values between clocks and needed info.
for execution.
M u x
Sign Extend
16
32
31
Another way to look at it
Clock Number
Time
Program execution order (in instructions)
32
So, what about the details?
  • In each cycle, new instruction fetched and begins
    5 cycle execution
  • In perfect world (pipeline) performance improved
    5 times over!
  • So, thats it, huh? Hardly!!!
  • What else do we have to worry about?
  • Must know whats going on in every cycle of
    machine
  • What if 2 instructions try to use the same
    resource at same time?
  • (LOTS more on this later)
  • Separate instruction/data memories, multiple
    register ports, etc. help avoid this

33
So seriously, what does pipelining do for us?
  • For starters, pipelining does not reduce the
    execution time of a single instruction.
  • Actually, b/c of overhead of controlling
    pipeline, execution time usually increases!
  • So why do it?
  • Pipelining increases CPU instruction throughput.
  • of instructions executed in some given time
    frame should increase b/c of pipelining
  • Thus, a program runs faster but all instructions
    actually execute a little slower. Crazy, huh?

34
Limits, limits, limits
  • So, now that the ideal stuff is out of the way,
    lets look at how a pipeline REALLY works
  • Pipelines are slowed b/c of
  • Pipeline latency
  • Imbalance of pipeline stages
  • (Think A chain is only as strong as its weakest
    link)
  • Well, a pipeline is only as fast as its slowest
    stage
  • Pipeline overhead (from where?)
  • Register delay from pipe stage latches
  • Clock skew Once a clock cycle is as small as
    the sum of the clock skew and latch overhead, you
    cant get any work done

35
Speed Up Equation for Pipelining
For simple RISC pipeline, CPI 1. W/microcode,
unpipelined CPI pipeline depth
Single-cycle HW would have a slow clock
36
Transition
  • to MIPS examples

37
CS 2200 Lecture 09bMIPS Pipelining Examples
  • (Lectures based on the work of Jay Brockman,
    Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
    Ken MacKenzie, Richard Murphy, and Michael
    Niemier)

38
Executing Instructions in Pipelined Datapath
  • Following charts describe 3 scenarios
  • Processing of load word (lw) instruction
  • Bug included in design (make SURE you understand
    the bug)
  • Processing of lw
  • Bug corrected (make SURE you understand the fix)
  • Processing of lw followed in pipeline by sub
  • (Sets the stage for discussion of HAZARDS and
    inter-instruction dependencies)

39
Load word Cycle 1
40
Load Word Cycle 2
41
Load Word Cycle 3
42
Load Word Cycle 4
43
Load Word Cycle 5
44
Load Word Fixed Bug
45
A 2 instruction sequence
  • Examine multiple-cycle single-cycle diagrams
    for a sequence of 2 independent instructions
  • (i.e. no common registers b/t them)
  • lw 10, 9(1)
  • sub 11, 2, 3

46
Single-cycle diagrams cycle 1
47
Single-cycle diagrams cycle 2
48
Single-cycle diagrams cycle 3
49
Single-cycle diagrams cycle 4
50
Single-cycle diagrams cycle 5
51
Single-cycle diagrams cycle 6
52
Pipelined Control
  • Potentially very complicated, approach
    methodically.
  • Example (independent instructions)
  • lw 10, 9(1)
  • sub 11, 2, 3
  • and 12, 4, 5
  • or 13, 6, 7
  • add 14, 8, 9

53
Pipelined Control
  • Example (dependent instructions)
  • (2 used in sequential instructions)
  • sub 2, 1, 3 register 2 written by sub
  • add 12, 2, 5 1st operand (2) depends on sub
  • or 13, 6, 2 2nd operand (2) depends on sub
  • add 14, 2, 2 1st and 2nd (2) depends on sub
  • sw 15, 100(2) index (2) depends on sub
  • Problem
  • write-back for sub wont occur until the 5th
    cycle
  • First assume sequence of independent instructions
  • later, remove this assumption

54
Control signal summary
55
Questions about control signals
  • Following discussion relevant to a single
    instruction
  • Q Are all control signals active at the same
    time?
  • A ?
  • Q Can we generate all these signals at the same
    time?
  • A ?

56
Control lines by pipe stage
  • Each data flow component is active in only one
    pipeline stage
  • So, divide control signals into groups according
    to active component
  • 1. Instruction Fetch
  • Always read instruction memory and write PC
  • (basically nothing special)
  • 2. Instruction Decode / Register Fetch
  • Still nothing special to control
  • (same action every time)
  • 3. Execution (must decode control sigs from
    inst.)
  • RegDst does target reg come from bits 20-16 or
    15-11?
  • ALUOp how to control ALU operation
  • ALUSrc does 2nd ALU input come from reg. file
    or sign ext.?

57
Control lines by pipe stage
  • 4. Memory likewise
  • Branch used to generate PCSrc
  • PCSrc does PC get incremented or replaced by
    output of branch adder
  • MemRead signal reads from memory
  • MemWrite signal writes to memory
  • 5. Write Back likewise
  • MemToReg does value going back to reg file come
    from ALU or memory?
  • RegWrite is there in fact a register write back
    to perform?

58
Passing control w/pipe registers
  • Analogy send instruction with car on assembly
    line
  • Install Corinthian leather interior on car 6 _at_
    stage 3

59
Pipelined datapath w/control signals
60
CS 2200 Lecture XHazards
  • (Lectures based on the work of Jay Brockman,
    Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
    Ken MacKenzie, Richard Murphy, and Michael
    Niemier)

61
The hazards of pipelining
  • Pipeline hazards prevent next instruction from
    executing during designated clock cycle
  • There are 3 classes of hazards
  • Structural Hazards
  • Arise from resource conflicts
  • HW cannot support all possible combinations of
    instructions
  • Data Hazards
  • Occur when given instruction depends on data from
    an instruction ahead of it in pipeline
  • Control Hazards
  • Result from branch, other instructions that
    change flow of program (i.e. change PC)

62
How do we deal with hazards?
  • Often, pipeline must be stalled
  • Stalling pipeline usually lets some
    instruction(s) in pipeline proceed,
    another/others wait for data, resource, etc.
  • A note on terminology
  • If we say an instruction was issued later than
    instruction x, we mean that it was issued after
    instruction x and is not as far along in the
    pipeline
  • If we say an instruction was issued earlier than
    instruction x, we mean that it was issued before
    instruction x and is further along in the pipeline

63
Stalls and performance
  • Stalls impede progress of a pipeline and result
    in deviation from 1 instruction executing/clock
    cycle
  • Pipelining can be viewed to
  • Decrease CPI or clock cycle time for instruction
  • Lets see what affect stalls have on CPI
  • CPI pipelined
  • Ideal CPI Pipeline stall cycles per instruction
  • 1 Pipeline stall cycles per instruction
  • Ignoring overhead and assuming stages are
    balanced

64
More pipeline performance issues
  • Pipelining can appear to improve clock cycle time
  • Can assume the CPI of an unpipelined and a
    pipelined machine is 1
  • This results in
  • If pipe stages perfectly balanced, we assume no
    overhead
  • clock cycle on pipelined machine is smaller than
    unpipelined machine by a factor equal to pipeline
    depth.

65
Even more pipeline performance issues!
  • This results in
  • Which leads to
  • If no stalls, speedup equal to of pipeline
    stages in ideal case

66
Structural hazards
  • 1 way to avoid structural hazards is to duplicate
    resources
  • i.e. An ALU to perform an arithmetic operation
    and an adder to increment PC
  • If not all possible combinations of instructions
    can be executed, structural hazards occur
  • Most common instances of structural hazards
  • When a functional unit not fully pipelined
  • When some resource not duplicated enough
  • Pipelines stall result of hazards, CPI increased
    from the usual 1

67
An example of a structural hazard
Load
Instruction 1
Instruction 2
Instruction 3
Instruction 4
Whats the problem here?
Time
68
How is it resolved?
Load
Instruction 1
Instruction 2
Stall
Instruction 3
Pipeline generally stalled by inserting a
bubble or NOP
Time
69
Or alternatively
Clock Number
  • LOAD instruction steals an instruction fetch
    cycle which will cause the pipeline to stall.
  • Thus, no instruction completes on clock cycle 8

70
A simple example
  • The facts
  • Data references constitute 40 of an instruction
    mix
  • Ideal CPI of the pipelined machine is 1
  • A machine with a structural hazard has a clock
    rate thats 1.05 times higher than a machine
    without the hazard.
  • How much does this LOAD problem hurt us?
  • Recall Avg. Inst. Time CPI x Clock Cycle Time
  • (1 0.4 x 1) x (Clock cycle timeideal/1.05)
  • 1.3 x Clock cycle timeideal
  • Therefore the machine without the hazard is
    better

71
Remember the common case!
  • All things being equal, a machine without
    structural hazards will always have a lower CPI.
  • But, in some cases it may be better to allow them
    than to eliminate them.
  • These are situations a computer architect might
    have to consider
  • Is pipelining functional units or duplicating
    them costly in terms of HW?
  • Does structural hazard occur often?
  • Whats the common case???

72
Data hazards
  • These exist because of pipelining
  • Why do they exist???
  • Pipelining changes order or read/write accesses
    to operands
  • Order differs from order seen by sequentially
    executing instructions on unpipelined machine
  • Consider this example
  • ADD R1, R2, R3
  • SUB R4, R1, R5
  • AND R6, R1, R7
  • OR R8, R1, R9
  • XOR R10, R1, R11

All instructions after ADD use result of ADD
For the DLX mP, ADD writes the register in WB
but SUB needs it in ID. This is a data hazard
73
Illustrating a data hazard
ADD R1, R2, R3
SUB R4, R1, R5
Reg
Mem
DM
AND R6, R1, R7
Reg
Mem
OR R8, R1, R9
Reg
Mem
XOR R10, R1, R11
Time
ADD instruction causes a hazard in next 3
instructions b/c register not written until after
those 3 read it.
74
Forwarding
  • Problem illustrated on previous slide can
    actually be solved relatively easily with
    forwarding
  • In this example, result of the ADD instruction
    not really needed until after ADD actually
    produces it
  • Can we move the result from EX/MEM register to
    the beginning of ALU (where SUB needs it)?
  • Yes! Hence this slide!
  • Generally speaking
  • Forwarding occurs when a result is passed
    directly to functional unit that requires it.
  • Result goes from output of one unit to input of
    another

75
When can we forward?
ADD R1, R2, R3
SUB gets info. from EX/MEM pipe register AND
gets info. from MEM/WB pipe register OR gets
info. by forwarding from register file
SUB R4, R1, R5
Reg
Mem
DM
AND R6, R1, R7
Reg
Mem
OR R8, R1, R9
Reg
Mem
XOR R10, R1, R11
Rule of thumb If line goes forward you can do
forwarding. If its drawn backward, its
physically impossible.
Time
76
HW Change for Forwarding
77
Data hazard specifics
  • There are actually 3 different kinds of data
    hazards!
  • Read After Write (RAW)
  • Write After Write (WAW)
  • Write After Read (WAR)
  • Well discuss/illustrate each on forthcoming
    slides. However, 1st a note on convention.
  • Discussion of hazards will use generic
    instructions i j.
  • i is always issued before j.
  • Thus, i will always be further along in pipeline
    than j.
  • With an in-order issue/in-order completion
    machine, were not as concerned with WAW, WAR

78
Read after write (RAW) hazards
  • With RAW hazard, instruction j tries to read a
    source operand before instruction i writes it.
  • Thus, j would incorrectly receive an old or
    incorrect value
  • Graphically/Example
  • Can use stalling or forwarding to resolve this
    hazard

i ADD R1, R2, R3 j SUB R4, R1, R6
Instruction j is a read instruction issued after i
Instruction i is a write instruction issued
before j
79
Write after write (WAW) hazards
  • With WAW hazard, instruction j tries to write an
    operand before instruction i writes it.
  • The writes are performed in wrong order leaving
    the value written by earlier instruction
  • Graphically/Example

i DIV F1, F2, F3 j SUB F1, F4, F6
Instruction j is a write instruction issued after
i
Instruction i is a write instruction issued
before j
80
Write after read (WAR) hazards
  • With WAR hazard, instruction j tries to write an
    operand before instruction i reads it.
  • Instruction i would incorrectly receive newer
    value of its operand
  • Instead of getting old value, it could receive
    some newer, undesired value
  • Graphically/Example

i DIV F7, F1, F3 j SUB F1, F4, F6
Instruction j is a write instruction issued after
i
Instruction i is a read instruction issued before
j
81
Forwarding It doesnt always work
LW R1, 0(R2)
Load has a latency that forwarding cant
solve. Pipeline must stall until hazard cleared
(starting with instruction that wants to use
data until source produces it).
Reg
IM
DM
SUB R4, R1, R5
Reg
IM
AND R6, R1, R7
Reg
IM
OR R8, R1, R9
Time
To get the data to subtract instruction we need a
time machine!
82
The solution pictorially
Reg
IM
DM
Reg
LW R1, 0(R2)
Reg
IM
DM
SUB R4, R1, R5
IM
Reg
AND R6, R1, R7
Reg
IM
OR R8, R1, R9
Time
Insertion of bubble causes of cycles to
complete this sequence to grow by 1
83
Data hazards and the compiler
  • Compiler should be able to help eliminate some
    stalls caused by data hazards
  • i.e. compiler could not generate a LOAD
    instruction that is immediately followed by
    instruction that uses result of LOADs
    destination register.
  • Technique is called pipeline/instruction
    scheduling

84
What about control logic?
  • For DLX integer pipeline, all data hazards can be
    checked during ID phase of pipeline
  • If data hazard, instruction stalled before its
    issued
  • Whether forwarding is needed can also be
    determined at this stage, controls signals set
  • If hazard detected, control unit of pipeline must
    stall pipeline and prevent instructions in IF, ID
    from advancing
  • All control information carried along in pipeline
    registers so only these fields must be changed

85
Some example situations
86
Detecting Data Hazards
87
Hazard Detection Logic
  • Insert a bubble into pipeline if any are true
  • ID/EX.RegWrite AND
  • ((ID/EX.RegDst0 AND ID/EX.WriteRegRtIF/ID.ReadRe
    gRs) OR
  • (ID/EX.RegDst1 AND ID/EX.WriteRegRdIF/ID.ReadReg
    Rs) OR
  • (ID/EX.RegDst0 AND ID/EX.WriteRegRtIF/ID.ReadReg
    Rt) OR
  • (ID/EX.RegDst1 AND ID/EX.WriteRegRdIF/ID.ReadReg
    Rt))
  • OR EX/MEM AND
  • ((EX/MEM.WriteReg IF/ID.ReadRegRs) OR
  • (EX/MEM.WriteReg IF/ID.ReadRegRt))
  • OR MEM/WB.RegWrite AND
  • ((MEM/WB.WriteReg IF/ID.ReadRegRs) OR
  • (MEM/WB.WriteReg IF/ID.ReadRegRt))

Notation ID/EX.RegDst
Pipeline Register
Field
88
How to Insert Bubbles
  • If hazard detected
  • Dont write to PC or IF/ID reg. de-assert
    signals for NOP

89
Incorporation of Hazard Detection Unit
90
Stall Ex. Cycle 1
91
Stall Ex. Cycle 2
92
Stall Ex. Cycle 3 1st Bubble Inserted
93
Stall Ex. Cycle 4 2nd Bubble Inserted
94
Stall Ex. Cycle 5 3rd Bubble Inserted
95
Stall Ex. Cycle 6 End of Stall
96
Stall Ex. Cycle 7
97
Control Hazards
98
R-Type
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
99
Control Hazard on Branches2 Stage Stall?
10 beq r1,r3,36
14 and r2,r3,r5
18 or r6,r1,r7
22 add r8,r1,r9
36 xor r10,r1,r11
100
Example
  • simulation

101
Scenario
  • We have the following code segment
  • lw R6, X(R0)
  • beq R1, R2, SKIP
  • add R1, R2, R3
  • SKIP add R5, R4, R1
  • sw R7, X(R0)
  • X .word 5

102
lw R6,X(R0)
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
103
lw R6,X(R0)
beq R1,R2,SKIP
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
104
lw R6,X(R0)
beq R1,R2,SKIP
BUBBLE
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
Note Bubble because no branch predict or slot
fill.
WB
EX
MEM
ID
IF
105
lw R6,X(R0)
beq R1,R2,SKIP
BUBBLE
BUBBLE
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
Second bubble because were detecting BEQ in 3rd
stage.
WB
EX
MEM
ID
IF
106
lw R6,X(R0)
BUBBLE
BUBBLE
add R1,R2,R3
beq R1,R2,SKIP
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
107
beq R1,R2,SKIP
add R5,R4,R1
BUBBLE
BUBBLE
add R1,R2,R3
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
108
BUBBLE
add R5,R4,R1
sw R7,X(R0)
BUBBLE
add R1,R2,R3
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
Forwarding Unit
WB
EX
MEM
ID
IF
109
BUBBLE
add R1,R2,R3
add R5,R4,R1
sw R7,X(R0)
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
110
add R1,R2,R3
add R5,R4,R1
sw R7,X(R0)
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
111
add R5,R4,R1
sw R7,X(R0)
lw R6,X(R0) beq R1,R2,SKIP add
R1,R2,R3 SKIP add R5,R4,R1 sw R7,X(R0)
M X
1
P C
Instr Mem
DPRF
BEQ
A
Data Mem
M X
M X
D
SE
WB
EX
MEM
ID
IF
112
Dealing with Branch Hazards (more detail)
113
Branch Hazards
  • So far, weve limited discussion of hazards to
  • Arithmetic/logic operations
  • Data transfers
  • Also need to consider hazards involving branches
  • Example
  • 40 beq 1, 3, 28
  • 44 and 12, 2, 5
  • 48 or 13, 6, 2
  • 52 add 14, 2, 2
  • 72 lw 4, 50(7)
  • How long will it take before the branch decision
    takes effect?
  • What happens in the meantime?

114
Branch signal determined in MEM stage
115
Pipeline impact on branch
  • If branch condition true, must skip 44, 48, 52
  • But, these have already started down the pipeline
  • They will complete unless we do something about
    it
  • How do we deal with this?
  • Well consider 2 possibilities

116
Dealing w/branch hazards always stall
  • Branch taken
  • Wait 3 cycles
  • No proper instructions in the pipeline
  • Same delay as without stalls (no time lost)

117
Dealing w/branch hazards always stall
  • Branch not taken
  • Still must wait 3 cycles
  • Time lost
  • Could have spent cycles fetching and decoding
    next instructions

118
Dealing w/branch hazardsassume branch not taken
  • On average, branches are taken ½ the time
  • If branch not taken
  • Continue normal processing
  • Else, if branch is taken
  • Need to flush improper instruction from pipeline
  • Cuts overall time for branch processing in ½

119
Flushing unwanted instructions from pipeline
  • Useful to compare w/stalling pipeline
  • Simple stall inject bubble into pipe at ID
    stage only
  • Change control to 0 in the ID stage
  • Let bubbles percolate to the right
  • Flushing pipe must change inst. In IF, ID, and
    EX
  • IF Stage
  • Zero instruction field of IF/ID pipeline register
  • Use new control signal IF.Flush
  • ID Stage
  • Use existing bubble injection mux that zeros
    control for stalls
  • Signal ID.Flush is ORed w/stall signal from
    hazard detection unit
  • EX Stage
  • Add new muxes to zero EX pipeline register
    control lines
  • Both muxes controlled by single EX.Flush signal
  • Control determines when to flush
  • Depends on Opcode and value of branch condition

120
Inserting bubbles v. flushing pipeline
121
Assume branch not takenand branch is not taken
  • Execution proceeds normally no penalty

122
Assume branch not takenand branch is taken
  • Bubbles injected into 3 stages during cycle 5

123
Reservation Table Picture
  • Another way of looking at it

Assume Branch Not Taken and Correct
40 beq 1, 3, 72 44 and 12, 2, 5 48 or
13, 6, 2 52 add 14, 2, 2 72 lw 4,
50(7)
No penalty 3 cycle penalty
Assume Branch Not Taken and NOT Correct
124
Branch Penalty Impact
  • Assume 16 of all instructions are branches
  • 4 unconditional branches 3 cycle penalty
  • 12 conditional 50 taken
  • For a sequence of N instructions (assume N is
    large)
  • N cycles to initiate each
  • 3 0.04 N delays due to unconditional branches
  • 0.5 3 0.12 N delays due to conditional
    taken
  • Also, an extra 4 cycles for pipeline to empty
  • Total
  • 1.3N 4 total cycles (or 1.3 cycles/instruction)
    (CPI)
  • 30 Performance Hit!!! (Bad thing)

125
Branch Penalty Impact
  • Some solutions
  • In ISA branches always execute next 1 or 2
    instructions
  • Instruction so executed said to be in delay slot
  • See SPARC ISA
  • In organization move comparator to ID stage and
    decide in the ID stage
  • Reduces branch delay by 2 cycles
  • Increases the cycle time

126
Branch Prediction
  • Prior solutions are ugly
  • Better ( more common) guess in IF stage
  • Technique is called branch predicting needs 2
    parts
  • Predictor to guess where/if instruction will
    branch (and to where)
  • Recovery Mechanism i.e. a way to fix your
    mistake
  • Prior strategy
  • Predictor always guess branch never taken
  • Recovery flush instructions if branch taken
  • Alternative accumulate info. in IF stage as to
  • Whether or not for any particular PC value a
    branch was taken next
  • To where it is taken
  • How to update with information from later stages

127
A Branch Predictor
128
Branch History Table
129
Branch Prediction Information
  • One bit predictor
  • Use result from last time we saw this instruction
  • Problem
  • Even if branch is almost always taken, we will be
    wrong at least twice
  • 1st time we the instruction
  • 1st time the branch is not taken
  • Also, 1st time branch is taken again after than
  • And if branch alternates b/t taken, not taken
  • We get 0 accuracy
  • Can we do better? Yep.

130
Branch Prediction Information
  • How to do better?
  • Keep a counter in each entry of the number of
    times taken in the last N times executed
  • Keep information about the pattern of previous
    branches
  • Books scheme a 2-bit saturating counter
  • Increment when branch is taken
  • Decrement when branch is not taken
  • Dont increment or decrement above or below a
    max/min count
  • Use sign of count as predictor

131
Books 2 Bit Branch Counter
132
Computing Performance
  • Program assumptions
  • 23 loads and in ½ of cases, next instruction
    uses load value
  • 13 stores
  • 19 conditional branches
  • 2 unconditional branches
  • 43 other
  • Machine Assumptions
  • 5 stage pipe with all forwarding
  • Only penalty is 1 cycle on use of load value
    immediately after a load)
  • Jumps are totally resolved in ID stage for a 1
    cycle branch penalty
  • 75 branch prediction accuracy
  • 1 cycle delay on misprediction

133
The Answer
  • CPI penalty calculation
  • Loads
  • 50 of the 23 of loads have 1 cycle penalty
    .5.230.115
  • Jumps
  • All of the 2 of jumps have 1 cycle penalty
    0.021 0.02
  • Conditional Branches
  • 25 of the 19 are mispredicted for a 1 cycle
    penalty 0.250.191 0.0475
  • Total Penalty 0.115 0.02 0.0475 0.1825

134
Exception Hazards
  • 40hex sub 11, 2, 4
  • 44hex and 12, 2, 5
  • 48hex or 13, 6, 2
  • 4bhex add 1, 2, 1 (overflow in EX stage)
  • 50hex slt 15, 6, 7 (already in ID stage)
  • 54hex lw 16, 50(7) (already in IF stage)
  • 40000040hex sw 25, 1000(0) exception handler
  • 40000044hex sw 26, 1004(0)
  • Need to transfer control to exception handler
    ASAP
  • Dont want invalid data to contaminate registers
    or memory
  • Need to flush instructions already in the
    pipeline
  • Start fetching instructions from 40000040hex
  • Save addr. following offending instruction
    (50hex) in TrapPC (EPC)
  • Dont clobber 1 use for debugging

135
Flushing pipeline after exception
  • Cycle 6
  • Exception detected, flush signals generated,
    bubbles injected
  • Cycle 7
  • 3 bubbles appear in ID, EX, MEM stages
  • PC gets 40000040hex, TrapPC gets 50hex

136
Managing exception hazards gets much worse!
  • Different exception types may occur in different
    stages
  • Challenge is to associate exception with proper
    instruction difficult!
  • Relax this requirement in non-critical cases
    imprecise exceptions
  • Most machines use precise instructions
  • Further challenge exceptions can happen at same
    time

137
Discussion
  • How does instruction set design impact
    pipelining?
  • Does increasing the depth of pipelining always
    increase performance?

138
Comparative Performance
  • Throughput instructions per clock cycle 1/cpi
  • Pipeline has fast throughput and fast clock rate
  • Latency inherent execution time, in cycles
  • High latency for pipelining causes problems
  • Increased time to resolve hazards

139
Summary
  • Performance
  • Execution time or throughput
  • Amdahls law
  • Multi-bus/multi-unit circuits
  • one long clock cycle or N shorter cycles
  • Pipelining
  • overlap independent tasks
  • Pipelining in processors
  • hazards limit opportunities for overlap
Write a Comment
User Comments (0)
About PowerShow.com