Its Not That Easy for Computers - PowerPoint PPT Presentation

About This Presentation
Title:

Its Not That Easy for Computers

Description:

LW Rb,b. LW Rc,c. ADD Ra,Rb,Rc. SW a,Ra. LW Re,e. LW Rf,f. SUB Rd,Re,Rf ... ADD Ra,Rb,Rc. LW Rf,f. SW a,Ra. SUB rd,re,Rf. SW d,rd. 9/7/09. CS 641 Fall 2001. 13 ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 26
Provided by: richarde67
Learn more at: https://www.cs.umb.edu
Category:
Tags: computers | easy | rb

less

Transcript and Presenter's Notes

Title: Its Not That Easy for Computers


1
Its Not That Easy for Computers
  • Limits to pipelining Hazards prevent next
    instruction from executing during its designated
    clock cycle
  • Structural hazards HW cannot support this
    combination of instructions (single person to
    fold and put clothes away)
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline (missing
    sock)
  • Control hazards Pipelining of branches other
    instructions stall the pipeline until the hazard
    bubbles in the pipeline

2
One Memory Port/Structural HazardsFigure 3.6,
Page 142
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
3
Example Dual-port vs. Single-port
  • Machine A Dual ported memory
  • Machine B Single ported memory, but its
    pipelined implementation has a 1.05 times faster
    clock rate
  • Ideal CPI 1 for both
  • Loads are 40 of instructions executed
  • SpeedUpA Pipeline Depth/(1 0) x
    (clockunpipe/clockpipe)
  • Pipeline Depth
  • SpeedUpB Pipeline Depth/(1 0.4 x 1)
    x (clockunpipe/(clockunpipe / 1.05)
  • (Pipeline Depth/1.4) x 1.05
  • 0.75 x Pipeline Depth
  • SpeedUpA / SpeedUpB Pipeline
    Depth/(0.75 x Pipeline Depth) 1.33
  • Machine A is 1.33 times faster

4
Data Hazard on R1Figure 3.9, page 147
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
5
Generic Data Hazards
  • InstrI followed by InstrJ
  • Read After Write (RAW) InstrJ tries to read
    operand before InstrI writes it

6
Generic Data Hazards
  • InstrI followed by InstrJ
  • Read after write (RAW) InstrJ tries to write
    operand before InstrI reads i
  • Write after write (WAW)
  • Write after read (WAR)

7
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
Write Back
Memory Access
  • Data stationary control
  • local decode for each instruction phase /
    pipeline stage

8
self-modifying sequence
  • 12 Lw r2 40(r0)
  • 16 Sw 20(r0), r2
  • 20 Lw r3 50(r0)
  • 40 Lw r3 60(r0)

9
Forwarding to Avoid Data HazardFigure 3.10, Page
149
Time (clock cycles)
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
10
HW Change for ForwardingFigure 3.20, Page 161
11
Data Hazard Even with ForwardingFigure 3.12,
Page 153
Time (clock cycles)
lw r1, 0(r2)
I n s t r. O r d e r
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
12
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
  • Fast code
  • LW Rb,b
  • LW Rc,c
  • LW re,e
  • ADD Ra,Rb,Rc
  • LW Rf,f
  • SW a,Ra
  • SUB rd,re,Rf
  • SW d,rd

13
Control Hazard on Branches
14
Branch Stall Impact
  • If CPI 1, 30 branch, Stall 3 cycles gt new CPI
    1.9!
  • Two part solution
  • Determine branch taken or not sooner, AND
  • Compute taken branch address earlier
  • DLX branch tests if register 0 or not 0
  • DLX Solution
  • Move Zero test to ID/RF stage
  • Adder to calculate new PC in ID/RF stage
  • 1 clock cycle penalty for branch versus 3

15
Pipelined DLX DatapathFigure 3.22, page 163
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
This is the correct 1 cycle latency
implementation!
16
Four Branch Hazard Alternatives
  • 1 Stall until branch direction is clear
  • 2 Predict Branch Not Taken
  • Execute successor instructions in sequence
  • Squash instructions in pipeline if branch
    actually taken
  • Advantage of late pipeline state update
  • 47 DLX branches not taken on average
  • PC4 already calculated, so use it to get next
    instruction
  • 3 Predict Branch Taken
  • 53 DLX branches taken on average
  • But havent calculated branch target address in
    DLX
  • DLX still incurs 1 cycle branch penalty
  • Other machines branch target known before outcome

17
Four Branch Hazard Alternatives
  • 4 Delayed Branch
  • Define branch to take place AFTER a following
    instruction
  • branch instruction sequential
    successor1 sequential successor2 ........ seque
    ntial successorn
  • branch target if taken
  • 1 slot delay allows proper decision and branch
    target address in 5 stage pipeline
  • DLX uses this

Branch delay of length n
18
Delayed Branch
  • Where to get instructions to fill branch delay
    slot?
  • Before branch instruction
  • From the target address only valuable when
    branch taken
  • From fall through only valuable when branch not
    taken
  • Cancelling branches allow more slots to be
    filled
  • Compiler effectiveness for single branch delay
    slot
  • Fills about 60 of branch delay slots
  • About 80 of instructions executed in branch
    delay slots useful in computation
  • About 50 (60 x 80) of slots usefully filled

19
Evaluating Branch Alternatives
  • Scheduling Branch CPI speedup v. speedup v.
    scheme penalty unpipelined stall
  • Stall pipeline 3 1.42 3.5 1.0
  • Predict taken 1 1.14 4.4 1.26
  • Predict not taken 1 1.09 4.5 1.29
  • Delayed branch 0.5 1.07 4.6 1.31
  • Conditional Unconditional 14, 65 change PC

20
Pipelining Introduction Summary
  • Just overlap tasks, and easy if tasks are
    independent
  • Speed Up Pipeline Depth if ideal CPI is 1,
    then
  • Hazards limit performance on computers
  • Structural need more HW resources
  • Data (RAW,WAR,WAW) need forwarding, compiler
    scheduling
  • Control delayed branch, prediction

Pipeline Depth
Clock Cycle Unpipelined
Speedup
X
Clock Cycle Pipelined
1 Pipeline stall CPI
21
Pipeline Performance (Contd)
  • For a non-pipelined machine
  • So the speed-up (SU) is

Tnp(k,n) nk
Tnp(k,n)
nk
SU

n for k gtgt n
Tp(k,n)
n (k - 1)
22
Stalling
?1998 Morgan Kaufmann Publishers
  • We can stall the pipeline by keeping an
    instruction in the same stage

R
e
g
b
u
b
b
l
e
R
e
g
23
Branch Hazards
?1998 Morgan Kaufmann Publishers
  • When we decide to branch, other instructions are
    in the pipeline!

R
e
g
24
Improving Performance
  • One thing to do to avoid stalls is to reorder
    instructions
  • Another is to add a branch delay slot
  • Next instruction after a branch is always
    executed
  • Compiler fills the slot with something useful
  • A third is a superpipelined machine
  • simply means longer pipelines
  • A fourth is to build a superscalar
  • Such a machine starts more than one instruction
    in the same cycle

25
Dynamic Scheduling
?1998 Morgan Kaufmann Publishers
  • The hardware performs the scheduling
  • Hardware tries to find instructions to execute
  • Out of order execution is possible
  • Speculative execution and dynamic branch
    prediction
  • All modern processors are very complicated
  • DEC Alpha 21264 9 stage pipeline, 6 instruction
    issue
  • PowerPC and Pentium branch history table
  • Compiler technology important
Write a Comment
User Comments (0)
About PowerShow.com