Lecture 5 Section A.8 Branch Hazards and Dynamic Scheduling via scoreboarding - PowerPoint PPT Presentation

About This Presentation

Title:

Lecture 5 Section A.8 Branch Hazards and Dynamic Scheduling via scoreboarding

Description:

Compute condition and target address in the ID stage: 1 cycle stall. ... For WAW, must detect hazard: stall in the Issue stage until other completes ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 46

Provided by: juny8

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 5 Section A.8 Branch Hazards and Dynamic Scheduling via scoreboarding

1
Lecture 5Section A.8Branch Hazards and Dynamic
Schedulingvia scoreboarding
CS 203AAdvanced Computer Architecture

Instructor L.N. Bhuyan

2
Control Hazards

Branch problem
branches are resolved in EX stage
? 2 cycles penalty on taken branches
Ideal CPI 1. Assuming 2 cycles for all branches
and 32 branch instructions ? new CPI 1
0.322 1.64
Solutions
Reduce branch penalty change the datapath new
adder needed in ID stage.
Fill branch delay slot(s) with a useful
instruction.
Fixed branch prediction.
Static branch prediction.
Dynamic branch prediction.

3
Control Hazards branch delay slots

Reduced branch penalty
Compute condition and target address in the ID
stage 1 cycle stall.
Target and condition computed even when
instruction is not a branch.
Branch delay slot filling
move an instruction into the slot right after the
branch, hoping that its execution is necessary.
Three alternatives (next slide)
Limitations restrictions on which instructions
can be rescheduled, compile time prediction of
taken or untaken branches.

4
Example Nondelayed vs. Delayed Branch
Nondelayed Branch
or M8, M9 ,M10
add M1 ,M2,M3
sub M4, M5,M6
beq M1, M4, Exit
xor M10, M1,M11
Exit
5
Control Hazards Branch Prediction

Idea doing something is better than waiting
around doing nothing
Guess branch target, start executing at guessed
position
Execute branch, verify (check) your guess
minimize penalty if guess is right (to zero)
May increase penalty for wrong guesses
Heavily researched area in the last 15 years
Fixed branch prediction.
Each of these strategies must be applied to all
branch instructions indiscriminately.
Predict not-taken (47 actually not taken)
continue to fetch instruction without stalling
do not change any state (no register write)
if branch is taken turn the fetched instruction
into no-op, restart fetch at target address 1
cycle penalty.

6
Control Hazards Branch Prediction

Predict taken (53) more difficult, must know
target before branch is decoded. no advantage in
our simple 5-stage pipeline.
Static branch prediction.
Opcode-based prediction based on opcode itself
and related condition. Examples MC 88110,
PowerPC 601/603.
Displacement based prediction if d lt 0 predict
taken, if d gt 0 predict not taken. Examples
Alpha 21064 (as option), PowerPC 601/603 for
regular conditional branches.
Compiler-directed prediction compiler sets or
clears a predict bit in the instruction itself.
Examples ATT 9210 Hobbit, PowerPC 601/603
(predict bit reverses opcode or displacement
predictions), HP PA 8000 (as option).

7
Control Hazards Branch Prediction

Dynamic branch prediction
Based on the history of a particular branch -
Later

8
MIPS R4000 pipeline
9
MIPS FP Pipe Stages

FP Instr 1 2 3 4 5 6 7 8
Add, Subtract U SA AR RS
Multiply U EM M M M N NA R
Divide U A R D28 DA DR, DR, DA, DR, A, R
Square root U E (AR)108 A R
Negate U S
Absolute value U S
FP compare U A R
Stages
M First stage of multiplier
N Second stage of multiplier
R Rounding stage
S Operand shift stage
U Unpack FP numbers

A Mantissa ADD stage D Divide pipeline
stage E Exception test stage
10
R4000 Performance

Not ideal CPI of 1
Load stalls (1 or 2 clock cycles)
Branch stalls (2 cycles unfilled slots)
FP result stalls RAW data hazard (latency)
FP structural stalls Not enough FP hardware
(parallelism)

11
FP Loop Where are the Hazards?

Loop LD F0,0(R1) F0vector element
ADDD F4,F0,F2 add scalar from F2
SD 0(R1),F4 store result
SUBI R1,R1,8 decrement pointer 8B (DW)
BNEZ R1,Loop branch R1!zero
NOP delayed branch slot

Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0

Where are the stalls?

12
FP Loop Showing Stalls
1 Loop LD F0,0(R1) F0vector element
2 stall 3 ADDD F4,F0,F2 add scalar in F2
4 stall 5 stall 6
SD 0(R1),F4 store result 7
SUBI R1,R1,8 decrement pointer 8B (DW) 8
BNEZ R1,Loop branch R1!zero
9 stall delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1

9 clocks Rewrite code to minimize stalls?

13
Minimizing Stalls Technique 1 Compiler
Optimization
1 Loop LD F0,0(R1) 2 stall
3 ADDD F4,F0,F2 4 SUBI R1,R1,8
5 BNEZ R1,Loop delayed branch 6
SD 8(R1),F4 altered when move past SUBI
Swap BNEZ and SD by changing address of SD
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1

6 clocks

14
HW Schemes Instruction Parallelism

Compiler or Static instruction scheduling can
avoid some pipeline hazards.
e.g. filling branch delay slot.
Why in HW at run time?
Works when cant know dependence at compile time
WAW can only be detected at run time
Compiler simpler
Code for one machine runs well on another
Key idea Allow instructions behind stall to
proceed
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
Enables out-of-order execution gt out-of-order
completion
But, both structural and data hazards are checked
in MIPS
ADDD is stalled at ID, SUBD can not even proceed
to ID.

15
HW Schemes Instruction Parallelism

Out-of-order execution divides ID stage
1. Issuedecode instructions, check for
structural hazards, Issue in order if the
functional unit is free and no WAW.
Read operands (RO)wait until no data hazards,
then read operands
ADDD would stall at RO, and SUBD could proceed
with no stalls.
Scoreboards allow instruction to execute whenever
1 2 hold, not waiting for prior instructions.

Focusing on FP operations assume no MEM stages
16
Scoreboard Implications

Out-of-order completion gt WAR, WAW hazards
Solutions for WAR
CDC 6600 Stall Write to allow Reads to take
place Read registers only during Read Operands
stage.
Tomasulo Register Renaming
For WAW, must detect hazard stall in the Issue
stage until other completes
Need to have multiple instructions in execution
phase gt multiple execution units or pipelined
execution units
Scoreboard replaces ID with 2 stages (Issue and
RO)
Scoreboard keeps track of dependencies, state or
operations
Monitors every change in the hardware.
Determines when to read ops, when can execute,
when can wb.
Hazard detection and resolution is centralized.

17
Four Stages of Scoreboard Control

1. Issuedecode instructions check for
structural hazards (ID1)
If a functional unit for the instruction is
free and no other active instruction has the same
destination register (WAW), the scoreboard issues
the instruction to the functional unit and
updates its internal data structure. If a
structural or WAW hazard exists, then the
instruction issue stalls, and no further
instructions will issue until these hazards are
cleared.
2. Read operandswait until no data hazards, then
read operands (ID2)
A source operand is available if no earlier
issued active instruction is going to write it,
or if the register containing the operand is
being written by a currently active functional
unit. When the source operands are available, the
scoreboard tells the functional unit to proceed
to read the operands from the registers and begin
execution. The scoreboard resolves RAW hazards
dynamically in this step, and instructions may be
sent into execution out of order.

18
Four Stages of Scoreboard Control

3.Executionoperate on operands (EX)
The functional unit begins execution upon
receiving operands. When the result is ready, it
notifies the scoreboard that it has completed
execution.
4.Write resultfinish execution (WB)
Once the scoreboard is aware that the
functional unit has completed execution, the
scoreboard checks for WAR hazards. If none, it
writes results. If WAR, then it stalls the
instruction.
Example
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
CDC 6600 scoreboard would stall SUBD until ADDD
reads operands

19
Three Parts of the Scoreboard

1. Instruction statuswhich of 4 steps the
instruction is in
2. Functional unit statusIndicates the state of
the functional unit (FU). 9 fields for each
functional unit
BusyIndicates whether the unit is busy or not
OpOperation to perform in the unit (e.g., or
)
FiDestination register
Fj, FkSource-register numbers
Qj, QkFunctional units producing source
registers Fj, Fk
Rj, RkFlags indicating when Fj, Fk are ready
and not yet read. Set to
No after operand are read.
3. Register result statusIndicates which
functional unit will write each register, if one
exists. Blank when no pending instructions will
write that register

20
Detailed Scoreboard Pipeline Control
Bookkeeping
Wait until
Not busy (FU) and not result(D)
Busy(FU) yes Op(FU) op Fi(FU) D Fj(FU)
S1 Fk(FU) S2 Qj Result(S1) Qk
Result(S2) Rj not Qj Rk not Qk
Result(D) FU
WAW
Rj No Rk No
Rj and Rk
Functional unit done
"f((Fj( f )!Fi(FU) or Rj( f )No) (Fk(
f )!Fi(FU) or Rk( f )No))
"f(if Qj(f)FU then Rj(f) Yes)"f(if Qk(f)FU
then Rj(f) Yes) Result(Fi(FU)) 0 Busy(FU) No
A.55 on page A-76
WAR
21
Scoreboard Example

The following numbers are to illustrate behavior,
not representative
LD 1 cycle
(compute address data cache access)
ADDDs and SUBs are 2 cycles
Multiply is 10 cycles
Divide is 40 cycles

22
Scoreboard Example
23
Scoreboard Example Cycle 1
24
Scoreboard Example Cycle 2
Note Cant issue I2 because Integer unit is
busy. Cant issue next instruction due to
in-order issue
25
Scoreboard Example Cycle 3
26
Scoreboard Example Cycle 4
27
Scoreboard Example Cycle 5
Now I2 is issued
28
Scoreboard Example Cycle 6
29
Scoreboard Example Cycle 7
I3 stalled at read because I2 isnt complete
30
Scoreboard Example Cycle 8
31
Scoreboard Example Cycle 9
Note I3 and I4 read operands because F2 is now
available. ADDD (I6) cant be issued because SUBD
(I4) uses the adder
32
Scoreboard Example Cycle 11
Note Add takes 2 cycles, so nothing happens in
cycle 10. MUL continues.
33
Scoreboard Example Cycle 12
34
Scoreboard Example Cycle 13
Now ADDD is issued because SUBD has completed
35
Scoreboard Example Cycle 14
36
Scoreboard Example Cycle 15
Note ADDD takes 2 cycles, so no change
37
Scoreboard Example Cycle 16
ADDD completes, but MULTD and DIVD go on
38
Scoreboard Example Cycle 17
ADDD stalls, cant write back due to WAR with
DIVD. MULT and DIV continue
39
Scoreboard Example Cycle 18
MULT and DIV continue
40
Scoreboard Example Cycle 19
19
MULT completes after 10 cycles
41
Scoreboard Example Cycle 20
MULTD completes and writes to F0
42
Scoreboard Example Cycle 21
Now DIVD reads because F0 is available
43
Scoreboard Example Cycle 22
ADDD writes result because WAR is removed.
44
Scoreboard Example Cycle 61
DIVD completes execution
45
Scoreboard Example Cycle 62
Execution is finished

Write a Comment

User Comments (0)