Lecture 5 Section A.8 Branch Hazards and Dynamic Scheduling via scoreboarding - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 5 Section A.8 Branch Hazards and Dynamic Scheduling via scoreboarding

Description:

Compute condition and target address in the ID stage: 1 cycle stall. ... For WAW, must detect hazard: stall in the Issue stage until other completes ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 46
Provided by: juny8
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Lecture 5 Section A.8 Branch Hazards and Dynamic Scheduling via scoreboarding


1
Lecture 5Section A.8Branch Hazards and Dynamic
Schedulingvia scoreboarding
CS 203AAdvanced Computer Architecture
  • Instructor L.N. Bhuyan

2
Control Hazards
  • Branch problem
  • branches are resolved in EX stage
  • ? 2 cycles penalty on taken branches
  • Ideal CPI 1. Assuming 2 cycles for all branches
    and 32 branch instructions ? new CPI 1
    0.322 1.64
  • Solutions
  • Reduce branch penalty change the datapath new
    adder needed in ID stage.
  • Fill branch delay slot(s) with a useful
    instruction.
  • Fixed branch prediction.
  • Static branch prediction.
  • Dynamic branch prediction.

3
Control Hazards branch delay slots
  • Reduced branch penalty
  • Compute condition and target address in the ID
    stage 1 cycle stall.
  • Target and condition computed even when
    instruction is not a branch.
  • Branch delay slot filling
  • move an instruction into the slot right after the
    branch, hoping that its execution is necessary.
    Three alternatives (next slide)
  • Limitations restrictions on which instructions
    can be rescheduled, compile time prediction of
    taken or untaken branches.

4
Example Nondelayed vs. Delayed Branch
Nondelayed Branch
or M8, M9 ,M10
add M1 ,M2,M3
sub M4, M5,M6
beq M1, M4, Exit
xor M10, M1,M11
Exit
5
Control Hazards Branch Prediction
  • Idea doing something is better than waiting
    around doing nothing
  • Guess branch target, start executing at guessed
    position
  • Execute branch, verify (check) your guess
  • minimize penalty if guess is right (to zero)
  • May increase penalty for wrong guesses
  • Heavily researched area in the last 15 years
  • Fixed branch prediction.
  • Each of these strategies must be applied to all
    branch instructions indiscriminately.
  • Predict not-taken (47 actually not taken)
  • continue to fetch instruction without stalling
  • do not change any state (no register write)
  • if branch is taken turn the fetched instruction
    into no-op, restart fetch at target address 1
    cycle penalty.

6
Control Hazards Branch Prediction
  • Predict taken (53) more difficult, must know
    target before branch is decoded. no advantage in
    our simple 5-stage pipeline.
  • Static branch prediction.
  • Opcode-based prediction based on opcode itself
    and related condition. Examples MC 88110,
    PowerPC 601/603.
  • Displacement based prediction if d lt 0 predict
    taken, if d gt 0 predict not taken. Examples
    Alpha 21064 (as option), PowerPC 601/603 for
    regular conditional branches.
  • Compiler-directed prediction compiler sets or
    clears a predict bit in the instruction itself.
    Examples ATT 9210 Hobbit, PowerPC 601/603
    (predict bit reverses opcode or displacement
    predictions), HP PA 8000 (as option).

7
Control Hazards Branch Prediction
  • Dynamic branch prediction
  • Based on the history of a particular branch -
    Later

8
MIPS R4000 pipeline
9
MIPS FP Pipe Stages
  • FP Instr 1 2 3 4 5 6 7 8
  • Add, Subtract U SA AR RS
  • Multiply U EM M M M N NA R
  • Divide U A R D28 DA DR, DR, DA, DR, A, R
  • Square root U E (AR)108 A R
  • Negate U S
  • Absolute value U S
  • FP compare U A R
  • Stages
  • M First stage of multiplier
  • N Second stage of multiplier
  • R Rounding stage
  • S Operand shift stage
  • U Unpack FP numbers

A Mantissa ADD stage D Divide pipeline
stage E Exception test stage
10
R4000 Performance
  • Not ideal CPI of 1
  • Load stalls (1 or 2 clock cycles)
  • Branch stalls (2 cycles unfilled slots)
  • FP result stalls RAW data hazard (latency)
  • FP structural stalls Not enough FP hardware
    (parallelism)

11
FP Loop Where are the Hazards?
  • Loop LD F0,0(R1) F0vector element
  • ADDD F4,F0,F2 add scalar from F2
  • SD 0(R1),F4 store result
  • SUBI R1,R1,8 decrement pointer 8B (DW)
  • BNEZ R1,Loop branch R1!zero
  • NOP delayed branch slot

Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0
  • Where are the stalls?

12
FP Loop Showing Stalls
1 Loop LD F0,0(R1) F0vector element
2 stall 3 ADDD F4,F0,F2 add scalar in F2
4 stall 5 stall 6
SD 0(R1),F4 store result 7
SUBI R1,R1,8 decrement pointer 8B (DW) 8
BNEZ R1,Loop branch R1!zero
9 stall delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1
  • 9 clocks Rewrite code to minimize stalls?

13
Minimizing Stalls Technique 1 Compiler
Optimization
1 Loop LD F0,0(R1) 2 stall
3 ADDD F4,F0,F2 4 SUBI R1,R1,8
5 BNEZ R1,Loop delayed branch 6
SD 8(R1),F4 altered when move past SUBI
Swap BNEZ and SD by changing address of SD
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1
  • 6 clocks

14
HW Schemes Instruction Parallelism
  • Compiler or Static instruction scheduling can
    avoid some pipeline hazards.
  • e.g. filling branch delay slot.
  • Why in HW at run time?
  • Works when cant know dependence at compile time
  • WAW can only be detected at run time
  • Compiler simpler
  • Code for one machine runs well on another
  • Key idea Allow instructions behind stall to
    proceed
  • DIVD F0,F2,F4
  • ADDD F10,F0,F8
  • SUBD F8,F8,F14
  • Enables out-of-order execution gt out-of-order
    completion
  • But, both structural and data hazards are checked
    in MIPS
  • ADDD is stalled at ID, SUBD can not even proceed
    to ID.

15
HW Schemes Instruction Parallelism
  • Out-of-order execution divides ID stage
  • 1. Issuedecode instructions, check for
    structural hazards, Issue in order if the
    functional unit is free and no WAW.
  • Read operands (RO)wait until no data hazards,
    then read operands
  • ADDD would stall at RO, and SUBD could proceed
    with no stalls.
  • Scoreboards allow instruction to execute whenever
    1 2 hold, not waiting for prior instructions.

Focusing on FP operations assume no MEM stages
16
Scoreboard Implications
  • Out-of-order completion gt WAR, WAW hazards
  • Solutions for WAR
  • CDC 6600 Stall Write to allow Reads to take
    place Read registers only during Read Operands
    stage.
  • Tomasulo Register Renaming
  • For WAW, must detect hazard stall in the Issue
    stage until other completes
  • Need to have multiple instructions in execution
    phase gt multiple execution units or pipelined
    execution units
  • Scoreboard replaces ID with 2 stages (Issue and
    RO)
  • Scoreboard keeps track of dependencies, state or
    operations
  • Monitors every change in the hardware.
  • Determines when to read ops, when can execute,
    when can wb.
  • Hazard detection and resolution is centralized.

17
Four Stages of Scoreboard Control
  • 1. Issuedecode instructions check for
    structural hazards (ID1)
  • If a functional unit for the instruction is
    free and no other active instruction has the same
    destination register (WAW), the scoreboard issues
    the instruction to the functional unit and
    updates its internal data structure. If a
    structural or WAW hazard exists, then the
    instruction issue stalls, and no further
    instructions will issue until these hazards are
    cleared.
  • 2. Read operandswait until no data hazards, then
    read operands (ID2)
  • A source operand is available if no earlier
    issued active instruction is going to write it,
    or if the register containing the operand is
    being written by a currently active functional
    unit. When the source operands are available, the
    scoreboard tells the functional unit to proceed
    to read the operands from the registers and begin
    execution. The scoreboard resolves RAW hazards
    dynamically in this step, and instructions may be
    sent into execution out of order.

18
Four Stages of Scoreboard Control
  • 3.Executionoperate on operands (EX)
  • The functional unit begins execution upon
    receiving operands. When the result is ready, it
    notifies the scoreboard that it has completed
    execution.
  • 4.Write resultfinish execution (WB)
  • Once the scoreboard is aware that the
    functional unit has completed execution, the
    scoreboard checks for WAR hazards. If none, it
    writes results. If WAR, then it stalls the
    instruction.
  • Example
  • DIVD F0,F2,F4
  • ADDD F10,F0,F8
  • SUBD F8,F8,F14
  • CDC 6600 scoreboard would stall SUBD until ADDD
    reads operands

19
Three Parts of the Scoreboard
  • 1. Instruction statuswhich of 4 steps the
    instruction is in
  • 2. Functional unit statusIndicates the state of
    the functional unit (FU). 9 fields for each
    functional unit
  • BusyIndicates whether the unit is busy or not
  • OpOperation to perform in the unit (e.g., or
    )
  • FiDestination register
  • Fj, FkSource-register numbers
  • Qj, QkFunctional units producing source
    registers Fj, Fk
  • Rj, RkFlags indicating when Fj, Fk are ready
    and not yet read. Set to
  • No after operand are read.
  • 3. Register result statusIndicates which
    functional unit will write each register, if one
    exists. Blank when no pending instructions will
    write that register

20
Detailed Scoreboard Pipeline Control
Bookkeeping
Wait until
Not busy (FU) and not result(D)
Busy(FU) yes Op(FU) op Fi(FU) D Fj(FU)
S1 Fk(FU) S2 Qj Result(S1) Qk
Result(S2) Rj not Qj Rk not Qk
Result(D) FU
WAW
Rj No Rk No
Rj and Rk
Functional unit done
"f((Fj( f )!Fi(FU) or Rj( f )No) (Fk(
f )!Fi(FU) or Rk( f )No))
"f(if Qj(f)FU then Rj(f) Yes)"f(if Qk(f)FU
then Rj(f) Yes) Result(Fi(FU)) 0 Busy(FU) No
A.55 on page A-76
WAR
21
Scoreboard Example
  • The following numbers are to illustrate behavior,
    not representative
  • LD 1 cycle
  • (compute address data cache access)
  • ADDDs and SUBs are 2 cycles
  • Multiply is 10 cycles
  • Divide is 40 cycles

22
Scoreboard Example
23
Scoreboard Example Cycle 1
24
Scoreboard Example Cycle 2
Note Cant issue I2 because Integer unit is
busy. Cant issue next instruction due to
in-order issue
25
Scoreboard Example Cycle 3
26
Scoreboard Example Cycle 4
27
Scoreboard Example Cycle 5
Now I2 is issued
28
Scoreboard Example Cycle 6
29
Scoreboard Example Cycle 7
I3 stalled at read because I2 isnt complete
30
Scoreboard Example Cycle 8
31
Scoreboard Example Cycle 9
Note I3 and I4 read operands because F2 is now
available. ADDD (I6) cant be issued because SUBD
(I4) uses the adder
32
Scoreboard Example Cycle 11
Note Add takes 2 cycles, so nothing happens in
cycle 10. MUL continues.
33
Scoreboard Example Cycle 12
34
Scoreboard Example Cycle 13
Now ADDD is issued because SUBD has completed
35
Scoreboard Example Cycle 14
36
Scoreboard Example Cycle 15
Note ADDD takes 2 cycles, so no change
37
Scoreboard Example Cycle 16
ADDD completes, but MULTD and DIVD go on
38
Scoreboard Example Cycle 17
ADDD stalls, cant write back due to WAR with
DIVD. MULT and DIV continue
39
Scoreboard Example Cycle 18
MULT and DIV continue
40
Scoreboard Example Cycle 19
19
MULT completes after 10 cycles
41
Scoreboard Example Cycle 20
MULTD completes and writes to F0
42
Scoreboard Example Cycle 21
Now DIVD reads because F0 is available
43
Scoreboard Example Cycle 22
ADDD writes result because WAR is removed.
44
Scoreboard Example Cycle 61
DIVD completes execution
45
Scoreboard Example Cycle 62
Execution is finished
Write a Comment
User Comments (0)
About PowerShow.com