Advanced Computer Architecture - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced Computer Architecture

Description:

Scheduled FP Loop Minimizing Stalls. Now 6 clocks: Now unroll loop 4 times to ... Unrolled Loop That Minimizes Stalls. What assumptions made when moved code? ... – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 109
Provided by: jb133
Category:

less

Transcript and Presenter's Notes

Title: Advanced Computer Architecture


1
Advanced Computer Architecture
  • Chapter 4
  • Advanced Pipelining
  • Ioannis Papaefstathiou
  • CS 590.25
  • Easter 2003
  • (thanks to Hennesy Patterson)

2
Chapter Overview
  • 4.1 Instruction Level Parallelism Concepts and
    Challenges
  • 4.2 Overcoming Data Hazards with Dynamic
    Scheduling
  • 4.3 Reducing Branch Penalties with Dynamic
    Hardware Prediction
  • 4.4 Taking Advantage of More ILP with Multiple
    Issue
  • 4.5 Compiler Support for Exploiting ILP
  • 4.6 Hardware Support for Extracting more
    Parallelism
  • 4.7 Studies of ILP

3
Chapter Overview
Technique Reduces Section
Loop Unrolling Control Stalls 4.1
Basic Pipeline Scheduling RAW Stalls 4.1
Dynamic Scheduling with Scoreboarding RAW stalls 4.2
Dynamic Scheduling with Register Renaming WAR and WAW stalls 4.2
Dynamic Branch Prediction Control Stalls 4.3
Issue Multiple Instructions per Cycle Ideal CPI 4.4
Compiler Dependence Analysis Ideal CPI data stalls 4.5
Software pipelining and trace scheduling Ideal CPI data stalls 4.5
Speculation All data control stalls 4.6
Dynamic memory disambiguation RAW stalls involving memory 4.2, 4.6
4
Instruction Level Parallelism
  • 4.1 Instruction Level Parallelism Concepts and
    Challenges
  • 4.2 Overcoming Data Hazards with Dynamic
    Scheduling
  • 4.3 Reducing Branch Penalties with Dynamic
    Hardware Prediction
  • 4.4 Taking Advantage of More ILP with Multiple
    Issue
  • 4.5 Compiler Support for Exploiting ILP
  • 4.6 Hardware Support for Extracting more
    Parallelism
  • 4.7 Studies of ILP
  • ILP is the principle that there are many
    instructions in code that dont depend on each
    other. That means its possible to execute those
    instructions in parallel.
  • This is easier said than done
  • Issues include
  • Building compilers to analyze the code,
  • Building hardware to be even smarter than that
    code.
  • This section looks at some of the problems to be
    solved.

5
Terminology
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling
  • Basic Block - That set of instructions between
    entry points and between branches. A basic block
    has only one entry and one exit. Typically this
    is about 6 instructions long.
  • Loop Level Parallelism - that parallelism that
    exists within a loop. Such parallelism can cross
    loop iterations.
  • Loop Unrolling - Either the compiler or the
    hardware is able to exploit the parallelism
    inherent in the loop.

6
Simple Loop and its Assembler Equivalent
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling
This is a clean and simple example!
  • for (i1 ilt1000 i) x(i) x(i) s

Loop LD F0,0(R1) F0vector element
ADDD F4,F0,F2 add scalar from F2
SD 0(R1),F4 store result SUBI R1,R1,8 decre
ment pointer 8bytes (DW) BNEZ R1,Loop branch
R1!zero NOP delayed branch slot
7
FP Loop Hazards
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling
Loop LD F0,0(R1) F0vector element
ADDD F4,F0,F2 add scalar in F2
SD 0(R1),F4 store result SUBI R1,R1,8 decre
ment pointer 8B (DW) BNEZ R1,Loop branch
R1!zero NOP delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0
Where are the stalls?
8
FP Loop Showing Stalls
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling
1 Loop LD F0,0(R1) F0vector element
2 stall 3 ADDD F4,F0,F2 add scalar in F2
4 stall 5 stall 6 SD 0(R1),F4 store
result 7 SUBI R1,R1,8 decrement pointer 8Byte
(DW) 8 stall 9
BNEZ R1,Loop branch R1!zero
10 stall delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0
  • 10 clocks Rewrite code to minimize stalls?

9
Scheduled FP Loop Minimizing Stalls
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling
1 Loop LD F0,0(R1) 2 SUBI R1,R1,8
3 ADDD F4,F0,F2 4 stall 5 BNEZ R1,Loop de
layed branch 6 SD 8(R1),F4 altered when move
past SUBI
Stall is because SD cant proceed.
Swap BNEZ and SD by changing address of SD
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1
  • Now 6 clocks Now unroll loop 4 times to make
    faster.

10
Unroll Loop Four Times (straightforward way)
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling
1 Loop LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2
4 stall 5 stall 6 SD 0(R1),F4 7 LD F6,-8(R1)
8 stall 9 ADDD F8,F6,F2 10 stall 11 stall 12 SD -
8(R1),F8 13 LD F10,-16(R1) 14 stall
15 ADDD F12,F10,F2 16 stall 17 stall 18 SD -16(R1)
,F12 19 LD F14,-24(R1) 20 stall 21 ADDD F16,F14,F2
22 stall 23 stall 24 SD -24(R1),F16 25 SUBI R1,R1
,32 26 BNEZ R1,LOOP 27 stall 28 NOP
15 4 x (12) 1 28 clock cycles, or 7 per
iteration Assumes R1 is multiple of 4
  • Rewrite loop to minimize stalls.

11
Unrolled Loop That Minimizes Stalls
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling
  • What assumptions made when moved code?
  • OK to move store past SUBI even though changes
    register
  • OK to move loads before stores get right data?
  • When is it safe for compiler to do such changes?

1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4
10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,
32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 8-32 -24
14 clock cycles, or 3.5 per iteration
No Stalls!!
12
Summary of Loop Unrolling Example
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling
  • Determine that it was legal to move the SD after
    the SUBI and BNEZ, and find the amount to adjust
    the SD offset.
  • Determine that unrolling the loop would be useful
    by finding that the loop iterations were
    independent, except for the loop maintenance
    code.
  • Use different registers to avoid unnecessary
    constraints that would be forced by using the
    same registers for different computations.
  • Eliminate the extra tests and branches and adjust
    the loop maintenance code.
  • Determine that the loads and stores in the
    unrolled loop can be interchanged by observing
    that the loads and stores from different
    iterations are independent. This requires
    analyzing the memory addresses and finding that
    they do not refer to the same address.
  • Schedule the code, preserving any dependences
    needed to yield the same result as the original
    code.

13
Compiler Perspectives on Code Movement
Instruction Level Parallelism
Dependencies
  • Compiler concerned about dependencies in program.
    Not concerned if a HW hazard depends on a
    given pipeline.
  • Tries to schedule code to avoid hazards.
  • Looks for Data dependencies (RAW if a hazard for
    HW)
  • Instruction i produces a result used by
    instruction j, or
  • Instruction j is data dependent on instruction k,
    and instruction k is data dependent on
    instruction i.
  • If dependent, cant execute in parallel
  • Easy to determine for registers (fixed names)
  • Hard for memory
  • Does 100(R4) 20(R6)?
  • From different loop iterations, does 20(R6)
    20(R6)?

14
Compiler Perspectives on Code Movement
Instruction Level Parallelism
Data Dependencies
Where are the data dependencies?
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SUBI R1,R1,8 4 BNEZ R1,Loop delayed
branch 5 SD 8(R1),F4 altered when move past
SUBI
15
Compiler Perspectives on Code Movement
Instruction Level Parallelism
Name Dependencies
  • Another kind of dependence called name
    dependence two instructions use same name
    (register or memory location) but dont exchange
    data
  • Anti-dependence (WAR if a hazard for HW)
  • Instruction j writes a register or memory
    location that instruction i reads from and
    instruction i is executed first
  • Output dependence (WAW if a hazard for HW)
  • Instruction i and instruction j write the same
    register or memory location ordering between
    instructions must be preserved.

16
Compiler Perspectives on Code Movement
Instruction Level Parallelism
Name Dependencies
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 4 LD F0,-8(R1) 5 ADDD F4,F0,F2
6 SD -8(R1),F4 7 LD F0,-16(R1)
8 ADDD F4,F0,F2 9 SD -16(R1),F4
10 LD F0,-24(R1) 11 ADDD F4,F0,F2
12 SD -24(R1),F4 13 SUBI R1,R1,32
14 BNEZ R1,LOOP 15 NOP How can we remove these
dependencies?
Where are the name dependencies?
No data is passed in F0, but cant reuse F0 in
cycle 4.
17
Where are the name dependencies?
Instruction Level Parallelism
Name Dependencies
Compiler Perspectives on Code Movement
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2
6 SD -8(R1),F8 7 LD F10,-16(R1)
8 ADDD F12,F10,F2 9 SD -16(R1),F12
10 LD F14,-24(R1) 11 ADDD F16,F14,F2
12 SD -24(R1),F16 13 SUBI R1,R1,32
14 BNEZ R1,LOOP 15 NOP Called register
renaming
Now there are data dependencies only. F0 exists
only in instructions 1 and 2.
18
Compiler Perspectives on Code Movement
Instruction Level Parallelism
Name Dependencies
  • Again Name Dependencies are Hard for Memory
    Accesses
  • Does 100(R4) 20(R6)?
  • From different loop iterations, does 20(R6)
    20(R6)?
  • Our example required compiler to know that if R1
    doesnt change then0(R1) ? -8(R1) ? -16(R1) ?
    -24(R1)
  • There were no dependencies between some
    loads and stores so they could be moved around
    each other

19
Instruction Level Parallelism
Control Dependencies
Compiler Perspectives on Code Movement
  • Final kind of dependence called control
    dependence
  • Example
  • if p1 S1
  • if p2 S2
  • S1 is control dependent on p1 and S2 is control
    dependent on p2 but not on p1.

20
Instruction Level Parallelism
Control Dependencies
Compiler Perspectives on Code Movement
  • Two (obvious) constraints on control dependences
  • An instruction that is control dependent on a
    branch cannot be moved before the branch so
    that its execution is no longer controlled by the
    branch.
  • An instruction that is not control dependent on a
    branch cannot be moved to after the branch so
    that its execution is controlled by the branch.
  • Control dependencies relaxed to get parallelism
    get same effect if preserve order of exceptions
    (address in register checked by branch before
    use) and data flow (value in register depends on
    branch)

21
Where are the control dependencies?
Instruction Level Parallelism
Control Dependencies
Compiler Perspectives on Code Movement
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 4 SUBI R1,R1,8 5 BEQZ R1,exit
6 LD F0,0(R1) 7 ADDD F4,F0,F2 8 SD 0(R1),F4
9 SUBI R1,R1,8 10 BEQZ R1,exit 11 LD F0,0(R1)
12 ADDD F4,F0,F2 13 SD 0(R1),F4
14 SUBI R1,R1,8 15 BEQZ R1,exit ....
22
When Safe to Unroll Loop?
Instruction Level Parallelism
Loop Level Parallelism
  • Example Where are data dependencies? (A,B,C
    distinct non-overlapping)
  • 1. S2 uses the value, Ai1, computed by S1 in
    the same iteration.
  • 2. S1 uses a value computed by S1 in an earlier
    iteration, since iteration i computes Ai1
    which is read in iteration i1. The same is true
    of S2 for Bi and Bi1. This is a
    loop-carried dependence between iterations
  • Implies that iterations are dependent, and cant
    be executed in parallel
  • Note the case for our prior example each
    iteration was distinct

for (i1 ilt100 ii1) Ai1 Ai Ci
/ S1 / Bi1 Bi Ai1 / S2 /
23
When Safe to Unroll Loop?
Instruction Level Parallelism
Loop Level Parallelism
  • Example Where are data dependencies? (A,B,C,D
    distinct non-overlapping)
  • 1. No dependence from S1 to S2. If there
    were, then there would be a cycle in the
    dependencies and the loop would not be parallel.
    Since this other dependence is absent,
    interchanging the two statements will not affect
    the execution of S2.
  • 2. On the first iteration of the loop,
    statement S1 depends on the value of B1
    computed prior to initiating the loop.

for (i1 ilt100 ii1) Ai1 Ai Bi
/ S1 / Bi1 Ci Di / S2 /
24
Now Safe to Unroll Loop? (p. 240)
Instruction Level Parallelism
Loop Level Parallelism
for (i1 ilt100 ii1) Ai1 Ai Bi
/ S1 / Bi1 Ci Di / S2 /
No circular dependencies.
OLD
Loop caused dependence on B.
  • A1 A1 B1
  • for (i1 ilt99 ii1) Bi1 Ci
    Di Ai1 Ai1 Bi1
  • B101 C100 D100

Have eliminated loop dependence.
NEW
25
Dynamic Scheduling
  • 4.1 Instruction Level Parallelism Concepts and
    Challenges
  • 4.2 Overcoming Data Hazards with Dynamic
    Scheduling
  • 4.3 Reducing Branch Penalties with Dynamic
    Hardware Prediction
  • 4.4 Taking Advantage of More ILP with Multiple
    Issue
  • 4.5 Compiler Support for Exploiting ILP
  • 4.6 Hardware Support for Extracting more
    Parallelism
  • 4.7 Studies of ILP
  • Dynamic Scheduling is when the hardware
    rearranges the order of instruction execution to
    reduce stalls.
  • Advantages
  • Dependencies unknown at compile time can be
    handled by the hardware.
  • Code compiled for one type of pipeline can be
    efficiently run on another.
  • Disadvantages
  • Hardware much more complex.

26
Dynamic Scheduling
The idea
HW Schemes Instruction Parallelism
  • Why in HW at run time?
  • Works when cant know real dependence at compile
    time
  • Compiler simpler
  • Code for one machine runs well on another
  • Key Idea Allow instructions behind stall to
    proceed.
  • Key Idea Instructions executing in parallel.
    There are multiple execution units, so use them.
  • DIVD F0,F2,F4
  • ADDD F10,F0,F8
  • SUBD F12,F8,F14
  • Enables out-of-order execution gt out-of-order
    completion

27
Dynamic Scheduling
The idea
HW Schemes Instruction Parallelism
  • Out-of-order execution divides ID stage
  • 1. Issuedecode instructions, check for
    structural hazards
  • 2. Read operandswait until no data hazards, then
    read operands
  • Scoreboards allow instruction to execute whenever
    1 2 hold, not waiting for prior instructions.
  • A scoreboard is a data structure that provides
    the information necessary for all pieces of the
    processor to work together.
  • We will use In order issue, out of order
    execution, out of order commit ( also called
    completion)
  • First used in CDC6600. Our example modified here
    for DLX.
  • CDC had 4 FP units, 5 memory reference units, 7
    integer units.
  • DLX has 2 FP multiply, 1 FP adder, 1 FP divider,
    1 integer.

28
Scoreboard Implications
Dynamic Scheduling
Using A Scoreboard
  • Out-of-order completion gt WAR, WAW hazards?
  • Solutions for WAR
  • Queue both the operation and copies of its
    operands
  • Read registers only during Read Operands stage
  • For WAW, must detect hazard stall until other
    completes
  • Need to have multiple instructions in execution
    phase gt multiple execution units or pipelined
    execution units
  • Scoreboard keeps track of dependencies, state or
    operations
  • Scoreboard replaces ID, EX, WB with 4 stages

29
Four Stages of Scoreboard Control
Dynamic Scheduling
Using A Scoreboard
  • 1. Issue decode instructions check for
    structural hazards (ID1)
  • If a functional unit for the instruction is free
    and no other active instruction has the same
    destination register (WAW), the scoreboard issues
    the instruction to the functional unit and
    updates its internal data structure.
  • If a structural or WAW hazard exists, then the
    instruction issue stalls, and no further
    instructions will issue until these hazards are
    cleared.

30
Four Stages of Scoreboard Control
Dynamic Scheduling
Using A Scoreboard
  • 2. Read operands wait until no data hazards,
    then read operands (ID2)
  • A source operand is available if no earlier
    issued active instruction is going to write it,
    or if the register containing the operand is
    being written by a currently active functional
    unit.
  • When the source operands are available, the
    scoreboard tells the functional unit to proceed
    to read the operands from the registers and begin
    execution. The scoreboard resolves RAW hazards
    dynamically in this step, and instructions may be
    sent into execution out of order.

31
Four Stages of Scoreboard Control
Dynamic Scheduling
Using A Scoreboard
  • 3. Execution operate on operands (EX)
  • The functional unit begins execution upon
    receiving operands. When the result is ready, it
    notifies the scoreboard that it has completed
    execution.
  • 4. Write result finish execution (WB)
  • Once the scoreboard is aware that the
    functional unit has completed execution, the
    scoreboard checks for WAR hazards. If none, it
    writes results. If WAR, then it stalls the
    instruction.
  • Example
  • DIVD F0,F2,F4
  • ADDD F10,F0,F8
  • SUBD F8,F8,F14
  • Scoreboard would stall SUBD until ADDD reads
    operands

32
Three Parts of the Scoreboard
Dynamic Scheduling
Using A Scoreboard
  • 1. Instruction statuswhich of 4 steps the
    instruction is in
  • 2. Functional unit statusIndicates the state of
    the functional unit (FU). 9 fields for each
    functional unit
  • BusyIndicates whether the unit is busy or not
  • OpOperation to perform in the unit (e.g., or
    )
  • FiDestination register
  • Fj, FkSource-register numbers
  • Qj, QkFunctional units producing source
    registers Fj, Fk
  • Rj, RkFlags indicating when Fj, Fk are ready
  • 3. Register result statusIndicates which
    functional unit will write each register, if one
    exists. Blank when no pending instructions will
    write that register

33
Detailed Scoreboard Pipeline Control
Dynamic Scheduling
Using A Scoreboard
Bookkeeping
Wait until
Busy(FU)? yes Op(FU)? op Fi(FU)? D Fj(FU)?
S1 Fk(FU)? S2 Qj? Result(S1) Qk?
Result(S2) Rj? not Qj Rk? not Qk
Result(D)? FU
Not busy (FU) and not result(D)
Rj? No Rk? No
Rj and Rk
Functional unit done
?f(if Qj(f)FU then Rj(f)? Yes)?f(if Qk(f)FU
then Rj(f)? Yes) Result(Fi(FU))? 0 Busy(FU)? No
?f((Fj( f )?Fi(FU) or Rj( f )No) (Fk( f )
?Fi(FU) or Rk( f )No))
34
Scoreboard Example
Dynamic Scheduling
Using A Scoreboard
This is the sample code well be working with in
the example LD F6, 34(R2) LD F2,
45(R3) MULT F0, F2, F4 SUBD F8, F6,
F2 DIVD F10, F0, F6 ADDD F6, F8, F2 What are
the hazards in this code?
Latencies (clock cycles) LD 1 MULT 10 SUBD 2 D
IVD 40 ADDD 2
35
Scoreboard Example
Dynamic Scheduling
Using A Scoreboard
36
Scoreboard Example Cycle 1
Dynamic Scheduling
Using A Scoreboard
Issue LD 1
Shows in which cycle the operation occurred.
37
Scoreboard Example Cycle 2
Dynamic Scheduling
Using A Scoreboard
LD 2 cant issue since integer unit is
busy. MULT cant issue because we require
in-order issue.
38
Scoreboard Example Cycle 3
Dynamic Scheduling
Using A Scoreboard
39
Scoreboard Example Cycle 4
Dynamic Scheduling
Using A Scoreboard
40
Scoreboard Example Cycle 5
Dynamic Scheduling
Using A Scoreboard
Issue LD 2 since integer unit is now free.
41
Scoreboard Example Cycle 6
Dynamic Scheduling
Using A Scoreboard
Issue MULT.
42
Scoreboard Example Cycle 7
Dynamic Scheduling
Using A Scoreboard
MULT cant read its operands (F2) because LD 2
hasnt finished.
43
Scoreboard Example Cycle 8a
Dynamic Scheduling
Using A Scoreboard
DIVD issues. MULT and SUBD both waiting for F2.
44
Scoreboard Example Cycle 8b
Dynamic Scheduling
Using A Scoreboard
LD 2 writes F2.
45
Scoreboard Example Cycle 9
Dynamic Scheduling
Using A Scoreboard
Now MULT and SUBD can both read F2. How can both
instructions do this at the same time??
46
Scoreboard Example Cycle 11
Dynamic Scheduling
Using A Scoreboard
ADDD cant start because add unit is busy.
47
Scoreboard Example Cycle 12
Dynamic Scheduling
Using A Scoreboard
SUBD finishes. DIVD waiting for F0.
48
Scoreboard Example Cycle 13
Dynamic Scheduling
Using A Scoreboard
ADDD issues.
49
Scoreboard Example Cycle 14
Dynamic Scheduling
Using A Scoreboard
50
Scoreboard Example Cycle 15
Dynamic Scheduling
Using A Scoreboard
51
Scoreboard Example Cycle 16
Dynamic Scheduling
Using A Scoreboard
52
Scoreboard Example Cycle 17
Dynamic Scheduling
Using A Scoreboard
ADDD cant write because of DIVD. RAW!
53
Scoreboard Example Cycle 18
Dynamic Scheduling
Using A Scoreboard
Nothing Happens!!
54
Scoreboard Example Cycle 19
Dynamic Scheduling
Using A Scoreboard
MULT completes execution.
55
Scoreboard Example Cycle 20
Dynamic Scheduling
Using A Scoreboard
MULT writes.
56
Scoreboard Example Cycle 21
Dynamic Scheduling
Using A Scoreboard
DIVD loads operands
57
Scoreboard Example Cycle 22
Dynamic Scheduling
Using A Scoreboard
Now ADDD can write since WAR removed.
58
Scoreboard Example Cycle 61
Dynamic Scheduling
Using A Scoreboard
DIVD completes execution
59
Scoreboard Example Cycle 62
Dynamic Scheduling
Using A Scoreboard
DONE!!
60
Another Dynamic Algorithm Tomasulo Algorithm
Dynamic Scheduling
Using A Scoreboard
  • For IBM 360/91 about 3 years after CDC 6600
    (1966)
  • Goal High Performance without special compilers
  • Differences between IBM 360 CDC 6600 ISA
  • IBM has only 2 register specifiers / instruction
    vs. 3 in CDC 6600
  • IBM has 4 FP registers vs. 8 in CDC 6600
  • Why Study? lead to Alpha 21264, HP 8000, MIPS
    10000, Pentium II, PowerPC 604,

61
Tomasulo Algorithm vs. Scoreboard
Dynamic Scheduling
Using A Scoreboard
  • Control buffers distributed with Function Units
    (FU) vs. centralized in scoreboard
  • FU buffers called reservation stations have
    pending operands
  • Registers in instructions replaced by values or
    pointers to reservation stations(RS) called
    register renaming
  • avoids WAR, WAW hazards
  • More reservation stations than registers, so can
    do optimizations compilers cant
  • Results to FU from RS, not through registers,
    over Common Data Bus that broadcasts results to
    all FUs
  • Load and Stores treated as FUs with RSs as well
  • Integer instructions can go past branches,
    allowing FP ops beyond basic block in FP queue

62
Tomasulo Organization
Using A Scoreboard
Dynamic Scheduling
FPRegisters
FP Op Queue
LoadBuffer
StoreBuffer
CommonDataBus
FP AddRes.Station
FP MulRes.Station
63
Reservation Station Components
Dynamic Scheduling
Using A Scoreboard
  • OpOperation to perform in the unit (e.g., or
    )
  • Vj, VkValue of Source operands
  • Store buffers have V field, result to be stored
  • Qj, QkReservation stations producing source
    registers (value to be written)
  • Note No ready flags as in Scoreboard Qj,Qk0 gt
    ready
  • Store buffers only have Qi for RS producing
    result
  • BusyIndicates reservation station or FU is busy
  • Register result statusIndicates which functional
    unit will write each register, if one exists.
    Blank when no pending instructions that will
    write that register.

64
Three Stages of Tomasulo Algorithm
Dynamic Scheduling
Using A Scoreboard
  • 1. Issueget instruction from FP Op Queue
  • If reservation station free (no structural
    hazard), control issues instruction sends
    operands (renames registers).
  • 2. Executionoperate on operands (EX)
  • When both operands ready then execute if not
    ready, watch Common Data Bus for result
  • 3. Write resultfinish execution (WB)
  • Write on Common Data Bus to all awaiting units
    mark reservation station available
  • Normal data bus data destination (go
    to bus)
  • Common data bus data source (come from bus)
  • 64 bits of data 4 bits of Functional Unit
    source address
  • Write if matches expected Functional Unit
    (produces result)
  • Does the broadcast

65
Tomasulo Example Cycle 0
Dynamic Scheduling
Using A Scoreboard
66
Review Tomasulo
Dynamic Scheduling
Using A Scoreboard
  • Prevents Register as bottleneck
  • Avoids WAR, WAW hazards of Scoreboard
  • Allows loop unrolling in HW
  • Not limited to basic blocks (provided branch
    prediction)
  • Lasting Contributions
  • Dynamic scheduling
  • Register renaming
  • Load/store disambiguation
  • 360/91 descendants are PowerPC 604, 620 MIPS
    R10000 HP-PA 8000 Intel Pentium Pro

67
Dynamic Hardware Prediction
  • 4.1 Instruction Level Parallelism Concepts and
    Challenges
  • 4.2 Overcoming Data Hazards with Dynamic
    Scheduling
  • 4.3 Reducing Branch Penalties with Dynamic
    Hardware Prediction
  • 4.4 Taking Advantage of More ILP with Multiple
    Issue
  • 4.5 Compiler Support for Exploiting ILP
  • 4.6 Hardware Support for Extracting more
    Parallelism
  • 4.7 Studies of ILP

Dynamic Branch Prediction is the ability of the
hardware to make an educated guess about which
way a branch will go - will the branch be taken
or not. The hardware can look for clues based on
the instructions, or it can use past history - we
will discuss both of these directions.
68
Dynamic Branch Prediction
Dynamic Hardware Prediction
Basic Branch Prediction Branch Prediction Buffers
  • Performance ƒ(accuracy, cost of misprediction)
  • Branch History Lower bits of PC address index
    table of 1-bit values
  • Says whether or not branch taken last time
  • Problem in a loop, 1-bit BHT will cause two
    mis-predictions
  • End of loop case, when it exits instead of
    looping as before
  • First time through loop on next time through
    code, when it predicts exit instead of looping

Bits 13 - 2
Prediction
Address
0
31 1
1023
69
Dynamic Branch Prediction
Dynamic Hardware Prediction
Basic Branch Prediction Branch Prediction Buffers
  • Solution 2-bit scheme where change prediction
    only if get misprediction twice (Figure 4.13, p.
    264)

70
BHT Accuracy
Dynamic Hardware Prediction
Basic Branch Prediction Branch Prediction Buffers
  • Mispredict because either
  • Wrong guess for that branch
  • Got branch history of wrong branch when index the
    table
  • 4096 entry table programs vary from 1
    misprediction (nasa7, tomcatv) to 18 (eqntott),
    with spice at 9 and gcc at 12
  • 4096 about as good as infinite table, but 4096 is
    a lot of HW

71
Correlating Branches
Dynamic Hardware Prediction
Basic Branch Prediction Branch Prediction Buffers
  • Idea taken/not taken of recently executed
    branches is related to behavior of next branch
    (as well as the history of that branch behavior)
  • Then behavior of recent branches selects between,
    say, four predictions of next branch, updating
    just that prediction

72
Accuracy of Different Schemes
Dynamic Hardware Prediction
Basic Branch Prediction Branch Prediction Buffers
(Figure 4.21, p. 272)
4096 Entries 2-bits per entry Unlimited Entries
2-bits per entry 1024 Entries - 2 bits of
history, 2 bits per entry
18
Frequency of Mispredictions
0
73
Branch Target Buffer
Dynamic Hardware Prediction
Basic Branch Prediction Branch Target Buffers
  • Branch Target Buffer (BTB) Use address of branch
    as index to get prediction AND branch address (if
    taken)
  • Note must check for branch match now, since
    cant use wrong branch address (Figure 4.22, p.
    273)
  • Return instruction addresses predicted with stack

Predicted PC
Branch Prediction Taken or not Taken
74
Example
Dynamic Hardware Prediction
Basic Branch Prediction Branch Target Buffers
Instructions Prediction Actual Penalty in
Buffer Branch Cycles Yes Taken Taken 0 Yes
Taken Not taken 2 No Taken 2
  • Example on page 274.
  • Determine the total branch penalty for a BTB
    using the above penalties. Assume also the
    following
  • Prediction accuracy of 80
  • Hit rate in the buffer of 90
  • 60 taken branch frequency.

Branch Penalty Percent buffer hit rate X
Percent incorrect predictions X 2 ( 1 -
percent buffer hit rate) X Taken branches X
2 Branch Penalty ( 90 X 10 X 2) (10 X 60
X 2) Branch Penalty 0.18 0.12 0.30 clock
cycles
75
Multiple Issue
Multiple Issue is the ability of the processor to
start more than one instruction in a given
cycle. Flavor I Superscalar processors issue
varying number of instructions per clock - can be
either statically scheduled (by the compiler) or
dynamically scheduled (by the hardware). Supersca
lar has a varying number of instructions/cycle
(1 to 8), scheduled by compiler or by HW
(Tomasulo). IBM PowerPC, Sun UltraSparc, DEC
Alpha, HP 8000
  • 4.1 Instruction Level Parallelism Concepts and
    Challenges
  • 4.2 Overcoming Data Hazards with Dynamic
    Scheduling
  • 4.3 Reducing Branch Penalties with Dynamic
    Hardware Prediction
  • 4.4 Taking Advantage of More ILP with Multiple
    Issue
  • 4.5 Compiler Support for Exploiting ILP
  • 4.6 Hardware Support for Extracting more
    Parallelism
  • 4.7 Studies of ILP

76
Issuing Multiple Instructions/Cycle
Multiple Issue
  • Flavor II
  • VLIW - Very Long Instruction Word - issues a
    fixed number of instructions formatted either as
    one very large instruction or as a fixed packet
    of smaller instructions.
  • fixed number of instructions (4-16) scheduled by
    the compiler put operators into wide templates
  • Joint HP/Intel agreement in 1999/2000
  • Intel Architecture-64 (IA-64) 64-bit address
  • Style Explicitly Parallel Instruction Computer
    (EPIC)

77
Issuing Multiple Instructions/Cycle
Multiple Issue
  • Flavor II - continued
  • 3 Instructions in 128 bit groups field
    determines if instructions dependent or
    independent
  • Smaller code size than old VLIW, larger than
    x86/RISC
  • Groups can be linked to show independence gt 3
    instr
  • 64 integer registers 64 floating point
    registers
  • Not separate files per functional unit as in old
    VLIW
  • Hardware checks dependencies (interlocks gt
    binary compatibility over time)
  • Predicated execution (select 1 out of 64 1-bit
    flags) gt 40 fewer mis-predictions?
  • IA-64 name of instruction set architecture
    EPIC is type
  • Merced is name of first implementation
    (1999/2000?)

78
Issuing Multiple Instructions/Cycle
Multiple Issue
A SuperScalar Version of DLX
  • In our DLX example, we can handle 2
    instructions/cycle
  • Floating Point
  • Anything Else
  • Fetch 64-bits/clock cycle Int on left, FP on
    right
  • Can only issue 2nd instruction if 1st
    instruction issues
  • More ports for FP registers to do FP load FP
    op in a pair
  • Type Pipe Stages
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • Int. instruction IF ID EX MEM WB
  • FP instruction IF ID EX MEM WB
  • 1 cycle load delay causes delay to 3
    instructions in Superscalar
  • instruction in right half cant use it, nor
    instructions in next slot

79
Unrolled Loop Minimizes Stalls for Scalar
Multiple Issue
A SuperScalar Version of DLX
1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4
10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,
32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 8-32
-24 14 clock cycles, or 3.5 per iteration
Latencies LD to ADDD 1 Cycle ADDD to SD 2
Cycles
80
Loop Unrolling in Superscalar
Multiple Issue
A SuperScalar Version of DLX
  • Integer instruction FP instruction Clock cycle
  • Loop LD F0,0(R1) 1
  • LD F6,-8(R1) 2
  • LD F10,-16(R1) ADDD F4,F0,F2 3
  • LD F14,-24(R1) ADDD F8,F6,F2 4
  • LD F18,-32(R1) ADDD F12,F10,F2 5
  • SD 0(R1),F4 ADDD F16,F14,F2 6
  • SD -8(R1),F8 ADDD F20,F18,F2 7
  • SD -16(R1),F12 8
  • SD -24(R1),F16 9
  • SUBI R1,R1,40 10
  • BNEZ R1,LOOP 11
  • SD 8(R1),F20 12
  • Unrolled 5 times to avoid delays (1 due to SS)
  • 12 clocks, or 2.4 clocks per iteration

81
Dynamic Scheduling in Superscalar
Multiple Issue
Multiple Instruction Issue Dynamic Scheduling
  • Code compiler for scalar version will run poorly
    on Superscalar
  • May want code to vary depending on how
    Superscalar
  • Simple approach separate Tomasulo Control for
    separate reservation stations for Integer FU/Reg
    and for FP FU/Reg

82
Dynamic Scheduling in Superscalar
Multiple Issue
Multiple Instruction Issue Dynamic Scheduling
  • How to do instruction issue with two instructions
    and keep in-order instruction issue for Tomasulo?
  • Issue 2X Clock Rate, so that issue remains in
    order
  • Only FP loads might cause dependency between
    integer and FP issue
  • Replace load reservation station with a load
    queue operands must be read in the order they
    are fetched
  • Load checks addresses in Store Queue to avoid RAW
    violation
  • Store checks addresses in Load Queue to avoid
    WAR,WAW

83
Performance of Dynamic Superscalar
Multiple Issue
Multiple Instruction Issue Dynamic Scheduling
  • Iteration Instructions Issues Executes Writes
    result
  • no.
    clock-cycle number
  • 1 LD F0,0(R1) 1 2 4
  • 1 ADDD F4,F0,F2 1 5 8
  • 1 SD 0(R1),F4 2 9
  • 1 SUBI R1,R1,8 3 4 5
  • 1 BNEZ R1,LOOP 4 5
  • 2 LD F0,0(R1) 5 6 8
  • 2 ADDD F4,F0,F2 5 9 12
  • 2 SD 0(R1),F4 6 13
  • 2 SUBI R1,R1,8 7 8 9
  • 2 BNEZ R1,LOOP 8 9
  • 4 clocks per iteration
  • Branches, Decrements still take 1 clock cycle

84
Loop Unrolling in VLIW
Multiple Issue
VLIW
  • Memory Memory FP FP Int. op/ Clockreference
    1 reference 2 operation 1 op. 2 branch
  • LD F0,0(R1) LD F6,-8(R1) 1
  • LD F10,-16(R1) LD F14,-24(R1) 2
  • LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
    F8,F6,F2 3
  • LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
  • ADDD F20,F18,F2 ADDD F24,F22,F2 5
  • SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
  • SD -16(R1),F12 SD -24(R1),F16 7
  • SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
  • SD -0(R1),F28 BNEZ R1,LOOP 9
  • Unrolled 7 times to avoid delays
  • 7 results in 9 clocks, or 1.3 clocks per
    iteration
  • Need more registers to effectively use VLIW

85
Limits to Multi-Issue Machines
Multiple Issue
Limitations With Multiple Issue
  • Inherent limitations of ILP
  • 1 branch in 5 instructions gt how to keep a 5-way
    VLIW busy?
  • Latencies of units gt many operations must be
    scheduled
  • Need about Pipeline Depth x No. Functional Units
    of independent operations to keep machines busy.
  • Difficulties in building HW
  • Duplicate Functional Units to get parallel
    execution
  • Increase ports to Register File (VLIW example
    needs 6 read and 3 write for Int. Reg. 6 read
    and 4 write for Reg.)
  • Increase ports to memory
  • Decoding SS and impact on clock rate, pipeline
    depth

86
Limits to Multi-Issue Machines
Multiple Issue
Limitations With Multiple Issue
  • Limitations specific to either SS or VLIW
    implementation
  • Decode issue in SS
  • VLIW code size unroll loops wasted fields in
    VLIW
  • VLIW lock step gt 1 hazard all instructions
    stall
  • VLIW binary compatibility

87
Multiple Issue Challenges
Multiple Issue
Limitations With Multiple Issue
  • While Integer/FP split is simple for the HW, get
    CPI of 0.5 only for programs with
  • Exactly 50 FP operations
  • No hazards
  • If more instructions issue at same time, greater
    difficulty of decode and issue
  • Even 2-scalar gt examine 2 opcodes, 6 register
    specifiers, decide if 1 or 2 instructions can
    issue
  • VLIW tradeoff instruction space for simple
    decoding
  • The long instruction word has room for many
    operations
  • By definition, all the operations the compiler
    puts in the long instruction word are independent
    gt execute in parallel
  • E.g., 2 integer operations, 2 FP ops, 2 Memory
    refs, 1 branch
  • 16 to 24 bits per field gt 716 or 112 bits to
    724 or 168 bits wide
  • Need compiling technique that schedules across
    several branches

88
Compiler Support For ILP
  • 4.1 Instruction Level Parallelism Concepts and
    Challenges
  • 4.2 Overcoming Data Hazards with Dynamic
    Scheduling
  • 4.3 Reducing Branch Penalties with Dynamic
    Hardware Prediction
  • 4.4 Taking Advantage of More ILP with Multiple
    Issue
  • 4.5 Compiler Support for Exploiting ILP
  • 4.6 Hardware Support for Extracting more
    Parallelism
  • 4.7 Studies of ILP

How can compilers be smart? 1. Produce good
scheduling of code. 2. Determine which loops
might contain parallelism. 3. Eliminate name
dependencies. Compilers must be REALLY smart to
figure out aliases -- pointers in C are a real
problem. Techniques lead to Symbolic Loop
Unrolling Critical Path Scheduling
89
Software Pipelining
Compiler Support For ILP
Symbolic Loop Unrolling
  • Observation if iterations from loops are
    independent, then can get ILP by taking
    instructions from different iterations
  • Software pipelining reorganizes loops so that
    each iteration is made from instructions chosen
    from different iterations of the original loop
    (Tomasulo in SW)

90
SW Pipelining Example
Compiler Support For ILP
Symbolic Loop Unrolling
  • Before Unrolled 3 times
  • 1 LD F0,0(R1)
  • 2 ADDD F4,F0,F2
  • 3 SD 0(R1),F4
  • 4 LD F6,-8(R1)
  • 5 ADDD F8,F6,F2
  • 6 SD -8(R1),F8
  • 7 LD F10,-16(R1)
  • 8 ADDD F12,F10,F2
  • 9 SD -16(R1),F12
  • 10 SUBI R1,R1,24
  • 11 BNEZ R1,LOOP

After Software Pipelined LD F0,0(R1) ADDD F4,F0
,F2 LD F0,-8(R1) 1 SD 0(R1),F4 Stores Mi
2 ADDD F4,F0,F2 Adds to Mi-1
3 LD F0,-16(R1) loads Mi-2 4 SUBI R1,R1,8
5 BNEZ R1,LOOP SD 0(R1),F4 ADDD F4,F0,F2 SD -8(
R1),F4
Read F4
Read F0
IF ID EX Mem WB IF ID EX Mem WB
IF ID EX Mem WB
SD ADDD LD
Write F4
Write F0
91
SW Pipelining Example
Compiler Support For ILP
Symbolic Loop Unrolling
  • Symbolic Loop Unrolling
  • Less code space
  • Overhead paid only once vs. each iteration
    in loop unrolling

Software Pipelining
Loop Unrolling
100 iterations 25 loops with 4 unrolled
iterations each
92
Trace Scheduling
Compiler Support For ILP
Critical Path Scheduling
  • Parallelism across IF branches vs. LOOP branches
  • Two steps
  • Trace Selection
  • Find likely sequence of basic blocks (trace) of
    (statically predicted or profile predicted) long
    sequence of straight-line code
  • Trace Compaction
  • Squeeze trace into few VLIW instructions
  • Need bookkeeping code in case prediction is wrong
  • Compiler undoes bad guess (discards values in
    registers)
  • Subtle compiler bugs mean wrong answer vs.
    poorer performance no hardware interlocks

93
Hardware Support For Parallelism
  • 4.1 Instruction Level Parallelism Concepts and
    Challenges
  • 4.2 Overcoming Data Hazards with Dynamic
    Scheduling
  • 4.3 Reducing Branch Penalties with Dynamic
    Hardware Prediction
  • 4.4 Taking Advantage of More ILP with Multiple
    Issue
  • 4.5 Compiler Support for Exploiting ILP
  • 4.6 Hardware Support for Extracting more
    Parallelism
  • 4.7 Studies of ILP
  • Software support of ILP is best when code is
    predictable at compile time.
  • But what if theres no predictability?
  • Here well talk about hardware techniques. These
    include
  • Conditional or Predicated Instructions
  • Hardware Speculation

94
Tell the Hardware To Ignore An Instruction
Hardware Support For Parallelism
Nullified Instructions
  • Avoid branch prediction by turning branches into
    conditionally executed instructions
  • IF (x) then A B op C else NOP
  • If false, then neither store result nor cause
    exception
  • Expanded ISA of Alpha, MIPs, PowerPC, SPARC, have
    conditional move. PA-RISC can annul any
    following instruction.
  • IA-64 64 1-bit condition fields selected so
    conditional execution of any instruction
  • Drawbacks to conditional instructions
  • Still takes a clock, even if annulled
  • Stalls if condition evaluated late
  • Complex conditions reduce effectiveness
    condition becomes known late in pipeline.
  • This can be a major win because there is no time
    lost by taking a branch!!

x
A B op C
95
Tell the Hardware To Ignore An Instruction
Hardware Support For Parallelism
Nullified Instructions
  • Suppose we have the code
  • if ( VarA 0 )
  • VarS VarT
  • Previous Method
  • LD R1, VarA
  • BNEZ R1, Label
  • LD R2, VarT
  • SD VarS, R2
  • Label

Nullified Method LD R1, VarA LD R2,
VarT CMPNNZ R1, 0 SD VarS, R2 Label
Compare and Nullify Next Instr. If Not Zero
Nullified Method LD R1, VarA LD R2, VarT CMOVZ
VarS,R2, R1
Compare and Move IF Zero
96
Hardware Support For Parallelism
Compiler Speculation
Increasing Parallelism
  • The theory here is to move an instruction across
    a branch so as to increase the size of a basic
    block and thus to increase parallelism.
  • Primary difficulty is in avoiding exceptions.
    For example
  • if ( a 0 ) c b/a may have divide by
    zero error in some cases.
  • Methods for increasing speculation include
  • 1. Use a set of status bits (poison bits)
    associated with the registers. Are a signal that
    the instruction results are invalid until some
    later time.
  • 2. Result of instruction isnt written until
    its certain the instruction is no longer
    speculative.

97
Hardware Support For Parallelism
Compiler Speculation
Increasing Parallelism
Original Code LW R1, 0(R3) Load A BNEZ
R1, L1 Test A LW R1, 0(R2) If
Clause J L2 Skip Else L1 ADDI R1, R1,
4 Else Clause L2 SW 0(R3), R1 Store A
  • Example on Page 305.
  • Code for
  • if ( A 0 )
  • A B
  • else
  • A A 4
  • Assume A is at 0(R3) and B is at 0(R4)

Speculated Code LW R1, 0(R3) Load A
LW R14, 0(R2) Spec Load B BEQZ R1, L3
Other if Branch ADDI R14, R1, 4 Else
Clause L3 SW 0(R3), R14 Non-Spec Store
Note here that only ONE side needs to take a
branch!!
98
Hardware Support For Parallelism
Compiler Speculation
Poison Bits
Speculated Code LW R1, 0(R3) Load A
LW R14, 0(R2) Spec Load B BEQZ R1, L3
Other if Branch ADDI R14, R1, 4 Else
Clause L3 SW 0(R3), R14 Non-Spec Store
  • In the example on the last page, if the LW
    produces an exception, a poison bit is set on
    that register. The if a later instruction tries
    to use the register, an exception is THEN raised.

99
HW support for More ILP
Hardware Support For Parallelism
Hardware Speculation
  • Need HW buffer for results of uncommitted
    instructions reorder buffer
  • Reorder buffer can be operand source
  • Once operand commits, result is found in register
  • 3 fields instr. type, destination, value
  • Use reorder buffer number instead of reservation
    station
  • Discard instructions on mis-predicted branches or
    on exceptions

100
HW support for More ILP
Hardware Support For Parallelism
Hardware Speculation
  • How is this used in practice?
  • Rather than predicting the direction of a branch,
    execute the instructions on both side!!
  • We early on know the target of a branch, long
    before we know it if will be taken or not.
  • So begin fetching/executing at that new Target
    PC.
  • But also continue fetching/executing as if the
    branch NOT taken.

101
Studies of ILP
  • 4.1 Instruction Level Parallelism Concepts and
    Challenges
  • 4.2 Overcoming Data Hazards with Dynamic
    Scheduling
  • 4.3 Reducing Branch Penalties with Dynamic
    Hardware Prediction
  • 4.4 Taking Advantage of More ILP with Multiple
    Issue
  • 4.5 Compiler Support for Exploiting ILP
  • 4.6 Hardware Support for Extracting more
    Parallelism
  • 4.7 Studies of ILP
  • Conflicting studies of amount of improvement
    available
  • Benchmarks (vectorized FP Fortran vs. integer C
    programs)
  • Hardware sophistication
  • Compiler sophistication
  • How much ILP is available using existing
    mechanisms with increasing HW budgets?
  • Do we need to invent new HW/SW mechanisms to keep
    on processor performance curve?

102
Limits to ILP
Studies of ILP
  • Initial HW Model here MIPS compilers.
  • Assumptions for ideal/perfect machine to start
  • 1. Register renaminginfinite virtual registers
    and all WAW WAR hazards are avoided
  • 2. Branch predictionperfect no mispredictions
  • 3. Jump predictionall jumps perfectly predicted
    gt machine with perfect speculation an
    unbounded buffer of instructions available
  • 4. Memory-address alias analysisaddresses are
    known a store can be moved before a load
    provided addresses not equal
  • 1 cycle latency for all instructions unlimited
    number of instructions issued per clock cycle

103
Upper Limit to ILP Ideal Machine(Figure 4.38,
page 319)
Studies of ILP
This is the amount of parallelism when there are
no branch mis-predictions and were limited only
by data dependencies.
FP 75 - 150
Integer 18 - 60
IPC
Instructions that could theoretically be issued
per cycle.
104
Impact of Realistic Branch Prediction
Studies of ILP
  • What parallelism do we get when we dont allow
    perfect branch prediction, as in the last
    picture, but assume some realistic model?
  • Possibilities include
  • 1. Perfect - all branches are perfectly
    predicted (the last slide)
  • 2. Selective History Predictor - a complicated
    but do-able mechanism for selection.
  • 3. Standard 2-bit history predictor with 512
    2-bit entries.
  • 4. Static prediction based on past history of the
    program.
  • 5. None - Parallelism is limited to basic block.

105
Selective History Predictor
Studies of ILP
Bonus!!
8096 x 2 bits
1 0
Taken/Not Taken
11 10 01 00
Choose Non-correlator
Branch Addr
Choose Correlator
2
Global History
00
8K x 2 bit Selector
01
10
11
11 Taken 10 01 Not Taken 00
2048 x 4 x 2 bits
106
Impact of Realistic Branch PredictionFigure
4.42, Page 325
Studies of ILP
  • Limiting the type of branch prediction.

FP 15 - 45
Integer 6 - 12
IPC
Profile
BHT (512)
Selective Hist
Perfect
No prediction
107
More Realistic HW Register ImpactFigure 4.44,
Page 328
Studies of ILP
  • Effect of limiting the number of renaming
    registers.

FP 11 - 45
Integer 5 - 15
IPC
64
None
256
Infinite
32
128
108
More Realistic HW Alias ImpactFigure 4.46, Page
330
Studies of ILP
  • What happens when there may be conflicts with
    memory aliasing?

FP 4 - 45 (Fortran, no heap)
Integer 4 - 9
IPC
Global/Stack perfheap conflicts
None
Perfect
Inspec.Assem.
109
Summary
  • 4.1 Instruction Level Parallelism Concepts and
    Challenges
  • 4.2 Overcoming Data Hazards with Dynamic
    Scheduling
  • 4.3 Reducing Branch Penalties with Dynamic
    Hardware Prediction
  • 4.4 Taking Advantage of More ILP with Multiple
    Issue
  • 4.5 Compiler Support for Exploiting ILP
  • 4.6 Hardware Support for Extracting more
    Parallelism
  • 4.7 Studies of ILP
Write a Comment
User Comments (0)
About PowerShow.com