Advanced Computer Architecture

About This Presentation

Title:

Advanced Computer Architecture

Description:

Scheduled FP Loop Minimizing Stalls. Now 6 clocks: Now unroll loop 4 times to ... Unrolled Loop That Minimizes Stalls. What assumptions made when moved code? ... – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 109

Provided by: jb133

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Computer Architecture

1
Advanced Computer Architecture

Chapter 4
Advanced Pipelining
Ioannis Papaefstathiou
CS 590.25
Easter 2003
(thanks to Hennesy Patterson)

2
Chapter Overview

4.1 Instruction Level Parallelism Concepts and
Challenges
4.2 Overcoming Data Hazards with Dynamic
Scheduling
4.3 Reducing Branch Penalties with Dynamic
Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple
Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more
Parallelism
4.7 Studies of ILP

3
Chapter Overview
Technique Reduces Section
Loop Unrolling Control Stalls 4.1
Basic Pipeline Scheduling RAW Stalls 4.1
Dynamic Scheduling with Scoreboarding RAW stalls 4.2
Dynamic Scheduling with Register Renaming WAR and WAW stalls 4.2
Dynamic Branch Prediction Control Stalls 4.3
Issue Multiple Instructions per Cycle Ideal CPI 4.4
Compiler Dependence Analysis Ideal CPI data stalls 4.5
Software pipelining and trace scheduling Ideal CPI data stalls 4.5
Speculation All data control stalls 4.6
Dynamic memory disambiguation RAW stalls involving memory 4.2, 4.6
4
Instruction Level Parallelism

4.1 Instruction Level Parallelism Concepts and
Challenges
4.2 Overcoming Data Hazards with Dynamic
Scheduling
4.3 Reducing Branch Penalties with Dynamic
Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple
Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more
Parallelism
4.7 Studies of ILP

ILP is the principle that there are many
instructions in code that dont depend on each
other. That means its possible to execute those
instructions in parallel.
This is easier said than done
Issues include
Building compilers to analyze the code,
Building hardware to be even smarter than that
code.
This section looks at some of the problems to be
solved.

5
Terminology
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling

Basic Block - That set of instructions between
entry points and between branches. A basic block
has only one entry and one exit. Typically this
is about 6 instructions long.
Loop Level Parallelism - that parallelism that
exists within a loop. Such parallelism can cross
loop iterations.
Loop Unrolling - Either the compiler or the
hardware is able to exploit the parallelism
inherent in the loop.

6
Simple Loop and its Assembler Equivalent
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling
This is a clean and simple example!

for (i1 ilt1000 i) x(i) x(i) s

Loop LD F0,0(R1) F0vector element
ADDD F4,F0,F2 add scalar from F2
SD 0(R1),F4 store result SUBI R1,R1,8 decre
ment pointer 8bytes (DW) BNEZ R1,Loop branch
R1!zero NOP delayed branch slot
7
FP Loop Hazards
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling
Loop LD F0,0(R1) F0vector element
ADDD F4,F0,F2 add scalar in F2
SD 0(R1),F4 store result SUBI R1,R1,8 decre
ment pointer 8B (DW) BNEZ R1,Loop branch
R1!zero NOP delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0
Where are the stalls?
8
FP Loop Showing Stalls
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling
1 Loop LD F0,0(R1) F0vector element
2 stall 3 ADDD F4,F0,F2 add scalar in F2
4 stall 5 stall 6 SD 0(R1),F4 store
result 7 SUBI R1,R1,8 decrement pointer 8Byte
(DW) 8 stall 9
BNEZ R1,Loop branch R1!zero
10 stall delayed branch slot
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1 Load double Store
double 0 Integer op Integer op 0

10 clocks Rewrite code to minimize stalls?

9
Scheduled FP Loop Minimizing Stalls
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling
1 Loop LD F0,0(R1) 2 SUBI R1,R1,8
3 ADDD F4,F0,F2 4 stall 5 BNEZ R1,Loop de
layed branch 6 SD 8(R1),F4 altered when move
past SUBI
Stall is because SD cant proceed.
Swap BNEZ and SD by changing address of SD
Instruction Instruction Latency inproducing
result using result clock cycles FP ALU
op Another FP ALU op 3 FP ALU op Store double 2
Load double FP ALU op 1

Now 6 clocks Now unroll loop 4 times to make
faster.

10
Unroll Loop Four Times (straightforward way)
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling
1 Loop LD F0,0(R1) 2 stall 3 ADDD F4,F0,F2
4 stall 5 stall 6 SD 0(R1),F4 7 LD F6,-8(R1)
8 stall 9 ADDD F8,F6,F2 10 stall 11 stall 12 SD -
8(R1),F8 13 LD F10,-16(R1) 14 stall
15 ADDD F12,F10,F2 16 stall 17 stall 18 SD -16(R1)
,F12 19 LD F14,-24(R1) 20 stall 21 ADDD F16,F14,F2
22 stall 23 stall 24 SD -24(R1),F16 25 SUBI R1,R1
,32 26 BNEZ R1,LOOP 27 stall 28 NOP
15 4 x (12) 1 28 clock cycles, or 7 per
iteration Assumes R1 is multiple of 4

Rewrite loop to minimize stalls.

11
Unrolled Loop That Minimizes Stalls
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling

What assumptions made when moved code?
OK to move store past SUBI even though changes
register
OK to move loads before stores get right data?
When is it safe for compiler to do such changes?

1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4
10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,
32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 8-32 -24
14 clock cycles, or 3.5 per iteration
No Stalls!!
12
Summary of Loop Unrolling Example
Instruction Level Parallelism
Pipeline Scheduling and Loop Unrolling

Determine that it was legal to move the SD after
the SUBI and BNEZ, and find the amount to adjust
the SD offset.
Determine that unrolling the loop would be useful
by finding that the loop iterations were
independent, except for the loop maintenance
code.
Use different registers to avoid unnecessary
constraints that would be forced by using the
same registers for different computations.
Eliminate the extra tests and branches and adjust
the loop maintenance code.
Determine that the loads and stores in the
unrolled loop can be interchanged by observing
that the loads and stores from different
iterations are independent. This requires
analyzing the memory addresses and finding that
they do not refer to the same address.
Schedule the code, preserving any dependences
needed to yield the same result as the original
code.

13
Compiler Perspectives on Code Movement
Instruction Level Parallelism
Dependencies

Compiler concerned about dependencies in program.
Not concerned if a HW hazard depends on a
given pipeline.
Tries to schedule code to avoid hazards.
Looks for Data dependencies (RAW if a hazard for
HW)
Instruction i produces a result used by
instruction j, or
Instruction j is data dependent on instruction k,
and instruction k is data dependent on
instruction i.
If dependent, cant execute in parallel
Easy to determine for registers (fixed names)
Hard for memory
Does 100(R4) 20(R6)?
From different loop iterations, does 20(R6)
20(R6)?

14
Compiler Perspectives on Code Movement
Instruction Level Parallelism
Data Dependencies
Where are the data dependencies?
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SUBI R1,R1,8 4 BNEZ R1,Loop delayed
branch 5 SD 8(R1),F4 altered when move past
SUBI
15
Compiler Perspectives on Code Movement
Instruction Level Parallelism
Name Dependencies

Another kind of dependence called name
dependence two instructions use same name
(register or memory location) but dont exchange
data
Anti-dependence (WAR if a hazard for HW)
Instruction j writes a register or memory
location that instruction i reads from and
instruction i is executed first
Output dependence (WAW if a hazard for HW)
Instruction i and instruction j write the same
register or memory location ordering between
instructions must be preserved.

16
Compiler Perspectives on Code Movement
Instruction Level Parallelism
Name Dependencies
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 4 LD F0,-8(R1) 5 ADDD F4,F0,F2
6 SD -8(R1),F4 7 LD F0,-16(R1)
8 ADDD F4,F0,F2 9 SD -16(R1),F4
10 LD F0,-24(R1) 11 ADDD F4,F0,F2
12 SD -24(R1),F4 13 SUBI R1,R1,32
14 BNEZ R1,LOOP 15 NOP How can we remove these
dependencies?
Where are the name dependencies?
No data is passed in F0, but cant reuse F0 in
cycle 4.
17
Where are the name dependencies?
Instruction Level Parallelism
Name Dependencies
Compiler Perspectives on Code Movement
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2
6 SD -8(R1),F8 7 LD F10,-16(R1)
8 ADDD F12,F10,F2 9 SD -16(R1),F12
10 LD F14,-24(R1) 11 ADDD F16,F14,F2
12 SD -24(R1),F16 13 SUBI R1,R1,32
14 BNEZ R1,LOOP 15 NOP Called register
renaming
Now there are data dependencies only. F0 exists
only in instructions 1 and 2.
18
Compiler Perspectives on Code Movement
Instruction Level Parallelism
Name Dependencies

Again Name Dependencies are Hard for Memory
Accesses
Does 100(R4) 20(R6)?
From different loop iterations, does 20(R6)
20(R6)?
Our example required compiler to know that if R1
doesnt change then0(R1) ? -8(R1) ? -16(R1) ?
-24(R1)
There were no dependencies between some
loads and stores so they could be moved around
each other

19
Instruction Level Parallelism
Control Dependencies
Compiler Perspectives on Code Movement

Final kind of dependence called control
dependence
Example
if p1 S1
if p2 S2
S1 is control dependent on p1 and S2 is control
dependent on p2 but not on p1.

20
Instruction Level Parallelism
Control Dependencies
Compiler Perspectives on Code Movement

Two (obvious) constraints on control dependences
An instruction that is control dependent on a
branch cannot be moved before the branch so
that its execution is no longer controlled by the
branch.
An instruction that is not control dependent on a
branch cannot be moved to after the branch so
that its execution is controlled by the branch.
Control dependencies relaxed to get parallelism
get same effect if preserve order of exceptions
(address in register checked by branch before
use) and data flow (value in register depends on
branch)

21
Where are the control dependencies?
Instruction Level Parallelism
Control Dependencies
Compiler Perspectives on Code Movement
1 Loop LD F0,0(R1) 2 ADDD F4,F0,F2
3 SD 0(R1),F4 4 SUBI R1,R1,8 5 BEQZ R1,exit
6 LD F0,0(R1) 7 ADDD F4,F0,F2 8 SD 0(R1),F4
9 SUBI R1,R1,8 10 BEQZ R1,exit 11 LD F0,0(R1)
12 ADDD F4,F0,F2 13 SD 0(R1),F4
14 SUBI R1,R1,8 15 BEQZ R1,exit ....
22
When Safe to Unroll Loop?
Instruction Level Parallelism
Loop Level Parallelism

Example Where are data dependencies? (A,B,C
distinct non-overlapping)
1. S2 uses the value, Ai1, computed by S1 in
the same iteration.
2. S1 uses a value computed by S1 in an earlier
iteration, since iteration i computes Ai1
which is read in iteration i1. The same is true
of S2 for Bi and Bi1. This is a
loop-carried dependence between iterations
Implies that iterations are dependent, and cant
be executed in parallel
Note the case for our prior example each
iteration was distinct

for (i1 ilt100 ii1) Ai1 Ai Ci
/ S1 / Bi1 Bi Ai1 / S2 /
23
When Safe to Unroll Loop?
Instruction Level Parallelism
Loop Level Parallelism

Example Where are data dependencies? (A,B,C,D
distinct non-overlapping)
1. No dependence from S1 to S2. If there
were, then there would be a cycle in the
dependencies and the loop would not be parallel.
Since this other dependence is absent,
interchanging the two statements will not affect
the execution of S2.
2. On the first iteration of the loop,
statement S1 depends on the value of B1
computed prior to initiating the loop.

for (i1 ilt100 ii1) Ai1 Ai Bi
/ S1 / Bi1 Ci Di / S2 /
24
Now Safe to Unroll Loop? (p. 240)
Instruction Level Parallelism
Loop Level Parallelism
for (i1 ilt100 ii1) Ai1 Ai Bi
/ S1 / Bi1 Ci Di / S2 /
No circular dependencies.
OLD
Loop caused dependence on B.

A1 A1 B1
for (i1 ilt99 ii1) Bi1 Ci
Di Ai1 Ai1 Bi1
B101 C100 D100

Have eliminated loop dependence.
NEW
25
Dynamic Scheduling

4.1 Instruction Level Parallelism Concepts and
Challenges
4.2 Overcoming Data Hazards with Dynamic
Scheduling
4.3 Reducing Branch Penalties with Dynamic
Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple
Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more
Parallelism
4.7 Studies of ILP

Dynamic Scheduling is when the hardware
rearranges the order of instruction execution to
reduce stalls.
Advantages
Dependencies unknown at compile time can be
handled by the hardware.
Code compiled for one type of pipeline can be
efficiently run on another.
Disadvantages
Hardware much more complex.

26
Dynamic Scheduling
The idea
HW Schemes Instruction Parallelism

Why in HW at run time?
Works when cant know real dependence at compile
time
Compiler simpler
Code for one machine runs well on another
Key Idea Allow instructions behind stall to
proceed.
Key Idea Instructions executing in parallel.
There are multiple execution units, so use them.
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
Enables out-of-order execution gt out-of-order
completion

27
Dynamic Scheduling
The idea
HW Schemes Instruction Parallelism

Out-of-order execution divides ID stage
1. Issuedecode instructions, check for
structural hazards
2. Read operandswait until no data hazards, then
read operands
Scoreboards allow instruction to execute whenever
1 2 hold, not waiting for prior instructions.
A scoreboard is a data structure that provides
the information necessary for all pieces of the
processor to work together.
We will use In order issue, out of order
execution, out of order commit ( also called
completion)
First used in CDC6600. Our example modified here
for DLX.
CDC had 4 FP units, 5 memory reference units, 7
integer units.
DLX has 2 FP multiply, 1 FP adder, 1 FP divider,
1 integer.

28
Scoreboard Implications
Dynamic Scheduling
Using A Scoreboard

Out-of-order completion gt WAR, WAW hazards?
Solutions for WAR
Queue both the operation and copies of its
operands
Read registers only during Read Operands stage
For WAW, must detect hazard stall until other
completes
Need to have multiple instructions in execution
phase gt multiple execution units or pipelined
execution units
Scoreboard keeps track of dependencies, state or
operations
Scoreboard replaces ID, EX, WB with 4 stages

29
Four Stages of Scoreboard Control
Dynamic Scheduling
Using A Scoreboard

1. Issue decode instructions check for
structural hazards (ID1)
If a functional unit for the instruction is free
and no other active instruction has the same
destination register (WAW), the scoreboard issues
the instruction to the functional unit and
updates its internal data structure.
If a structural or WAW hazard exists, then the
instruction issue stalls, and no further
instructions will issue until these hazards are
cleared.

30
Four Stages of Scoreboard Control
Dynamic Scheduling
Using A Scoreboard

2. Read operands wait until no data hazards,
then read operands (ID2)
A source operand is available if no earlier
issued active instruction is going to write it,
or if the register containing the operand is
being written by a currently active functional
unit.
When the source operands are available, the
scoreboard tells the functional unit to proceed
to read the operands from the registers and begin
execution. The scoreboard resolves RAW hazards
dynamically in this step, and instructions may be
sent into execution out of order.

31
Four Stages of Scoreboard Control
Dynamic Scheduling
Using A Scoreboard

3. Execution operate on operands (EX)
The functional unit begins execution upon
receiving operands. When the result is ready, it
notifies the scoreboard that it has completed
execution.
4. Write result finish execution (WB)
Once the scoreboard is aware that the
functional unit has completed execution, the
scoreboard checks for WAR hazards. If none, it
writes results. If WAR, then it stalls the
instruction.
Example
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
Scoreboard would stall SUBD until ADDD reads
operands

32
Three Parts of the Scoreboard
Dynamic Scheduling
Using A Scoreboard

1. Instruction statuswhich of 4 steps the
instruction is in
2. Functional unit statusIndicates the state of
the functional unit (FU). 9 fields for each
functional unit
BusyIndicates whether the unit is busy or not
OpOperation to perform in the unit (e.g., or
)
FiDestination register
Fj, FkSource-register numbers
Qj, QkFunctional units producing source
registers Fj, Fk
Rj, RkFlags indicating when Fj, Fk are ready
3. Register result statusIndicates which
functional unit will write each register, if one
exists. Blank when no pending instructions will
write that register

33
Detailed Scoreboard Pipeline Control
Dynamic Scheduling
Using A Scoreboard
Bookkeeping
Wait until
Busy(FU)? yes Op(FU)? op Fi(FU)? D Fj(FU)?
S1 Fk(FU)? S2 Qj? Result(S1) Qk?
Result(S2) Rj? not Qj Rk? not Qk
Result(D)? FU
Not busy (FU) and not result(D)
Rj? No Rk? No
Rj and Rk
Functional unit done
?f(if Qj(f)FU then Rj(f)? Yes)?f(if Qk(f)FU
then Rj(f)? Yes) Result(Fi(FU))? 0 Busy(FU)? No
?f((Fj( f )?Fi(FU) or Rj( f )No) (Fk( f )
?Fi(FU) or Rk( f )No))
34
Scoreboard Example
Dynamic Scheduling
Using A Scoreboard
This is the sample code well be working with in
the example LD F6, 34(R2) LD F2,
45(R3) MULT F0, F2, F4 SUBD F8, F6,
F2 DIVD F10, F0, F6 ADDD F6, F8, F2 What are
the hazards in this code?
Latencies (clock cycles) LD 1 MULT 10 SUBD 2 D
IVD 40 ADDD 2
35
Scoreboard Example
Dynamic Scheduling
Using A Scoreboard
36
Scoreboard Example Cycle 1
Dynamic Scheduling
Using A Scoreboard
Issue LD 1
Shows in which cycle the operation occurred.
37
Scoreboard Example Cycle 2
Dynamic Scheduling
Using A Scoreboard
LD 2 cant issue since integer unit is
busy. MULT cant issue because we require
in-order issue.
38
Scoreboard Example Cycle 3
Dynamic Scheduling
Using A Scoreboard
39
Scoreboard Example Cycle 4
Dynamic Scheduling
Using A Scoreboard
40
Scoreboard Example Cycle 5
Dynamic Scheduling
Using A Scoreboard
Issue LD 2 since integer unit is now free.
41
Scoreboard Example Cycle 6
Dynamic Scheduling
Using A Scoreboard
Issue MULT.
42
Scoreboard Example Cycle 7
Dynamic Scheduling
Using A Scoreboard
MULT cant read its operands (F2) because LD 2
hasnt finished.
43
Scoreboard Example Cycle 8a
Dynamic Scheduling
Using A Scoreboard
DIVD issues. MULT and SUBD both waiting for F2.
44
Scoreboard Example Cycle 8b
Dynamic Scheduling
Using A Scoreboard
LD 2 writes F2.
45
Scoreboard Example Cycle 9
Dynamic Scheduling
Using A Scoreboard
Now MULT and SUBD can both read F2. How can both
instructions do this at the same time??
46
Scoreboard Example Cycle 11
Dynamic Scheduling
Using A Scoreboard
ADDD cant start because add unit is busy.
47
Scoreboard Example Cycle 12
Dynamic Scheduling
Using A Scoreboard
SUBD finishes. DIVD waiting for F0.
48
Scoreboard Example Cycle 13
Dynamic Scheduling
Using A Scoreboard
ADDD issues.
49
Scoreboard Example Cycle 14
Dynamic Scheduling
Using A Scoreboard
50
Scoreboard Example Cycle 15
Dynamic Scheduling
Using A Scoreboard
51
Scoreboard Example Cycle 16
Dynamic Scheduling
Using A Scoreboard
52
Scoreboard Example Cycle 17
Dynamic Scheduling
Using A Scoreboard
ADDD cant write because of DIVD. RAW!
53
Scoreboard Example Cycle 18
Dynamic Scheduling
Using A Scoreboard
Nothing Happens!!
54
Scoreboard Example Cycle 19
Dynamic Scheduling
Using A Scoreboard
MULT completes execution.
55
Scoreboard Example Cycle 20
Dynamic Scheduling
Using A Scoreboard
MULT writes.
56
Scoreboard Example Cycle 21
Dynamic Scheduling
Using A Scoreboard
DIVD loads operands
57
Scoreboard Example Cycle 22
Dynamic Scheduling
Using A Scoreboard
Now ADDD can write since WAR removed.
58
Scoreboard Example Cycle 61
Dynamic Scheduling
Using A Scoreboard
DIVD completes execution
59
Scoreboard Example Cycle 62
Dynamic Scheduling
Using A Scoreboard
DONE!!
60
Another Dynamic Algorithm Tomasulo Algorithm
Dynamic Scheduling
Using A Scoreboard

For IBM 360/91 about 3 years after CDC 6600
(1966)
Goal High Performance without special compilers
Differences between IBM 360 CDC 6600 ISA
IBM has only 2 register specifiers / instruction
vs. 3 in CDC 6600
IBM has 4 FP registers vs. 8 in CDC 6600
Why Study? lead to Alpha 21264, HP 8000, MIPS
10000, Pentium II, PowerPC 604,

61
Tomasulo Algorithm vs. Scoreboard
Dynamic Scheduling
Using A Scoreboard

Control buffers distributed with Function Units
(FU) vs. centralized in scoreboard
FU buffers called reservation stations have
pending operands
Registers in instructions replaced by values or
pointers to reservation stations(RS) called
register renaming
avoids WAR, WAW hazards
More reservation stations than registers, so can
do optimizations compilers cant
Results to FU from RS, not through registers,
over Common Data Bus that broadcasts results to
all FUs
Load and Stores treated as FUs with RSs as well
Integer instructions can go past branches,
allowing FP ops beyond basic block in FP queue

62
Tomasulo Organization
Using A Scoreboard
Dynamic Scheduling
FPRegisters
FP Op Queue
LoadBuffer
StoreBuffer
CommonDataBus
FP AddRes.Station
FP MulRes.Station
63
Reservation Station Components
Dynamic Scheduling
Using A Scoreboard

OpOperation to perform in the unit (e.g., or
)
Vj, VkValue of Source operands
Store buffers have V field, result to be stored
Qj, QkReservation stations producing source
registers (value to be written)
Note No ready flags as in Scoreboard Qj,Qk0 gt
ready
Store buffers only have Qi for RS producing
result
BusyIndicates reservation station or FU is busy
Register result statusIndicates which functional
unit will write each register, if one exists.
Blank when no pending instructions that will
write that register.

64
Three Stages of Tomasulo Algorithm
Dynamic Scheduling
Using A Scoreboard

1. Issueget instruction from FP Op Queue
If reservation station free (no structural
hazard), control issues instruction sends
operands (renames registers).
2. Executionoperate on operands (EX)
When both operands ready then execute if not
ready, watch Common Data Bus for result
3. Write resultfinish execution (WB)
Write on Common Data Bus to all awaiting units
mark reservation station available
Normal data bus data destination (go
to bus)
Common data bus data source (come from bus)
64 bits of data 4 bits of Functional Unit
source address
Write if matches expected Functional Unit
(produces result)
Does the broadcast

65
Tomasulo Example Cycle 0
Dynamic Scheduling
Using A Scoreboard
66
Review Tomasulo
Dynamic Scheduling
Using A Scoreboard

Prevents Register as bottleneck
Avoids WAR, WAW hazards of Scoreboard
Allows loop unrolling in HW
Not limited to basic blocks (provided branch
prediction)
Lasting Contributions
Dynamic scheduling
Register renaming
Load/store disambiguation
360/91 descendants are PowerPC 604, 620 MIPS
R10000 HP-PA 8000 Intel Pentium Pro

67
Dynamic Hardware Prediction

4.1 Instruction Level Parallelism Concepts and
Challenges
4.2 Overcoming Data Hazards with Dynamic
Scheduling
4.3 Reducing Branch Penalties with Dynamic
Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple
Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more
Parallelism
4.7 Studies of ILP

Dynamic Branch Prediction is the ability of the
hardware to make an educated guess about which
way a branch will go - will the branch be taken
or not. The hardware can look for clues based on
the instructions, or it can use past history - we
will discuss both of these directions.
68
Dynamic Branch Prediction
Dynamic Hardware Prediction
Basic Branch Prediction Branch Prediction Buffers

Performance ƒ(accuracy, cost of misprediction)
Branch History Lower bits of PC address index
table of 1-bit values
Says whether or not branch taken last time
Problem in a loop, 1-bit BHT will cause two
mis-predictions
End of loop case, when it exits instead of
looping as before
First time through loop on next time through
code, when it predicts exit instead of looping

Bits 13 - 2
Prediction
Address
0
31 1
1023
69
Dynamic Branch Prediction
Dynamic Hardware Prediction
Basic Branch Prediction Branch Prediction Buffers

Solution 2-bit scheme where change prediction
only if get misprediction twice (Figure 4.13, p.
264)

70
BHT Accuracy
Dynamic Hardware Prediction
Basic Branch Prediction Branch Prediction Buffers

Mispredict because either
Wrong guess for that branch
Got branch history of wrong branch when index the
table
4096 entry table programs vary from 1
misprediction (nasa7, tomcatv) to 18 (eqntott),
with spice at 9 and gcc at 12
4096 about as good as infinite table, but 4096 is
a lot of HW

71
Correlating Branches
Dynamic Hardware Prediction
Basic Branch Prediction Branch Prediction Buffers

Idea taken/not taken of recently executed
branches is related to behavior of next branch
(as well as the history of that branch behavior)
Then behavior of recent branches selects between,
say, four predictions of next branch, updating
just that prediction

72
Accuracy of Different Schemes
Dynamic Hardware Prediction
Basic Branch Prediction Branch Prediction Buffers
(Figure 4.21, p. 272)
4096 Entries 2-bits per entry Unlimited Entries
2-bits per entry 1024 Entries - 2 bits of
history, 2 bits per entry
18
Frequency of Mispredictions
0
73
Branch Target Buffer
Dynamic Hardware Prediction
Basic Branch Prediction Branch Target Buffers

Branch Target Buffer (BTB) Use address of branch
as index to get prediction AND branch address (if
taken)
Note must check for branch match now, since
cant use wrong branch address (Figure 4.22, p.
273)
Return instruction addresses predicted with stack

Predicted PC
Branch Prediction Taken or not Taken
74
Example
Dynamic Hardware Prediction
Basic Branch Prediction Branch Target Buffers
Instructions Prediction Actual Penalty in
Buffer Branch Cycles Yes Taken Taken 0 Yes
Taken Not taken 2 No Taken 2

Example on page 274.
Determine the total branch penalty for a BTB
using the above penalties. Assume also the
following
Prediction accuracy of 80
Hit rate in the buffer of 90
60 taken branch frequency.

Branch Penalty Percent buffer hit rate X
Percent incorrect predictions X 2 ( 1 -
percent buffer hit rate) X Taken branches X
2 Branch Penalty ( 90 X 10 X 2) (10 X 60
X 2) Branch Penalty 0.18 0.12 0.30 clock
cycles
75
Multiple Issue
Multiple Issue is the ability of the processor to
start more than one instruction in a given
cycle. Flavor I Superscalar processors issue
varying number of instructions per clock - can be
either statically scheduled (by the compiler) or
dynamically scheduled (by the hardware). Supersca
lar has a varying number of instructions/cycle
(1 to 8), scheduled by compiler or by HW
(Tomasulo). IBM PowerPC, Sun UltraSparc, DEC
Alpha, HP 8000

4.1 Instruction Level Parallelism Concepts and
Challenges
4.2 Overcoming Data Hazards with Dynamic
Scheduling
4.3 Reducing Branch Penalties with Dynamic
Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple
Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more
Parallelism
4.7 Studies of ILP

76
Issuing Multiple Instructions/Cycle
Multiple Issue

Flavor II
VLIW - Very Long Instruction Word - issues a
fixed number of instructions formatted either as
one very large instruction or as a fixed packet
of smaller instructions.
fixed number of instructions (4-16) scheduled by
the compiler put operators into wide templates
Joint HP/Intel agreement in 1999/2000
Intel Architecture-64 (IA-64) 64-bit address
Style Explicitly Parallel Instruction Computer
(EPIC)

77
Issuing Multiple Instructions/Cycle
Multiple Issue

Flavor II - continued
3 Instructions in 128 bit groups field
determines if instructions dependent or
independent
Smaller code size than old VLIW, larger than
x86/RISC
Groups can be linked to show independence gt 3
instr
64 integer registers 64 floating point
registers
Not separate files per functional unit as in old
VLIW
Hardware checks dependencies (interlocks gt
binary compatibility over time)
Predicated execution (select 1 out of 64 1-bit
flags) gt 40 fewer mis-predictions?
IA-64 name of instruction set architecture
EPIC is type
Merced is name of first implementation
(1999/2000?)

78
Issuing Multiple Instructions/Cycle
Multiple Issue
A SuperScalar Version of DLX

In our DLX example, we can handle 2
instructions/cycle
Floating Point
Anything Else

Fetch 64-bits/clock cycle Int on left, FP on
right
Can only issue 2nd instruction if 1st
instruction issues
More ports for FP registers to do FP load FP
op in a pair
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
1 cycle load delay causes delay to 3
instructions in Superscalar
instruction in right half cant use it, nor
instructions in next slot

79
Unrolled Loop Minimizes Stalls for Scalar
Multiple Issue
A SuperScalar Version of DLX
1 Loop LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1
) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2
7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4
10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,
32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 8-32
-24 14 clock cycles, or 3.5 per iteration
Latencies LD to ADDD 1 Cycle ADDD to SD 2
Cycles
80
Loop Unrolling in Superscalar
Multiple Issue
A SuperScalar Version of DLX

Integer instruction FP instruction Clock cycle
Loop LD F0,0(R1) 1
LD F6,-8(R1) 2
LD F10,-16(R1) ADDD F4,F0,F2 3
LD F14,-24(R1) ADDD F8,F6,F2 4
LD F18,-32(R1) ADDD F12,F10,F2 5
SD 0(R1),F4 ADDD F16,F14,F2 6
SD -8(R1),F8 ADDD F20,F18,F2 7
SD -16(R1),F12 8
SD -24(R1),F16 9
SUBI R1,R1,40 10
BNEZ R1,LOOP 11
SD 8(R1),F20 12
Unrolled 5 times to avoid delays (1 due to SS)
12 clocks, or 2.4 clocks per iteration

81
Dynamic Scheduling in Superscalar
Multiple Issue
Multiple Instruction Issue Dynamic Scheduling

Code compiler for scalar version will run poorly
on Superscalar
May want code to vary depending on how
Superscalar
Simple approach separate Tomasulo Control for
separate reservation stations for Integer FU/Reg
and for FP FU/Reg

82
Dynamic Scheduling in Superscalar
Multiple Issue
Multiple Instruction Issue Dynamic Scheduling

How to do instruction issue with two instructions
and keep in-order instruction issue for Tomasulo?
Issue 2X Clock Rate, so that issue remains in
order
Only FP loads might cause dependency between
integer and FP issue
Replace load reservation station with a load
queue operands must be read in the order they
are fetched
Load checks addresses in Store Queue to avoid RAW
violation
Store checks addresses in Load Queue to avoid
WAR,WAW

83
Performance of Dynamic Superscalar
Multiple Issue
Multiple Instruction Issue Dynamic Scheduling

Iteration Instructions Issues Executes Writes
result
no.
clock-cycle number
1 LD F0,0(R1) 1 2 4
1 ADDD F4,F0,F2 1 5 8
1 SD 0(R1),F4 2 9
1 SUBI R1,R1,8 3 4 5
1 BNEZ R1,LOOP 4 5
2 LD F0,0(R1) 5 6 8
2 ADDD F4,F0,F2 5 9 12
2 SD 0(R1),F4 6 13
2 SUBI R1,R1,8 7 8 9
2 BNEZ R1,LOOP 8 9
4 clocks per iteration
Branches, Decrements still take 1 clock cycle

84
Loop Unrolling in VLIW
Multiple Issue
VLIW

Memory Memory FP FP Int. op/ Clockreference
1 reference 2 operation 1 op. 2 branch
LD F0,0(R1) LD F6,-8(R1) 1
LD F10,-16(R1) LD F14,-24(R1) 2
LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD
F8,F6,F2 3
LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
ADDD F20,F18,F2 ADDD F24,F22,F2 5
SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
SD -16(R1),F12 SD -24(R1),F16 7
SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,48 8
SD -0(R1),F28 BNEZ R1,LOOP 9
Unrolled 7 times to avoid delays
7 results in 9 clocks, or 1.3 clocks per
iteration
Need more registers to effectively use VLIW

85
Limits to Multi-Issue Machines
Multiple Issue
Limitations With Multiple Issue

Inherent limitations of ILP
1 branch in 5 instructions gt how to keep a 5-way
VLIW busy?
Latencies of units gt many operations must be
scheduled
Need about Pipeline Depth x No. Functional Units
of independent operations to keep machines busy.
Difficulties in building HW
Duplicate Functional Units to get parallel
execution
Increase ports to Register File (VLIW example
needs 6 read and 3 write for Int. Reg. 6 read
and 4 write for Reg.)
Increase ports to memory
Decoding SS and impact on clock rate, pipeline
depth

86
Limits to Multi-Issue Machines
Multiple Issue
Limitations With Multiple Issue

Limitations specific to either SS or VLIW
implementation
Decode issue in SS
VLIW code size unroll loops wasted fields in
VLIW
VLIW lock step gt 1 hazard all instructions
stall
VLIW binary compatibility

87
Multiple Issue Challenges
Multiple Issue
Limitations With Multiple Issue

While Integer/FP split is simple for the HW, get
CPI of 0.5 only for programs with
Exactly 50 FP operations
No hazards
If more instructions issue at same time, greater
difficulty of decode and issue
Even 2-scalar gt examine 2 opcodes, 6 register
specifiers, decide if 1 or 2 instructions can
issue
VLIW tradeoff instruction space for simple
decoding
The long instruction word has room for many
operations
By definition, all the operations the compiler
puts in the long instruction word are independent
gt execute in parallel
E.g., 2 integer operations, 2 FP ops, 2 Memory
refs, 1 branch
16 to 24 bits per field gt 716 or 112 bits to
724 or 168 bits wide
Need compiling technique that schedules across
several branches

88
Compiler Support For ILP

4.1 Instruction Level Parallelism Concepts and
Challenges
4.2 Overcoming Data Hazards with Dynamic
Scheduling
4.3 Reducing Branch Penalties with Dynamic
Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple
Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more
Parallelism
4.7 Studies of ILP

How can compilers be smart? 1. Produce good
scheduling of code. 2. Determine which loops
might contain parallelism. 3. Eliminate name
dependencies. Compilers must be REALLY smart to
figure out aliases -- pointers in C are a real
problem. Techniques lead to Symbolic Loop
Unrolling Critical Path Scheduling
89
Software Pipelining
Compiler Support For ILP
Symbolic Loop Unrolling

Observation if iterations from loops are
independent, then can get ILP by taking
instructions from different iterations
Software pipelining reorganizes loops so that
each iteration is made from instructions chosen
from different iterations of the original loop
(Tomasulo in SW)

90
SW Pipelining Example
Compiler Support For ILP
Symbolic Loop Unrolling

Before Unrolled 3 times
1 LD F0,0(R1)
2 ADDD F4,F0,F2
3 SD 0(R1),F4
4 LD F6,-8(R1)
5 ADDD F8,F6,F2
6 SD -8(R1),F8
7 LD F10,-16(R1)
8 ADDD F12,F10,F2
9 SD -16(R1),F12
10 SUBI R1,R1,24
11 BNEZ R1,LOOP

After Software Pipelined LD F0,0(R1) ADDD F4,F0
,F2 LD F0,-8(R1) 1 SD 0(R1),F4 Stores Mi
2 ADDD F4,F0,F2 Adds to Mi-1
3 LD F0,-16(R1) loads Mi-2 4 SUBI R1,R1,8
5 BNEZ R1,LOOP SD 0(R1),F4 ADDD F4,F0,F2 SD -8(
R1),F4
Read F4
Read F0
IF ID EX Mem WB IF ID EX Mem WB
IF ID EX Mem WB
SD ADDD LD
Write F4
Write F0
91
SW Pipelining Example
Compiler Support For ILP
Symbolic Loop Unrolling

Symbolic Loop Unrolling
Less code space
Overhead paid only once vs. each iteration
in loop unrolling

Software Pipelining
Loop Unrolling
100 iterations 25 loops with 4 unrolled
iterations each
92
Trace Scheduling
Compiler Support For ILP
Critical Path Scheduling

Parallelism across IF branches vs. LOOP branches
Two steps
Trace Selection
Find likely sequence of basic blocks (trace) of
(statically predicted or profile predicted) long
sequence of straight-line code
Trace Compaction
Squeeze trace into few VLIW instructions
Need bookkeeping code in case prediction is wrong
Compiler undoes bad guess (discards values in
registers)
Subtle compiler bugs mean wrong answer vs.
poorer performance no hardware interlocks

93
Hardware Support For Parallelism

4.1 Instruction Level Parallelism Concepts and
Challenges
4.2 Overcoming Data Hazards with Dynamic
Scheduling
4.3 Reducing Branch Penalties with Dynamic
Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple
Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more
Parallelism
4.7 Studies of ILP

Software support of ILP is best when code is
predictable at compile time.
But what if theres no predictability?
Here well talk about hardware techniques. These
include
Conditional or Predicated Instructions
Hardware Speculation

94
Tell the Hardware To Ignore An Instruction
Hardware Support For Parallelism
Nullified Instructions

Avoid branch prediction by turning branches into
conditionally executed instructions
IF (x) then A B op C else NOP
If false, then neither store result nor cause
exception
Expanded ISA of Alpha, MIPs, PowerPC, SPARC, have
conditional move. PA-RISC can annul any
following instruction.
IA-64 64 1-bit condition fields selected so
conditional execution of any instruction
Drawbacks to conditional instructions
Still takes a clock, even if annulled
Stalls if condition evaluated late
Complex conditions reduce effectiveness
condition becomes known late in pipeline.
This can be a major win because there is no time
lost by taking a branch!!

x
A B op C
95
Tell the Hardware To Ignore An Instruction
Hardware Support For Parallelism
Nullified Instructions

Suppose we have the code
if ( VarA 0 )
VarS VarT
Previous Method
LD R1, VarA
BNEZ R1, Label
LD R2, VarT
SD VarS, R2
Label

Nullified Method LD R1, VarA LD R2,
VarT CMPNNZ R1, 0 SD VarS, R2 Label
Compare and Nullify Next Instr. If Not Zero
Nullified Method LD R1, VarA LD R2, VarT CMOVZ
VarS,R2, R1
Compare and Move IF Zero
96
Hardware Support For Parallelism
Compiler Speculation
Increasing Parallelism

The theory here is to move an instruction across
a branch so as to increase the size of a basic
block and thus to increase parallelism.
Primary difficulty is in avoiding exceptions.
For example
if ( a 0 ) c b/a may have divide by
zero error in some cases.
Methods for increasing speculation include
1. Use a set of status bits (poison bits)
associated with the registers. Are a signal that
the instruction results are invalid until some
later time.
2. Result of instruction isnt written until
its certain the instruction is no longer
speculative.

97
Hardware Support For Parallelism
Compiler Speculation
Increasing Parallelism
Original Code LW R1, 0(R3) Load A BNEZ
R1, L1 Test A LW R1, 0(R2) If
Clause J L2 Skip Else L1 ADDI R1, R1,
4 Else Clause L2 SW 0(R3), R1 Store A

Example on Page 305.
Code for
if ( A 0 )
A B
else
A A 4
Assume A is at 0(R3) and B is at 0(R4)

Speculated Code LW R1, 0(R3) Load A
LW R14, 0(R2) Spec Load B BEQZ R1, L3
Other if Branch ADDI R14, R1, 4 Else
Clause L3 SW 0(R3), R14 Non-Spec Store
Note here that only ONE side needs to take a
branch!!
98
Hardware Support For Parallelism
Compiler Speculation
Poison Bits
Speculated Code LW R1, 0(R3) Load A
LW R14, 0(R2) Spec Load B BEQZ R1, L3
Other if Branch ADDI R14, R1, 4 Else
Clause L3 SW 0(R3), R14 Non-Spec Store

In the example on the last page, if the LW
produces an exception, a poison bit is set on
that register. The if a later instruction tries
to use the register, an exception is THEN raised.

99
HW support for More ILP
Hardware Support For Parallelism
Hardware Speculation

Need HW buffer for results of uncommitted
instructions reorder buffer
Reorder buffer can be operand source
Once operand commits, result is found in register
3 fields instr. type, destination, value
Use reorder buffer number instead of reservation
station
Discard instructions on mis-predicted branches or
on exceptions

100
HW support for More ILP
Hardware Support For Parallelism
Hardware Speculation

How is this used in practice?
Rather than predicting the direction of a branch,
execute the instructions on both side!!
We early on know the target of a branch, long
before we know it if will be taken or not.
So begin fetching/executing at that new Target
PC.
But also continue fetching/executing as if the
branch NOT taken.

101
Studies of ILP

4.1 Instruction Level Parallelism Concepts and
Challenges
4.2 Overcoming Data Hazards with Dynamic
Scheduling
4.3 Reducing Branch Penalties with Dynamic
Hardware Prediction
4.4 Taking Advantage of More ILP with Multiple
Issue
4.5 Compiler Support for Exploiting ILP
4.6 Hardware Support for Extracting more
Parallelism
4.7 Studies of ILP

Conflicting studies of amount of improvement
available
Benchmarks (vectorized FP Fortran vs. integer C
programs)
Hardware sophistication
Compiler sophistication
How much ILP is available using existing
mechanisms with increasing HW budgets?
Do we need to invent new HW/SW mechanisms to keep
on processor performance curve?

102
Limits to ILP
Studies of ILP

Initial HW Model here MIPS compilers.
Assumptions for ideal/perfect machine to start
1. Register renaminginfinite virtual registers
and all WAW WAR hazards are avoided
2. Branch predictionperfect no mispredictions
3. Jump predictionall jumps perfectly predicted
gt machine with perfect speculation an
unbounded buffer of instructions available
4. Memory-address alias analysisaddresses are
known a store can be moved before a load
provided addresses not equal
1 cycle latency for all instructions unlimited
number of instructions issued per clock cycle

103
Upper Limit to ILP Ideal Machine(Figure 4.38,
page 319)
Studies of ILP
This is the amount of parallelism when there are
no branch mis-predictions and were limited only
by data dependencies.
FP 75 - 150
Integer 18 - 60
IPC
Instructions that could theoretically be issued
per cycle.
104
Impact of Realistic Branch Prediction
Studies of ILP

What parallelism do we get when we dont allow
perfect branch prediction, as in the last
picture, but assume some realistic model?
Possibilities include
1. Perfect - all branches are perfectly
predicted (the last slide)
2. Selective History Predictor - a complicated
but do-able mechanism for selection.
3. Standard 2-bit history predictor with 512
2-bit entries.
4. Static prediction based on past history of the
program.
5. None - Parallelism is limited to basic block.

105
Selective History Predictor
Studies of ILP
Bonus!!
8096 x 2 bits
1 0
Taken/Not Taken
11 10 01 00
Choose Non-correlator
Branch Addr
Choose Correlator
2
Global History
00
8K x 2 bit Selector
01
10
11
11 Taken 10 01 Not Taken 00
2048 x 4 x 2 bits
106
Impact of Realistic Branch PredictionFigure
4.42, Page 325
Studies of ILP

Limiting the type of branch prediction.

FP 15 - 45
Integer 6 - 12
IPC
Profile
BHT (512)
Selective Hist
Perfect
No prediction
107
More Realistic HW Register ImpactFigure 4.44,
Page 328
Studies of ILP