Modern Computer Architectures Lecture 4: Branch Prediction MultipleIssue Processors Limits to ILP - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Modern Computer Architectures Lecture 4: Branch Prediction MultipleIssue Processors Limits to ILP

Description:

Scoreboard and Tomasulo stop issuing instructions when a branch is encountered ... PowerPC, SPARC have conditional move; PA-RISC can annul any following instr. ... – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 40

Provided by: Rand249

Category:

more less

Transcript and Presenter's Notes

Title: Modern Computer Architectures Lecture 4: Branch Prediction MultipleIssue Processors Limits to ILP

1
Modern Computer ArchitecturesLecture 4 Branch
PredictionMultiple-Issue ProcessorsLimits to ILP

Dr. Ben Juurlink
Fall 2001

2
Branch PredictionMotivation

Scoreboard and Tomasulo stop issuing instructions
when a branch is encountered
With on average one out of five instructions
being a branch, the maximum ILP is five
Situation even worse for multiple-issue
processors, because we need to provide an
instruction stream of n instructions per cycle.
Idea predict the outcome of branches based on
their history and execute instructions
speculatively

3
5 Branch Prediction Schemes

1-bit Branch Prediction Buffer
2-bit Branch Prediction Buffer
Correlating Branch Prediction Buffer
Branch Target Buffer
Return Address Predictors
. A way to get rid of those malicious branches

4
1-bit Branch Prediction Buffer

1-bit branch prediction buffer or branch history
table
Buffer is like a cache without tags
Does not help for simple DLX pipeline because
target address calculations in same stage as
branch condition calculation

10..10 101 00
PC
BHT
0 1 0 1 0 1 1 0
5
2-bit Branch Prediction Buffer

sad 0
for (i0 ilt16 i)
for (j0 jlt16 j)
if ((v aij-bij) lt 0) v -v
sad v
Problem in a nested loop, 1-bit BHT will cause
2 mispredictions
End of loop case, when it exits instead of
looping as before
First time through loop on next time through
code, when it predicts exit instead of looping
Only 88 accuracy even if loop 94 of the time

6
2-bit Branch Prediction Buffer

Solution 2-bit scheme where prediction is
changed only if mispredicted twice
Can be implemented as a saturating counter

T
NT
Predict Taken
Predict Taken
T
T
NT
NT
Predict Not Taken
Predict Not Taken
T
NT
7
Correlating Branches

Fragment from SPEC92 benchmark eqntott
if (aa2) subi R3,R1,2
aa 0 b1 bnez R3,L1
if (bb2) add R1,R0,R0
bb0 L1 subi R3,R2,2
if (a!b) b2 bnez R3,L2
add R2,R0,R0
L2 sub R3,R1,R2
b3 beqz R3,L3

8
Correlating Branch Predictor
4 bits from branch address

Idea behavior of this branch is related to
taken/not taken history of recently executed
branches
Then behavior of recent branches selects between,
say, 4 predictions of next branch, updating just
that prediction
(2,2) predictor 2-bit global, 2-bit local
(m,n) predictor uses behavior of last m branches
to choose from 2m predictors, each of which is
n-bit predictor

2-bits per branch local predictors
Prediction
shift register
2-bit global branch history (01 not taken then
taken)
9
Accuracy of Different Branch Predictors
18
Mispredictions Rate
0
4096 Entries 2-bit BHT Unlimited Entries 2-bit
BHT 1024 Entries (2,2) BHT
10
BHT Accuracy

Mispredict because either
Wrong guess for that branch
Got branch history of wrong branch when index the
table
4096 entry table misprediction rates vary from
1 (nasa7, tomcatv) to 18 (eqntott), with spice
at 9 and gcc at 12
For SPEC92, 4096 about as good as infinite table
Real programs OS more like gcc

11
Branch Target Buffer

For DLX pipeline, need target address at same
time as prediction
Branch Target Buffer (BTB) Address of branch
index to get prediction AND branch address (if
taken)
Note must check for branch match now, since can
be any instruction

10..10 101 00
PC
Yes instruction is branch. Use predicted PC as
next PC if branch predicted taken.
?
Branch prediction
No instruction is not a branch. Proceed normally
12
Instruction Fetch Stage

Not shown hardware needed when prediction was
wrong

found taken
target address
13
Special Case Return Addresses

Register indirect branches hard to predict
target address
MIPS/DLX instruction jr r31 PC r31
useful for
implementing switch/case statements
FORTRAN computed GOTOs
procedure return (mainly)
SPEC89 85 such branches for procedure return
Since stack discipline for procedures, save
return address in small buffer that acts like a
stack 8 to 16 entries has small miss rate

14
Dynamic Branch Prediction Summary

Prediction becoming important part of scalar
execution
Branch History Table 2 bits for loop accuracy
Correlation Recently executed branches
correlated with next branch
Either different branches
Or different executions of same branch
Branch Target Buffer include branch target
address prediction
Return address stack for prediction of indirect
jumps

15
Predicated Instructions

Avoid branch prediction by turning branches into
conditional or predicated instructions
if (x) then A B op C else NOP
If false, then neither store result nor cause
exception
Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move PA-RISC can annul any following
instr.
IA-64 conditional execution of any instruction
Examples
if (R10) R2 R3 CMOVZ R2,R3,R1
if (R1 lt R2) SLT R9,R1,R2
R3 R1 CMOVNZ R3,R1,R9
else CMOVZ R3,R2,R9
R3 R2

16
Administratrivia

Have you all chosen a project/topic?
When I received all your topics, I will make a
schedule of the presentations. Presentations
related to memory hierarchy last.

17
Getting CPI lt 1 Multiple-Issue Processors

Vector Processing Explicit coding of independent
loops as operations on large vectors of numbers
Multimedia instructions being added to many
processors
Multiple-Issue Processors
Superscalar varying no. instructions/cycle (1 to
8), scheduled by compiler or by HW (Tomasulo)
(dynamic issue capability)
IBM PowerPC, Sun UltraSparc, DEC Alpha, Pentium
III/4
(Very) Long Instruction Words (V)LIW fixed
number of instructions (4-16) scheduled by the
compiler (static issue capability)
Intel Architecture-64 (IA-64)
Renamed Explicitly Parallel Instruction
Computer (EPIC)
Anticipated success of multiple instructions led
to Instructions Per Cycle (IPC) metric instead
of CPI

18
Statically Scheduled Superscalar

Superscalar MIPS/DLX 2 instructions, 1 anything
1 FP
Fetch 64-bits/clock cycle Int on left, FP on
right
Can only issue 2nd instruction if 1st
instruction issues
More ports needed for FP register file to
execute FP load FP op in parallel
Type Pipe Stages
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
Int. instruction IF ID EX MEM WB
FP instruction IF ID EX MEM WB
1 cycle load delay expands to 3 instructions in
SS
instruction in right half cant use it, nor
instructions in next slot
Using this design, only FP loads and convert int
to float instructions can cause data hazards

19
Example

for (i1 ilt1000 i)
ai ais
Integer instruction FP instruction Cycle
L LD F0,0(R1) 1
LD F6,8(R1) 2
LD F10,16(R1) ADDD F4,F0,F2 3
LD F14,24(R1) ADDD F8,F6,F2 4
LD F18,32(R1) ADDD F12,F10,F2 5
SD 0(R1),F4 ADDD F16,F14,F2 6
SD 8(R1),F8 ADDD F20,F18,F2 7
SD 16(R1),F12 8
ADDI R1,R1,40 9
SD -16(R1),F16 10
BNE R1,R2,L 11
SD -8(R1),F20 12

Load 1 cycle latency ALU op 2 cycles latency

2.4 cycles per element vs. 3.5 for ordinary DLX
pipeline
Int and FP instructions not perfectly balanced

20
Dynamically Scheduled SuperscalarsIssues

Dynamically scheduled superscalar HW potentially
reorders instructions and sends them to correct
execution unit. Extend Scoreboard or Tomasulo.
issue packet group of instructions from fetch
unit that could potentially issue in one cycle
If instruction causes structural or data hazard
either due to earlier instruction in execution or
to earlier instruction in issue packet, then
instruction cannot be issued
0 to N instruction issues per clock cycle, for
N-issue
Performing issue checks in 1 cycle could limit
clock cycle time O(n2) comparisons
gt issue stage usually split and pipelined
1st stage decides how many instructions from
within this packet can issue, 2nd stage examines
hazards among selected instructions and those
already issued
gt higher branch penalties gt prediction accuracy
important

21
Multiple Issue Issues

While Integer/FP split is simple for the HW, get
IPI of 2 only for programs with
Exactly 50 FP operations AND no hazards
If more instructions issue at same time, greater
difficulty of decode and issue
Even 2-issue superscalar gt examine 2 opcodes, 6
register specifiers, and decide if 1 or 2
instructions can issue (N-issue O(N2)
comparisons)
Register file for 2-issue superscalar need 2x
reads and 1x writes/cycle
Rename logic must be able to rename same
register multiple times in one cycle! For
instance, consider 4-way issue
add r1, r2, r3 add p11, p4, p7 sub r4, r1,
r2 ? sub p22, p11, p4 lw r1, 4(r4) lw p23,
4(p22) add r5, r1, r2 add p12, p23, p4
Imagine doing this transformation in a single
cycle!
Result buses Need to complete multiple
instructions/cycle
So, need multiple buses with associated matching
logic at every reservation station.

22
VLIW Processors

Superscalar HW too difficult to build gt let
compiler find independent instructions and pack
them in one Very Long Instruction Word (VLIW)
Example VLIW processor with 2 ld/st units, two
FP units, one integer/branch unit, no branch
delay
Ld/st 1 Ld/st 2 FP 1 FP 2 Int
LD F0,0(R1) LD F6,8(R1)
LD F10,16(R1) LD F14,24(R1)
LD F18,32(R1) LD F22,40(R1) ADDD F4,F0,F2 ADDD
F8,F6,F2
LD F26,48(R1) ADDD ADDD
F12,F10,F2 F16,F14,F2
ADDD ADDD
F20,F18,F2 F24,F22,F2
SD 0(R1),F4 SD 8(R1),F8 ADDD
F28,F26,F2
SD 16(R1),F12 SD 24(R1),F16
SD 32(R1),F20 SD 40(R1),F24 ADDI
R1,R1,56
SD 8(R1),F28 BNE R1,R2,L

23
Limitations of Multiple-Issue Processors

Available ILP is limited (were not programming
with parallelism in mind)
Hardware cost
adding more functional units is easy
more memory ports and register ports needed
dependency check needs O(n2) comparisons
Limitations of VLIW processors
Loop unrolling increases code size
Unfilled slots wastes bits
Cache miss stalls pipeline
Research topic scheduling loads
Binary incompatibility (not EPIC)

24
Limits to ILP

Conflicting studies of amount
Benchmarks (vectorized Fortran FP vs. integer C
programs)
Hardware sophistication
Compiler sophistication
How much ILP is available using existing
mechanisms with increasing HW budgets?
Do we need to invent new HW/SW mechanisms to keep
on processor performance curve?
Intel MMX, SSE (Streaming SIMD Extensions) 64
bit ints
Intel SSE2 128 bit, including 2 64-bit FP
operations per cycle
Motorola AltaVec 128 bit ints and FPs
Supersparc Multimedia ops, etc.

25
Ideal Processor

Assumptions for ideal/perfect processor
1. Register renaming infinite number of
virtual registers gt all register WAW WAR
hazards avoided
2. Branch and Jump prediction Perfect gt all
program instructions available for execution
4. Memory-address alias analysis addresses are
known. A store can be moved before a load
provided addresses not equal
Also
unlimited number of instructions issued/cycle
(unlimited resources)
perfect caches
1 cycle latency for all instructions (FP ,/)
Programs were compiled using MIPS compiler with
maximum optimization level

26
Upper Limit to ILP Ideal Processor
Integer 18 - 60
FP 75 - 150
IPC
27
More Realistic HWWindow Size and Branch Impact

Change from Infinite window to examine 2000 and
issue at most 64 instructions per cycle

FP 15 - 45
Integer 6 12
IPC
Perfect Tournament BHT(512) Profile No
prediction
28
More Realistic HWImpact of Limited Renaming
Registers

Changes 2000 instr. window, 64 instr. issue, 8K
2-level predictor (slightly better that
tournament predictor)

FP 11 - 45
Integer 5 - 15
IPC
Infinite 256 128 64 32
29
More Realistic HW Memory Address Alias Impact

Changes 2000 instr. window, 64 instr. issue, 8K
2-level predictor, 256 renaming registers

FP 4 - 45 (Fortran, no heap)
Integer 4 - 9
IPC
Perfect Global/stack perfect Inspection
None
30
Realistic HW for 00 Window Size Impact

Assumptions Perfect disambiguation, 1K Selective
predictor, 16 entry return stack, 64 renaming
registers, issue as many as window

FP 8 - 45
IPC
Integer 6 - 12
Infinite 256 128 64 32
16 8 4
31
How to Exceed ILP Limits of This Study?

WAR and WAW hazards through memory eliminated
WAW and WAR hazards through register renaming,
but not in memory
Unnecessary dependences (compiler did not unroll
loops so iteration variable dependence)
Overcoming the data flow limit value prediction,
predicting values and speculating on prediction
Address value prediction and speculation predicts
addresses and speculates by reordering loads and
stores. Could provide better aliasing analysis

32
Workstation Microprocessors 3/2001

Max issue 4 instructions (many CPUs)Max rename
registers 128 (Pentium 4) Max BHT 4K x 9
(Alpha 21264B), 16Kx2 (Ultra III)Max Window Size
(OOO) 126 intructions (Pent. 4)Max Pipeline
22/24 stages (Pentium 4)

Source Microprocessor Report, www.MPRonline.com
33
SPEC 2000 Performance 3/2001 Source
Microprocessor Report, www.MPRonline.com
34
Conclusions

1985-2000 1000X performance
Moores Law transistors/chip gt Moores Law for
Performance/CPU
Hennessy industry been following a roadmap of
ideas known in 1985 to exploit Instruction Level
Parallelism and (real) Moores Law to get
1.55X/year
Caches, Pipelining, Superscalar, Branch
Prediction, Out-of-order execution,
ILP limits To make performance progress in
future need to have explicit parallelism from
programmer vs. implicit parallelism of ILP
exploited by compiler, HW?
Otherwise drop to old rate of 1.3X per year?
Less than 1.3X because of processor-memory
performance gap?
Impact on you if you care about performance,
better think about explicitly parallel
algorithms vs. rely on ILP?

35
Tournament Predictors

Motivation for correlating branch predictors is
2-bit predictor failed on important branches by
adding global information, performance improved
Tournament predictors use 2 predictors, 1 based
on global information and 1 based on local
information, and combine with a selector
Hopes to select right predictor for right branch

36
Tournament Predictor in Alpha 21264

4K 2-bit counters to choose from among a global
predictor and a local predictor
Global predictor also has 4K entries and is
indexed by the history of the last 12 branches
each entry in the global predictor is a standard
2-bit predictor
12-bit pattern ith bit 0 gt ith prior branch not
taken ith bit 1 gt ith prior branch taken
Local predictor consists of a 2-level predictor
Top level a local history table consisting of
1024 10-bit entries each 10-bit entry
corresponds to the most recent 10 branch outcomes
for the entry. 10-bit history allows patterns 10
branches to be discovered and predicted.
Next level Selected entry from the local history
table is used to index a table of 1K entries
consisting a 3-bit saturating counters, which
provide the local prediction
Total size 4K2 4K2 1K10 1K3 29K
bits!
(180,000 transistors)