Recent Advances in Branch Prediction - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Recent Advances in Branch Prediction

Description:

Improving Branch Prediction in the Microarchitecture. Perceptron predictor ... Recent cores have calmed down but branch prediction remains a problem (this used ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 41

Provided by: Daniela187

Category:

more less

Transcript and Presenter's Notes

Title: Recent Advances in Branch Prediction

1
Recent Advances in Branch Prediction
Daniel Ángel Jiménez Department of Computer
Science Rutgers, The State University of New
Jersey
2
This Talk

Brief introduction to conditional branch
prediction
Some motivation, some background
Improving Branch Prediction with the Compiler
Pattern history table partitioning
Improving Branch Prediction in the
Microarchitecture
Perceptron predictor
Mathematical intuition
Some pictures and movies
Piecewise Linear Branch Prediction

3
Pipelining and Branches
Pipelining overlaps instructions to exploit
parallelism, allowing the clock rate to be
increased. Branches cause bubbles in the
pipeline, where some stages are left idle.
Instruction fetch
Instruction decode
Execute
Memory access
Write back
Unresolved branch instruction
4
Branch Prediction
A branch predictor allows the processor to
speculatively fetch and execute instructions down
the predicted path.
Instruction fetch
Instruction decode
Execute
Memory access
Write back
Speculative execution
Branch predictors must be highly accurate to
avoid mispredictions!
5
Branch Predictor Accuracy is Critical

The cost of a misprediction is proportional to
pipeline depth
Predictor accuracy is more important for deeper
pipelines
Pentium 4 with Prescott core pipeline has 31
stages!
Recent cores have calmed down but branch
prediction remains a problem (this used to be my
favorite slide)

Deeper pipelines allow higher clock rates by
decreasing the delay of each pipeline stage
Decreasing misprediction rate from 9 to 4
results in 31 speedup for 32 stage pipeline

Simulations with SimpleScalar/Alpha
6
Conditional Branch Prediction

Most predictors based on 2-level adaptive branch
prediction Yeh Patt 91
Branch outcomes shifted into history register,
1taken, 0not taken
History bits and address bits combine to index a
pattern history table (PHT) of 2-bit saturating
counters
Prediction is high bit of counter
Counter is incremented if branch is taken,
decremented if branch is not taken
Branches tend to be highly biased

GAs a common type of predictor
7
Destructive Interference (Aliasing)

Unrelated branches might accidentally use the
same counter
If two branches behave differently, the predictor
cant learn the behavior
Leads to decreased accuracy
A small decrease in accuracy can have a large
impact on performance
Much branch prediction research focuses on
reducing this effect
Almost all known techniques change the
microarchitecture
Techniques shown to work well in simulation
But microprocessor manufacturers still use
relatively simple predictors
Can we reduce destructive interference without
changing the processor?

8
Pattern History Table Partitioning PLDI 2005

Compiler changes branch addresses to reduce
interference
Exploits the fact that branches are highly biased
PHT divided into partitions based on branch bias

9
Simple Idea

The predictor uses part of the branch address to
form an index
Bimodal PHT Partitioning
Make sure that one bit (bit k) of that address
is
1, for biased taken branches, and
0, for biased not taken branches
Four-way PHT Partitioning
Control two address bits for the following four
cases
00, for weakly biased not taken branches,
01, for weakly biased taken branches,
10, for strongly biased not taken branches,
11 for strongly biased taken branches
This way, branches with similar behavior are
grouped together, reducing destructive
interference

10
Simple Idea continued
11
Complicated Implementation

How do we control the addresses of branch
instructions?
By inserting no-op instructions to move branches
where we want them
But we cant just insert them anywhere that
would hurt performance
Solution insert no-ops between specially
selected regions of instructions
How do we know branch biases?
Profiling with a training run (or heuristics in
the absence of profiling)
Whats the best value of k?
Predictor specific determine empirically for
each microarchitecture
How does the compiler know the addresses of
instructions?
Through feedback from the assembler
Gets complicated when inserting a no-op can
affect a PC-relative branch

12
Adding No-Ops Between Regions

No-ops are inserted between regions of
instructions
Each procedure is divided into regions
The first region begins with the first basic
block
A region can be ended by
A basic block that does not fall through, e.g. an
unconditional jump or return instruction
A basic block ending in a conditional branch that
is taken at least 99.9 of the time
This ensures that
There is enough flexibility to insert no-ops in
the program
No-ops are almost never actually fetched and
executed

13
Deciding How Many No-Ops To Insert

Goal is to maximize number of branches whose
biases match bit k in their addresses
Each procedure is aligned on a 2k -byte boundary
For each region in sequence,
Determine the effect of inserting from 0 to 2k-1
single-byte no-ops
Choose the value that maximizes the number of
branches assigned to their correct partition,
weighted by dynamic execution frequency
Each time the effect of a new number of no-ops is
evaluated, we must compute the new addresses
(modulo 2k) of all instructions affected
We cant just add 1 for each no-op because of
PC-relative branches that can change size
Compiler-inserted alignment pseudo-ops

14
PHT Partitioning on the Intel Pentium 4

Intel Pentium 4 branch predictor details are very
secret
Seems to include a GAs branch predictor
Intel engineers disappear in a puff of smoke when
I ask them about it
I used ad hoc trial-and-error experiments to find
the best address bits for bimodal and 4-way PHT
partioning
Camino compiler infrastructure
Development began at UT Austin, continues at
Rutgers
Camino uses GCC S to produce assembly language
from C and C
Camino reads the assembly language, forming
control-flow-graphs
Transformations are done at this CFG level
Result is put back into assembly form and
assembled
This slide represents 95 of the work tedious
development

15
Methodology

Used SPEC CPU integer benchmarks
Benchmarks from SPEC CPU 2000
Also from SPEC CPU 95 that werent duplicated in
2000
A few benchmarks failed to work with Camino
Dell workstation for evaluation of the
optimization
Intel Pentium 4 2.8 GHz
2GB SDRAM
Compare with greedy branch alignment Calder
Grunwald 94
Measure median of 5 execution times on quiescent
system
Measure number of branch mispredictions with
OProfile in a separate run

16
Results Speedup

Mean speedup for 4-way was 4.5, maximum speedup
16
Speedup for branch alignment virtually nil
probably because of Intel Pentium 4s trace cache

17
Results Decrease in Mispredictions

Normalized MPKI reduced by 3.5 for 4-way
partitioning

18
Future For This Work

PHT partitioning is crude with path profiling,
I hope to direct paths to specific PHT entries
FDDO 2001
A lot of work needs to be done to
reverse-engineer industrial branch predictors to
develop better optimizations
Convince chip manufacturers to use simple branch
predictors and leave the hard work to the compiler

19
Changing the Branch Predictor

Now lets consider changing the branch predictor
itself
The architecture literature is replete with these
kind of branch prediction papers
Before 2001, most work refined two-level adaptive
branch prediction Yeh Patt 91
A 1st-level table records recent global or
per-branch pattern histories
A 2nd-level table learns correlations between
histories and outcomes
Refinements focus on reducing destructive
interference in the tables
Some of the better refinements (not an exhaustive
list)
gshare McFarling 93, agree Sprangle et al.
97, hybrid predictors Evers et al. 96,
skewed predictors Michaud et al. 93

20
Conditional Branch Prediction is a Machine
Learning Problem

The machine learns to predict conditional
branches
So why not apply a machine learning algorithm?
Artificial neural networks
Simple model of neural networks in brain cells
Learn to recognize and classify patterns
We used fast and accurate perceptrons
Rosenblatt 62, Block 62 for dynamic branch
prediction Jiménez Lin, HPCA 2001
We were the first to use single-layer perceptrons
and to achieve accuracy superior to PHT
techniques. Previous work used LVQ and MLP for
branch prediction Vintan Iridon 99.

21
Basics of Neural Branch Prediction

The inputs to a neuron are branch outcome
histories
The last n branch outcomes
Can be global or local (per-branch) or both
(alloyed)
Conceptually, branch outcomes are represented as
1, for taken
-1, for not taken
The output of the neuron is
Non-negative, if the branch is predicted taken
Negative, if the branch is predicted not taken
Ideally, each static branch is allocated its own
neuron

22
Branch-Predicting Perceptron

Inputs (xs) are from branch history
n 1 small integer weights (ws) learned by
on-line training
Output (y) is dot product of xs and ws predict
taken if y 0
Training finds correlations between history and
outcome

23
Mathematical Intuition
A perceptron defines a hyperplane in
n1-dimensional space
For instance, in 2D space we have
This is the equation of a line, the same as
24
Mathematical Intuition continued
In 3D space, we have
Or you can think of it as
i.e. the equation of a plane in 3D space This
hyperplane forms a decision surface separating
predicted taken from predicted not taken
histories. This surface intersects the feature
space. Is it a linear surface, e.g. a line in
2D, a plane in 3D, a cube in 4D, etc.
25
Example The AND Function

White means false, black means true for the
output
-1 means false, 1 means true for the input
A linear decision surface (i.e. a plane in 3D
space) intersecting the feature space (i.e. the
2D plane where z0) separates false from true
instances

-1 AND -1 false -1 AND 1 false 1 AND -1
false 1 AND 1 true
26
Example AND continued

Watch a perceptron learn the AND function

27
Example XOR

Heres the XOR function

-1 XOR -1 false -1 XOR 1 true 1 XOR -1
true 1 XOR 1 false
Perceptrons cannot learn such linearly
inseparable functions
28
My Previous Work on Neural Predictors

The perceptron predictor HPCA 2001, TOCS 2002
uses only pattern history information
The same weights vector is used for every
prediction of a static branch
The ith history bit could come from any number of
static branches
So the ith correlating weight is aliased among
many branches
The newer path-based neural predictor MICRO
2003 uses path information
The ith correlating weight is selected using the
ith branch address
This allows the predictor to be pipelined,
mitigating latency MICRO 2000, HPCA 2003
This strategy improves accuracy because of path
information
But there is now even more aliasing since the ith
weight could be used to predict many different
branches

29
Piecewise Linear Branch Prediction ISCA 2005

Generalization of perceptron and path-based
neural predictors
Ideally, there is a weight giving the correlation
between each
Static branch b, and
Each pair of branch and history position (i.e. i)
in bs history
b might have 1000s of correlating weights or just
a few
Depends on the number of static branches in bs
history

30
The Algorithm Parameters and Variables

GHL the global history length
GHR a global history shift register
GA a global array of previous branch addresses
W an n m (GHL 1) array of small integers

31
The Algorithm Making a Prediction
Weights are selected based on the current branch
and the ith most recent branch
32
The Algorithm Training
33
Why Its Better

Forms a piecewise linear decision surface
Each piece determined by the path to the
predicted branch
Can solve more problems than perceptron

Perceptron decision surface for XOR doesnt
classify all inputs correctly
Piecewise linear decision surface for
XOR classifies all inputs correctly
34
Learning XOR

From a program that computes XOR using if
statements

perceptron prediction
piecewise linear prediction
35
Methodology

Aggressive out-of-order processor simulated
15 SPEC CPU integer benchmarks from 2000 and 95
simulated
Highly accurate predictors from previous work
tuned for this workload

36
Results Accuracy

Practical piecewise linear predictor approaches
accuracy of an unlimited ideal predictor, 16
better than previous work

37
Results Performance

Harmonic mean normalized IPC improves by 8 over
previous work

38
Future For This Work

I still want more accuracy
Lets quantify the benefit of neural prediction
with respect to power and energy
Not clear perceptrons are like multipliers in
terms of energy consumption
Still, energy saved by running faster might
offset energy consumed by predictor
Can the ideas from neural branch prediction be
applied to other domains, e.g. value prediction?

39
Conclusion

PHT partitioning provides a real speedup for
todays CPUs
Simple technique
But it would be nice to have detailed information
about the predictor
Neural prediction could be incorporated into
future CPUs
Latency problem is diminished
Accuracy is very good
Power and energy need to be researched
Acknowledgments
NSF, CCR-0311091 for piecewise linear branch
prediction
Ministerio de Educacíon y Ciencia (Spanish
Ministry of Education and Science), SB2003-0357
for both