Title: Recent Advances in Branch Prediction
1Recent Advances in Branch Prediction
Daniel Ángel Jiménez Department of Computer
Science Rutgers, The State University of New
Jersey
2This Talk
- Brief introduction to conditional branch
prediction - Some motivation, some background
- Improving Branch Prediction with the Compiler
- Pattern history table partitioning
- Improving Branch Prediction in the
Microarchitecture - Perceptron predictor
- Mathematical intuition
- Some pictures and movies
- Piecewise Linear Branch Prediction
3Pipelining and Branches
Pipelining overlaps instructions to exploit
parallelism, allowing the clock rate to be
increased. Branches cause bubbles in the
pipeline, where some stages are left idle.
Instruction fetch
Instruction decode
Execute
Memory access
Write back
Unresolved branch instruction
4Branch Prediction
A branch predictor allows the processor to
speculatively fetch and execute instructions down
the predicted path.
Instruction fetch
Instruction decode
Execute
Memory access
Write back
Speculative execution
Branch predictors must be highly accurate to
avoid mispredictions!
5Branch Predictor Accuracy is Critical
- The cost of a misprediction is proportional to
pipeline depth - Predictor accuracy is more important for deeper
pipelines - Pentium 4 with Prescott core pipeline has 31
stages! - Recent cores have calmed down but branch
prediction remains a problem (this used to be my
favorite slide)
- Deeper pipelines allow higher clock rates by
decreasing the delay of each pipeline stage - Decreasing misprediction rate from 9 to 4
results in 31 speedup for 32 stage pipeline
Simulations with SimpleScalar/Alpha
6Conditional Branch Prediction
- Most predictors based on 2-level adaptive branch
prediction Yeh Patt 91 - Branch outcomes shifted into history register,
1taken, 0not taken - History bits and address bits combine to index a
pattern history table (PHT) of 2-bit saturating
counters - Prediction is high bit of counter
- Counter is incremented if branch is taken,
decremented if branch is not taken - Branches tend to be highly biased
GAs a common type of predictor
7Destructive Interference (Aliasing)
- Unrelated branches might accidentally use the
same counter - If two branches behave differently, the predictor
cant learn the behavior - Leads to decreased accuracy
- A small decrease in accuracy can have a large
impact on performance - Much branch prediction research focuses on
reducing this effect - Almost all known techniques change the
microarchitecture - Techniques shown to work well in simulation
- But microprocessor manufacturers still use
relatively simple predictors - Can we reduce destructive interference without
changing the processor?
8Pattern History Table Partitioning PLDI 2005
- Compiler changes branch addresses to reduce
interference - Exploits the fact that branches are highly biased
- PHT divided into partitions based on branch bias
9Simple Idea
- The predictor uses part of the branch address to
form an index - Bimodal PHT Partitioning
- Make sure that one bit (bit k) of that address
is - 1, for biased taken branches, and
- 0, for biased not taken branches
- Four-way PHT Partitioning
- Control two address bits for the following four
cases - 00, for weakly biased not taken branches,
- 01, for weakly biased taken branches,
- 10, for strongly biased not taken branches,
- 11 for strongly biased taken branches
- This way, branches with similar behavior are
grouped together, reducing destructive
interference
10Simple Idea continued
11Complicated Implementation
- How do we control the addresses of branch
instructions? - By inserting no-op instructions to move branches
where we want them - But we cant just insert them anywhere that
would hurt performance - Solution insert no-ops between specially
selected regions of instructions - How do we know branch biases?
- Profiling with a training run (or heuristics in
the absence of profiling) - Whats the best value of k?
- Predictor specific determine empirically for
each microarchitecture - How does the compiler know the addresses of
instructions? - Through feedback from the assembler
- Gets complicated when inserting a no-op can
affect a PC-relative branch
12Adding No-Ops Between Regions
- No-ops are inserted between regions of
instructions - Each procedure is divided into regions
- The first region begins with the first basic
block - A region can be ended by
- A basic block that does not fall through, e.g. an
unconditional jump or return instruction - A basic block ending in a conditional branch that
is taken at least 99.9 of the time - This ensures that
- There is enough flexibility to insert no-ops in
the program - No-ops are almost never actually fetched and
executed
13Deciding How Many No-Ops To Insert
- Goal is to maximize number of branches whose
biases match bit k in their addresses - Each procedure is aligned on a 2k -byte boundary
- For each region in sequence,
- Determine the effect of inserting from 0 to 2k-1
single-byte no-ops - Choose the value that maximizes the number of
branches assigned to their correct partition,
weighted by dynamic execution frequency - Each time the effect of a new number of no-ops is
evaluated, we must compute the new addresses
(modulo 2k) of all instructions affected - We cant just add 1 for each no-op because of
- PC-relative branches that can change size
- Compiler-inserted alignment pseudo-ops
14PHT Partitioning on the Intel Pentium 4
- Intel Pentium 4 branch predictor details are very
secret - Seems to include a GAs branch predictor
- Intel engineers disappear in a puff of smoke when
I ask them about it - I used ad hoc trial-and-error experiments to find
the best address bits for bimodal and 4-way PHT
partioning - Camino compiler infrastructure
- Development began at UT Austin, continues at
Rutgers - Camino uses GCC S to produce assembly language
from C and C - Camino reads the assembly language, forming
control-flow-graphs - Transformations are done at this CFG level
- Result is put back into assembly form and
assembled - This slide represents 95 of the work tedious
development
15Methodology
- Used SPEC CPU integer benchmarks
- Benchmarks from SPEC CPU 2000
- Also from SPEC CPU 95 that werent duplicated in
2000 - A few benchmarks failed to work with Camino
- Dell workstation for evaluation of the
optimization - Intel Pentium 4 2.8 GHz
- 2GB SDRAM
- Compare with greedy branch alignment Calder
Grunwald 94 - Measure median of 5 execution times on quiescent
system - Measure number of branch mispredictions with
OProfile in a separate run
16Results Speedup
- Mean speedup for 4-way was 4.5, maximum speedup
16 - Speedup for branch alignment virtually nil
probably because of Intel Pentium 4s trace cache
17Results Decrease in Mispredictions
- Normalized MPKI reduced by 3.5 for 4-way
partitioning
18Future For This Work
- PHT partitioning is crude with path profiling,
I hope to direct paths to specific PHT entries
FDDO 2001 - A lot of work needs to be done to
reverse-engineer industrial branch predictors to
develop better optimizations - Convince chip manufacturers to use simple branch
predictors and leave the hard work to the compiler
19Changing the Branch Predictor
- Now lets consider changing the branch predictor
itself - The architecture literature is replete with these
kind of branch prediction papers - Before 2001, most work refined two-level adaptive
branch prediction Yeh Patt 91 - A 1st-level table records recent global or
per-branch pattern histories - A 2nd-level table learns correlations between
histories and outcomes - Refinements focus on reducing destructive
interference in the tables - Some of the better refinements (not an exhaustive
list) - gshare McFarling 93, agree Sprangle et al.
97, hybrid predictors Evers et al. 96,
skewed predictors Michaud et al. 93
20Conditional Branch Prediction is a Machine
Learning Problem
- The machine learns to predict conditional
branches - So why not apply a machine learning algorithm?
- Artificial neural networks
- Simple model of neural networks in brain cells
- Learn to recognize and classify patterns
- We used fast and accurate perceptrons
Rosenblatt 62, Block 62 for dynamic branch
prediction Jiménez Lin, HPCA 2001 - We were the first to use single-layer perceptrons
and to achieve accuracy superior to PHT
techniques. Previous work used LVQ and MLP for
branch prediction Vintan Iridon 99.
21Basics of Neural Branch Prediction
- The inputs to a neuron are branch outcome
histories - The last n branch outcomes
- Can be global or local (per-branch) or both
(alloyed) - Conceptually, branch outcomes are represented as
- 1, for taken
- -1, for not taken
- The output of the neuron is
- Non-negative, if the branch is predicted taken
- Negative, if the branch is predicted not taken
- Ideally, each static branch is allocated its own
neuron
22Branch-Predicting Perceptron
- Inputs (xs) are from branch history
- n 1 small integer weights (ws) learned by
on-line training - Output (y) is dot product of xs and ws predict
taken if y 0 - Training finds correlations between history and
outcome
23Mathematical Intuition
A perceptron defines a hyperplane in
n1-dimensional space
For instance, in 2D space we have
This is the equation of a line, the same as
24Mathematical Intuition continued
In 3D space, we have
Or you can think of it as
i.e. the equation of a plane in 3D space This
hyperplane forms a decision surface separating
predicted taken from predicted not taken
histories. This surface intersects the feature
space. Is it a linear surface, e.g. a line in
2D, a plane in 3D, a cube in 4D, etc.
25Example The AND Function
- White means false, black means true for the
output - -1 means false, 1 means true for the input
- A linear decision surface (i.e. a plane in 3D
space) intersecting the feature space (i.e. the
2D plane where z0) separates false from true
instances
-1 AND -1 false -1 AND 1 false 1 AND -1
false 1 AND 1 true
26Example AND continued
- Watch a perceptron learn the AND function
27Example XOR
-1 XOR -1 false -1 XOR 1 true 1 XOR -1
true 1 XOR 1 false
Perceptrons cannot learn such linearly
inseparable functions
28My Previous Work on Neural Predictors
- The perceptron predictor HPCA 2001, TOCS 2002
uses only pattern history information - The same weights vector is used for every
prediction of a static branch - The ith history bit could come from any number of
static branches - So the ith correlating weight is aliased among
many branches - The newer path-based neural predictor MICRO
2003 uses path information - The ith correlating weight is selected using the
ith branch address - This allows the predictor to be pipelined,
mitigating latency MICRO 2000, HPCA 2003 - This strategy improves accuracy because of path
information - But there is now even more aliasing since the ith
weight could be used to predict many different
branches
29Piecewise Linear Branch Prediction ISCA 2005
- Generalization of perceptron and path-based
neural predictors - Ideally, there is a weight giving the correlation
between each - Static branch b, and
- Each pair of branch and history position (i.e. i)
in bs history - b might have 1000s of correlating weights or just
a few - Depends on the number of static branches in bs
history
30The Algorithm Parameters and Variables
- GHL the global history length
- GHR a global history shift register
- GA a global array of previous branch addresses
- W an n m (GHL 1) array of small integers
31The Algorithm Making a Prediction
Weights are selected based on the current branch
and the ith most recent branch
32The Algorithm Training
33Why Its Better
- Forms a piecewise linear decision surface
- Each piece determined by the path to the
predicted branch - Can solve more problems than perceptron
Perceptron decision surface for XOR doesnt
classify all inputs correctly
Piecewise linear decision surface for
XOR classifies all inputs correctly
34Learning XOR
- From a program that computes XOR using if
statements
perceptron prediction
piecewise linear prediction
35Methodology
- Aggressive out-of-order processor simulated
- 15 SPEC CPU integer benchmarks from 2000 and 95
simulated - Highly accurate predictors from previous work
tuned for this workload
36Results Accuracy
- Practical piecewise linear predictor approaches
accuracy of an unlimited ideal predictor, 16
better than previous work
37Results Performance
- Harmonic mean normalized IPC improves by 8 over
previous work
38Future For This Work
- I still want more accuracy
- Lets quantify the benefit of neural prediction
with respect to power and energy - Not clear perceptrons are like multipliers in
terms of energy consumption - Still, energy saved by running faster might
offset energy consumed by predictor - Can the ideas from neural branch prediction be
applied to other domains, e.g. value prediction?
39Conclusion
- PHT partitioning provides a real speedup for
todays CPUs - Simple technique
- But it would be nice to have detailed information
about the predictor - Neural prediction could be incorporated into
future CPUs - Latency problem is diminished
- Accuracy is very good
- Power and energy need to be researched
- Acknowledgments
- NSF, CCR-0311091 for piecewise linear branch
prediction - Ministerio de Educacíon y Ciencia (Spanish
Ministry of Education and Science), SB2003-0357
for both
40The End