Title: Twolevel Adaptive Branch Prediction
1Two-level Adaptive Branch Prediction
- Colin Egan
- University of Hertfordshire
- Hatfield
- U.K.
- c.egan_at_herts.ac.uk
2Presentation Structure
- Two-level Adaptive Branch Prediction
- Cached Correlated Branch Prediction
- Neural Branch Prediction
- Conclusion and Discussion
- Where next?
3Two-level Adaptive Branch Prediction Schemes
- First level
- History register(s) record the outcome of the
last k branches encountered. - A global history register records information
from other branches leading to the branch. - Local (Per-Address) history registers record
information of specific branches.
4Two-level Adaptive Branch Prediction Schemes
- Second level
- Is termed the Pattern History Table (PHT).
- The PHT consists of at least one array of two-bit
up/down saturating counters that provide the
prediction.
5Two-level Adaptive Branch Prediction Schemes
- Global schemes
- Exploit correlation between the outcome of the
current branch and neighbouring branches that
were executed leading to this branch. - Local schemes
- Exploit correlation between the outcome of the
current branch and its past behaviour.
6Global Two-level Adaptive Branch Prediction
- GAg Predictor Implementation
prediction
7Global Two-level Adaptive Branch Prediction
- GAp / GAs Predictor Implementation
8Local Two-level Adaptive Branch Prediction
9Local Two-level Adaptive Branch Prediction
10Problems with Two-level Adaptive Branch Prediction
- Size of PHT
- Increases exponentially as a function of HR
length. - Use of uninitialised predictors
- No tag fields are associated with PHT prediction
counters. - Branch interference (aliasing)
- In GAg and PAg all branches share a common PHT.
- In GAs and PAs each branch is shared by a set of
branches.
11Cached Correlated Branch Prediction
- Minimises the number of initial mispredictions.
- Eliminates branch interference.
- Is used in a disciplined manner.
- Is cost-effective.
12Cached Correlated Branch Prediction
- Today we are going to look at two types of cached
correlated predictors - A Global Cached Correlated Predictor.
- A Local Cached Correlated Predictor.
- We have also developed a combined predictor that
uses both global and local history information.
13Cached Correlated Branch Prediction
- The first level history register remains the same
as conventional two-level predictors. - Uses a second level Prediction Cache instead of a
PHT.
14Cached Correlated Branch Prediction
- Uses a secondary default predictor (BTC).
- Both predictors provide a prediction.
- A priority selector chooses the actual prediction.
15Cached Correlated Branch Prediction
- The Prediction Cache predicts on the past
behaviour of the branch with the current history
register pattern. - The Default Predictor predicts on the overall
past behaviour of the branch.
16Prediction Cache
- Size is not a function of the history register
length. - Size is determined by the number of prediction
counters that are actually used. - Requires a tag-field.
- Is cost effective as long as the cost of
redundant counters removed from a conventional
PHT exceeds the cost of the added tags.
17A Global Cached Correlated Branch Predictor
18A Local Cached Correlated Branch Predictor
- Problem
- A Local predictor will require two sequential
clock access - One to access the BTC to furnish HRl.
- Second to access the Prediction Cache.
- Solution
- Cache the next prediction for each branch in the
BTC. - Only one clock access is therefore needed.
19A Local Cached Correlated Branch Predictor
PC
BTC
Prediction Cache
hrl
hash
Default prediction
Prediction Cache prediction
Correlated hit
prediction selector
BTC hit
actual prediction
20Simulations
- Stanford Integer Benchmark suite.
- These benchmarks are difficult to predict.
- Instruction traces were obtained from the
Hatfield Superscalar Architecture (HSA).
21Global Simulations
- A comparative study of misprediction rates
- A conventional GAg, a conventional GAs(16) and a
conventional GAp. - Against a Global Cached Correlated predictor (1K
64K).
22Global Simulation Results
23Global Simulation Results
- For conventional global two-level predictors the
best average misprediction rate of 9.23 is
achieved by a GAs(16) predictor with a history
register length of 26. - In general there is little benefit from
increasing the history register length beyond
16-bits for GAg and 14-bits for GAs/GAp.
24Global Simulation Results
- A 32K entry Prediction Cache with a 30-bit
history register achieved the best misprediction
rate of 5.99 for the global cached correlated
predictors. - This represents a 54 reduction over the best
misprediction rate achieved by a conventional
global two-level predictor.
25Global Simulations
- We repeated the same simulations without the
default predictor.
26Global Simulations Results(without default
predictor)
27Global Simulation Results(without default
predictor)
- The best misprediction rate is now 9.12.
- The high performance of the cached predictor
depends crucially on the provision of the
two-stage mechanism.
28Local Simulations
- A comparative study of misprediction rates.
- A conventional PAg, a conventional PAs(16) and a
conventional PAp. - Against a local cached correlated predictor (1K
64K), with and without the default predictor.
29Local Simulation Results(with default predictor)
30Local Simulation Results(without default
predictor)
31Local Simulation Results
- For conventional local two-level predictors the
best average misprediction rate of 7.35 is
achieved by a PAp predictor with a history
register length of 30. - The best misprediction rate achieved by a local
cached correlated predictor (64K HR 28) is
6.19. - This is a 19 improvement over the best
conventional local two-level predictor. - However, without the default predictor the best
misprediction rate achieved is 8.21 (32K HR12).
32Three-Stage Predictor
- Since, the high performance of a cached predictor
depends crucially on the provision of the
two-stage mechanism we were led to the
development of the three-stage predictor.
33Three-Stage Predictor
- Stages
- Primary Prediction Cache.
- Secondary Prediction Cache.
- Default Predictor.
- The predictions from the two Prediction Caches
are stored in the BTC so that a prediction is
furnished in a single clock cycle.
34Three-Stage Predictor Simulations
- We repeated the same set of simulations.
- We varied the Primary Prediction Cache size (1
64K). - The Secondary Prediction Cache was always half
the size of the Primary Prediction Cache and used
exactly half of the history register bits.
35Global Three-Stage Predictor Simulation Results
36Global Three-Stage Predictor Simulation Results
- The global three-stage predictor consistently
outperforms the simpler global two-stage
predictor. - The best misprediction rate is now 5.57 achieved
with a 32K Prediction Cache and a 30-bit HR. - This represents a 7.5 improvement over the best
global two-Stage predictor.
37Local Three-Stage Predictor Simulation Results
38Local Three-Stage Predictor Simulation Results
- The local three-stage predictor consistently
outperforms the simpler local two-stage
predictor. - The best misprediction rate is now 6.00 achieved
with a 64K Prediction Cache and a 28-bit HR. - This represents a 3.2 improvement over the best
local two-stage predictor.
39Conclusion So Far
- Conventional PHTs use large amounts of hardware
with increasing history register length. - The history register size of a cached correlated
predictor does not determine cost. - A Prediction Cache can reduce the hardware cost
over a conventional PHT.
40Conclusion So Far
- Cached correlated predictors provide better
prediction accuracy than conventional two-level
predictors. - The role of the default predictor in a cached
correlated predictor is crucial. - Three-stage predictors consistently record a
small but significant improvement over their
two-stage counterparts.
41Neural Network Branch Prediction
- Dynamic branch prediction can be considered to be
a specific instance of a general Time Series
Prediction. - Two-level Adaptive Branch Prediction is a very
specific solution to the branch prediction
problem. - An alternative approach is to look at other
applications areas and fields for novel solutions
to the problem. - At Hatfield, we have examined the application of
neural networks to the branch prediction problem.
42Neural Network Branch Prediction
- Two neural networks are considered
- A Learning Vector Quantisation (LVQ) Network,
- A Backpropagation Network.
-
- One of our main research objectives is to use
neural networks to identify new correlations that
can be exploited by branch predictors. - We also wish to determine whether more accurate
branch prediction is possible and to gain a
greater understanding of the underlying
prediction mechanisms.
43Neural Network Branch Prediction
- As with Cached Correlated Branch Prediction, we
retain the first level of a conventional
two-level predictor. - The k-bit pattern of the history register is fed
into the network as input. - In fact, we concatenate the 10 lsb of the branch
address with the HR as input to the network.
44LVQ prediction
- The idea of using an LVQ predictor, was to see if
respectable prediction rates could be delivered
by a simple LVQ network that was dynamically
trained after each branch prediction.
45LVQ prediction
- The LVQ predictor contains two codebook
vectors - Vt is associated with a taken branch.
- Vnt is associated with a not taken branch.
- The concatenated PC HR form a single input
vector. - We call this vector X.
46LVQ prediction
- Modified Hamming distances are then computed
between X, Vt and Vnt. -
- The winning vector, Vw, is the vector with the
smallest HD.
47LVQ prediction
- Vw is used to predict the branch.
- If Vt wins then the branch is predicted as taken.
- If Vnt wins then the branch is predicted as not
taken.
48LVQ prediction
- LVQ network training
- At branch resolution, Vw is adjusted
- Vw (t 1) Vw (t) /- a(t)X(t) - Vw(t)
- To reinforce correct predictions, the vector is
incremented whenever a prediction was proved to
be correct, and decremented whenever a prediction
was proved to be incorrect. - The factor a(t) represents the learning factor
and was (usually) set to a small constant of
lt0.1. - The losing vector remains unchanged.
49LVQ prediction
- LVQ network training
- Training is therefore dynamic.
- Training is also adaptive since the codebook
vectors reflect the outcomes of the most recently
encountered branches.
50Backpropagation prediction
- Prediction information is fed into a
backpropagation network.
51Backpropagation prediction
- There are two steps through the backpropagation
network. - First Step (forward step)
- From the input layer to the output layer.
- Propagates the input vector of the net into the
first layer, outputs from this first layer are
fed as inputs to the next layer and so on to the
final layer. - The outputs from the final layer are the output
signals of the net. - In the case of branch prediction there is a
single output signal the prediction. - Second step (backward step)
- Is similar to the forward step.
- Except that error values are backpropagated
through the network. - These error values are used to train the net by
changing the weights.
52Backpropagation prediction
- Inputs into the net can be coded in two different
ways - Binary (0 for not taken, 1 for taken).
- Bipolar (-1 for not taken, 1 for taken).
- Two different activations functions are required
- Sigmoid function for binary inputs.
- Bipolar sigmoid function for bipolar inputs.
53Backpropagation prediction
- Sigmoid function
- 1/(1 e-?x)
- Bipolar Sigmoidal function
- 2/(1 e-?x) 1
- N.B. At the moment we are not considering
implementation costs when we do we will have to
consider a simpler/cheaper function. - The factor ? controls the degree of linearity
between the two functions.
54Backpropagation prediction
- Binary inputs
- A value greater than or equal to 0.5 is predicted
as taken. - A value greater less than 0.5 is predicted as not
taken. - Bipolar inputs
- A value greater than or equal to 0 is predicted
as taken. - A value less than 0 is predicted as not taken.
55Backpropagation prediction
- A backpropagation network is not initially
trained. - Random values between -2/X, 2/X are used to
initialise the net. - X corresponds to the number of inputs into the
net. - The selection of weights guarantees that both the
weights and their average value will be close to
0. - This ensures that the net is initially not biased
towards taken or not taken.
56Simulations
- We simulated 3 LVQ predictors
- global LVQ,
- local LVQ,
- combined (global local) LVQ.
- Inputs were
- global LVQ PC HRg,
- local LVQ PC HRl,
- combined LVQ PC HRg HRl.
- The learning step a(t) was standardised at 0.001.
57Simulations
- We simulated 2 types of Backpropagation
predictors - global Backpropagation,
- local Backpropagation.
- Inputs were either
- binary (0, 1),
- bipolar (-1, 1).
58Simulations
- We simulated therefore simulated 4
Backpropagation predictors - global Backpropagation binary,
- global Backpropagation bipolar,
- local Backpropagation binary,
- local Backpropagation bipolar.
- A learning rate of 0.125 was used throughout.
59Simulation Results
- global LVQ
- Achieved an average misprediction rate of 13.54
- worse than the BTC! - Only modest improvements with increasing HR
length beyond HR4. - local LVQ
- Achieved an average misprediction rate of 10.91
- better than the BTC, but not as good as
conventional two-level local prediction. - Only modest improvements with increasing HR
length beyond HR6. - combined LVQ
- Was marginally better than the local LVQ.
60Simulation Results
61Simulation Results
- global Backpropagation with binary inputs
- Achieved an average misprediction rate of 11.28.
- Not as good as conventional global two-level.
-
- global Backpropagation with bipolar inputs
- Achieved an average misprediction rate of 8.77
- Better than conventional global two-level.
62Simulation Results
- local Backpropagation with binary inputs
- Achieved an average misprediction rate of 10.46.
- Not as good as conventional local two-level.
-
- local Backpropagation with bipolar inputs
- Achieved an average misprediction rate of 8.47.
- Not as good as conventional local two-level.
-
63Simulation Results
- Backpropagation prediction
64Simulation Results
- Backpropagation and conventional two-level
prediction
65Conclusion for NNs
- LVQ predictors do not compete with conventional
two-level predictors. - Backpropagation predictors with bipolar inputs
compare well with conventional two-level
predictors. - The results suggest that in some cases neural
network predictors may be able to exploit
correlation information more effectively than
conventional two-level predictors.
66Conclusion for NNs
- We are looking at new composite input vectors
that will outperform conventional two-level
predictors. - We are considering hardware implementations of
NNs that work quickly and within sensible silicon
budget cost.
67So where next?
- With deeper pipelines and increasing MII the
branch prediction problem is going to be
exacerbated further. - Is dynamic branch prediction worth improving?
- Individual benchmarks have an upper limit of
prediction accuracy. - Have conventional predictors reached those
ceilings? - Cost can be expensive even CCP.
- Will NNs only achieve diminishing returns?
- So where next?
68So where next?
- Why not remove branches from the instruction
stream? - Aggressive schedulers can remove branch
instructions by guarded or predicated instruction
executions (HSA and IA-64). - In fact it is the difficult to predict branch
instructions that are removed. - But not all IA-64 claim about 50, at Hatfield
we say about 30! - Why not multithread difficult to predict
branches? - Dynamically predict the simple to predict
branches. - Identify the difficult to predict branches and
multithread them.
69So where next?
- So why dont we apply a combination of
techniques? - Remove branches from the instruction stream by
aggressive scheduling. - Those that remain
- Dynamically predict the easy to predict
branches. - Multithread the difficult to predict branches.
- c.egan_at_herts.ac.uk