Twolevel Adaptive Branch Prediction - PowerPoint PPT Presentation

1 / 69

About This Presentation

Title:

Twolevel Adaptive Branch Prediction

Description:

History register(s) record the outcome of the last k branches encountered. A global history register records information from other branches leading to the ... – PowerPoint PPT presentation

Number of Views:472

Avg rating:3.0/5.0

Slides: 70

Provided by: capsl

Category:

more less

Transcript and Presenter's Notes

Title: Twolevel Adaptive Branch Prediction

1
Two-level Adaptive Branch Prediction

Colin Egan
University of Hertfordshire
Hatfield
U.K.
c.egan_at_herts.ac.uk

2
Presentation Structure

Two-level Adaptive Branch Prediction
Cached Correlated Branch Prediction
Neural Branch Prediction
Conclusion and Discussion
Where next?

3
Two-level Adaptive Branch Prediction Schemes

First level
History register(s) record the outcome of the
last k branches encountered.
A global history register records information
from other branches leading to the branch.
Local (Per-Address) history registers record
information of specific branches.

4
Two-level Adaptive Branch Prediction Schemes

Second level
Is termed the Pattern History Table (PHT).
The PHT consists of at least one array of two-bit
up/down saturating counters that provide the
prediction.

5
Two-level Adaptive Branch Prediction Schemes

Global schemes
Exploit correlation between the outcome of the
current branch and neighbouring branches that
were executed leading to this branch.
Local schemes
Exploit correlation between the outcome of the
current branch and its past behaviour.

6
Global Two-level Adaptive Branch Prediction

GAg Predictor Implementation

prediction
7
Global Two-level Adaptive Branch Prediction

GAp / GAs Predictor Implementation

8
Local Two-level Adaptive Branch Prediction

9
Local Two-level Adaptive Branch Prediction

PAs / PAp

10
Problems with Two-level Adaptive Branch Prediction

Size of PHT
Increases exponentially as a function of HR
length.
Use of uninitialised predictors
No tag fields are associated with PHT prediction
counters.
Branch interference (aliasing)
In GAg and PAg all branches share a common PHT.
In GAs and PAs each branch is shared by a set of
branches.

11
Cached Correlated Branch Prediction

Minimises the number of initial mispredictions.
Eliminates branch interference.
Is used in a disciplined manner.
Is cost-effective.

12
Cached Correlated Branch Prediction

Today we are going to look at two types of cached
correlated predictors
A Global Cached Correlated Predictor.
A Local Cached Correlated Predictor.
We have also developed a combined predictor that
uses both global and local history information.

13
Cached Correlated Branch Prediction

The first level history register remains the same
as conventional two-level predictors.
Uses a second level Prediction Cache instead of a
PHT.

14
Cached Correlated Branch Prediction

Uses a secondary default predictor (BTC).
Both predictors provide a prediction.
A priority selector chooses the actual prediction.

15
Cached Correlated Branch Prediction

The Prediction Cache predicts on the past
behaviour of the branch with the current history
register pattern.
The Default Predictor predicts on the overall
past behaviour of the branch.

16
Prediction Cache

Size is not a function of the history register
length.
Size is determined by the number of prediction
counters that are actually used.
Requires a tag-field.
Is cost effective as long as the cost of
redundant counters removed from a conventional
PHT exceeds the cost of the added tags.

17
A Global Cached Correlated Branch Predictor
18
A Local Cached Correlated Branch Predictor

Problem
A Local predictor will require two sequential
clock access
One to access the BTC to furnish HRl.
Second to access the Prediction Cache.
Solution
Cache the next prediction for each branch in the
BTC.
Only one clock access is therefore needed.

19
A Local Cached Correlated Branch Predictor
PC
BTC
Prediction Cache
hrl
hash
Default prediction
Prediction Cache prediction
Correlated hit
prediction selector
BTC hit
actual prediction
20
Simulations

Stanford Integer Benchmark suite.
These benchmarks are difficult to predict.
Instruction traces were obtained from the
Hatfield Superscalar Architecture (HSA).

21
Global Simulations

A comparative study of misprediction rates
A conventional GAg, a conventional GAs(16) and a
conventional GAp.
Against a Global Cached Correlated predictor (1K
64K).

22
Global Simulation Results
23
Global Simulation Results

For conventional global two-level predictors the
best average misprediction rate of 9.23 is
achieved by a GAs(16) predictor with a history
register length of 26.
In general there is little benefit from
increasing the history register length beyond
16-bits for GAg and 14-bits for GAs/GAp.

24
Global Simulation Results

A 32K entry Prediction Cache with a 30-bit
history register achieved the best misprediction
rate of 5.99 for the global cached correlated
predictors.
This represents a 54 reduction over the best
misprediction rate achieved by a conventional
global two-level predictor.

25
Global Simulations

We repeated the same simulations without the
default predictor.

26
Global Simulations Results(without default
predictor)
27
Global Simulation Results(without default
predictor)

The best misprediction rate is now 9.12.
The high performance of the cached predictor
depends crucially on the provision of the
two-stage mechanism.

28
Local Simulations

A comparative study of misprediction rates.
A conventional PAg, a conventional PAs(16) and a
conventional PAp.
Against a local cached correlated predictor (1K
64K), with and without the default predictor.

29
Local Simulation Results(with default predictor)
30
Local Simulation Results(without default
predictor)
31
Local Simulation Results

For conventional local two-level predictors the
best average misprediction rate of 7.35 is
achieved by a PAp predictor with a history
register length of 30.
The best misprediction rate achieved by a local
cached correlated predictor (64K HR 28) is
6.19.
This is a 19 improvement over the best
conventional local two-level predictor.
However, without the default predictor the best
misprediction rate achieved is 8.21 (32K HR12).

32
Three-Stage Predictor

Since, the high performance of a cached predictor
depends crucially on the provision of the
two-stage mechanism we were led to the
development of the three-stage predictor.

33
Three-Stage Predictor

Stages
Primary Prediction Cache.
Secondary Prediction Cache.
Default Predictor.
The predictions from the two Prediction Caches
are stored in the BTC so that a prediction is
furnished in a single clock cycle.

34
Three-Stage Predictor Simulations

We repeated the same set of simulations.
We varied the Primary Prediction Cache size (1
64K).
The Secondary Prediction Cache was always half
the size of the Primary Prediction Cache and used
exactly half of the history register bits.

35
Global Three-Stage Predictor Simulation Results
36
Global Three-Stage Predictor Simulation Results

The global three-stage predictor consistently
outperforms the simpler global two-stage
predictor.
The best misprediction rate is now 5.57 achieved
with a 32K Prediction Cache and a 30-bit HR.
This represents a 7.5 improvement over the best
global two-Stage predictor.

37
Local Three-Stage Predictor Simulation Results
38
Local Three-Stage Predictor Simulation Results

The local three-stage predictor consistently
outperforms the simpler local two-stage
predictor.
The best misprediction rate is now 6.00 achieved
with a 64K Prediction Cache and a 28-bit HR.
This represents a 3.2 improvement over the best
local two-stage predictor.

39
Conclusion So Far

Conventional PHTs use large amounts of hardware
with increasing history register length.
The history register size of a cached correlated
predictor does not determine cost.
A Prediction Cache can reduce the hardware cost
over a conventional PHT.

40
Conclusion So Far

Cached correlated predictors provide better
prediction accuracy than conventional two-level
predictors.
The role of the default predictor in a cached
correlated predictor is crucial.
Three-stage predictors consistently record a
small but significant improvement over their
two-stage counterparts.

41
Neural Network Branch Prediction

Dynamic branch prediction can be considered to be
a specific instance of a general Time Series
Prediction.
Two-level Adaptive Branch Prediction is a very
specific solution to the branch prediction
problem.
An alternative approach is to look at other
applications areas and fields for novel solutions
to the problem.
At Hatfield, we have examined the application of
neural networks to the branch prediction problem.

42
Neural Network Branch Prediction

Two neural networks are considered
A Learning Vector Quantisation (LVQ) Network,
A Backpropagation Network.
One of our main research objectives is to use
neural networks to identify new correlations that
can be exploited by branch predictors.
We also wish to determine whether more accurate
branch prediction is possible and to gain a
greater understanding of the underlying
prediction mechanisms.

43
Neural Network Branch Prediction

As with Cached Correlated Branch Prediction, we
retain the first level of a conventional
two-level predictor.
The k-bit pattern of the history register is fed
into the network as input.
In fact, we concatenate the 10 lsb of the branch
address with the HR as input to the network.

44
LVQ prediction

The idea of using an LVQ predictor, was to see if
respectable prediction rates could be delivered
by a simple LVQ network that was dynamically
trained after each branch prediction.

45
LVQ prediction

The LVQ predictor contains two codebook
vectors
Vt is associated with a taken branch.
Vnt is associated with a not taken branch.
The concatenated PC HR form a single input
vector.
We call this vector X.

46
LVQ prediction

Modified Hamming distances are then computed
between X, Vt and Vnt.
The winning vector, Vw, is the vector with the
smallest HD.

47
LVQ prediction

Vw is used to predict the branch.
If Vt wins then the branch is predicted as taken.
If Vnt wins then the branch is predicted as not
taken.

48
LVQ prediction

LVQ network training
At branch resolution, Vw is adjusted
Vw (t 1) Vw (t) /- a(t)X(t) - Vw(t)
To reinforce correct predictions, the vector is
incremented whenever a prediction was proved to
be correct, and decremented whenever a prediction
was proved to be incorrect.
The factor a(t) represents the learning factor
and was (usually) set to a small constant of
lt0.1.
The losing vector remains unchanged.

49
LVQ prediction

LVQ network training
Training is therefore dynamic.
Training is also adaptive since the codebook
vectors reflect the outcomes of the most recently
encountered branches.

50
Backpropagation prediction

Prediction information is fed into a
backpropagation network.

51
Backpropagation prediction

There are two steps through the backpropagation
network.
First Step (forward step)
From the input layer to the output layer.
Propagates the input vector of the net into the
first layer, outputs from this first layer are
fed as inputs to the next layer and so on to the
final layer.
The outputs from the final layer are the output
signals of the net.
In the case of branch prediction there is a
single output signal the prediction.
Second step (backward step)
Is similar to the forward step.
Except that error values are backpropagated
through the network.
These error values are used to train the net by
changing the weights.

52
Backpropagation prediction

Inputs into the net can be coded in two different
ways
Binary (0 for not taken, 1 for taken).
Bipolar (-1 for not taken, 1 for taken).
Two different activations functions are required
Sigmoid function for binary inputs.
Bipolar sigmoid function for bipolar inputs.

53
Backpropagation prediction

Sigmoid function
1/(1 e-?x)
Bipolar Sigmoidal function
2/(1 e-?x) 1
N.B. At the moment we are not considering
implementation costs when we do we will have to
consider a simpler/cheaper function.
The factor ? controls the degree of linearity
between the two functions.

54
Backpropagation prediction

Binary inputs
A value greater than or equal to 0.5 is predicted
as taken.
A value greater less than 0.5 is predicted as not
taken.
Bipolar inputs
A value greater than or equal to 0 is predicted
as taken.
A value less than 0 is predicted as not taken.

55
Backpropagation prediction

A backpropagation network is not initially
trained.
Random values between -2/X, 2/X are used to
initialise the net.
X corresponds to the number of inputs into the
net.
The selection of weights guarantees that both the
weights and their average value will be close to
0.
This ensures that the net is initially not biased
towards taken or not taken.

56
Simulations

We simulated 3 LVQ predictors
global LVQ,
local LVQ,
combined (global local) LVQ.
Inputs were
global LVQ PC HRg,
local LVQ PC HRl,
combined LVQ PC HRg HRl.
The learning step a(t) was standardised at 0.001.

57
Simulations

We simulated 2 types of Backpropagation
predictors
global Backpropagation,
local Backpropagation.
Inputs were either
binary (0, 1),
bipolar (-1, 1).

58
Simulations

We simulated therefore simulated 4
Backpropagation predictors
global Backpropagation binary,
global Backpropagation bipolar,
local Backpropagation binary,
local Backpropagation bipolar.
A learning rate of 0.125 was used throughout.

59
Simulation Results

global LVQ
Achieved an average misprediction rate of 13.54
- worse than the BTC!
Only modest improvements with increasing HR
length beyond HR4.
local LVQ
Achieved an average misprediction rate of 10.91
- better than the BTC, but not as good as
conventional two-level local prediction.
Only modest improvements with increasing HR
length beyond HR6.
combined LVQ
Was marginally better than the local LVQ.

60
Simulation Results

LVQ prediction

61
Simulation Results

global Backpropagation with binary inputs
Achieved an average misprediction rate of 11.28.
Not as good as conventional global two-level.
global Backpropagation with bipolar inputs
Achieved an average misprediction rate of 8.77
Better than conventional global two-level.

62
Simulation Results

local Backpropagation with binary inputs
Achieved an average misprediction rate of 10.46.
Not as good as conventional local two-level.
local Backpropagation with bipolar inputs
Achieved an average misprediction rate of 8.47.
Not as good as conventional local two-level.

63
Simulation Results

Backpropagation prediction

64
Simulation Results

Backpropagation and conventional two-level
prediction

65
Conclusion for NNs

LVQ predictors do not compete with conventional
two-level predictors.
Backpropagation predictors with bipolar inputs
compare well with conventional two-level
predictors.
The results suggest that in some cases neural
network predictors may be able to exploit
correlation information more effectively than
conventional two-level predictors.

66
Conclusion for NNs

We are looking at new composite input vectors
that will outperform conventional two-level
predictors.
We are considering hardware implementations of
NNs that work quickly and within sensible silicon
budget cost.

67
So where next?

With deeper pipelines and increasing MII the
branch prediction problem is going to be
exacerbated further.
Is dynamic branch prediction worth improving?
Individual benchmarks have an upper limit of
prediction accuracy.
Have conventional predictors reached those
ceilings?
Cost can be expensive even CCP.
Will NNs only achieve diminishing returns?
So where next?

68
So where next?

Why not remove branches from the instruction
stream?
Aggressive schedulers can remove branch
instructions by guarded or predicated instruction
executions (HSA and IA-64).
In fact it is the difficult to predict branch
instructions that are removed.
But not all IA-64 claim about 50, at Hatfield
we say about 30!
Why not multithread difficult to predict
branches?
Dynamically predict the simple to predict
branches.
Identify the difficult to predict branches and
multithread them.