Twolevel Adaptive Branch Prediction - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Twolevel Adaptive Branch Prediction

Description:

History register(s) record the outcome of the last k branches encountered. A global history register records information from other branches leading to the ... – PowerPoint PPT presentation

Number of Views:472
Avg rating:3.0/5.0
Slides: 70
Provided by: capsl
Category:

less

Transcript and Presenter's Notes

Title: Twolevel Adaptive Branch Prediction


1
Two-level Adaptive Branch Prediction
  • Colin Egan
  • University of Hertfordshire
  • Hatfield
  • U.K.
  • c.egan_at_herts.ac.uk

2
Presentation Structure
  • Two-level Adaptive Branch Prediction
  • Cached Correlated Branch Prediction
  • Neural Branch Prediction
  • Conclusion and Discussion
  • Where next?

3
Two-level Adaptive Branch Prediction Schemes
  • First level
  • History register(s) record the outcome of the
    last k branches encountered.
  • A global history register records information
    from other branches leading to the branch.
  • Local (Per-Address) history registers record
    information of specific branches.

4
Two-level Adaptive Branch Prediction Schemes
  • Second level
  • Is termed the Pattern History Table (PHT).
  • The PHT consists of at least one array of two-bit
    up/down saturating counters that provide the
    prediction.

5
Two-level Adaptive Branch Prediction Schemes
  • Global schemes
  • Exploit correlation between the outcome of the
    current branch and neighbouring branches that
    were executed leading to this branch.
  • Local schemes
  • Exploit correlation between the outcome of the
    current branch and its past behaviour.

6
Global Two-level Adaptive Branch Prediction
  • GAg Predictor Implementation

prediction
7
Global Two-level Adaptive Branch Prediction
  • GAp / GAs Predictor Implementation

8
Local Two-level Adaptive Branch Prediction
  • PAg

9
Local Two-level Adaptive Branch Prediction
  • PAs / PAp

10
Problems with Two-level Adaptive Branch Prediction
  • Size of PHT
  • Increases exponentially as a function of HR
    length.
  • Use of uninitialised predictors
  • No tag fields are associated with PHT prediction
    counters.
  • Branch interference (aliasing)
  • In GAg and PAg all branches share a common PHT.
  • In GAs and PAs each branch is shared by a set of
    branches.

11
Cached Correlated Branch Prediction
  • Minimises the number of initial mispredictions.
  • Eliminates branch interference.
  • Is used in a disciplined manner.
  • Is cost-effective.

12
Cached Correlated Branch Prediction
  • Today we are going to look at two types of cached
    correlated predictors
  • A Global Cached Correlated Predictor.
  • A Local Cached Correlated Predictor.
  • We have also developed a combined predictor that
    uses both global and local history information.

13
Cached Correlated Branch Prediction
  • The first level history register remains the same
    as conventional two-level predictors.
  • Uses a second level Prediction Cache instead of a
    PHT.

14
Cached Correlated Branch Prediction
  • Uses a secondary default predictor (BTC).
  • Both predictors provide a prediction.
  • A priority selector chooses the actual prediction.

15
Cached Correlated Branch Prediction
  • The Prediction Cache predicts on the past
    behaviour of the branch with the current history
    register pattern.
  • The Default Predictor predicts on the overall
    past behaviour of the branch.

16
Prediction Cache
  • Size is not a function of the history register
    length.
  • Size is determined by the number of prediction
    counters that are actually used.
  • Requires a tag-field.
  • Is cost effective as long as the cost of
    redundant counters removed from a conventional
    PHT exceeds the cost of the added tags.

17
A Global Cached Correlated Branch Predictor
18
A Local Cached Correlated Branch Predictor
  • Problem
  • A Local predictor will require two sequential
    clock access
  • One to access the BTC to furnish HRl.
  • Second to access the Prediction Cache.
  • Solution
  • Cache the next prediction for each branch in the
    BTC.
  • Only one clock access is therefore needed.

19
A Local Cached Correlated Branch Predictor
PC
BTC
Prediction Cache
hrl
hash
Default prediction
Prediction Cache prediction
Correlated hit
prediction selector
BTC hit
actual prediction
20
Simulations
  • Stanford Integer Benchmark suite.
  • These benchmarks are difficult to predict.
  • Instruction traces were obtained from the
    Hatfield Superscalar Architecture (HSA).

21
Global Simulations
  • A comparative study of misprediction rates
  • A conventional GAg, a conventional GAs(16) and a
    conventional GAp.
  • Against a Global Cached Correlated predictor (1K
    64K).

22
Global Simulation Results
23
Global Simulation Results
  • For conventional global two-level predictors the
    best average misprediction rate of 9.23 is
    achieved by a GAs(16) predictor with a history
    register length of 26.
  • In general there is little benefit from
    increasing the history register length beyond
    16-bits for GAg and 14-bits for GAs/GAp.

24
Global Simulation Results
  • A 32K entry Prediction Cache with a 30-bit
    history register achieved the best misprediction
    rate of 5.99 for the global cached correlated
    predictors.
  • This represents a 54 reduction over the best
    misprediction rate achieved by a conventional
    global two-level predictor.

25
Global Simulations
  • We repeated the same simulations without the
    default predictor.

26
Global Simulations Results(without default
predictor)
27
Global Simulation Results(without default
predictor)
  • The best misprediction rate is now 9.12.
  • The high performance of the cached predictor
    depends crucially on the provision of the
    two-stage mechanism.

28
Local Simulations
  • A comparative study of misprediction rates.
  • A conventional PAg, a conventional PAs(16) and a
    conventional PAp.
  • Against a local cached correlated predictor (1K
    64K), with and without the default predictor.

29
Local Simulation Results(with default predictor)
30
Local Simulation Results(without default
predictor)
31
Local Simulation Results
  • For conventional local two-level predictors the
    best average misprediction rate of 7.35 is
    achieved by a PAp predictor with a history
    register length of 30.
  • The best misprediction rate achieved by a local
    cached correlated predictor (64K HR 28) is
    6.19.
  • This is a 19 improvement over the best
    conventional local two-level predictor.
  • However, without the default predictor the best
    misprediction rate achieved is 8.21 (32K HR12).

32
Three-Stage Predictor
  • Since, the high performance of a cached predictor
    depends crucially on the provision of the
    two-stage mechanism we were led to the
    development of the three-stage predictor.

33
Three-Stage Predictor
  • Stages
  • Primary Prediction Cache.
  • Secondary Prediction Cache.
  • Default Predictor.
  • The predictions from the two Prediction Caches
    are stored in the BTC so that a prediction is
    furnished in a single clock cycle.

34
Three-Stage Predictor Simulations
  • We repeated the same set of simulations.
  • We varied the Primary Prediction Cache size (1
    64K).
  • The Secondary Prediction Cache was always half
    the size of the Primary Prediction Cache and used
    exactly half of the history register bits.

35
Global Three-Stage Predictor Simulation Results
36
Global Three-Stage Predictor Simulation Results
  • The global three-stage predictor consistently
    outperforms the simpler global two-stage
    predictor.
  • The best misprediction rate is now 5.57 achieved
    with a 32K Prediction Cache and a 30-bit HR.
  • This represents a 7.5 improvement over the best
    global two-Stage predictor.

37
Local Three-Stage Predictor Simulation Results
38
Local Three-Stage Predictor Simulation Results
  • The local three-stage predictor consistently
    outperforms the simpler local two-stage
    predictor.
  • The best misprediction rate is now 6.00 achieved
    with a 64K Prediction Cache and a 28-bit HR.
  • This represents a 3.2 improvement over the best
    local two-stage predictor.

39
Conclusion So Far
  • Conventional PHTs use large amounts of hardware
    with increasing history register length.
  • The history register size of a cached correlated
    predictor does not determine cost.
  • A Prediction Cache can reduce the hardware cost
    over a conventional PHT.

40
Conclusion So Far
  • Cached correlated predictors provide better
    prediction accuracy than conventional two-level
    predictors.
  • The role of the default predictor in a cached
    correlated predictor is crucial.
  • Three-stage predictors consistently record a
    small but significant improvement over their
    two-stage counterparts.

41
Neural Network Branch Prediction
  • Dynamic branch prediction can be considered to be
    a specific instance of a general Time Series
    Prediction.
  • Two-level Adaptive Branch Prediction is a very
    specific solution to the branch prediction
    problem.
  • An alternative approach is to look at other
    applications areas and fields for novel solutions
    to the problem.
  • At Hatfield, we have examined the application of
    neural networks to the branch prediction problem.

42
Neural Network Branch Prediction
  • Two neural networks are considered
  • A Learning Vector Quantisation (LVQ) Network,
  • A Backpropagation Network.
  • One of our main research objectives is to use
    neural networks to identify new correlations that
    can be exploited by branch predictors.
  • We also wish to determine whether more accurate
    branch prediction is possible and to gain a
    greater understanding of the underlying
    prediction mechanisms.

43
Neural Network Branch Prediction
  • As with Cached Correlated Branch Prediction, we
    retain the first level of a conventional
    two-level predictor.
  • The k-bit pattern of the history register is fed
    into the network as input.
  • In fact, we concatenate the 10 lsb of the branch
    address with the HR as input to the network.

44
LVQ prediction
  • The idea of using an LVQ predictor, was to see if
    respectable prediction rates could be delivered
    by a simple LVQ network that was dynamically
    trained after each branch prediction.

45
LVQ prediction
  • The LVQ predictor contains two codebook
    vectors
  • Vt is associated with a taken branch.
  • Vnt is associated with a not taken branch.
  • The concatenated PC HR form a single input
    vector.
  • We call this vector X.

46
LVQ prediction
  • Modified Hamming distances are then computed
    between X, Vt and Vnt.
  • The winning vector, Vw, is the vector with the
    smallest HD.

47
LVQ prediction
  • Vw is used to predict the branch.
  • If Vt wins then the branch is predicted as taken.
  • If Vnt wins then the branch is predicted as not
    taken.

48
LVQ prediction
  • LVQ network training
  • At branch resolution, Vw is adjusted
  • Vw (t 1) Vw (t) /- a(t)X(t) - Vw(t)
  • To reinforce correct predictions, the vector is
    incremented whenever a prediction was proved to
    be correct, and decremented whenever a prediction
    was proved to be incorrect.
  • The factor a(t) represents the learning factor
    and was (usually) set to a small constant of
    lt0.1.
  • The losing vector remains unchanged.

49
LVQ prediction
  • LVQ network training
  • Training is therefore dynamic.
  • Training is also adaptive since the codebook
    vectors reflect the outcomes of the most recently
    encountered branches.

50
Backpropagation prediction
  • Prediction information is fed into a
    backpropagation network.

51
Backpropagation prediction
  • There are two steps through the backpropagation
    network.
  • First Step (forward step)
  • From the input layer to the output layer.
  • Propagates the input vector of the net into the
    first layer, outputs from this first layer are
    fed as inputs to the next layer and so on to the
    final layer.
  • The outputs from the final layer are the output
    signals of the net.
  • In the case of branch prediction there is a
    single output signal the prediction.
  • Second step (backward step)
  • Is similar to the forward step.
  • Except that error values are backpropagated
    through the network.
  • These error values are used to train the net by
    changing the weights.

52
Backpropagation prediction
  • Inputs into the net can be coded in two different
    ways
  • Binary (0 for not taken, 1 for taken).
  • Bipolar (-1 for not taken, 1 for taken).
  • Two different activations functions are required
  • Sigmoid function for binary inputs.
  • Bipolar sigmoid function for bipolar inputs.

53
Backpropagation prediction
  • Sigmoid function
  • 1/(1 e-?x)
  • Bipolar Sigmoidal function
  • 2/(1 e-?x) 1
  • N.B. At the moment we are not considering
    implementation costs when we do we will have to
    consider a simpler/cheaper function.
  • The factor ? controls the degree of linearity
    between the two functions.

54
Backpropagation prediction
  • Binary inputs
  • A value greater than or equal to 0.5 is predicted
    as taken.
  • A value greater less than 0.5 is predicted as not
    taken.
  • Bipolar inputs
  • A value greater than or equal to 0 is predicted
    as taken.
  • A value less than 0 is predicted as not taken.

55
Backpropagation prediction
  • A backpropagation network is not initially
    trained.
  • Random values between -2/X, 2/X are used to
    initialise the net.
  • X corresponds to the number of inputs into the
    net.
  • The selection of weights guarantees that both the
    weights and their average value will be close to
    0.
  • This ensures that the net is initially not biased
    towards taken or not taken.

56
Simulations
  • We simulated 3 LVQ predictors
  • global LVQ,
  • local LVQ,
  • combined (global local) LVQ.
  • Inputs were
  • global LVQ PC HRg,
  • local LVQ PC HRl,
  • combined LVQ PC HRg HRl.
  • The learning step a(t) was standardised at 0.001.

57
Simulations
  • We simulated 2 types of Backpropagation
    predictors
  • global Backpropagation,
  • local Backpropagation.
  • Inputs were either
  • binary (0, 1),
  • bipolar (-1, 1).

58
Simulations
  • We simulated therefore simulated 4
    Backpropagation predictors
  • global Backpropagation binary,
  • global Backpropagation bipolar,
  • local Backpropagation binary,
  • local Backpropagation bipolar.
  • A learning rate of 0.125 was used throughout.

59
Simulation Results
  • global LVQ
  • Achieved an average misprediction rate of 13.54
    - worse than the BTC!
  • Only modest improvements with increasing HR
    length beyond HR4.
  • local LVQ
  • Achieved an average misprediction rate of 10.91
    - better than the BTC, but not as good as
    conventional two-level local prediction.
  • Only modest improvements with increasing HR
    length beyond HR6.
  • combined LVQ
  • Was marginally better than the local LVQ.

60
Simulation Results
  • LVQ prediction

61
Simulation Results
  • global Backpropagation with binary inputs
  • Achieved an average misprediction rate of 11.28.
  • Not as good as conventional global two-level.
  • global Backpropagation with bipolar inputs
  • Achieved an average misprediction rate of 8.77
  • Better than conventional global two-level.

62
Simulation Results
  • local Backpropagation with binary inputs
  • Achieved an average misprediction rate of 10.46.
  • Not as good as conventional local two-level.
  • local Backpropagation with bipolar inputs
  • Achieved an average misprediction rate of 8.47.
  • Not as good as conventional local two-level.

63
Simulation Results
  • Backpropagation prediction

64
Simulation Results
  • Backpropagation and conventional two-level
    prediction

65
Conclusion for NNs
  • LVQ predictors do not compete with conventional
    two-level predictors.
  • Backpropagation predictors with bipolar inputs
    compare well with conventional two-level
    predictors.
  • The results suggest that in some cases neural
    network predictors may be able to exploit
    correlation information more effectively than
    conventional two-level predictors.

66
Conclusion for NNs
  • We are looking at new composite input vectors
    that will outperform conventional two-level
    predictors.
  • We are considering hardware implementations of
    NNs that work quickly and within sensible silicon
    budget cost.

67
So where next?
  • With deeper pipelines and increasing MII the
    branch prediction problem is going to be
    exacerbated further.
  • Is dynamic branch prediction worth improving?
  • Individual benchmarks have an upper limit of
    prediction accuracy.
  • Have conventional predictors reached those
    ceilings?
  • Cost can be expensive even CCP.
  • Will NNs only achieve diminishing returns?
  • So where next?

68
So where next?
  • Why not remove branches from the instruction
    stream?
  • Aggressive schedulers can remove branch
    instructions by guarded or predicated instruction
    executions (HSA and IA-64).
  • In fact it is the difficult to predict branch
    instructions that are removed.
  • But not all IA-64 claim about 50, at Hatfield
    we say about 30!
  • Why not multithread difficult to predict
    branches?
  • Dynamically predict the simple to predict
    branches.
  • Identify the difficult to predict branches and
    multithread them.

69
So where next?
  • So why dont we apply a combination of
    techniques?
  • Remove branches from the instruction stream by
    aggressive scheduling.
  • Those that remain
  • Dynamically predict the easy to predict
    branches.
  • Multithread the difficult to predict branches.
  • c.egan_at_herts.ac.uk
Write a Comment
User Comments (0)
About PowerShow.com