Recent Advances in Branch Prediction - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Recent Advances in Branch Prediction

Description:

Improving Branch Prediction in the Microarchitecture. Perceptron predictor ... Recent cores have calmed down but branch prediction remains a problem (this used ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 41
Provided by: Daniela187
Category:

less

Transcript and Presenter's Notes

Title: Recent Advances in Branch Prediction


1
Recent Advances in Branch Prediction
Daniel Ángel Jiménez Department of Computer
Science Rutgers, The State University of New
Jersey
2
This Talk
  • Brief introduction to conditional branch
    prediction
  • Some motivation, some background
  • Improving Branch Prediction with the Compiler
  • Pattern history table partitioning
  • Improving Branch Prediction in the
    Microarchitecture
  • Perceptron predictor
  • Mathematical intuition
  • Some pictures and movies
  • Piecewise Linear Branch Prediction

3
Pipelining and Branches
Pipelining overlaps instructions to exploit
parallelism, allowing the clock rate to be
increased. Branches cause bubbles in the
pipeline, where some stages are left idle.
Instruction fetch
Instruction decode
Execute
Memory access
Write back
Unresolved branch instruction
4
Branch Prediction
A branch predictor allows the processor to
speculatively fetch and execute instructions down
the predicted path.
Instruction fetch
Instruction decode
Execute
Memory access
Write back
Speculative execution
Branch predictors must be highly accurate to
avoid mispredictions!
5
Branch Predictor Accuracy is Critical
  • The cost of a misprediction is proportional to
    pipeline depth
  • Predictor accuracy is more important for deeper
    pipelines
  • Pentium 4 with Prescott core pipeline has 31
    stages!
  • Recent cores have calmed down but branch
    prediction remains a problem (this used to be my
    favorite slide)
  • Deeper pipelines allow higher clock rates by
    decreasing the delay of each pipeline stage
  • Decreasing misprediction rate from 9 to 4
    results in 31 speedup for 32 stage pipeline

Simulations with SimpleScalar/Alpha
6
Conditional Branch Prediction
  • Most predictors based on 2-level adaptive branch
    prediction Yeh Patt 91
  • Branch outcomes shifted into history register,
    1taken, 0not taken
  • History bits and address bits combine to index a
    pattern history table (PHT) of 2-bit saturating
    counters
  • Prediction is high bit of counter
  • Counter is incremented if branch is taken,
    decremented if branch is not taken
  • Branches tend to be highly biased

GAs a common type of predictor
7
Destructive Interference (Aliasing)
  • Unrelated branches might accidentally use the
    same counter
  • If two branches behave differently, the predictor
    cant learn the behavior
  • Leads to decreased accuracy
  • A small decrease in accuracy can have a large
    impact on performance
  • Much branch prediction research focuses on
    reducing this effect
  • Almost all known techniques change the
    microarchitecture
  • Techniques shown to work well in simulation
  • But microprocessor manufacturers still use
    relatively simple predictors
  • Can we reduce destructive interference without
    changing the processor?

8
Pattern History Table Partitioning PLDI 2005
  • Compiler changes branch addresses to reduce
    interference
  • Exploits the fact that branches are highly biased
  • PHT divided into partitions based on branch bias

9
Simple Idea
  • The predictor uses part of the branch address to
    form an index
  • Bimodal PHT Partitioning
  • Make sure that one bit (bit k) of that address
    is
  • 1, for biased taken branches, and
  • 0, for biased not taken branches
  • Four-way PHT Partitioning
  • Control two address bits for the following four
    cases
  • 00, for weakly biased not taken branches,
  • 01, for weakly biased taken branches,
  • 10, for strongly biased not taken branches,
  • 11 for strongly biased taken branches
  • This way, branches with similar behavior are
    grouped together, reducing destructive
    interference

10
Simple Idea continued
11
Complicated Implementation
  • How do we control the addresses of branch
    instructions?
  • By inserting no-op instructions to move branches
    where we want them
  • But we cant just insert them anywhere that
    would hurt performance
  • Solution insert no-ops between specially
    selected regions of instructions
  • How do we know branch biases?
  • Profiling with a training run (or heuristics in
    the absence of profiling)
  • Whats the best value of k?
  • Predictor specific determine empirically for
    each microarchitecture
  • How does the compiler know the addresses of
    instructions?
  • Through feedback from the assembler
  • Gets complicated when inserting a no-op can
    affect a PC-relative branch

12
Adding No-Ops Between Regions
  • No-ops are inserted between regions of
    instructions
  • Each procedure is divided into regions
  • The first region begins with the first basic
    block
  • A region can be ended by
  • A basic block that does not fall through, e.g. an
    unconditional jump or return instruction
  • A basic block ending in a conditional branch that
    is taken at least 99.9 of the time
  • This ensures that
  • There is enough flexibility to insert no-ops in
    the program
  • No-ops are almost never actually fetched and
    executed

13
Deciding How Many No-Ops To Insert
  • Goal is to maximize number of branches whose
    biases match bit k in their addresses
  • Each procedure is aligned on a 2k -byte boundary
  • For each region in sequence,
  • Determine the effect of inserting from 0 to 2k-1
    single-byte no-ops
  • Choose the value that maximizes the number of
    branches assigned to their correct partition,
    weighted by dynamic execution frequency
  • Each time the effect of a new number of no-ops is
    evaluated, we must compute the new addresses
    (modulo 2k) of all instructions affected
  • We cant just add 1 for each no-op because of
  • PC-relative branches that can change size
  • Compiler-inserted alignment pseudo-ops

14
PHT Partitioning on the Intel Pentium 4
  • Intel Pentium 4 branch predictor details are very
    secret
  • Seems to include a GAs branch predictor
  • Intel engineers disappear in a puff of smoke when
    I ask them about it
  • I used ad hoc trial-and-error experiments to find
    the best address bits for bimodal and 4-way PHT
    partioning
  • Camino compiler infrastructure
  • Development began at UT Austin, continues at
    Rutgers
  • Camino uses GCC S to produce assembly language
    from C and C
  • Camino reads the assembly language, forming
    control-flow-graphs
  • Transformations are done at this CFG level
  • Result is put back into assembly form and
    assembled
  • This slide represents 95 of the work tedious
    development

15
Methodology
  • Used SPEC CPU integer benchmarks
  • Benchmarks from SPEC CPU 2000
  • Also from SPEC CPU 95 that werent duplicated in
    2000
  • A few benchmarks failed to work with Camino
  • Dell workstation for evaluation of the
    optimization
  • Intel Pentium 4 2.8 GHz
  • 2GB SDRAM
  • Compare with greedy branch alignment Calder
    Grunwald 94
  • Measure median of 5 execution times on quiescent
    system
  • Measure number of branch mispredictions with
    OProfile in a separate run

16
Results Speedup
  • Mean speedup for 4-way was 4.5, maximum speedup
    16
  • Speedup for branch alignment virtually nil
    probably because of Intel Pentium 4s trace cache

17
Results Decrease in Mispredictions
  • Normalized MPKI reduced by 3.5 for 4-way
    partitioning

18
Future For This Work
  • PHT partitioning is crude with path profiling,
    I hope to direct paths to specific PHT entries
    FDDO 2001
  • A lot of work needs to be done to
    reverse-engineer industrial branch predictors to
    develop better optimizations
  • Convince chip manufacturers to use simple branch
    predictors and leave the hard work to the compiler

19
Changing the Branch Predictor
  • Now lets consider changing the branch predictor
    itself
  • The architecture literature is replete with these
    kind of branch prediction papers
  • Before 2001, most work refined two-level adaptive
    branch prediction Yeh Patt 91
  • A 1st-level table records recent global or
    per-branch pattern histories
  • A 2nd-level table learns correlations between
    histories and outcomes
  • Refinements focus on reducing destructive
    interference in the tables
  • Some of the better refinements (not an exhaustive
    list)
  • gshare McFarling 93, agree Sprangle et al.
    97, hybrid predictors Evers et al. 96,
    skewed predictors Michaud et al. 93

20
Conditional Branch Prediction is a Machine
Learning Problem
  • The machine learns to predict conditional
    branches
  • So why not apply a machine learning algorithm?
  • Artificial neural networks
  • Simple model of neural networks in brain cells
  • Learn to recognize and classify patterns
  • We used fast and accurate perceptrons
    Rosenblatt 62, Block 62 for dynamic branch
    prediction Jiménez Lin, HPCA 2001
  • We were the first to use single-layer perceptrons
    and to achieve accuracy superior to PHT
    techniques. Previous work used LVQ and MLP for
    branch prediction Vintan Iridon 99.

21
Basics of Neural Branch Prediction
  • The inputs to a neuron are branch outcome
    histories
  • The last n branch outcomes
  • Can be global or local (per-branch) or both
    (alloyed)
  • Conceptually, branch outcomes are represented as
  • 1, for taken
  • -1, for not taken
  • The output of the neuron is
  • Non-negative, if the branch is predicted taken
  • Negative, if the branch is predicted not taken
  • Ideally, each static branch is allocated its own
    neuron

22
Branch-Predicting Perceptron
  • Inputs (xs) are from branch history
  • n 1 small integer weights (ws) learned by
    on-line training
  • Output (y) is dot product of xs and ws predict
    taken if y 0
  • Training finds correlations between history and
    outcome

23
Mathematical Intuition
A perceptron defines a hyperplane in
n1-dimensional space
For instance, in 2D space we have
This is the equation of a line, the same as
24
Mathematical Intuition continued
In 3D space, we have
Or you can think of it as
i.e. the equation of a plane in 3D space This
hyperplane forms a decision surface separating
predicted taken from predicted not taken
histories. This surface intersects the feature
space. Is it a linear surface, e.g. a line in
2D, a plane in 3D, a cube in 4D, etc.
25
Example The AND Function
  • White means false, black means true for the
    output
  • -1 means false, 1 means true for the input
  • A linear decision surface (i.e. a plane in 3D
    space) intersecting the feature space (i.e. the
    2D plane where z0) separates false from true
    instances

-1 AND -1 false -1 AND 1 false 1 AND -1
false 1 AND 1 true
26
Example AND continued
  • Watch a perceptron learn the AND function

27
Example XOR
  • Heres the XOR function

-1 XOR -1 false -1 XOR 1 true 1 XOR -1
true 1 XOR 1 false
Perceptrons cannot learn such linearly
inseparable functions
28
My Previous Work on Neural Predictors
  • The perceptron predictor HPCA 2001, TOCS 2002
    uses only pattern history information
  • The same weights vector is used for every
    prediction of a static branch
  • The ith history bit could come from any number of
    static branches
  • So the ith correlating weight is aliased among
    many branches
  • The newer path-based neural predictor MICRO
    2003 uses path information
  • The ith correlating weight is selected using the
    ith branch address
  • This allows the predictor to be pipelined,
    mitigating latency MICRO 2000, HPCA 2003
  • This strategy improves accuracy because of path
    information
  • But there is now even more aliasing since the ith
    weight could be used to predict many different
    branches

29
Piecewise Linear Branch Prediction ISCA 2005
  • Generalization of perceptron and path-based
    neural predictors
  • Ideally, there is a weight giving the correlation
    between each
  • Static branch b, and
  • Each pair of branch and history position (i.e. i)
    in bs history
  • b might have 1000s of correlating weights or just
    a few
  • Depends on the number of static branches in bs
    history

30
The Algorithm Parameters and Variables
  • GHL the global history length
  • GHR a global history shift register
  • GA a global array of previous branch addresses
  • W an n m (GHL 1) array of small integers

31
The Algorithm Making a Prediction
Weights are selected based on the current branch
and the ith most recent branch
32
The Algorithm Training
33
Why Its Better
  • Forms a piecewise linear decision surface
  • Each piece determined by the path to the
    predicted branch
  • Can solve more problems than perceptron

Perceptron decision surface for XOR doesnt
classify all inputs correctly
Piecewise linear decision surface for
XOR classifies all inputs correctly
34
Learning XOR
  • From a program that computes XOR using if
    statements

perceptron prediction
piecewise linear prediction
35
Methodology
  • Aggressive out-of-order processor simulated
  • 15 SPEC CPU integer benchmarks from 2000 and 95
    simulated
  • Highly accurate predictors from previous work
    tuned for this workload

36
Results Accuracy
  • Practical piecewise linear predictor approaches
    accuracy of an unlimited ideal predictor, 16
    better than previous work

37
Results Performance
  • Harmonic mean normalized IPC improves by 8 over
    previous work

38
Future For This Work
  • I still want more accuracy
  • Lets quantify the benefit of neural prediction
    with respect to power and energy
  • Not clear perceptrons are like multipliers in
    terms of energy consumption
  • Still, energy saved by running faster might
    offset energy consumed by predictor
  • Can the ideas from neural branch prediction be
    applied to other domains, e.g. value prediction?

39
Conclusion
  • PHT partitioning provides a real speedup for
    todays CPUs
  • Simple technique
  • But it would be nice to have detailed information
    about the predictor
  • Neural prediction could be incorporated into
    future CPUs
  • Latency problem is diminished
  • Accuracy is very good
  • Power and energy need to be researched
  • Acknowledgments
  • NSF, CCR-0311091 for piecewise linear branch
    prediction
  • Ministerio de Educacíon y Ciencia (Spanish
    Ministry of Education and Science), SB2003-0357
    for both

40
The End
Write a Comment
User Comments (0)
About PowerShow.com