Title: COMP 740: Computer Architecture and Implementation
1COMP 740Computer Architecture and Implementation
- Montek Singh
- Thu, Feb 19, 2009
- Topic Instruction-Level Parallelism III
- (Dynamic Branch Prediction)
2Why Do We Need Branch Prediction?
- Basic blocks are short, and we have done about
all we can do for them with dynamic scheduling - control dependences now become the bottleneck
- Since branches disrupt sequential flow of instrs
- we need to be able to predict branch behavior to
avoid stalling the pipeline - What we must predict
- Branch outcome (Is the branch taken?)
- Branch Target Address (What is the next
non-sequential PC value?)
3A General Model of Branch Prediction
Branch predictor accuracy
Branch penalties
- T probability of branch being taken
- p fraction of branches that are predicted
- to be taken
- A accuracy of prediction
- j, k, m, n associated delays (penalties) for
- the four events (n is usually
0)
Branch penalty of a particular prediction method
4Theoretical Limits of Branch Prediction
- Best case branches are perfectly predicted (A
1) - also assume that n 0
- minimum branch penalty jT
- Let s be the pipeline stage where BTA becomes
known - Then j s-1
- See static prediction methods in Lecture 7
- Thus, performance of any branch prediction
strategy is limited by - s, the location of the pipeline stage that
develops BTA - A, the accuracy of the prediction
5Review Static Branch Prediction Methods
- Several static prediction strategies
- Predict all branches as NOT TAKEN
- Predict all branches as TAKEN
- Predict all branches with certain opcodes as
TAKEN, and all others as NOT TAKEN - Predict all forward branches as NOT TAKEN, and
all backward branches as TAKEN - Opcodes have default predictions, which the
compiler may reverse by setting a bit in the
instruction
6Dynamic Branch Prediction
- Premise History of a branch instrs outcome
matters! - whether a branch will be taken depends greatly on
the way previous dynamic instances of the same
branch were decided - Dynamic prediction methods
- take advantage of this fact by making their
predictions dependent on the past behavior of the
same branch instr - such methods are called Branch History Table
(BHT) methods
7BHT Methods for Branch Prediction
8A One-Bit Predictor
State 0 Predict Not Taken
State 1 Predict Taken
- Predictor misses twice on typical loop branches
- Once at the end of loop
- Once at the end of the 1st iteration of next
execution of loop - The outcome sequence NT-T-NT-T makes it miss all
the time
9A Two-Bit Predictor
- A four-state Moore machine
- Predictor misses once on typical loop branches
- hence popular
- Outcome sequence NT-NT-T-T-NT-NT-T-T make it miss
all the time
10A Two-Bit Predictor
- A four-state Moore machine
- Predictor misses once on typical loop branches
- hence popular
- Input sequence NT-NT-T-T-NT-NT-T-T make it miss
all the time
11Correlating Branch Outcome Predictors
- The history-based branch predictors seen so far
base their predictions on past history of branch
that is being predicted - A completely different idea
- The outcome of a branch may well be predicted
successfully based on the outcome of the last k
branches executed - i.e., the path leading to the branch being
predicted - Much-quoted example from SPEC92 benchmark eqntott
if (aa 2) /b1/ aa 0 if (bb 2)
/b2/ bb 0 if (aa ! bb) /b3/
TAKEN(b1) TAKEN(b2) implies NOT-TAKEN(b3)
12Another Example of Branch Correlation
if (d 0) //b1 d 1 if (d 1) //b2
...
- Assume multiple runs of code fragment
- d alternates between 2 and 0
- How would a 1-bit predictor initialized to state
0 - behave?
BNEZ R1, L1 ADDI R1, R0, 1 L1 SUBI R3, R1,
1 BNEZ R3, L2 L2
13A Correlating Branch Predictor
- Think of having a pair of 1-bit predictors p0,
p1 for each branch, where we choose between
predictors (and update them) based on outcome of
most recent branch (i.e., B1 for B2, and B2 for
B1) - if most recent br was not taken, use and update
(if needed) predictor p0 - If most recent br was taken, use and update (if
needed) predictor p1 - How would such (1,1) correlating predictors
behave if initialized to 0,0?
14Organization of (m,n) Correlating Predictor
- Using the results of last m branches
- 2m outcomes
- can be kept in m-bit shift register
- n-bit self-history predictor
- BHT addressed using
- m bits of global history
- select column (particular predictor)
- some lower bits of branch address
- select row (particular branch instr)
- entry holds n previous outcomes
- Aliasing can occur since BHT uses only portion of
branch instr address - state in various predictors in single row may
correspond to different branches at different
points of time - m0 is ordinary BHT
15Improved Dynamic Branch Prediction
- Recall that, even with perfect accuracy of
prediction, branch penalty of a prediction method
is (s-1)T - s is the pipeline stage where BTA is developed
- T is the frequency of taken branches
- Further improvements can be obtained only by
using a cache storing BTAs, and accessing it
simultaneously with the I-cache - Such a cache is called a Branch Target Buffer
(BTB) - BHT and BTB can be used together
- Coupled one table holds all the information
- Uncoupled two independent tables
16Using BTB and BHT Together
- Uncoupled solution
- BTB stores only the BTAs of taken branches
recently executed - No separate branch outcome prediction (the
presence of an entry in BTB can be used as an
implicit prediction of the branch being TAKEN
next time) - Use the BHT in case of a BTB miss
- Coupled solution
- Stores BTAs of all branches recently executed
- Has separate branch outcome prediction for each
table entry - Use BHT in case of BTB hit
- Predict NOT TAKEN otherwise
17Parameters of Real Machines
18Coupled BTB and BHT
19Decoupled BTB and BHT
20Reducing Misprediction Penalties
- Need to recover whenever branch prediction is not
correct - Discard all speculatively executed instructions
- Resume execution along alternative path (this is
the costly step) - Scenarios where recovery is needed
- Predict taken, branch is taken, BTA wrong (case
7) - Predict taken, branch is not taken (cases 4 and
6) - Predict not taken, branch is taken (case 3)
- Preparing for recovery involves working on
alternative parh - On instruction level
- Two fetch address registers per speculated branch
(PPC 603 640) - Two instruction buffers (IBM 360/91, SuperSPARC,
Pentium) - On I-cache level
- For PT, also do next-line prefetching
- For PNT, also do target-line prefetching
21Predicting Dynamic BTAs
- Vast majority of dynamic BTAs come from procedure
returns (85 for SPEC95) - Since procedure call-return for the most part
follows a stack discipline, a specialized return
address buffer operated as a stack is appropriate
for high prediction accuracy - Pushes return address on call
- Pops return address on return
- Depth of RAS should be as large as maximum call
depth to avoid mispredictions - 8-16 elements generally sufficient