Title: CSE 420598 Computer Architecture Lec 10 Chapter 2 DynPredBTB
1CSE 420/598 Computer Architecture Lec 10
Chapter 2 - DynPred-BTB
- Sandeep K. S. Gupta
- School of Computing and Informatics
- Arizona State University
Based on Slides by David Patterson
2Agenda
- Dynamic Branch Prediction (Review)
- BTB
3Applying the Prediction
- The earliest time we can begin using the
prediction is when - the prediction bits are available
- the branch target is available
- The earliest time we can know whether we have
predicted correctly is when - the branch condition is resolved
- The difference between these times is roughly
what is saved by a correct prediction - If the branch target is available late, the
window of savings is reduced
4Correlating Predictors
- The prediction is a function of the last k branch
outcomes - The branch history buffer is indexed by
- m bits taken from address of branch
- k bits of branch history
- i.e., m k bits all told
- Each entry in the branch history buffer has q
bits (i.e., is a q-bit predictor) - The branch history buffer has 2mk ? q bits of
storage
5Correlating predictor with2 history bits and 2
state bits (2,2)
6Local versus Global
7Hashing Correlation
For the same amount of table storage, we can get
better associativity in the case of fewer
branches but highly correlated behavior.
8Tournament Predictor
- Move toward the other predictor when
- I am wrong
- He is right
- Stay put when I am right and he is right, or I am
wrong and he is wrong.
9Tournament predictor local vs global
10Alpha 21264 Branch Predictor
- Tournament predictor (4K x 2) chooses between
global and local - Global has 4K 2-bit entries indexed by last 12
branch outcomes XORed with address - Local is also a two-level predictor
- 1K x 10 branch history buffer (last 10 outcomes
for indexed branch) indexed by address - The selected 10-bit history is XORed with address
to index a table of 3-bit entries
11Alpha 21264 Predictor
12Branch Target Buffers (BTB) or Caches (BTC)
- Branch target calculation is costly and stalls
the instruction fetch. - To reduce the branch penalty
- need to know what the address is by the end of IF
- but the instruction isnt even decoded yet
- so we have to wait a cycle and perhaps get a
branch (penalty 1 for MIPS) - so use the branch instruction address
- to predict the branch target
- if prediction works then penalty goes to 0!
13BTB - Idea
- BTB stores PCs the same way as caches
- Only PCs of predicted taken branches are stored
(no need to store untaken) - The match tag is the PC (associative memory OK if
its small) - The datafield is the predicted PC
- The PC of a (potential) branch is sent to the BTB
- When a match is found the corresponding Predicted
PC is returned - If PC not in table, it is taken to mean
- either not a branch
- or not predicted taken
- in either case, continue fetching from PC k (k
4 for MIPS) - If the branch was predicted taken, instruction
fetch continues at the returned predicted PC - BTB gets us the branch target address early
14Branch Target Buffers
15Changes in MIPS to incorporate BTB
16Penalties Using BTB in MIPS
- Note
- Penalties for mis-prediction more complex
machines are much higher
17Questions Concerning BTBs
- Can BTB be combined with branch prediction
machinery introduced earlier in this lecture?
How? - What kind of branches can a BTB accelerate that
are out of the reach of ordinary branch
predictors?
18BTB coupled with BHT
19Improvements
- Store instructions rather than target address
- increases entry size but removes Ifetch time
- permits BTB to run slower and therefore be larger
- permits branch folding - branches effectively
disappear - branch job is to change PC and get the real
instruction - if you have the instruction then the branch isnt
there (folded out of the way) - result is 0-cycle jumps and effectively 0-cycle
properly predicted branches - however - branches must be checked
- in a parallel path the branch must be fetched and
checked to see if the prediction is true - Predicting indirect jumps
- major source is procedure return
- obvious model is to use a stack as the return
predictor - note this can be combined with the above to get
jump folding
20Dynamic Branch Prediction Summary
- Prediction becoming important part of execution
- Branch History Table 2 bits for loop accuracy
- Correlation Recently executed branches
correlated with next branch - Either different branches (GA)
- Or different executions of same branches (PA)
- Tournament predictors take insight to next level,
by using multiple predictors - usually one based on global information and one
based on local information, and combining them
with a selector - In 2006, tournament predictors using ? 30K bits
are in processors like the Power5 and Pentium 4 - Branch Target Buffer include branch address
prediction - Next Class Dynamic Scheduling