CSE 420598 Computer Architecture Lec 10 Chapter 2 DynPredBTB - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

CSE 420598 Computer Architecture Lec 10 Chapter 2 DynPredBTB

Description:

CSE 420/598 Computer Architecture. Lec 10 Chapter 2 - DynPred-BTB. Sandeep K. S. Gupta ... Only PCs of predicted taken branches are stored (no need to store untaken) ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 21

Provided by: impac1

Category:

more less

Transcript and Presenter's Notes

Title: CSE 420598 Computer Architecture Lec 10 Chapter 2 DynPredBTB

1
CSE 420/598 Computer Architecture Lec 10
Chapter 2 - DynPred-BTB

Sandeep K. S. Gupta
School of Computing and Informatics
Arizona State University

Based on Slides by David Patterson
2
Agenda

Dynamic Branch Prediction (Review)
BTB

3
Applying the Prediction

The earliest time we can begin using the
prediction is when
the prediction bits are available
the branch target is available
The earliest time we can know whether we have
predicted correctly is when
the branch condition is resolved
The difference between these times is roughly
what is saved by a correct prediction
If the branch target is available late, the
window of savings is reduced

4
Correlating Predictors

The prediction is a function of the last k branch
outcomes
The branch history buffer is indexed by
m bits taken from address of branch
k bits of branch history
i.e., m k bits all told
Each entry in the branch history buffer has q
bits (i.e., is a q-bit predictor)
The branch history buffer has 2mk ? q bits of
storage

5
Correlating predictor with2 history bits and 2
state bits (2,2)
6
Local versus Global
7
Hashing Correlation
For the same amount of table storage, we can get
better associativity in the case of fewer
branches but highly correlated behavior.
8
Tournament Predictor

Move toward the other predictor when
I am wrong
He is right
Stay put when I am right and he is right, or I am
wrong and he is wrong.

9
Tournament predictor local vs global
10
Alpha 21264 Branch Predictor

Tournament predictor (4K x 2) chooses between
global and local
Global has 4K 2-bit entries indexed by last 12
branch outcomes XORed with address
Local is also a two-level predictor
1K x 10 branch history buffer (last 10 outcomes
for indexed branch) indexed by address
The selected 10-bit history is XORed with address
to index a table of 3-bit entries

11
Alpha 21264 Predictor
12
Branch Target Buffers (BTB) or Caches (BTC)

Branch target calculation is costly and stalls
the instruction fetch.
To reduce the branch penalty
need to know what the address is by the end of IF
but the instruction isnt even decoded yet
so we have to wait a cycle and perhaps get a
branch (penalty 1 for MIPS)
so use the branch instruction address
to predict the branch target
if prediction works then penalty goes to 0!

13
BTB - Idea

BTB stores PCs the same way as caches
Only PCs of predicted taken branches are stored
(no need to store untaken)
The match tag is the PC (associative memory OK if
its small)
The datafield is the predicted PC
The PC of a (potential) branch is sent to the BTB
When a match is found the corresponding Predicted
PC is returned
If PC not in table, it is taken to mean
either not a branch
or not predicted taken
in either case, continue fetching from PC k (k
4 for MIPS)
If the branch was predicted taken, instruction
fetch continues at the returned predicted PC
BTB gets us the branch target address early

14
Branch Target Buffers
15
Changes in MIPS to incorporate BTB
16
Penalties Using BTB in MIPS

Note
Penalties for mis-prediction more complex
machines are much higher

17
Questions Concerning BTBs

Can BTB be combined with branch prediction
machinery introduced earlier in this lecture?
How?
What kind of branches can a BTB accelerate that
are out of the reach of ordinary branch
predictors?

18
BTB coupled with BHT
19
Improvements

Store instructions rather than target address
increases entry size but removes Ifetch time
permits BTB to run slower and therefore be larger
permits branch folding - branches effectively
disappear
branch job is to change PC and get the real
instruction
if you have the instruction then the branch isnt
there (folded out of the way)
result is 0-cycle jumps and effectively 0-cycle
properly predicted branches
however - branches must be checked
in a parallel path the branch must be fetched and
checked to see if the prediction is true
Predicting indirect jumps
major source is procedure return
obvious model is to use a stack as the return
predictor
note this can be combined with the above to get
jump folding

20
Dynamic Branch Prediction Summary

Prediction becoming important part of execution
Branch History Table 2 bits for loop accuracy
Correlation Recently executed branches
correlated with next branch
Either different branches (GA)
Or different executions of same branches (PA)
Tournament predictors take insight to next level,
by using multiple predictors
usually one based on global information and one
based on local information, and combining them
with a selector
In 2006, tournament predictors using ? 30K bits
are in processors like the Power5 and Pentium 4
Branch Target Buffer include branch address
prediction
Next Class Dynamic Scheduling