Title: Lecture 8: Instruction Fetch, ILP Limits
1Lecture 8 Instruction Fetch, ILP Limits
- Today advanced branch prediction, limits of ILP
- (Sections 3.4-3.5, 3.8-3.14)
21-Bit Prediction
- For each branch, keep track of what happened
last time - and use that outcome as the prediction
- What are prediction accuracies for branches 1
and 2 below - while (1)
- for (i0ilt10i)
branch-1 -
-
- for (j0jlt20j)
branch-2 -
-
-
32-Bit Prediction
- For each branch, maintain a 2-bit saturating
counter - if the branch is taken counter
min(3,counter1) - if the branch is not taken counter
max(0,counter-1) - If (counter gt 2), predict taken, else predict
not taken - Advantage a few atypical branches will not
influence the - prediction (a better measure of the common
case) - Especially useful when multiple branches share
the same - counter (some bits of the branch PC are used to
index - into the branch predictor)
- Can be easily extended to N-bits (in most
processors, N2)
4Correlating Predictors
- Basic branch prediction maintain a 2-bit
saturating - counter for each entry (or use 10 branch PC
bits to index - into one of 1024 counters) captures the
recent - common case for each branch
- Can we take advantage of additional information?
- If a branch recently went 01111, expect 0 if
it - recently went 11101, expect 1 can we have a
- separate counter for each case?
- If the previous branches went 01, expect 0 if
the - previous branches went 11, expect 1 can we
have - a separate counter for each case?
- Hence, build correlating predictors
5Local/Global Predictors
- Instead of maintaining a counter for each branch
to - capture the common case,
- Maintain a counter for each branch and
surrounding pattern - If the surrounding pattern belongs to the branch
being - predicted, the predictor is referred to as a
local predictor - If the surrounding pattern includes neighboring
branches, - the predictor is referred to as a global
predictor
6Global Predictor
A single register that keeps track of recent
history for all branches
Table of 16K entries of 2-bit saturating counters
00110101
8 bits
6 bits
Branch PC
Also referred to as a two-level predictor
7Local Predictor
Also a two-level predictor that only uses local
histories at the first level
Branch PC
Table of 16K entries of 2-bit saturating counters
Use 6 bits of branch PC to index into local
history table
10110111011001
14-bit history indexes into next level
Table of 64 entries of 14-bit histories for a
single branch
8Tournament Predictors
- A local predictor might work well for some
branches or - programs, while a global predictor might work
well for others - Provide one of each and maintain another
predictor to - identify which predictor is best for each branch
Alpha 21264 1K entries in level-1 1K entries in
level-2 4K entries 12-bit global history 4K
entries Total capacity ?
Local Predictor
M U X
Global Predictor
Branch PC
Tournament Predictor
Table of 2-bit saturating counters
9Predictor Comparison
- Note that predictors of equal capacity must be
compared - Sizes of each level have to be selected to
optimize prediction accuracy - Influencing factors degree of interference
between branches, program - likely to benefit from local/global history
10Branch Target Prediction
- In addition to predicting the branch direction,
we must - also predict the branch target address
- Branch PC indexes into a predictor table
indirect branches - might be problematic
- Most common indirect branch return from a
procedure - can be easily handled with a stack of return
addresses
11Multiple Instruction Issue
- The out-of-order processor implementation can be
easily - extended to have multiple instructions in each
pipeline stage - Increased complexity (lower clock speed!)
- more reads and writes per cycle to register map
table - more read and write ports in issue queue
- more tags being broadcast to issue queue every
cycle - higher complexity for bypassing/forwarding among
FUs - more register read and write ports
- more ports in the LSQ
- more ports in the data cache
- more ports in the ROB
12ILP Limits
- The perfect processor
- Infinite registers (no WAW or WAR hazards)
- Perfect branch direction and target prediction
- Perfect memory disambiguation
- Perfect instruction and data caches
- Single-cycle latencies for all ALUs
- Infinite ROB size (window of in-flight
instructions) - No limit on number of instructions in each
pipeline stage - The last instruction may be scheduled in the
first cycle - The only constraint is a true dependence
(register or - memory RAW hazards) (with value prediction,
how would - the perfect processor behave?)
13Infinite Window Size and Issue Rate
14Effect of Window Size
- Window size is effected by register file/ROB
size, branch mispredict rate, - fetch bandwidth, etc.
- We will use a window size of 2K instrs and a max
issue rate of 64 for - subsequent experiments
15Imperfect Branch Prediction
- Note no branch mispredict penalty branch
mispredict restricts window size - Assume a large tournament predictor for
subsequent experiments
16Effect of Name Dependences
- More registers ? fewer WAR and WAW constraints
(usually register file size - goes hand in hand with in-flight window size)
- 256 int and fp registers for subsequent
experiments
17Memory Dependences
18Limits of ILP Summary
- Int programs are more limited by branches,
memory - disambiguation, etc., while FP programs are
limited most - by window size
- We have not yet examined the effect of branch
mispredict - penalty and imperfect caching
- All of the studied factors have relatively
comparable - influence on CPI window/register size, branch
prediction, - memory disambiguation
- Can we do better? Yes better compilers, value
prediction, - memory dependence prediction, multi-path
execution
19Pentium III (P6 Microarchitecture) Case Study
- 14-stage pipeline 8 for fetch/decode/dispatch,
3 for o-o-o, - 3 for commit ? branch mispredict penalty of
10-15 cycles - Out-of-order execution with a 40-entry ROB (40
temporary - or virtual registers) and 20 reservation
stations - Each x86 instruction gets converted into
RISC-like - micro-ops on average, one CISC instr ? 1.37
micro-ops - Three instructions in each pipeline stage ? 3
instructions - can simultaneously leave the pipeline ? ideal
CPmI 0.33 - ? ideal CPI 0.45
20Branch Prediction
- 512-entry global two-level branch predictor and
512-entry - BTB ? 20 combined mispredict rate
- For every instruction committed, 0.2
instructions on the - mispredicted path are also executed (wasted
power!) - Mispredict penalty is 10-15 cycles
21Where is Time Lost?
- Branch mispredict stalls
- Cache miss stalls (dominated by L1D misses)
- Instruction fetch stalls (happens often because
subsequent - stages are stalled, and occasionally because of
an I-cache - miss
22CPI Performance
- Owing to stalls, the processor can fall behind
(no instructions are committed - for 55 of all cycles), but then recover with
multi-instruction commits (31 of - all cycles) ? average CPI 1.15 (Int) and 2.0
(FP) - Overlap of different stalls ? CPI is not the sum
of individual stalls - IPC is also an attractive metric
23Title