Title: CDA 5155
1CDA 5155
- Week 3
- Branch Prediction
- Superscalar Execution
2M U X
1
REG file
M U X
PC
Inst mem
Data memory
M U X
sign ext
bpc
target
Control
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
beq
3Branch Target Buffer
Fetch PC
Send PC to BTB
found?
No
Yes
use target
use PC1
Predicted target PC
4Branch prediction
- Predict not taken 50 accurate
- No BTB needed always use PC1
- Predict backward taken 65 accurate
- BTB holds targets for backward branches (loops)
- Predict same as last time 80 accurate
- Update BTB for any taken branch
5What about indirect branches?
- Could use same approach
- PC1 unlikely indirect target
- Indirect jumps often have multiple targets (for
same instruction) - Switch statements
- Virtual function calls
- Shared library (DLL) calls
6Indirect jump Special Case
- Return address stack
- Function returns have deterministic behavior
(usually) - Return to different locations (BTB doesnt work
well) - Return location known ahead of time
- In some register at the time of the call
- Build a specialize structure for return addresses
- Call instructions write return address to R31 AND
RAS - Return instructions pop predicted target off
stack - Issues finite size (save or forget on
overflow?) - Issues long jumps (clear when wrong?)
7Costs of branch prediction/speculation
- Performance costs?
- Minimal no difference between waiting and
squashing and it is a huge gain when prediction
is correct! - Power?
- Large in very long/wide pipelines many
instructions can be squashed - Squashed mispredictions ? pipeline
length/width before target resolved
8Costs of branch prediction/speculation
- Area?
- Can be large predictors can get very big as we
will see next time - Complexity?
- Designs are more complex
- Testing becomes more difficult, but
9What else can be speculated?
- Dependencies
- I think this data is coming from that store
instruction - Values
- I think I will load a 0 value
- Accuracy?
- Branch prediction (direction) is Boolean (T,NT)
- Branch targets are stable or predictable (RAS)
- Dependencies are limited
- Values cover a huge space (0 4B)
10Parts of the branch predictor
- Direction Predictor
- For conditional branches
- Predicts whether the branch will be taken
- Examples
- Always taken backwards taken
- Address Predictor
- Predicts the target address (use if predicted
taken) - Examples
- BTB Return Address Stack Precomputed Branch
- Recovery logic
Ref The Precomputed Branch Architecture
11Characteristics of branches
- Individual branches differ
- Loops tend not to exit
- Unoptimized code not-taken
- Optimized code taken
- If-statements
- Tend to be less predictable
- Unconditional branches
- Still need address prediction
12Example gzip
- gzip loop branch A_at_ 0x1200098d8
- Executed 1359575 times
- Taken 1359565 times
- Not-taken 10 times
- time taken 99 - 100
Easy to predict (direction and address)
13Example gzip
- gzip if branch B_at_ 0x12000fa04
- Executed 151409 times
- Taken 71480 times
- Not-taken 79929 times
- time taken 49
Easy to predict? (maybe not/ maybe dynamically)
14Example gzip
A
B
0
100
Direction prediction always taken Accuracy 73
15Branch Backwards
Most backward branches are heavily TAKEN Forward
branches slightly more likely to be NOT-TAKEN
Ref The Effects of Predicated Execution on
Branch Prediction
16Using history
- 1-bit history (direction predictor)
- Remember the last direction for a branch
Branch History Table
branchPC
How big is the BHT?
17Example gzip
A
B
0
100
Direction prediction always taken Accuracy 73
How many times will branch A mispredict?
How many times will branch B mispredict?
18Using history
- 2-bit history (direction predictor)
Branch History Table
branchPC
How big is the BHT?
19Example gzip
A
B
0
100
Direction prediction always taken Accuracy 76
How many times will branch A mispredict?
How many times will branch B mispredict?
20Using History Patterns
- 80 percent of branches are either heavily TAKEN
or heavily NOT-TAKEN - For the other 20, we need to look a patterns of
reference to see if they are predictable using a
more complex predictor - Example gcc has a branch that flips each time
T(1) NT(0) 1010101010101010101010101010101010
1010
21Local history
branchPC
Branch History Table
Pattern History Table
10101010
What is the prediction for this BHT 10101010?
When do I update the tables?
22Local history
branchPC
Branch History Table
Pattern History Table
01010101
On the next execution of this branch instruction,
the branch history table is 01010101, pointing
to a different pattern
What is the accuracy of a flip/flop branch
0101010101010?
23Global history
Pattern History Table
Branch History Register
01110101
for (i0 ilt100 i) for (j0 jlt3
j) jlt3 j 1 1101 ? taken jlt3 j 2 1011 ?
taken jlt3 j 3 0111 ? not taken ilt100 1110 ?
usually taken
if (aa 2) aa 0 if (bb 2) bb 0 if
(aa ! bb)
How can branches interfere with each other?
24Gshare predictor
branchPC
Pattern History Table
Branch History Register
xor
01110101
Ref Combining Branch Predictors
25Bimod predictor
Global history reg
branchPC
xor
Choice predictor
PHT skewed taken
PHT skewed Not-taken
mux
26Tournament predictors
Local predictor (e.g. 2-bit)
Global/gshare predictor (much more state)
Prediction 1
Prediction 2
Selection table (2-bit state machine)
Prediction
How do you select which predictor to use? How do
you update the various predictor/selector?
27Overriding Predictors
- Big predictors are slow, but more accurate
- Use a single cycle predictor in fetch
- Start the multi-cycle predictor
- When it completes, compare it to the fast
prediction. - If same, do nothing
- If different, assume the slow predictor is right
and flush pipline. - Advantage reduced branch penalty for those
branches mispredicted by the fast predictor and
correctly predicted by the slow predictor
28Pipelined Gshare Predictor
- How can we get a pipelined global prediction by
stage 1? - Start in stage 2
- Dont have the most recent branch history
- Access multiple entries
- E.g. if we are missing last three branches, get 8
histories and pick between them during fetch
stage.
Ref Reconsidering Complex Branch Predictors
29Exceptions
- Exceptions are events that are difficult or
impossible to manage in hardware alone. - Exceptions are usually handled by jumping into a
service (software) routine. - Examples I/O device request, page fault, divide
by zero, memory protection violation (seg fault),
hardware failure, etc.
30Taking and Exception
- Once an exception occurs, how does the processor
proceed. - Non-pipelined dont fetch from PC save state
fetch from interrupt vector table - Pipelined depends on the exception
- Precise Interrupt Must stop all instruction
after the exception (squash) - Divide by zero flush fetch/decode
- Page fault (fetch or mem stage?)
- Save state after last instruction before
exception completes (PC, regs) - Fetch from interrupt vector table
31Optimizing CPU Performance
- Golden Rule tCPU NinstCPItCLK
- Given this, what are our options
- Reduce the number of instructions executed
- Compiler Job (COP 5621 COP 5622)
- Reduce the clock period
- Fabrication (Some Engineering classes)
- Reduce the cycles to execute an instruction
- Approach Instruction Level Parallelism (ILP)
32Adding width to basic pipelining
- 5 stage RISC load-store architecture
- About as simple as things get
- Instruction fetch
- get 2 instructions from memory/cache
- Instruction decode
- translate opcodes into control signals and read
regs - Execute
- perform ALU operations
- Memory
- Access memory operations if load/store
- Writeback/retire
- update register file
33Stage 1 Fetch
- Design a datapath that can fetch two instructions
from memory every cycle. - Use PC to index memory to read instruction
- Read 2 instructions
- Increment the PC (by 2)
- Write everything needed to complete execution to
the pipeline register (IF/ID) - Instruction 1 instruction 2 PC1 PC2
34Rest of pipelined datapath
35Stage 2 Decode
- Design a datapath that reads the IF/ID pipeline
register, decodes instructions and reads register
file (specified by regA and regB of instruction
bits for both instructions). - Write everything needed to complete execution to
the pipeline register (ID/EX) - Pass on both instructions.
- Including PC1, PC2 even though decode didnt
use it.
36Rest of pipelined datapath
Stage 1 Fetch datapath
Changes? Hazard detection?
37Stage 3 Execute
- Design a datapath that performs the proper ALU
operations for the instructions specified and the
values present in the ID/EX pipeline register. - The inputs to ALUtop are the contents of regAtop
and either the contents of RegBtop or the
offsettop field on the instruction. - The inputs to ALUbottom are the contents of
regAbottom and either the contents of RegBbottom
or the offsetbottom field on the instruction. - Also, calculate PC1offsettop in case this is a
branch. - Also, calculate PC2offsetbottom in case this is
a branch.
38PC 1
Stage 2 Decode datapath
Control Signals
How many data forwarding paths?
39Stage 4 Memory Operation
- Design a datapath that performs the proper memory
operation(s) for the instructions specified and
the values present in the EX/Mem pipeline
register. - ALU results contain addresses for ld and st
instructions. - Opcode bits control memory R/W and enable
signals. - Write everything needed to complete execution to
the pipeline register (Mem/WB) - ALU results and MemData(x2)
- Instruction bits for opcodes and destRegs
specifiers
40PC1 offset
Stage 3 Execute datapath
contents of regB
Control Signals
Should we process 2 memory operations in one
cycle?
41Stage 5 Write back
- Design a datapath that completes the execution of
these instructions, writing to the register file
if required. - Write MemData to destReg for ld instructions
- Write ALU result to destReg for add or nand
instructions. - Opcode bits also control register write enable
signal.
42What about ordering the register writes if
same destination specifier for each instruction?
Alu Result
Memory Read Data
Stage 4 Memory datapath
Control Signals
Mem/WB Pipeline register
43How Much ILP is There?
44ALU Operation GOOD, Branch BAD
Expected Number of Branches Between
Mispredicts E(X) 1/(1-p) E.g., p 95, E(X)
20 brs, 100-ish insts
45How Accurate are Branch Predictors?