CDA 5155 - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

CDA 5155

Description:

Call instructions write return address to R31 AND RAS ... Branch targets are stable or predictable (RAS) Dependencies are limited ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 46

Provided by: csF2

Category:

Tags: cda | ras

more less

Transcript and Presenter's Notes

Title: CDA 5155

1
CDA 5155

Week 3
Branch Prediction
Superscalar Execution

2
M U X
1
REG file
M U X
PC
Inst mem
Data memory
M U X
sign ext
bpc
target
Control
IF/ ID
ID/ EX
EX/ Mem
Mem/ WB
beq
3
Branch Target Buffer
Fetch PC
Send PC to BTB
found?
No
Yes
use target
use PC1
Predicted target PC
4
Branch prediction

Predict not taken 50 accurate
No BTB needed always use PC1
Predict backward taken 65 accurate
BTB holds targets for backward branches (loops)
Predict same as last time 80 accurate
Update BTB for any taken branch

5
What about indirect branches?

Could use same approach
PC1 unlikely indirect target
Indirect jumps often have multiple targets (for
same instruction)
Switch statements
Virtual function calls
Shared library (DLL) calls

6
Indirect jump Special Case

Return address stack
Function returns have deterministic behavior
(usually)
Return to different locations (BTB doesnt work
well)
Return location known ahead of time
In some register at the time of the call
Build a specialize structure for return addresses
Call instructions write return address to R31 AND
RAS
Return instructions pop predicted target off
stack
Issues finite size (save or forget on
overflow?)
Issues long jumps (clear when wrong?)

7
Costs of branch prediction/speculation

Performance costs?
Minimal no difference between waiting and
squashing and it is a huge gain when prediction
is correct!
Power?
Large in very long/wide pipelines many
instructions can be squashed
Squashed mispredictions ? pipeline
length/width before target resolved

8
Costs of branch prediction/speculation

Area?
Can be large predictors can get very big as we
will see next time
Complexity?
Designs are more complex
Testing becomes more difficult, but

9
What else can be speculated?

Dependencies
I think this data is coming from that store
instruction
Values
I think I will load a 0 value
Accuracy?
Branch prediction (direction) is Boolean (T,NT)
Branch targets are stable or predictable (RAS)
Dependencies are limited
Values cover a huge space (0 4B)

10
Parts of the branch predictor

Direction Predictor
For conditional branches
Predicts whether the branch will be taken
Examples
Always taken backwards taken
Address Predictor
Predicts the target address (use if predicted
taken)
Examples
BTB Return Address Stack Precomputed Branch
Recovery logic

Ref The Precomputed Branch Architecture
11
Characteristics of branches

Individual branches differ
Loops tend not to exit
Unoptimized code not-taken
Optimized code taken
If-statements
Tend to be less predictable
Unconditional branches
Still need address prediction

12
Example gzip

gzip loop branch A_at_ 0x1200098d8
Executed 1359575 times
Taken 1359565 times
Not-taken 10 times
time taken 99 - 100

Easy to predict (direction and address)
13
Example gzip

gzip if branch B_at_ 0x12000fa04
Executed 151409 times
Taken 71480 times
Not-taken 79929 times
time taken 49

Easy to predict? (maybe not/ maybe dynamically)
14
Example gzip
A
B
0
100
Direction prediction always taken Accuracy 73
15
Branch Backwards
Most backward branches are heavily TAKEN Forward
branches slightly more likely to be NOT-TAKEN
Ref The Effects of Predicated Execution on
Branch Prediction
16
Using history

1-bit history (direction predictor)
Remember the last direction for a branch

Branch History Table
branchPC
How big is the BHT?
17
Example gzip
A
B
0
100
Direction prediction always taken Accuracy 73
How many times will branch A mispredict?
How many times will branch B mispredict?
18
Using history

2-bit history (direction predictor)

Branch History Table
branchPC
How big is the BHT?
19
Example gzip
A
B
0
100
Direction prediction always taken Accuracy 76
How many times will branch A mispredict?
How many times will branch B mispredict?
20
Using History Patterns

80 percent of branches are either heavily TAKEN
or heavily NOT-TAKEN
For the other 20, we need to look a patterns of
reference to see if they are predictable using a
more complex predictor
Example gcc has a branch that flips each time

T(1) NT(0) 1010101010101010101010101010101010
1010
21
Local history
branchPC
Branch History Table
Pattern History Table
10101010
What is the prediction for this BHT 10101010?
When do I update the tables?
22
Local history
branchPC
Branch History Table
Pattern History Table
01010101
On the next execution of this branch instruction,
the branch history table is 01010101, pointing
to a different pattern
What is the accuracy of a flip/flop branch
0101010101010?
23
Global history
Pattern History Table
Branch History Register
01110101
for (i0 ilt100 i) for (j0 jlt3
j) jlt3 j 1 1101 ? taken jlt3 j 2 1011 ?
taken jlt3 j 3 0111 ? not taken ilt100 1110 ?
usually taken
if (aa 2) aa 0 if (bb 2) bb 0 if
(aa ! bb)
How can branches interfere with each other?
24
Gshare predictor
branchPC
Pattern History Table
Branch History Register
xor
01110101
Ref Combining Branch Predictors
25
Bimod predictor
Global history reg
branchPC
xor
Choice predictor
PHT skewed taken
PHT skewed Not-taken
mux
26
Tournament predictors
Local predictor (e.g. 2-bit)
Global/gshare predictor (much more state)
Prediction 1
Prediction 2
Selection table (2-bit state machine)
Prediction
How do you select which predictor to use? How do
you update the various predictor/selector?
27
Overriding Predictors

Big predictors are slow, but more accurate
Use a single cycle predictor in fetch
Start the multi-cycle predictor
When it completes, compare it to the fast
prediction.
If same, do nothing
If different, assume the slow predictor is right
and flush pipline.
Advantage reduced branch penalty for those
branches mispredicted by the fast predictor and
correctly predicted by the slow predictor

28
Pipelined Gshare Predictor

How can we get a pipelined global prediction by
stage 1?
Start in stage 2
Dont have the most recent branch history
Access multiple entries
E.g. if we are missing last three branches, get 8
histories and pick between them during fetch
stage.

Ref Reconsidering Complex Branch Predictors

29
Exceptions

Exceptions are events that are difficult or
impossible to manage in hardware alone.
Exceptions are usually handled by jumping into a
service (software) routine.
Examples I/O device request, page fault, divide
by zero, memory protection violation (seg fault),
hardware failure, etc.

30
Taking and Exception

Once an exception occurs, how does the processor
proceed.
Non-pipelined dont fetch from PC save state
fetch from interrupt vector table
Pipelined depends on the exception
Precise Interrupt Must stop all instruction
after the exception (squash)
Divide by zero flush fetch/decode
Page fault (fetch or mem stage?)
Save state after last instruction before
exception completes (PC, regs)
Fetch from interrupt vector table

31
Optimizing CPU Performance

Golden Rule tCPU NinstCPItCLK
Given this, what are our options
Reduce the number of instructions executed
Compiler Job (COP 5621 COP 5622)
Reduce the clock period
Fabrication (Some Engineering classes)
Reduce the cycles to execute an instruction
Approach Instruction Level Parallelism (ILP)

32
Adding width to basic pipelining

5 stage RISC load-store architecture
About as simple as things get
Instruction fetch
get 2 instructions from memory/cache
Instruction decode
translate opcodes into control signals and read
regs
Execute
perform ALU operations
Memory
Access memory operations if load/store
Writeback/retire
update register file

33
Stage 1 Fetch

Design a datapath that can fetch two instructions
from memory every cycle.
Use PC to index memory to read instruction
Read 2 instructions
Increment the PC (by 2)
Write everything needed to complete execution to
the pipeline register (IF/ID)
Instruction 1 instruction 2 PC1 PC2

34
Rest of pipelined datapath
35
Stage 2 Decode

Design a datapath that reads the IF/ID pipeline
register, decodes instructions and reads register
file (specified by regA and regB of instruction
bits for both instructions).
Write everything needed to complete execution to
the pipeline register (ID/EX)
Pass on both instructions.
Including PC1, PC2 even though decode didnt
use it.

36
Rest of pipelined datapath
Stage 1 Fetch datapath
Changes? Hazard detection?
37
Stage 3 Execute

Design a datapath that performs the proper ALU
operations for the instructions specified and the
values present in the ID/EX pipeline register.
The inputs to ALUtop are the contents of regAtop
and either the contents of RegBtop or the
offsettop field on the instruction.
The inputs to ALUbottom are the contents of
regAbottom and either the contents of RegBbottom
or the offsetbottom field on the instruction.
Also, calculate PC1offsettop in case this is a
branch.
Also, calculate PC2offsetbottom in case this is
a branch.

38
PC 1
Stage 2 Decode datapath
Control Signals
How many data forwarding paths?
39
Stage 4 Memory Operation

Design a datapath that performs the proper memory
operation(s) for the instructions specified and
the values present in the EX/Mem pipeline
register.
ALU results contain addresses for ld and st
instructions.
Opcode bits control memory R/W and enable
signals.
Write everything needed to complete execution to
the pipeline register (Mem/WB)
ALU results and MemData(x2)
Instruction bits for opcodes and destRegs
specifiers

40
PC1 offset
Stage 3 Execute datapath
contents of regB
Control Signals
Should we process 2 memory operations in one
cycle?
41
Stage 5 Write back

Design a datapath that completes the execution of
these instructions, writing to the register file
if required.
Write MemData to destReg for ld instructions
Write ALU result to destReg for add or nand
instructions.
Opcode bits also control register write enable
signal.

42
What about ordering the register writes if
same destination specifier for each instruction?
Alu Result
Memory Read Data
Stage 4 Memory datapath
Control Signals
Mem/WB Pipeline register
43
How Much ILP is There?
44
ALU Operation GOOD, Branch BAD
Expected Number of Branches Between
Mispredicts E(X) 1/(1-p) E.g., p 95, E(X)
20 brs, 100-ish insts
45
How Accurate are Branch Predictors?

Write a Comment

User Comments (0)