Title: Hardwarebased Devirtualization VPC Prediction
1Hardware-based Devirtualization (VPC Prediction)
- Hyesoon Kim, Jose A. Joao, Onur Mutlu, Chang
Joo Lee, Yale N. Patt, Robert Cohn
2Outline
- Background and Motivation
- VPC (Virtual Program Counter) Prediction
- Results
- Conclusion
3Direct vs. Indirect Branch
A
A
R1 MEMR2 branch R1
br.cond TARGET
N
T
?
A1
TARG
a
b
d
r
Indirect Branch
Conditional (Direct) Branch
- Indirect branches are costly on processor
performance - Much more difficult to predict than conditional
(direct) branches multiple target addresses - Indirect branch predictor requires a large
structure
4Source Code Examples
- Switch structures
- Virtual function calls
Source code Shape s a s-area()
// virtual function call Static
assembly code R1 MEMR2 //
function address lookup call R1
// a register-indirect call
5Indirect Branch Mispredictions
Data from Intel Core Duo processor
6Branch Predictor
Direction Predictor
..1001010
GHR
Hash
PC Addr
0x0800
TARG2
TARG2
Predicted target
Indirect Branch Predictor
T
TARG1
PC1
Direct Branch?
Indirect Branch?
Branch Target Buffer (BTB)
7Outline
- Background and Motivation
- VPC (Virtual Program Counter) Prediction
- Results
- Conclusion
8VPC Prediction Basic Idea
- Key idea Treat an indirect branch as
multiple virtual
conditional branches - Only for prediction purposes
- Use the conditional branch predictor
9VPC Branch Predictor
Direction Predictor
..1001010
GHR
Hash
PC Addr
0x0800
VPC2
VPC1
TARG2
Predicted target
TARG1
Branch Target Buffer
10VPC Prediction Basic Idea
- Key idea Treat an indirect branch as
multiple virtual
conditional branches - Only for prediction purposes
- Use the conditional branch predictor
- Benefits
- No separate complex structure
- Can be applied to any other conditional branch
prediction algorithm - Improve conditional branch prediction algorithm
- Will improve the indirect branch prediction
accuracy
11Inspiration Static Devirtualization
- Source code
- Shape s
- a s-area()
// an indirect call
Optimized source code Shape s
if (s-type Rectangle) // a conditional
branch at PC X a Rectanglearea()
else if (s-type Circle) // a
conditional branch at PC Y a
Circlearea() else a
s-area() // an indirect call
at PC Z
Small talk(84), Calder and Grunwald (94),
Garret et al. (94) , Ishizaki et al.(00)
12VPC Prediction
- Source code
- Shape s
- a s-area() // an
indirect call - Static assembly code
- R1 MEMR2
- call R1 //
PC L - Dynamic virtual branches (for prediction
purposes) - conditional jump TARGET1 // virtual PC
L - conditional jump TARGET2 // virtual PC
L XOR HASHVAL1 - conditional jump TARGET3 // virtual PC
L XOR HASHVAL2 - conditional jump TARGET4 // virtual PC
L XOR HASHVAL3
13Virtual PC Address Generation
- Use original PC address and iteration counter
value
Hash value table
iteration counter value
14VPC Prediction Process-I
Direction Predictor
Real Instruction
GHR
call R1 // PC L
1111
not taken
Virtual Instructions
PC
L
- cond. jump TARG1 // VPC L
- cond. jump TARG2 // VPC VL2
- cond. jump TARG3 // VPC VL3
- cond. jump TARG4 // VPC VL4
-
BTB
Next iteration
TARG1
15VPC Prediction Process-II
Direction Predictor
Real Instruction
VGHR
call R1 // PC L
1110
Virtual Instructions
VPC
VL2
- cond. jump TARG1 // VPC L
- cond. jump TARG2 // VPC VL2
- cond. jump TARG3 // VPC VL3
- cond. jump TARG4 // VPC VL4
-
not taken
BTB
TARG2
Next iteration
16VPC Prediction Process-III
Direction Predictor
Real Instruction
taken
VGHR
call R1 // PC L
1100
Virtual Instructions
VPC
- cond. jump TARG1 // VPC L
- cond. jump TARG2 // VPC VL2
- cond. jump TARG3 // VPC VL3
- cond. jump TARG4 // VPC VL4
-
VL3
BTB
Predicted Target TARG3
TARG3
17VPC Prediction Algorithm
- Access the conditional branch predictor and the
BTB with VPCA and VGHR - Compute VPCA and VGHR for the next iteration
- VPCA PC XOR HASHVALiter
- VGHR VGHR
- Predicted not taken Move to the next iteration
- Predicted taken Use the target in the BTB as the
target of an indirect branch - Give up and stall if
- Iteration count MAX_ITER or BTB miss
18VPC Training Algorithm
- An iterative process when an indirect branch is
retired (not on the critical path) - Update the conditional branch predictor
- Virtual branch has a correct target Taken
- Virtual branch has a wrong target Not-taken
- Update replacement policy bits of the correct
target in the BTB - Insert the correct target into the BTB
- Conditional branch predictor taken
- Replace the least frequently used target (LFU)
19Hardware Cost and Complexity
Taken/Not Taken
Predict?
Direct/Indirect
Target Address
20Outline
- Background and Motivation
- VPC Prediction
- Results
- Conclusion
21Simulation Methodology
- Pin-based x86 Simulator
- Processor configuration
- 4K-entry BTB
- 64KB perceptron conditional branch predictor
- Minimum 30-cycle branch misprediction penalty
- 8-wide, 512-entry instruction window
- Less aggressive processor (in the paper)
- Gshare, O-GEHL conditional branch predictors
- Indirect branch intensive benchmarks
- 5 SPEC CPU2000, 5 SPEC CPU 2006, 2 other C
- IBM server benchmarks (OLTP) (in the paper)
22VPC MPKI
23VPC Performance
24Different Direction Predictors
98 98.3 99
Conditional branch accuracy ()
Improving conditional branch prediction accuracy
also improves indirect branch prediction accuracy!
25VPC vs. Static Devirtualization
- Advantages
- Enables other compiler optimizations (function
inlining) - Can reduce the number of mispredictions
- Disadvantages/Limitations
- Not all indirect branches can be statically
devirtualized - Extensive static analysis/profiling
- Lack of adaptivity to run-time input set and
phase behavior - VPC prediction can be used with
- statically devirtualized binaries
- 10 improvement on top of static devirtualization
26Outline
- Background and Motivation
- VPC Prediction
- Results
- Conclusion
27Conclusion
- VPC dynamically converts indirect branches into
multiple conditional branches uses the existing
conditional branch prediction hardware - VPC prediction reduces the branch misprediction
penalty without significant extra hardware
storage. - Baseline 26 IPC improvement
- O-GEHL 31 IPC improvement
- VPC can be an enabler encouraging programmers to
use object-oriented programming styles
28Thank you!
29VPC vs. Cascaded IBP
30VPC vs. Other Indirect BP
TTC Chang et al. (96) Cascaded Driesen and
Holzle(98)
31Iterative prediction
- It doesnt hurt performance significantly
- Results
- Why?
- Most prediction is within a few iterations.
- Results
32VPC Hit Iteration Counter
33Can the BTB be pipelined?
- Yes
- The next iteration of VPC can be started without
knowing the previous iteration in the pipeline. - Consecutive VPC prediction iterations can be
simply pipelined. - If the iteration is not needed then simply
discard the prediction.
34Is 4K-entry BTB too large?
- Pentium 4 has a 4K-entry BTB
- IBM Z series (z990) has an 8K-entry BTB
- AMD Athlon and Hammer have 2K-entry BTBs
35BTB Size Effects
36VPC Prediction Accuracy
37Target Distribution
38VPC vs. Tagged Target Cache
39VPC Prediction Delay Effects
40VPC with O-GEHL BP
41VPC with a Less Aggressive Processor
42Server Benchmarks
43Server Benchmarks (VPC vs. TTC)
44VPC Prediction vs. Compiler-Based
Devirtualization (With TTC)
45Conditional Br. Prediction Effects
VPC Prediction reduces the accuracy of direction
branch prediction but not that much!
46Indirect Branch Mispredictions
47VPC Prediction with Static Devirtualization
- VPC prediction can be used with static
devirtualized binaries. - Not all indirect branches could be devirtualized
48VPC Training Correct Prediction
Retirement Real Instruction
call R1 // PC L
Known Correct predicted, predicted iter 3
Update the BTB replacement counter
49VPC Training Misprediction
Retirement Real Instruction
call R1 // PC L
Known Mispredicted, correct target address
Update the BTB replacement counter
50VPC Training Misprediction
Retirement Real Instruction
call R1 // PC L
Known Mispredicted, correct target address
No Target
51VPC Training Misprediction
Retirement Real Instruction
call R1 // PC L
Known Mispredicted, correct target address
Replacement
?
Taken
Insert
0
52Does VPC need an extra BTB port?
- No
- A read from the BTB is only needed when a branch
is mispredicted. - 95 branches are correctly predicted with VPC.
- The read is performed only there is a available
BTB port.