Title: Dynamic Branch Prediction
1Dynamic Branch Prediction
Jonathan Creekmore Nicolas Spiegelberg
2Overview
- Branch Prediction Techniques
- Context Switching
- Compression of Branch Tables
- Simulation
- Hardware Model
- Results
- Analysis
3Case for Branch Prediction
- Multiple instructions at one time
- Between 15 and 20
- Branches occur every 5 instructions
- if, while, for, function calls, etc.
- Stalling pipeline is unacceptable
- Lose all advantage of multiple instruction issue
4Context Switch Time
- Cause program execution to be paused
- State of program is saved
- New program is executed
- Eventually, original program begins executing
again - Not all of the CPU state is saved
- Such as the branch predictor tables
5Context Switch Time
- 1 set of branch predictor state
- Context switch causes a new application to use
the previous applications branch predictor state - Degrades performance for all applications
- Solution Save the state of the branch predictor
at context switch time
6Saving Branch State Table
- Simple branch predictors still have large number
of bits - Storing and restoring the branch predictor should
not take too long - Lose the gain of storing/restoring if it takes
longer than the warm-up time of the branch
predictor
7Compression
- Compression is the key
- Requires less storage
- Needs to be done carefully
- Some lossless compression schemes can inflate
number of bits - Luckily, lossy compression is acceptible
8Semi-Lossy Compression
- Applies to 2-bit predictors
- Key is to store just taken/not-taken state
- Ignores strong/weak
9Semi-Lossy Decompression
10Lossy Compression
- Branch prediction is just an educated guess
- Achieve higher compression ratio if some
information is lost - Majority rules
- Used by correlating branch predictor
11Lossy Compression
4x
T
T
NT
NT
NT
T
NT
NT
T
NT
NT
12Lossy Decompression
- Reinitialize all elements for an address to the
stored value - Best case -- all elements are correct
- Worst cast -- 50 of elements are correct
- Remember Branch predictors are just educated
guesses
13Simulation
- Modified SimpleScalars sim-bpred to support
context switching - Not necessary to actually switch between programs
- On context switch, corrupt branch predictor table
according to a dirty percentage to simulate
another program running
14Simulation
- Testing compression/decompression becomes simple
- Instead of corrupting branch predictor table,
replace entries with the value after
compression/decompression - Testing with
- 2-bit semi-lossy compression
- 4-bit lossy compression
- 8-bit lossy compression
15Hardware Model
- Compression and decompression blocks are fully
pipelined - Compression and decompression blocks can handle n
bits of compressed data at a time - Compression and decompression occur simultaneously
16Hardware Model
- Utilize data independence
- Compress 128 bits into 64 bits at one time
- Pipeline overhead should be minimal compared to
clock cycle savings
17Programs Simulated
- Several SPEC2000 CINT200 programs simulated
- 164.gzip Compression
- 175.vpr FPGA Place and route
- 181.mcf Combinatorial Optimization
- 197.parser Word Processing
- 256.bzip2 Compression
18Predictor Types
- 2048 entry bimodal predictor (4096 bits)
- 4096 entry bimodal predictor (8192 bits)
- 1024 entry two-level predictor with 4-bit history
size (16384 bits) - 4096 entry two-level predictor with 8-bit history
size (1048576 bits) - 8192 entry two-level predictor with 8-bit history
size (2097152 bits)
192048 Entry Bimodal Predictor
202048 Entry Bimodal Predictor
212048 Entry Bimodal Predictor
224096 Entry Bimodal Predictor
234096 Entry Bimodal Predictor
244096 Entry Bimodal Predictor
251024 entry two-level predictor with 4-bit history
size
261024 entry two-level predictor with 4-bit history
size
271024 entry two-level predictor with 4-bit history
size
284096 entry two-level predictor with 8-bit history
size
294096 entry two-level predictor with 8-bit history
size
304096 entry two-level predictor with 8-bit history
size
318192 entry two-level predictor with 8-bit history
size
328192 entry two-level predictor with 8-bit history
size
338192 entry two-level predictor with 8-bit history
size
34Timing Comparison
Miss Penalty 10 clock cycles
Bandwidth 64 bits per clock cycle
35Timing Equations
General Timing Equation
Special Case for ratio of 0
36Timing Comparison
Miss Penalty 15 clock cycles
Bandwidth 64 bits per clock cycle
37Timing Comparison
Miss Penalty 10 clock cycles
Bandwidth 128 bits per clock cycle
38Summary
- Dynamic Branch Prediction is necessary for modern
high-performance processors - Context switches reduce the effect of dynamic
branch prediction - Naïvely saving the branch predictor state is
costly
39Summary
- Compression can be used to improve the cost of
saving branch predictor state - Higher compression ratios improve fixed
save/restore time at a cost of increasing the
number of mispredictions - For low frequency context switches, yields an
improvement in performance
40Questions