Title: Performance-Aware Speculation Control using Wrong Path Usefulness Prediction
1Performance-Aware Speculation Control using Wrong
Path Usefulness Prediction
- Chang Joo Lee
- Hyesoon Kim
- Onur Mutlu
- Yale N. Patt
HPS Research Group University of Texas at Austin
School of Computer Science Georgia Institute of
Technology
Microsoft Research
2Outline
- Motivation
- Mechanism
- Experimental Evaluation
- Conclusion
3Fetch Gating (Pipeline Gating)
- Proposed by Manne et al. ISCA98
- Stops fetching instructions on wrong path to
save energy. - Assumes wrong-path instructions do not contribute
to performance and consume energy. - Various fetch gating mechanisms
- Baniasadi and Moshovos ISLPED01, Karkhanis et
al. ISLPED02, Aragon et al. HPCA03,
Buyuktosunoglu et al. GLSVLSI03, Collins et
al. MICRO04
4Limitations of Previous Mechanisms
- Hardware complexity
- Branch confidence estimator,changes to
critical/power-hungry structures. - Additional hardware can offset energy savings
due to fetch gating. - Assumption
- Wrong-path execution consumes energybut is
useless for performance.
5Is Wrong Path Execution Really Useless?
Performance of most benchmarks increases by
perfect fetch gating.
mcf Performance degrades by 30 and energy
consumption increases by 15
parser Energy consumption decreases by 28 but
performance degrades by 5
6Why Does Performance Degrade with Perfect Fetch
Gating?
MPKI 36.6
MPKI 1.5
mcf almost all of wrong-path L2 fills used,
memory intensive (MPKI 36.6)? 30 performance
degradation with perfect fetch gating
parser 37 is used wrong path fills, 14 is
unused wrong path fills? 5 performance
degradation with perfect fetch gating
Wrong path execution can prefetch useful
data Butler Thesis93, Pierce and Mudge IPPS94,
MICRO96, Mutlu et al. IEEE TC05
7Why Can Wrong Path ExecutionBe Useful?
- From mcf
- Hammock structure within a frequently executed
loop - BR in BB2 is frequently mispredicted
- Since memory latency is large, wrong path
prefetching benefit can be significant - Taking into account wrong-path usefulness is
important -
Taken
Not-taken
.. BR BB4
BB2
Mispredicted
Misprediction recovery
BB3
BB4
Load A Load B .. JMP BB5
Load A Load B ..
L2 cache miss
Cache hit
Load C ..
BB5
Cache hit
L2 cache miss
8Outline
- Motivation
- Mechanism
- Experimental Evaluation
- Conclusion
9Our Solution Performance-Aware Speculation
Control
- Hardware complexity Simple low cost fetch gating
mechanism - Wrong-path Usefulness Low cost Wrong Path
Usefulness Predictor (WPUP)
Performance-Aware Speculation Control
Lookup
Fetch Gating
WPUP
Useful
Branch Count
Gate Enable
Fetch Engine
Fetch gate only when wrong path execution is
useless
10Our Fetch Gating Mechanism
- Branch-count based mechanism
- More branches ? higher chance of misprediction.
- Fetch gate if ( of Branches) gt Threshold
- Mispredictions show phase behavior.
- Threshold is determined by branch prediction
accuracy for a certain period. - Higher accuracy ? Higher threshold
- No need for complex logic (e.g. confidence
estimator)
11Two WPUP Mechanisms
- Branch PC-based WPUP (Fine grained)
- Phase-based WPUP (Coarse grained)
Can be combined with other fetch gating
mechanisms.
12Branch PC-based WPUP
- Basic idea
- Identifies and records conditional branch PCs
that lead to useful wrong-path memory references - If the fetched branch is recorded as useful, do
not fetch gate
13Branch PC-based WPUP
- Implementation
- Fetch Engine
- Latest Branch PC Register (LBPC, 16bits)
- LBPC value carried through pipeline
- Miss Status Holding Registers (MSHR)
- Branch ID field (BID, 10bits)
- Already used for branch misprediction recovery
- Branch PC field (BPC, 16bits)
- Wrong Path field (WP, 1bit)
- WPUP Cache
- 4 way set-associative, No Data Store, LRU
14Branch PC-Based WPUP (Training)
LBPC
PC 2
Taken
Load A in BB3 with PC 2 and BID 2
Load B in BB3 with PC2 and BID 2
Load C in BB5 with PC 2 and BID 2
Load A in BB4
BID 2 from branch unit
Not-taken
BB2
.. BR 2
PC2
BID 2
L2 cache miss
Mispredicted
Misprediction recovery
BB3
BB4
Load A Load B .. JMP
Load A Load B ..
MSHR
Addr BID BPC WP
A
2
0
1
PC2
B
PC2
2
0
1
Load C ..
BB5
2
C
0
1
PC2
MSHR hit Wrong Path was useful. BPC 2 is stored
in WPUP cache.
15Branch PC-Based WPUP (Prediction)
LBPC
PC 2
Taken
Not-taken
Fetch Gate?
Fetch Gate?
BB2
.. BR 2
PC2
Mispredicted
BB3
BB4
Load A Load B .. JMP
Load A Load B ..
WPUP Cache
Wrong-path Execution
Addr LRU
PC2
Load C ..
BB5
Hit Do not fetch gate.
16Phase-based WPUP
- Basic idea
- Predict if the current phase will provide useful
wrong-path memory references - If so, do not fetch gate
17Phase-based WPUP
- Implementation
- Wrong Path Usefulness Counter (WPUC, 5bits)
- Incremented for each useful wrong-path memory
reference - Reset periodically
- Do not fetch gate if WPUC gt threshold
- BPC fields or WPUP cache not needed
18Outline
- Motivation
- Mechanism
- Experimental Evaluation
- Conclusion
19Simulation Methodology
- Alpha ISA execution driven simulator
- Baseline processor configuration
- 2GHz, 8-wide issue, out-of-order, 128-entry ROB
- Hybrid branch predictor (64K-entry gshare and
64K-entry PAs) - 11 stages (minimum branch misprediction penalty)
- 1MB, 8-way unified L2 cache
- 32 L2 MSHRs, 300 cycle memory latency
- Stream prefetcher
- Wattch power model 100 nm, 1.2V technology
- Mannes fetch gating
- Gating threshold 3 low confidence branches
- JRS confidence estimator (4K-entry, 4bit-MDC)
- Tuned for the best energy-delay product
- Branch Count-based fetch gating
BP Acc() 10099 9997 9795 9593 9390 9085 850
Threshold 18 16 13 12 11 7 3
20Branch-Count Based Fetch Gating
Performance and energy savings are higher than
Mannes.
Mannes and our fetch gating degrade performance
of mcf and parser
21WPUP Mechanisms
Improves performance and energy savings compared
to Mannes
Improves performance of mcf and parser
22Hardware Cost
Performance-Aware Speculation Control
vs.Mannes Fetch Gating
Hardware cost Hardware cost Hardware cost
Fetch Gating WPUP Total
Manne 2049B - 2049B
FG-BR/PC-WPUP 6B 260B 266B
FG-BR/PHASE WPUP 6B 45B 51B
23Comparison with Mannes Fetch Gating
WPUPs improve performance and energy efficiency
of Mannes
2.5 less performance degradation, 1.0 more
energy savings
24Energy-Delay Product
Improves Energy-Delay Product (2.6 compared to
Mannes)
25Conclusion
- Performance-Aware Speculation Control
- Branch count-based fetch gating
- Simple and low cost.
- Introduced Wrong Path Usefulness Prediction
- Recovers performance loss due to fetch gating by
executing useful wrong-path instructions. - Can be combined with other fetch gating
mechanisms. - Reduces performance loss due to fetch gating and
also saves energy.
26Questions?