Title: TurboROB A Low Cost CheckpointRestore Accelerator
1TurboROB A Low Cost Checkpoint/Restore
Accelerator
Patrick Akl1 and Andreas Moshovos AENAO Research
Group Department of Electrical and Computer
Engineering University of Toronto 1 Now with
AMD/ATI
2What Happens on a Branch Misprediction?
Predict a Branch Outcome
Predicted Path
Correct Path
Misprediction Discovered
Recover Processor State Redirect Fetch
Resume Execution
- We wish to make the recovery fast
3Recover Mechanisms Overview
- ROB
- Buffer all changes
- Slow
- Instantaneous checkpoints
- Snapshot before speculating
- Fast
- Problem cant have enough checkpoints
- Checkpoint prediction
- Allocate the few checkpoints judiciously
- Speculation control
- Sometimes deeper speculation higher recovery
cost - Can hurt performance
- Throttle speculation
4TurboROB Overview
- Complements or Replaces Existing Mechanisms
- ROB recover at any point
- TurboROB recover only at frequent points
- Improves performance for most programs
- Misprediction performance penalty reduced by 28
on AVG - BranchTap comes for free
- Very simple to implement
- Better than more accurate checkpoint predictors
5Outline
- Background
- BranchTap
- Methodology and Results
- Summary
6State Recovery Example Register Alias Table
Lg( arch. regs)
Original Code
RAT
A add r1, r2, 100 B breq r1, E C sub r1, r2, r2
p1
p4
p5
p5
p4
Architectural Register
p2
p3
arch. regs
Renamed Code
A add p4, p2, 100 B breq p4, E C sub r5, p2, p2
Physical Register
7ROB Slow, Fine-Grain Recovery
- Each entry contains
- Architectural destination register
- Its previous RAT map
Program Order
3. Undo RAT updates in reverse order
Reorder Buffer
- 2. Locate newest instruction
INVALID
RAT
- Too slow recovery latency proportional to number
of instructions to squash
8Global Checkpoints Fast, Coarse-Grain Recovery
Program Order
checkpoint
checkpoint
checkpoint
checkpoint
Reorder Buffer
INVALID
RAT
- Branch w/ GC Recovery is Instantaneous
9Impact of More Checkpoints
Concept
Actual Implementation
architectural register
physical register
- More checkpoints ?
- Power hungry structure
- Increased delay
- Only a few checkpoints can practically be
implemented - Cannot always cover all branches
10Intelligent Checkpointing
- State of the art solution
- Checkpoint allocation Allocate checkpoints at
hard-to-predict branches - Checkpoint management Release checkpoints as
soon as they are no longer needed - Use few checkpoints efficiently
11Conventional Mechanisms Recovery Scenarios
- Mispeculation on a branch w/ a GC Direct
recovery - Mispeculation on a branch w/o a GC Indirect
recovery - With intelligent checkpointing
- 30 Indirect recoveries ? 75 of performance loss
B
B
B
ROB
Fast Recovery
checkpoint
B
B
B
ROB
Slow Recovery
checkpoint
12Outline
- Background
- BranchTap
- Methodology and Results
- Summary
13BranchTap Motivation
Low confidence branch
Recovery Cost
B
B
B
No Wait Scenario
checkpoint
checkpoint
B
B
B
Wait Scenario
Recovery Cost
checkpoint
checkpoint
Sometimes, it is better to wait if no checkpoint
is available
14BranchTap Concept
- Key idea stall when speculation is likely to
deteriorate performance - Count the number of low confidence branches w/o a
checkpoint - If it exceeds a threshold, stall
- Threshold selection
- Fixed
- Varies greatly across programs
- Can deteriorate performance significantly
- Adaptive
- Robust performance
- Minimize recovery cost while conserving good
speculation opportunities
15Threshold Adaptation Policy
- BranchTap adapts across and within applications
16Outline
- Background
- BranchTap
- Methodology and Results
- Summary
17Results Overview
- Performance w/o Checkpoints
- BranchTap improves even with just an ROB
- Performance w/ 4 Checkpoints
- BranchTap improves over conventional recovery
methods - Performance w/ Larger Checkpoint Predictors
- BranchTap offers better performance than a 64x
larger predictor
18Methodology
- Simulator based on Simplescalar
- 24 SPEC CPU 2000 benchmarks
- Reference Inputs
- Processor configurations
- 8-way OoO core
- Up to 1K in-flight instructions
- 1K-entry confidence table for low confidence
branch identification - 1B committed instructions after skipping 100B
19Perfect Checkpointing Configuration
- A checkpoint is auto-magically taken at all
mispredicted branches - All recoveries are fast
- We report the deterioration relative to perfect
checkpointing
20Performance with No Checkpoints
- Deterioration relative to perfect checkpointing
better
-39
- BranchTap improves over conventional mechanisms
- Adaptation leads to robust performance
improvements
21Performance Evaluation with 4 Checkpoints
- Deterioration relative to perfect checkpointing
- BranchTap with 4 checkpoints is better than 6
checkpoints alone
better
-28
deterioration
22BranchTap vs. Larger Checkpoint Predictors
- BranchTap with a 1K-entry confidence table and 4
GCs - Higher performance than a 64K-entry confidence
table with 4 GCs - Lower complexity, virtually comes for free
better
BranchTap
23Outline
- Background
- BranchTap
- Methodology and Results
- Summary
24Summary
- Performance with 4 (no) checkpoints
- 28 (39) of misprediction penalty removed
- BranchTap is robust
- Up to 6 (13) better and max 1.2 (0.1) worse
than conventional mechanisms - BranchTap is very simple to implement
- Few counters and comparators
- BranchTap is better than other alternatives
- BT 1K predictor better than a 64K predictor
alone - BT 4 GCs better than 6 GCs alone
25BranchTapImproving Performance With Very Few
Checkpoints Through Adaptive Speculation Control
Patrick Akl and Andreas Moshovos AENAO Research
Group Department of Electrical and Computer
Engineering University of Toronto pakl,
moshovos_at_eecg.toronto.edu