Title: TurboROB A Low Cost Checkpoint/Restore Accelerator
1TurboROBA Low Cost Checkpoint/Restore Accelerator
Patrick Akl and Andreas Moshovos AENAO Research
Group Department of Electrical and Computer
Engineering University of Toronto pakl,
moshovos_at_eecg.toronto.edu
2Recovering From Control Flow Mispredictions
Predict a Branch Outcome
Misprediction Discovered
Recover Processor State Redirect Fetch
Correct Path
Predicted Path
Resume Execution
- Accelerate Recovery Improve Performance
3State-of-the-Art Recovery
Log of Changes
State Snapshot
Predict a Branch Outcome
ROB
Misprediction Discovered
what
old value
- Scalability and/or Performance Issues
4Turbo-ROB
Log of Changes
Predict a Branch Outcome
ROB
Misprediction Discovered
Partial Log of Changes
- Make common case fast
- Recover only at branches
- Store only as much as needed
- Partial Log
5Outline
- Control Flow Mispeculation Recovery
- TurboROB
- Methodology and Results
- Summary
6State Recovery Example Register Alias Table
Lg( arch. regs)
Original Code
RAT
A add r1, r2, 100 B breq r1, E C sub r1, r2, r2
p1
p4
p5
p5
p4
Architectural Register
p2
p3
arch. regs
Renamed Code
A add p4, p2, 100 B breq p4, E C sub r5, p2, p2
Physical Register
7ROB Slow, Fine-Grain Recovery
- Each entry contains
- Architectural destination register
- Its previous RAT map
Program Order
3. Undo RAT updates in reverse order
Reorder Buffer
- 2. Locate newest instruction
INVALID
RAT
- Too slow recovery latency proportional to number
of instructions to squash
8Global Checkpoints Fast, Coarse-Grain Recovery
Program Order
checkpoint
checkpoint
checkpoint
checkpoint
Reorder Buffer
INVALID
RAT
- Branch w/ GC Recovery is Instantaneous
9Impact of More Checkpoints
Concept
Actual Implementation
architectural register
physical register
- More checkpoints ?
- Power hungry structure
- Increased delay
- Only a few checkpoints can practically be
implemented - Cannot always cover all branches
10Intelligent Checkpointing BranchTap
checkpoint
checkpoint
checkpoint
checkpoint
- Use Few Checkpoints Effectively
- BranchTap
- Throttle Speculation
11Conventional Mechanisms Recovery Scenarios
B
B
B
checkpoint
B
B
B
checkpoint
Re-Execution
B
B
B
checkpoint
12Outline
- Background
- Turbo-ROB
- Methodology and Results
- Summary
13Turbo-ROB
Recovery Cost
B
R2
R1
R1
R1
R2
ROB Recovery
useful
redundant
We only need to reverse the first subsequent
change for every RAT entry
14Turbo-ROB Replacing the ROB
B
B
B
TROB
Re-Execution
B
B
B
TROB
15Selective Turbo-ROB w/ ROB
B
B
B
TROB
Selective Turbo-ROB w/ GCs
B
B
B
TROB
checkpoint
16Outline
- Background
- TurboROB
- Methodology and Results
- Summary
17Results Overview
- TROB as an ROB replacement
- BranchTap offers better performance than ROB
- Fewer resources
- Even for smaller windows
- Selective TROB as a GC reduction mechanism
- TROB reduces pressure for GCs
- Offload a critical structure RAT
- In the paper
- Selective TROB as an ROB accelerator
- Even the smallest TROB accelerates recovery
18Methodology
- Simulator based on Simplescalar
- Alpha/OSF
- 24 SPEC CPU 2000 benchmarks
- Reference Inputs
- Processor configurations
- 4-way OoO core
- 128/256/512 in-flight instructions
- 1K-entry confidence table for low confidence
branch identification / similar results with
Anyweak - 1B committed instructions after skipping 2B
19Perfect Checkpointing Configuration
- A checkpoint is auto-magically taken at all
mispredicted branches - All recoveries are fast
- We report the deterioration relative to perfect
checkpointing
20TROB Replacing the ROB/512-Entry Window
- 64-entry TROB ROB on the Average
- Pathological cases exist ? 256-entry needed
- 512-Entry TROB better than ROB
21TROB Replacing the ROB/128-Entry Window
- 64-Entry 50 better than ROB
- Fewer pathological cases
- 128-Entry TROB better than ROB
22sTROB and Global Checkpoints/128-Entry Window
- TROB 1 GC better than 4GCs
23Summary
- TROB vs. ROB
- Replacement
- Same resources ? better performance
- Fewer resources ? often better performance
- Except when accuracy is high
- Acceleration
- ¼ resources ? 35 improvement
- TROB vs. GCs
- Reduce pressure from the critical path
- With just 1 GC match the performance of four GCs
- One more alternative for designers
- Allows different area/performance/power tradeoffs
24TurboROBA Low Cost Checkpoint/Restore Accelerator
Patrick Akl and Andreas Moshovos AENAO Research
Group Department of Electrical and Computer
Engineering University of Toronto pakl,
moshovos_at_eecg.toronto.edu
25TROB Replacing the ROB/512-Entry Window
- 64-entry TROB ROB on the Average
- Pathological cases exist ? 256-entry needed
- 512-Entry TROB better than ROB
26TROB Replacing the ROB/128-Entry Window
- 64-Entry 50 better than ROB
- Fewer pathological cases
- 128-Entry TROB better than ROB