TurboROB A Low Cost CheckpointRestore Accelerator - PowerPoint PPT Presentation

About This Presentation
Title:

TurboROB A Low Cost CheckpointRestore Accelerator

Description:

Misprediction performance penalty reduced by 28% on AVG. BranchTap comes 'for free' ... complexity, virtually comes 'for free' BranchTap vs. Larger Checkpoint ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 26
Provided by: davorca
Category:

less

Transcript and Presenter's Notes

Title: TurboROB A Low Cost CheckpointRestore Accelerator


1
TurboROB A Low Cost Checkpoint/Restore
Accelerator
Patrick Akl1 and Andreas Moshovos AENAO Research
Group Department of Electrical and Computer
Engineering University of Toronto 1 Now with
AMD/ATI
2
What Happens on a Branch Misprediction?
  • Execution Timeline

Predict a Branch Outcome
Predicted Path
Correct Path
Misprediction Discovered
Recover Processor State Redirect Fetch
Resume Execution
  • We wish to make the recovery fast

3
Recover Mechanisms Overview
  • ROB
  • Buffer all changes
  • Slow
  • Instantaneous checkpoints
  • Snapshot before speculating
  • Fast
  • Problem cant have enough checkpoints
  • Checkpoint prediction
  • Allocate the few checkpoints judiciously
  • Speculation control
  • Sometimes deeper speculation higher recovery
    cost
  • Can hurt performance
  • Throttle speculation

4
TurboROB Overview
  • Complements or Replaces Existing Mechanisms
  • ROB recover at any point
  • TurboROB recover only at frequent points
  • Improves performance for most programs
  • Misprediction performance penalty reduced by 28
    on AVG
  • BranchTap comes for free
  • Very simple to implement
  • Better than more accurate checkpoint predictors

5
Outline
  • Background
  • BranchTap
  • Methodology and Results
  • Summary

6
State Recovery Example Register Alias Table
Lg( arch. regs)
Original Code
RAT
A add r1, r2, 100 B breq r1, E C sub r1, r2, r2
p1
p4
p5
p5
p4
Architectural Register
p2
p3
arch. regs
Renamed Code
A add p4, p2, 100 B breq p4, E C sub r5, p2, p2
Physical Register
7
ROB Slow, Fine-Grain Recovery
  • Each entry contains
  • Architectural destination register
  • Its previous RAT map

Program Order
3. Undo RAT updates in reverse order
Reorder Buffer
  • Misprediction discovered
  • 2. Locate newest instruction

INVALID
RAT
  • Too slow recovery latency proportional to number
    of instructions to squash

8
Global Checkpoints Fast, Coarse-Grain Recovery
Program Order
checkpoint
checkpoint
checkpoint
checkpoint
Reorder Buffer
  • Misprediction discovered

INVALID
RAT
  • Branch w/ GC Recovery is Instantaneous

9
Impact of More Checkpoints
Concept
Actual Implementation
architectural register
physical register
  • More checkpoints ?
  • Power hungry structure
  • Increased delay
  • Only a few checkpoints can practically be
    implemented
  • Cannot always cover all branches

10
Intelligent Checkpointing
  • State of the art solution
  • Checkpoint allocation Allocate checkpoints at
    hard-to-predict branches
  • Checkpoint management Release checkpoints as
    soon as they are no longer needed
  • Use few checkpoints efficiently

11
Conventional Mechanisms Recovery Scenarios
  • Mispeculation on a branch w/ a GC Direct
    recovery
  • Mispeculation on a branch w/o a GC Indirect
    recovery
  • With intelligent checkpointing
  • 30 Indirect recoveries ? 75 of performance loss

B
B
B
ROB
Fast Recovery
checkpoint
B
B
B
ROB
Slow Recovery
checkpoint
12
Outline
  • Background
  • BranchTap
  • Methodology and Results
  • Summary

13
BranchTap Motivation
Low confidence branch
Recovery Cost
B
B
B
  • ROB

No Wait Scenario
checkpoint
checkpoint
  • Misprediction
  • discovered

B
B
B
Wait Scenario
  • ROB

Recovery Cost
checkpoint
checkpoint
Sometimes, it is better to wait if no checkpoint
is available
14
BranchTap Concept
  • Key idea stall when speculation is likely to
    deteriorate performance
  • Count the number of low confidence branches w/o a
    checkpoint
  • If it exceeds a threshold, stall
  • Threshold selection
  • Fixed
  • Varies greatly across programs
  • Can deteriorate performance significantly
  • Adaptive
  • Robust performance
  • Minimize recovery cost while conserving good
    speculation opportunities

15
Threshold Adaptation Policy
  • BranchTap adapts across and within applications

16
Outline
  • Background
  • BranchTap
  • Methodology and Results
  • Summary

17
Results Overview
  • Performance w/o Checkpoints
  • BranchTap improves even with just an ROB
  • Performance w/ 4 Checkpoints
  • BranchTap improves over conventional recovery
    methods
  • Performance w/ Larger Checkpoint Predictors
  • BranchTap offers better performance than a 64x
    larger predictor

18
Methodology
  • Simulator based on Simplescalar
  • 24 SPEC CPU 2000 benchmarks
  • Reference Inputs
  • Processor configurations
  • 8-way OoO core
  • Up to 1K in-flight instructions
  • 1K-entry confidence table for low confidence
    branch identification
  • 1B committed instructions after skipping 100B

19
Perfect Checkpointing Configuration
  • A checkpoint is auto-magically taken at all
    mispredicted branches
  • All recoveries are fast
  • We report the deterioration relative to perfect
    checkpointing

20
Performance with No Checkpoints
  • Deterioration relative to perfect checkpointing


better
-39
  • deterioration
  • BranchTap improves over conventional mechanisms
  • Adaptation leads to robust performance
    improvements

21
Performance Evaluation with 4 Checkpoints
  • Deterioration relative to perfect checkpointing
  • BranchTap with 4 checkpoints is better than 6
    checkpoints alone


better
-28
deterioration
22
BranchTap vs. Larger Checkpoint Predictors
  • BranchTap with a 1K-entry confidence table and 4
    GCs
  • Higher performance than a 64K-entry confidence
    table with 4 GCs
  • Lower complexity, virtually comes for free

better
  • deterioration

BranchTap
  • confidence table size

23
Outline
  • Background
  • BranchTap
  • Methodology and Results
  • Summary

24
Summary
  • Performance with 4 (no) checkpoints
  • 28 (39) of misprediction penalty removed
  • BranchTap is robust
  • Up to 6 (13) better and max 1.2 (0.1) worse
    than conventional mechanisms
  • BranchTap is very simple to implement
  • Few counters and comparators
  • BranchTap is better than other alternatives
  • BT 1K predictor better than a 64K predictor
    alone
  • BT 4 GCs better than 6 GCs alone

25
BranchTapImproving Performance With Very Few
Checkpoints Through Adaptive Speculation Control
Patrick Akl and Andreas Moshovos AENAO Research
Group Department of Electrical and Computer
Engineering University of Toronto pakl,
moshovos_at_eecg.toronto.edu
Write a Comment
User Comments (0)
About PowerShow.com