Title: Exploiting Eager Register Release in a Redundantly Multi-threaded Processor
1Exploiting Eager Register Release in a
Redundantly Multi-threaded Processor
- Niti Madan
- Rajeev Balasubramonian
- University of Utah
2Introduction
- Rising soft error rates due to shrinking
transistor sizes and lower supply voltages - Existing Solutions
- Process level SOI
- Circuit level Rad-hard cells, ECC, BISER
- Architecture level
- Redundant Multithreading
- Reducing the time useful state spends in
unprotected structures - Software assisted fault tolerance
3Introduction
- CMPs/SMTs enable redundant multi-threading (RMT)
- Detailed Design and Evaluation of Redundant
Multithreading Alternatives, ISCA 2002 - 2 processors/threads execute the same program
4Chip-level Redundant Multi-threading (CRTR)
Branch Outcomes
Processor 1
Processor 2
OoO
OoO
Loads
Leading thread 1 Trailing thread 2
Trailing thread 1 Leading thread 2
Lags behind leading thread by some slack
Register Values
Stores
5Motivation
- Register file is already a critical resource
- impacts ILP
- impacts cycle time
- impacts peak temperature
- Multiple threads increase pressure on register
file -
6Motivation
- Out-of-order processors are "conservative" since
they must preserve correctness - Example registers are de-allocated
conservatively - Having a trailing thread allows the leading
thread to be aggressive - improves the performance of the leading thread
- trailer state can be used for ensuring
correctness - some errors may go undetected
7 Processor 1
Processor 2
RVQ
Leading 1
Trailing 1
R1
R1
R2
R1
lr5 . lr5 mapped to R1
Branch
lr5 lr5 mapped to R2
Mispredict
8 Processor 1
Processor 2
RVQ
Leading 1
Trailing 1
R1
R1
R1
Soft error
Mispredict Recovery
Fault Propagates
Very few errors slip through Slack is most of
the times less than RVQ size
9Our Approach
- RMT processor has duplicate register value state
in RVQ/trailers state - Improve Register file efficiency using
- Eager Register Release
- Smaller Register file size can deliver same
performance using above technique - Reduced power
- Increased reliability ECC less expensive
- Potentially faster clock speed
10Outline
- Background on RMT design space
- Proposed technique
- Evaluation
- Conclusions Future Work
11Redundant Multi-threading
- Fault model
- Trailers state used for recovery
- Does not provide complete recovery
- Caches and Load Value Queue (LVQ) ECC protected
- Can detect all single event upset faults
- Baseline RMT models include SRTR, CRTR,
ST-P-CRTR, MT-P-CRTR
12Baseline RMT Model
Leading Thread 1 Trailing Thread
1 Out-of-Order Processor
- SRTR SMT level RMT
- CRTR Chip level RMT
- Proposed by Mukherjee et al ISCA 2002, Gomaa et
al ISCA 2002, ISCA 2003
Out-of-order
Out-of-order
Processor 1
Processor 2
LVQ, BOQ, RVQ
Leading 1 Trailing 2
Trailing 1 Leading 2
13Power-efficient RMT model
- Our Earlier Work explores Power-efficient RMT
model - P-CRTR (Selse-2, Tech Report 2005)
- Observations
- Trailing thread doesnt suffer from D-cache
misses and branch mispredictions - Trailing thread bound to have higher IPC
- High Trailer IPC enables power reduction
- Techniques proposed for power-efficiency
- Dynamic Frequency Scaling
- In-order execution of trailer
14Dynamic Frequency Scaling
- High Trailer IPC enables frequency reduction
- Reduce Trailers frequency to match the leaders
throughput - Reduction in Trailers dynamic power
- Does not impact Trailers leakage power
15In-order Execution of Checker
- Our approach
- Send all register values computed by leading core
to the trailer (Register value prediction 100
accuracy if no fault) - Trailer reads source operands from RVQ
- Trailer verifies source operands at commit
- RVP enables perfect IPC no stalls
- Cost Extra communication overhead
- Benefit Overall reduced dynamic and leakage
power
16ST-P-CRTR
Out-of-order
In-order
Processor 1
Processor 2
LVQ, BOQ, RVQ
Leading 1
Trailing 1
17MT-P-CRTR
Processor 2
Trailing 1
Out-of-order
LVQ, BOQ, RVQ
Processor 1
In-order
Leading 1 Leading 2
Processor 3
LVQ, BOQ, RVQ
Trailing 2
In-order
18Eager Register Release
Original Code lr3 lr1,lr2 lr5 lr3, lr4 Branch
to x lr3
Renamed Code pr21 pr8,pr11 pr15 pr21,
pr12 Branch to x pr29
lr3 has 2 mappings new pr29 and old pr21 pr21
cannot be released until branch resolves
- Eager Register Release
- Involves releasing older physical register after
the value is rewritten and used by all consumers - Requires a mechanism to store the released state
elsewhere -
19Implementation Details
- Need to keep track of various states for each
physical register in Usage Table - Bit that tracks if logical register value is
overwritten - RVQ address/register id in trailing thread
- Counters for each physical register
- To track pending consumers
- Modification in ROB to initiate recovery upon
mispredict - Non-trivial complexity and overheads
20Evaluation Methodology
- Simplescalar-3.0 (Modified for CMP/SMT) for
performance analysis and wattch for processor
power - eCacti-3.0 to model register file power and area
overheads - Spec2k Int, FP benchmark suite
- 16 benchmarks for single thread experiments
- 10 pairs of High/Low IPC/ Int/FP combinations
for multi-thread experiments - Evaluated all RMT models for comprehensive
analysis of all combinations of leading/trailing
threads - RVQ size 600 entries
21Performance Evaluation
22Effect of Register File Size - SRTR
ROB size 160
23Effect of Register File Size ST-P-CRTR
24Effect of Register File Size CRTR
25Effect of Register File Size MT-P-CRTR
26Effect of Register File Size
- For SRTR, CRTR, MT-P-CRTR
- Performance of 100 size RF with ER same as
baseline with 160 size (37.5 size reduction) - Performance improvement of 34 in 100 size RF
with ER compared to baseline with 100 size - For ST-P-CRTR
- Performance of 50 size register file with ER same
as baseline with 80 size (37.5 size reduction) - Performance improvement of 12 in 100 size RF
with ER compared to baseline with 100 size
27Observations
- More favorable to models where leading thread
co-executes with another leading/trailing thread - Most FP benchmarks perform better with ER
(greater than 20 improvement) - Int benchmarks that have poor bpred rates do not
benefit much (gcc, equake, eon etc upto 3)
28Performance Overheads
- For 100 million single thread execution
- 70 million registers are released eagerly
- 6 copied back upon mispredict recovery
- Cost of copying back dependent upon program
mispredict rate - Each mispredict requires 6.6 copy back values
- Cost of copying can be possibly hidden with
branch recovery time
29Performance Overheads
Max IPC loss for 5-cycle overhead is 4
30Power/Area Analysis
8 Rd/4 Wr ports assumed for ST RF
16 Rd/8 Wr ports assumed for MT RF
31Power/Area Analysis
- Single thread RF size 50 with ER compared to
baseline RF size 80 can - Improve Clock speed by 19
- Consumes 11 less energy and 25 less area
- If SEC-DED ECC is implemented on baseline
register file - 6 Energy increase and 16 area increase
- Smaller RF can help afford ECC for even multiple
bit soft error resilience
32Fault-Injection Analysis
- Modified Simplescalar for fault analysis
- Conservative analysis as masking effects cannot
be modeled - Every 1000 cycles, register bit is flipped in
trailing register file - Only 0.0004 of faults go undetected
- On average 99 of time logical register is
rewritten in less than 100 instruction interval - Ensures that slack is less than RVQ size
33Conclusions and Future Work
- RMT model very suitable for Eager Register
Release - A 100 entry RF can match the throughput of 160
entry file and shows 34 improvement over
baseline - Fault-coverage reduction marginal 0.0004
- Enables smaller RF for lower power, higher clock
speed, lower area overheads - Enables reliability by making ECC affordable
- Nontrivial implementation overheads
- Need to explore complexity-effective solution