Exploiting Eager Register Release in a Redundantly Multi-threaded Processor - PowerPoint PPT Presentation

About This Presentation

Title:

Exploiting Eager Register Release in a Redundantly Multi-threaded Processor

Description:

RMT processor has duplicate register value state in RVQ/trailer's state ... Smaller Register file size can deliver same performance using above technique ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 34

Provided by: NM15

Learn more at: https://users.cs.utah.edu

Category:

more less

Transcript and Presenter's Notes

Title: Exploiting Eager Register Release in a Redundantly Multi-threaded Processor

1
Exploiting Eager Register Release in a
Redundantly Multi-threaded Processor

Niti Madan
Rajeev Balasubramonian
University of Utah

2
Introduction

Rising soft error rates due to shrinking
transistor sizes and lower supply voltages
Existing Solutions
Process level SOI
Circuit level Rad-hard cells, ECC, BISER
Architecture level
Redundant Multithreading
Reducing the time useful state spends in
unprotected structures
Software assisted fault tolerance

3
Introduction

CMPs/SMTs enable redundant multi-threading (RMT)
Detailed Design and Evaluation of Redundant
Multithreading Alternatives, ISCA 2002
2 processors/threads execute the same program

4
Chip-level Redundant Multi-threading (CRTR)
Branch Outcomes
Processor 1
Processor 2
OoO
OoO
Loads
Leading thread 1 Trailing thread 2
Trailing thread 1 Leading thread 2
Lags behind leading thread by some slack
Register Values
Stores
5
Motivation

Register file is already a critical resource
impacts ILP
impacts cycle time
impacts peak temperature
Multiple threads increase pressure on register
file

6
Motivation

Out-of-order processors are "conservative" since
they must preserve correctness
Example registers are de-allocated
conservatively
Having a trailing thread allows the leading
thread to be aggressive
improves the performance of the leading thread
trailer state can be used for ensuring
correctness
some errors may go undetected

7
Processor 1
Processor 2
RVQ
Leading 1
Trailing 1
R1
R1
R2
R1
lr5 . lr5 mapped to R1
Branch
lr5 lr5 mapped to R2
Mispredict
8
Processor 1
Processor 2
RVQ
Leading 1
Trailing 1
R1
R1
R1
Soft error
Mispredict Recovery
Fault Propagates
Very few errors slip through Slack is most of
the times less than RVQ size
9
Our Approach

RMT processor has duplicate register value state
in RVQ/trailers state
Improve Register file efficiency using
Eager Register Release
Smaller Register file size can deliver same
performance using above technique
Reduced power
Increased reliability ECC less expensive
Potentially faster clock speed

10
Outline

Background on RMT design space
Proposed technique
Evaluation
Conclusions Future Work

11
Redundant Multi-threading

Fault model
Trailers state used for recovery
Does not provide complete recovery
Caches and Load Value Queue (LVQ) ECC protected
Can detect all single event upset faults
Baseline RMT models include SRTR, CRTR,
ST-P-CRTR, MT-P-CRTR

12
Baseline RMT Model
Leading Thread 1 Trailing Thread
1 Out-of-Order Processor

SRTR SMT level RMT
CRTR Chip level RMT
Proposed by Mukherjee et al ISCA 2002, Gomaa et
al ISCA 2002, ISCA 2003

Out-of-order
Out-of-order
Processor 1
Processor 2
LVQ, BOQ, RVQ
Leading 1 Trailing 2
Trailing 1 Leading 2
13
Power-efficient RMT model

Our Earlier Work explores Power-efficient RMT
model
P-CRTR (Selse-2, Tech Report 2005)
Observations
Trailing thread doesnt suffer from D-cache
misses and branch mispredictions
Trailing thread bound to have higher IPC
High Trailer IPC enables power reduction
Techniques proposed for power-efficiency
Dynamic Frequency Scaling
In-order execution of trailer

14
Dynamic Frequency Scaling

High Trailer IPC enables frequency reduction
Reduce Trailers frequency to match the leaders
throughput
Reduction in Trailers dynamic power
Does not impact Trailers leakage power

15
In-order Execution of Checker

Our approach
Send all register values computed by leading core
to the trailer (Register value prediction 100
accuracy if no fault)
Trailer reads source operands from RVQ
Trailer verifies source operands at commit
RVP enables perfect IPC no stalls
Cost Extra communication overhead
Benefit Overall reduced dynamic and leakage
power

16
ST-P-CRTR

Single thread workloads

Out-of-order
In-order
Processor 1
Processor 2
LVQ, BOQ, RVQ
Leading 1
Trailing 1
17
MT-P-CRTR

Multi-threaded Workloads

Processor 2
Trailing 1
Out-of-order
LVQ, BOQ, RVQ
Processor 1
In-order
Leading 1 Leading 2
Processor 3
LVQ, BOQ, RVQ
Trailing 2
In-order
18
Eager Register Release
Original Code lr3 lr1,lr2 lr5 lr3, lr4 Branch
to x lr3
Renamed Code pr21 pr8,pr11 pr15 pr21,
pr12 Branch to x pr29
lr3 has 2 mappings new pr29 and old pr21 pr21
cannot be released until branch resolves

Eager Register Release
Involves releasing older physical register after
the value is rewritten and used by all consumers
Requires a mechanism to store the released state
elsewhere

19
Implementation Details

Need to keep track of various states for each
physical register in Usage Table
Bit that tracks if logical register value is
overwritten
RVQ address/register id in trailing thread
Counters for each physical register
To track pending consumers
Modification in ROB to initiate recovery upon
mispredict
Non-trivial complexity and overheads

20
Evaluation Methodology

Simplescalar-3.0 (Modified for CMP/SMT) for
performance analysis and wattch for processor
power
eCacti-3.0 to model register file power and area
overheads
Spec2k Int, FP benchmark suite
16 benchmarks for single thread experiments
10 pairs of High/Low IPC/ Int/FP combinations
for multi-thread experiments
Evaluated all RMT models for comprehensive
analysis of all combinations of leading/trailing
threads
RVQ size 600 entries

21
Performance Evaluation
22
Effect of Register File Size - SRTR
ROB size 160
23
Effect of Register File Size ST-P-CRTR
24
Effect of Register File Size CRTR
25
Effect of Register File Size MT-P-CRTR
26
Effect of Register File Size

For SRTR, CRTR, MT-P-CRTR
Performance of 100 size RF with ER same as
baseline with 160 size (37.5 size reduction)
Performance improvement of 34 in 100 size RF
with ER compared to baseline with 100 size
For ST-P-CRTR
Performance of 50 size register file with ER same
as baseline with 80 size (37.5 size reduction)
Performance improvement of 12 in 100 size RF
with ER compared to baseline with 100 size

27
Observations

More favorable to models where leading thread
co-executes with another leading/trailing thread
Most FP benchmarks perform better with ER
(greater than 20 improvement)
Int benchmarks that have poor bpred rates do not
benefit much (gcc, equake, eon etc upto 3)

28
Performance Overheads

For 100 million single thread execution
70 million registers are released eagerly
6 copied back upon mispredict recovery
Cost of copying back dependent upon program
mispredict rate
Each mispredict requires 6.6 copy back values
Cost of copying can be possibly hidden with
branch recovery time

29
Performance Overheads
Max IPC loss for 5-cycle overhead is 4
30
Power/Area Analysis
8 Rd/4 Wr ports assumed for ST RF
16 Rd/8 Wr ports assumed for MT RF
31
Power/Area Analysis

Single thread RF size 50 with ER compared to
baseline RF size 80 can
Improve Clock speed by 19
Consumes 11 less energy and 25 less area
If SEC-DED ECC is implemented on baseline
register file
6 Energy increase and 16 area increase
Smaller RF can help afford ECC for even multiple
bit soft error resilience

32
Fault-Injection Analysis

Modified Simplescalar for fault analysis
Conservative analysis as masking effects cannot
be modeled
Every 1000 cycles, register bit is flipped in
trailing register file
Only 0.0004 of faults go undetected
On average 99 of time logical register is
rewritten in less than 100 instruction interval
Ensures that slack is less than RVQ size

33
Conclusions and Future Work

RMT model very suitable for Eager Register
Release
A 100 entry RF can match the throughput of 160
entry file and shows 34 improvement over
baseline
Fault-coverage reduction marginal 0.0004
Enables smaller RF for lower power, higher clock
speed, lower area overheads
Enables reliability by making ECC affordable
Nontrivial implementation overheads
Need to explore complexity-effective solution