Exploiting Eager Register Release in a Redundantly Multi-threaded Processor - PowerPoint PPT Presentation

About This Presentation
Title:

Exploiting Eager Register Release in a Redundantly Multi-threaded Processor

Description:

RMT processor has duplicate register value state in RVQ/trailer's state ... Smaller Register file size can deliver same performance using above technique ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 34
Provided by: NM15
Category:

less

Transcript and Presenter's Notes

Title: Exploiting Eager Register Release in a Redundantly Multi-threaded Processor


1
Exploiting Eager Register Release in a
Redundantly Multi-threaded Processor
  • Niti Madan
  • Rajeev Balasubramonian
  • University of Utah

2
Introduction
  • Rising soft error rates due to shrinking
    transistor sizes and lower supply voltages
  • Existing Solutions
  • Process level SOI
  • Circuit level Rad-hard cells, ECC, BISER
  • Architecture level
  • Redundant Multithreading
  • Reducing the time useful state spends in
    unprotected structures
  • Software assisted fault tolerance

3
Introduction
  • CMPs/SMTs enable redundant multi-threading (RMT)
  • Detailed Design and Evaluation of Redundant
    Multithreading Alternatives, ISCA 2002
  • 2 processors/threads execute the same program

4
Chip-level Redundant Multi-threading (CRTR)
Branch Outcomes
Processor 1
Processor 2
OoO
OoO
Loads
Leading thread 1 Trailing thread 2
Trailing thread 1 Leading thread 2
Lags behind leading thread by some slack
Register Values
Stores
5
Motivation
  • Register file is already a critical resource
  • impacts ILP
  • impacts cycle time
  • impacts peak temperature
  • Multiple threads increase pressure on register
    file

6
Motivation
  • Out-of-order processors are "conservative" since
    they must preserve correctness
  • Example registers are de-allocated
    conservatively
  • Having a trailing thread allows the leading
    thread to be aggressive
  • improves the performance of the leading thread
  • trailer state can be used for ensuring
    correctness
  • some errors may go undetected

7
Processor 1
Processor 2
RVQ
Leading 1
Trailing 1
R1
R1
R2
R1
lr5 . lr5 mapped to R1
Branch
lr5 lr5 mapped to R2
Mispredict
8
Processor 1
Processor 2
RVQ
Leading 1
Trailing 1
R1
R1
R1
Soft error
Mispredict Recovery
Fault Propagates
Very few errors slip through Slack is most of
the times less than RVQ size
9
Our Approach
  • RMT processor has duplicate register value state
    in RVQ/trailers state
  • Improve Register file efficiency using
  • Eager Register Release
  • Smaller Register file size can deliver same
    performance using above technique
  • Reduced power
  • Increased reliability ECC less expensive
  • Potentially faster clock speed

10
Outline
  • Background on RMT design space
  • Proposed technique
  • Evaluation
  • Conclusions Future Work

11
Redundant Multi-threading
  • Fault model
  • Trailers state used for recovery
  • Does not provide complete recovery
  • Caches and Load Value Queue (LVQ) ECC protected
  • Can detect all single event upset faults
  • Baseline RMT models include SRTR, CRTR,
    ST-P-CRTR, MT-P-CRTR

12
Baseline RMT Model
Leading Thread 1 Trailing Thread
1 Out-of-Order Processor
  • SRTR SMT level RMT
  • CRTR Chip level RMT
  • Proposed by Mukherjee et al ISCA 2002, Gomaa et
    al ISCA 2002, ISCA 2003

Out-of-order
Out-of-order
Processor 1
Processor 2
LVQ, BOQ, RVQ
Leading 1 Trailing 2
Trailing 1 Leading 2
13
Power-efficient RMT model
  • Our Earlier Work explores Power-efficient RMT
    model
  • P-CRTR (Selse-2, Tech Report 2005)
  • Observations
  • Trailing thread doesnt suffer from D-cache
    misses and branch mispredictions
  • Trailing thread bound to have higher IPC
  • High Trailer IPC enables power reduction
  • Techniques proposed for power-efficiency
  • Dynamic Frequency Scaling
  • In-order execution of trailer

14
Dynamic Frequency Scaling
  • High Trailer IPC enables frequency reduction
  • Reduce Trailers frequency to match the leaders
    throughput
  • Reduction in Trailers dynamic power
  • Does not impact Trailers leakage power

15
In-order Execution of Checker
  • Our approach
  • Send all register values computed by leading core
    to the trailer (Register value prediction 100
    accuracy if no fault)
  • Trailer reads source operands from RVQ
  • Trailer verifies source operands at commit
  • RVP enables perfect IPC no stalls
  • Cost Extra communication overhead
  • Benefit Overall reduced dynamic and leakage
    power

16
ST-P-CRTR
  • Single thread workloads

Out-of-order
In-order
Processor 1
Processor 2
LVQ, BOQ, RVQ
Leading 1
Trailing 1
17
MT-P-CRTR
  • Multi-threaded Workloads

Processor 2
Trailing 1
Out-of-order
LVQ, BOQ, RVQ
Processor 1
In-order
Leading 1 Leading 2
Processor 3
LVQ, BOQ, RVQ
Trailing 2
In-order
18
Eager Register Release
Original Code lr3 lr1,lr2 lr5 lr3, lr4 Branch
to x lr3
Renamed Code pr21 pr8,pr11 pr15 pr21,
pr12 Branch to x pr29
lr3 has 2 mappings new pr29 and old pr21 pr21
cannot be released until branch resolves
  • Eager Register Release
  • Involves releasing older physical register after
    the value is rewritten and used by all consumers
  • Requires a mechanism to store the released state
    elsewhere

19
Implementation Details
  • Need to keep track of various states for each
    physical register in Usage Table
  • Bit that tracks if logical register value is
    overwritten
  • RVQ address/register id in trailing thread
  • Counters for each physical register
  • To track pending consumers
  • Modification in ROB to initiate recovery upon
    mispredict
  • Non-trivial complexity and overheads

20
Evaluation Methodology
  • Simplescalar-3.0 (Modified for CMP/SMT) for
    performance analysis and wattch for processor
    power
  • eCacti-3.0 to model register file power and area
    overheads
  • Spec2k Int, FP benchmark suite
  • 16 benchmarks for single thread experiments
  • 10 pairs of High/Low IPC/ Int/FP combinations
    for multi-thread experiments
  • Evaluated all RMT models for comprehensive
    analysis of all combinations of leading/trailing
    threads
  • RVQ size 600 entries

21
Performance Evaluation
22
Effect of Register File Size - SRTR
ROB size 160
23
Effect of Register File Size ST-P-CRTR
24
Effect of Register File Size CRTR
25
Effect of Register File Size MT-P-CRTR
26
Effect of Register File Size
  • For SRTR, CRTR, MT-P-CRTR
  • Performance of 100 size RF with ER same as
    baseline with 160 size (37.5 size reduction)
  • Performance improvement of 34 in 100 size RF
    with ER compared to baseline with 100 size
  • For ST-P-CRTR
  • Performance of 50 size register file with ER same
    as baseline with 80 size (37.5 size reduction)
  • Performance improvement of 12 in 100 size RF
    with ER compared to baseline with 100 size

27
Observations
  • More favorable to models where leading thread
    co-executes with another leading/trailing thread
  • Most FP benchmarks perform better with ER
    (greater than 20 improvement)
  • Int benchmarks that have poor bpred rates do not
    benefit much (gcc, equake, eon etc upto 3)

28
Performance Overheads
  • For 100 million single thread execution
  • 70 million registers are released eagerly
  • 6 copied back upon mispredict recovery
  • Cost of copying back dependent upon program
    mispredict rate
  • Each mispredict requires 6.6 copy back values
  • Cost of copying can be possibly hidden with
    branch recovery time

29
Performance Overheads
Max IPC loss for 5-cycle overhead is 4
30
Power/Area Analysis
8 Rd/4 Wr ports assumed for ST RF
16 Rd/8 Wr ports assumed for MT RF
31
Power/Area Analysis
  • Single thread RF size 50 with ER compared to
    baseline RF size 80 can
  • Improve Clock speed by 19
  • Consumes 11 less energy and 25 less area
  • If SEC-DED ECC is implemented on baseline
    register file
  • 6 Energy increase and 16 area increase
  • Smaller RF can help afford ECC for even multiple
    bit soft error resilience

32
Fault-Injection Analysis
  • Modified Simplescalar for fault analysis
  • Conservative analysis as masking effects cannot
    be modeled
  • Every 1000 cycles, register bit is flipped in
    trailing register file
  • Only 0.0004 of faults go undetected
  • On average 99 of time logical register is
    rewritten in less than 100 instruction interval
  • Ensures that slack is less than RVQ size

33
Conclusions and Future Work
  • RMT model very suitable for Eager Register
    Release
  • A 100 entry RF can match the throughput of 160
    entry file and shows 34 improvement over
    baseline
  • Fault-coverage reduction marginal 0.0004
  • Enables smaller RF for lower power, higher clock
    speed, lower area overheads
  • Enables reliability by making ECC affordable
  • Nontrivial implementation overheads
  • Need to explore complexity-effective solution
Write a Comment
User Comments (0)
About PowerShow.com