Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Hardware Fault Tolerance Through Simultaneous Multithreading (part 2)

Description:

Eric Rotenberg, 'AR-SMT - A Microarchitectural Approach to Fault Tolerance in ... Delays cache block replacement or invalidation ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 21
Provided by: gregBron
Category:

less

Transcript and Presenter's Notes

Title: Hardware Fault Tolerance Through Simultaneous Multithreading (part 2)


1
Hardware Fault Tolerance Through Simultaneous
Multithreading (part 2)
  • Jonathan Winter

2
3 SMT Fault Tolerance Papers
  • Eric Rotenberg, "AR-SMT - A Microarchitectural
    Approach to Fault Tolerance in Microprocessors",
    Symposium on Fault-Tolerant Computing, 1999.
  • Steven K. Reinhardt and Shubhendu S. Mukherjee,
    "Transient Fault Detection via Simultaneous
    Multithreading", ISCA 2000.
  • Shubhendu S. Mukherjee, Michael Kontz and Steven
    K. Reinhardt, "Detailed Design and Evaluation of
    Redundant Multithreading Alternatives", ISCA 2002.

3
Outline
  • Background
  • SMT
  • Hardware fault tolerance
  • AR-SMT
  • Basic mechanisms
  • Implementation issues
  • Simulation and Results
  • Transient Fault Detection via SMT
  • Sphere of replication
  • Basic mechanisms
  • Comparison to AR-SMT
  • Simulation and Results
  • Redundant Multithreading Alternatives
  • Realistic processor implementation
  • CRT
  • Simulation and Results
  • Fault Recovery
  • Next Lecture

4
Transient Fault Detection via SMT
  • More detailed analysis of Simultaneous and
    Redundant Threading (SRT)
  • Introduces Sphere of Replication concept
  • Explores SRT design space
  • Discussion of input replication
  • Architecture for output comparison
  • Performance improving mechanisms
  • More depth in simulation

5
Sphere of Replication
  • Components inside sphere are protected against
    faults using replication
  • External components must use other means of fault
    tolerance (parity, ECC, etc.)
  • Inputs to sphere must be duplicated for each of
    the redundant processes
  • Outputs of the redundant processes are compared
    to detect faults
  • Simple to understand in lockstepping
  • Larger sphere
  • more state to replicate
  • less input replication and output comparison

6
Sphere of Replication (part 2)
  • Size of sphere of replication
  • Two alternatives with and without register file
  • Instruction and data caches kept outside

7
Input Replication
  • Must ensure that both threads received same
    inputs to guarantee they follow the same path
  • Instructions assume no self-modification
  • Cached load data
  • Out-of-order execution issue
  • Multiprocessor cache coherence issues
  • Uncached load data must synchronize
  • External interrupts
  • Stall lead thread and deliver interrupt
    synchronously
  • Record interrupt delivery point and deliver later

8
Cached Load Data - ALAB
  • Active Load Address Buffer (ALAB)
  • Delays cache block replacement or invalidation
  • ALAB is table with address tag, counter, and
    pending-invalidate bit
  • Counter tracks trailing threads outstanding
    loads
  • Blocks cannot be replaced or invalidated until
    counter is zero
  • Pending-invalidate set on unevictable block
  • Leading thread stalls when ALAB is full
  • Must detect and address deadlocks

9
Cached Load Data - LVQ
  • Load Value Queue (LVQ)
  • Explicit designation of leading and trailing
    thread
  • Only leading thread issues loads and stores
  • Load addresses and values forward to trailing
    thread via LVQ
  • Trailing thread executes loads in-order and
    non-speculatively (why?)
  • Input replication guaranteed
  • Design simpler and less pressure on cache
  • Earlier fault detection
  • Constrains scheduling of trailing thread loads

10
Output Comparison
  • Store buffer used to verify address and value of
    stores to be committed
  • Trailing thread searches for matching entry
  • Mismatch means fault occurred
  • Cached load values require no checking
  • Uncached load values could have side effects
  • Issue non-speculatively, so stall leading thread
  • Assumes uncached loads are always detected
  • Register Check Buffer used to match register
    writebacks.
  • 3 register files required future files
    architectural file

11
Enhancing SRT Performance
  • Slack Fetch
  • Maintain constant lag between threads execution
  • Lead thread updates branch and data predictors
  • Lead thread prefetches loads
  • Traditional SMT ICount fetch policy is modified
    to maintain slack
  • Branch Outcome Queue
  • Deliver branch outcomes directly to trailing
    thread
  • Trailing thread has no control speculation

12
AR-SMT verses SRT
  • AR-SMT only has space redundancy in functional
    units
  • SRT can potentially have space redundancy across
    the pipeline
  • AR-SMT is trace processor-based while SRT is
    conventional
  • Register file of R-stream must be protected
  • AR-SMT forwards load data values
  • AR-SMT checks every instruction during fault
    detection
  • SRT requires no operating system modifications
  • AR-SMT doesnt support uncached loads and stores
    or multiprocessor coherence
  • Delay buffer performs function of register check
    buffer and branch outcome queue
  • All of main memory is in AR-SMT sphere
  • Better fault coverage but very costly

13
Simulation Environment
  • Modified Simplescalar sim-outorder
  • Long front-end pipeline because of out-of-order
    nature and SMT
  • Simple approximation of trace cache
  • Used 11 SPEC95 benchmarks

14
Results
  • Again, this paper only analyzes the performance
    impact of fault tolerance
  • Baseline Characterization
  • ORH-Dual ? two pipelines, each with half the
    resources
  • SMT-Dual ? replicated threads with no detection
    hardware
  • ORH and SMT-Dual 32 slower than SMT-Single

15
Slack Fetch Branch Outcome Queue
  • 10,14, 15 (27 max) performance improvements
    for SF, BOQ, and SF BOQ
  • Reduced memory stalls through prefetching
  • Prevents trailing thread from wasting resources
    by speculating
  • Performance better with slack of 256 instructions
    over 32 or 128

16
Input Replication
  • Assumes output comparison performed by oracle
  • Almost no performance penalty paid for 64-entry
    ALAB or LVQ
  • With a 16-entry ALAB and LVQ, benchmarks
    performance degraded 8 and 5 respectively

17
Output Comparison
  • Assumes inputs replicated by oracle
  • Leading thread can stall if store queue is full
  • 64-entry store buffer eliminates almost all
    stalls
  • Register check buffer or size 32, 64, and 128
    entries degrades performance by 27, 6, and 1
    respectively

18
Overall Results
  • Speedup of SRT processor with 256 slack fetch,
    branch outcome queue with 128 entries, 64-entry
    store buffer, and 64-entry load value queue.
  • SRT demonstrates a 16 speedup on average (up to
    29) over a lockstepping processor with the
    same hardware

19
Multi-cycle and Permanent Faults
  • Transient faults could potentially persist for
    multiple cycles and affect both threads
  • Increasing slack fetch decreases this possibility
  • Spatial redundancy can be increased by
    partitioning function units and forcing threads
    to execute on different groups
  • Performance loss for this approach is less than 2

20
Conclusions
  • Sphere of replication helps analysis of input
    replication and output comparison
  • Keep register file in sphere
  • LVQ is superior to ALAB (simpler)
  • Slack fetch and branch outcome queue mechanism
    enhance performance
  • SRT fault tolerance method performs 16 better on
    average than lockstepping
Write a Comment
User Comments (0)
About PowerShow.com