Hardware Fault Tolerance Through Simultaneous Multithreading part 3 - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Hardware Fault Tolerance Through Simultaneous Multithreading part 3

Description:

Eric Rotenberg, 'AR-SMT - A Microarchitectural Approach to Fault Tolerance in ... Karthik Sundaramoorth, Zach Purser, and Eric Rotenberg, 'Slipstream Processors: ... – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 23
Provided by: gregBron
Category:

less

Transcript and Presenter's Notes

Title: Hardware Fault Tolerance Through Simultaneous Multithreading part 3


1
Hardware Fault Tolerance Through Simultaneous
Multithreading (part 3)
  • Jonathan Winter

2
3 SMT Fault Tolerance Papers
  • Eric Rotenberg, "AR-SMT - A Microarchitectural
    Approach to Fault Tolerance in Microprocessors",
    Symposium on Fault-Tolerant Computing, 1999.
  • Steven K. Reinhardt and Shubhendu S. Mukherjee,
    "Transient Fault Detection via Simultaneous
    Multithreading", ISCA 2000.
  • Shubhendu S. Mukherjee, Michael Kontz and Steven
    K. Reinhardt, "Detailed Design and Evaluation of
    Redundant Multithreading Alternatives", ISCA 2002.

3
Outline
  • Background
  • SMT
  • Hardware fault tolerance
  • AR-SMT
  • Basic mechanisms
  • Implementation issues
  • Simulation and Results
  • Transient Fault Detection via SMT
  • Sphere of replication
  • Basic mechanisms
  • Comparison to AR-SMT
  • Simulation and Results
  • Redundant Multithreading Alternatives
  • Realistic processor implementation
  • CRT
  • Simulation and Results
  • Fault Recovery
  • Future Lectures ?

4
Sphere of Replication
  • Size of sphere of replication
  • Two alternatives with and without register file
  • Instruction and data caches kept outside

5
Redundant Multithreading Alternatives
  • Discusses real world fault tolerant processors
  • Evaluates SRT on a more realistic and detailed
    processor than the previous paper
  • Proposes Chip-level Redundant Threading (CRT)
  • Detailed simulation results with new metric
  • Relative SMT-Efficiency

6
Real World SMT, CMP, and FT
  • Simulated processor based on Compaq Alpha Araña
    (a.k.a. 21464 or EV8)
  • IBM Power4 and HP Mako are 2-way CMPs
  • Compaq Himalaya uses multi-chip lockstepping
  • IBM S/390 G5 uses on-chip lockstepping

7
Detailed Processor Description
  • 8 way SMT with 4 hardware contexts
  • IBOX fetches chunks of 8 instructions and
    forwards them to the PBOX
  • Complex branch prediction mechanism
  • Line predictor
  • Branch predictor, jump target predictor, and
    return address stack

8
Detailed Processor Description (part 2)
  • PBOX performs initial processing
  • Register renaming and partial decoding
  • Maintains tables for recovery from
    miss-predictions
  • QBOX issues instructions out-of-order to the
    EBOX, FBOX, or MBOX
  • Retires instructions and commits architectural
    state in program order
  • Consists of instruction queue, in-flight table,
    and completion unit
  • MBOX conducts loads and stores
  • Load and store queues divided between threads
  • Available queue space is very small per thread

9
SRT on Detailed Processor
  • Input replication uses LVQ variant that allows
    out-of-order load issue from trailing thread
  • Output comparison is the same as SRT
  • Improvement is suggested that has per-thread SQ
  • PBOX storage structures made per-thread to avoid
    deadlock situations
  • Branch outcome queue converted to line prediction
    queue
  • Preferential space redundancy (PSR) implemented
    to better cover permanent faults

10
Chip-level Redundant Threading
  • Each core executes a lead and trailing thread
    from different programs
  • LVQ and line prediction queue must forward data
    to other processors trailing thread
  • Store buffer must receive retired stores from
    other processor for comparison

11
CRT Advantages
  • CRT checks much less information to detect faults
    then lockstepped processors
  • Lockstep fault detection circuitry is on critical
    path for cache misses
  • CRT executes threads more efficiently because of
    SMT dynamic scheduling on each processor

12
Simulation Environment
  • Asim performance model framework used
  • Simulates processor like Alpha 21464
  • All 18 SPEC CPU95 benchmarks used
  • Combinations of SPEC used for multi-program
    simulations
  • Lockstepped processor simulated with zero fault
    detection delay (Lock0) and with 8-cycle delay
    (Lock8)
  • SRT architecture simulated with delays for
    forwarding line predictions and load values
  • Extra delays for CRT architecture

13
SMT-Efficiency
  • SMT-Efficiency (SMT-E) used instead of IPC
  • SMT-E of individual thread is IPC of thread in
    SMT mode divided by the IPC in single-thread mode
    in an SMT
  • Overall SMT-E is arithmetic mean of individual
    SMT-Efficiencies
  • A. Snavely and D. M. Tullsen, Symbiotic Job
    Scheduling for a Simultaneous Multithreading
    Processor, ASPLOS 2000

14
SMT-Speedup ( SMT-Efficiency)
  • Y. Sazeides and T. Juan, How to Compare the
    Performance of Two SMT Microarchitectures,
    ISPASS 2001

15
Preferential Space Redundancy
  • Without PSR, 65 of instructions execute on same
    functional unit
  • With PSR, only 0.06 of instructions run on the
    same unit
  • No performance degradation is experienced

16
SRT One Logical Thread
  • SRT 32 slower than single thread on SMT
  • SRT 11 faster than running two redundant copies
  • Degradation 30 with per-thread store queue
  • Best-case 26 degradation with oracle store queue

17
SRT Two Logical Threads
  • Degradation of SRT is 40
  • Per-thread store queue give 32 degradation
  • Store lifetime drops from 44 cycles to vs. 39 for
    single thread
  • Oracle store queue gives 5 better efficiency

18
Chip-level Redundant Threading
  • With one logical thread, CRT performs similarly
    to lockstepping
  • With two logical threads CRT beats Lock0 and
    Lock8 by 10 and 2 respectively
  • Adding the per-thread store queue causes CRT to
    beat Lock8 by 13 average (22 maximum)
  • Using an oracle store queue improves performance
    by 6 more

19
CRT with Four Logical Threads
  • Initial CRT configuration is no better than Lock8
  • Adding per-thread store queue gives CRT 13
    better performance than Lock8
  • Using an oracle store queue improve performance
    only by another 2

20
Conclusions
  • The benefits of SRT are not as great as in the
    original paper when using a detailed model
  • 30 and 32 degradation seen on single thread and
    multithread workloads
  • SRT methods can be used to detect permanent
    faults
  • Chip-level redundant threading gives improved
    performance over lockstepped processors
  • Overall CRT provided a 13 improvement

21
Transient Fault Recovery
  • AR-SMT suggests that the R-stream could be used
    as a checkpoint for recovery
  • SRT suggests checkpoint/restart or failover
  • Argues that since faults are infrequent, the will
    have a minor impact on performance

22
Future Lectures ?
  • Hardware Transient Fault Recovery
  • T.N. Vijaykumar, Irith Pomeranz, and Karl Cheng,
    Transient-Fault Recovery Using Simultaneous
    Multithreading, ISCA 2002
  • Mohamed Gomaa, Chad Scarbrough, T.N. Vijaykumar,
    and Irith Pomeranz, Transient-Fault Recovery for
    Chip Multiprocessors, ISCA 2003
  • Slipstream Processors (an AR-SMT extension)
  • Karthik Sundaramoorth, Zach Purser, and Eric
    Rotenberg, Slipstream Processors Improving both
    Performance and Fault Tolerance, ASPLOS 2000
  • Khaled Z. Ibrahim, Gregory T. Byrd, and Eric
    Rotenberg, Slipstream Execution Mode for
    CMP-Based Multiprocessors, HPCA 2003
Write a Comment
User Comments (0)
About PowerShow.com