Title: Hardware Fault Tolerance Through Simultaneous Multithreading part 3
1Hardware Fault Tolerance Through Simultaneous
Multithreading (part 3)
23 SMT Fault Tolerance Papers
- Eric Rotenberg, "AR-SMT - A Microarchitectural
Approach to Fault Tolerance in Microprocessors",
Symposium on Fault-Tolerant Computing, 1999. - Steven K. Reinhardt and Shubhendu S. Mukherjee,
"Transient Fault Detection via Simultaneous
Multithreading", ISCA 2000. - Shubhendu S. Mukherjee, Michael Kontz and Steven
K. Reinhardt, "Detailed Design and Evaluation of
Redundant Multithreading Alternatives", ISCA 2002.
3Outline
- Background
- SMT
- Hardware fault tolerance
- AR-SMT
- Basic mechanisms
- Implementation issues
- Simulation and Results
- Transient Fault Detection via SMT
- Sphere of replication
- Basic mechanisms
- Comparison to AR-SMT
- Simulation and Results
- Redundant Multithreading Alternatives
- Realistic processor implementation
- CRT
- Simulation and Results
- Fault Recovery
- Future Lectures ?
4Sphere of Replication
- Size of sphere of replication
- Two alternatives with and without register file
- Instruction and data caches kept outside
5Redundant Multithreading Alternatives
- Discusses real world fault tolerant processors
- Evaluates SRT on a more realistic and detailed
processor than the previous paper - Proposes Chip-level Redundant Threading (CRT)
- Detailed simulation results with new metric
- Relative SMT-Efficiency
6Real World SMT, CMP, and FT
- Simulated processor based on Compaq Alpha Araña
(a.k.a. 21464 or EV8) - IBM Power4 and HP Mako are 2-way CMPs
- Compaq Himalaya uses multi-chip lockstepping
- IBM S/390 G5 uses on-chip lockstepping
7Detailed Processor Description
- 8 way SMT with 4 hardware contexts
- IBOX fetches chunks of 8 instructions and
forwards them to the PBOX - Complex branch prediction mechanism
- Line predictor
- Branch predictor, jump target predictor, and
return address stack
8Detailed Processor Description (part 2)
- PBOX performs initial processing
- Register renaming and partial decoding
- Maintains tables for recovery from
miss-predictions - QBOX issues instructions out-of-order to the
EBOX, FBOX, or MBOX - Retires instructions and commits architectural
state in program order - Consists of instruction queue, in-flight table,
and completion unit - MBOX conducts loads and stores
- Load and store queues divided between threads
- Available queue space is very small per thread
9SRT on Detailed Processor
- Input replication uses LVQ variant that allows
out-of-order load issue from trailing thread - Output comparison is the same as SRT
- Improvement is suggested that has per-thread SQ
- PBOX storage structures made per-thread to avoid
deadlock situations - Branch outcome queue converted to line prediction
queue - Preferential space redundancy (PSR) implemented
to better cover permanent faults
10Chip-level Redundant Threading
- Each core executes a lead and trailing thread
from different programs - LVQ and line prediction queue must forward data
to other processors trailing thread - Store buffer must receive retired stores from
other processor for comparison
11CRT Advantages
- CRT checks much less information to detect faults
then lockstepped processors - Lockstep fault detection circuitry is on critical
path for cache misses - CRT executes threads more efficiently because of
SMT dynamic scheduling on each processor
12Simulation Environment
- Asim performance model framework used
- Simulates processor like Alpha 21464
- All 18 SPEC CPU95 benchmarks used
- Combinations of SPEC used for multi-program
simulations - Lockstepped processor simulated with zero fault
detection delay (Lock0) and with 8-cycle delay
(Lock8) - SRT architecture simulated with delays for
forwarding line predictions and load values - Extra delays for CRT architecture
13SMT-Efficiency
- SMT-Efficiency (SMT-E) used instead of IPC
- SMT-E of individual thread is IPC of thread in
SMT mode divided by the IPC in single-thread mode
in an SMT - Overall SMT-E is arithmetic mean of individual
SMT-Efficiencies - A. Snavely and D. M. Tullsen, Symbiotic Job
Scheduling for a Simultaneous Multithreading
Processor, ASPLOS 2000
14SMT-Speedup ( SMT-Efficiency)
- Y. Sazeides and T. Juan, How to Compare the
Performance of Two SMT Microarchitectures,
ISPASS 2001
15Preferential Space Redundancy
- Without PSR, 65 of instructions execute on same
functional unit - With PSR, only 0.06 of instructions run on the
same unit - No performance degradation is experienced
16SRT One Logical Thread
- SRT 32 slower than single thread on SMT
- SRT 11 faster than running two redundant copies
- Degradation 30 with per-thread store queue
- Best-case 26 degradation with oracle store queue
17SRT Two Logical Threads
- Degradation of SRT is 40
- Per-thread store queue give 32 degradation
- Store lifetime drops from 44 cycles to vs. 39 for
single thread - Oracle store queue gives 5 better efficiency
18Chip-level Redundant Threading
- With one logical thread, CRT performs similarly
to lockstepping - With two logical threads CRT beats Lock0 and
Lock8 by 10 and 2 respectively - Adding the per-thread store queue causes CRT to
beat Lock8 by 13 average (22 maximum) - Using an oracle store queue improves performance
by 6 more
19CRT with Four Logical Threads
- Initial CRT configuration is no better than Lock8
- Adding per-thread store queue gives CRT 13
better performance than Lock8 - Using an oracle store queue improve performance
only by another 2
20Conclusions
- The benefits of SRT are not as great as in the
original paper when using a detailed model - 30 and 32 degradation seen on single thread and
multithread workloads - SRT methods can be used to detect permanent
faults - Chip-level redundant threading gives improved
performance over lockstepped processors - Overall CRT provided a 13 improvement
21Transient Fault Recovery
- AR-SMT suggests that the R-stream could be used
as a checkpoint for recovery - SRT suggests checkpoint/restart or failover
- Argues that since faults are infrequent, the will
have a minor impact on performance
22Future Lectures ?
- Hardware Transient Fault Recovery
- T.N. Vijaykumar, Irith Pomeranz, and Karl Cheng,
Transient-Fault Recovery Using Simultaneous
Multithreading, ISCA 2002 - Mohamed Gomaa, Chad Scarbrough, T.N. Vijaykumar,
and Irith Pomeranz, Transient-Fault Recovery for
Chip Multiprocessors, ISCA 2003 - Slipstream Processors (an AR-SMT extension)
- Karthik Sundaramoorth, Zach Purser, and Eric
Rotenberg, Slipstream Processors Improving both
Performance and Fault Tolerance, ASPLOS 2000 - Khaled Z. Ibrahim, Gregory T. Byrd, and Eric
Rotenberg, Slipstream Execution Mode for
CMP-Based Multiprocessors, HPCA 2003