Hardware Fault Tolerance Through Simultaneous Multithreading part 3 - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Hardware Fault Tolerance Through Simultaneous Multithreading part 3

Description:

Eric Rotenberg, 'AR-SMT - A Microarchitectural Approach to Fault Tolerance in ... Karthik Sundaramoorth, Zach Purser, and Eric Rotenberg, 'Slipstream Processors: ... – PowerPoint PPT presentation

Number of Views:139

Avg rating:3.0/5.0

Slides: 23

Provided by: gregBron

Category:

more less

Transcript and Presenter's Notes

Title: Hardware Fault Tolerance Through Simultaneous Multithreading part 3

1
Hardware Fault Tolerance Through Simultaneous
Multithreading (part 3)

Jonathan Winter

2
3 SMT Fault Tolerance Papers

Eric Rotenberg, "AR-SMT - A Microarchitectural
Approach to Fault Tolerance in Microprocessors",
Symposium on Fault-Tolerant Computing, 1999.
Steven K. Reinhardt and Shubhendu S. Mukherjee,
"Transient Fault Detection via Simultaneous
Multithreading", ISCA 2000.
Shubhendu S. Mukherjee, Michael Kontz and Steven
K. Reinhardt, "Detailed Design and Evaluation of
Redundant Multithreading Alternatives", ISCA 2002.

3
Outline

Background
SMT
Hardware fault tolerance
AR-SMT
Basic mechanisms
Implementation issues
Simulation and Results
Transient Fault Detection via SMT
Sphere of replication
Basic mechanisms
Comparison to AR-SMT
Simulation and Results
Redundant Multithreading Alternatives
Realistic processor implementation
CRT
Simulation and Results
Fault Recovery
Future Lectures ?

4
Sphere of Replication

Size of sphere of replication
Two alternatives with and without register file
Instruction and data caches kept outside

5
Redundant Multithreading Alternatives

Discusses real world fault tolerant processors
Evaluates SRT on a more realistic and detailed
processor than the previous paper
Proposes Chip-level Redundant Threading (CRT)
Detailed simulation results with new metric
Relative SMT-Efficiency

6
Real World SMT, CMP, and FT

Simulated processor based on Compaq Alpha Araña
(a.k.a. 21464 or EV8)
IBM Power4 and HP Mako are 2-way CMPs
Compaq Himalaya uses multi-chip lockstepping
IBM S/390 G5 uses on-chip lockstepping

7
Detailed Processor Description

8 way SMT with 4 hardware contexts
IBOX fetches chunks of 8 instructions and
forwards them to the PBOX
Complex branch prediction mechanism
Line predictor
Branch predictor, jump target predictor, and
return address stack

8
Detailed Processor Description (part 2)

PBOX performs initial processing
Register renaming and partial decoding
Maintains tables for recovery from
miss-predictions
QBOX issues instructions out-of-order to the
EBOX, FBOX, or MBOX
Retires instructions and commits architectural
state in program order
Consists of instruction queue, in-flight table,
and completion unit
MBOX conducts loads and stores
Load and store queues divided between threads
Available queue space is very small per thread

9
SRT on Detailed Processor

Input replication uses LVQ variant that allows
out-of-order load issue from trailing thread
Output comparison is the same as SRT
Improvement is suggested that has per-thread SQ
PBOX storage structures made per-thread to avoid
deadlock situations
Branch outcome queue converted to line prediction
queue
Preferential space redundancy (PSR) implemented
to better cover permanent faults

10
Chip-level Redundant Threading

Each core executes a lead and trailing thread
from different programs
LVQ and line prediction queue must forward data
to other processors trailing thread
Store buffer must receive retired stores from
other processor for comparison

11
CRT Advantages

CRT checks much less information to detect faults
then lockstepped processors
Lockstep fault detection circuitry is on critical
path for cache misses
CRT executes threads more efficiently because of
SMT dynamic scheduling on each processor

12
Simulation Environment

Asim performance model framework used
Simulates processor like Alpha 21464
All 18 SPEC CPU95 benchmarks used
Combinations of SPEC used for multi-program
simulations
Lockstepped processor simulated with zero fault
detection delay (Lock0) and with 8-cycle delay
(Lock8)
SRT architecture simulated with delays for
forwarding line predictions and load values
Extra delays for CRT architecture

13
SMT-Efficiency

SMT-Efficiency (SMT-E) used instead of IPC
SMT-E of individual thread is IPC of thread in
SMT mode divided by the IPC in single-thread mode
in an SMT
Overall SMT-E is arithmetic mean of individual
SMT-Efficiencies
A. Snavely and D. M. Tullsen, Symbiotic Job
Scheduling for a Simultaneous Multithreading
Processor, ASPLOS 2000

14
SMT-Speedup ( SMT-Efficiency)

Y. Sazeides and T. Juan, How to Compare the
Performance of Two SMT Microarchitectures,
ISPASS 2001

15
Preferential Space Redundancy

Without PSR, 65 of instructions execute on same
functional unit
With PSR, only 0.06 of instructions run on the
same unit
No performance degradation is experienced

16
SRT One Logical Thread

SRT 32 slower than single thread on SMT
SRT 11 faster than running two redundant copies
Degradation 30 with per-thread store queue
Best-case 26 degradation with oracle store queue

17
SRT Two Logical Threads

Degradation of SRT is 40
Per-thread store queue give 32 degradation
Store lifetime drops from 44 cycles to vs. 39 for
single thread
Oracle store queue gives 5 better efficiency

18
Chip-level Redundant Threading

With one logical thread, CRT performs similarly
to lockstepping
With two logical threads CRT beats Lock0 and
Lock8 by 10 and 2 respectively
Adding the per-thread store queue causes CRT to
beat Lock8 by 13 average (22 maximum)
Using an oracle store queue improves performance
by 6 more

19
CRT with Four Logical Threads

Initial CRT configuration is no better than Lock8
Adding per-thread store queue gives CRT 13
better performance than Lock8
Using an oracle store queue improve performance
only by another 2

20
Conclusions

The benefits of SRT are not as great as in the
original paper when using a detailed model
30 and 32 degradation seen on single thread and
multithread workloads
SRT methods can be used to detect permanent
faults
Chip-level redundant threading gives improved
performance over lockstepped processors
Overall CRT provided a 13 improvement

21
Transient Fault Recovery

AR-SMT suggests that the R-stream could be used
as a checkpoint for recovery
SRT suggests checkpoint/restart or failover
Argues that since faults are infrequent, the will
have a minor impact on performance

22
Future Lectures ?

Hardware Transient Fault Recovery
T.N. Vijaykumar, Irith Pomeranz, and Karl Cheng,
Transient-Fault Recovery Using Simultaneous
Multithreading, ISCA 2002
Mohamed Gomaa, Chad Scarbrough, T.N. Vijaykumar,
and Irith Pomeranz, Transient-Fault Recovery for
Chip Multiprocessors, ISCA 2003
Slipstream Processors (an AR-SMT extension)
Karthik Sundaramoorth, Zach Purser, and Eric
Rotenberg, Slipstream Processors Improving both
Performance and Fault Tolerance, ASPLOS 2000
Khaled Z. Ibrahim, Gregory T. Byrd, and Eric
Rotenberg, Slipstream Execution Mode for
CMP-Based Multiprocessors, HPCA 2003