Title: Transient Fault Tolerance via Dynamic Process-Level Redundancy
1Transient Fault Tolerance via Dynamic
Process-Level Redundancy
- Alex Shye, Vijay Janapa Reddi, Tipp Moseley and
Daniel A. Connors - University of Colorado at Boulder
- Department of Electrical and Computer Engineering
- DRACO Architecture Research Group
- Workshop on Binary Instrumentation and
Applications - San Jose, CA
- 10.22.2006
2Outline
- Introduction
- Background/Terminology
- Software-centric Fault Detection
- Process-Level Redundancy
- Experimental Results
- Conclusion
3Introduction
- Process technology trends
- Single transistor error rate expected to stay
close to constant - Number of transistors is increasing exponentially
with each generation - Transient faults will be a problem for
microprocessors! - Hardware Approaches
- Specialized redundant hardware, redundant
multi-threading - Software Approaches
- Compiler solutions instruction duplication,
control flow checking - Low-cost, flexible alternative but higher
overhead - Goal Leverage available hardware parallelism in
SMT and CMP machines to improve the performance
of software transient fault tolerance
4Background/Terminology
- Types of transient faults (based upon outcome)
- Benign Faults
- Silent Data Corruption (SDC)
- Detected Unrecoverable Error (DUE)
- True DUE
- False DUE
- Sphere of Replication (SoR)
- Indicates the scope of fault detection and
containment - Input Replication
- Output Comparison
5Software-centric Fault Detection
Hardware SoR
Software SoR
Processor
Application
Libraries
Cache
Operating System
Memory
Devices
Software-centric Fault Detection
Hardware-centric Fault Detection
- Most previous approaches are hardware-centric
- Even compiler approaches (e.g. EDDI, SWIFT)
- Software-centric able to leverage strengths of a
software approach - Correctness is defined by software output
- Ability to see larger scope effect of a fault
- Ignore benign faults
6Process-Level Redundancy (PLR)
- Master Process
- only process
- allowed to perform
- system I/O
- Redundant Processes
- identical address space,
- file descriptors, etc.
- not allowed to perform
- system I/O
App
App
App
Libs
Libs
Libs
Watchdog Alarm
SysCall Emulation Unit
Operating System
- Watchdog Alarm
- occasionally a process
- will hang
- set at beginning of barrier
- synchronization to ensure
- that all processes are
- alive
- System Call Emulation Unit
- Creates redundant processes
- Barrier synchronize at all system calls
- Enforces SoR with input replication and output
comparison - Emulates system calls to guarantee determinism
among all processes - Detects and recovers from transient faults
7Enforcing SoR and Determinism
Redundant Processes
Master Process
- Input Replication
- All read events read(), gettimeofday(),
getrusage(), etc. - Return value from all system calls
- Output Comparison
- All write events write(), msync(, etc.
- System call parameters
- Maintaining Determinism at System Calls
- Master process executes system call
- Redundant processes emulate it
- Ignore some rename(), unlink()
- Execute similar/altered system call
- Identical address space mmap()
- Process-specific data open(), lseek()
Barrier
Write cmd line parameters and syscall type
to shmem
Compare syscall type and cmd line parameters
read()
Write resulting file offset and read buffer to
shmem
Copy the read buffer from shmem
lseek() to correct file offset
Example of handling a read() system call
8Fault Detection and Recovery
Type of Error
Detection Mechanism
Recovery Mechanism
Output Mismatch Detected as a mismatch of compare buffers on an output comparison Use majority vote ensure correct data exists, kill incorrect process, and fork() to create a new one
Program Failure System call emulation unit registers signal handlers for SIGSEGV, SIGIOT, etc. Re-create the dead process by forking one of existing processes
Timeout Watchdog alarm times out Determine the missing process and fork() to create a new one
- PLR supports detection/recovery from multiple
faults by increasing number of redundant
processes and scaling the majority vote logic
9Experimental Methodology
- Use a set of the SPEC2000 benchmarks
- PLR prototype developed with Pin
- Intercept system calls to implement PLR
- Fault Injection
- Gather an instruction count profile
- Use profile to generate a test case
- Test case an instruction and a particular
execution of the instruction to fault - Run with Pin in JIT mode and use IARG_RETURN_REGS
to alter a random bit of the instructions source
or destination registers - Fault Coverage
- Use fault injector on test inputs generating 1000
test cases per benchmark - specdiff in SPEC2000 harness determines output
correctness - PLR Performance
- Run PLR (in Probe mode using Pin Probes) on
reference inputs with two redundant processes - 4-way SMP machine, each processor is
hyper-threaded - Use sched_set_affinity() to simulate various
hardware platforms
10Fault Coverage
- Watchdog timeout very rare so not shown
- PLR detects all Incorrect and Failed cases
- Effectively detects relevant faults and ignores
benign faults - Floating point correctness question (ex.
168.wupwise, 172.mgrid) - Actually different results but tolerable
difference for specdiff
11Performance
- Performance for single processor (PLR 1x1), 2 SMT
processors (PLR 2x1) and 4 way SMP (PLR 4x1) - Slowdown for 4-way SMP only 1.26x
- Should be better on a CMP with faster processor
interconnect
12Conclusion
- Present a different way to use existing general
purpose SMT and CMP machines for transient fault
tolerance - Differentiate between hardware-centric and
software-centric fault detection models - Show how software-centric can be effective in
ignoring benign faults - PLR on a 4-way SMP executes with only a 26
slowdown, a 36 improvement over the fastest
compiler technique - Future Work
- Implementation in a run-time system allows for
dynamically altering amount of fault tolerance - Simple PLR model is presented work on handling
interrupts, shared memory, and threads (the tough
one)
Questions?