Transient Fault Tolerance via Dynamic Process-Level Redundancy - PowerPoint PPT Presentation

About This Presentation
Title:

Transient Fault Tolerance via Dynamic Process-Level Redundancy

Description:

Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado at Boulder – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 13
Provided by: Matthe477
Category:

less

Transcript and Presenter's Notes

Title: Transient Fault Tolerance via Dynamic Process-Level Redundancy


1
Transient Fault Tolerance via Dynamic
Process-Level Redundancy
  • Alex Shye, Vijay Janapa Reddi, Tipp Moseley and
    Daniel A. Connors
  • University of Colorado at Boulder
  • Department of Electrical and Computer Engineering
  • DRACO Architecture Research Group
  • Workshop on Binary Instrumentation and
    Applications
  • San Jose, CA
  • 10.22.2006

2
Outline
  • Introduction
  • Background/Terminology
  • Software-centric Fault Detection
  • Process-Level Redundancy
  • Experimental Results
  • Conclusion

3
Introduction
  • Process technology trends
  • Single transistor error rate expected to stay
    close to constant
  • Number of transistors is increasing exponentially
    with each generation
  • Transient faults will be a problem for
    microprocessors!
  • Hardware Approaches
  • Specialized redundant hardware, redundant
    multi-threading
  • Software Approaches
  • Compiler solutions instruction duplication,
    control flow checking
  • Low-cost, flexible alternative but higher
    overhead
  • Goal Leverage available hardware parallelism in
    SMT and CMP machines to improve the performance
    of software transient fault tolerance

4
Background/Terminology
  • Types of transient faults (based upon outcome)
  • Benign Faults
  • Silent Data Corruption (SDC)
  • Detected Unrecoverable Error (DUE)
  • True DUE
  • False DUE
  • Sphere of Replication (SoR)
  • Indicates the scope of fault detection and
    containment
  • Input Replication
  • Output Comparison

5
Software-centric Fault Detection
Hardware SoR
Software SoR
Processor
Application
Libraries
Cache
Operating System
Memory
Devices
Software-centric Fault Detection
Hardware-centric Fault Detection
  • Most previous approaches are hardware-centric
  • Even compiler approaches (e.g. EDDI, SWIFT)
  • Software-centric able to leverage strengths of a
    software approach
  • Correctness is defined by software output
  • Ability to see larger scope effect of a fault
  • Ignore benign faults

6
Process-Level Redundancy (PLR)
  • Master Process
  • only process
  • allowed to perform
  • system I/O
  • Redundant Processes
  • identical address space,
  • file descriptors, etc.
  • not allowed to perform
  • system I/O

App
App
App
Libs
Libs
Libs
Watchdog Alarm
SysCall Emulation Unit
Operating System
  • Watchdog Alarm
  • occasionally a process
  • will hang
  • set at beginning of barrier
  • synchronization to ensure
  • that all processes are
  • alive
  • System Call Emulation Unit
  • Creates redundant processes
  • Barrier synchronize at all system calls
  • Enforces SoR with input replication and output
    comparison
  • Emulates system calls to guarantee determinism
    among all processes
  • Detects and recovers from transient faults

7
Enforcing SoR and Determinism
Redundant Processes
Master Process
  • Input Replication
  • All read events read(), gettimeofday(),
    getrusage(), etc.
  • Return value from all system calls
  • Output Comparison
  • All write events write(), msync(, etc.
  • System call parameters
  • Maintaining Determinism at System Calls
  • Master process executes system call
  • Redundant processes emulate it
  • Ignore some rename(), unlink()
  • Execute similar/altered system call
  • Identical address space mmap()
  • Process-specific data open(), lseek()

Barrier
Write cmd line parameters and syscall type
to shmem
Compare syscall type and cmd line parameters
read()
Write resulting file offset and read buffer to
shmem
Copy the read buffer from shmem
lseek() to correct file offset
Example of handling a read() system call
8
Fault Detection and Recovery
Type of Error
Detection Mechanism
Recovery Mechanism
Output Mismatch Detected as a mismatch of compare buffers on an output comparison Use majority vote ensure correct data exists, kill incorrect process, and fork() to create a new one
Program Failure System call emulation unit registers signal handlers for SIGSEGV, SIGIOT, etc. Re-create the dead process by forking one of existing processes
Timeout Watchdog alarm times out Determine the missing process and fork() to create a new one
  • PLR supports detection/recovery from multiple
    faults by increasing number of redundant
    processes and scaling the majority vote logic

9
Experimental Methodology
  • Use a set of the SPEC2000 benchmarks
  • PLR prototype developed with Pin
  • Intercept system calls to implement PLR
  • Fault Injection
  • Gather an instruction count profile
  • Use profile to generate a test case
  • Test case an instruction and a particular
    execution of the instruction to fault
  • Run with Pin in JIT mode and use IARG_RETURN_REGS
    to alter a random bit of the instructions source
    or destination registers
  • Fault Coverage
  • Use fault injector on test inputs generating 1000
    test cases per benchmark
  • specdiff in SPEC2000 harness determines output
    correctness
  • PLR Performance
  • Run PLR (in Probe mode using Pin Probes) on
    reference inputs with two redundant processes
  • 4-way SMP machine, each processor is
    hyper-threaded
  • Use sched_set_affinity() to simulate various
    hardware platforms

10
Fault Coverage
  • Watchdog timeout very rare so not shown
  • PLR detects all Incorrect and Failed cases
  • Effectively detects relevant faults and ignores
    benign faults
  • Floating point correctness question (ex.
    168.wupwise, 172.mgrid)
  • Actually different results but tolerable
    difference for specdiff

11
Performance
  • Performance for single processor (PLR 1x1), 2 SMT
    processors (PLR 2x1) and 4 way SMP (PLR 4x1)
  • Slowdown for 4-way SMP only 1.26x
  • Should be better on a CMP with faster processor
    interconnect

12
Conclusion
  • Present a different way to use existing general
    purpose SMT and CMP machines for transient fault
    tolerance
  • Differentiate between hardware-centric and
    software-centric fault detection models
  • Show how software-centric can be effective in
    ignoring benign faults
  • PLR on a 4-way SMP executes with only a 26
    slowdown, a 36 improvement over the fastest
    compiler technique
  • Future Work
  • Implementation in a run-time system allows for
    dynamically altering amount of fault tolerance
  • Simple PLR model is presented work on handling
    interrupts, shared memory, and threads (the tough
    one)

Questions?
Write a Comment
User Comments (0)
About PowerShow.com