Transient Fault Tolerance via Dynamic Process-Level Redundancy

About This Presentation

Title:

Transient Fault Tolerance via Dynamic Process-Level Redundancy

Description:

Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado at Boulder – PowerPoint PPT presentation

Number of Views:107

Avg rating:3.0/5.0

Slides: 13

Provided by: Matthe477

Learn more at: http://users.ece.northwestern.edu

Category:

more less

Transcript and Presenter's Notes

Title: Transient Fault Tolerance via Dynamic Process-Level Redundancy

1
Transient Fault Tolerance via Dynamic
Process-Level Redundancy

Alex Shye, Vijay Janapa Reddi, Tipp Moseley and
Daniel A. Connors
University of Colorado at Boulder
Department of Electrical and Computer Engineering
DRACO Architecture Research Group
Workshop on Binary Instrumentation and
Applications
San Jose, CA
10.22.2006

2
Outline

Introduction
Background/Terminology
Software-centric Fault Detection
Process-Level Redundancy
Experimental Results
Conclusion

3
Introduction

Process technology trends
Single transistor error rate expected to stay
close to constant
Number of transistors is increasing exponentially
with each generation
Transient faults will be a problem for
microprocessors!
Hardware Approaches
Specialized redundant hardware, redundant
multi-threading
Software Approaches
Compiler solutions instruction duplication,
control flow checking
Low-cost, flexible alternative but higher
overhead
Goal Leverage available hardware parallelism in
SMT and CMP machines to improve the performance
of software transient fault tolerance

4
Background/Terminology

Types of transient faults (based upon outcome)
Benign Faults
Silent Data Corruption (SDC)
Detected Unrecoverable Error (DUE)
True DUE
False DUE
Sphere of Replication (SoR)
Indicates the scope of fault detection and
containment
Input Replication
Output Comparison

5
Software-centric Fault Detection
Hardware SoR
Software SoR
Processor
Application
Libraries
Cache
Operating System
Memory
Devices
Software-centric Fault Detection
Hardware-centric Fault Detection

Most previous approaches are hardware-centric
Even compiler approaches (e.g. EDDI, SWIFT)
Software-centric able to leverage strengths of a
software approach
Correctness is defined by software output
Ability to see larger scope effect of a fault
Ignore benign faults

6
Process-Level Redundancy (PLR)

Master Process
only process
allowed to perform
system I/O

Redundant Processes
identical address space,
file descriptors, etc.
not allowed to perform
system I/O

App
App
App
Libs
Libs
Libs
Watchdog Alarm
SysCall Emulation Unit
Operating System

Watchdog Alarm
occasionally a process
will hang
set at beginning of barrier
synchronization to ensure
that all processes are
alive

System Call Emulation Unit
Creates redundant processes
Barrier synchronize at all system calls
Enforces SoR with input replication and output
comparison
Emulates system calls to guarantee determinism
among all processes
Detects and recovers from transient faults

7
Enforcing SoR and Determinism
Redundant Processes
Master Process

Input Replication
All read events read(), gettimeofday(),
getrusage(), etc.
Return value from all system calls
Output Comparison
All write events write(), msync(, etc.
System call parameters
Maintaining Determinism at System Calls
Master process executes system call
Redundant processes emulate it
Ignore some rename(), unlink()
Execute similar/altered system call
Identical address space mmap()
Process-specific data open(), lseek()

Barrier
Write cmd line parameters and syscall type
to shmem
Compare syscall type and cmd line parameters
read()
Write resulting file offset and read buffer to
shmem
Copy the read buffer from shmem
lseek() to correct file offset
Example of handling a read() system call
8
Fault Detection and Recovery
Type of Error
Detection Mechanism
Recovery Mechanism
Output Mismatch Detected as a mismatch of compare buffers on an output comparison Use majority vote ensure correct data exists, kill incorrect process, and fork() to create a new one
Program Failure System call emulation unit registers signal handlers for SIGSEGV, SIGIOT, etc. Re-create the dead process by forking one of existing processes
Timeout Watchdog alarm times out Determine the missing process and fork() to create a new one

PLR supports detection/recovery from multiple
faults by increasing number of redundant
processes and scaling the majority vote logic

9
Experimental Methodology

Use a set of the SPEC2000 benchmarks
PLR prototype developed with Pin
Intercept system calls to implement PLR
Fault Injection
Gather an instruction count profile
Use profile to generate a test case
Test case an instruction and a particular
execution of the instruction to fault
Run with Pin in JIT mode and use IARG_RETURN_REGS
to alter a random bit of the instructions source
or destination registers
Fault Coverage
Use fault injector on test inputs generating 1000
test cases per benchmark
specdiff in SPEC2000 harness determines output
correctness
PLR Performance
Run PLR (in Probe mode using Pin Probes) on
reference inputs with two redundant processes
4-way SMP machine, each processor is
hyper-threaded
Use sched_set_affinity() to simulate various
hardware platforms

10
Fault Coverage

Watchdog timeout very rare so not shown
PLR detects all Incorrect and Failed cases
Effectively detects relevant faults and ignores
benign faults
Floating point correctness question (ex.
168.wupwise, 172.mgrid)
Actually different results but tolerable
difference for specdiff

11
Performance

Performance for single processor (PLR 1x1), 2 SMT
processors (PLR 2x1) and 4 way SMP (PLR 4x1)
Slowdown for 4-way SMP only 1.26x
Should be better on a CMP with faster processor
interconnect

12
Conclusion

Present a different way to use existing general
purpose SMT and CMP machines for transient fault
tolerance
Differentiate between hardware-centric and
software-centric fault detection models
Show how software-centric can be effective in
ignoring benign faults
PLR on a 4-way SMP executes with only a 26
slowdown, a 36 improvement over the fastest
compiler technique
Future Work
Implementation in a run-time system allows for
dynamically altering amount of fault tolerance
Simple PLR model is presented work on handling
interrupts, shared memory, and threads (the tough
one)

Questions?

Write a Comment

User Comments (0)

About PowerShow.com

Transient Fault Tolerance via Dynamic Process-Level Redundancy - PowerPoint PPT Presentation

Transient Fault Tolerance via Dynamic Process-Level Redundancy

Transient Fault Tolerance via Dynamic Process-Level Redundancy Alex Shye, Vijay Janapa Reddi, Tipp Moseley and Daniel A. Connors University of Colorado at Boulder – PowerPoint PPT presentation