Design and Evaluation of Hybrid Fault-Detection Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Design and Evaluation of Hybrid Fault-Detection Systems

Description:

Liberty Research Group, Princeton University. Shubu Mukherjee. FACT Group, Intel ... Liberty Research Group, Princeton University. Severity of Transient Faults ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 30
Provided by: gr817
Category:

less

Transcript and Presenter's Notes

Title: Design and Evaluation of Hybrid Fault-Detection Systems


1
Design and Evaluation of Hybrid Fault-Detection
Systems
  • George A. Reis
  • Jonathan Chang Neil Vachharajani Ram Rangan
    David I. August
  • Liberty Research Group, Princeton University
  • Shubu Mukherjee
  • FACT Group, Intel Corporation
  • 32nd Annual International Symposium on Computer
    Architecture

2
Transient Faults
  • Randomly change bits of state element or
    computation
  • Caused by external energetic particle striking
    processor
  • Cannot test for fault before hardware use

0x8675309 0x42
0x32AA36852
0x22AA36852
3
Severity of Transient Faults
  • IBM historically adds 20-30 additional logic for
    mainframe processors for fault tolerance Slegel
    1999
  • In 2000, Sun server systems deployed to America
    Online, eBay, and others crashed due to cosmic
    rays Baumann 2002
  • In 2003, Fujitsu released SPARC64 with 80 of
    200,000 latches covered by transient fault
    protection Ando 2003
  • Processors are becoming more susceptible
  • lower voltage thresholds
  • increased transistor count
  • faster clock speeds

4
Existing Instruction Level Techniques
  • Hardware solutions
  • Lockstepping (Stratus DMR)
  • Redundant Multithreading (Reinhardt Mukherjee,
    ISCA 00)
  • Software solutions
  • EDDI, CFCSS (Oh et al. Transactions on
    Reliability 02)
  • Source-to-source (Rebaudengo et al. Source Code
    Analysis and Manipulation 01)
  • Hardware cost
  • No application software changes
  • Fixed solution applied to all
  • Visibility into all state
  • More resources reduce performance degradation

No hardware cost Require software
changes Flexibility to continually trade off
reliability and costs in the field Visibility
limited to architectural state Fixed resources
Hybrid solutions take benefits from both Tradeoff
hardware, performance, and reliability
5
Fault Detection Requirements
Mechanism to create redundant
computation Mechanism to compare original and
redundant results
6
Store Protection
If a tree falls in the forest, but nobody is
around to hear it, does it make a sound? If
a fault affects some data, but does not change
the output, does it make a error? Only
store operations affect output, so validate
data before stores.
7
Fault Detection Requirements
Mechanism to ensure original and redundant reads
from memory receive same values Mechanism to
create redundant computation Mechanism to
compare original and redundant results before
writes to memory
8
Redundant Multithreading
  • Hardware-only approach
  • Redundant code executes in separate hardware
    context
  • Hardware requirements
  • Multi-threaded machine
  • Load Value Queue
  • to ensure data loaded from memory is identical to
    both hardware contexts
  • Checking Store Buffer
  • compare both versions of data before committing
    data to memory
  • No software changes
  • Fixed redundancy for application
  • Only half of the hardware contexts available to
    Operating System

Fetch engine 1
Fetch engine 2
Pipeline
9
Schedule
Thread 1 Thread 2
ld r1 r2 add r1 r1 1 st r2
r1
10
Schedule
Thread 1 Thread 2
ld r1 r2 add r1 r1 1 st r2
r1
ld r1 r2 add r1 r1 1 st r2
r1
11
Schedule
Thread 1 Thread 2
ld r1 r2 add r1 r1 1 st r2
r1
ld r1 r2 add r1 r1 1 st r2
r1
12
Schedule
Thread 1 Thread 2
ld r1 r2 add r1 r1 1 st r2
r1
ld r1 r2 add r1 r1 1 st r2
r1
ld r1 r2 add r1 r1 1 st r2
r1
Instruction duplication, register allocation, and
scheduling can be moved into software Hybrid
scheme CRAFT CompileR Assisted Fault Tolerance
13
Hybrid Reliability
  • Leverage reliability that software can provide
  • Compiler duplicates and schedules instructions,
    allocates registers
  • Free up hardware thread resources
  • Work can be done on other thread
  • Applicable to single-threaded machines
  • Maintain Input / Output hardware
  • Load Value Queue
  • Checking Store Buffer

Fetch engine 1
Fetch engine 2
Pipeline
Load Value Queue
Checking Store Buffer
14
Fault Detection Requirements
Mechanism to ensure original and redundant reads
from memory receive same values Mechanism to
create redundant computation Mechanism to
compare original and redundant results before
writes to memory Mechanism to guarantee correct
control flow
15
Control Flow Protection
  • Original and redundant computation in one thread
  • Incorrect control flow will divert both versions
  • Redundant and original may compute the same, but
    incorrect value

br (b 0)
  • Compiler adds instructions to compute redundant
    PC
  • Set before branch
  • Validate at destination
  • Not perfect, but effective

mov r1 1 mov r1 1
mov r1 0 mov r1 0
16
Fault Detection Requirements
Mechanism to ensure original and redundant reads
from memory receive same values Mechanism to
create redundant computation Mechanism to
compare original and redundant results before
writes to memory Mechanism to guarantee correct
control flow
17
Removing the Checking Store Buffer
  • ld r1 r2
  • ld r1 r2
  • add r1 r1 1
  • add r1 r11
  • br faultdet, r1!r1
  • br faultdet, r2!r2
  • st r2 r1
  • st r2 r1

Fetch engine 1
Fetch engine 2
Pipeline
Load Value Queue
Checking Store Buffer
CRAFTLVQ (still has Load Value Queue)
18
Removing the Load Value Queue
  • br faultdet, r2!r2
  • ld r1 r2
  • ld r1 r2
  • mov r1 r1
  • add r1 r1 1
  • add r1 r11
  • st r2 r1
  • st r2 r1

Fetch engine 1
Fetch engine 2
r2 window
r1 window
Pipeline
Load Value Queue
Checking Store Buffer
CRAFTCSB (still has Checking Store Buffer)
19
Removing Both Structures
  • br faultdet, r2!r2
  • ld r1 r2
  • ld r1 r2
  • mov r1 r1
  • add r1 r1 1
  • add r1 r11
  • br faultdet, r1!r1
  • br faultdet, r2!r2
  • st r2 r1
  • st r2 r1

Fetch engine 1
Fetch engine 2
Pipeline
Load Value Queue
Checking Store Buffer
CRAFTNONE is SWIFT (Software-only)
20
Spectrum of Solutions
  • Applicability
  • Spectrum of solutions
  • Redundant Multithreading
  • CRAFT w/ Load Value Queue and
  • Checking Store Buffer
  • CRAFT w/ Load Value Queue
  • CRAFT w/ Checking Store Buffer
  • SWIFT

21
Spectrum of Solutions
  • Applicability
  • Spectrum of solutions
  • Redundant Multithreading
  • CRAFT w/ Load Value Queue and
  • Checking Store Buffer
  • CRAFT w/ Load Value Queue
  • CRAFT w/ Checking Store Buffer
  • SWIFT

fetch 1
fetch 2
Pipeline
Load Value Queue
Checking Store Buffer
22
Spectrum of Solutions
  • Spectrum of solutions
  • Redundant Multithreading
  • CRAFT w/ Load Value Queue and
  • Checking Store Buffer
  • CRAFT w/ Load Value Queue
  • CRAFT w/ Checking Store Buffer
  • SWIFT
  • Applicability

fetch 1
fetch 2
Pipeline
Load Value Queue
Checking Store Buffer
23
Spectrum of Solutions
  • Spectrum of solutions
  • Redundant Multithreading
  • CRAFT w/ Load Value Queue and
  • Checking Store Buffer
  • CRAFT w/ Load Value Queue
  • CRAFT w/ Checking Store Buffer
  • SWIFT
  • Applicability

fetch 1
fetch 2
Pipeline
Load Value Queue
Checking Store Buffer
24
Hybrid Evaluation
  • Implemented SWIFT, CRAFTCSB, CRAFTLVQ, and
    CRAFTCSBLVQ
  • Modified pre-release version of OpenIMPACT
    compiler
  • Target IA-64 on Intel Itanium 2
  • Applied redundancy to all computation
  • Optimization and ILP scheduling applied to
    resulting code
  • Performance and reliability for benchmarks from
  • SPECINT2000, SPECFP2000, SPECINT95, MediaBench

25
Hybrid Evaluation Performance
  • Execution times normalized to no fault detection

26
Hybrid Evaluation Reliability
  • Reliability evaluated using fault injection (3
    structures)
  • Single bit flip per execution
  • 5000 injection executions per structure per
    benchmark per system
  • Use combination of microarchitectural and
    architectural simulation
  • Using Liberty Simulation Environment framework
    with validated Intel Itanium 2 model (Penry et
    al. MOBS 05)
  • Mean Work To Failure
  • Encompass longer execution time and increased
    reliability
  • Generalization of Mean Instructions To Failure
    (Weaver et al. ISCA 04)
  • Instruction not constant unit of work in hybrid
    systems
  • Proportional to
  • 1 / (Architectural Vulnerability Execution
    timeunit of work)

27
Hybrid Evaluation Reliability
  • Mean Work to Failure normalized to no fault
    detection (dSDC see paper)

28
Benefits of Selective Protection
  • Software control provides selective protection
  • Hybrid and Software systems enable software
    control
  • Compiler/user/runtime system can make different
    decisions for different code regions
  • Programs, functions, or individual instructions
  • Regions have different levels of natural fault
    resistance
  • Output corrupting faults have different severity

original jpegenc output
f a u l t y j p e g e n c o u t p u t
faulty? jpegenc output
  • Selective protection can improve reliability

29
Conclusions
  • Hybrid solutions benefit from both hardware and
    software
  • Hardware support provides microarchitectural
    visibility
  • Software control allows parameters to be tailored
    to task
  • Hybrid solutions create a spectrum of reliability
    solutions
  • Spectrum offers choices
  • Best solution depends on system constraints,
    cost, etc
  • Tradeoff between hardware, reliability, and
    performance
Write a Comment
User Comments (0)
About PowerShow.com