Fault injection: a way to build more reliable computers - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Fault injection: a way to build more reliable computers

Description:

Henrique Madeira, University of Coimbra, Portugal ECE Seminar ... SCIFI ... SCIFI: Scan Chain Implemented Fault Injection. IEEE 1149.1 boundary-scan ... – PowerPoint PPT presentation

Number of Views:298
Avg rating:3.0/5.0
Slides: 52
Provided by: henrique4
Category:

less

Transcript and Presenter's Notes

Title: Fault injection: a way to build more reliable computers


1
Fault injection a way to build more reliable
computers
  • Henrique Madeira
  • University of Coimbra, Portugal

2
Presentation outline
  • Basic concepts and fault injection technologies
  • Current research issues and problems in fault
    injection
  • Fault representativeness emulation of software
    faults
  • Conclusions

I need fault injection what are my choices?
I want to do research in this area what are the
research problems in FI?
3
What is fault injection?
Deliberate insertion of upsets (faults or errors)
in computer systems to evaluate its behavior in
the presence of faults or validate specific fault
tolerance mechanisms in computers.
4
Faults, Errors, and Failures
Fault
Failure
Error
5
Fault Injection goals
  • Fault removal aims to detect the presence of
    design and implementation faults, and then to
    locate and remove them.
  • Fault forecasting aims to quantify the
    confidence that can be attributed to a system by
    estimating the number and the consequences of
    possible faults in the system.

6
What is a fault injector?
  • Is a tool that performs the following tasks
  • Generates sets of faults according to different
    criteria
  • Injects faults in a automatic way (or with
    reduced manual needs)
  • Collect results (direct measures) on fault impact
    in the target system
  • Provides support for result analysis.

7
Design of fault injection tools
Two technical problems
  • How to inject faults in the very complex systems
    available today?
  • How to monitor the effects of the faults?
  • Detect if faults have been injected
  • Collect info on the fault actually injected
  • Detect if errors have been discharged or
    overwriten
  • Trace target system behavior
  • ...

8
Main fault injection techniques
  • Physical faults
  • Pin-level
  • Heavy-ion radiation
  • Power supply disturbances
  • Other types of physical upsets
  • Simulation based fault injection
  • Emulation of faults by software
  • SWIFI
  • SCIFI (boundary scan)

9
Pin-level fault Injection
To monitoring logic
10
Pin-level fault injection in practice
Is this possible for complex processors?
11
Heavy-ion fault injection
Is this possible for complex processors?
12
Physical Fault Injection Techniques
  • Many problems today
  • Hardware is too complex
  • Poor controllability
  • Poor observability of the faults effects
  • Huge development efforts
  • Low portability

13
Simulation Fault Injection
  • Major problems
  • Large development efforts
  • Time consuming
  • Models are not readily available

14
Main fault injection techniques
  • Physical faults
  • Pin-level
  • Heavy-ion radiation
  • Power supply disturbances
  • Other types of physical upsets
  • Simulation based fault injection
  • Emulation of faults by software
  • SWIFI
  • SCIFI (boundary scan)

15
Software Implemented Fault Injection
  • Basic SWIFI idea
  • 1) Interrupt the target application/system in
    some way (e.g., by inserting a trap instruction
    or by executing the application in trace mode)
  • 2) Execute specific fault injection routine that
    emulates faults by inserting errors in different
    parts of the system (processor registers, memory,
    )
  • 3) Resume the execution of the interrupted
    program
  • 4) Collect results on faults manifestations at
    different levels (system level, application
    level, FTM, etc).

16
Fault definition
  • Faults are defined according to two main
    attributes
  • Fault model describes what is corrupted and how
    is corrupted
  • Fault trigger describes when the fault should be
    injected.

17
SWIFI pros and cons
  • Advantages
  • Not much affected by the complexity of the target
  • Low complexity
  • Low cost and development effort
  • Reasonably portable
  • No physical interferences
  • Typical disadvantages
  • Do not cover faults in peripheral devices, ASICS,
    etc
  • Limited monitoring capabilities
  • Tools have great impact on the target system
    behavior

18
Why great impact on the target behavior?
  • Detect when faults should been injected
  • Collect info on the fault actually injected
  • Detect if errors have been discharged or
    overwriten
  • Trace target system behavior

Because the FI tool must
19
Impact of workload in the fail silent violations
observed
Target commercial parallel system from Parsytec
(no fault tolerance mechanisms) Fault injection
tool Xception Kind of faults HW transient
Critical faults Percentage of faults that caused
wrong program results no error detected in the
system.
20
Coverage of a set of error detection techniques
for different workloads in the target system
Target two industrial control systems based on
the Z80 e 68000 Fault injection tool RIFLE Kind
of faults HW transient
  • Error detection methods
  • Memory protection
  • WDT
  • Signature monitoring
  • SW assertions

68K
Z80
100,0
95,0
90,0
85,0
80,0
75,0
70,0
Pesquisa
CRC
Alea
QuickSort
Matriz
Sieve
Biblioteca C
21
Xception Approach
  • Use processor built-in debugging and performance
    monitoring features to inject faults by software
    with minimal interference (hybrid approach)
  • Breakpoint registers - normaly used to facilitate
    the design and implementation of debuggers
  • Special counters - normaly used to facilitate the
    design and implementation of performance
    monitoring tools.

22
Breakpoint Registers
  • Used to implement non-intrusive fault triggers.
    Faults can be injected when
  • Read from/write to a specific memory address
  • Instruction fetch from specific address
  • Execution of specific instructions (e.g.,
    floating point instructions)
  • After elapsing a given time from application
    start or ocurrence of other event
  • Combination of above trriggers.

23
Special Counters
  • Used to collect data on the target system
    behavior after the injection of a fault
  • Counts number of instructions executed from the
    injection of the fault until a given moment
  • The same for...
  • Number of clock cycles
  • Number of read clock cycles ou write clock cycles
  • Number of instructions executed of a given type
  • etc, etc.

24
Using both the breakpoint registers and the
special counters
  • It is possible to detect (after the injection of
    each fault)
  • The activation of latent errors in memory
  • Corruption (erroneous writes) of specific memory
    areas
  • Execution of specific programs/routines
  • Measure the moment in time when some memory cell
    is written
  • Count the number of ocurrences of events such as
    the above mentioned.

25
What kind of faults can be injected with Xception?
  • Transient faults in the memory and processor
    functional/structural units. For example
  • Integer Unit (IU)
  • Floating Point Unit (FPU)
  • Internal Data Bus
  • Internal Address Bus
  • General Purpose Registers (GPRs)
  • Condition Code Register
  • Memory Management Unit (MMU)

26
Xception fault injector tool
Network
Target System
Xception Host
Application Output
Xception Experiment Control
Application
Fault Parameters
Fault Confi.
Log File
Fault Injection Exception handlers
Results
Faults Fault Impact
Kernel
27
Xception today
  • Is being marketed by Critical Software
    http//www.criticalsoftware.com/
  • Available for Windows NT, Linux, Lynx OS (for
    Power PC and Pentium processor)
  • Been used (or is going to be used) in several
    European universities, JPL, CISCO,...
  • Example of recent research works
  • Evaluation of an Oracle database runnig the TPC-C
    benchmark (DSN 2000)
  • Evaluation of a real-time control application
    (DSN 2001)

28
An alternative to Xception approach
  • SCIFI Scan Chain Implemented Fault Injection
  • IEEE 1149.1 boundary-scan standard for HW testing
  • Available for most of the complex IC (processors,
    DPSs, ASICs,...)
  • GOOFI - Chalmers University (Sweden) tool
  • Advantage improved control and monitoring
  • Disadvantage greater intrusiveness and needs
    some HW support

29
What about the future of SWIFI tools?
Four attributes that must be improved
  • Precision of fault models
  • Intrusiveness
  • Usability
  • Portability

Software Faults
30
Is it possible to emulate software faults (bugs)
with a SWIFI tool?
31
Motivation
  • Problem emulation of software faults by fault
    injection
  • Relevance why software faults are important?
  • Software faults are probably the major cause of
    system outages
  • By injecting software faults it is possible to
    assess the consequences of hidden bugs
    (experimental risk assessment).
  • Traditional approaches to the injection of SW
    faults
  • Optimistic we have injected so many different
    faults that at least some of them should emulate
    software faults...
  • Best effort injected faults that mimic typical
    software bugs...

32
Our current research on software fault emulation
  • Two steps approach
  • 1 - The source code of the target program/system
    is available
  • 2 - The source code of the target program/system
    is not available
  • Goals
  • Evaluation of accuracy of software fault models
    used by SWIFI
  • Evaluation of SWIFI tools concerning the
    emulation of SW faults
  • New features of SWIFI tools required for SW fault
    emulation
  • Use of software metrics (and tools) to guide the
    SW fault generation and injection process
    (instead of using field data)

33
What is a software fault?
Software development process (in
theory...) Requirements Specification Desig
n Code development Test Deployment
  • OK
  • OK
  • The requirements specification
  • are correct but the deployed code is not

34
Characterization of software faults
  • A SW fault is characterized by the change in the
    code that is necessary to correct it (Orthogonal
    Defect Classification).
  • Defined according two parameters
  • Fault trigger conditions that make the fault to
    be exposed
  • Fault type type of mistake in the code

35
Types of software faults (ODC)
  • Assignment values assigned incorrectly or not
    assigned
  • Checking missing or incorrect validation of data,
    or incorrect loop, or incorrect conditional
    statement
  • Timing/serialization missing or incorrect
    serialization of shared resources
  • Algorithm incorrect or missing implementation
    that can be fixed without the need of design
    change
  • Function incorrect or missing implementation that
    requires a design change to be corrected

36
Software fault definition and fault emulation
EMULATED FAULT SWIFI level What Where Which When
ACTUAL FAULT Source Code Level Fault Type Fault
Trigger
37
Injecting hardware faults
  • Errors at machine code level (processor
    programming model) map nearly directly to
    hardware faults (bit flips, stuck-at)
  • Faults in processor units
  • Faults in memory
  • Faults in other addressable devices (I/O cards,
    etc).

Machine code level
Xception injects errors at this level (low
intrusion level)
Hardware faults
Hardware
38
Injecting software faults
  • Software faults are defined at source code level.
    The questions are
  • Is it possible to map all classes of SW faults
    into errors injected at the Xception/SWIFI level?
  • Is this maping accurate?

Source code level
Software faults
Machine code level
Hardware faults
Hardware
39
Injecting software faults
Find real software faults and try to emulate them
accurately using the Xception
40
Sources of real software faults
  • Programs resulting from the ACM International
    Collegiate Programming Contest (IOI98)
  • Rationale
  • Programs written according to a formal, clear,
    and correct specification (the problem proposed
    to the contest participants)
  • Programs written by very skilled programmers
  • Access to many implementations from the same
    specification (237 teams at IOI 98)
  • There is a test case associated to each problem
    specification that defines the acceptance
    criteria for correct programs for the contest.

41
Programs used and Xception version
  • Target programs - different implementations of
  • Camelot Computes the minimum number of moves
    required to gather all the pieces of a chessboard
    in the same square
  • JamesB Codifies strings according to a specific
    algorithm.
  • Xception for PowerPC 601 processor.

42
Software bug detection
  • Programs considered corrected have been selected
  • These programs were intensively tested using a
    very thorough test case programs that fail this
    intensive test case have software faults
  • Programs with faults have been analyzed to
    identify and classify the fault.

SW faults found Assignment 2 faults
(JamesB.team6, Camelot.team4) Checking 1 fault
(Camelot.team1) Algorithm 4 faults
(JamesB.team7, Camelot.team2, Camelot.team3,
Camelot.team5)
43
Example of assignment fault (camelot.team4)
Machine Code (PowerPC)
Excerpt of source code
L..26 .line 12 addi r3,r0,0 stw r3,24(sp)
lwz r4,T.n(toc) lwz r4,0(r4) cmp 0x7,0x0,r
3,r4 bc 0x4,0x1c,L..28 L..27
. . . for (i 0 i lt n i) visited
xi yi TRUE . . .
Should be addi r3,r0,1
Should be for (i 1 i lt n i)
44
Example of checking fault (camelot.team1)
Machine Code (PowerPC)
Excerpt of source code
L..94 .line 23 . . . cmp 0x6,0x0,r3,r4
bc 0x4,0x18,L..96 lwz r4,112(sp) . .
. L..96
. . . if ((depth lt timexy) (
timexy -1)) / Body of if statement
/ . . .
Translated into 19 machine code lines
Should be if ((depth lt timexy) (timexy
-1))
Should be bc 0x4,0x19,L..96
45
Example of algorithm fault (camelot.team5)
Machine Code (PowerPC)
Excerpt of source code
. . . add r3,r3,r4 .line 5 .ef
69 addi sp,sp,48 bclr 0x14,0x0
. . .
. . . Int dist (int x1,int y1,int x2,
int y2) Int dx x1x2 Int dy y1x2 Return
((dxgt0)?dx-dx)((dygt0)?dy-dy)
. . .
Trans. into 26 machine code lines
The Xception cannot directely emulate this fault.
However, some algorithm faults can be emulated
indirectely by equivalent assignment and/or
checking fautls. For example, this fault assign
an incorrect value to the return parameter of
function dist.
Should be bl .max nop .line 5 .ef
69 lwz r0,88(sp) addi sp,sp,80 mtspr lr,r0 b
clr 0x14,0x0
Should be max(((dxgt0)?dx-dx),((dygt0)?dy-dy))
46
Example of assignment fault (JamesB.team6)
Source Code
Machine Code (PowerPC)
. . . addi r5,r5,-1 stw r5,
240(sp) addi r5,r31,-1 stw r5, 236(sp)
. . .
. . . void main() int i, i2, size,
no_iter, code char phrase 80 char phrase2
80 / Rest of main / . . .
  • Fault emulation this fault can be emulated by
    shifting all the stack references in the main
    that corresponds to the affected variables phrase
    and phrase2 (and other variables defined in main
    after these ones in the code).
  • Limitations
  • The breackpoint registers needed to inject these
    faults with low intrusiveness are very limited in
    number.
  • Needs extra software tools to assist the
    definition of these kind of faults

Should be addi r5,r5,-1 stw r5,
248(sp) addi r5,r31,-1 stw r5, 244(sp)
Should be char phrase 81 char phrase2 81
47
What can be emulated by Xception (or other SWIFI
tools)
  • Assignment and checking faults can be accurately
    emulated by Xception (assignment faults emulated
    by stack shifts have limitations)
  • Algorithm and function faults cannot be emulated.
    Field data suggests that this kind of faults
    accounts for nearly 44 of SW faults
    Christmansson Chillarege
  • Tools to assist fault definition are required.

48
Emulation of software faults of a given class
  • Locate in the source code all the lines in which
    a given type of fault can be injected (e.g.,
    assignment faults)
  • Use compiler, assembler, linker, and loader
    facilities (tables of symbols, labels, variables,
    etc) to map high level target instructions into
    machine level instructions (this task needs a
    tool)
  • From 1) and 2) result a list of possible
    locations (machine instructions) to inject faults
    of the desired class
  • Define faults using typical SWIFI fault triggers
    and fault types to corrupt (insert errors) the
    right machine instructions according to the
    desired type of software fault.
  • (e.g., for assignment faults
  • value 1, value 1, random value, unassigned
    (NOP))

49
Failure modes for assignment faults
Accelaration of the process or injection of naive
SW faults?
50
Failure modes for checking faults
51
Failure modes for type of assignment faults
52
Failure modes for type of checking faults
The high percentage of incorrect results can be
explained because many faults just represent
naive software errors
53
Tune the fault types and fault triggers
  • Fault types
  • Eliminate/reduce fault types that represent naive
    SW faults
  • Fault triggers
  • Use software metrics to choose the modules to
    inject faults and define trigger locations
    accordingly
  • Metrics of software complexity base on
  • Static feature of the code
  • Dynamic features
  • Possible information on the development process
    (type of tests, etc)
  • ...

54
Conclusions
  • SWIFI tools are the tools today
  • FI is mature for hardware transient faults
  • Emulation of more complex classes of faults is
    required
  • SW faults can be emulated in part (assignment and
    checking, at least)
  • Current tools (Xception included) are not
    prepared to assist the user during fault
    definition
  • Further research is required to tune the process
    of emulating classes of complex fauls
  • Interesting fault types
  • Software metrics to guide the fault trigger
    definition
  • Emulation of SW faults when the source code of
    the target is not available
  • Interaction faults (among modules) and operator
    faults.
Write a Comment
User Comments (0)
About PowerShow.com