Title: Fault injection: a way to build more reliable computers
1Fault injection a way to build more reliable
computers
- Henrique Madeira
- University of Coimbra, Portugal
2Presentation outline
- Basic concepts and fault injection technologies
- Current research issues and problems in fault
injection - Fault representativeness emulation of software
faults - Conclusions
I need fault injection what are my choices?
I want to do research in this area what are the
research problems in FI?
3What is fault injection?
Deliberate insertion of upsets (faults or errors)
in computer systems to evaluate its behavior in
the presence of faults or validate specific fault
tolerance mechanisms in computers.
4Faults, Errors, and Failures
Fault
Failure
Error
5Fault Injection goals
- Fault removal aims to detect the presence of
design and implementation faults, and then to
locate and remove them. - Fault forecasting aims to quantify the
confidence that can be attributed to a system by
estimating the number and the consequences of
possible faults in the system.
6What is a fault injector?
- Is a tool that performs the following tasks
- Generates sets of faults according to different
criteria - Injects faults in a automatic way (or with
reduced manual needs) - Collect results (direct measures) on fault impact
in the target system - Provides support for result analysis.
7Design of fault injection tools
Two technical problems
- How to inject faults in the very complex systems
available today? - How to monitor the effects of the faults?
- Detect if faults have been injected
- Collect info on the fault actually injected
- Detect if errors have been discharged or
overwriten - Trace target system behavior
- ...
8Main fault injection techniques
- Physical faults
- Pin-level
- Heavy-ion radiation
- Power supply disturbances
- Other types of physical upsets
- Simulation based fault injection
- Emulation of faults by software
- SWIFI
- SCIFI (boundary scan)
9Pin-level fault Injection
To monitoring logic
10Pin-level fault injection in practice
Is this possible for complex processors?
11Heavy-ion fault injection
Is this possible for complex processors?
12Physical Fault Injection Techniques
- Many problems today
- Hardware is too complex
- Poor controllability
- Poor observability of the faults effects
- Huge development efforts
- Low portability
13Simulation Fault Injection
- Major problems
- Large development efforts
- Time consuming
- Models are not readily available
14Main fault injection techniques
- Physical faults
- Pin-level
- Heavy-ion radiation
- Power supply disturbances
- Other types of physical upsets
- Simulation based fault injection
- Emulation of faults by software
- SWIFI
- SCIFI (boundary scan)
15Software Implemented Fault Injection
- Basic SWIFI idea
- 1) Interrupt the target application/system in
some way (e.g., by inserting a trap instruction
or by executing the application in trace mode) - 2) Execute specific fault injection routine that
emulates faults by inserting errors in different
parts of the system (processor registers, memory,
) - 3) Resume the execution of the interrupted
program - 4) Collect results on faults manifestations at
different levels (system level, application
level, FTM, etc).
16Fault definition
- Faults are defined according to two main
attributes - Fault model describes what is corrupted and how
is corrupted - Fault trigger describes when the fault should be
injected.
17SWIFI pros and cons
- Advantages
- Not much affected by the complexity of the target
- Low complexity
- Low cost and development effort
- Reasonably portable
- No physical interferences
- Typical disadvantages
- Do not cover faults in peripheral devices, ASICS,
etc - Limited monitoring capabilities
- Tools have great impact on the target system
behavior
18Why great impact on the target behavior?
- Detect when faults should been injected
- Collect info on the fault actually injected
- Detect if errors have been discharged or
overwriten - Trace target system behavior
Because the FI tool must
19Impact of workload in the fail silent violations
observed
Target commercial parallel system from Parsytec
(no fault tolerance mechanisms) Fault injection
tool Xception Kind of faults HW transient
Critical faults Percentage of faults that caused
wrong program results no error detected in the
system.
20Coverage of a set of error detection techniques
for different workloads in the target system
Target two industrial control systems based on
the Z80 e 68000 Fault injection tool RIFLE Kind
of faults HW transient
- Error detection methods
- Memory protection
- WDT
- Signature monitoring
- SW assertions
68K
Z80
100,0
95,0
90,0
85,0
80,0
75,0
70,0
Pesquisa
CRC
Alea
QuickSort
Matriz
Sieve
Biblioteca C
21Xception Approach
- Use processor built-in debugging and performance
monitoring features to inject faults by software
with minimal interference (hybrid approach)
- Breakpoint registers - normaly used to facilitate
the design and implementation of debuggers - Special counters - normaly used to facilitate the
design and implementation of performance
monitoring tools.
22Breakpoint Registers
- Used to implement non-intrusive fault triggers.
Faults can be injected when - Read from/write to a specific memory address
- Instruction fetch from specific address
- Execution of specific instructions (e.g.,
floating point instructions) - After elapsing a given time from application
start or ocurrence of other event - Combination of above trriggers.
23Special Counters
- Used to collect data on the target system
behavior after the injection of a fault - Counts number of instructions executed from the
injection of the fault until a given moment - The same for...
- Number of clock cycles
- Number of read clock cycles ou write clock cycles
- Number of instructions executed of a given type
- etc, etc.
24Using both the breakpoint registers and the
special counters
- It is possible to detect (after the injection of
each fault) - The activation of latent errors in memory
- Corruption (erroneous writes) of specific memory
areas - Execution of specific programs/routines
- Measure the moment in time when some memory cell
is written - Count the number of ocurrences of events such as
the above mentioned.
25What kind of faults can be injected with Xception?
- Transient faults in the memory and processor
functional/structural units. For example - Integer Unit (IU)
- Floating Point Unit (FPU)
- Internal Data Bus
- Internal Address Bus
- General Purpose Registers (GPRs)
- Condition Code Register
- Memory Management Unit (MMU)
26Xception fault injector tool
Network
Target System
Xception Host
Application Output
Xception Experiment Control
Application
Fault Parameters
Fault Confi.
Log File
Fault Injection Exception handlers
Results
Faults Fault Impact
Kernel
27Xception today
- Is being marketed by Critical Software
http//www.criticalsoftware.com/ - Available for Windows NT, Linux, Lynx OS (for
Power PC and Pentium processor) - Been used (or is going to be used) in several
European universities, JPL, CISCO,... - Example of recent research works
- Evaluation of an Oracle database runnig the TPC-C
benchmark (DSN 2000) - Evaluation of a real-time control application
(DSN 2001)
28An alternative to Xception approach
- SCIFI Scan Chain Implemented Fault Injection
- IEEE 1149.1 boundary-scan standard for HW testing
- Available for most of the complex IC (processors,
DPSs, ASICs,...)
- GOOFI - Chalmers University (Sweden) tool
- Advantage improved control and monitoring
- Disadvantage greater intrusiveness and needs
some HW support
29What about the future of SWIFI tools?
Four attributes that must be improved
- Precision of fault models
- Intrusiveness
- Usability
- Portability
Software Faults
30Is it possible to emulate software faults (bugs)
with a SWIFI tool?
31Motivation
- Problem emulation of software faults by fault
injection - Relevance why software faults are important?
- Software faults are probably the major cause of
system outages - By injecting software faults it is possible to
assess the consequences of hidden bugs
(experimental risk assessment). - Traditional approaches to the injection of SW
faults - Optimistic we have injected so many different
faults that at least some of them should emulate
software faults... - Best effort injected faults that mimic typical
software bugs...
32Our current research on software fault emulation
- Two steps approach
- 1 - The source code of the target program/system
is available - 2 - The source code of the target program/system
is not available
- Goals
- Evaluation of accuracy of software fault models
used by SWIFI - Evaluation of SWIFI tools concerning the
emulation of SW faults - New features of SWIFI tools required for SW fault
emulation - Use of software metrics (and tools) to guide the
SW fault generation and injection process
(instead of using field data)
33What is a software fault?
Software development process (in
theory...) Requirements Specification Desig
n Code development Test Deployment
- OK
- OK
- The requirements specification
- are correct but the deployed code is not
34Characterization of software faults
- A SW fault is characterized by the change in the
code that is necessary to correct it (Orthogonal
Defect Classification). - Defined according two parameters
- Fault trigger conditions that make the fault to
be exposed - Fault type type of mistake in the code
35Types of software faults (ODC)
- Assignment values assigned incorrectly or not
assigned - Checking missing or incorrect validation of data,
or incorrect loop, or incorrect conditional
statement - Timing/serialization missing or incorrect
serialization of shared resources - Algorithm incorrect or missing implementation
that can be fixed without the need of design
change - Function incorrect or missing implementation that
requires a design change to be corrected
36Software fault definition and fault emulation
EMULATED FAULT SWIFI level What Where Which When
ACTUAL FAULT Source Code Level Fault Type Fault
Trigger
37Injecting hardware faults
- Errors at machine code level (processor
programming model) map nearly directly to
hardware faults (bit flips, stuck-at) - Faults in processor units
- Faults in memory
- Faults in other addressable devices (I/O cards,
etc).
Machine code level
Xception injects errors at this level (low
intrusion level)
Hardware faults
Hardware
38Injecting software faults
- Software faults are defined at source code level.
The questions are - Is it possible to map all classes of SW faults
into errors injected at the Xception/SWIFI level? - Is this maping accurate?
Source code level
Software faults
Machine code level
Hardware faults
Hardware
39Injecting software faults
Find real software faults and try to emulate them
accurately using the Xception
40Sources of real software faults
- Programs resulting from the ACM International
Collegiate Programming Contest (IOI98) - Rationale
- Programs written according to a formal, clear,
and correct specification (the problem proposed
to the contest participants) - Programs written by very skilled programmers
- Access to many implementations from the same
specification (237 teams at IOI 98) - There is a test case associated to each problem
specification that defines the acceptance
criteria for correct programs for the contest.
41Programs used and Xception version
- Target programs - different implementations of
- Camelot Computes the minimum number of moves
required to gather all the pieces of a chessboard
in the same square - JamesB Codifies strings according to a specific
algorithm. - Xception for PowerPC 601 processor.
42Software bug detection
- Programs considered corrected have been selected
- These programs were intensively tested using a
very thorough test case programs that fail this
intensive test case have software faults - Programs with faults have been analyzed to
identify and classify the fault.
SW faults found Assignment 2 faults
(JamesB.team6, Camelot.team4) Checking 1 fault
(Camelot.team1) Algorithm 4 faults
(JamesB.team7, Camelot.team2, Camelot.team3,
Camelot.team5)
43Example of assignment fault (camelot.team4)
Machine Code (PowerPC)
Excerpt of source code
L..26 .line 12 addi r3,r0,0 stw r3,24(sp)
lwz r4,T.n(toc) lwz r4,0(r4) cmp 0x7,0x0,r
3,r4 bc 0x4,0x1c,L..28 L..27
. . . for (i 0 i lt n i) visited
xi yi TRUE . . .
Should be addi r3,r0,1
Should be for (i 1 i lt n i)
44Example of checking fault (camelot.team1)
Machine Code (PowerPC)
Excerpt of source code
L..94 .line 23 . . . cmp 0x6,0x0,r3,r4
bc 0x4,0x18,L..96 lwz r4,112(sp) . .
. L..96
. . . if ((depth lt timexy) (
timexy -1)) / Body of if statement
/ . . .
Translated into 19 machine code lines
Should be if ((depth lt timexy) (timexy
-1))
Should be bc 0x4,0x19,L..96
45Example of algorithm fault (camelot.team5)
Machine Code (PowerPC)
Excerpt of source code
. . . add r3,r3,r4 .line 5 .ef
69 addi sp,sp,48 bclr 0x14,0x0
. . .
. . . Int dist (int x1,int y1,int x2,
int y2) Int dx x1x2 Int dy y1x2 Return
((dxgt0)?dx-dx)((dygt0)?dy-dy)
. . .
Trans. into 26 machine code lines
The Xception cannot directely emulate this fault.
However, some algorithm faults can be emulated
indirectely by equivalent assignment and/or
checking fautls. For example, this fault assign
an incorrect value to the return parameter of
function dist.
Should be bl .max nop .line 5 .ef
69 lwz r0,88(sp) addi sp,sp,80 mtspr lr,r0 b
clr 0x14,0x0
Should be max(((dxgt0)?dx-dx),((dygt0)?dy-dy))
46Example of assignment fault (JamesB.team6)
Source Code
Machine Code (PowerPC)
. . . addi r5,r5,-1 stw r5,
240(sp) addi r5,r31,-1 stw r5, 236(sp)
. . .
. . . void main() int i, i2, size,
no_iter, code char phrase 80 char phrase2
80 / Rest of main / . . .
- Fault emulation this fault can be emulated by
shifting all the stack references in the main
that corresponds to the affected variables phrase
and phrase2 (and other variables defined in main
after these ones in the code). - Limitations
- The breackpoint registers needed to inject these
faults with low intrusiveness are very limited in
number. - Needs extra software tools to assist the
definition of these kind of faults
Should be addi r5,r5,-1 stw r5,
248(sp) addi r5,r31,-1 stw r5, 244(sp)
Should be char phrase 81 char phrase2 81
47What can be emulated by Xception (or other SWIFI
tools)
- Assignment and checking faults can be accurately
emulated by Xception (assignment faults emulated
by stack shifts have limitations) - Algorithm and function faults cannot be emulated.
Field data suggests that this kind of faults
accounts for nearly 44 of SW faults
Christmansson Chillarege - Tools to assist fault definition are required.
48Emulation of software faults of a given class
- Locate in the source code all the lines in which
a given type of fault can be injected (e.g.,
assignment faults) - Use compiler, assembler, linker, and loader
facilities (tables of symbols, labels, variables,
etc) to map high level target instructions into
machine level instructions (this task needs a
tool) - From 1) and 2) result a list of possible
locations (machine instructions) to inject faults
of the desired class - Define faults using typical SWIFI fault triggers
and fault types to corrupt (insert errors) the
right machine instructions according to the
desired type of software fault. - (e.g., for assignment faults
- value 1, value 1, random value, unassigned
(NOP))
49Failure modes for assignment faults
Accelaration of the process or injection of naive
SW faults?
50Failure modes for checking faults
51Failure modes for type of assignment faults
52Failure modes for type of checking faults
The high percentage of incorrect results can be
explained because many faults just represent
naive software errors
53Tune the fault types and fault triggers
- Fault types
- Eliminate/reduce fault types that represent naive
SW faults - Fault triggers
- Use software metrics to choose the modules to
inject faults and define trigger locations
accordingly - Metrics of software complexity base on
- Static feature of the code
- Dynamic features
- Possible information on the development process
(type of tests, etc) - ...
54Conclusions
- SWIFI tools are the tools today
- FI is mature for hardware transient faults
- Emulation of more complex classes of faults is
required - SW faults can be emulated in part (assignment and
checking, at least) - Current tools (Xception included) are not
prepared to assist the user during fault
definition - Further research is required to tune the process
of emulating classes of complex fauls - Interesting fault types
- Software metrics to guide the fault trigger
definition - Emulation of SW faults when the source code of
the target is not available - Interaction faults (among modules) and operator
faults.