Title: Frances Perry
1Reasoning about Software in the Presence of
Transient Faults
- Frances Perry
- Princeton University
-
2Transient Faults
- Occur when an energetic particle strikes a
transistor or wire causing a change in state - Do not permanently damage hardware
- May corrupt computation by altering stored values
and signals
1 1 18
1 1 2
3Issues Crashes and Failures
- In 2000, Sun Microsystems acknowledged that
transient faults interfered with cache memories
and caused crashes in server systems at major
customer sites, including AOL, eBay, and dozens
of others. Baumann 02 - Cypress Semiconductor acknowledged, the wake-up
call came in the end of 2001 with a major
customer reporting havoc at a large telephone
company. Technically it was found that a single
soft failwas causing an interleaved system farm
to crash. Ziegler Puchner 04 - At Los Alamos in 2003, the ASC Q supercomputer
crashed regularly due to soft errors. Michalak
et al. 05
4Issues Vulnerabilities in Virtual Machines
Govindavajhala Appel 03
- Java Virtual Machine relies on type safety to
separate untrusted programs from the VM - An attacker can craft a program that exploits
transient faults to take over the VM - Waits until a fault results in a pointer with a
runtime type that doesnt match its static type - Use mismatch to execute arbitrary code
- Successful demo at conference
- Can speed up transient fault rate using a heat
source - 70 probability of taking complete control
- of VM within 1 minute
5Issues Breaking Cryptographic Protocols
- Certain implementations of RSA are vulnerable to
a single fault Boneh, DeMillo Lipton 97 - RSA relies on inability to factor N into prime
numbers p and q - Attacker obtains two signatures (one correct, one
faulty) of the same message - GCD ( correct signature - faulty signature, N )
p - Many other examples Biham Shamir 97, Blomer
Seifert 03, Dusart et al. 03, Piret
Quisquater 03,
6Transient Fault Trends
- In 2004, a typical laptop with 1GB RAM had 1 soft
fail per year. Ziegler 04 - Faster clock rates, increasing transistor
density, decreasing voltages, and smaller feature
size all contribute to increasing fault rates of
approximately 8 per generation. Borkar 05
7Dealing with Transient Faults
- Many existing solutions
- Provide protection by adding redundancy
- Hardware, software, hybrid hardware-software,
single core, multi-core - Tradeoffs performance, cost, power and
reliability
Performance
Power
Reliability
Cost
8An Example Solution SWIFT
- SWIFT Reis et al. 05 is a software-based
solution - Compiler duplicates the original computation and
inserts comparisons to ensure that the two
versions agree - Evaluation Randomly inject faults and look at
resulting performance and detection rate - The detection rate wasnt as good as they
expected - Compiler was adding the redundant computation and
then performing optimizations that remove
redundancy! - Solution permanently turn off certain
optimizations
9Transient Fault Solutions
- Many Existing Solutions
- Borin et al. 06, Chang et al. 06, Gomaa et
al. 03, Guerraoui Schiper 97, Horst et al.
90, Kalbarczyk et al. 99, Oh et al. 02,
Ohlsson Rimen 95, Rebaudengo 01, Reinhardt
Mukherjee 00, Reis et al. 05, Reis et al.
06, Reis et al. 07, Shirvani et al. 00,
Slegel et al. 99, Tremblay Tamir 89,
Venkatasubramanian 03, Vijaykumar et al. 02,
Yan Zhang 05, Yeh 96, Yeh 98, - Do these solutions actually work?
10Formalizing Fault Tolerance Solutions
- Goal Formally reason about the behavior of fault
tolerance solutions - What is the right level of abstraction?
- Faults affect hardware
- Need to deal with primitive instructions, memory,
registers, additional hardware structures - Software redundancy is added during compilation
- Need to understand interactions with
optimizations, register allocation, etc - Want to reason about the implementation as well
as the algorithm
11Proof Carrying Code Necula Lee 96
- Code producer supplies a safety proof along with
the code binary - User verifies proof before executing code
- How do we represent safety proofs?
Source Code
Compilation Certification
Native Safety Code Proof
Producer
Consumer
Proof Validation
12Typed Assembly Language Morrisett et al. 98
- Assembly-level type system encapsulates desired
invariants - Type checking the generated code guarantees
properties - The compiler does not have to be trusted.
Source Code
Type-Preserving Compiler
Typed Assembly Language
Type Checker
13Using Typed Assembly Languages
- Develop a machine model to reason about machine
execution - Design the type system
- Prove the type system is sound with respect to
the machine model - Show that the typed assembly language is
expressive - My research Using typed assembly languages in
the presence of transient faults
Source Code
Type-Preserving Compiler
Typed Assembly Language
Type Checker
14Roadmap
- Transient Faults Issues and Existing Solutions
- Formalizing Fault-tolerance Solutions with Typed
Assembly Languages - TALFT Formalizing a hybrid transient fault
solution - Formalizing the machine model
- Designing the type system
- Proving soundness
- Discussion Is this realistic?
- Related and Future Work in Fault Tolerance
- Other Recent Projects
- Conclusions Goals
15TALFT Fault-tolerant Typed Assembly Language
PLDI 07. Best Paper Award.
Source Code
Syntactic Semantic Analysis
CRAFT Compiler Reis et al. 05
Type-Preserving Compiler
Compilation to Low-level Code
Addition of Software Redundancy
TALFT
Typed Assembly Language
Optimizations
Type Checker
Code Generation
with slightly modified hardware
16Machine Instruction Set
- Colors c G B
- Values v c n
- Instructions i mov rd, v
- op rd, rs, rt op rd, rs, v
- ldc rd, rs stc rd, rs
- brzc rz, rd jmpc rd
17Machine State
. . . 0x0393 mov r1 4 0x0394 mov r3
4 0x0395 stG r2 r1 0x0396 stB r4 r3 . . .
C
18Example Store Instruction
- Goal
- store value 5 into address 256
- mov r1, G 5
- mov r2, G 256
- stG r2, r1
- mov r3, B 5
- mov r4, B 256
- stB r4, r3
0x0393 mov r1, G 5 0x0394 mov r2, G 256 0x0395
stG r2, r1 0x0396 mov r3, B 5 0x0397 mov r4, B
256 0x0398 stB r4, r3
256 ! 5
r15
r2256
r35
r4256
19Example Data Fault
- Goal
- store value 5 into address 256
- mov r1, G 5
- mov r2, G 256
- stG r2, r1
- mov r3, B 5
- mov r4, B 256
- stB r4, r3
0x0393 mov r1, G 5 0x0394 mov r2, G 256 0x0395
stG r2, r1 0x0396 mov r3, B 5 0x0397 mov r4, B
256 0x0398 stB r4, r3
fault detected
r15
r121
r2256
r35
r4256
20Example Interleaved Loads and Stores
- / r1 r2 value 5
- r3 r4 address 256 /
- stG r3, r1
- ldG r5, r3
- add r5, r5, 1
- stG r3, r5
- stB r4, r2
- ldB r6, r4
- add r6, r6, 1
- stB r4, r6
0x0393 mov r1, G 5 0x0394 mov r2, G 256 0x0395
stG r2, r1 0x0396 mov r3, B 5 0x0397 mov r4, B
256 0x0398 stB r4, r3
256 ! 5
256 ! 6
r55
r56
r66
r65
21Example Flexible Instruction Scheduling
- / r1r2value 5
- r3r4 address 256 /
- stG r3, r1
- stB r4, r2
- ldG r5, r3
- add r5, r5, 1
- ldB r6, r4
- stG r3, r5
- add r6, r6, 1
- stB r4, r6
0x0393 mov r1, G 5 0x0394 mov r2, G 256 0x0395
stG r2, r1 0x0396 mov r3, B 5 0x0397 mov r4, B
256 0x0398 stB r4, r3
256 ! 5
256 ! 6
r56
r55
r66
r65
22Example Control Flow Fault
-
-
- mov r1, G 5
- mov r2, G 256
- stG r2, r1
- mov r3, B 5
- mov r4, B 256
- stB r4, r3
0x0393 mov r1, G 5 0x0394 mov r2, G 256 0x0395
stG r2, r1 0x0396 mov r3, B 5 0x0397 mov r4, B
256 0x0398 stB r4, r3
fault detected
pcG0x0393
pcG0x0394
pcG0x0395
pcG0xbeef
pcB0x0393
pcB0x0394
pcB0x0395
23Formalizing Program Execution
- ? represents the state of the machine at some
point during program execution - Define an operational semantics ? !o ? to
express the result of executing a single
instruction - There are no rules for the undefined cases
R(r2) c m
R(r1) c n
R(r2) c m
R(r1) c n
(add)
!o
(R, C, M, Q, add rd r1 r2)
(Rrd ? c (nm), C, M, Q, ?)
(Rrd ? c (nm), C, M, Q, ?)
24Formalizing the Fault Model
- Add operational rules ? !1 ? that
nondeterministically introduce faults to the
register file and queue -
R(r) c n (zap-reg) (R, C, M, Q, i)
!1 (Rr ? c z, C, M, Q, i)
Q (a1,v1), , (ai,vi), .,(an,vn) Q
(a1,v1), , (z, vi), .,(an,vn)
(zap-queue-addr) (R, C, M, Q, i) !1 (R,
C, M, Q, i)
25Roadmap
- Transient Faults Issues and Existing Solutions
- Formalizing Fault-tolerance Solutions with Typed
Assembly Languages - TALFT Formalizing a hybrid transient fault
solution - Formalizing the machine model
- Designing the type system
- Proving soundness
- Discussion Is this realistic?
- Related and Future Work in Fault Tolerance
- Other Recent Projects
- Conclusions Goals
26Principles Behind the TALFT Type System
- In the absence of faults, standard type theory
applies. - Green and blue computations are independent.
- Green and blue computations are redundant.
- Observable actions depend on both computations.
27Typing Values
- The type of a value is a triple ltc,b,Egt
- c - a color (either green or blue)
- b - a basic type (int, b reference, code
pointer) - E - a static expression describing arithmetic
and memory
G 3 ltG,int,3gt
28Instruction Typing Example Add
Execution Behavior
R(r2) c m
R(r1) c n
!o
(R, C, M, Q, add rd r1 r2)
(Rrd ? c (nm), C, M, Q, ?)
Static Requirements
?(r1) hc, int, E1i ?(r2) hc, int, E2i
?
add rd r1 r2 )
?????
where
?
rd ?hc, int, E1E2i
E1E2
int
c
?
29Using Expressions to Enforce Redundancy
r1 ltG,int,xgt r3 ltB,int,xgt r2 ltG,int,ygt r4
ltB,int,ygt
r3 ltB,int,xygt
r1 ltG,int,xygt
Q (E8, xy)
3. Redundant computations
xy xy
30Using Expressions to Enforce Redundancy
Error during compilation
r3 ltB,int,xxgt
r1 ltG,int,xygt
Q (E8, xy)
xx ? xy
Type Checking Fails
31Roadmap
- Transient Faults Issues and Existing Solutions
- Formalizing Fault-tolerance Solutions with Typed
Assembly Languages - TALFT Formalizing a hybrid transient fault
solution - Formalizing the machine model
- Designing the type system
- Proving soundness
- Discussion Is this realistic?
- Related and Future Work in Fault Tolerance
- Other Recent Projects
- Conclusions Goals
32Type Safety in the Presence of Faults
- Standard Type Safety Well-typed programs
continue to be well-typed during execution - After a fault, some values may not be well-typed
r1 ltG, int, E1gt r3 ltG, int ref, E3 gt
stG r3, r1
33Abstracting Corruption with Zap Tags
- Transient faults occur during execution cant
statically track which values may be corrupted - Abstract the possible scenarios using three zap
tags - Z ? G B
- When Z is a color, values of that color may have
any type
ltG, int ref, E3 gt
ltG, int, 3 gt
ltG, code ptr, E2E7 gt
G
G 3
34Type Safety Progress
- When no faults have occurred, well-typed machine
states can execute the next instruction -
- After a fault has occurred, well-typed states
either execute the next instruction or detect the
error
S
If
then
S !o S
c S
If
then either
S !o S
or
S !o fault
35Type Safety Preservation
- Normal execution preserves typing
- Faulty execution preserves typing modulo the
corrupted color
Z S
S !o S
Z S
If
and
then
S
S S
Exists c. c S
If
and
then
36Program Execution
S3
S2
!o
!o
!o
S1
37Indistinguishable Machine States
38Fault Detection Theorem
- If a machine state is well-typed and a single
fault occurs somewhere during execution, then
there is no change in observable behavior until
the fault is detected.
S1
S2
S3
S5
S6
!o
!o
!o
!o
!o
!o
!o
S2f
S3f
S5f
!o
!o
!o
!o
fault
39Formal Results for TALFT
- Well-typed TALFT programs
- Dont go wrong Type Safety
- Only detect a fault when a fault has occurred No
False Positives Lemma - Never allow a single fault to change the
observable program behavior Fault Detection
Theorem
40Work In Progress Compiling to TALFT
- Its easy to design a sound type system just
make it very restrictive - Claim Can generate code for TALFT
- Work in progress
- Naively translate a source-level language into
TALFT - Show how to support register allocation and other
optimizations.
41Roadmap
- Transient Faults Issues and Existing Solutions
- Formalizing Fault-tolerance Solutions with Typed
Assembly Languages - TALFT Formalizing a hybrid transient fault
solution - Formalizing the machine model
- Designing the type system
- Proving soundness
- Discussion Is this realistic?
- Related and Future Work in Fault Tolerance
- Other Recent Projects
- Conclusions Goals
42Model vs. Reality Faults
- TALFT models a single fault to the register file
or store buffer - Faults may occur anywhere ALU, control signals,
sequential or combinational logic, instruction
decoding, - Multiple faults may occur
- Benefit Precise specification of faults under
consideration - Allow arbitrary corruption
- Many intra-instruction faults can be modeled by
correct instruction execution followed by a fault
to the destination register - Many others are likely to be caught by an
eventual mismatch in the two computations
43Model vs. Reality Program Inputs
- All inputs need to be duplicated
- Loads in concurrent applications
- Can use a load queue similar to the store queue
- Requires ldG to simultaneously put correct value
in destination register and correct pair into
load queue - Adhoc inputs random(), user inputs, etc
- Would need to be cached for blue computation.
- Creates window of vulnerability without
hardware support
44Model vs. Reality Performance Cost
- Simulated execution of TALFT code to get a rough
estimate of performance cost - TALFT code is 34 slower than the
fault-intolerant baseline
45TALFT Summary
- TALFT is an assembly-level type system that
captures invariants about redundant computations
in a hybrid transient fault solution - Well-typed programs will always detect observable
faults (relative to the fault model) - The results are applicable to real-world
scenarios
46Roadmap
- Transient Faults Issues and Existing Solutions
- Formalizing Fault-tolerance Solutions with Typed
Assembly Languages - TALFT Formalizing a hybrid transient fault
solution - Formalizing the machine model
- Designing the type system
- Proving soundness
- Discussion Is this realistic?
- Related and Future Work in Fault Tolerance
- Other Recent Projects
- Conclusions Goals
47Formalizing Fault Tolerance Related Work
- Reasoning about Control Flow in the Presence of
Transient Faults To appear SAS 08 - Existing software-only solutions can catch many
faults - Takes a first step towards formalizing
software-based control-flow fault detection - Requires reasoning after a control flow fault has
occurred - Generalizes TALFT zap tags
- Higher-level abstractions
- Static Typing for a Faulty Lambda Calculus
Walker et al. 06 - Fault-Tolerant Voting in a Simply-Typed Lambda
Calculus Elsman. 07
48Formalizing Fault Tolerance Future Work
- Explore other fault tolerance techniques and
richer fault models - Look beyond fault detection, to reason about
fault recovery - Develop more powerful methods for reasoning in
the presence of transient faults
49Roadmap
- Transient Faults Issues and Existing Solutions
- Formalizing Fault-tolerance Solutions with Typed
Assembly Languages - TALFT Formalizing a hybrid transient fault
solution - Formalizing the machine model
- Designing the type system
- Proving soundness
- Discussion Is this realistic?
- Related and Future Work in Fault Tolerance
- Other Recent Projects
- Conclusions Goals
50Other Research Projects
- Typed assembly languages for stacks
- L. Jia, F. Perry, D. Walker, and N. Glew.
Certifying Compilation for a Language with Stack
Allocation. Logic in Computer Science (LICS),
June 2005 - F. Perry, C. Hawblitzel, and J. Chen. Simple
and Flexible Stack Types. International Workshop
on Aliasing, Confinement, and Ownership (IWACO),
July 2007 - J. Chen, C. Hawblitzel, F. Perry, M. Emmi, J.
Condit, D. Coetzee and P. Pratikakis.
Type-Preserving Compilation for Realistic
Object-Oriented Compilers. Programming Language
Design and Implementation (PLDI), to appear June
2008 - Dynamic verification of aliasing invariants
- F. Perry, L. Jia, and D. Walker. Expressing
Heap-shape Contracts in Linear Logic . Generative
Programming and Component Engineering (GPCE),
October 2006 - Static deadlock detection using dataflow analysis
- Identified over 100 confirmed concurrency bugs in
Windows Vista
51Conclusion and Goals
- Typed Assembly Languages are well-suited for
reasoning about solutions to transient faults - Future Goal Continuing to improve code
reliability by - Collaborating with researchers in other fields to
identify domain-specific issues that will benefit
from formal reasoning - Developing new techniques for reasoning about
program behavior - Implementing analyses in real compilers
52 53Extra Slides
54Typed Assembly Languages for Stacks
- Typing stacks is tricky
- Stacks require frequent strong updates
- To support real stack use need to allow stack
locations to be aliased - Strong updates are unsound in the presence of
aliasing - Insight use Linear Logic to express stack types
- Certifying Compilation for a Language with Stack
Allocation - L. Jia, Frances Spalding Perry, D. Walker, and
N. Glew - IEEE Symposium on Logic in Computer Science
(LICS), June 2005 - Simple and Flexible Stack Types
- Frances Perry, C. Hawblitzel, and J. Chen
International Workshop on Aliasing, Confinement,
and Ownership (IWACO), July 2007 - Type-Preserving Compilation for Realistic
Object-Oriented Compilers - J. Chen, C. Hawblitzel, Frances Perry, M. Emmi,
J. Condit, D. Coetzee and P. Pratikakis - Programming Language Design and Implementation
(PLDI), to appear June 2008
55ESPC Static Lock Use Analysis
Program Analysis Group, Microsoft
- Use static analysis to detect incorrect lock
usage - Deadlock detection - infer global lock ordering
and look for conflicts - Has found over 100 confirmed concurrency bugs in
Windows Vista - Lessons from ESPC
- There are times when its ok to sacrifice
soundness and completeness - Analysis finds bugs you wouldnt find any other
way - Programmers appreciate help with difficult
problems
56Related Work Similar Systems
- Typed Assembly Language Morrisett et al. 98
- Proof Carrying Code Necula Lee. 96
- Information Flow Denning 78 Volpano et al.
96 - Control-Flow Integrity Abadi et al. 05