Title: Frances Perry
1Reasoning about Control Flow in the Presence of
Transient Faults
- Frances Perry
- Princeton University
- Joint work WITH David Walker
- SAS 2008
2Transient Faults
- Occur when an energetic particle strikes a
transistor or wire causing a change in state - Do not permanently damage hardware
- May corrupt computation by altering stored values
and signals
1 1 18
1 1 2
3Issues Caused by Transient Faults
- Crashes Failures
- Sun Microsystems acknowledged that transient
caused crashes in server systems at AOL, eBay,
and dozens of other major clients. Baumann 02 - Cypress Semiconductor acknowledged that a single
soft caused a server farm at a telephone company
to crash. Ziegler Puchner 04 - The ASC Q supercomputer at Los Alamos crashed
regularly due to soft errors. Michalak et al.
05 - Exploiting Virtual Machine Vulnerabilities
Govindavajhala Appel 03 - Java Virtual Machine relies on type safety to
separate untrusted programs from the VM - An attacker can craft a program that exploits
transient faults to take over the VM - Breaking Cryptographic Protocols, eg. RSA
Boneh, DeMillo Lipton 97 - RSA relies on inability to factor N into prime
numbers p and q - Attacker obtains two signatures (one correct, one
faulty) of the same message
4Transient Fault Trends
- In 2004, a typical laptop with 1GB RAM had 1 soft
fail per year. Ziegler 04 - Faster clock rates, increasing transistor
density, decreasing voltages, and smaller feature
size all contribute to increasing fault rates of
approximately 8 per generation. Borkar 05
5Dealing with Transient Faults
- Many existing solutions
- Provide protection by adding redundancy
- Hardware, software, hybrid hardware-software,
single core, multi-core - Tradeoffs performance, cost, power and
reliability
Performance
Power
Reliability
Cost
6An Example Solution SWIFT
- SWIFT Reis et al. 05 is a software-based
solution - Compiler duplicates the original computation and
inserts comparisons before memory stores and
control flow instructions to ensure that the two
versions agree - Evaluation Randomly inject faults and look at
resulting performance and detection rate - The detection rate wasnt as good as they
expected - Compiler was adding the redundant computation and
then performing optimizations that remove
redundancy! - Solution permanently turn off certain
optimizations - Results
- Experimental data showing a decrease in fault
detection - English descriptions of faults not handled, but
no formal reasoning - SWIFT will detect all but the most pathological
single-upset faults.
7Transient Fault Solutions
- Many Existing Solutions
- Borin et al. 06, Chang et al. 06, Gomaa et
al. 03, Guerraoui Schiper 97, Horst et al.
90, Kalbarczyk et al. 99, Oh et al. 02,
Ohlsson Rimen 95, Rebaudengo 01, Reinhardt
Mukherjee 00, Reis et al. 05, Reis et al.
06, Reis et al. 07, Shirvani et al. 00,
Slegel et al. 99, Tremblay Tamir 89,
Venkatasubramanian 03, Vijaykumar et al. 02,
Yan Zhang 05, Yeh 96, Yeh 98, - Do these solutions actually work?
8Typed Assembly Language Morrisett et al. 98
- Instance of Proof-Carrying Code Necula Lee
96 - Assembly-level type system encapsulates desired
invariants - Type checking the generated code guarantees
properties - The compiler does not have to be trusted.
Source Code
Type-Preserving Compiler
Typed Assembly Language
Type Checker
9Using Typed Assembly Languages
- Develop a machine model to reason about machine
execution - Design the type system
- Prove the type system is sound with respect to
the machine model - Show that the typed assembly language is
expressive
Source Code
Type-Preserving Compiler
Typed Assembly Language
Type Checker
10Roadmap
- Transient Faults Issues and Existing Solutions
- Verifying Assembly Code with Typed Assembly
Languages - Detecting Control Flow Faults in Software
- TALCF Formalizing a partial solution to Control
Flow Faults - Formalizing the machine model
- Designing the type system
- Proving soundness
- Showing Expressiveness
- Conclusions
11Control Flow Faults
. . . mov r2 L2 jmp r2
L1
12Control Flow Faults
. . . mov r2 L2 jmp r2
L1
?
13Detecting Control Flow Faults in Software
- Existing software solutions can catch many (but
not all) control flow faults. - Method have another value that approximates the
PC within the current block - Existing sequence of work handles increasing
classes of control flow faults - CFCSS Oh et al. 2002, SWIFT Reis et al.
2005, RCF Borin et al. 2006 - Still cant handle all faults
- Detecting control flow faults is difficult (and
handling all may be impossible?) - What can we say about these techniques
mathematically?
14A Simplified Model of Control Flow Faults
- Goal To formally analyze (parts of) the existing
(imperfect) solutions for detecting control flow
faults in software - Fault model
- Faults affect general registers r1, r2, , ri
- Single Event Upset Model
- Hardware catches faults into the middle of blocks
or to non-code addresses - General approach two redundant, independent
computations - Green used to determine control flow
- Blue backup copy, used to check control flow
was correct
15Stating and Verifying Intentions
. . . mov r2 L2 jmp r2
. . . mov ri L2 mov r2 L2 jmp r2
L1
L10
. . .
mov r2 L10 sub r2 r2 ri brnz r2 Lrec . . .
Lrec
... Recovery Code ...
. . .
mov r2 L2 sub r2 r2 ri brnz r2 Lrec . . .
L2
16Stating and Verifying Intentions
. . . mov ri L2 mov r2 L2 jmp r2
L1
L10
mov r2 L10 sub r2 r2 ri brnz r2 Lrec . . .
Lrec
... Recovery Code ...
mov r2 L2 sub r2 r2 ri brnz r2 Lrec . . .
L2
17Instruction Set
- Instructions i mov rd v
- sub rd rs rs
- intend r2 // mov ri r2
- intendz rz r2 // if rz 0, mov ri r2
- recovernz rz // if rz ? 0, jmp Lrec
- Blocks b i b
- jmp rt
- brz rz rt
18Roadmap
- Transient Faults Issues and Existing Solutions
- Verifying Assembly Code with Typed Assembly
Languages - Detecting Control Flow Faults in Software
- TALCF Formalizing a partial solution to Control
Flow Faults - Formalizing the machine model
- Designing the type system
- Proving soundness
- Showing Expressiveness
- Conclusions
19TALCF Reasoning about Control Flow
Source Code
Syntactic Semantic Analysis
Type-Preserving Compiler
Compilation to Low-level Code
Addition of Redundancy Detection Protocol
TALCF
Optimizations
Type Checker
Code Generation
20Step 1 TALCF Machine Model
- Machine States ? (C,R,b,h) contains a code
memory, register file, current block being
executed, and a history (trace of blocks visited) - Final States ? recover hw-error
- Define an operational semantics ? !o
- Add operational rules ? !1 ? that
nondeterministically introduce faults
21Step 2 Type System Design
- Main concepts behind the TALCF type system
- Stages of the fault tolerance protocol must occur
in order. - Equivalence checking ensures that redundant
values act as proper backups. - Values are classified based on their reliability
properties.
22Concept 1 Protocol Stages
... mov r2 L ... sub r2 r2 ri ... recovernz
r2 . . . intend r3 ... jmp r3
- Checking Code
- ri has type check
- Block may be invalid
- Block Body
- ri has type ok
- Block is valid
- Exit Code
- ri has type go
- Block is valid
23Concept 2 Equivalence Checking
- Values are typed with a triple ltc,b,Egt
- c - a color
- b - a basic type (int, codetp, protocol stage)
- E - a static expression
- Static expressions are arithmetic describing the
value - Typing rules use expressions to enforce Blue
computation is a true copy of the Green
computation
24Concept 3 Approximating Trust
- How do we know if a runtime value actually has
its static type? We cant! - Type system approximates trust by assigning each
value a color G, B, O - All values with the same color share the same
reliability properties - A value colored c only depends on other values
colored c - After a fault zaps a value colored c
- all values colored c are untrusted
- the trust level of other colors doesnt change
25Green and Blue Invariants
- Main computation values are green
- Backup computation values are blue
- If a corrupt value of either color is used during
a control flow transfer, then a control flow
fault has occurred - Once a CF fault has occurred, we consider both
green and blue values to be untrusted
26Orange Invariants
- Orange values continue to be trusted even after a
control flow fault has occurred. - Invariants still hold because either
- Value isnt live across control flow transfers
- Invariant is true across every control flow
transfer - All blocks must
- Require nothing special about the value in ri (
ltO,check,agt ) - No other registers can be colored orange
27Zap Tags
- Typing judgments are parameterized by a zap tag Z
( Z ? ) which classifies groups of values as
trusted or untrusted - Z Z if Z is at least as trusted as Z
28Step Three Soundness
- Progress Well-typed states can take a step.
- Preservation Execution preserves typing, but
the zap tag may elevate to a supertype. - ? ? c ? when a fault first occurs
- G/B ? ? CF ? when a corrupt value is
used to determine control flow. - Fault Tolerance Theorem
- If a single fault occurs, nothing goes too badly
wrong before the error is detected. - Fully formalized and proved in the paper
29Step 4 Proving Expressiveness
- Type preserving translation from a simple
imperative language to TALCF - s v n vd va vb
- if0 vz then s1 else s2 while vz ? 0 do s
- s1 s2
- Prove that well-formed statements s can be
translated into well-typed machine states ?
30Conclusion
- Transient faults are a significant concern
already and future processors will come more
susceptible - Contributions of TALCF
- Introduces new proof techniques for reasoning
about control flow. - Prove the correctness of a software technique for
detecting control flow faults. - For more information, visit the Project ap
Homepagehttp//www.cs.princeton.edu/sip/projects/
zap
31(No Transcript)
32Extra Slides
33Formalizing Fault Tolerance Solutions
- ?zap Walker et al. ICFP 06
- at the level of ? calculus
- duplicates computation and uses atomic voting to
compare computations and recover from faults - Avoids specifically dealing with control flow
faults - TALFT Perry et al. PLDI 07
- formalizes an existing hybrid hardware-software
solution - uses an assembly level type system to capture
invariants about redundancy - Requires special hardware to address control flow
faults
34Option 1 CF Identical Registers Similar
1
1
2
2
3
3
4
4
5
5
35Option 2 Hardware Error Detected
1
1
2
2
3
3
hw-error
4
4
5
36Option 3 Fault to Backup Copy Detected
1
1
2
2
Lrec
3
3
5
4
4
5
37Option 4 CF Fault Detected
1
1
2
2
Lrec
3
3
4
5
4
5
38Typing Example
. . . intend r3 jmp r2
r2 hG, codetp, e2i r3 hB, codetp, e3i
ri hB, go, e3i
- Requires
- Concept 1 intend has already occurred
- Concept 2 backup is correct (e2 e3)
ri hO, check, xii
Want to know Did r3 r2 ? Can
check Does ri L ?
L
mov r6 L sub r6 r6 ri recovernz r6 . . .
r6 hO, check, L - xii
Control only proceeds past this point if xi L
and block is valid.
39Formalizing Program Execution
- ? represents the state of the machine at some
point during program execution - Define an operational semantics ? !o ? to
express the result of executing a single
instruction - There are no rules for the undefined cases
R(r2) c m
R(r1) c n
R(r2) c m
R(r1) c n
(add)
!o
(R, C, M, Q, add rd r1 r2)
(Rrd ? c (nm), C, M, Q, ?)
(Rrd ? c (nm), C, M, Q, ?)
40Formalizing the Fault Model
- Add operational rules ? !1 ? that
nondeterministically introduce faults to the
register file and queue -
R(r) c n (zap-reg) (R, C, M, Q, i)
!1 (Rr ? c z, C, M, Q, i)
Q (a1,v1), , (ai,vi), .,(an,vn) Q
(a1,v1), , (z, vi), .,(an,vn)
(zap-queue-addr) (R, C, M, Q, i) !1 (R,
C, M, Q, i)
41Fault Tolerance Theorem
- If a computation sustains a single fault
- the faulty computation looks identical to the
original, modulo the corrupt color - the faulty computation visits the same sequence
of blocks until a hardware error is detected - the faulty computation visits the same sequence
of block until a fault is detected - the faulty computation veers off course to an
invalid block but catches the error within that
block
42Control Flow Invariants
. . . mov r2 L jmp r2
. . . mov r2 L jmp r2
L
. . .
If the green and blue computations dont agree
that I am the jump target, then theres been a
fault.
43Control Flow Invariants
. . . mov r2 L jmp r2
. . . mov r2 L jmp r2
Arrival at this instruction depends on green
computation. Constants can be trusted as any
color.
L
mov r2 L sub r2 r2 ri recovernz r2 . . .
Use a different color for the checking code
ri is part of the blue computation, but L has no
preconceived notion of ris type so can trust
it as any color.
44Machine State Typing
- Machine State Typing Judgment Z ?
- Code memory is described by ?
- Register File typing is described by ?
45Related Work
- ?zap Fault-tolerant lambda calculus Walker et
al. - High-level lacks a program counter, register
file, memory, load/store instructions, - TAL Typed Assembly Language Morrisett et al.