Title: Heavyion Fault Injections in the Timetriggered Communication Protocol
1Heavy-ion Fault Injections in the Time-triggered
Communication Protocol
- Håkan Sivencrona, SP
- Per Johannessen, Volvo Car Corporation
- Mattias Persson Jan Torin, Chalmers University
of Technology
2Agenda
- Objective
- Time-triggered Protocol
- Membership Agreement
- Communication Failures
- Heavy-ion Fault Injections
- Experimental Set-up
- Results
- Discussion
- Conclusions
3Objective
- Validate the fault hypothesis and fault handling
mechanisms of a specific implementation of TTP/C - Use results for improvements of TTP/C and
time-triggered systems in general - To gain experience with safety-critical broadcast
buses using FI-techniques - Explore new failure modes of time-triggered
communication
4Time-Triggered Protocol
- Time Division Multiple Access, TDMA For
safety-critical applications - Fault tolerance is mainly implemented as
redundant hardware and software mechanisms - Fault Hypothesis tolerate any single fault
- Services
- Deterministic message sending
- Clock synchronization
- Membership service
- Clique avoidance
5Membership Agreement
- Gives a consistent system state
- All nodes have a membership vector
- The clusters membership vector includes the
nodes that have the same global state - Every node is represented by a unique bit in the
vectors in all nodes
6Communication Failures
- A node stops transmitting messages
- Application fault
- Controller crash/failure
- A message interference in the physical layer
- Permanent or temporary persistent
- Transient
- An asymmetric message interpretation
- Byzantine
- Omission inconsistent
- and the system behavior depends on the
application
7Heavy-ion Fault Injection
- Californium 252 source which radiates heavy-ions
with high energy, gtgt 1 MeV - Causes so-called single event upsets, SEUs, and
other effects in the CMOS device - Can affect locations not accessible with other
methods - Only statistically reproducible
- Low controllability
-
8Experimental Set-up
System with 4-9 nodes with similar message
schedules Software that monitors and detects
discrepancies
9Fault Injection Results
- Null Frame No transmission, eg. Fail Silence
- Checksum Errors, CRC, Message has the right
format but wrong content - Invalid Frame, A message that may or may not be
readable but not valid to use - In time domain
- In value domain
- Time discrepancies, when times are close to the
unacceptable
10CNI-register Error Log Files
Error diagnosis field
Invalid frame flagged
Correct frame received
11Example of Logged Data
12Results Fail Silence Violations
- Approximately 12 of all faults were undetected
by the FI-node resulting in a fail silence
violation - More than 90 of these were CRC faults
- The rest were invalid frames
- Approximately 0.1 of all faults were SOS
messages, mainly invalid frames in the time domain
13Fault Injection Results in Cluster
- A node stops transmitting messages
- The FI-node is silent
- Message Interference
- Babbling idiot, needed manual reset of the system
- Reintegration
- Asymmetric interpretation of messages
- Asymmetric timing faults SOS faults in time
domain - Asymmetric value faults - SOS faults in value
domain - and the system behavior depends on the protocol
implementation and the application
14Asymmetric value failure scenario
15Asymmetric timing failure scenario
16Cluster Size Comparisons
17Concerns Membership vs. Asymmetry
- Faulty node remains undetected in case of SOS
faults - Applications within the minority partition
system safety? - Protocol membership gives a brittle system
- Reintegration a possible hazard
18Discussion
- Active star coupler
- Modified membership agreement protocol
- Algorithms to detect and handle SOS failures
Dependability increase
Membership
Membership
19Conclusions TTP/C
- Partitioning due to asymmetric faults should be
resolved smoother and maybe not by forced
reintegration - Stronger fault containment regions are needed
- Larger system/cluster more resilient against SOS
faults
20General Conclusions
- Heavy-ion fault injection is efficient in
stressing silicon designs to arbitrary failure
modes - High-integrity systems must handle asymmetric and
Byzantine faults - Coverage against arbitrary faults is the only
realistic approach for safety critical systems
but difficult to achieve
21Questions?Thank you for listening!
22Recovery time distribution
Normal integration lt 400 ms
Latch ups automatic reset gt1500 ms
Combinations of faults and latch ups automatic
reset
Manual reset
0,5 sec 1,5 sec 2 sec
gt 5 sec
23Node Outage during experiments
256
512
Latch up Automatic reset
768
1024
Persistent fault
1280
Latch up Automatic reset
1536
Latch up - Manual reset
1792
Persistent fault
2048
Latch up Automatic reset