Heavyion Fault Injections in the Timetriggered Communication Protocol - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Heavyion Fault Injections in the Timetriggered Communication Protocol

Description:

LADC 2003 S o Paulo, Brazil. 2. SP Swedish National Testing and Research ... LADC 2003 S o Paulo, Brazil. 4. SP Swedish National Testing and Research Institute ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 22
Provided by: stenkejo
Category:

less

Transcript and Presenter's Notes

Title: Heavyion Fault Injections in the Timetriggered Communication Protocol


1
Heavy-ion Fault Injections in the Time-triggered
Communication Protocol
  • Håkan Sivencrona, SP
  • Per Johannessen, Volvo Car Corporation
  • Mattias Persson Jan Torin, Chalmers University
    of Technology

2
Agenda
  • Objective
  • Time-triggered Protocol
  • Membership Agreement
  • Communication Failures
  • Heavy-ion Fault Injections
  • Experimental Set-up
  • Results
  • Discussion
  • Conclusions

3
Objective
  • Validate the fault hypothesis and fault handling
    mechanisms of a specific implementation of TTP/C
  • Use results for improvements of TTP/C and
    time-triggered systems in general
  • To gain experience with safety-critical broadcast
    buses using FI-techniques
  • Explore new failure modes of time-triggered
    communication

4
Time-Triggered Protocol
  • Time Division Multiple Access, TDMA For
    safety-critical applications
  • Fault tolerance is mainly implemented as
    redundant hardware and software mechanisms
  • Fault Hypothesis tolerate any single fault
  • Services
  • Deterministic message sending
  • Clock synchronization
  • Membership service
  • Clique avoidance

5
Membership Agreement
  • Gives a consistent system state
  • All nodes have a membership vector
  • The clusters membership vector includes the
    nodes that have the same global state
  • Every node is represented by a unique bit in the
    vectors in all nodes

6
Communication Failures
  • A node stops transmitting messages
  • Application fault
  • Controller crash/failure
  • A message interference in the physical layer
  • Permanent or temporary persistent
  • Transient
  • An asymmetric message interpretation
  • Byzantine
  • Omission inconsistent
  • and the system behavior depends on the
    application

7
Heavy-ion Fault Injection
  • Californium 252 source which radiates heavy-ions
    with high energy, gtgt 1 MeV
  • Causes so-called single event upsets, SEUs, and
    other effects in the CMOS device
  • Can affect locations not accessible with other
    methods
  • Only statistically reproducible
  • Low controllability

8
Experimental Set-up
System with 4-9 nodes with similar message
schedules Software that monitors and detects
discrepancies
9
Fault Injection Results
  • Null Frame No transmission, eg. Fail Silence
  • Checksum Errors, CRC, Message has the right
    format but wrong content
  • Invalid Frame, A message that may or may not be
    readable but not valid to use
  • In time domain
  • In value domain
  • Time discrepancies, when times are close to the
    unacceptable

10
CNI-register Error Log Files
Error diagnosis field
Invalid frame flagged
Correct frame received
11
Example of Logged Data
12
Results Fail Silence Violations
  • Approximately 12 of all faults were undetected
    by the FI-node resulting in a fail silence
    violation
  • More than 90 of these were CRC faults
  • The rest were invalid frames
  • Approximately 0.1 of all faults were SOS
    messages, mainly invalid frames in the time domain

13
Fault Injection Results in Cluster
  • A node stops transmitting messages
  • The FI-node is silent
  • Message Interference
  • Babbling idiot, needed manual reset of the system
  • Reintegration
  • Asymmetric interpretation of messages
  • Asymmetric timing faults SOS faults in time
    domain
  • Asymmetric value faults - SOS faults in value
    domain
  • and the system behavior depends on the protocol
    implementation and the application

14
Asymmetric value failure scenario
15
Asymmetric timing failure scenario
16
Cluster Size Comparisons
17
Concerns Membership vs. Asymmetry
  • Faulty node remains undetected in case of SOS
    faults
  • Applications within the minority partition
    system safety?
  • Protocol membership gives a brittle system
  • Reintegration a possible hazard

18
Discussion
  • Active star coupler
  • Modified membership agreement protocol
  • Algorithms to detect and handle SOS failures

Dependability increase
Membership
Membership
19
Conclusions TTP/C
  • Partitioning due to asymmetric faults should be
    resolved smoother and maybe not by forced
    reintegration
  • Stronger fault containment regions are needed
  • Larger system/cluster more resilient against SOS
    faults

20
General Conclusions
  • Heavy-ion fault injection is efficient in
    stressing silicon designs to arbitrary failure
    modes
  • High-integrity systems must handle asymmetric and
    Byzantine faults
  • Coverage against arbitrary faults is the only
    realistic approach for safety critical systems
    but difficult to achieve

21
Questions?Thank you for listening!
22
Recovery time distribution
Normal integration lt 400 ms
Latch ups automatic reset gt1500 ms
Combinations of faults and latch ups automatic
reset
Manual reset
0,5 sec 1,5 sec 2 sec
gt 5 sec
23
Node Outage during experiments

256


512
Latch up Automatic reset
768

1024

Persistent fault
1280

Latch up Automatic reset
1536

Latch up - Manual reset
1792

Persistent fault

2048

Latch up Automatic reset

Write a Comment
User Comments (0)
About PowerShow.com