Safety Critical Systems presentation

About This Presentation

Transcript and Presenter's Notes

Title: Safety Critical Systems

1
Safety Critical Systems
2
Eight steps to safety

Identify the hazards
Determine the risks
Define the safety measures
Create safe requirements
Create safe designs
Implement safety
Assure the safety process
Test, test, test

3
Eight steps to safety

Identify the hazards
Determine the risks
Define the safety measures
Create safe requirements
Create safe designs
Implement safety
Assure the safety process
Test, test, test

Safety analysis
Handled at the architectural level and
mechanistic level
4
Safety Analysis

You must identify the hazards of the system
You must identify the faults that can lead to
hazards
You must define safety control measures to handle
hazards
These culminate in the Hazard Analysis
The Hazard Analysis feeds into the Requirements
Specification

5
Eight steps to safety

Identify the hazards
Determine the risks
Define the safety measures
Create safe requirements
Create safe designs
Implement safety
Assure the safety process
Test, test, test

6
Hazard Causes

Release of energy
electromagnetism (microwave oven)
radiation (nuclear power plant)
electricity (electrocution hazard from ECG leads)
heat (infant warmer)
kinetic (runaway train)
Release of toxins

7
Hazard Causes

Interference with life support or other
safety-related function
Misleading safety personnel
Failure to alarm
alarming too much - Therac 25. These were
ignored and people were killed

8
Types of Hazards

Actions
inappropriate system actions taken
F-18 pilot pulling up landing gear
appropriate system actions not taken
Timing
too soon
too late
fault latency time

9
Types of Hazards

Sequence
skipping actions
actions out of order
Amount
too much
too little

10
Example Hazards

Actions
incorrectly energizing a medical treatment laser
failure to engage landing gear
Timing
cardiac pacemaker paces too fast
flight control surface adjusted too slowly

11
Example Hazards

Sequence
empty the vat, THEN add the reagent
out of sequence network packets controlling
industrial robot
Amount
electrocution from muscle stimulator
too little oxygen delivered to ventilator patient

12
Means of Hazard Control

Obviation the possibility of the hazard can be
removed by being made physically impossible
use incompatible fasteners to prevent cross
connections
Education the hazard can be handled by
educating the users so that they wont create
hazardous conditions through equipment misuse
dont look down the barrel when cleaning your
rifle

13
Means of Hazard Control

Alarming announcing the hazard to the user when
it appears so that they can take appropriate
action
alarming when the heart stops beating
Interlocks the hazard can be removed by using
secondary devices and/or logic to intercede when
a hazard presents itself
car wont start unless it is in Park

14
Means of Hazard Control

Internal checking the hazard can be handled by
ensuring that a system can detect that it is
malfunctioning prior to an incident
CRC checks data for corruption whenever it is
accessed
Safety equipment
goggles, gloves

15
Means of Hazard Control

Restricting access to potential hazards so that
only knowledgeable users have such access
using passwords to prevent inadvertently starting
service mode
Labelling
High Voltage -- DO NOT TOUCH

16
Hazard Analysis
What do you do about it?
How long is the exposure to hazard?
How can this happen?
How long to discover?
How long can it be tolerated
How bad if it occurs?
Hazardous condition
How frequently?
Hazard
Level of
Toleran
Fault
Likeli
Detection
Control
Exposure
risk
ce time
hood
time
Measure
time
T1
Hypo-
Severe
5 min
Ventilator
rare
30 sec
Independent
1 min
ventilation
fans
pressure
alarm,
action by
doctor
Esphageal
often
30 sec
C)2 sensor
1 min
Intubation
alarm
User
often
0
Noncompati
0
misattaches
ble
breathing
mechanical
circuit
fasteners
used
Overpressur
Severe
250 ms
Release
rare
50 ms
Secondary
55 ms
e
valve
valve opens
failure
17
When is a system safe enough?

(Minimal) No hazards in the absence of faults
(Minimal) No hazards in the presence of any
single point failure
a common mode failure is a single point failure
that affects multiple channels
a latent fault is an undetected fault which
allows another fault to cause a hazard
Your mileage may vary depending on the risk
introduced by your system

18
Safety Measures

You cannot depend on a safety measure that you
cannot test!
CAN bus with 2 nodes provides a CRC on messages
checked at the chip level, but the chips provide
no way of testing to see if it is working.
Therefore, it cannot be relied on as a safety
measure

19
Fail-Safe States

Off
Emergency stop -- immediately cut power
Production stop -- stop after the current task
Protection stop -- shut down without removing
power
Partial Shutdown
Degraded level of functionality

20
Fail-Safe States

Hold
No functionality, but with safety actions taken
Manuel or External control
Restart (reboot)

21
Eight steps to safety

Identify the hazards
Determine the risks
Define the safety measures
Create safe requirements
Create safe designs
Implement safety
Assure the safety process
Test, test, test

22
Risk Assessment

For each hazard
determine the potential severity
determine the likelihood of the hazard
determine how long the user is exposed to the
hazard
determine whether the risk can be removed

23
TUV Risk Level Determination Chart
W3
W2
W1
S1
1
-
-
G1
2
1
-
E1
G2
3
2
1
S2
G1
4
3
2
E2
G2
5
4
3
E1
6
5
4
S3
E2
7
6
5
S4
8
7
6
Risk parameters S Extent of damage S1 slight
injury S2 severe irreversible injury, to one of
more persons or the death of a single person S3
death of several persons S4 Catestrophic
consequences, several deaths E Exposure
time E1 seldom to relatively infrequent E2
frequent to continuous G Hazard Prevention G1
possible under certain conditions G2 hardly
possible W Occurrence probability of hazardous
event W1 very low W2 low W3 relatively high
24
Sample Risk Assessments
Device
Hazard
Extent of
Exposure
Hazard
Probability
TUV Risk
damage
time
Prevention
level
Microwave
Irradiation
S2
E2
G2
W3
5
oven
Pacemaker
Pace too
S2
E2
G2
W3
5
slowly
Pace too
S2
E2
G2
W3
5
fast
Power
Explosion
S3
E1
--
W3
6
station
Airliner
Crash
S4
E2
G2
W2
8
25
Eight steps to safety

Identify the hazards
Determine the risks
Define the safety measures
Create safe requirements
Create safe designs
Implement safety
Assure the safety process
Test, test, test

26
Safety Measures

Safety measures do one of the following
remove the hazard
reduce the risk
identify the hazard to supervisory control
The purpose of the safety measure is to ensure
the system remains in a safe state

27
Safety Measures

Adequacy of measures
safety measures mut be able to reliably detect
the fault
safety measures must be able to take appropriate
actions

Component
Fault/Error
Software class
Examples of acceptable measures
1
2
Interrupt handling
no interrupt or too
rq
functional test or time-slot
and execution
frequent
monitoring
no interrupt or too
rq
comparison of redundant
frequent and
functional channles by either
interrupt related
- reciprocal comparison
to different
- independent hardware
sources
comparator
- independent time-slot and logical
monitoring
28
Risk Reduction

Identify the fault
Take corrective action, either
use redundancy to correct and move on
feedforward error correction (Hamming codes)
redo the computational step
feedback error detection (take corrective action
first)
go to a fail-safe state

29
Fault Identification at Run-Time

Faults must be identified in lt TO
Fault identification requires redundancy
Redundancy can be in terms of
channel
device
data
control

Architectural

Detailed design
30
Fault Identification at Run-Time

Redundancy may be either
homogenous (random faults only)
does not detect errors
peform functions the same way on the same thing
multiple times
heterogenous (systematic and random faults)
includes errors -gt present in all channels
perform processing differently and hopefully you
didnt make the same mistake!

31
Fault Tree Analysis Symbology
A condition that must be present to produce the
output of a gate
An event that results from a combination of
events through a logic gate
Transfer
A basic fault event that requires no further
development
A fault event because the event is
inconsequential or the necessary information is
not available
AND gate (also OR gate)
An event that is expected to occur normally
NOT gate
32
Subset of Pacemaker Fault Analysis
Pacing too slowly
Condition or event to avoid
Secondary conditions or events
OR
Shutdown fault
Time-base fault
Invalid pacing rate
OR
AND
OR
AND
Crystal failure
Watchdog failure
Bad command rate
Data corrupted in vivo
Software failure
CPU H/W failure
Rate command corrupted
CRC hardware failure
Primary or fundamental faults
33
Eight steps to safety

Identify the hazards
Determine the risks
Define the safety measures
Create safe requirements
Create safe designs
Implement safety
Assure the safety process
Test, test, test

34
Safe Requirements

Requirements specification follows initial hazard
analysis
Specific requirements should track back to hazard
analysis
must be shown to FDA, etc
Architectural framework should be selected with
safety needs in mind
has the hooks in place

35
Eight steps to safety

Identify the hazards
Determine the risks
Define the safety measures
Create safe requirements
Create safe designs
Implement safety
Assure the safety process
Test, test, test

36
Use Good Design Practices

Good design practices allow you to
manage complexity
view the system at various levels of abstraction
zoom in on a particular area of interest
identify hot spots of special concern
have consistent quality
easily test
build and use high quality components
Regulatory agencies look at this!!

37
Use Good Design Practices

Manage your requirements
trace requirements to design elements
trace design elements back to requirements

remote communications
adjust trajectory
class a
class b
remote communication
requirements specification
class c
class d
class e
use cases
design model
38
Use Good Design Practices

Use iterative development
integrating many times finds more defects
iterative prototypes can result in more reliable
and safe systems

39
Use Good Design Practices

Use component-based design architectures
third party components may be very well tested in
they are in wide use
require bug lists from component vendors
this bit Microsoft once

40
Use Good Design Practices

Use Visual Modeling
UML
Ward-Mellor
Use executable models
animate models
execute and debug at modeling level of abstraction

41
Use Good Design Practices

Use frameworks
a framework is a partially completed application
which is specialized by the user
Microsoft foundation classes
Object Execution Framework
frameworks reduce the work of developing new
applications
frameworks rely on well-tested patterns

42
Use Good Design Practices
User Model
Framework

80-90 of application code is housekeeping code

System
43
Use Good Design Practices

Use Configuration Management
only use unit-testing components in builds

parameters
data aquisition
SYSTEM
CM Database
drivers
OS
44
Use Good Design Practices

Design for test
product testing
built-in-testing to ensure
invariants are truly invariant
functional invariants
quality of service invariants (e.g. performance)
faults are detected

45
Good Design Practices

Isolate Safety Functions
Safety-relevant systems are 200-300 more effort
to produce
Isolation of safety systems allows more expedient
development
Care must be taken that the safety system is
truly isolated so that a defect in the non-safety
system cannot affect the safety system
different processor
different heavy-weight tasks (depends on the OS)

46
Safety Critical Patterns
47
Safety Architecture Patterns

Protected Single-Channel Pattern
Dual-Channel Pattern
Homogenous Dual Channel Pattern
Heterogenous Peer-Channel Pattern
Sanity Check Pattern
Actuator-Monitor Pattern
Voting Multichannel Pattern

48
Protected Single Channel Pattern

Within the single channel, mechanisms exist to
identify and handle faults
All faults must be detected within the fault
tolerance time
May be impossible
to test for all faults within the fault tolerance
time
to remove common mode failures from the single
channel
Generally, less recurring system cost
no additional hardware required

49
Protected Single Channel Pattern
If Im not getting life ticks, Ill shut down!
Single Channel Train Braking System
50
Dual Channel Architecture Patterns

Separation of safety-relevant from non-safety
relevant where possible
Separation of monitoring from control
Generally easier to meet safety requirements
timing
common mode failures
Generally higher recurring system cost
additional hardware required

51
Homogenous Dual-Channel Pattern

Identical channels used
Channels may operate simultaneously (Multichannel
Vote Pattern)
Channels may operate in series (Backup Pattern)
Good at identifying random faults but not
systematic faults
Low RD cost, higher recurring cost

52
Homogenous Dual-Channel Pattern
53
Heterogeneous Peer-Channel Pattern

Equal weight, differently implemented channels
may use algorithmic inversion to recreate initial
data
may use different algorithm
may use different teams (not fool proof because
of hot spots that can cause failures)
Good at identifying both random and systematic
faults

54
Heterogeneous Peer-Channel Pattern

Generally safest, but higher RD and recurring
cost

55
Heterogeneous Peer-Channel Pattern
56
Sanity Check Pattern

A primary actuator channel does real computations
A light-weight secondary channel checks the
reasonableness of the primary channel
Good for detection of both random and systematic
faults
May not detect faults which result in small
variance
Relatively inexpensive to implement

57
Monitor-Actuator Pattern

Separates actuation from the monitoring of that
actuation
If the actuator channel fails, the monitor
channel detects it
If the monitor channel fails, the actuator
channel continues correctly
Requires fault isolation to be single-fault
tolerant
actuator channel cannot use the monitor itself

58
Monitor-Actuator Pattern
59
Dual-Channel Design Architecture
60
Safety Executive Pattern

Large scale architectural pattern
Controller subsystem (safety executive)
One or more watchdog subsystems
check on system health
ensure proper actuation is occurring
One or more actuation channels
Recovery subsystem (Fail safe processing channel)

61
Safety Executive Pattern

Appropriate when
A set of fail-safe system states needs to be
entered when failures identified
Determination of failures is complex
Several safety-related system actions are
controlled simultaneously
Safety-related actions are not independent
Determining proper safety action in the event of
a failure can be complex

62
Safety Executive Pattern
63
Eight steps to safety

Identify the hazards
Determine the risks
Define the safety measures
Create safe requirements
Create safe designs
Implement safety
Assure the safety process
Test, test, test

64
Detailed Design for Safety

Make it right before you make it fast
simple, clear algorithms and code
optimize only the 10-20 of code which affects
performance
use safe language subsets
ensure you havent introduced any common mode
failures

65
Detailed Design for Safety

Thoroughly test
unit test and peer review
integration test
validation test

66
Detailed Design for Safety

Verify that it remains right throughout program
execution
exceptions
invariant assertions
range checking
index and boundary checking
When its not right during execution, then make it
right with corrective or protective measures

67
Detailed Design for Safety

Use safe language subsets
strong compile-time checking
if you use C, use lint
strong run-time checking
exception handling
avoid void
avoid error prone statements and syntax
you can make C safe but its not safe out of the
box

68
Detailed Design for Safety

Language choice
Compile time checking (C versus Ada)
Run-time checking (C versus Ada)
Exceptions versus error codes
Language selection
C treats you like a consenting adult. Pascal
treats you like a naughty child. Ada treats you
like a criminal

69
Pascal example

Program WontCompile
type
MySubRange 0 .. 20
Day (Monday, Tuesday, Wednesday, Thursday,
Friday, Saturday, Sunday)
var
MyVar MySubRange
MyDate Day
begin
MyVar 9 will not compile -- range error!
MyDate 0 will not compile -- wrong type!
end.

70
Ada example
Procedure MyProc is Var MyArray array (1..10)
of integer j integer b byte begin for j
in 0 .. 10 loop MyArray(j) j6 -- raises
exception on first time
--through end loop b MyArray(10) -- will
fail run-time range check end MyProc
71
Exceptions

Some languages (Pascal, Modula-2) have a
draconian error handling policy
exception raised and program terminated
not good for embedded systems
Ada and C allow run time recovery through
user-defined exceptions and exception handlers

72
Exceptions

A lot of extra code to check the statement
aj b

73
Detailed Design for Safety

Do not allow ignoring of error indications
checking of return values is a manuel process
user of the function must remember each and every
time
easy to circumvent this error handling system
Separate normal code from error handling code

74
Detailed Design for Safety

Handle errors at the lowest level with sufficient
context to correct the problem

75
Error handling code

a getfone(b, c)
if (a)
switch (a)
case 1 ..
case 2 ..
d getftwo(b,c)
if (d)
switch (a)
case 1 ..
case 2 ..

in this code the normal execution path is a
getfone(b,c) d getftwo(b,c)
76
Built-in exception types

procedure enqueue (q in out queue v in FLOAT)
is
begin
if full (q) then
raise overflow
end if
q.body(q.head q.length) mod qSize v
q.length q.length 1
end enqueue

77
Caller of the sequence handles exception

procedure testQ(q in out queue) is
begin
for j in 1 .. 10 loop
enqueue(q, random(1000))
end loop
exception
when overflow gt
puts(Test failed due to queue overflow)
end testQ

78
C exception handling

Extends capabilities beyond that of Ada
Exceptions extended by type rather than value
possible to create hierarchies of exception
classes and catch by thrown subclass type
class can contain different types of information
about the kind of device that failed
this facilitates error recovery, debugging, and
user error reporting

79
Making C safe

Overloading the operator with index range
checking improves the safety of arrays
Make classes of scalars and overload the
assignment operator allows additional range and
value checking

80
Detailed Design for Safety

Data Validity Checks
CRC (16 bit or 32 bit)
identifies all single or dual bit errors
detects high percentage of multiple bit errors
table or compute-driven
chips are available
checksum
redundant storage
ones complement

81
Detailed Design for Safety

Redundancy should be set every write access
Data should be checked every read access

82
ANSI C Exception Class Hierarchy
exception
logic error
runtime error
domain error
out of range
range error
overflow error
invalid argument
length error
83
Eight steps to safety

Identify the hazards
Determine the risks
Define the safety measures
Create safe requirements
Create safe designs
Implement safety
Assure the safety process
Test, test, test

84
Safety Process (Development)

Do Hazard Analysis early and often
Track safety measures from hazard analysis to
requirments specification
design
code
validation tests
Test safety measures with fault seeding

85
Safety Process (Deployment)

Install safely
ensure proper means are used to set up system
safety measures are installed and checked
Deploy safely
ensure safety measures are periodically checked
and serviced
Do not turn off safety measures
Decommission safely
removal of hazardous materials

86
Concept
IEC Overall Safety Lifecycle
Overall scope definition
Hazard and risk analysis
SRS Safety Related System E/E/PES
Electrical/Electronic/Programmable electronic
system
Overall safety requirements
Safety requirements allocation
SRS E/E/PES realization
Overall planning
SRS other technology realization
External risk reduct. facilities
Ops mainten. planning
Valida tion planning
Install. planning
Overall installation and commissioning
Overall safety validation
Overall modification and retrofit
Overall operation and maintenance
Decommissioning
87
Eight steps to safety

Identify the hazards
Determine the risks
Define the safety measures
Create safe requirements
Create safe designs
Implement safety
Assure the safety process
Test, test, test

88
Safety in Testing in RD

Use fault-seeding
Unit (class) testing
white box
procedural invariant violation assertions
peer reviews
Integration testing
grey box
Validation testing
black box
externally caused faults
(Grey box) internally seeded faults

89
Safety Testing During Operation

Power on Self-Test (POST)
Check for latent faults
All safety measures must be tested at power on
and periodically
RAM (stuck-at, shorts, cell failures)
ROM
Flash
Disks
CPU
Interfaces
Buses

90
Safety Testing During Operation

Built-In Tests
Repeats some of POST
Data integrity checks
Index and pointer validity checking
Subrange value invariant assertions
Proper functioning
Watchdogs
Reasonableness checks
Lifeticks

91
A simplified Example A linear Accelerator
92
Unsafe Linear Accelerator
Beam Intensity Beam Duration
CPU
Radiation Dose
1. Set Dose 2. Start Beam 3. End Beam
Sensor
93
Fault Tree Analysis
Over radiation
AND
OR
Radiation command invalid
CPU Halted
OR
OR
Shutoff timer failure
Beam engaged
CPU failure
Software defect
Software defect
EMI
EMI
94
Hazards of the Linear Accelerator
Hazard
Level of
Tolerance
Fault
Likelihood
Detection
Control
Exposur
risk
Time T1
time
measure
e time
Over
Severe
100 ms
CPU
rare
50 ms
Safety
50m ms
radiati
locks
CPU
on
up
checks
lifetick at
2 5 ms
Corru
often
10 ms
32 bit
15 ms
pt data
CRCs on
setting
data
s
checked
every
access
Under
Moderat
2 weeks
corrup
often
10 ms
32 bit
15 ms
radiati
e
t data
CRCs on
on
setting
data
checked
every
access
Inadve
sefere
100 ms
beam
often
n/a
curtain
0 ms
rtant
left
mechanica
radiati
engage
lly shuts
on on
d
at power
power
during
down
on
power
down
95
Safe Linear Accelerator
Self test results shared prior to operation
Periodic watchdog service
Safety CPU
CPU
Beam Intensity Beam Duration
Radiation Dose
1. Set Dose 2. Start Beam 3. End Beam
Sensor
Deenergize
Mechanical shutoff when curtain low
96
Summary

Safety is a system issue
It is cheaper and more effective to include
safety early on then to add it later
Safety architectures provide programming in the
large safety
Safe coding rules and detailed design provide
programming in-the-small safety

Write a Comment

User Comments (0)

About PowerShow.com

Safety Critical Systems PowerPoint PPT Presentation