Title: Safety Critical Systems
1Safety Critical Systems
2Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
3Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
Safety analysis
Handled at the architectural level and
mechanistic level
4Safety Analysis
- You must identify the hazards of the system
- You must identify the faults that can lead to
hazards - You must define safety control measures to handle
hazards - These culminate in the Hazard Analysis
- The Hazard Analysis feeds into the Requirements
Specification
5Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
6Hazard Causes
- Release of energy
- electromagnetism (microwave oven)
- radiation (nuclear power plant)
- electricity (electrocution hazard from ECG leads)
- heat (infant warmer)
- kinetic (runaway train)
- Release of toxins
7Hazard Causes
- Interference with life support or other
safety-related function - Misleading safety personnel
- Failure to alarm
- alarming too much - Therac 25. These were
ignored and people were killed
8Types of Hazards
- Actions
- inappropriate system actions taken
- F-18 pilot pulling up landing gear
- appropriate system actions not taken
- Timing
- too soon
- too late
- fault latency time
9Types of Hazards
- Sequence
- skipping actions
- actions out of order
- Amount
- too much
- too little
10Example Hazards
- Actions
- incorrectly energizing a medical treatment laser
- failure to engage landing gear
- Timing
- cardiac pacemaker paces too fast
- flight control surface adjusted too slowly
11Example Hazards
- Sequence
- empty the vat, THEN add the reagent
- out of sequence network packets controlling
industrial robot - Amount
- electrocution from muscle stimulator
- too little oxygen delivered to ventilator patient
12Means of Hazard Control
- Obviation the possibility of the hazard can be
removed by being made physically impossible - use incompatible fasteners to prevent cross
connections - Education the hazard can be handled by
educating the users so that they wont create
hazardous conditions through equipment misuse - dont look down the barrel when cleaning your
rifle
13Means of Hazard Control
- Alarming announcing the hazard to the user when
it appears so that they can take appropriate
action - alarming when the heart stops beating
- Interlocks the hazard can be removed by using
secondary devices and/or logic to intercede when
a hazard presents itself - car wont start unless it is in Park
14Means of Hazard Control
- Internal checking the hazard can be handled by
ensuring that a system can detect that it is
malfunctioning prior to an incident - CRC checks data for corruption whenever it is
accessed - Safety equipment
- goggles, gloves
15Means of Hazard Control
- Restricting access to potential hazards so that
only knowledgeable users have such access - using passwords to prevent inadvertently starting
service mode - Labelling
- High Voltage -- DO NOT TOUCH
16Hazard Analysis
What do you do about it?
How long is the exposure to hazard?
How can this happen?
How long to discover?
How long can it be tolerated
How bad if it occurs?
Hazardous condition
How frequently?
Hazard
Level of
Toleran
Fault
Likeli
Detection
Control
Exposure
risk
ce time
hood
time
Measure
time
T1
Hypo-
Severe
5 min
Ventilator
rare
30 sec
Independent
1 min
ventilation
fans
pressure
alarm,
action by
doctor
Esphageal
often
30 sec
C)2 sensor
1 min
Intubation
alarm
User
often
0
Noncompati
0
misattaches
ble
breathing
mechanical
circuit
fasteners
used
Overpressur
Severe
250 ms
Release
rare
50 ms
Secondary
55 ms
e
valve
valve opens
failure
17When is a system safe enough?
- (Minimal) No hazards in the absence of faults
- (Minimal) No hazards in the presence of any
single point failure - a common mode failure is a single point failure
that affects multiple channels - a latent fault is an undetected fault which
allows another fault to cause a hazard - Your mileage may vary depending on the risk
introduced by your system
18Safety Measures
- You cannot depend on a safety measure that you
cannot test! - CAN bus with 2 nodes provides a CRC on messages
checked at the chip level, but the chips provide
no way of testing to see if it is working. - Therefore, it cannot be relied on as a safety
measure
19Fail-Safe States
- Off
- Emergency stop -- immediately cut power
- Production stop -- stop after the current task
- Protection stop -- shut down without removing
power - Partial Shutdown
- Degraded level of functionality
20Fail-Safe States
- Hold
- No functionality, but with safety actions taken
- Manuel or External control
- Restart (reboot)
21Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
22Risk Assessment
- For each hazard
- determine the potential severity
- determine the likelihood of the hazard
- determine how long the user is exposed to the
hazard - determine whether the risk can be removed
23TUV Risk Level Determination Chart
W3
W2
W1
S1
1
-
-
G1
2
1
-
E1
G2
3
2
1
S2
G1
4
3
2
E2
G2
5
4
3
E1
6
5
4
S3
E2
7
6
5
S4
8
7
6
Risk parameters S Extent of damage S1 slight
injury S2 severe irreversible injury, to one of
more persons or the death of a single person S3
death of several persons S4 Catestrophic
consequences, several deaths E Exposure
time E1 seldom to relatively infrequent E2
frequent to continuous G Hazard Prevention G1
possible under certain conditions G2 hardly
possible W Occurrence probability of hazardous
event W1 very low W2 low W3 relatively high
24Sample Risk Assessments
Device
Hazard
Extent of
Exposure
Hazard
Probability
TUV Risk
damage
time
Prevention
level
Microwave
Irradiation
S2
E2
G2
W3
5
oven
Pacemaker
Pace too
S2
E2
G2
W3
5
slowly
Pace too
S2
E2
G2
W3
5
fast
Power
Explosion
S3
E1
--
W3
6
station
Airliner
Crash
S4
E2
G2
W2
8
25Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
26Safety Measures
- Safety measures do one of the following
- remove the hazard
- reduce the risk
- identify the hazard to supervisory control
- The purpose of the safety measure is to ensure
the system remains in a safe state
27Safety Measures
- Adequacy of measures
- safety measures mut be able to reliably detect
the fault - safety measures must be able to take appropriate
actions
Component
Fault/Error
Software class
Examples of acceptable measures
1
2
Interrupt handling
no interrupt or too
rq
functional test or time-slot
and execution
frequent
monitoring
no interrupt or too
rq
comparison of redundant
frequent and
functional channles by either
interrupt related
- reciprocal comparison
to different
- independent hardware
sources
comparator
- independent time-slot and logical
monitoring
28Risk Reduction
- Identify the fault
- Take corrective action, either
- use redundancy to correct and move on
- feedforward error correction (Hamming codes)
- redo the computational step
- feedback error detection (take corrective action
first) - go to a fail-safe state
29Fault Identification at Run-Time
- Faults must be identified in lt TO
- Fault identification requires redundancy
- Redundancy can be in terms of
- channel
- device
- data
- control
Architectural
Detailed design
30Fault Identification at Run-Time
- Redundancy may be either
- homogenous (random faults only)
- does not detect errors
- peform functions the same way on the same thing
multiple times - heterogenous (systematic and random faults)
- includes errors -gt present in all channels
- perform processing differently and hopefully you
didnt make the same mistake!
31Fault Tree Analysis Symbology
A condition that must be present to produce the
output of a gate
An event that results from a combination of
events through a logic gate
Transfer
A basic fault event that requires no further
development
A fault event because the event is
inconsequential or the necessary information is
not available
AND gate (also OR gate)
An event that is expected to occur normally
NOT gate
32Subset of Pacemaker Fault Analysis
Pacing too slowly
Condition or event to avoid
Secondary conditions or events
OR
Shutdown fault
Time-base fault
Invalid pacing rate
OR
AND
OR
AND
Crystal failure
Watchdog failure
Bad command rate
Data corrupted in vivo
Software failure
CPU H/W failure
Rate command corrupted
CRC hardware failure
Primary or fundamental faults
33Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
34Safe Requirements
- Requirements specification follows initial hazard
analysis - Specific requirements should track back to hazard
analysis - must be shown to FDA, etc
- Architectural framework should be selected with
safety needs in mind - has the hooks in place
35Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
36Use Good Design Practices
- Good design practices allow you to
- manage complexity
- view the system at various levels of abstraction
- zoom in on a particular area of interest
- identify hot spots of special concern
- have consistent quality
- easily test
- build and use high quality components
- Regulatory agencies look at this!!
37Use Good Design Practices
- Manage your requirements
- trace requirements to design elements
- trace design elements back to requirements
remote communications
adjust trajectory
class a
class b
remote communication
requirements specification
class c
class d
class e
use cases
design model
38Use Good Design Practices
- Use iterative development
- integrating many times finds more defects
- iterative prototypes can result in more reliable
and safe systems
39Use Good Design Practices
- Use component-based design architectures
- third party components may be very well tested in
they are in wide use - require bug lists from component vendors
- this bit Microsoft once
40Use Good Design Practices
- Use Visual Modeling
- UML
- Ward-Mellor
- Use executable models
- animate models
- execute and debug at modeling level of abstraction
41Use Good Design Practices
- Use frameworks
- a framework is a partially completed application
which is specialized by the user - Microsoft foundation classes
- Object Execution Framework
- frameworks reduce the work of developing new
applications - frameworks rely on well-tested patterns
42Use Good Design Practices
User Model
Framework
80-90 of application code is housekeeping code
System
43Use Good Design Practices
- Use Configuration Management
- only use unit-testing components in builds
parameters
data aquisition
SYSTEM
CM Database
drivers
OS
44Use Good Design Practices
- Design for test
- product testing
- built-in-testing to ensure
- invariants are truly invariant
- functional invariants
- quality of service invariants (e.g. performance)
- faults are detected
45Good Design Practices
- Isolate Safety Functions
- Safety-relevant systems are 200-300 more effort
to produce - Isolation of safety systems allows more expedient
development - Care must be taken that the safety system is
truly isolated so that a defect in the non-safety
system cannot affect the safety system - different processor
- different heavy-weight tasks (depends on the OS)
46Safety Critical Patterns
47Safety Architecture Patterns
- Protected Single-Channel Pattern
- Dual-Channel Pattern
- Homogenous Dual Channel Pattern
- Heterogenous Peer-Channel Pattern
- Sanity Check Pattern
- Actuator-Monitor Pattern
- Voting Multichannel Pattern
48Protected Single Channel Pattern
- Within the single channel, mechanisms exist to
identify and handle faults - All faults must be detected within the fault
tolerance time - May be impossible
- to test for all faults within the fault tolerance
time - to remove common mode failures from the single
channel - Generally, less recurring system cost
- no additional hardware required
49Protected Single Channel Pattern
If Im not getting life ticks, Ill shut down!
Single Channel Train Braking System
50Dual Channel Architecture Patterns
- Separation of safety-relevant from non-safety
relevant where possible - Separation of monitoring from control
- Generally easier to meet safety requirements
- timing
- common mode failures
- Generally higher recurring system cost
- additional hardware required
51Homogenous Dual-Channel Pattern
- Identical channels used
- Channels may operate simultaneously (Multichannel
Vote Pattern) - Channels may operate in series (Backup Pattern)
- Good at identifying random faults but not
systematic faults - Low RD cost, higher recurring cost
52Homogenous Dual-Channel Pattern
53Heterogeneous Peer-Channel Pattern
- Equal weight, differently implemented channels
- may use algorithmic inversion to recreate initial
data - may use different algorithm
- may use different teams (not fool proof because
of hot spots that can cause failures) - Good at identifying both random and systematic
faults
54Heterogeneous Peer-Channel Pattern
- Generally safest, but higher RD and recurring
cost
55Heterogeneous Peer-Channel Pattern
56Sanity Check Pattern
- A primary actuator channel does real computations
- A light-weight secondary channel checks the
reasonableness of the primary channel - Good for detection of both random and systematic
faults - May not detect faults which result in small
variance - Relatively inexpensive to implement
57Monitor-Actuator Pattern
- Separates actuation from the monitoring of that
actuation - If the actuator channel fails, the monitor
channel detects it - If the monitor channel fails, the actuator
channel continues correctly - Requires fault isolation to be single-fault
tolerant - actuator channel cannot use the monitor itself
58Monitor-Actuator Pattern
59Dual-Channel Design Architecture
60Safety Executive Pattern
- Large scale architectural pattern
- Controller subsystem (safety executive)
- One or more watchdog subsystems
- check on system health
- ensure proper actuation is occurring
- One or more actuation channels
- Recovery subsystem (Fail safe processing channel)
61Safety Executive Pattern
- Appropriate when
- A set of fail-safe system states needs to be
entered when failures identified - Determination of failures is complex
- Several safety-related system actions are
controlled simultaneously - Safety-related actions are not independent
- Determining proper safety action in the event of
a failure can be complex
62Safety Executive Pattern
63Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
64Detailed Design for Safety
- Make it right before you make it fast
- simple, clear algorithms and code
- optimize only the 10-20 of code which affects
performance - use safe language subsets
- ensure you havent introduced any common mode
failures
65Detailed Design for Safety
- Thoroughly test
- unit test and peer review
- integration test
- validation test
66Detailed Design for Safety
- Verify that it remains right throughout program
execution - exceptions
- invariant assertions
- range checking
- index and boundary checking
- When its not right during execution, then make it
right with corrective or protective measures
67Detailed Design for Safety
- Use safe language subsets
- strong compile-time checking
- if you use C, use lint
- strong run-time checking
- exception handling
- avoid void
- avoid error prone statements and syntax
- you can make C safe but its not safe out of the
box
68Detailed Design for Safety
- Language choice
- Compile time checking (C versus Ada)
- Run-time checking (C versus Ada)
- Exceptions versus error codes
- Language selection
- C treats you like a consenting adult. Pascal
treats you like a naughty child. Ada treats you
like a criminal
69Pascal example
- Program WontCompile
- type
- MySubRange 0 .. 20
- Day (Monday, Tuesday, Wednesday, Thursday,
Friday, Saturday, Sunday) - var
- MyVar MySubRange
- MyDate Day
- begin
- MyVar 9 will not compile -- range error!
- MyDate 0 will not compile -- wrong type!
- end.
70Ada example
Procedure MyProc is Var MyArray array (1..10)
of integer j integer b byte begin for j
in 0 .. 10 loop MyArray(j) j6 -- raises
exception on first time
--through end loop b MyArray(10) -- will
fail run-time range check end MyProc
71Exceptions
- Some languages (Pascal, Modula-2) have a
draconian error handling policy - exception raised and program terminated
- not good for embedded systems
- Ada and C allow run time recovery through
user-defined exceptions and exception handlers
72Exceptions
- A lot of extra code to check the statement
- aj b
73Detailed Design for Safety
- Do not allow ignoring of error indications
- checking of return values is a manuel process
- user of the function must remember each and every
time - easy to circumvent this error handling system
- Separate normal code from error handling code
74Detailed Design for Safety
- Handle errors at the lowest level with sufficient
context to correct the problem
75Error handling code
- a getfone(b, c)
- if (a)
- switch (a)
- case 1 ..
- case 2 ..
-
- d getftwo(b,c)
- if (d)
- switch (a)
- case 1 ..
- case 2 ..
-
in this code the normal execution path is a
getfone(b,c) d getftwo(b,c)
76Built-in exception types
- procedure enqueue (q in out queue v in FLOAT)
is - begin
- if full (q) then
- raise overflow
- end if
- q.body(q.head q.length) mod qSize v
- q.length q.length 1
- end enqueue
77Caller of the sequence handles exception
- procedure testQ(q in out queue) is
- begin
- for j in 1 .. 10 loop
- enqueue(q, random(1000))
- end loop
- exception
- when overflow gt
- puts(Test failed due to queue overflow)
- end testQ
78C exception handling
- Extends capabilities beyond that of Ada
- Exceptions extended by type rather than value
- possible to create hierarchies of exception
classes and catch by thrown subclass type - class can contain different types of information
about the kind of device that failed - this facilitates error recovery, debugging, and
user error reporting
79Making C safe
- Overloading the operator with index range
checking improves the safety of arrays - Make classes of scalars and overload the
assignment operator allows additional range and
value checking
80Detailed Design for Safety
- Data Validity Checks
- CRC (16 bit or 32 bit)
- identifies all single or dual bit errors
- detects high percentage of multiple bit errors
- table or compute-driven
- chips are available
- checksum
- redundant storage
- ones complement
81Detailed Design for Safety
- Redundancy should be set every write access
- Data should be checked every read access
82ANSI C Exception Class Hierarchy
exception
logic error
runtime error
domain error
out of range
range error
overflow error
invalid argument
length error
83Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
84Safety Process (Development)
- Do Hazard Analysis early and often
- Track safety measures from hazard analysis to
- requirments specification
- design
- code
- validation tests
- Test safety measures with fault seeding
85Safety Process (Deployment)
- Install safely
- ensure proper means are used to set up system
- safety measures are installed and checked
- Deploy safely
- ensure safety measures are periodically checked
and serviced - Do not turn off safety measures
- Decommission safely
- removal of hazardous materials
86Concept
IEC Overall Safety Lifecycle
Overall scope definition
Hazard and risk analysis
SRS Safety Related System E/E/PES
Electrical/Electronic/Programmable electronic
system
Overall safety requirements
Safety requirements allocation
SRS E/E/PES realization
Overall planning
SRS other technology realization
External risk reduct. facilities
Ops mainten. planning
Valida tion planning
Install. planning
Overall installation and commissioning
Overall safety validation
Overall modification and retrofit
Overall operation and maintenance
Decommissioning
87Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
88Safety in Testing in RD
- Use fault-seeding
- Unit (class) testing
- white box
- procedural invariant violation assertions
- peer reviews
- Integration testing
- grey box
- Validation testing
- black box
- externally caused faults
- (Grey box) internally seeded faults
89Safety Testing During Operation
- Power on Self-Test (POST)
- Check for latent faults
- All safety measures must be tested at power on
and periodically - RAM (stuck-at, shorts, cell failures)
- ROM
- Flash
- Disks
- CPU
- Interfaces
- Buses
90Safety Testing During Operation
- Built-In Tests
- Repeats some of POST
- Data integrity checks
- Index and pointer validity checking
- Subrange value invariant assertions
- Proper functioning
- Watchdogs
- Reasonableness checks
- Lifeticks
91A simplified Example A linear Accelerator
92Unsafe Linear Accelerator
Beam Intensity Beam Duration
CPU
Radiation Dose
1. Set Dose 2. Start Beam 3. End Beam
Sensor
93Fault Tree Analysis
Over radiation
AND
OR
Radiation command invalid
CPU Halted
OR
OR
Shutoff timer failure
Beam engaged
CPU failure
Software defect
Software defect
EMI
EMI
94Hazards of the Linear Accelerator
Hazard
Level of
Tolerance
Fault
Likelihood
Detection
Control
Exposur
risk
Time T1
time
measure
e time
Over
Severe
100 ms
CPU
rare
50 ms
Safety
50m ms
radiati
locks
CPU
on
up
checks
lifetick at
2 5 ms
Corru
often
10 ms
32 bit
15 ms
pt data
CRCs on
setting
data
s
checked
every
access
Under
Moderat
2 weeks
corrup
often
10 ms
32 bit
15 ms
radiati
e
t data
CRCs on
on
setting
data
checked
every
access
Inadve
sefere
100 ms
beam
often
n/a
curtain
0 ms
rtant
left
mechanica
radiati
engage
lly shuts
on on
d
at power
power
during
down
on
power
down
95Safe Linear Accelerator
Self test results shared prior to operation
Periodic watchdog service
Safety CPU
CPU
Beam Intensity Beam Duration
Radiation Dose
1. Set Dose 2. Start Beam 3. End Beam
Sensor
Deenergize
Mechanical shutoff when curtain low
96Summary
- Safety is a system issue
- It is cheaper and more effective to include
safety early on then to add it later - Safety architectures provide programming in the
large safety - Safe coding rules and detailed design provide
programming in-the-small safety