Safety Critical Systems - PowerPoint PPT Presentation

1 / 96
About This Presentation
Title:

Safety Critical Systems

Description:

You must identify the hazards of the system. You must identify the ... incorrectly energizing a medical treatment laser. failure to engage landing gear. Timing ... – PowerPoint PPT presentation

Number of Views:2126
Avg rating:3.0/5.0
Slides: 97
Provided by: RobOs3
Category:

less

Transcript and Presenter's Notes

Title: Safety Critical Systems


1
Safety Critical Systems
2
Eight steps to safety
  • Identify the hazards
  • Determine the risks
  • Define the safety measures
  • Create safe requirements
  • Create safe designs
  • Implement safety
  • Assure the safety process
  • Test, test, test

3
Eight steps to safety
  • Identify the hazards
  • Determine the risks
  • Define the safety measures
  • Create safe requirements
  • Create safe designs
  • Implement safety
  • Assure the safety process
  • Test, test, test

Safety analysis
Handled at the architectural level and
mechanistic level
4
Safety Analysis
  • You must identify the hazards of the system
  • You must identify the faults that can lead to
    hazards
  • You must define safety control measures to handle
    hazards
  • These culminate in the Hazard Analysis
  • The Hazard Analysis feeds into the Requirements
    Specification

5
Eight steps to safety
  • Identify the hazards
  • Determine the risks
  • Define the safety measures
  • Create safe requirements
  • Create safe designs
  • Implement safety
  • Assure the safety process
  • Test, test, test

6
Hazard Causes
  • Release of energy
  • electromagnetism (microwave oven)
  • radiation (nuclear power plant)
  • electricity (electrocution hazard from ECG leads)
  • heat (infant warmer)
  • kinetic (runaway train)
  • Release of toxins

7
Hazard Causes
  • Interference with life support or other
    safety-related function
  • Misleading safety personnel
  • Failure to alarm
  • alarming too much - Therac 25. These were
    ignored and people were killed

8
Types of Hazards
  • Actions
  • inappropriate system actions taken
  • F-18 pilot pulling up landing gear
  • appropriate system actions not taken
  • Timing
  • too soon
  • too late
  • fault latency time

9
Types of Hazards
  • Sequence
  • skipping actions
  • actions out of order
  • Amount
  • too much
  • too little

10
Example Hazards
  • Actions
  • incorrectly energizing a medical treatment laser
  • failure to engage landing gear
  • Timing
  • cardiac pacemaker paces too fast
  • flight control surface adjusted too slowly

11
Example Hazards
  • Sequence
  • empty the vat, THEN add the reagent
  • out of sequence network packets controlling
    industrial robot
  • Amount
  • electrocution from muscle stimulator
  • too little oxygen delivered to ventilator patient

12
Means of Hazard Control
  • Obviation the possibility of the hazard can be
    removed by being made physically impossible
  • use incompatible fasteners to prevent cross
    connections
  • Education the hazard can be handled by
    educating the users so that they wont create
    hazardous conditions through equipment misuse
  • dont look down the barrel when cleaning your
    rifle

13
Means of Hazard Control
  • Alarming announcing the hazard to the user when
    it appears so that they can take appropriate
    action
  • alarming when the heart stops beating
  • Interlocks the hazard can be removed by using
    secondary devices and/or logic to intercede when
    a hazard presents itself
  • car wont start unless it is in Park

14
Means of Hazard Control
  • Internal checking the hazard can be handled by
    ensuring that a system can detect that it is
    malfunctioning prior to an incident
  • CRC checks data for corruption whenever it is
    accessed
  • Safety equipment
  • goggles, gloves

15
Means of Hazard Control
  • Restricting access to potential hazards so that
    only knowledgeable users have such access
  • using passwords to prevent inadvertently starting
    service mode
  • Labelling
  • High Voltage -- DO NOT TOUCH

16
Hazard Analysis
What do you do about it?
How long is the exposure to hazard?
How can this happen?
How long to discover?
How long can it be tolerated
How bad if it occurs?
Hazardous condition
How frequently?
Hazard
Level of
Toleran
Fault
Likeli
Detection
Control
Exposure
risk
ce time
hood
time
Measure
time
T1
Hypo-
Severe
5 min
Ventilator
rare
30 sec
Independent
1 min
ventilation
fans
pressure
alarm,
action by
doctor
Esphageal
often
30 sec
C)2 sensor
1 min
Intubation
alarm
User
often
0
Noncompati
0
misattaches
ble
breathing
mechanical
circuit
fasteners
used
Overpressur
Severe
250 ms
Release
rare
50 ms
Secondary
55 ms
e
valve
valve opens
failure
17
When is a system safe enough?
  • (Minimal) No hazards in the absence of faults
  • (Minimal) No hazards in the presence of any
    single point failure
  • a common mode failure is a single point failure
    that affects multiple channels
  • a latent fault is an undetected fault which
    allows another fault to cause a hazard
  • Your mileage may vary depending on the risk
    introduced by your system

18
Safety Measures
  • You cannot depend on a safety measure that you
    cannot test!
  • CAN bus with 2 nodes provides a CRC on messages
    checked at the chip level, but the chips provide
    no way of testing to see if it is working.
  • Therefore, it cannot be relied on as a safety
    measure

19
Fail-Safe States
  • Off
  • Emergency stop -- immediately cut power
  • Production stop -- stop after the current task
  • Protection stop -- shut down without removing
    power
  • Partial Shutdown
  • Degraded level of functionality

20
Fail-Safe States
  • Hold
  • No functionality, but with safety actions taken
  • Manuel or External control
  • Restart (reboot)

21
Eight steps to safety
  • Identify the hazards
  • Determine the risks
  • Define the safety measures
  • Create safe requirements
  • Create safe designs
  • Implement safety
  • Assure the safety process
  • Test, test, test

22
Risk Assessment
  • For each hazard
  • determine the potential severity
  • determine the likelihood of the hazard
  • determine how long the user is exposed to the
    hazard
  • determine whether the risk can be removed

23
TUV Risk Level Determination Chart
W3
W2
W1
S1
1
-
-
G1
2
1
-
E1
G2
3
2
1
S2
G1
4
3
2
E2
G2
5
4
3
E1
6
5
4
S3
E2
7
6
5
S4
8
7
6
Risk parameters S Extent of damage S1 slight
injury S2 severe irreversible injury, to one of
more persons or the death of a single person S3
death of several persons S4 Catestrophic
consequences, several deaths E Exposure
time E1 seldom to relatively infrequent E2
frequent to continuous G Hazard Prevention G1
possible under certain conditions G2 hardly
possible W Occurrence probability of hazardous
event W1 very low W2 low W3 relatively high
24
Sample Risk Assessments
Device
Hazard
Extent of
Exposure
Hazard
Probability
TUV Risk
damage
time
Prevention
level
Microwave
Irradiation
S2
E2
G2
W3
5
oven
Pacemaker
Pace too
S2
E2
G2
W3
5
slowly
Pace too
S2
E2
G2
W3
5
fast
Power
Explosion
S3
E1
--
W3
6
station
Airliner
Crash
S4
E2
G2
W2
8
25
Eight steps to safety
  • Identify the hazards
  • Determine the risks
  • Define the safety measures
  • Create safe requirements
  • Create safe designs
  • Implement safety
  • Assure the safety process
  • Test, test, test

26
Safety Measures
  • Safety measures do one of the following
  • remove the hazard
  • reduce the risk
  • identify the hazard to supervisory control
  • The purpose of the safety measure is to ensure
    the system remains in a safe state

27
Safety Measures
  • Adequacy of measures
  • safety measures mut be able to reliably detect
    the fault
  • safety measures must be able to take appropriate
    actions

Component
Fault/Error
Software class
Examples of acceptable measures
1
2
Interrupt handling
no interrupt or too
rq
functional test or time-slot
and execution
frequent
monitoring
no interrupt or too
rq
comparison of redundant
frequent and
functional channles by either
interrupt related
- reciprocal comparison
to different
- independent hardware
sources
comparator
- independent time-slot and logical
monitoring
28
Risk Reduction
  • Identify the fault
  • Take corrective action, either
  • use redundancy to correct and move on
  • feedforward error correction (Hamming codes)
  • redo the computational step
  • feedback error detection (take corrective action
    first)
  • go to a fail-safe state

29
Fault Identification at Run-Time
  • Faults must be identified in lt TO
  • Fault identification requires redundancy
  • Redundancy can be in terms of
  • channel
  • device
  • data
  • control


Architectural

Detailed design
30
Fault Identification at Run-Time
  • Redundancy may be either
  • homogenous (random faults only)
  • does not detect errors
  • peform functions the same way on the same thing
    multiple times
  • heterogenous (systematic and random faults)
  • includes errors -gt present in all channels
  • perform processing differently and hopefully you
    didnt make the same mistake!

31
Fault Tree Analysis Symbology
A condition that must be present to produce the
output of a gate
An event that results from a combination of
events through a logic gate
Transfer
A basic fault event that requires no further
development
A fault event because the event is
inconsequential or the necessary information is
not available
AND gate (also OR gate)
An event that is expected to occur normally
NOT gate
32
Subset of Pacemaker Fault Analysis
Pacing too slowly
Condition or event to avoid
Secondary conditions or events
OR
Shutdown fault
Time-base fault
Invalid pacing rate
OR
AND
OR
AND
Crystal failure
Watchdog failure
Bad command rate
Data corrupted in vivo
Software failure
CPU H/W failure
Rate command corrupted
CRC hardware failure
Primary or fundamental faults
33
Eight steps to safety
  • Identify the hazards
  • Determine the risks
  • Define the safety measures
  • Create safe requirements
  • Create safe designs
  • Implement safety
  • Assure the safety process
  • Test, test, test

34
Safe Requirements
  • Requirements specification follows initial hazard
    analysis
  • Specific requirements should track back to hazard
    analysis
  • must be shown to FDA, etc
  • Architectural framework should be selected with
    safety needs in mind
  • has the hooks in place

35
Eight steps to safety
  • Identify the hazards
  • Determine the risks
  • Define the safety measures
  • Create safe requirements
  • Create safe designs
  • Implement safety
  • Assure the safety process
  • Test, test, test

36
Use Good Design Practices
  • Good design practices allow you to
  • manage complexity
  • view the system at various levels of abstraction
  • zoom in on a particular area of interest
  • identify hot spots of special concern
  • have consistent quality
  • easily test
  • build and use high quality components
  • Regulatory agencies look at this!!

37
Use Good Design Practices
  • Manage your requirements
  • trace requirements to design elements
  • trace design elements back to requirements

remote communications
adjust trajectory
class a
class b
remote communication
requirements specification
class c
class d
class e
use cases
design model
38
Use Good Design Practices
  • Use iterative development
  • integrating many times finds more defects
  • iterative prototypes can result in more reliable
    and safe systems

39
Use Good Design Practices
  • Use component-based design architectures
  • third party components may be very well tested in
    they are in wide use
  • require bug lists from component vendors
  • this bit Microsoft once

40
Use Good Design Practices
  • Use Visual Modeling
  • UML
  • Ward-Mellor
  • Use executable models
  • animate models
  • execute and debug at modeling level of abstraction

41
Use Good Design Practices
  • Use frameworks
  • a framework is a partially completed application
    which is specialized by the user
  • Microsoft foundation classes
  • Object Execution Framework
  • frameworks reduce the work of developing new
    applications
  • frameworks rely on well-tested patterns

42
Use Good Design Practices
User Model
Framework

80-90 of application code is housekeeping code

System
43
Use Good Design Practices
  • Use Configuration Management
  • only use unit-testing components in builds

parameters
data aquisition
SYSTEM
CM Database
drivers
OS
44
Use Good Design Practices
  • Design for test
  • product testing
  • built-in-testing to ensure
  • invariants are truly invariant
  • functional invariants
  • quality of service invariants (e.g. performance)
  • faults are detected

45
Good Design Practices
  • Isolate Safety Functions
  • Safety-relevant systems are 200-300 more effort
    to produce
  • Isolation of safety systems allows more expedient
    development
  • Care must be taken that the safety system is
    truly isolated so that a defect in the non-safety
    system cannot affect the safety system
  • different processor
  • different heavy-weight tasks (depends on the OS)

46
Safety Critical Patterns
47
Safety Architecture Patterns
  • Protected Single-Channel Pattern
  • Dual-Channel Pattern
  • Homogenous Dual Channel Pattern
  • Heterogenous Peer-Channel Pattern
  • Sanity Check Pattern
  • Actuator-Monitor Pattern
  • Voting Multichannel Pattern

48
Protected Single Channel Pattern
  • Within the single channel, mechanisms exist to
    identify and handle faults
  • All faults must be detected within the fault
    tolerance time
  • May be impossible
  • to test for all faults within the fault tolerance
    time
  • to remove common mode failures from the single
    channel
  • Generally, less recurring system cost
  • no additional hardware required

49
Protected Single Channel Pattern
If Im not getting life ticks, Ill shut down!
Single Channel Train Braking System
50
Dual Channel Architecture Patterns
  • Separation of safety-relevant from non-safety
    relevant where possible
  • Separation of monitoring from control
  • Generally easier to meet safety requirements
  • timing
  • common mode failures
  • Generally higher recurring system cost
  • additional hardware required

51
Homogenous Dual-Channel Pattern
  • Identical channels used
  • Channels may operate simultaneously (Multichannel
    Vote Pattern)
  • Channels may operate in series (Backup Pattern)
  • Good at identifying random faults but not
    systematic faults
  • Low RD cost, higher recurring cost

52
Homogenous Dual-Channel Pattern
53
Heterogeneous Peer-Channel Pattern
  • Equal weight, differently implemented channels
  • may use algorithmic inversion to recreate initial
    data
  • may use different algorithm
  • may use different teams (not fool proof because
    of hot spots that can cause failures)
  • Good at identifying both random and systematic
    faults

54
Heterogeneous Peer-Channel Pattern
  • Generally safest, but higher RD and recurring
    cost

55
Heterogeneous Peer-Channel Pattern
56
Sanity Check Pattern
  • A primary actuator channel does real computations
  • A light-weight secondary channel checks the
    reasonableness of the primary channel
  • Good for detection of both random and systematic
    faults
  • May not detect faults which result in small
    variance
  • Relatively inexpensive to implement

57
Monitor-Actuator Pattern
  • Separates actuation from the monitoring of that
    actuation
  • If the actuator channel fails, the monitor
    channel detects it
  • If the monitor channel fails, the actuator
    channel continues correctly
  • Requires fault isolation to be single-fault
    tolerant
  • actuator channel cannot use the monitor itself

58
Monitor-Actuator Pattern
59
Dual-Channel Design Architecture
60
Safety Executive Pattern
  • Large scale architectural pattern
  • Controller subsystem (safety executive)
  • One or more watchdog subsystems
  • check on system health
  • ensure proper actuation is occurring
  • One or more actuation channels
  • Recovery subsystem (Fail safe processing channel)

61
Safety Executive Pattern
  • Appropriate when
  • A set of fail-safe system states needs to be
    entered when failures identified
  • Determination of failures is complex
  • Several safety-related system actions are
    controlled simultaneously
  • Safety-related actions are not independent
  • Determining proper safety action in the event of
    a failure can be complex

62
Safety Executive Pattern
63
Eight steps to safety
  • Identify the hazards
  • Determine the risks
  • Define the safety measures
  • Create safe requirements
  • Create safe designs
  • Implement safety
  • Assure the safety process
  • Test, test, test

64
Detailed Design for Safety
  • Make it right before you make it fast
  • simple, clear algorithms and code
  • optimize only the 10-20 of code which affects
    performance
  • use safe language subsets
  • ensure you havent introduced any common mode
    failures

65
Detailed Design for Safety
  • Thoroughly test
  • unit test and peer review
  • integration test
  • validation test

66
Detailed Design for Safety
  • Verify that it remains right throughout program
    execution
  • exceptions
  • invariant assertions
  • range checking
  • index and boundary checking
  • When its not right during execution, then make it
    right with corrective or protective measures

67
Detailed Design for Safety
  • Use safe language subsets
  • strong compile-time checking
  • if you use C, use lint
  • strong run-time checking
  • exception handling
  • avoid void
  • avoid error prone statements and syntax
  • you can make C safe but its not safe out of the
    box

68
Detailed Design for Safety
  • Language choice
  • Compile time checking (C versus Ada)
  • Run-time checking (C versus Ada)
  • Exceptions versus error codes
  • Language selection
  • C treats you like a consenting adult. Pascal
    treats you like a naughty child. Ada treats you
    like a criminal

69
Pascal example
  • Program WontCompile
  • type
  • MySubRange 0 .. 20
  • Day (Monday, Tuesday, Wednesday, Thursday,
    Friday, Saturday, Sunday)
  • var
  • MyVar MySubRange
  • MyDate Day
  • begin
  • MyVar 9 will not compile -- range error!
  • MyDate 0 will not compile -- wrong type!
  • end.

70
Ada example
Procedure MyProc is Var MyArray array (1..10)
of integer j integer b byte begin for j
in 0 .. 10 loop MyArray(j) j6 -- raises
exception on first time
--through end loop b MyArray(10) -- will
fail run-time range check end MyProc
71
Exceptions
  • Some languages (Pascal, Modula-2) have a
    draconian error handling policy
  • exception raised and program terminated
  • not good for embedded systems
  • Ada and C allow run time recovery through
    user-defined exceptions and exception handlers

72
Exceptions
  • A lot of extra code to check the statement
  • aj b

73
Detailed Design for Safety
  • Do not allow ignoring of error indications
  • checking of return values is a manuel process
  • user of the function must remember each and every
    time
  • easy to circumvent this error handling system
  • Separate normal code from error handling code

74
Detailed Design for Safety
  • Handle errors at the lowest level with sufficient
    context to correct the problem

75
Error handling code
  • a getfone(b, c)
  • if (a)
  • switch (a)
  • case 1 ..
  • case 2 ..
  • d getftwo(b,c)
  • if (d)
  • switch (a)
  • case 1 ..
  • case 2 ..

in this code the normal execution path is a
getfone(b,c) d getftwo(b,c)
76
Built-in exception types
  • procedure enqueue (q in out queue v in FLOAT)
    is
  • begin
  • if full (q) then
  • raise overflow
  • end if
  • q.body(q.head q.length) mod qSize v
  • q.length q.length 1
  • end enqueue

77
Caller of the sequence handles exception
  • procedure testQ(q in out queue) is
  • begin
  • for j in 1 .. 10 loop
  • enqueue(q, random(1000))
  • end loop
  • exception
  • when overflow gt
  • puts(Test failed due to queue overflow)
  • end testQ

78
C exception handling
  • Extends capabilities beyond that of Ada
  • Exceptions extended by type rather than value
  • possible to create hierarchies of exception
    classes and catch by thrown subclass type
  • class can contain different types of information
    about the kind of device that failed
  • this facilitates error recovery, debugging, and
    user error reporting

79
Making C safe
  • Overloading the operator with index range
    checking improves the safety of arrays
  • Make classes of scalars and overload the
    assignment operator allows additional range and
    value checking

80
Detailed Design for Safety
  • Data Validity Checks
  • CRC (16 bit or 32 bit)
  • identifies all single or dual bit errors
  • detects high percentage of multiple bit errors
  • table or compute-driven
  • chips are available
  • checksum
  • redundant storage
  • ones complement

81
Detailed Design for Safety
  • Redundancy should be set every write access
  • Data should be checked every read access

82
ANSI C Exception Class Hierarchy
exception
logic error
runtime error
domain error
out of range
range error
overflow error
invalid argument
length error
83
Eight steps to safety
  • Identify the hazards
  • Determine the risks
  • Define the safety measures
  • Create safe requirements
  • Create safe designs
  • Implement safety
  • Assure the safety process
  • Test, test, test

84
Safety Process (Development)
  • Do Hazard Analysis early and often
  • Track safety measures from hazard analysis to
  • requirments specification
  • design
  • code
  • validation tests
  • Test safety measures with fault seeding

85
Safety Process (Deployment)
  • Install safely
  • ensure proper means are used to set up system
  • safety measures are installed and checked
  • Deploy safely
  • ensure safety measures are periodically checked
    and serviced
  • Do not turn off safety measures
  • Decommission safely
  • removal of hazardous materials

86
Concept
IEC Overall Safety Lifecycle
Overall scope definition
Hazard and risk analysis
SRS Safety Related System E/E/PES
Electrical/Electronic/Programmable electronic
system
Overall safety requirements
Safety requirements allocation
SRS E/E/PES realization
Overall planning
SRS other technology realization
External risk reduct. facilities
Ops mainten. planning
Valida tion planning
Install. planning
Overall installation and commissioning
Overall safety validation
Overall modification and retrofit
Overall operation and maintenance
Decommissioning
87
Eight steps to safety
  • Identify the hazards
  • Determine the risks
  • Define the safety measures
  • Create safe requirements
  • Create safe designs
  • Implement safety
  • Assure the safety process
  • Test, test, test

88
Safety in Testing in RD
  • Use fault-seeding
  • Unit (class) testing
  • white box
  • procedural invariant violation assertions
  • peer reviews
  • Integration testing
  • grey box
  • Validation testing
  • black box
  • externally caused faults
  • (Grey box) internally seeded faults

89
Safety Testing During Operation
  • Power on Self-Test (POST)
  • Check for latent faults
  • All safety measures must be tested at power on
    and periodically
  • RAM (stuck-at, shorts, cell failures)
  • ROM
  • Flash
  • Disks
  • CPU
  • Interfaces
  • Buses

90
Safety Testing During Operation
  • Built-In Tests
  • Repeats some of POST
  • Data integrity checks
  • Index and pointer validity checking
  • Subrange value invariant assertions
  • Proper functioning
  • Watchdogs
  • Reasonableness checks
  • Lifeticks

91
A simplified Example A linear Accelerator
92
Unsafe Linear Accelerator
Beam Intensity Beam Duration
CPU
Radiation Dose
1. Set Dose 2. Start Beam 3. End Beam
Sensor
93
Fault Tree Analysis
Over radiation
AND
OR
Radiation command invalid
CPU Halted
OR
OR
Shutoff timer failure
Beam engaged
CPU failure
Software defect
Software defect
EMI
EMI
94
Hazards of the Linear Accelerator
Hazard
Level of
Tolerance
Fault
Likelihood
Detection
Control
Exposur
risk
Time T1
time
measure
e time
Over
Severe
100 ms
CPU
rare
50 ms
Safety
50m ms
radiati
locks
CPU
on
up
checks
lifetick at
2 5 ms
Corru
often
10 ms
32 bit
15 ms
pt data
CRCs on
setting
data
s
checked
every
access
Under
Moderat
2 weeks
corrup
often
10 ms
32 bit
15 ms
radiati
e
t data
CRCs on
on
setting
data
checked
every
access
Inadve
sefere
100 ms
beam
often
n/a
curtain
0 ms
rtant
left
mechanica
radiati
engage
lly shuts
on on
d
at power
power
during
down
on
power
down
95
Safe Linear Accelerator
Self test results shared prior to operation
Periodic watchdog service
Safety CPU
CPU
Beam Intensity Beam Duration
Radiation Dose
1. Set Dose 2. Start Beam 3. End Beam
Sensor
Deenergize
Mechanical shutoff when curtain low
96
Summary
  • Safety is a system issue
  • It is cheaper and more effective to include
    safety early on then to add it later
  • Safety architectures provide programming in the
    large safety
  • Safe coding rules and detailed design provide
    programming in-the-small safety
Write a Comment
User Comments (0)
About PowerShow.com