Title: Requirements Analysis and Design Engineering
1RequirementsAnalysis andDesignEngineering
- Southern Methodist University
- CSE 7313
2Module 17 Validating the system
3Agenda
- Traceability
- Validating the system
- Safety critical systems
4Role of traceability
- Ability to trace is significant factor in quality
software implementation - Tracking relationships and relating them is key
in many high assurance processes - Impact of change is often missed
- Small changes can create significant safety and
reliability problems
5Traceability defined
- IEEE The degree to which a relationship can be
established between two or more products of the
development process, especially products having a
predecessor-successor or master-subordinate
relationship to one another
6Traceability relationship
Vision Document (features)
Traceability link
Actor
SW requirement (use case)
7Traceability relationship
- Additional meanings can be placed on these
relationships - tested by
- traced to
- implemented by
8Implicit vs Explicit
- Explicit traceability development of
relationships stemming from external
considerations supplied by the team - product feature and use case
- Implicit traceability driven by methodology and
structure of the model - hierarchical requirements have implicit
relationship between parent and related child
9Implicit vs Explicit
- Other implicit examples
- modeling tools in the development process may
provide other traceability relationships (use
cases and actors that interact with the use case)
10Project Relationships
Need
Note This traceability link is optional, as it
can be derived from the link between the product
fea- ture and the use case selection. This link
is often used to relate the prod- uct features to
the use cases before the use case selections are
written
Traces to
Product Feature
Traces to
Traces to
Use case
Traces to
Software Requirements
Use case selection
11Additional Traceability Options
- Additional less traditional elements of a project
can be traced if they add value - issue for unresolved issues
- assumptions and rationales
- action items
- requests for new/revised features
- glossary and acronym terms
- bibliographic references
12Augmented Traceability relationships
Need
Note This traceability link is optional, as it
can be derived from the link between the product
feature and the use case section. This link is
often used to relate the product features to the
use cases before the use case sections are written
The SW requirements make up the formal SRS,
of which the use case model is an interpretation
Traces to
Product Features
Traces to
SW Rqmts
Traces to
Traces to
Use case
Glossary term
Traces to
Traces to
Traces to
In this case, we are tracing items to the
glossary terms, as well as from then,
as described when defining glossary terms as one
of the supporting traceability types
Actor
Use case selection
13Verification and traceability
- Must consider whether or not you have correctly
and completely considered all of the links that
should be established - Deeper consideration often leads to some
revisions - Should hold formal and informal reviews
- Its not all mechanical processing
14Validating the system
15Validation (IEEE)
- the process of evaluating a system or component
during or at the end of the development process
to determine whether it satisfies requirements - Use validation to conform that the implemented
system conforms to the requirements established.
16Acceptance Tests
- Bringing the customer into the final validation
process in order to gain assurance that the
product works the way the customer really needs
it to - May be part of the contract provisions
- IT environments do this by a customer alpha or
beta evaluation
17Acceptance Tests
- Based on a specific number of scenarios that the
user specifies and executes in the usage
environment - Freedom to think outside the box
- Construct interesting ways to test the system to
gain confidence that the system works as needed - Based on certain key use cases
18Acceptance Tests
- Apply these use cases interesting combinations
- under certain types of system load and other
environmental factors - interoperability with other applications
- OS dependencies
- others likely to be present in the users
environment
19Acceptance Tests
- Iterative development environments will have
generations of acceptance tests run at various
milestones - Will most often find at least some undiscovered
ruins
20Validation Testing
- Primary activities in validation are testing
activities - IEEE 829-1983 IEEE Standard for SW test
documentation - The development process must
- include planning for test activities
- time and resources to design the tests
- time and resources to execute the tests
21Implementation documentation
User needs
Vision document
SRS package
Requirement specification
Hazard Analysis
Use cases
Implementation units (functions, use case
realizations, modules)
Test protocols
Test suites
22Validation Traceability
- Validation traceability gives confidence that two
important goals have been addressed - 1. Do we have enough tests to cover everything
that needs testing? - 2. Do we have any extra or gratuitous tests that
serve no useful purpose? - Validation focuses on whether the product works
as it supposed to
23Requirements based testing
- Quality can be achieved only by testing the
system against its requirements - Many complex systems will pass all unit tests but
fail as a system - unit tests interact in more complex behaviors
- resulting system has not been adequately tested
against the requirements
24Use case and test cases
Test case 2
Test model
Test case 3
Test case 1
(traceability links)
Use case 2
Use case 1
Use cases
25Testing design constraints
- Consider design constraints requirements
- Include design constraints as part of the
validation effort - Many design constraints will yield to simple
inspections - use abbreviated test procedure
26Design Constraint validation approaches
27Using ROI to determine effort
- Must perform cost/benefit wrt VV activities
- Plan VV activities based on
- 1. What are the social and economic consequences
of a failure of our system? - 2. How much VV do we need to do to ensure that
we do not experience these consequences?
28VV depth
- Depth defines the level of VV effort to be
applied to a system element - greater the depth, the more resources
- Match depth of the review to the importance of
the element - inspection
- simple test
- extensive white box testing
29VV depth activities
- Examination review the code or take some
measurements. - Prescribed and minimally invasive look is taken
at the element under test - minimal depth of review of an element
- Walkthrough peer group walks the element through
its paces - process is a structured inspection performed by a
wider audience - search for weaknesses, oversights, etc
30VV depth activities
- Independent reviews unrelated but knowledgeable
group examines the element and searches for
weaknesses - may provide additional insights that were not in
the mind set of the project group
31VV depth activities
- Black box test treats the element as a module
that cannot be internally inspected - supply inputs to the box and observe the boxs
outputs to ensure that the element is working to
the required standards - performed via instrumented code or with system
emulators and other tools to simulate and record
operation
32VV depth activities
- White box test allows you to open the box and
examine the internal workings of the element - most modules have too many combinatorial pathways
to test in a reasonable amount of time - apply reasonable approach that does not take too
much time - coverage instead of combination
33VV coverage
- Coverage defines the extent of coverage of system
elements to be verified and validated - The amount of traceability and the corresponding
level of specificity in the requirements
determines coverage
34What to verify and validate
- 1. Verify and validate everything
- smaller projects
- simple and consistent application
- ensures uniformity
- selective VV can be appropriate if you know
what the risks are - omitted elements are done so for a good reason
- NOT run out of time or money
35What to verify and validate
- Possible repercussions
- embarrassment over and element not conforming to
customer specification - elements not working properly per the
specification - worst case an unsafe product that can cause harm
to its users
36What to verify and validate
- 2. Use a hazard analysis to determine VV
necessities - Hazard analysis is the detailed examination of a
device from the user and patient perspectives.
Its purpose is to detect potential design flaws -
possibilities of failure that could cause harm -
and to enable manufacturers to correct them
before a device is released for use
37What to verify and validate
- Hazard analysis guides the selection of project
elements for VV - Always perform a hazard analysis for a human
safety critical system
38Safety Critical Systems
39Why worry about safety?
- Safety is not discussed in the literature
- Safety is not taught in colleges
- Without training or guidance, embedded systems
are assuming more safety roles every day!
40Examples of safety-related computing systems
- Medical equipment (monitoring and therapy)
- Flight computers
- Automobile braking and engine control
- Chemical process control
- Robotic assembly systems
- couple dozen deaths each year in Japan because of
wayward robots
41Examples of safety-related computing systems
- Military Weapondry
- Nuclear power plants
- Financial systems
42Therac-25 Story
- Radiation therapy treatment device
- Released in 1982
- Used S/W to enhance usability and lower cost of
production - Compounding of process, design, and
implementation failures led to massive overdoses
that killed 3 patients - Fixing identified problem did not make the device
safer
43Other stories
- First Shuttle launch delayed 2 days because
backup computer would not start correctly when an
error was discovered - Patriot missle failed because of clock drift and
effectiveness downgraded from 95 to 13 - 8080-based cement factory process control system
mistakenly stacked huge boulders 80 ft above the
ground which fell and crushed cars and damaged
the building
44Other stories
- Stray electromagetic interference blamed for 19
robot inflicted deaths in Japan - Low energy radiation blamed for several deaths
related to reprogramming cardiac pacemakers - Attempted suicides after incorrect diagnosis of
diseases
45Other Stories
- Grady Booch The last announcement I want to hear
from a pilot when I am flying is Is there a
programmer on board?
46Errors are systematic faults
- They are designed in
- The FORTRAN line
- DO I1.10
- rather than
- DO I1,10
- caused problems for the Project Mercury flight
control software. This is a design error. Entire
mainframes have been brought down with an
inadvertent semicolon !
47What is safety?
- Safety is the freedom from accidents or losses
- Safety is not reliability!
- Reliability is the probability that a system will
perform its intended function satisfactorily - A handgun is a very reliable piece of equipment
but it is not very safe! - Windows 95 is safe, but not very reliable!
48What is safety?
- Safety is not security!
- Security is protection or defense against attack,
interference, or espionage
49Safety related concepts
- Accident is a loss of some kind, such as injury,
death, or equipment damage - telephone system going down on the East coast was
a big loss -gt safety in one sense - Risk is a combination of the likelihood of an
accident and its severity - risk p(a) s(a)
50Safety related concepts
- an airplane crashing has a high severity but a
very low probability gt ultimately low risk - Hazard is a set of conditions and/or events that
leads to an accident
51Other safety related concepts
- A failure is the nonperformance of a system or
component, a random fault - a random failure is one that can be estimated
from a pdf, - failures are events
- e.g. a component failure
52Other safety related concepts
- An error is a systematic fault
- a systematic fault is a design error
- Errors are states or conditions
- e.g. a software bug
- A fault is either a failure or an error
53Other safety related concepts
- Safety must be considered in the context of the
system, not the component - It is less expensive and far more cost effective
to build in safety early than try to tack it on
later - The Hazard Analysis ties together hazards,
faults, and safety measures
54Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
55Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
Safety analysis
Handled at the architectural level and
mechanistic level
56Safety Analysis
- You must identify the hazards of the system
- You must identify the faults that can lead to
hazards - You must define safety control measures to handle
hazards - These culminate in the Hazard Analysis
- The Hazard Analysis feeds into the Requirements
Specification
57Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
58Hazard Causes
- Release of energy
- electromagnetism (microwave oven)
- radiation (nulcear power plant)
- electricity (electrocution hazard from ECG leads)
- heat (infant warmer)
- kinetic (runaway train)
- Release of toxins
59Hazard Causes
- Interference with life support or other
safety-related function - Misleading safety personnel
- Failure to alarm
- alarming too much - Therac 25. These were
ignored and people were killed
60Types of Hazards
- Actions
- inappropriate system actions taken
- F-18 pilot pulling up landing gear
- appropriate system actions not taken
- Timing
- too soon
- too late
- fault latency time
61Types of Hazards
- Sequence
- skipping actions
- actions out of order
- Amount
- too much
- too little
62Example Hazards
- Actions
- incorrectly energizing a medical treatment laser
- failure to engage landing gear
- Timing
- cardiac pacemaker paces too fast
- flight control surface adjusted too slowly
63Example Hazards
- Sequence
- empty the vat, THEN add the reagent
- out of sequence network packets controlling
industrial robot - Amount
- electrocution from muscle stimulator
- too little oxygen delivered to ventilator patient
64Means of Hazard Control
- Obviation the possibility of the hazard can be
removed by being made physically impossible - use incompatible fasteners to prevent cross
connections - Education the hazard can be handled by
educating the users so that they wont create
hazardous conditions through equipment misuse - dont look down the barrel when cleaning your
rifle
65Means of Hazard Control
- Alarming announcing the hazard to the user when
it appears so that they can take appropriate
action - alarming when the heart stops beating
- Interlocks the hazard can be removed by using
secondary devices and/or logic to intercede when
a hazard presents itself - car wont start unless it is in Park
66Means of Hazard Control
- Internal checking the hazard can be handled by
ensuring that a system can detect that it is
malfunctioning prior to an incident - CRC checks data for corruption whenever it is
accessed - Safety equipment
- goggles, gloves
67Means of Hazard Control
- Restricting access to potential hazards so that
only knowledgeable users have such access - using passwords to prevent inadvertently starting
service mode - Labelling
- High Voltage -- DO NOT TOUCH
68Hazard Analysis
What do you do about it?
How long is the exposure to hazard?
How can this happen?
How long to discover?
How long can it be tolerated
How bad if it occurs?
Hazardous condition
How frequently?
Hazard
Level of
Toleran
Fault
Likeli
Detection
Control
Exposure
risk
ce time
hood
time
Measure
time
T1
Hypo-
Severe
5 min
Ventilator
rare
30 sec
Independent
1 min
ventilation
fans
pressure
alarm,
action by
doctor
Esphageal
often
30 sec
C)2 sensor
1 min
Intubation
alarm
User
often
0
Noncompati
0
misattaches
ble
breathing
mechanical
circuit
fasteners
used
Overpressur
Severe
250 ms
Release
rare
50 ms
Secondary
55 ms
e
valve
valve opens
failure
69When is a system safe enough?
- (Minimal) No hazards in the absence of faults
- (Minimal) No hazards in the presence of any
single point failure - a common mode failure is a single point failure
that affects multiple channels - a latent fault is an undetected fault which
allows another fault to cause a hazard - Your mileage may vary depending on the risk
introduced by your system
70TUV Single Fault Assessment
71Single Fault Assessment
- T0 is fault tolerance time for the first fault
- T1 is the time after the first fault that the
second fault is likely (via MTBF) - For testing used for safety, test time TT must be
done periodically - TT lt T1 lt T0
- This is not always possible
- e.g. RAM tests
72Safety Measures
- You cannot depend on a safety measure that you
cannot test! - CAN bus with 2 nodes provides a CRC on mesages
checked at the chip level, but the chips provide
no way of testing to see if it is working. - Therefore, it cannot be relied on as a safety
measure
73Fail-Safe States
- Off
- Emergency stop -- immediately cut power
- Production stop -- stop after the current task
- Protection stop -- shut down without removing
power - Partial Shutdown
- Degraded level of functionality
74Fail-Safe States
- Hold
- No functionality, but with safety actions taken
- Manuel or External control
- Restart (reboot)
75Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
76Risk Assessment
- For each hazard
- determine the potential severity
- determine the likelihood of the hazard
- determine how long the user is exposed to the
hazard - determine whether the risk can be removed
77TUV Risk Level Determination Chart
W3
W2
W1
S1
1
-
-
G1
2
1
-
E1
G2
3
2
1
S2
G1
4
3
2
E2
G2
5
4
3
E1
6
5
4
S3
E2
7
6
5
S4
8
7
6
Risk parameters S Extent of damage S1 slight
injury S2 severe irreversible injury, to one of
more persons or the death of a single person S3
death of several persons S4 Catestrophic
consequences, several deaths E Exposure
time E1 seldom to relatively infrequent E2
frequent to continuous G Hazard Prevention G1
possible under certain conditions G2 hardly
possible W Occurrence probability of hazardous
event W1 very low W2 low W3 relatively high
78Sample Risk Assessments
Device
Hazard
Extent of
Exposure
Hazard
Probability
TUV Risk
damage
time
Prevention
level
Microwave
Irradiation
S2
E2
G2
W3
5
oven
Pacemaker
Pace too
S2
E2
G2
W3
5
slowly
Pace too
S2
E2
G2
W3
5
fast
Power
Explosion
S3
E1
--
W3
6
station
Airliner
Crash
S4
E2
G2
W2
8
79Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
80Safety Measures
- Safety measures do one of the following
- remove the hazard
- reduce the risk
- identify the hazard to supervisory control
- The purpose of the safety measure is to ensure
the system remains in a safe state
81Safety Measures
- Adequacy of measures
- safety measures mut be able to reliably detect
the fault - safety measures must be able to take appropriate
actions
Component
Fault/Error
Software class
Examples of acceptable measures
1
2
Interrupt handling
no interrupt or too
rq
functional test or time-slot
and execution
frequent
monitoring
no interrupt or too
rq
comparison of redundant
frequent and
functional channles by either
interrupt related
- reciprocal comparison
to different
- independent hardware
sources
comparator
- independent time-slot and logical
monitoring
82Risk Reduction
- Identify the fault
- Take corrective action, either
- use redundancy to correct and move on
- feedforward error correction (Hamming codes)
- redo the computational step
- feedback error detection (take corrective action
first) - go to a fail-safe state
83Fault Identification at Run-Time
- Faults must be identified in lt TO
- Fault identification requires redundancy
- Redundancy can be in terms of
- channel
- device
- data
- control
Architectural
Detailed design
84Fault Identification at Run-Time
- Redundancy may be either
- homogenous (random faults only)
- does not detect errors
- peform functions the same way on the same thing
multiple times - heterogenous (systematic and random faults)
- includes errors -gt present in all channels
- perform processing differently and hopefully you
didnt make the same mistake!
85Fault Tree Analysis Symbology
A condition that must be present to produce the
output of a gate
An event that results from a combination of
events through a logic gate
Transfer
A basic fault event that requires no further
development
A fault event because the event is
inconsequential or the necessary information is
not available
AND gate (also OR gate)
An event that is expected to occur normally
NOT gate
86Subset of Pacemaker Fault Analysis
Pacing too slowly
Condition or event to avoid
Secondary conditions or events
OR
Shutdown fault
Time-base fault
Invalid pacing rate
OR
AND
OR
AND
Crystal failure
Watchdog failure
Bad command rate
Data corrupted in vivo
Software failure
CPU H/W failure
Rate command corrupted
CRC hardware failure
Primary or fundamental faults
87Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
88Safe Requirements
- Requirements specification follows initial hazard
analysis - Specific requirements should track back to hazard
analysis - must be shown to FDA, etc
- Architectural framework should be selected with
safety needs in mind - has the hooks in place
89Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
90Use Good Design Practices
- Good design practices allow you to
- manage complexity
- view the system at various levels of abstraction
- zoom in on a particular area of interest
- identify hot spots of special concern
- have consistent quality
- easily test
- build and use high quality components
- Regulatory agencies look at this!!
91Use Good Design Practices
- Manage your requirements
- trace requirements to design elements
- trace design elements back to requirements
remote communications
adjust trajectory
class a
class b
remote communication
requirements specification
class c
class d
class e
use cases
design model
92Use Good Design Practices
- Use iterative development
- integrating many times finds more defects
- iterative prototypes can result in more reliable
and safe systems
93Use Good Design Practices
- Use component-based design architectures
- third party components may be very well tested in
they are in wide use - require bug lists from component vendors
- this bit Microsoft once
94Use Good Design Practices
- Use Visual Modeling
- UML
- Ward-Mellor
- Use executable models
- animate models
- execute and debug at modeling level of abstraction
95Use Good Design Practices
- Use frameworks
- a framework is a partially completed application
which is specialized by the user - Microsoft foundation classes
- Object Execution Framework
- frameworks reduce the work of developing new
applications - frameworks rely on well-tested patterns
96Use Good Design Practices
User Model
Framework
80-90 of application code is housekeeping code
System
97Use Good Design Practices
- Use Configuration Management
- only use unit-testing components in builds
parameters
data aquisition
SYSTEM
CM Database
drivers
OS
98Use Good Design Practices
- Design for test
- product testing
- built-in-testing to ensure
- invariants are truly invariant
- functional invariants
- quality of service invariants (e.g. performance)
- faults are detected
99Good Design Practices
- Isolate Safety Functions
- Safety-relevant systems are 200-300 more effort
to produce - Isolation of safety systems allows more expedient
development - Care must be taken that the safety system is
truly isolated so that a defect in the non-safety
system cannot affect the safety system - different processor
- different heavy-weight tasks (depends on the OS)
100Safety Architecture Patterns
- Protected Single-Channel Pattern
- Dual-Channel Pattern
- Homogenous Dual Channel Pattern
- Heterogenous Peer-Channel Pattern
- Sanity Check Pattern
- Actuator-Monitor Pattern
- Voting Multichannel Pattern
101Protected Single Channel Pattern
- Within the single channel, mechanisms exist to
identify and handle faults - All faults must be detected within the fault
tolerance time - May be imposssible
- to test for all faults within the fault tolerance
time - to remove common mode failures from the single
channel - Generally, less recurring system cost
- no additional hardware required
102Protected Single Channel Pattern
If Im not getting life ticks, Ill shut down!
Single Channel Train Braking System
103Dual Channel Architecture Patterns
- Separation of safety-relevant from nonsafty
relevant where possible - Separation of monitoring from control
- Generally easier to meet safety requirements
- timing
- common mode failures
- Generally higher recurring system cost
- additional hardware required
104Homogenous Dual-Channel Pattern
- Identical channels used
- Channels may operate simultaneously (Multichannel
Vote Pattern) - Channels may operate in series (Backup Pattern)
- Good at identifying random faults but not
systematic faults - Low RD cost, higher recurring cost
105Homogenous Dual-Channel Pattern
106Heterogenous Peer-Channel Pattern
- Equal weight, differently implemented channels
- may use algorithmic inversion to recreate initial
data - may use different algorithm
- may use different teams (not fool proof because
of hot spots that can cause failures) - Good at identifying both random and systematic
faults
107Heterogenous Peer-Channel Pattern
- Generally safest, but higher RD and recurring
cost
108Heterogenous Peer-Channel Pattern
109Sanity Check Pattern
- A primary actuator channel does real computations
- A light-weight secondary channel checks the
reasonableness of the primary channel - Good for detection of both random and systematic
faults - May not detect faults which result in small
variance - Relatively inexpensive to implement
110Monitor-Actuator Pattern
- Separates actuation from the monitoring of that
actuation - If the actuator channel fails, the monitor
channel detects it - If the monitor channel fails, the actuator
channel continues correctly - Requires fault isolation to be single-fault
tolerant - actuator channel cannot use the monitor itself
111Monitor-Actuator Pattern
112Dual-Channel Design Architecture
113Safety Executive Pattern
- Large scale architectural pattern
- Controller subsystem (safety executive)
- One or more watchdog subsystems
- check on system health
- ensure proper actuation is occurring
- One or more actuation channels
- Recovery subsystem (Fail safe processing channel)
114Safety Executive Pattern
- Appropriate when
- A set of fail-safe system states needs to be
entered when failures identified - Determination of failures is complex
- Several safety-related system actions are
controlled simultaneously - Safety-related actions are not independent
- Determining proper safety action in the event of
a failure can be complex
115Safety Executive Pattern
116Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
117Detailed Design for Safety
- Make it right before you make it fast
- simple, clear algorithms and code
- optimize only the 10-20 of code which affects
performance - use safe language subsets
- ensure you havent introduced any common mode
failures
118Detailed Design for Safety
- Thoroughly test
- unit test and peer review
- integration test
- validation test
119Detailed Design for Safety
- Verify that it remains right throughout program
execution - exceptions
- invariant assertions
- range checking
- index and boundary checking
- When its not right during execution, then make it
right with corrective or protective measures
120Detailed Design for Safety
- Use safe language subsets
- strong compile-time checking
- if you use C, use lint
- strong run-time checking
- exception handling
- avoid void
- avoid error prone statements and syntax
- you can make C safe but its not safe out of the
box
121Detailed Design for Safety
- Language choice
- Compile time checking (C versus Ada)
- Run-time checking (C versus Ada)
- Exceptions versus error codes
- Language selection
- C treats you like a consenting adult. Pascal
treats you like a naughty child. Ada treats you
like a criminal
122Pascal example
- Program WontCompile
- type
- MySubRange 0 .. 20
- Day (Monday, Tuesday, Wednesday, Thursday,
Friday, Saturday, Sunday) - var
- MyVar MySubRange
- MyDate Day
- begin
- MyVar 9 will not compile -- range error!
- MyDate 0 will not compile -- wrong type!
- end.
123Ada example
Procedure MyProc is Var MyArray array (1..10)
of integer j integer b byte begin for j
in 0 .. 10 loop MyArray(j) j6 -- raises
exception on first time
--through end loop b MyArray(10) -- will
fail run-time range check end MyProc
124Exceptions
- Some languages (Pascal, Modula-2) have a
draconian error handling policy - exception raised and program terminated
- not good for embedded systems
- Ada and C allow run time recovery through
user-defined exceptions and exception handlers
125Exceptions
- A lot of extra code to check the statement
- aj b
126Detailed Design for Safety
- Do not allow ignoring of error indications
- checking of return values is a manuel process
- user of the function must remember each and every
time - easy to circumvent this error handling system
- Separate normal code from error handling code
127Detailed Design for Safety
- Handle errors at the lowest level with sufficient
contect to correct the problem
128Error handling code
- a getfone(b, c)
- if (a)
- switch (a)
- case 1 ..
- case 2 ..
-
- d getftwo(b,c)
- if (d)
- switch (a)
- case 1 ..
- case 2 ..
-
in this code the normal execution path is a
getfone(b,c) d getftwo(b,c)
129Built-in exception types
- procedure enqueue (q in out queue v in FLOAT)
is - begin
- if full (q) then
- raise overflow
- end if
- q.body(q.head q.length) mod qSize v
- q.length q.length 1
- end enqueue
130Caller of the sequence handles exception
- procedure testQ(q in out queue) is
- begin
- for j in 1 .. 10 loop
- enqueue(q, random(1000))
- end loop
- exception
- when overflow gt
- puts(Test failed due to queue overflow)
- end testQ
131C exception handling
- Extends capabilities beyond that of Ada
- Exceptions extended by type rather than value
- possible to create hierarchies of exception
classes and catch by thrown subclass type - class can contain different types of information
about the kind of device that failed - this facilitates error recovery, debugging, and
user error reporting
132Making C safe
- Overloading the operator with index range
checking improves the safety of arrays - Make classes of scalars and overload the
assignment operator allows additional range and
value checking
133Detailed Design for Safety
- Data Validity Checks
- CRC (16 bit or 32 bit)
- identifies all single or dual bit errors
- detects high percentage of multiple bit errors
- table or compute-driven
- chips are available
- checksum
- redundant storage
- ones complement
134Detailed Design for Safety
- Redundancy should be set every write access
- Data should be checked every read access
135ANSI C Exception Class Hierarchy
exception
logic error
runtime error
domain error
out of range
range error
overflow error
invalid argument
length error
136Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
137Safety Process (Development)
- Do Hazard Analysis early and often
- Track safety measures from hazard analysis to
- requirments specification
- design
- code
- validation tests
- Test safety measures with fault seeding
138Safety Process (Deployment)
- Install safely
- ensure proper means are used to set up system
- safety measures are installed and checked
- Deploy safely
- ensure safety measures are periodically checked
and serviced - Do not turn off safety measures
- Decommision safely
- removal of hazardous materials
139Concept
IEC Overall Safety Lifecycle
Overall scope definition
Hazard and risk analysis
SRS Safety Related System E/E/PES
Electrical/Electronic/Programmable electronic
system
Overall safety requirements
Safety requirements allocation
SRS E/E/PES realization
Overall planning
SRS other technology realization
External risk reduct. facilities
Ops mainten. planning
Valida tion planning
Install. planning
Overall installation and commissioning
Overall safety validation
Overall modification and retrofit
Overall operation and maintenance
Decommissioning
140Eight steps to safety
- Identify the hazards
- Determine the risks
- Define the safety measures
- Create safe requirements
- Create safe designs
- Implement safety
- Assure the safety process
- Test, test, test
141Safety in Testing in RD
- Use fault-seeding
- Unit (class) testing
- white box
- procedural invariant violation assertions
- peer reviews
- Integration testing
- grey box
- Validation testing
- black box
- externally caused faults
- (Grey box) internally seeded faults
142Safety Testing During Operation
- Power on Self-Test (POST)
- Check for latent faults
- All safety measures must be tested at power on
and periodically - RAM (stuck-at, shorts, cell failures)
- ROM
- Flash
- Disks
- CPU
- Interfaces
- Buses
143Safety Testing During Operation
- Built-In Tests
- Repeats some of POST
- Data integrity checks
- Index and pointer validity checking
- Subrange value invariant assertions
- Proper functioning
- Watchdogs
- Reasonableness checks
- Lifeticks
144A simplified Example A linear Accelerator
145Unsafe Linear Accelerator
Beam Intensity Beam Duration
CPU
Radiation Dose
1. Set Dose 2. Start Beam 3. End Beam
Sensor
146Fault Tree Analysis
Over radiation
AND
OR
Radiation command invalid
CPU Halted
OR
OR
Shutoff timer failure
Beam engaged
CPU failure
Software defect
Software defect
EMI
EMI
147Hazards of the Linear Accelerator
Hazard
Level of
Tolerance
Fault
Likelihood
Detection
Control
Exposur
risk
Time T1
time
measure
e time
Over
Severe
100 ms
CPU
rare
50 ms
Safety
50m ms
radiati
locks
CPU
on
up
checks
lifetick at
2 5 ms
Corru
often
10 ms
32 bit
15 ms
pt data
CRCs on
setting
data
s
checked
every
access
Under
Moderat
2 weeks
corrup
often
10 ms
32 bit
15 ms
radiati
e
t data
CRCs on
on
setting
data
checked
every
access
Inadve
sefere
100 ms
beam
often
n/a
curtain
0 ms
rtant
left
mechanica
radiati
engage
lly shuts
on on
d
at power
power
during
down
on
power
down
148Safe Linear Accelerator
Self test results shared prior to operation
Periodic watchdog service
Safety CPU
CPU
Beam Intensity Beam Duration
Radiation Dose
1. Set Dose 2. Start Beam 3. End Beam
Sensor
Deenergize
Mechanical shutoff when curtain low
149Summary
- Safety is a system issue
- It is cheaper and more effective to include
safety early on then to add it later - Safety architectures provide programming in the
large safety - Safe coding rules and detailed design provide
programming in-the-small safety
150End of module