CS444A: Software for Critical Systems - PowerPoint PPT Presentation

About This Presentation
Title:

CS444A: Software for Critical Systems

Description:

Increased use of critical software is irresistable ... Gargantuan-scale 24x7 mission critical systems: Wal-Mart financial exchanges, ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 28
Provided by: carla3
Category:

less

Transcript and Presenter's Notes

Title: CS444A: Software for Critical Systems


1
CS444ASoftware for Critical Systems
2
Staff
  • Prof. David L. Dill
  • Prof. Armando Fox

3
Topic
  • The engineering of software for applications
    where failure is unacceptable
  • . . . for some value of failure and
    unacceptable.
  • Costs of failure exceed value of the software

4
Critical software is growing in importance
  • Computers are getting exponentially smaller,
    cheaper, faster, and better connected.
  • Communications are improving at least as fast.
  • Increased use of critical software is
    irresistable
  • Automation of tasks that were previous manual or
    infeasible.
  • Sophisticated control replacing simple control.
  • Replacing mechanical, analog, digital hardware.

5
Software is growing
  • Software will replace mechanical, analog, and
    digital hardware
  • Cheaper to copy.
  • Easier to manufacture.
  • Easier to upgrade.
  • Provides more functionality.
  • Software will replace manual processes
  • Cheaper and more reliable than human workers
  • Relieves them of tedious tasks
  • Faster and more predictable

6
Complexity is increasing
  • COTS is coming to software
  • Large projects increasingly use commercial
    off-the-shelf components
  • Commodity hardware, OSs, tools, other building
    blocks
  • Example Mars Pathfinder
  • This is good and bad
  • COTS reduces development cost development time
  • Sophisticated building blocks allow creation of
    more complex systems
  • But they are often brittle intra-component and
    inter-component failure modes are poorly
    understood
  • Composition of pieces that were designed
    separately sometimes leads to unexpected failure
    modes

7
Software will be used in safety-critical
applications
  • All of the above reasons (esp. cost)
  • Software can make systems safer
  • TCAS - Aircraft collision avoidance system
  • Software can enhance system performance
  • Fly-by-wire
  • antilock braking
  • Software can perform life-saving functions
  • Computer-controlled pacemakers

8
Software will be used in safety-critical
applications
  • All of the above reasons (esp. cost)
  • Software can make systems safer
  • TCAS - Aircraft collision avoidance system
  • Software can enhance system performance
  • Fly-by-wire
  • antilock braking
  • Software can perform life-saving functions
  • Computer-controlled pacemakers

9
Subtopics
  • Successful engineering of software encompasses
    many different issues
  • Relationship of software to the larger system
  • Software development processes
  • Software design
  • Algorithms
  • Programming practices

10
Goal Best Of Both Worlds
  • Traditional safety-engineering perspective
  • Formal verification, requirements specification,
    related formal methods
  • Traditional hazard/fault analysis
  • Fault tolerance
  • Systems perspective
  • Design techniques and programming practices
  • As much folklore as formal
  • Especially recent experience in Internet-scale
    mission-critical systems

11
Formal Methodology Outline
  • Safety engineering of systems
  • Hazard identification
  • Hazard avoidance
  • Standards
  • Requirements specification and tools
  • Specification for reactive systems
  • Model checking
  • Logical specification (Z, VDM?)
  • Theorem proving
  • Fault tolerance
  • Fault models
  • Fault tolerant protocols
  • Etc.

12
The Case for the Systems Perspective
  • Many visible success stories
  • The Internet
  • Mars Pathfinder
  • Gargantuan-scale 24x7 mission critical systems
    Wal-Mart financial exchanges, Visa, CIRRUS
    banking network
  • Some spectacular failures
  • Therac-25 (today)
  • System design combines engineering judgment and
    folklore with formal methodology

13
The Role of the Internet
  • The distributed system from hell
  • Evolved over gt25 years, lots of legacy code
    layers
  • Widely distributed, both geographically and
    administratively
  • Transient failure (hardware software) is a way
    of life
  • Yet, it mostly works...What great ideas can we
    steal?
  • The Internet is a good testbed for new approaches
    to reliability
  • Internet scale implies large size, exponential
    growth, and 24x7 operational requirements
  • People dont die (usually) when systems go down
  • Strong financial incentive spurs industrial
    deployment -)

14
Systems Track Outline
  • Conceptual vocabulary, research landscape
  • Fault isolation, fault containment, orthogonal
    guard mechanisms
  • Transactions, replication, consistency
  • State maintenance
  • Availability vs. consistency tradeoffs, harvest
    and yield
  • Application-level vs. OS-level mechanisms
  • Systems case studies

15
Goals
  • Identify recurrent design philosophies that work
    well
  • Taxonomize the folklore in software systems
    design
  • Identify fertile crossover areas to the formal
    world

16
Example Software failures in the Therac-25
17
Motivation
  • The "Therac-25" is a classic case study in
    engineering failure -- like Tacoma Narrows
    bridge, Challenger disaster, etc.
  • Illustrates many problems and issues of software
    safety.
  • Shows how not to do it.
  • Related to assignment.

18
The Machine
  • The Therac-25 is a linear accelerator used for
    radiation therapy (e.g. cancer treatment).
  • Safety issues
  • overdose Patient is injured or dies from
    radiation burns.
  • underdose Serious disease is not treated
    properly, patient may be injured or die because
    of this.
  • Therac-25 much more dependent on software for
    safety than its predecessors (Therac-20,
    Therac-6)
  • "Hardware interlocks" replaced by software.

19
Technical details
  • Multi-mode machine protons, electrons, X-rays.
  • X-rays generated when electron beam collides with
    target.
  • - This is inefficient, so electron beam
    must be very powerful.
  • Different modes require turntable to be properly
    positioned with targets, spreaders, etc. between
    beam and patient.

20
Accidents
  • Machine reliably treated thousands of patients,
    but occasionally weird things would happen.
  • There were at least 6 accidents.
  • Kennestone 1985
  • Patient treated for breast cancer is
    unexpectedly burned.
  • Est. 15K-20K rad dose (500 rad to whole body
    50 fatal).
  • Patient lost breast, shoulder and arm
    paralyzed.
  • Patient sued, settled out of court.
  • FDA not informed until much later.

21
Another accident
  • Tyler 1986
  • Patient to be treated with electron beam.
  • Operator said to treat with X-ray, then
    corrected.
  • Patient felt "electric shock.
  • Operator saw "malfunction 54" and under-dose
    reading, so said "proceed" to zap patient
    again.
  • Patient overdosed a second time (in arm) as he
    was trying to escape.
  • Patient died horribly of radiation overdose 5
    months later.

22
Software issues
  • No locks on shared variables (race conditions).
  • Control flow bug some newly entered data can be
    ignored.
  • Timing sensitivity in user interface.
  • Wrap-around on counters.

23
User interface issues
  • Malfunction 54 (patient might have received
    overdose or under-dose).
  • No indication about patient safety with error
    messages.
  • Proceed button continues after error message
  • - one patient overdosed twice.

24
System issues
  • Inadequate mechanical checks on turntable
  • - 3 microswitches for position sensing.
  • - 1-bit error in encoding makes position
    inaccurate.
  • - potentiometer installed later to sense
    position.
  • No independent hardware to suppress beam.
  • Dosage measurement devices (ion chambers) report
    inaccurate results for very high doses.
  • Therac-20 had same bugs, but no accidents because
    of independent protective systems.

25
Management issues
  • Software complacency
  • - software errors not modelled in fault
    trees.
  • - users told no possibility of overdose.
  • Absurdly low probabilities assigned to SW
    failure.
  • Guesswork in analyzing observed failures
  • - blamed microswitches on turntable.
  • - no actual failures found in microswitches.
  • - problem was probably software.
  • Inadequate software processes
  • - unclear safety analyses.
  • - no audit trails.
  • - inadequate testing.

26
Regulatory and legal issues
  • FDA, Canadian regulators not heavily involved
  • - no software regulation in med. devices (at
    that time).
  • - not notified of incidents (no requirement to
    do so).
  • - inadequate investigation of early incidents.
  • When FDA got involved, the machine got fixed.
  • (speculation) Out of court settlements impeded.
    dissemination of information about hazards.

27
A more Armando-like example?
Write a Comment
User Comments (0)
About PowerShow.com