Engineering Disasters - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Engineering Disasters

Description:

... overlooked the bar, and so used R and therefore the code to compute R-smoothed ... AECL sold 11 Therac-25 units to hospitals in the US (5) and Canada (6) ... – PowerPoint PPT presentation

Number of Views:254
Avg rating:3.0/5.0
Slides: 24
Provided by: NKU
Category:

less

Transcript and Presenter's Notes

Title: Engineering Disasters


1
Engineering Disasters
  • While engineering disasters differ from software
    disasters, we can learn some useful lessons
  • Engineering disasters primarily occur because of
  • human factors (ethical failures, accidents)
  • design flaws (often from unethical practices)
  • materials failures
  • software will not fail from this
  • extreme conditions or environments
  • software might be thought to fail in such cases
    when running in an overloaded system, for example
  • combination of the above
  • The main difference is that software does not
    have wear and tear, but on the other hand,
    software systems can often be far more complex
    than engineering systems and so need very
    thorough review processes including testing

2
Software Failures
  • There are many reasons why software fails
  • It is impossible to discover all bugs
  • It is impossible to test all combinations of
    inputs or variable values
  • Most software is built by a team of people who
    must then integrate the software together there
    will always be unexpected consequences when
    software systems are integrated
  • Specifications may be incomplete or change over
    time
  • Software engineering principles may not be
    followed properly
  • Management may underestimate the effort required
    and/or put undo pressure on the developers to get
    the software done to meet a deadline
  • In many cases, software must interact with
    hardware which may not be predictable
  • Poor user interfaces may exacerbate problems
  • User errors
  • And of course, humans are not infallible

3
A Few Famous Software Disasters
  • Mariner 1 spacecraft
  • Intended to fly by Venus in 1962, blew up 5
    minutes after takeoff when its rocket sent it off
    course
  • The spacecraft used an Atlas booster rocket
  • The rocket was guided with the help of two radar
    systems, the Rate system (velocity) and the Track
    system (distance and angle) separated by a
    difference of 43 ms
  • To make the system work in the same time base,
    smoothed track data instead of raw velocity data
    relayed directly from the radar was required
  • The on-board Rate System hardware failed but the
    Track System was working correctly and could have
    handled the ascent
  • But a software bug caused the smoothed function
    to not work correctly so that the computer
    received incorrect data causing a fluctuation in
    velocity requiring the safety officer to destroy
    the rocket

4
What Caused the Software Error
  • It was thought for some time that the error was
    an incorrect FORTRAN Do loop (the for-loop of
    FORTRAN)
  • The actual code looked like this
  • DO 10 I1.10
  • The . should be a , to iterate for I from 1
    to 10 but instead, the above loop would just
    iterate 1 time
  • However, while the software did have such an
    error, that error was in code pertaining to
    computing orbital data and so was not the cause
    of this error
  • The real cause?
  • Due to an error in writing specification, the
    smoothed function never got implemented leading
    to the failure
  • The mathematician who wrote the smoothing
    function on paper had written R (meaning R
    smoothed)
  • The programmer implementing the algorithm
    overlooked the bar, and so used R and therefore
    the code to compute R-smoothed was never
    implemented

5
Patriot Missile Background
  • The Patriot missile, used during the first Gulf
    War, is an antiballistic missile
  • When fired, its goal is to destroy a ballistic
    missile during this war, Iraq used Scud
    missiles
  • A Patriot missile must have a way of determining
    whether an airborne target is actually an
    incoming missile
  • this determined by tracking the target to see if
    it is following an expected ballistic missile
    path
  • ballistic missiles travel at extremely high
    speeds so that the time interval between radar
    sightings must be very small
  • the Patriot tracks a target by first noting its
    original radar sighting location and then
    anticipates where the target should be at the
    next radar sighting (a fraction of a second
    later)
  • if the target does not appear in a given range,
    the target is classified a false alarm and
    ignored by the Patriot
  • In order to make this path calculation, the
    Patriot depends on its internal clock
  • However, to save memory space, the clock value
    was truncated slightly when stored (this would
    not normally cause a significant error)

6
The Error
  • However, the missiles software was written so
    that the error compounded over time would
    increase to a larger error
  • the Israeli military discovered this clock drift
    error when analyzing data from Patriot batteries
    operating in Israel
  • after only 8 hours of continuous operation, the
    Patriot's stored clock value would be off by
    0.0275 seconds
  • this would cause a range error of up to 55 meters
  • During the Gulf War, a Patriot battery had been
    operating continuously for more than 100 hours
  • the clock value was off by 0.3433 seconds
  • this caused range errors of up to 687 meters
  • and therefore, the Patriot would lock onto a
    Scud, not find it in the next time interval and
    forget it, allowing the Scud to possibly to
    damage
  • On Feb 25, 1991, an American military barracks in
    Dhahran, Saudi Arabia was hit by a Scud,
    resulting in the deaths of 28 people
  • simply rebooting the missiles computer would
    have solved the error!

7
Ariane 5
  • This is a French rocket used by the European
    Space Agency
  • Previous versions (Ariane 5) have been successful
  • The Ariane 5 is somewhat based on the Ariane 4
    running much of the same software
  • However, the Ariane 5 is a superior rocket that
    can achieve a much greater velocity
  • One piece of legacy code was used to properly
    shut down the rocket prior to liftoff in case of
    an emergency by computing rocket alignment
  • This code would operate for 50 seconds after
    flight initiation, which includes some time
    after actual liftoff
  • This code does not need such a long operating
    window in the Ariane 5 and could actually be
    shut off 3 seconds prior to liftoff

8
The Problem
  • On June 4, 1996, an Ariane 5 was launched and
    everything went smoothly for the first 36 seconds
    of flight
  • The on-board inertial reference systems (both the
    main and the backup) were running the alignment
    code
  • the horizontal velocity was an exceptional value,
    caused when a 64-bit float was converted into a
    16-bit int
  • recall, the Ariane 4 had a lesser velocity, so
    the greater velocity of the Ariane 4 is not
    within acceptable limits for this legacy code
    which shouldnt even have been operating!
  • the exception caused both inertial reference
    systems to shut down (note that most numeric
    conversions were protected (presumably through
    exception handling) but this particular one
    (along with 2 others) was not
  • The rockets on-board computers now had
    exceptions to deal with, however, the exception
    was not treated as an exception but actual flight
    data
  • This caused the on-board computers to drastically
    change the angles of the two on-board rockets,
    which caused the rocket to veer off course which
    itself triggered self destruction

9
Therac-25
  • Radiation therapy machine
  • An example of a radiation therapy machine is
    shown to the right (this is NOT Therac-25)
  • Two companies produced the predecessor, Therac-20
    (which itself was based on the Therac-6)
  • The Therac-25 would be superior in that many of
    the hardware features were replaced by software
  • for instance, hardware interlocks to make sure
    that the mechanism was in the proper orientation
    were now checked by software routines
  • The Therac-6 could only deliver X-rays
  • The Therac-20 and -25 could deliver either X-rays
    or electron beams
  • both are used in cancer treatments
  • The company Atomic Energy Commission Limited
    produced the -6 and -25 and jointly the -20 with
    another company
  • They borrowed software routines from the -6 and
    -20 in building the -25

10
6 Accidents
  • AECL sold 11 Therac-25 units to hospitals in the
    US (5) and Canada (6)
  • Between 1985 and 1987, 6 accidents occurred at 4
    locations resulting in 3 deaths, 2 severe
    injuries and 1 mild injury, all caused by
    overexposure to radiation
  • In each case, the machine emitted a far greater
    amount of x-ray or electronic beam that was
    healthy because of a combination of hardware and
    software problems with the machine
  • Several law suits were filed and the FDA and the
    Canadian Radiation Protection Bureau got involved
    with AECL to have them fix the problems
  • Some of AECLs responses were inappropriate and
    slow, possibly leading to further accidents that
    may have been avoided

11
Accident 1
  • Kennestone Regional Oncology Ctr, 1985, Marietta
    GA
  • Patient receiving follow-up radiation treatment
    after a lumpectomy to remove a malignant breast
    tumor
  • Treatment was for an electron beam
  • patient complained of tremendous heat, red-hot
    sensation
  • patients shoulder froze
  • had reddening on her back
  • suffered great pain
  • her breast had to eventually be removed because
    of radiation burns
  • She sued the hospital and AECL, the lawsuit was
    settled out of court
  • The hospitals physicist later estimated that she
    received 15K-20K rads
  • At the time, AECL had no explanation for the
    accident and stated that the physicists
    explanation was wrong
  • The law suit should have triggered certain
    actions by AECL that did not at this point take
    place (such as alerting other Therac-25 users of
    the accident)

12
Accident 2
  • Ontario Cancer Foundation, July 26, 1985
  • On her 24th Therac-25 treatment, a female patient
    received a burning sensation like an electrical
    shock
  • The operator received an H-tilt error and the
    machine shut down within 5 seconds indicating no
    dose had been delivered
  • the operator tried to repeat the treatment as
    many as 4 more times and received the same error,
    at which point the machine shut itself off
    completely (standard process for the machine
    after 5 errors)
  • the patient received 13K-17K rads and died within
    5 months (she would most likely have died from
    her cancer, but not as quickly nor in as much
    pain)
  • AECL suspected that the fault was a mechanical
    problem with the turntable that aligns either the
    x-ray or electron gun and released a statement to
    users that they should visually confirm the
    turntables alignment until further notice

13
Accident 3
  • Yakima Valley Memorial Hospital, Washington,
    January (or Feb) 1986
  • A patient developed erythema in parallel stripes
    after a treatment, the skin went on to harden
  • Neither AECL nor the hospital staff could
    identify the cause of the problem, and AECL
    denied that it was the Therac-25
  • as this was the womans last treatment, no
    further exposure occurred and she survived,
    however she had chronic skin ulcer, tissue
    necrosis and was in constant pain
  • Up until this point, AECL did not contact other
    users of Therac-25 to warn them of these accidents

14
Accident 4
  • East Texas Cancer Center, March 21 1986
  • Up until this accident, the ETCC had treated 500
    patients with the Therac-25
  • A male patient came in for his 9th treatment to
    remove a tumor from his back
  • the patient was to be treated with the electron
    beam and receive 180 rads over a 10x17 cm area of
    his back per treatment (6000 rads over a 6 ½ week
    period)
  • The user sits in a different, shielded room, and
    monitors the patient through a video monitor
    (broken) and audio monitor (disabled)
  • the user entered the patient treatment through
    the user-interface but had to use the editing
    keys to move the cursor back up and change the
    mode from x (X-ray) to e (electron)
  • this was common for users to edit the treatment
    page

15
Continued
  • When the information was properly entered, the
    user hit B for beam on and received a malfunction
    notice (malfunction 54)
  • The user manual did not indicate what this
    malfunction was
  • The user tried again and received the same
    malfunction notice
  • at this point, the patient, who felt that he had
    been burned with hot coffee, pounded on the door
    between rooms to indicate a problem had arisen
  • The patient actually received between 16.5K and
    25K rads
  • the patient lost function of his arm, had
    periodic bouts of nausea and vomiting and wound
    up with paralysis of his arm and both legs, vocal
    cord and several other problems
  • he died 5 months later
  • AECL could not reproduce the error but insisted
    that the Therac-25 could not be responsible for
    the overdose

16
Accident 5
  • Also at ETCC, on April 11, 1986
  • A male patient received an electron treatment
  • The situation was similar in that the machine
    shut down with Malfunction 54
  • In this occasion, the speaker was not disabled
    and the user heard moaning from the patient
  • This patient died of an overdose 3 weeks later
  • The physicist at ETCC was able to recreate the
    accident once he realized that the problem arose
    from using the user interface and typing in
    information quickly (explanation follows)
  • A 6th accident occurred at Yakima who received a
    much greater exposure than the previous Yakima
    patient, resulting in death

17
Explanations
  • There were several causes of the accidents
  • Many of these problems were only discovered
    through experimentation and through the Therac-25
    users group who compared notes at conferences
  • AECL was slow to discover these errors if at all
    and were slow at communicating problems to the
    users community
  • because of this, some of the latter accidents may
    have been avoided
  • The primary cause of the accidents is simply
    because the Therac-25 uses software solely to
    ensure that overdoses can not arise
  • Whereas in the Therac-20, the software might lead
    to an overdose situation, but hardware interlocks
    insure that the mechanism cannot fire when the
    result would be an overexposure

18
User Interface Problem
  • To understand one of the problems, we need to
    understand how Therac-25 works
  • It consists of a number of executable modules,
    some of which run concurrently
  • a variable, Tphase, indicates which of these
    modules should execute next
  • Tphase is a shared variable but without proper
    synchronization
  • For instance, one module is Datent (data entry),
    and when done, the system switches to Set-up Test
  • The system will not switch out of Datent mode
    until all information has been entered and the
    cursor resides on the bottom of the screen
  • a typo in Datent requires using the editing keys
    (arrows, escape, etc) to move the cursor to the
    proper position and correct the entry

19
Continued
  • Now consider this situation
  • The user has entered x for x-ray and completes
    all other information and the cursor is at the
    bottom of the screen
  • Datent is now completed and the mode switches to
    set-up
  • A magnet is used to shape the electron beam
    deliver and takes 8 seconds between set-up and
    delivery
  • A fast typist can move the cursor back up to
    change the mode to e and move the cursor back
    to the bottom within those 8 seconds
  • This is exactly what happened at ETCC and the
    result was that while the user expected the
    milder electron beam to be used, instead the
    x-ray was used and delivered a massive
    overexposure
  • the reason that the typist could change the mode
    while still in another mode is because Tphase
    could be set by two active processes (called a
    race condition)
  • While this problem existed in Therac-20, it could
    never deliver the overexposure because of the
    hardware interlocks

20
Other Problems
  • Another shared variable indicated which setting
    the turntable was in (a 2-bit number indicating
    one of 3 settings)
  • Unfortunately, if the data were corrupted (by
    indicating an illegal value), the turntables
    position would be unchecked
  • Since this was a shared variable, it could become
    corrupted by some of the modules
  • Another variable value was stored in 1 byte so
    its maximum value was 255
  • On an overflow (256), it resets to 0 and does not
    throw any kind of error
  • Because of legacy code, a hardware device called
    the Collimator, is not checked when this variable
    is 0
  • In certain situations, the Collimator (the device
    which helps tune the beam) should be checked but
    is ignored and therefore, it is possible for the
    machine to deliver a massive dose
  • Cryptic error messages and poor users guide
    didnt help

21
AECL Responses
  • Part of the reason for so many accidents had to
    be AECLs inadequate and slow responses
  • In some cases, AECL simply denied that Therac-25
    could have caused the accidents and in other
    cases, offered hardware-only solutions that did
    not address the root of the problem
  • After each accident, the FDA and CDRH required
    AECL to respond with a corrective action plan
  • AECL was slow to respond
  • Further, AECL did not quickly alert other users
    to the accidents nor the problems, only some
    solutions
  • For instance, to resolve the problem with quick
    editing, AECL informed all users to remove the
    up-arrow key from the keyboard and make sure that
    the key (if pressed) would not make contact (that
    is, disable the up-arrow action)
  • The FDA had to get involved to mandate some
    changes as AECL offered some very weak fixed
    (kluges) to both the hardware and software
    problems

22
Some of the Solutions
  • All interruptions of errors would cause treatment
    to suspend, not pause
  • so that the operator could not resume, but would
    have to start over
  • Software-controlled single-pulse shutdown would
    be added
  • Hardware single-pulse shutdown would be added
  • Potentiometer added to turntable so that operator
    could determine turntables exact orientation
  • Beam would be disabled if turntable not in an
    appropriate orientation
  • Malfunction messages would be more meaningful and
    dose-rates highlighted
  • Editing keys would be limited to eliminate
    previous editing problem
  • Changes would be made to software to improve
    reliability

23
Lessons Learned
  • People who worked on the hardware felt that the
    software safeguards would be perfect and
    therefore hardware safeguards (e.g., interlocks)
    were unnecessary
  • Engineers and management greatly underestimated
    the complexity of the software
  • Management permitted (perhaps encouraged?) the
    use of legacy software
  • Poor software engineering techniques were most
    likely used given that there was little to no
    documentation available regarding the
    developments of the software
  • Management was very slow to communicate the
    problems with other users
  • Management often cited safety improvements made
    based on ridiculous assumptions and possibly made
    up values
  • between the 4th and 5th accidents, changes were
    made that the company indicated improved the
    system by 10,000,000 percent!
Write a Comment
User Comments (0)
About PowerShow.com