Title: Software Engineering II
1Software Engineering II
Software Reliability
2Dependable and Reliable Systems The Royal
Majesty
From the report of the National Transportation
Safety Board "On June 10, 1995, the Panamanian
passenger ship Royal Majesty grounded on Rose and
Crown Shoal about 10 miles east of Nantucket
Island, Massachusetts, and about 17 miles from
where the watch officers thought the vessel was.
The vessel, with 1,509 persons on board, was en
route from St. Georges, Bermuda, to Boston,
Massachusetts." "The Raytheon GPS unit installed
on the Royal Majesty had been designed as a
standalone navigation device in the mid- to
late 1980s, ...The Royal Majestys GPS was
configured by Majesty Cruise Line to
automatically default to the Dead Reckoning mode
when satellite data were not available."
3The Royal Majesty Analysis
The ship was steered by an autopilot that
relied on position information from the Global
Positioning System (GPS). If the GPS could not
obtain a position from satellites, it provided an
estimated position based on Dead Reckoning
(distance and direction traveled from a known
point). The GPS failed one hour after leaving
Bermuda. The crew failed to see the warning
message on the display (or to check the
instruments). 34 hours and 600 miles later, the
Dead Reckoning error was 17 miles.
4The Royal Majesty Software Lessons
All the software worked as specified (no bugs),
but ... Since the GPS software had been
specified, the requirements had changed (stand
alone system to part of integrated system). The
manufacturers of the autopilot and GPS adopted
different design philosophies about the
communication of mode changes. The autopilot
was not programmed to recognize valid/invalid
status bits in message from the GPS (NMEA
0183). The warnings provided by the user
interface were not sufficiently conspicuous to
alert the crew. The officers had not been
properly trained on this equipment.
5Reliability
Reliability Probability of a failure occurring
in operational use. Perceived reliability
Depends upon user behavior set
of inputs pain of failure
6User Perception of Reliability
1. A personal computer that crashes frequently
v. a machine that is out of service for two
days. 2. A database system that crashes
frequently but comes back quickly with no loss of
data v. a system that fails once in three years
but data has to be restored from backup. 3. A
system that does not fail but has unpredictable
periods when it runs very slowly.
7Reliability Metrics
Traditional Measures Mean time between
failures Availability (up time) Mean
time to repair Market Measures Complaints
Customer retention User Perception is
Influenced by Distribution of
failures Hypothetical example Cars are less safe
than airplanes in accidents per hour, but safer
in accidents per mile.
8Reliability Metrics for Distributed Systems
Traditional metrics are hard to apply in
multi-component systems In a big network,
at any given moment something will be giving
trouble, but very few users will see it. A
system that has excellent average reliability may
give terrible service to certain users.
There are so many components that system
administrators rely on automatic reporting
systems to identify problem areas.
9Requirements Specification of System Reliability
Example ATM card reader
Failure class Example Metric Permanent
System fails to operate 1 per 1,000
days non-corrupting with any card --
reboot Transient System can not read 1 in
1,000 transactions non-corrupting an undamaged
card Corrupting A pattern of Never
transactions corrupts database
10Cost of Improved Reliability
Up time
100
99
Will you spend your money on new functionality or
improved reliability?
11Example Central Computing System
A central computer serves the entire
organization. Any failure is serious. Step 1
Gather data on every failure 10 years of
data in a simple data base Every failure
analyzed hardware software (default) environme
nt (e.g., power, air conditioning) human (e.g.,
operator error)
12Example Central Computing System
Step 2 Analyze the data Weekly, monthly,
and annual statistics Number of failures and
interruptions Mean time to repair Graphs of
trends by component, e.g., Failure rates of disk
drives Hardware failures after power
failures Crashes caused by software bugs in each
module
13Example Central Computing System
Step 3 Invest resources where benefit will be
maximum, e.g., Orderly shut down after power
failure Priority order for software
improvements Changed procedures for
operators Replacement hardware
14Building Dependable Systems Three Principles
For a software system to be dependable Each
stage of development must be done well. Changes
should be incorporated into the structure as
carefully as the original system
development. Testing and correction do not
ensure quality, but dependable systems are not
possible without systematic testing.
15Reliability Modified Waterfall Model
Feasibility study
Requirements
System design
Program design
Coding
Testing
Changes
Acceptance
Operation maintenance
16Key Factors for Reliable Software
Organization culture that expects quality
Approach to software design and implementation
that hides complexity (e.g., structured design,
object-oriented programming) Precise,
unambiguous specification Use of software
tools that restrict or detect errors (e.g.,
strongly typed languages, source control systems,
debuggers) Programming style that emphasizes
simplicity, readability, and avoidance of
dangerous constructs Incremental validation
17Building Dependable Systems Organizational
Culture
Good organizations create good systems
Acceptance of the group's style of work (e.g.,
meetings, preparation, support for juniors)
Visibility Completion of a task before
moving to the next (e.g., documentation, comments
in code)
18Building Dependable Systems Complexity
The human mind can encompass only limited
complexity Comprehensibility
Simplicity Partitioning of complexity A
simple system or subsystem is easier to get right
than a complex one.
19Building Dependable Systems Specifications for
the Client
Specifications are of no value if they do not
meet the client's needs The client must
understand and review the requirements
specification in detail Appropriate members
of the client's staff must review relevant areas
of the design (e.g., operations, training
materials, system administration) The
acceptance tests must belong to the client
20Building Dependable Systems Quality Management
Processes
Assumption Good processes lead to good
software The importance of routine Standard
terminology (requirements, specification, design,
etc.) Software standards (naming conventions,
etc.) Internal and external documentation Report
ing procedures
21Building Dependable Systems Change
Change management Source code management and
version control Tracking of change requests and
bug reports Procedures for changing requirements
specifications, designs and other
documentation Regression testing Release control
22Reviews Process (Plan)
Objectives To review progress against plan
(formal or informal). To adjust plan
(schedule, team assignments, functionality,
etc.). Impact on quality Good quality systems
usually result from plans that are demanding but
realistic. Good people like to be stretched and
to work hard, but must not be pressed beyond
their capabilities.
23Reviews Design and Code
DESIGN AND CODE REVIEWS ARE A FUNDAMENTAL PART OF
GOOD SOFTWARE DEVELOPMENT Concept Colleagues
review each other's work can be
applied to any stage of software development
can be formal or informal
24Benefits of Design and Code Reviews
Benefits Extra eyes spot mistakes, suggest
improvements Colleagues share expertise
helps with training An occasion to tidy
loose ends Incompatibilities between
components can be identified Helps
scheduling and management control Fundamental
requirements Senior team members must show
leadership Good reviews require good
preparation Everybody must be helpful, not
threatening
25Review Team (Full Version)
A review is a structured meeting, with the
following people Moderator -- ensures that the
meeting moves ahead steadily Scribe -- records
discussion in a constructive manner Developer --
person(s) whose work is being reviewed Interested
parties -- people above and below in the software
process Outside experts -- knowledgeable people
who have are not working on this project Client
-- representatives of the client who are
knowledgeable about this part of the process
26Example Program Design
Moderator Scribe Developer -- the design
team Interested parties -- people who created the
system design and/or requirements specification,
and the programmers who will implement the
system Outside experts -- knowledgeable people
who have are not working on this project Client
-- only if the client has a strong technical
representative
27Review Process
Preparation The developer provides colleagues
with documentation (e.g., specification or
design), or code listing Participants study the
documentation in advance Meeting The developer
leads the reviewers through the documentation,
describing what each section does and encouraging
questions Must allow plenty of time and be
prepared to continue on another day.
28Static and Dynamic Verification
Static verification Techniques of verification
that do not include execution of the software.
May be manual or use computer tools. Dynamic
verification Testing the software with
trial data. Debugging to remove errors.
29Static Validation Verification
Carried out throughout the software development
process.
Validation verification
Requirements specification
Program
Design
REVIEWS
30Static Verification Program Inspections
Formal program reviews whose objective is to
detect faults Code may be read or reviewed
line by line. 150 to 250 lines of code in 2
hour meeting. Use checklist of common
errors. Requires team commitment, e.g.,
trained leaders So effective that it is claimed
that it can replace unit testing
31Inspection Checklist Common Errors
Data faults Initialization, constants, array
bounds, character strings Control faults
Conditions, loop termination, compound
statements, case statements Input/output faults
All inputs used all outputs assigned a
value Interface faults Parameter numbers,
types, and order structures and shared
memory Storage management faults Modification
of links, allocation and de-allocation of
memory Exceptions Possible errors, error
handlers
32Static Analysis Tools
Program analyzers scan the source of a program
for possible faults and anomalies (e.g., Lint
for C programs). Control flow loops with
multiple exit or entry points Data use
Undeclared or uninitialized variables, unused
variables, multiple assignments, array bounds
Interface faults Parameter mismatches, non-use
of functions results, uncalled procedures
Storage management Unassigned pointers, pointer
arithmetic
33Static Analysis Tools (continued)
Static analysis tools Cross-reference table
Shows every use of a variable, procedure, object,
etc. Information flow analysis Identifies
input variables on which an output
depends. Path analysis Identifies all possible
paths through the program.
34Failures and Faults
Failure Software does not deliver the service
expected by the user (e.g., mistake in
requirements, confusing user interface) Fault
(BUG) Programming or design error whereby the
delivered system does not conform to
specification (e.g., coding error, interface
error)
35Faults and Failures?
Actual examples (a) A mathematical function
loops for ever from rounding error. (b) A
distributed system hangs because of a concurrency
problem. (c) After a network is hit by
lightning, it crashes on restart. (d) A program
dies because the programmer typed x 1 instead
of x 1. (e) The head of an organization is
paid 5 a month instead of 10,005 because the
maximum salary allowed by the program is
10,000. (f) An operating system fails because
of a page-boundary error in the firmware.
36Terminology
Fault avoidance Build systems with the objective
of creating fault-free (bug-free) software Fault
tolerance Build systems that continue to operate
when faults (bugs) occur Fault detection (testing
and validation) Detect faults (bugs) before the
system is put into operation.
37Fault Avoidance
Software development process that aims to develop
zero-defect software. Formal specification
Incremental development with customer input
Constrained programming options Static
verification Statistical testing It is
always better to prevent defects than to remove
them later. Example The four color problem.
38Defensive Programming
Murphy's Law If anything can go wrong, it
will. Defensive Programming Redundant
code is incorporated to check system state after
modifications. Implicit assumptions are
tested explicitly. Risky programming
constructs are avoided.
39Defensive Programming Error Avoidance
Risky programming constructs Pointers
Dynamic memory allocation Floating-point
numbers Parallelism Recursion
Interrupts All are valuable in certain
circumstances, but should be used with discretion
40Defensive Programming Examples
Use boolean variable not integer
Test i lt n not i n Assertion checking
(e.g., validate parameters) Build debugging
code into program with a switch to display values
at interfaces Error checking codes in data
(e.g., checksum or hash)
41Maintenance
Most production programs are maintained by people
other than the programmers who originally wrote
them. (a) What factors make a program easy for
somebody else to maintain? (b) What factors make
a program hard for somebody else to maintain?
42Fault Tolerance
General Approach Failure detection
Damage assessment Fault recovery Fault
repair N-version programming -- Execute
independent implementation in parallel, compare
results, accept the most probable.
43Fault Tolerance
Basic Techniques After error continue with
next transaction (e.g., drop packet) Timers
and timeout in networked systems Error
correcting codes in data Bad block tables on
disk drives Forward and backward pointers in
databases Report all errors for quality control
44Fault Tolerance
Backward Recovery Record system state at
specific events (checkpoints). After failure,
recreate state at last checkpoint. Combine
checkpoints with system log that allows
transactions from last checkpoint to be repeated
automatically. Test the restore software!
45Software Engineering for Real Time
The special characteristics of real time
computing require extra attention to good
software engineering principles
Requirements analysis and specification
Development of tools Modular design
Exhaustive testing Heroic programming will fail!
46Software Engineering for Real Time
Testing and debugging need special tools and
environments Debuggers, etc., can not be
used to test real time performance
Simulation of environment may be needed to test
interfaces -- e.g., adjustable clock speed
General purpose tools may not be available
47Some Notable Bugs
Built-in function in Fortran compiler (e0
0) Japanese microcode for Honeywell DPS
virtual memory The microfilm plotter with
the missing byte (11023) The Sun 3 page
fault that IBM paid to fix Left handed
rotation in the graphics package Good people work
around problems. The best people track them down
and fix them!
48End of Lecture 6