Title: Using Simplicity to Control Complexity
1Using Simplicity to Control Complexity
- Lui Sha
- Department of CS
- lrs_at_cs.uiuc.edu
- UIUC
- June, 2002
2The Goal
- Software systems are not static. They evolve.
- Our goal is to develop an engineering foundation
that allows us to evolve software systems
dependably. - New features can be easily added, preferably
online without down time - The system never performs worse than before, even
if the changes have bugs or even contain
malicious attack codes. - To realize this goal, we need to first understand
the nature of software reliability, and
demonstrate the viability of this idea in some
important class of applications.
3Which Side Would You Take?
- How to improve the reliability and availability
of increasingly complex software is a serious
challenge. There are two philosophical positions - The diversity camp Diversity in crops resists
diseases diversity in software improves
reliability. The likelihood of making the same
mistakes decreases as the degree of diversity
increases. Dont put all your eggs in one basket. - The bullet-proof your basket camp Concentrate
all the available resource to one version and do
it right. Do-it-right-the-first-time is the time
honored approach to quality products.
4Software Development Postulates
- In science we rely on facts and logic. Lets
begin with well known observations in software
development. We make 3 postulates - P1 Complexity Breeds Bugs Everything else being
equal, the more complex the software project is,
the harder it is to make it reliable. - P2 All Bugs are Not Equal You fix a bunch of
obvious bugs quickly, but finding and fixing the
last few bugs is much harder, if you can ever
hunt them down. - P3 All Budgets are Finite There is only a
finite amount of effort (budget) that we can
spend on any project. - No so fast, Lui! Could please you define
software complexity?
5Residual Logical Complexity
- Computational complexity is modeled as the number
of steps to complete the computation. Likewise,
logical complexity can be viewed as the number of
steps that are needed to verify the correctness. - A program can have different logical and
computational complexities. For example,
comparing with heap-sort, bubble-sort has lower
logical complexity but higher computational
complexity. We focus on logical complexity in
this talk. - Â
- Residue logical complexity. A program could have
high logical complexity initially. However, if it
has been verified and can be used as is, then the
residue complexity is zero - In the rest of discussion, we shall focus on
(residual logical) complexity of software.
6The Implications of the 3 Postulates
- P1 The Complexity Breeds Bugs postulate implies
that for a given mission duration t, the
reliability of software decreases as complexity
increases. - P2 The All Bugs are Not Equal postulate implies
that for a given degree of complexity, the
reliability function has a monotonically
decreasing rate of improvement with respect to
development effort. - A reliability function in the form of R(Effort,
Complexity, t) e-kC t/E satisfies P1 and P2 - P3 The Finite Budget Assumption implies that
that diversity is not free. That is, if we go for
n version diversity, we must divide the available
effort n-way. This allows us to compare
different approaches fairly.
7Modeling the Implications
- This is equivalent to assume that
- the commonly used reliability function e- ? t is
a useful model - the failure rate, ?, in R(t) is proportional to
complexity but inversely proportional to effort
spent to the software. - Hold on Lui, how do you know failure rate is
proportional to complexity and inversely
proportional to efforts spent? For Gods sake,
they could be very non-linear relations! - Ok, we will examine non-relationships later.
8A Unified Framework
- Recently Larry Bernstein extended the reliability
model as follows - R e-kCt /E?
- Where ? expresses the ability to solve a program
with fewer instructions with a new tool such as a
complier. - This equation expresses reliability of a software
system in a unified form as related to software
engineering parameters. The longer the software
system runs the lower the reliability and the
more likely a fault will be executed to become a
failure. Reliability can be improved by
investing in tools (?), simplifying the design
(C), or increasing the effort in development to
do more inspections or testing than required by
software effort estimation techniques. - This a new idea. For this lecture, we assume ?
1.
9N-Version Programming - 1
- Lets use the simple model to analyze N-version
program under ideal condition that faults are
independent. N-version programming suggests that
we should independently develop N versions of
programs according to the same specification. And
then take the majority of the outputs.
3-version programming
10N-version Programming - 2
- It turns out that single-version is better than
3-version is a robust result. Here are two
examples.
3-version programming
11Recovery Block
- The idea of recovery block is that you develop
several alternatives Checkpoint your state, try
the primary and test the output. If it passes the
acceptance test, use it. Otherwise, roll back and
try another alternative We shall assume that we
have perfect acceptance test for now.
12The More Alternatives the Merrier?
13Power of Simplicity
14The Fly in the Ointment
- Alas, it is difficult to develop high coverage
acceptance tests. Consider the case of a uniform
number generator. - Can you determine the distribution is indeed
uniform using one isolated data point? No. - Can you determine the distribution with a large
sample? Yes. - Many phenomena require a good size sample to
diagnose. It is often difficult to diagnose a
phenomenon with an isolated instance. This
explains why it is so difficulty to determine the
correctness of each individual program output. - Unfortunately, we cannot buffer a long sequence
of outputs before we output them in many
applications. We cant do it in interactive
applications, nor can we buffer up the outputs in
control applications - We need to find a way that tolerates incorrect
outputs
15Feedback Control of Software Execution
- To tolerate output errors that cannot be detect
instantaneously, the applications should have the
following characteristics - Capability control When the system in an
operational state, a single incorrect output
cannot bring the system down instantaneously.
(Cumulative errors can) - Measurable system behavior We can evaluate the
system behaviors under the software control. - Control applications meet these 2 requirements.
Control software error maps to measurable
actuation errors. Errors are measurable and can
be bounded by a combination of control authority
and monitoring frequency. - A simple and reliable core to provide acceptable
performance - Stability control the system under complex
software control remain in states that are
controllable by the simple and reliable
controller.
16The Idea
- Joe is a new student who partied a bit too much.
He masters bubble sort but only have 50 chance
of writing a correct quick sort program. - He must submit a program that will be evaluated
as follows - Correct and fast O(n log n) A
- Correct but slow B
- Incorrect F
- What is Joes optimal strategy?
Quick Sort
Bubble Sort
Stability control the set of numbers to be
sorted cannot be altered. This is the
precondition for Bubble Sort.
17Simplex Architecture
A simple verifiable core diversity in the form
of 2 alternatives feedback control of the
software execution.
Online replaceable
18Admissible States
- In the operation of a plant, there is a set of
state constraints representing the safety,
device physical limitations, environmental and
other operation requirements. - They can be represented as a normalized polytope,
CTX ? 1, in the N-dimensional state space. We
must be able - take the control away from a faulty controller,
before the system state becomes inadmissible - the future trajectory of the system state after
the switch will stay within the set of admissible
states.
State constraints
Admissible States
Operation Constraints and Admissible states
19The Error Bounds
- When cannot use the boundary of admissible states
as switching rule due to the inertia of the
physical plant. - Recovery region is closed with respect to the
operations of simple controller. It is Lyapunov
function inside the polytope. - The largest recovery region can be found using
LMI.
20System Development Process
- The high assurance control subsystem
- Application level using well-understood
classical controllers - System software level using high assurance OS
kernels such as certifiable Ada runtime - Hardware level using well-established and simple
fault tolerant hardware configurations, such as
pair-pair or TMR. - High assurance development and maintenance
process, e.g., FAA DO 178B - Requirement management requirements here are
limited to critical properties. - Â
- The high performance control subsystem
- Application level advanced control
technologies        - System software level using COTS real time
operating systems and middleware - Hardware level using standard industrial
hardware, e.g., VME - Standard industrial software development process
- Requirement management features and performance
are handled here. - System evolution supports, e.g., online
replaceable components
21Semi-Conduction Wafer Process State Control
Deposition rate Refractive index Si-H/Ni-H
bonds Uniformity etc.
DC bias Mass 60 (disilane) Mass 76
(triaminosilane)
SiH4 RF power Pressure
22DoD Applications
SoftwareFault tolerance is particularly
useful for cases in which some new functionality
is available that has been only partially tested
but that might help to achieve the success of a
mission. By providing protection from faults,
Simplex enables such functionality to be applied
on a mission. Joint Strike Fighter (JSF)the
JSF mission software architecture builds on the
architectural principles developed under the
INSERT project http//www.sei.cmu.edu/pub/docume
nts/99.reports/pdf/news-sei-fall-1999.pdf The
Space and Naval Warfare Systems Command (SPAWAR)
has initiated a process to transition SIMPLEX
technology The technology will be transitioned
to the Surface Combatant for the 21st Century
(SC21), the Next Generation Carrier (CV(X)), and
other Navy systems. SIMPLEX includes a software
architecture, real-time middleware services and
supporting tools to allow the safe insertion of
new technology or upgrading of existing
technology in high-assurance real-time systems.
It permits the new technology to operate until an
error condition (system, timing or semantic
error) occurs at which time the system rolls back
to the baseline technology http//www.rl.af.mil/
tech/programs/edcs/Accomplishments.html
23Summary
- We should never trust complex software that is
beyond our means to verify - Untrusted complex software are useful, provided
that when it malfunctions its adverse impacts on
system behaviors is observable and bounded by
design - We need a simple and reliable core to provide
minimal essential services and constrain the
impacts of malfunction software so as not to let
faults turn into failures
After 30 seconds of a planned 90 flight missile
test in the 70s, the clock was not properly
reset. The missile blew up. Some twenty-five
years later ATT experienced a massive network
failure caused by a similar problem in the fault
recovery subsystem they were upgrading. In both
cases, the system failed because there was no
limits placed on the results the software could
produce. There were no boundary conditions set.
Designers built with a point solution in mind and
without bounding the domain of software
execution. Testers were rushed to meet schedules
and the planned fault recovery mechanisms did not
work. --- Larry
Bernstein
24Software Fault Model
- Timing fault misses its deadlines
- Capability abuse
- Corrupt others code or data
- Unauthorized acquisition of process/resource
management capability - Semantic fault incorrect results that can lead
to - Poor control performance
- Instability in the plant
25Recent Extensions Secured Reliable Upgrades
Code Data Access Attacks
Compiler Based Protection
Algorithmic attacks
Algorithm Based Protection
Resource Depletion attacks
OS Based Protection
26Telelab
- www-drii.cs.uiuc.edu/download