Title: Architecture Based Software Reliability
1 Architecture Based Software Reliability
- Katerina Goseva Popstojanova
- Kishor S. Trivedi
- Center for Advanced Computing
Communication - Department of Electrical and Computer
Engineering - Duke University, Durham, NC
2Outline
- Motivation and advantages
- Common requirements and classification
- Models elaboration
- Assumptions, limitations and applicability
- Conclusion
3Software Reliability
- Increasing dependence on computer systems
- Failures are more due to software faults than to
hardware faults - Examples of failures due to software
- Excessive radiotherapy doses (1985-1987)
- 9 hours outage of the long-distance phone in USA
(1990) - Scud missile missed by the Patriot (1991)
- Ariane 5 crash (1996)
- 8 hours delay in opening the London Stock
Exchange (2000) - Need for evaluation, prediction and improvement
- of software reliability
4Motivation
- Black-box models treat software as a monolithic
whole, considering only its interactions with
external environment, without an attempt to - model its internal structure
- With growing emphasis on reuse, software
development process moves toward component-based
software design - White-box approach can be used to analyze system
with many software components and how they fit
together
5Advantages
- Analyzing reliability and performance of the
application built from reusable and COTS - software components
- Studying sensitivity of the application
- reliability/performance to reliability/performanc
e - of components and interfaces
- Guiding the process of identifying critical
components and interfaces - Allocation of resources to each of the
components, - so that the system reliability objective is
achieved - Evaluation of design alternatives
6Common Requirements and Classification
- Component identification
- Architecture of the software
- Failure behavior of the components and interfaces
- Combining the architecture with failure behavior
- state-based models
- path-based models
- additive models
7State-based Models
- Estimate software reliability analytically by
- combining software architecture with failure
behavior - Method of solution
- Composite method
- Combine the architecture and failure behavior
into - composite model
- Solve the composite model for reliability and
performance measures - Hierarchical method
- Solve the architectural model
- Superimpose the failure behavior on the solution
of the architectural model in order to predict
reliability
8Software Architecture
- Software behavior with respect to the manner in
which different components interact - May include the information about the execution
time of each component - Use control flow graph to represent architecture
- Assume that transfer of control between
- components has Markov property
9 Software Architecture - Contd
- Sequential program architecture modeled by
- Discrete Time Markov Chain (DTMC)
- Continuous Time Markov Chain (CTMC)
- Semi-Markov process (SMP)
- These can be
- absorbing - terminating applications
- irreducible - continuously running applications
10Failure Behavior of Components and
Interfaces
- Failure can happen
- during the execution of any component or
- during the transfer of control between components
- Failure behavior can be specified in terms of
- reliability
- constant failure rate
- time-dependent failure intensity
11Terminating ApplicationsCheung model (1980)
- Architecture DTMC
-
- Prtransfer of control from module to
module - Failure behavior components reliability
- Solution method Composite
- Two absorbing states S (correct output) and F
(failure) are added. Transition probability
matrix - P is modified appropriately. W is the matrix
obtained by deleting rows and columns
corresponding to S and F. The element M(1,n)
of the fundamental matrix M (I-W)-1
represents - the probability of reaching state n from state
1 -
-
- System reliability is R M(1,n) Rn
-
12Terminating Applications Kubat model (1989)
- Architecture SMP
- - density of the sojourn time in state
- Failure behavior Component failure rate
- Solution method Hierarchical
- Reliability of component is
-
- Embedded DTMC provides the expected number
- of times each component is executed
13Terminating Applications Kubat model - Contd
- can be considered as the equivalent reliability
- of the component that takes into account the
component utilization - System reliability becomes
14Terminating ApplicationsGokhale et al. model
(1998)
- Architecture DTMC
- - expected time spent in component per
visit - Failure behavior Time-dependent failure rate
- Solution method Hierarchical
- Given and utilization of components
represented by the cumulative expected time
spent in the component per execution
component reliability is -
- System reliability becomes
15Comments on Terminating Application Models
- Once the reliabilities are estimated the
solution method in Kubat model reduces to
hierarchical treatment of Cheung model - Special case of Kubat model that assumes
deterministic execution times is equivalent to
- the special case of Gokhale et al. model that
- assumes constant failure intensities
16Example 1
- Terminating application
- architecture described by
- DTMC with transition
- probability matrix Ppij
- component reliabilities are
1
1
2
p23
p24
3
4
1
1
5
17Example 1 - Contd
- Solution method - Hierarchical
- Vi is a clear indication of component usage
- when p240.8 components 2
- and 4 are invoked within a loop
- many times which results in
- a significantly higher expected
- number of executions compared
- to the case when p240.2
- Application reliability is highly dependent on
the components - usage
18 Example 1 - Contd
Solution method - Composite
1
1-R1
R1
1-R2
2
P23 R2
P24 R2
R4
1-R4
3
4
F
1-R3
R3
1-R5
5
S
R5
19Example 1 - Contd
1
- p24 0.8
- Application reliability varies
- significantly with the variation in
- the reliability of components
- 2 and 4
- This is due to the fact that these
- two components are invoked
- large number of times
y
0.8
t
i
l
i
b
a
i
0.6
l
e
r
n
R4
o
i
0.4
t
a
c
R2
i
l
p
0.2
p
A
0
0.2
0.4
0.6
0.8
1
Reliability
of
a
component
Application reliability as a function of the
reliability of one component while
reliabilities of other components are fixed
20Example 1 - Contd
- p24 0.2
- Application reliability does not
- vary significantly with the
- variation in the reliability of
- component 4
- This is due to the fact that
- component 4 is invoked
- few times
- Even when R4 is low (including
- R40), overall application
- reliability is still high since
- component 4, unlike other
- components, is not necessary
- invoked in each execution
1
R4
y
0.8
t
i
l
i
b
a
i
l
0.6
e
r
n
o
i
R2
0.4
t
a
c
i
l
p
0.2
p
A
0
0.2
0.4
0.6
0.8
1
Reliability
of
a
component
Application reliability as a function of the
reliability of one component while reliabilities
of other components are fixed
21 Continuously Running Applications
Littlewood model (1975)
- Architecture CTMC
- - transition rate from state to state
- Failure behavior
- When component is executed failures occur
according to a Poisson process with parameter - Transfer of control between component and
component fails with probability - Solution method Composite
- Moment generating function of the number of
- failures in time interval
- Asymptotic analysis leads to a Poisson process
- with parameter
-
22 Continuously Running Applications
Littlewood model (1979)
- Architecture SMP
- Failure behavior
- When component is executed failures occur
according to a Poisson process with parameter - Transfer of control between component and
component fails with probability - Solution method Composite Asymptotic analysis
- leads to Poisson process with parameter
-
- where and
-
23 Continuously Running Applications
Laprie model (1984)
- Architecture CTMC
- Failure behavior constant failure rate
- Solution method Hierarchical
- Assuming that failure rates are much smaller
- than execution rates leads to asymptotic
behavior - relative to the execution process
- System failure rate tends to
-
24 Continuously Running Applications
Ledoux model (1999)
- Architecture CTMC
- Failure behavior
- primary failures which lead to execution break
occur with constant failure rates - secondary failures are described as Poisson
process with rate - primary (secondary) interface failures occur with
probability ( ) - Solution method Composite Using matrix-
- analytical approach numerically are evaluated
- distribution function of the number of failures
- reliability
- failure intensity function
25Comments on Continuously Running Application
Models
- Models that describe software architecture with
- CTMC are special cases of Versatile Markov point
- processes introduced by Neuts (1979) which are
- shown to be equivalent to Batch Markovian Arrival
- Processes (1991)
- This is a rich class of point processes that have
been used extensively to model arrival processes
in queuing theory - Close relation of the BMAP with finite CTMC
results - in matrix-analytic approach that substantially
reduces - the computational complexity of the algorithmic
- solution
26Comments on Continuously Running Application
Models - Contd
- Littlewood model (1975) and Ledoux model (1999)
which consider both components and interfaces
failures (with batch size of 1) are - Markovian Arrival Processes - MAP
- Laprie model (1984) which considers only
components failures is - doubly stochastic Poisson process known as
- Markov Modulated Poisson Process - MMPP
27Example 2
- Continuously running application
- architecture described by CTMC
- transition probabilities pij
- expected execution time of
- component
1
g5
g1
p23 g2
p24 g2
2
g4
3
4
g3
5
28Example 2 - Contd
1
- This is a composite model
- that can be solved exactly
- Since failure rates are much
- smaller than execution rates
- many exchanges of control
- would take place between
- successive program failures
- This leads to the asymptotic
- behavior relative to the
- execution process and
- allows us to adopt the
- hierarchical solution method
l1
g5
g1
l2
2
p23 g2
p24 g2
g4
l4
F
4
3
l3
g3
5
l5
29Example 2 - Contd
Solution method - Hierarchical
- The proportion of time spent in
- each component pi is a measure
- of component usage
- when p240.8 the proportion of
- time spent in components 2
- and 4 is significantly higher
- compared to the case when
- p240.2
- Application failure rate i.e. reliability is
affected by the components usage
30Example 2 - Contd
1
y
0.8
t
i
l
p24
i
0.2
b
i
a
l
0.6
e
r
n
o
i
0.4
p24
t
0.8
a
c
i
l
p
0.2
p
A
0
20
40
60
80
100
t
Application reliability as a function of time t
31Path-based Models
- Method used to combine software architecture
- with failure behavior is not analytical
- The sequence of components executed along each
path is obtained either experimentally by testing
or algorithmically - Path reliability is obtained by multiplying the
reliabilities of the components and interfaces - along the path
- System reliability is estimated by averaging path
reliabilities over all paths
32Path-based Models Shooman model (1976)
- Architecture
- Knowledge of all paths and their frequencies of
execution - Failure behavior
- Failure probability of path
- Solution method
- Total number of failures in test runs
is -
- System failure probability on any test run is
33Path-based Models Krishnamurthy-Mathur model
(1997)
- Architecture Sequence of components along
different paths is observed using the component
traces collected during testing - Failure behavior component reliability
- Solution method Reliability of a path traversed
when P is executed on test case is - Reliability of P with respect to test set
is
34Path-based Models Yacoub,Cukic,Ammar model
(1999)
- Architecture probabilistic model named Component
- Dependency Graph is constructed using scenarios
- Failure behavior
- component reliability
- transition reliability
- Solution method Tree traversal algorithm
- breadth expansions represent logical OR paths
translated in summation of reliabilities weighted
by the transition probability along each path - depth of each path represents the sequential
execution of components, logical AND
translated to multiplication of reliabilities
35 Comments on Path-based Models
- Account for each component utilization along each
path, as well as among different paths - Difference between state-based and path-based
approach becomes evident when control flow graph
contains loops - state based models analytically account for the
infinite number of paths - path-based models
- number of paths is restricted to one observed
experimentally during the testing, or - depth traversal of each path is terminated using
the average execution time of the application
36Example 3
Path based approach
1
2
3
4
5
37Example 3 - Contd
This sample of test cases results in the same
value for the transition probability p240.2
1
1
2
- Assuming that components along
- each path fail independently
0.8
0.2
Rin(6R1 R2 R3 R52R1 R22 R4R3 R5)/8
3
4
1
- Considering intra-component
- dependency by collapsing multiple
- occurrences of a component
1
5
Rdep(6R1 R2 R3 R52R1 R2 R4R3 R5)/8
38Example 3 - Contd
Comparison of the results
39Example 3 - Contd
Another sample of test cases 12424235 12424242424
24235 results in p240.8
1
1
2
0.2
0.8
3
4
Rin(R1 R23R42 R3 R5 R1 R27R46 R3R5)/2
1
Rdep(R1 R2 R4R3 R5R1 R2 R4R3 R5)/2
1
5
40Example 3 - Contd
- In this case sample paths traverse components 2
and - 4 within the loop significantly larger number of
times - assuming intra-component
- dependency results into
- significantly higher
- reliability compared to the
- independent case
41Example 3 - Contd
Path based model restricts the number of paths to
one observed experimentally
1
1
Considering all possible paths
2
R p23 R1 R2 R3 R5 1 p24
R2 R4 (p24 R2 R4)2
(p24 R2 R4)3
p23
p24
3
4
1
leads to the same solution as in the case of the
composite model
1
R p23 R1 R2 R3 R5 / (1- p24 R2 R4)
5
42Additive Models
- Do not consider software architecture explicitly
- Focused on estimating overall application
reliability using components failure data - Consider software reliability growth
- Components failure processes are modeled by
- non-homogeneous Poisson process (NHPP)
- System failure process is also NHPP with
cumulative number of failures and failure
intensity function that are sums of the
corresponding functions for each component
43Additive Models Xie and Wohlin model (1995)
- Failure behavior Components reliabilities are
modeled by NHPP with failure intensity - Solution method
- System failure intensity is
- Expected cumulative number of system failures by
time t, known as mean value function, is -
-
- Time has to be adjusted appropriately to
consider different starting points for different
components
44Additive models Everett model (1999)
- Failure behavior Components reliabilities are
modeled by Extended Execution Time model that
includes information about relative usage stress
imposed on each component - Solution method Cumulative number of failures
and failure intensity functions for superposition
of such models is just the sum of corresponding
functions for each component - Keeps track of the cumulative processing time
per component during the testing, that is,
considers software architecture implicitly
45Example 4
Additive model
- Consider a software that consists of two
components, - which are tested independently
- Second component is introduced into the system at
t23 - Components reliabilities are modeled by log-power
model - that has mean value function
- Using a set of data from a large communication
software project results in the following mean
value functions -
46Example 4 - Contd
600
- The model fits very well the
- sudden change in failure behavior
- upon introduction of the second
- component into the system
- It overestimates the number of
- failures because the log-power
- model is not the best software
- reliability growth model for this
- set of data
500
s
e
r
u
l
i
400
a
f
f
300
o
r
e
b
200
m
u
100
N
0
0
10
20
30
40
50
Months
Estimated expected number of failures together
with the empirical failure data for the whole
system
47Discussion on Model Choice
- The choice can be based on different criteria
- validity of the assumptions
- accuracy of the solution
- number of parameters in the model
- ability to collect data
- insight gained from the model evaluation
- The relative weight to be placed on different
criteria - may depend on the context in which the model is
- being applied
48Validity of Assumptions andAccuracy of Solution
- Terminating Applications
- Cheung model assumes that component reliabilities
are known and uses composite method to obtain
exact solution for system reliability - Kubat model and Gokhale et al. model use
- two different approaches to estimate component
reliabilities - hierarchical solution method
- first order approximation for the reliability is
based on - the assumptions that
- components are highly reliable
- variances of the number of times each component
is executed are very small
49Validity of Assumptions andAccuracy of Solution
- Continuously Running Applications
- Littlewood, Laprie and Ledoux models assume that
the sojourn times in each component are
exponentially distributed - If that is not the case one should use Littlewood
model (1979) - Asymptotic solutions are based on the additional
assumption that time between failures are much
larger than times between exchange of control
50Validity of Assumptions and Accuracy of Solution
- Path-based models
- Unlike state-based models that analytically
account for the infinite number of paths,
path-based models - restrict the number of paths Krishnamurthy,
Mathur - terminate the depth traversal of each path
Yacoub, Cukic, Ammar - System reliability should not differ
significantly since long paths are usually highly
improbable
51Validity of Assumptions andAccuracy of Solution
- Additive models
- Xie and Wohlin model and Everett model could be
the choice when interest is focused on the
testing phase, when the reliability growth is
considered - Component reliabilities need to be modeled with
NHPP
52Number of Parameters and Ability to Collect Data
- Each model requires the knowledge of component
failure behavior - Some models also require execution times of each
component to be measured - may impose difficulties, especially when the
distribution function is required - Granularity of required data is different
- many models consider software architecture
explicitly in terms of the transfer of control
between components - other models deal directly with quantities such
as - path reliabilities
- cumulative execution time per component
53Unavailability of High Quality Data
- Major limitation to comparing and validating
software reliability models is the lack of high
quality data - This limitation is even more significant for the
architecture based models which need far more
sophisticated data than black-box models - Availability of high quality data should provide
- sound basis for comparison
- help with the clear choice between the models
54 Assumptions, Limitations, Applicability
- Level of decomposition
- Estimation of individual component reliabilities
- Estimation of interface reliabilities
- Validity of Markov assumption
- Estimation of transition probabilities
- Operational profile
- Considering failure dependencies
- Extracting software architecture
- Sensitivity analysis
- Considering multiple software quality attributes
- Considering different architectural styles
55Level of Decomposition
- Decomposition level depends on the factors such
as - system being analyzed, possibility of getting
required - data, etc.
- Too many small components may pose difficulties
- in measurement, parametrization, and solution
of - the model
- Too few components may cause the distinction of
- how components contribute to the system
failure - to be lost
56 Level of Decomposition Contd
- Choices for level of decomposition in
experimental - studies published so far
- Telephone switching software system - four
componets according to the main functions Kanoun
et. al 1987 - Unix utility grep - 8 components Krishnamurthy
et al. 1997 - SHARPE 30 components, each corresponding to a
single file - Gokhale et. al, 1998
- Simulation of waiting queues - 6 reused
components - Yacoub et al. 1999
-
57 Estimation of Components Reliabilities
- Depends on whether or not component code is
- available, how well the component has been
tested, - whether it is a reused or a new component, etc.
- Reliability growth models
- difficulty due to the scarcity of failure data
- Explicit consideration of non-failed executions,
possibly together with failures - high number of executions is required
- Fault seeding and fault injection
- depends on the range of fault classes that are
simulated
58 Estimation of Interface Reliabilities
- Interface between two components could be
- another component
- collection of global variables
- set of files
- any combination of these
- Little information is available about interface
failures, apart from the general agreement that
they exist separately from component failures
revealed during unit testing
59Validity of Markov Assumption
- State-space models assume that the next component
to be executed will depend only on the present
component and is independent of the past history - the embedded Markov chain is a first order chain
- Hypothesis that the chain is of a given order
- needs to be tested
- Higher order Markov chain
- enables to consider dependency among components
- can be represented as a first order chain by
redefining - the state space appropriately
- size of the state space grows fast
60 Estimation of Transition Probabilities
- During the early phases transition probabilities
- may be available by analyzing program structure
- and using known operational profile
- During the design phase, before actual
- development, simulation can be used
- During the integration phase, as new data become
available, the estimates has to be updated
thereby improving predictions
61Operational Profile
- Test selection aimed at
- finding faults
- increasing various structure coverages
- demonstrating different functional requirements
- are not representative of users operational
profile - Upgrades of software might invalidate any
existing - estimate of operational profile because new
- features can change the way software is used
- Change of the operational profile must be
- considered in assessing components reliabilities
62 Considering Failure Dependencies among
Components and Interfaces
- Existing models assume
- failure processes associated with different
- components are independent
- when considered, interface failures are assumed
- to be mutually independent and independent of
- components failure processes
- If a component failure behavior is affected by
previous component being executed, or by the
interface between them, these assumptions are no
longer acceptable, that is, - inter-component and intra-component
dependencies need to be considered
63 Extracting Software Architecture
- If the software architecture is not available
it - has to be extracted from the source or object
code - static architectural information
- parser-based or lexically-based tools
- dynamic architectural information
- profilers or test coverage tools
64 Extracting Software Architecture - Contd
- Workbench for architectural extraction recently
- developed at Software Engineering Institute
- Used at Duke University
- in house developed parser
- GNU profiler gprof
- coverage testing tool ATAC (Telcordia
Technologies) - toolkit ATOM (Compaq Tru64Unix)
65Sensitivity analysis
- Helps to identify the critical components which
- have the greatest impact on system reliability
- and performance
- Can be used for planning and certification
activities - during different phases of software life cycle
- reliability allocation to each component based
- on target reliability for the entire system and
- the sensitivity of the system to the component
66Considering Multiple Software Quality Attributes
- Architecture based models are mainly focused on
reliability - Performance as a software quality attribute
characterizes timeliness of the service delivered -
- Terminating application
- expected execution time of the application
- Continuously running application
- expected time of one cycle
67Considering Multiple Software Quality Attributes
- Contd
- This tutorial presents an overview from the
- perspective of Software reliability engineering
- community
- Software performance engineering perspective
- in Smith, Williams, 1993
- Unifying approach for reasoning about multiple
- software quality attributes needs to be
developed - first step in that direction - Architecture
Tradeoff - Analysis Method (Software Engineering
Institute)
68 Considering Different Architectural Styles
- Todays software applications are far more
complex, - frequently run on two or more processors,
- under different operating systems, and
- on geographically distributed machines
- Architectural style is determined by
- set of components types
- clients, servers, filters, databases, objects,
etc. - topological layout of these components indicating
their interrelationships - set of interaction mechanisms
- simple as procedure calls, pipes and event
broadcast, - or much more complex as client-server
protocols, database accessing protocols,
etc.
69 Considering Different Architectural Styles
Contd
- In todays network centric word most software
- applications run in a distributed environment
- Assumptions like sequential execution of
components and instantaneous transfer of control
are not applicable - Additional challenges
- race conditions
- deadlocks
- communication errors
- node failures
- failures associated with deadline violations due
to - communication overheads
- etc.
70Conclusion
- Architectural decisions are made early in the
life cycle they are hardest to change, most
critical and far-reaching - State of the research and practice of
architecture based approach to software
reliability assessment - common requirements and classification
- model elaboration
- usefulness and limitations
- key challenges for applicability
- Standardized architectural styles have to be
developed, along with the methods for their
qualitative and quantitative assessment