Title: Dependability Theory and Methods 5. Markov Models
1Dependability Theory and Methods5. Markov Models
- Andrea Bobbio
- Dipartimento di Informatica
- Università del Piemonte Orientale, A. Avogadro
- 15100 Alessandria (Italy)
- bobbio_at_unipmn.it - http//www.mfn.unipmn.it/bob
bio
Bertinoro, March 10-14, 2003
2State-Space-Based Models
- States and labeled state transitions
- State can keep track of
- Number of functioning resources of each type
- States of recovery for each failed resource
- Number of tasks of each type waiting at each
resource - Allocation of resources to tasks
- A transition
- Can occur from any state to any other state
- Can represent a simple or a compound event
3State-Space-Based Models (Continued)
- Transitions between states represent the change
of the system state due to the occurrence of an
event - Drawn as a directed graph
- Transition label
- Probability homogeneous discrete-time Markov
chain (DTMC) - Rate homogeneous continuous-time Markov chain
(CTMC) - Time-dependent rate non-homogeneous CTMC
- Distribution function semi-Markov process (SMP)
4Modelers Options
- Should I Use Markov Models?
- State-Space-Based Methods
- Model Dependencies
- Model Fault-Tolerance and Recovery/Repair
- Model Contention for Resources
- Model Concurrency and Timeliness
- Generalize to Markov Reward Models for Modeling
Degradable Performance
5Modelers Options
- Should I Use Markov Models?
- Generalize to Markov Regenerative Models for
Allowing Generally Distributed Event Times - Generalize to Non-Homogeneous Markov Chains for
Allowing Weibull Failure Distributions - Performance, Availability and Performability
Modeling Possible - - Large (Exponential) State Space
6In order to fulfill our goals
- Modeling Performance, Availability and
Performability - Modeling Complex Systems
- We Need
- Automatic Generation and Solution of Large Markov
Reward Models
7Model-based evaluation
- Choice of the model type is dictated by
- Measures of interest
- Level of detailed system behavior to be
represented - Ease of model specification and solution
- Representation power of the model type
- Access to suitable tools or toolkits
8State space models
x i
s
s
A transition represents the change of state of a
single component
Z(t) is the stochastic process Pr Z(t) s is
the probability of finding Z(t) in state s at
time t.
Pr s ? s, ?t Pr Z(t ?t) s Z(t) s
9State space models
x i
s
s
If s ? s represents a failure event
Pr s ? s, ?t Pr Z(t
?t) s Z(t) s ? i ?t
If s ? s represents a repair event
Pr s ? s, ?t Pr Z(t
?t) s Z(t) s ? i ?t
10Markov Process definition
11Transition Probability Matrix
initial
12State Probability Vector
13Chapman-Kolmogorov Equations
14Time-homogeneous CTMC
15Time-homogeneous CTMC
16The transition rate matrix
17C-K Equations for CTMC
18Solution equations
19Transient analysis
Given that the initial state of the Markov
chain, then the system of differential Equations
is written based on rate of buildup rate of
flow in - rate of flow out for each state
(continuity equation).
20Steady-state condition
If the process reaches a steady state condition,
then
21Steady-state analysis (balance equation)
The steady-state equation can be written as a
flow balance equation with a normalization
condition on the state probabilities. (rate of
buildup) rate of flow in - rate of flow
out rate of flow in rate of flow out for each
state (balance equation).
222-component system
232-component system
242-component system
252-component series system
2-component parallel system
262-component stand-by system
27Repairable system Availability
28Repairable system 2 identical components
29Repairable system 2 identical components
302-component Markov availability model
- Assume we have a two-component parallel redundant
system with repair rate ?. - Assume that the failure rate of both the
components is ?. - When both the components have failed, the system
is considered to have failed.
31Markov availability model
- Let the number of properly functioning components
be the state of the system. - The state space is 0,1,2 where 0 is the system
down state. - We wish to examine effects of shared vs.
non-shared repair.
32Markov availability model
2
1
0
Non-shared (independent) repair
2
1
0
Shared repair
33Markov availability model
- Note Non-shared case can be modeled solved
using a RBD or a FTREE but shared case needs the
use of Markov chains.
34Steady-state balance equations
- For any state
- Rate of flow in Rate of flow out
- Considering the shared case
- ?i steady state probability that system is in
state i
35Steady-state balance equations
36Steady-state balance equations (Continued)
- Steady-state Unavailability
- For the Shared Case ?0 1 - Ashared
- Similarly, for the Non-Shared Case,
- Steady-state Unavailability 1 -
Anon-shared - Downtime in minutes per year (1 - A) 876060
37Steady-state balance equations
38Absorbing states MTTF
39Absorbing states - MTTF
40Markov Reliability Model with Imperfect Coverage
41Markov model with imperfect coverage
- Next consider a modification of the 2-component
parallel system proposed by Arnold as a model of
duplex processors of an electronic switching
system. - We assume that not all faults are recoverable and
that c is the coverage factor which denotes the
conditional probability that the system recovers
given that a fault has occurred. - The state diagram is now given by the following
picture
42Now allow for Imperfect coverage
c
43Markov modelwith imperfect coverage
- Assume that the initial state is 2 so that
- Then the system of differential equations are
44Markov model with imperfect coverage
- After solving the differential equations we
obtain -
- R(t)P2(t) P1(t)
- From R(t), we can obtain system MTTF
- It should be clear that the system MTTF and
system reliability are critically dependent on
the coverage factor.
45Source of fault coverage data
- Measurement data from an operational system
- Large amount of data needed
- Improved instrumentation needed
- Fault-injection experiments
- Expensive but badly needed
- Tools from CMU,Illinois, LAAS (Toulouse)
- A fault/error handling submodel (FEHM)
- Phases detection, location, retry, reconfig,
reboot - Estimate duration and probability of success of
each phase
46Redundant System with Finite Detection Switchover
Time
- Modify the Markov model with imperfect coverage
to allow for finite time to detect as well as
imperfect detection. - You will need to add an extra state, say D.
- The rate at which detection occurs is ? .
- Draw the state diagram and investigate the
effects of detection delay on system reliability
and mean time to failure.
47Redundant System with Finite Detection Switchover
Time
- Assumptions
- Two units have the same MTTF and MTTR
- Single shared repair person
- Average detection/switchover time tsw1/?
- We need to use a Markov model.
48Redundant System with Finite Detection Switchover
Time
1D
2
1
0
49Redundant System with Finite Detection Switchover
Time
- After solving the Markov model, we obtain
steady-state probabilities
50Closed-form
51 52A Workstations-Fileserver Example
- Computing system consisting of
- A file-server
- Two workstations
- Computing network connecting them
- System operational as long as
- One of the Workstations
- and
- The file-server are operational
- Computer network is assumed to be fault-free
53The WFS Example
54Markov Chain for WFS Example
- Assuming exponentially distributed times to
failure - ?w failure rate of workstation
- ?f failure rate of file-server
- Assume that components are repairable
- ?w repair rate of workstation
- ?f repair rate of file-server
- File-server has priority for repair over
workstations (such repair priority cannot be
captured by non-state-space models)
55Markov Availability Model for WFS
Since all states are reachable from every other
states, the CTMC is irreducible. Furthermore, all
states are positive recurrent.
56Markov Availability Model for WFS (Continued)
- In the figure, the label (i,j) of each state
is interpreted as follows - i represents the number of workstations that are
still functioning - j is 1 or 0 depending on whether the file-server
is up or down respectively.
57Markov Availability Model for WFS (Continued)
- For the example problem, with the states ordered
as (2,1), (2,0), (1,1), (1,0), (0,1), (0,0) the Q
matrix is given by
Q
58Markov Model (steady-state)
? Steady-state probability vector These are
called steady-state balance equations rate of
flow in rate of flow out after solving for
obtain Steady-state availability
59Markov Availability Model
- We compute the availability of the system
- System is available as long as it is in states
- (2,1) and (1,1).
- Instantaneous availability of the system
60Markov Availability Model (Continued)
61Markov Reliability Model with Repair
- Assume that the computer system does not recover
if both workstations fail, or if the file-server
fails
62Markov Reliability Model with Repair
States (0,1), (1,0) and (2,0) become absorbing
states while (2,1) and (1,1) are transient
states. Note we have made a simplification that,
once the CTMC reaches a system failure state, we
do not allow any more transitions.
63Markov Model with Absorbing States
- If we solve for P2,1(t) and P1,1(t) then
- R(t)P2,1(t) P1,1(t)
- For a Markov chain with absorbing states
- A the set of absorbing states
- B ? - A the set of remaining states
- zi,j Mean time spent in state i,j until
absorption
64Markov Model with Absorbing States (Continued)
QB derived from Q by restricting it to only
states in B
Mean time to absorption MTTA is given as
65Markov Reliability Model with Repair (Continued)
66Markov Reliability Model with Repair (Continued)
- Mean time to failure is 19992 hours.
67Markov Reliability Model without Repair
- Assume that neither workstations nor file-server
is repairable
68Markov Reliability Model without Repair
(Continued)
States (0,1), (1,0) and (2,0) become absorbing
states
69Markov Reliability Model without Repair
(Continued)
- Mean time to failure is 9333 hours.