Title: Reliability and Availability
1- Reliability and Availability
- Analysis
2Learning Objectives
- Define availability and reliability in the
context of computer and communications systems. - Provide a quantitative approach to understand and
compute availability and reliability metrics.
3A Motivating Availability Example
- Consider a simple example of an online brokerage
which is in the process of designing its site and
selecting the components that will be used in its
design. - The main consideration here is the site
availability, which has to be at least 99.99
(four 9s) according to management decision.
4- The site is used by users to get quotes on stocks
and mutual funds, manage portfolios, conduct risk
analysis, and to place orders to trade stocks and
mutual funds. - Consider in the security trading business, Web
service availability is a key QoS metric. - If customers are denied access to the trading
services, they may incur financial losses and the
trading company may be liable for these loses.
5Trading site architecture
Internet
Load Balancer
Web Server
Database Server
R router
6- The trading site architecture is composed of a
load balancer that distributes the incoming
requests to one of nWS Web servers. - The servers are all implemented using the same
type of hardware and software. - At the back-end, nDS database server are used to
store all the persistent data needed to support
customer trading transactions. - The database is fully replicated at each of the
nDS database servers to increase availability and
distribute the load.
7 - The company is considering two types of boxes
highly reliable, expensive, high-end servers with
hot-swappable CPU boards and disks, as well as
less expensive, less reliable, low-end servers. - Management wants to answer the following
questions - What is the least expensive configuration that
meets the 99.99 availability requirement? - All low-end servers, all high-end servers, or a
mix of low-end and high-end servers?
8Reasons of System Failure
- To categorize different types of failure three
dimensions are considered - Duration, Effect, and Scope
9- Duration of the failures
- Permanent failures
- A system stops working and there is no
possibility of repairing or replacing it. (e.g.,
unmanned space ship) - Recoverable failures
- The system is placed back in operation after a
fault is recovered. (e.g., Web site
inaccessibility due to its connection to the
internet being down) - Transient failures
- Characterized by having a very short duration and
may not require major recovery actions. (e.g.,
problems that can be solved by resetting network
routers or rebooting servers.)
10- Effect of the failures
- Functional failures
- The system does not operate according to its
functional specifications. (e.g., an online
bookstore failing to display information about a
book even though it is in the catalog) - Performance failures
- Even though the system may be executing the
requested functions correctly, they are not
executed in a timely fashion. (e.g., A search
engine that presents very accurate results to
requests for search but takes more than a minute
on average to process each request)
11- Scope of the failure
- Partial failures
- Some of the services provided by the computer
system becomes unavailable, while others can
still be used. (e.g., The services that allow
customers to bid in an online auction site may
become unavailable due to the failure of the
servers that process these types of requests,
while customers may still be able to see existing
bids) - Total failures
- Characterized by a complete disruption of all
services offered by the computer system. (e.g.,
power outages could cause a Web site to go down
completely)
12Reliability and Availability Basics
- Reliability
- The reliability of a system or component is the
probability that it is functioning properly and
constantly over a fixed period of time. - Availability
- The fraction of time that a component (or system)
is operational.
13- Consider the notion that a component (or system)
alternates through periods in which it is
operational the up periods and periods in
which it is down the down periods. - Mean Time To Failure (MTTF)
- The average time it takes for a system to fail.
14- Mean Time To Recover (MTTR)
- The average time it takes for the system to
recover. - Mean Time Between Failures (MTBF)
- The average time between failures can be written
as, - MTBF MTTF MTTR
15Relationship between MTTF, MTTR, and MTBF
MTTF
MTTF
MTTR
up
down
up
MTBF
n-th failure
(n1)-th failure
16Summary
Computer systems tend to be labeled by the number
of 9s in the availability. For example a
five-9s system has an availability of 99.999.
Computer system classification according to their
availability is shown below
17Expression for the availability of a system
- The following state transition diagram can be
used to show that the system can be in one of two
states up and down - The system fails, i.e., goes from up to down,
with a rate ? - It gets repaired, i.e., goes from down to up,
with a rate ?
18 - Writing these rates in term of the MTTF and MTTR
we get, - Using the flow-in-flow-out principle, we can
write, - Here pup and pdown are the probability that the
system is up and down, respectively. - Thus,
19- The availability A of a system is simply pup
- We also know that pup pdown 1
- Therefore,
20- Therefore,
- And, the system un-availability is simply
21- In most systems of interest, it takes
significantly longer time for the system to fail
than to be repaired - MTTFgtgtMTTR
- Thus, the unavailability can be approximated as
- U ? MTTR/MTTF
-
22- Consider a Web site composed of two Web servers,
one application server, and one database server.
Suppose that historical data shows that the
application server machine is rebooted every
twenty days on average. Assuming that the system
administrator takes 10 minutes to reboot the
machine, what is the application server
availability? - Here the MTTF is 20 days or (20?24?6028,800
minutes) and - The MTTR is 10 minutes
- Therefore the availability is given by,
- AMTTF/(MTTFMTTR)28,800/(28,80010)99.965
23- If the system administrator were able to cut the
reboot time to 20 - The availability would be A 28,800/(28,800
10?0.2) 99.972 - To achieve the same availability (99.972) with
the original MTTR of 10 minutes, the MTTF would
have to be increased to 35,704 minutes, I.e., a
24 increase - This indicates the importance of reducing the
time to recovery to improve the availability of a
system
24The Reliability of Systems of Components
- Q. What is the reliability of the system as a
function of the reliability of the components
used to build the system? - Well consider two cases,
- Components connected in series
- Components connected in parallel
- Example of a serial system is when a Web site has
a Web server connected to an application server
which is then connected to a database server,
each on its own dedicated machine,
25- Inside each box in the diagram are the
reliabilities r1,rn of the n components. - To compute the reliability, Rs, of the series
system we need, - To know the probability that the entire system is
operational when needed. - All n components must be operational for the
system to be operational.
26- Assuming that the n components fail in an
independent way (failure of one component does
not affect any other component). - Using the probability theory that says that the
probability of an event expressed as the
intersection of independent events (all n
components are operational) is the product of the
probabilities of the independent events. Thus, - ImplicationsSince each reliability value, ri, is
a probability and therefore, ri?1 - Therefore as more components are added in series
the system reliability will decrease.
27- A Web site has a Web server (WS), an application
server (AS), and a database server (DS) in
series. Let rWS, rAS, and rDB be the
reliabilities of these components and assume
their values are rWS0.9, rAS0.95, and rDB0.99. - Management wants to replace the database server
with a highly reliable and expensive model that
is advertised as having a 0.999 reliability. Is
it a wise decision? - The reliability of the site with the current
database server is - Rsite rWS ? rAS ? rDB 0.9 ? 0.95 ? 0.99
0.84645 - The reliability of the site with the new database
server is - RnewDBsite rWS ? rAS ? rnewDB 0.9 ? 0.95 ?
0.999 0.85415
28- If instead of the database server, the web server
(the most unreliable component of the system) is
replaced by a new one with r 0.95. The
reliability of the site now will be, - RnewWSsite rWSWS ? rAS ? rDB 0.95 ? 0.95 ?
0.99 0.89348 - Thus it is evident that replacing the most
unreliable component has a more pronounced effect
in terms of improving overall system reliability.
29Reliability block diagram for a parallel system
- Using components in parallel is one of the most
common way to use redundancy. - The reliability of the parallel system, Rp, is
the probability that it is in operation when
needed - This probability is equal to one minus the
probability that the system is not in operation.
30- For this to happen, all n components must be
down. - The probability that component i is down is
simply (1-ri). - So, assuming independence of failures between
components, we get, - The special case when all components have the
same reliability r. We get,
31- Thus as we increase the number of components,
system reliability grows very fast. - As shown
32- A search engine site wants to achieve a site
reliability of 99.999 using a cluster of very
cheap and unreliable Web servers. A cluster is a
parallel combination of a number of servers. Each
has a reliability of 85. How many servers should
be used in the cluster? - From the eq. we know that,
- 0.99999 1 (1-8.85)n 1 0.15 n
- So,
- 0.15n 1 0.99999 0.00001
- If we apply logarithms to both sides of the above
equation and we take into consideration that n
must be an integer, we get that - n ?ln 0.00001/ ln 0.15? ?6.069? 7
- Thus, seven unreliable Web servers can provide a
high-level of reliability when used in parallel,