Reliability and Availability - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Reliability and Availability

Description:

Characterized by having a very short duration and may not require major recovery actions. ... Suppose that historical data shows that the application server machine is ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 33
Provided by: daniela71
Category:

less

Transcript and Presenter's Notes

Title: Reliability and Availability


1
  • Reliability and Availability
  • Analysis

2
Learning Objectives
  • Define availability and reliability in the
    context of computer and communications systems.
  • Provide a quantitative approach to understand and
    compute availability and reliability metrics.

3
A Motivating Availability Example
  • Consider a simple example of an online brokerage
    which is in the process of designing its site and
    selecting the components that will be used in its
    design.
  • The main consideration here is the site
    availability, which has to be at least 99.99
    (four 9s) according to management decision.

4
  • The site is used by users to get quotes on stocks
    and mutual funds, manage portfolios, conduct risk
    analysis, and to place orders to trade stocks and
    mutual funds.
  • Consider in the security trading business, Web
    service availability is a key QoS metric.
  • If customers are denied access to the trading
    services, they may incur financial losses and the
    trading company may be liable for these loses.

5
Trading site architecture
Internet
Load Balancer
Web Server
Database Server
R router
6
  • The trading site architecture is composed of a
    load balancer that distributes the incoming
    requests to one of nWS Web servers.
  • The servers are all implemented using the same
    type of hardware and software.
  • At the back-end, nDS database server are used to
    store all the persistent data needed to support
    customer trading transactions.
  • The database is fully replicated at each of the
    nDS database servers to increase availability and
    distribute the load.

7
  • The company is considering two types of boxes
    highly reliable, expensive, high-end servers with
    hot-swappable CPU boards and disks, as well as
    less expensive, less reliable, low-end servers.
  • Management wants to answer the following
    questions
  • What is the least expensive configuration that
    meets the 99.99 availability requirement?
  • All low-end servers, all high-end servers, or a
    mix of low-end and high-end servers?

8
Reasons of System Failure
  • To categorize different types of failure three
    dimensions are considered
  • Duration, Effect, and Scope

9
  • Duration of the failures
  • Permanent failures
  • A system stops working and there is no
    possibility of repairing or replacing it. (e.g.,
    unmanned space ship)
  • Recoverable failures
  • The system is placed back in operation after a
    fault is recovered. (e.g., Web site
    inaccessibility due to its connection to the
    internet being down)
  • Transient failures
  • Characterized by having a very short duration and
    may not require major recovery actions. (e.g.,
    problems that can be solved by resetting network
    routers or rebooting servers.)

10
  • Effect of the failures
  • Functional failures
  • The system does not operate according to its
    functional specifications. (e.g., an online
    bookstore failing to display information about a
    book even though it is in the catalog)
  • Performance failures
  • Even though the system may be executing the
    requested functions correctly, they are not
    executed in a timely fashion. (e.g., A search
    engine that presents very accurate results to
    requests for search but takes more than a minute
    on average to process each request)

11
  • Scope of the failure
  • Partial failures
  • Some of the services provided by the computer
    system becomes unavailable, while others can
    still be used. (e.g., The services that allow
    customers to bid in an online auction site may
    become unavailable due to the failure of the
    servers that process these types of requests,
    while customers may still be able to see existing
    bids)
  • Total failures
  • Characterized by a complete disruption of all
    services offered by the computer system. (e.g.,
    power outages could cause a Web site to go down
    completely)

12
Reliability and Availability Basics
  • Reliability
  • The reliability of a system or component is the
    probability that it is functioning properly and
    constantly over a fixed period of time.
  • Availability
  • The fraction of time that a component (or system)
    is operational.

13
  • Consider the notion that a component (or system)
    alternates through periods in which it is
    operational the up periods and periods in
    which it is down the down periods.
  • Mean Time To Failure (MTTF)
  • The average time it takes for a system to fail.

14
  • Mean Time To Recover (MTTR)
  • The average time it takes for the system to
    recover.
  • Mean Time Between Failures (MTBF)
  • The average time between failures can be written
    as,
  • MTBF MTTF MTTR

15
Relationship between MTTF, MTTR, and MTBF
MTTF
MTTF
MTTR
up
down
up
MTBF
n-th failure
(n1)-th failure
16
Summary
Computer systems tend to be labeled by the number
of 9s in the availability. For example a
five-9s system has an availability of 99.999.
Computer system classification according to their
availability is shown below
17
Expression for the availability of a system
  • The following state transition diagram can be
    used to show that the system can be in one of two
    states up and down
  • The system fails, i.e., goes from up to down,
    with a rate ?
  • It gets repaired, i.e., goes from down to up,
    with a rate ?

18
  • Writing these rates in term of the MTTF and MTTR
    we get,
  • Using the flow-in-flow-out principle, we can
    write,
  • Here pup and pdown are the probability that the
    system is up and down, respectively.
  • Thus,

19
  • The availability A of a system is simply pup
  • We also know that pup pdown 1
  • Therefore,

20
  • Therefore,
  • And, the system un-availability is simply

21
  • In most systems of interest, it takes
    significantly longer time for the system to fail
    than to be repaired
  • MTTFgtgtMTTR
  • Thus, the unavailability can be approximated as
  • U ? MTTR/MTTF

22
  • Consider a Web site composed of two Web servers,
    one application server, and one database server.
    Suppose that historical data shows that the
    application server machine is rebooted every
    twenty days on average. Assuming that the system
    administrator takes 10 minutes to reboot the
    machine, what is the application server
    availability?
  • Here the MTTF is 20 days or (20?24?6028,800
    minutes) and
  • The MTTR is 10 minutes
  • Therefore the availability is given by,
  • AMTTF/(MTTFMTTR)28,800/(28,80010)99.965

23
  • If the system administrator were able to cut the
    reboot time to 20
  • The availability would be A 28,800/(28,800
    10?0.2) 99.972
  • To achieve the same availability (99.972) with
    the original MTTR of 10 minutes, the MTTF would
    have to be increased to 35,704 minutes, I.e., a
    24 increase
  • This indicates the importance of reducing the
    time to recovery to improve the availability of a
    system

24
The Reliability of Systems of Components
  • Q. What is the reliability of the system as a
    function of the reliability of the components
    used to build the system?
  • Well consider two cases,
  • Components connected in series
  • Components connected in parallel
  • Example of a serial system is when a Web site has
    a Web server connected to an application server
    which is then connected to a database server,
    each on its own dedicated machine,

25
  • Inside each box in the diagram are the
    reliabilities r1,rn of the n components.
  • To compute the reliability, Rs, of the series
    system we need,
  • To know the probability that the entire system is
    operational when needed.
  • All n components must be operational for the
    system to be operational.

26
  • Assuming that the n components fail in an
    independent way (failure of one component does
    not affect any other component).
  • Using the probability theory that says that the
    probability of an event expressed as the
    intersection of independent events (all n
    components are operational) is the product of the
    probabilities of the independent events. Thus,
  • ImplicationsSince each reliability value, ri, is
    a probability and therefore, ri?1
  • Therefore as more components are added in series
    the system reliability will decrease.

27
  • A Web site has a Web server (WS), an application
    server (AS), and a database server (DS) in
    series. Let rWS, rAS, and rDB be the
    reliabilities of these components and assume
    their values are rWS0.9, rAS0.95, and rDB0.99.
  • Management wants to replace the database server
    with a highly reliable and expensive model that
    is advertised as having a 0.999 reliability. Is
    it a wise decision?
  • The reliability of the site with the current
    database server is
  • Rsite rWS ? rAS ? rDB 0.9 ? 0.95 ? 0.99
    0.84645
  • The reliability of the site with the new database
    server is
  • RnewDBsite rWS ? rAS ? rnewDB 0.9 ? 0.95 ?
    0.999 0.85415

28
  • If instead of the database server, the web server
    (the most unreliable component of the system) is
    replaced by a new one with r 0.95. The
    reliability of the site now will be,
  • RnewWSsite rWSWS ? rAS ? rDB 0.95 ? 0.95 ?
    0.99 0.89348
  • Thus it is evident that replacing the most
    unreliable component has a more pronounced effect
    in terms of improving overall system reliability.

29
Reliability block diagram for a parallel system
  • Using components in parallel is one of the most
    common way to use redundancy.
  • The reliability of the parallel system, Rp, is
    the probability that it is in operation when
    needed
  • This probability is equal to one minus the
    probability that the system is not in operation.

30
  • For this to happen, all n components must be
    down.
  • The probability that component i is down is
    simply (1-ri).
  • So, assuming independence of failures between
    components, we get,
  • The special case when all components have the
    same reliability r. We get,

31
  • Thus as we increase the number of components,
    system reliability grows very fast.
  • As shown

32
  • A search engine site wants to achieve a site
    reliability of 99.999 using a cluster of very
    cheap and unreliable Web servers. A cluster is a
    parallel combination of a number of servers. Each
    has a reliability of 85. How many servers should
    be used in the cluster?
  • From the eq. we know that,
  • 0.99999 1 (1-8.85)n 1 0.15 n
  • So,
  • 0.15n 1 0.99999 0.00001
  • If we apply logarithms to both sides of the above
    equation and we take into consideration that n
    must be an integer, we get that
  • n ?ln 0.00001/ ln 0.15? ?6.069? 7
  • Thus, seven unreliable Web servers can provide a
    high-level of reliability when used in parallel,
Write a Comment
User Comments (0)
About PowerShow.com