CTIS 490 DISTRIBUTED SYSTEMS - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

CTIS 490 DISTRIBUTED SYSTEMS

Description:

Another feature of distributed systems that distinguishes them from non ... redundancy Extra bits are added to allow recovery from garbled bits ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 25
Provided by: cneyt
Category:

less

Transcript and Presenter's Notes

Title: CTIS 490 DISTRIBUTED SYSTEMS


1
CTIS 490DISTRIBUTED SYSTEMS
  • WEEK 9
  • FAULT TOLERANCE

2
INTRODUCTION
  • Another feature of distributed systems that
    distinguishes them from non-distributed systems
    is the notion of a partial failure.
  • A partial failure may happen when one component
    in a distributed system fails. This failure may
    affect the proper operation of other components,
    while at the same time leaving other components
    unaffected.
  • One of the goals of distributed systems design is
    to construct the system so that it can
    automatically recover from partial failures. In
    particular, whenever a failure occurs, the
    distributed system should continue to operate in
    an acceptable way while repairs are being made.
  • In other words, it should tolerate faults and
    continue to operate.

3
FAULT TOLERANCE
  • Being fault tolerant is strongly related to what
    are called dependable systems.
  • Dependable systems have the following attributes
  • Availability defined as the property that a
    system is ready to be used immediately. The
    probability that the system is operating
    correctly at any given moment and is available to
    perform its functions.
  • Reliability refers to the property that a
    system can run continuously without failure. In
    contrast to availability, reliability refers to
    in terms of a time interval instead of an instant
    of time.
  • A highly-reliable system is one that will
    most likely continue to work without interruption
    during a relatively long period of time. If a
    system goes down for 1 millisecond every hour, it
    has an availability of over 99.999 percent, but
    it is unreliable. If a system never crashes but
    is shutdown for two week very August, it has high
    reliability but 96 percent availability.

4
FAULT TOLERANCE
  • Safety refers to the situation when a system
    temporarily fails to operate correctly. For
    example, many process control systems such as
    nuclear power plants or sending people into space
    are required to provide a high degree of safety.
  • Maintainability refers to how easy a failed
    system can be repaired. A high maintainable
    system may also show a high degree of
    availability, especially if failures can be
    detected and repaired automatically.
  • Dependable systems are also required to provide
    high degree of security.

5
FAULT TOLERANCE
  • Faults can be classified as
  • Transient faults occur once and then disappear.
    A bird flying through the transmitter may cause
    lost bits on some networks.
  • Intermittent faults occur, then vanishes, then
    reappears. A loose contact on a connector will
    often cause this kind of fault.
  • Permanent faults continue to exists until
    faulty component is replaced. Burnt-out chips and
    software bugs are examples of this type of faults.

6
FAILURE MODELS
7
FAILURE MODELS
  • Crash failures occur when a server prematurely
    halts, but was working correctly until it
    stopped. An example of a crash failure is an
    operating system that comes to a halt, and only
    solution is to reboot it.
  • Omission failures occur when a server fails to
    respond to a request. Maybe server never got the
    request in the first place. Maybe, server could
    not send a response (said to hung).
  • Timing (Performance) failures occur when response
    lies outside a specified real-time interval. In
    data streaming timing is very important.

8
FAILURE MODELS
  • Response failures occur when servers response is
    incorrect. For example, a search engine returns
    Web pages not related to search terms. When
    servers receive a message that it cannot
    recognize, it may incorrectly take default
    actions.
  • Arbitrary failures are the most series faults,
    also known as Byzantine failures. It may happen
    when a server produces an output it should never
    have produced, but which cannot be detected as
    being incorrect. A server may even be maliciously
    working together with other servers to produce
    intentionally wrong answers.

9
REDUNDANCY
  • A fault tolerant system should hide the
    occurrence of failures from other processes.
  • The key technique for masking faults is to use
    redundancy.
  • There are three kinds of redundancy
  • Information redundancy Extra bits are added to
    allow recovery from garbled bits
  • Time redundancy An action is performed, and
    then if need to be, it is performed again. For
    example, if a transaction aborts, it can be
    redone again. Time redundancy is helpful when the
    faults are transient or intermittent.

10
REDUNDANCY
  • Physical redundancy Extra equipment or
    processes are added to make it possible for the
    system as a whole to tolerate the loss or
    malfunctioning of some components. Physical
    redundancy can be done either in hardware or
    software.
  • Physical redundancy is used in biology mammals
    have two eyes, two ears etc., planes 747 has
    four engines, but it can fly on three, and
    sports multiple referees.
  • Physical redundancy is also used in electronic
    circuits.

11
REDUNDANCY
Triple modular redundancy.
12
TRIPLE MODULAR REDUNDANCY
  • In triple modular redundancy (TMR), each device
    is replicated three times.
  • Each voter is a circuit that has three inputs and
    one output. If two or three of the inputs are the
    same, the output is equal to input. If all three
    inputs are different, the output is undefined.
  • Three voters are also needed at each stage,
    because a voter is also a component and can also
    be faulty.
  • Although not all fault tolerant distributed
    systems use TMR, the technique is very general to
    give an idea what a fault-tolerant system is as
    opposed to individual components are highly
    reliable but the overall design cannot tolerate
    faults.

13
DISTRIBUTED COMMIT
  • Distributed commit problem involves having an
    operation, for example distributed transaction,
    being performed by each member of a process
    group, or none at all.
  • Distributed commit is established by means of a
    coordinator.
  • In its simplest form, a coordinator tells other
    processes whether or not to locally perform the
    operation. This scheme is referred as a one-phase
    commit protocol.
  • In this scheme, if one process cannot perform the
    operation, there is no way to tell the
    coordinator.
  • The most common used protocol is two-phase
    commit.

14
TWO-PHASE COMMIT (2PC)
  • Consider a distributed transaction involving the
    participation of a number of processes each
    running on a different machine.
  • The protocol consists of 2 phases (and 2 steps)
  • Voting phase
  • Decision phase

15
TWO-PHASE COMMIT (2PC)
  • The finite state machine for the coordinator in
    2PC.
  • The finite state machine for a participant.

16
TWO-PHASE COMMIT (2PC)
  • Coordinator sends a VOTE_REQUEST message
  • In response, participant either sends VOTE_COMMIT
    or VOTE_ABORT message.
  • Coordinator collects all votes from the
    participants. If all participants have voted to
    commit, then coordinator sends GLOBOL_COMMIT
    message. If one participant had voted to abort
    the transaction, coordinator sends GLOBAL_ABORT
    message.
  • Each participant that voted for a commit waits
    for the coordinator. When a coordinator receives
    a GLOBAL_COMMIT message, it locally commits the
    transaction. Otherwise, if it receives a
    GLOBAL_ABORT message, transaction is locally
    aborted.

17
TWO-PHASE COMMIT (2PC)
  • Several problems arise when 2PC protocol is used
    in a system where failures occur since
    coordinator and participants block waiting for
    incoming messages.
  • Protocol easily fails when one of the processes
    crashes. For this reason, timeout mechanisms are
    used.
  • There are a total of three states in which either
    a coordinator or participant is blocked waiting
    for an incoming message.
  • A participant may be waiting in its INIT state
    for a VOTE_REQUEST. If that message is not
    received after some time, it will locally abort
    the transaction and send a VOTE_ABORT.

18
TWO-PHASE COMMIT (2PC)
  • A coordinator can be blocked in state WAIT,
    waiting for the votes of each participant. If not
    all votes have been collected after a certain
    period of time, the coordinator should vote for
    abort as well, and send GLOBOL_ABORT.
  • A participant can be blocked in state READY,
    waiting for the global vote as sent by the
    coordinator. If that message is not received
    within a given time, the participant cannot
    simply decide to abort. Instead, it must find out
    which message the coordinator sent.

19
TWO-PHASE COMMIT (2PC)
  • The simplest solution to this problem is to let
    each participant block until the coordinator
    recovers again.
  • A better solution is to let a participant P
    contact another participant Q to see if it can
    decide from Qs state what it should do.

Actions taken by a participant P when residing in
state READY and having contacted another
participant Q.
20
TWO-PHASE COMMIT (2PC)
  • To ensure that a process can actually recover, it
    is necessary that it saves its state to
    persistent storage.
  • If a participant was in state INIT, it can safely
    decide to locally abort the transaction when it
    recovers, and then inform the coordinator.
  • Problems arise when a participant crashed while
    residing in state READY. When recovering it
    cannot decide on its own what it should do next
    that is commit or abort the transaction.
  • Consequently, it is forced to contact other
    participants to find out what it should do,
    similar to the situation when it times out while
    residing in state READY.

21
TWO-PHASE COMMIT (2PC)
  • Outline of the steps taken by the coordinator in
    a two-phase commit protocol (Conitinued on the
    next slide)

22
TWO-PHASE COMMIT (2PC)
  • Outline of the steps taken by the coordinator in
    a two-phase commit protocol

23
TWO-PHASE COMMIT (2PC)
  • (a) The steps taken by a participant process in
    2PC.

24
TWO-PHASE COMMIT (2PC)
  • (b) The steps for handling incoming decision
    requests.
Write a Comment
User Comments (0)
About PowerShow.com