Title: Fault Tolerance
1Fault Tolerance
- Introduction to Distributed SystemsCS
457/557Fall 2008Kenneth Chiu
2Nondistributed vs. Distributed
- Can you make a nondistributed system
fault-tolerant? - Can a non-distributed system have partial
failures? - Key goal of fault tolerance is to allow a system
to continue to function after a partial failure.
3Basic Concepts
- Availability
- Is it working?
- Reliability
- Is it the same as availability?
- How available is something that is down for 1 ms
every hour? How reliable is it? - Safety
- How does it fail? What happens when there are no
signals? - Maintainability
- How easy is it to repair? Can affect
availability. - What is failure?
- Cant meet promises.
- Error is something that might lead to failure.
Part of the state. - Fault is the cause of the error, like a short in
a circuit, etc. - Transient
- Intermittent
- Permanent
- A system is called k fault tolerant if it can
tolerate k faults.
4Failure Models
- How can things fail (from the viewpoint of an
observer)? - Why try to define them?
- Because addressing them may have different levels
of difficulty. - Some systems might be able to tolerate some
kinds, but not others. - Some kinds of faults may be rare.
- We do not want to address them, but we need to be
able to define those kind precisely.
5Failure Models
- Crashes are also called fail-stop.
- Arbitrary (byzantine) failures are all else, and
include malicious/subverted servers, etc. - What kind of failures are easiest? What kind are
hardest?
6Redundancy
- You can mask failures by being redundant.
- Redundant information. How?
- Time redundancy. How?
- Physical redundancy. How?
- Is nature redundant?
7Triple Modular Redundancy
Signals pass through three devices.
- Redundancy. Why three voters?
- How many fail-stop faults can this tolerate?
- How many response failures (wrong values)?
8Process Resilience
9- Protect against process failures, so replicate in
groups.
10Process Groups
- Key technique to tolerating faulty processes is
to organize them into a group. - When a message is sent to a group, all members
get them. - Can be dynamic. A process can also be in multiple
groups. - Can be flat or hierarchical.
11- Flat groups All processes the same.
- Hierarchical Might have a coordinator.
- What happens when a process is lost in a flat
group vs. hierarchical group? - In flat groups, all processes are the same, so
there is no danger of losing the coordinator. - Which one is has more complicated decision making
algorithms, probably? - Using a coordinator makes decision making easier.
Coordinator
Hierarchical group
Flat group
Worker
12(No Transcript)
13- Group membership
- Can use a group server. It is a centralized
machine that maintains all groups. - Distributed? How would you do it in real life?
- Join by multicasting to all.
- Send a goodbye message when leaving?
- Problematic if there is a crash, because there is
no message announcing that the process that
crashed has left. - Must notice that the crashed process no longer
responds. - Joining and leaving have to be synchronous.
- As soon as it joins, should receive all messages.
- As soon as it leaves, must stop receiving.
- Rebuilding the group is hard.
14Failure Masking and Replication
- Process groups are part of the solution for fault
tolerance. - Replace single, vulnerable process with a (fault
tolerant) group. - Two ways to approach replication primary-based
or replicated write. - A system is k fault tolerant if it can tolerate k
faults. - Suppose everything is fail-stop (silently). How
many processes do we need to tolerate k faults
for a simple voting system. - Suppose the failures are byzantine. Now how many?
15Agreement in Faulty Systems
- When using simple voting, we can tolerate k
faults with 2k1. Not so simple when we are
talking about agreement. - What aspects are important?
- Synchronous vs. asynchronous Do the processes
operate in lock-step? - Is communication latency bounded?
- Is message delivery ordered?
- Unicasting or multicasting?
16- Turns out that distributed agreement is only
possible under these conditions.
17Two Generals Problem
- Two generals want to attack a city from different
sides. They will only succeed if both attack at
the same time. They can communicate only by
messengers sent by horse. - How do they reach agreement?
18Lamports Algorithm
- Assumptions
- Synchronous
- Unicast, preserve ordering.
- Latency is bounded.
- Setup
- N processes.
- Each process i will send vi to others.
- Goal is that each process will construct a vector
V of length N such that if process i is
nonfaulty,Vi vi. - Otherwise Vi is undefined.
- Assume at most k faulty processes.
19Lamports Algorithm
- N 4, k 1.
- Steps
- Each process sends their value to the others
using reliable unicasting. - Results collected into vectors.
- Vectors re-distributed.
- If any position has a majority, that is the value.
20- Lamport proved that you need 3k1 to tolerate k
faulty processes. - If you may have infinite delays, no agreement is
possible.
21Failure Detection
- How do you detect that a process has failed?
- Two mechanisms
- Ping
- Heartbeat
- Nothing can really be done, except some kind of
timeout.
22Reliable Client-Server Communication
23- Besides faulty processes, also need to look at
communications failures. - Point-point communication
- TCP hides lost messages.
- Crashes are not masked, however.
24RPC Failures
- Five different classes of failures.
- Cant find server.
- Request message lost.
- Server crashes after receiving request.
- Reply message is lost.
- Client crashes after receiving request.
25Cant Find Server
- One possibility is to raise an exception.
- Is RPC still completely transparent to the client?
26Lost Request
- Start a timer. If it expires, send another.
- Or is the server down?
27Server Crashes
- What should the proper way to handle the two
crashes be? - Can the client tell which one happened?
- Can use different semantics At least once and at
most once. - No general solution for exactly once. Consider a
print server that crashes and comes back up. - Client sends a message, gets an ack that message
was received. - Two strategies server can use sends a completion
message either right before or right after. - If crash, client can never reissue, always
reissue, only reissue if no ack, only reissue if
there is an ack.
28- Three events that can happen at the server
- Send the completion message (M).
- Print the text (P).
- Crash (C).
29- These events can occur in six different
orderings - M ?P ?C A crash occurs after sending the
completion message and printing the text. - M ?C (?P) A crash happens after sending the
completion message, but before the text could be
printed. - P ?M ?C A crash occurs after sending the
completion message and printing the text. - P?C(?M) The text printed, after which a crash
occurs before the completion message could be
sent. - C (?P ?M) A crash happens before the server
could do anything. - C (?M ?P) A crash happens before the server
could do anything.
30- Server crashes and comes back up. No combination
is satisfactory.
?
?
?
?
?
?
?
31- Server crashes and comes back up. No combination
is satisfactory.
32Lost Reply Messages
- Start a timer, if no reply, send the request
again. - How well does this work?
- The problem is idempotency.
33Client Crashes
- Can leave dangling resources at the server. Four
strategies - Use client log. If crash is detected, contact
server to free resources. How well does this
work? - Very expensive.
- May have grand-orphans.
- Network may be partitioned.
- When the client reboots, it broadcasts, freeing
all resources. Disadvantage? - May have long-running computations.
- When a reboot message comes in, contact owners.
- Use expiration. If a resource has not been freed
for a while, free it automatically. Similar to a
lease.
34Reliable Group Communication
35- What is reliable multicasting?
- What happens if during communication, someone
joins the group? - What happens if the sending process crashes
during the send? - To cover these, make a distinction between
reliable multicasting with process failure and
without. - With faulty processes, multicasting is reliable
when it goes to all non-faulty group members. But
how to agree on exactly what is the group? - Simpler if there is agreement on the group.
36- Consider a single sender multicasting. Assume
that the underlying system only has unreliable
multicast. - A sender wants to send message number 25.
37Scalability Issues with Reliable Multicasting
- What happens if there are a million receivers?
- How many ACKs?
- How about returning NAKs only?
- What happens if a packet is dropped early?
- How long to buffer a packet using NAKs?
38Nonhierarchical Feedback Control
- Only report NAKs, but multicast them to everyone.
How does this resolve things? - Will everyone multicast the NAK at the same time?
How to resolved?
- A packet was lost early. All receivers schedule a
retransmit, but one will go first.
39- Disadvantages
- Still hard to make sure that only one NAK is
sent. - Lots of interruptions. Receivers that got the
packet successfully are forced to process useless
NAKs. Solution? - Could have a separate multicast group for those.
- But that requires highly efficient and reliable
dynamic group management. Could involve just as
many messages. - Could have ones that tend to miss them join.
- Can have local recovery, to improve efficiency.
40Hierarchical Feedback Control
- Scalability for flat schemes is hard.
- For large groups, need a hierachy.
41- Receivers are divided into subgroups, based on
physical topology. - Subgroups are organized into a tree, with the
subgroup containing the sender at the root.
Within a subgroup? - Within subgroup, use a method that works well for
small groups. - Each subgroup has a coordinator with a history
buffer. What happens if the coordinator itself
misses a packet? - If the coordinator misses a packet, it asks the
coordinator of the parent subgroup. - When can a coordinator remove a packet from its
history buffer?
Receiver
Root
Sender
WAN
S
R
C
C
LAN
Coordinator
42Virtual Synchrony (1)
- Figure 8-12. The logical organization of a
distributed system to - distinguish between message receipt and message
delivery.
43Virtual Synchrony (2)
- Figure 8-13. The principle of virtual synchronous
multicast.
44Message Ordering (1)
- Four different orderings are distinguished
- Unordered multicasts
- FIFO-ordered multicasts
- Causally-ordered multicasts
- Totally-ordered multicasts
45Message Ordering (2)
- Figure 8-14. Three communicating processes in the
- same group. The ordering of events
- per process is shown along the vertical axis.
46Message Ordering (3)
- Figure 8-15. Four processes in the same group
with two different senders, and a possible
delivery order of messages under FIFO-ordered
multicasting
47Implementing Virtual Synchrony (1)
- Figure 8-16. Six different versions of virtually
synchronous reliable multicasting.
48Implementing Virtual Synchrony (2)
- Figure 8-17. (a) Process 4 notices that process 7
has crashed and sends a view change. - Figure 8-17. (b) Process 6 sends out all
itsunstable messages, followed by a flush
message. - Figure 8-17. (c) Process 6 installs the new view
when it has received a flush message from
everyone else.
49Distributed Commmit
50Noncomputer-Based Distributed Systems
- This is the Clayton Tunnel in 1841 in England.
- A two-way tunnel.
- At each entrance is a semaphore system that flips
red when a train passes. It must be manually
reset to green. - Before manual reset, the signal man must make
sure that the train has exited. - Only one train allowed per track in the tunnel.
- A telegraph, with a fixed set of 3 messages was
provided. - TRAIN-IN-TUNNEL, TUNNEL-IS-CLEAR,
HAS-THE-TRAIN-LEFT-THE-TUNNEL? - In case the semaphore failed, the signal man had
red and white flags for manual signalling.
51Noncomputer-Based Distributed Systems
A
B
- Normal
- A train enters, flips the semaphore signal red.
- Signal man A sends TRAIN-IN-TUNNEL.
- When train exists, opposite signal man B sends
TUNNEL-IS-CLEAR. - Signal man A manually resets the signal to green.
- Semaphore failure
- A train enters, semaphore fails to flip, alarm
rings. - Signal man A sends TRAIN-IN-TUNNEL.
- Signal man A then manually raises a red flag.
- When train exists, opposite signal man B sends
TUNNEL-IS-CLEAR. - Signal man A changes red flag to white flag.
- Should 2 and 3 be reversed?
- Weaknesses?
- What happens if the train has exited by the time
the TRAIN-IN-TUNNEL message is sent? - How far apart do trains need to be? What happens
if they are too close?
52- On August 25th, 1861
- Three trains left Brighton at 828, 831, and
835, due to late running of the first train. - The first train entered the tunnel, but the
semaphore failed to flip to red. - The signal man A telegraphed TRAIN-IN-TUNNEL.
- He went to manually raise a red flag, but was too
slow, due to the trains being too close together. - The second train barely catches a glimpse of the
red flag as he passes by, but cant stop in time
and enters the tunnel. He stops in the middle of
the tunnel and begins to back up. - The third train sees the red flag in time, and
stops before entering. - The signal man A now telegraphs TRAIN-IN-TUNNEL,
to indicate that there are two trains in the
tunnel. - Signal man A now asks, HAS-THE-TRAIN-LEFT-THE-TUNN
EL? - What should signal man B do now?
- Signal man B, after the first train has left,
responds TUNNEL-IS-CLEAR, thinking A meant the
first train. - Signal man A thinks B meant the second train, and
changes the flag to white. - The third train enters the tunnel.
- 21 people died, 176 were injured. Whose fault was
it?
53Distributed Commit
- Given a group of actors, how do you get them to
either all agree to do something (commit), or not
do it (abort)? - Suppose you are trying to arrange going to a
movie with a group of friends by e-mail. You only
want the event to happen if everyone can go. How
do you do it? - Send out a group e-mail asking if they can make
it. - Wait for responses, sent back just to you.
- If anyone says they cannot, then send out an
abort message to everyone. - If everyone says they can make it, send out a
commit message. - How can this fail?
- If communication is unreliable, then problems are
numerous. Consider only fail-stop (crash)
failures that are recoverable. - Suppose one of your friends has e-mail problems
before 1, they never get the request. - Suppose he has e-mail problems after sending back
OK? - Suppose you have e-mail problems?
- Suppose one of your friends gets a girlfriend and
starts ignoring you?
54One-Phase Commit
- Coordinator decides whether or not to perform
(commit) the operation, and tells others. - What is the obvious problem?
55Two-Phase Commit
- Consists of a coordinator and participants.
- Coordinator multicasts a VOTE_REQUEST message to
all participants. - When a participant receives a VOTE_REQUEST
message, it replies (unicast) with either
VOTE_COMMIT or VOTE_ABORT. - A VOTE_COMMIT response is essentially a
contractual guarantee that it will be able to
commit. - Coordinator collects all votes. If all are
VOTE_COMMIT, then it multicasts a GLOBAL_COMMIT
message. Otherwise, it will multicast a
GLOBAL_ABORT message. - When a participant receives GLOBAL_COMMIT, it
locally commits if it receives GLOBAL_ABORT, it
locally aborts. - Can a process have an error now?
56Two-Phase Commit FSMs
Coordinator
Participant
- Where does the waiting/blocking occur?
- Coordinator-WAIT
- Participant-INIT
- Participant-READY
57Two-Phase Commit Recovery
Wait State
Wait States
Participant
Coordinator
- What happens in case of a crash? How do we detect
a crash? - If timeout in Coordinator-WAIT, then abort.
- If timout in Participant-INIT, then abort.
- If timout in Participant-READY, then need to find
out if globally committed or aborted. - Just wait for Coordinator to recover.
- Check with others.
58Two-Phase Commit Recovery
- If in Participant-READY, and we wish to check
with others - If Q is in COMMIT, then commit. If Q is in ABORT,
then ABORT. - If Q in INIT, then can safely ABORT.
- If all in READY, nothing can be done.
59Two-Phase Commit Recovery
- If in Participant-READY, and we wish to check
with others - If Q is in COMMIT, then commit. If Q is in ABORT,
then ABORT. - If Q in INIT, then can safely ABORT.
- Does it make a difference whether we change
state, then send, or send, then change state? - If all in READY, nothing can be done.
602PC Coordinator Code
- write START_2PC to local logmulticast
VOTE_REQUEST to all participantswhile not all
votes have been collected wait for any
incoming vote if timeout write
GLOBAL_ABORT to local log multicast
GLOBAL_ABORT to all participants exit
record voteif all participants sent
VOTE_COMMIT and coordinator
votes COMMIT write GLOBAL_COMMIT to local
log multicast GLOBAL_COMMIT to all
participants else write GLOBAL_ABORT to
local log multicast GLOBAL_ABORT to all
participants
612PC Participant Code
- write INIT to local logwait for VOTE_REQUEST
from coordinatorif timeout write
VOTE_ABORT to local log exitif
participant votes COMMIT write VOTE_COMMIT
to local log send VOTE_COMMIT to
coordinator wait for DECISION from
coordinator if timeout multicast
DECISION_REQUEST to other participants
wait until DECISION is received / Remain
blocked / write DECISION to local log
if DECISION GLOBAL_COMMIT
write GLOBAL_COMMIT to local log else if
DECISION GLOBAL_ABORT write
GLOBAL_ABORT to local log else write
VOTE_ABORT to local log send VOTE_ABORT to
coordinator
622PC Decision Request Handler
- while true wait until any incoming
DECISION_REQUEST is
received read most recently recorded state
from the local log if state
GLOBAL_COMMIT send GLOBAL_COMMIT to
requesting participant else if state INIT
or state GLOBAL_ABORT send
GLOBAL_ABORT to requesting participant else
skip / Participant remains blocked /
63- Can participants always make a decision to commit
or abort in the face of failure? - No, so this is called a blocking commit protocol.
64Three-Phase Commit
- To avoid blocking, there is a three phase commit
protocol.
65Three-Phase Commit
- The states of the coordinator and each
participant satisfy the following two conditions - There is no single state from which it is
possible to make a transition directly to either
a COMMIT or an ABORT state. - There is no state in which it is not possible to
make a final decision, and from which a
transition to a COMMIT state can be made.
66- Participant Timeout
- Init Abort
- Precommit Commit
- Ready Abort
- Coordinator Timeout
- Wait Abort
- Precommit Commit
67Recovery
68Recovery
- Recovery is to replace an erroneous state with an
error-free state. - Two kinds
- Backward recovery go back to a known good state
(checkpoint). - Forward recovery attempt to go forward to a
known good state. - Requires knowing in advance which errors might
occur. - Examples of backward recovery? Forward recovery?
- File system backups
- Erasure codes
69Checkpointing and Logging
- Suppose you have a set of 20 files, totaling 100
MB. You are constantly modifying these files. - You want to checkpoint the set of files as a
whole. How do you do it? - Make a complete copy every time.
- Suppose you want a checkpoint every hour, but you
only change 1K scattered across all 20 files
every 10 minutes? How much disk will this take?
How much time? - Log the changes.
- Suppose you completely replace every byte in half
your files every 10 minutes. How much disk will
this take? - Best is usually some combination.
70Stable Storage
- Suppose you want to make a disk system that is
resilient from non-catastrophic hardware
failures, like a sector going bad, or a disk
going bad. How would you do it? - Use two disks. Make a duplicate of each sector on
drive 1 on drive 2. - When a block is written, first update and verify
on drive 1, then update and verify on drive 2.
71Stable Storage
- (b) How do we recover from a crash, if different
value? If bad checksum?
72Checkpointing
- A checkpoint is a complete record of the state of
an application. - How do you checkpoint a non-distributed
application? - OS-level checkpointing Save the complete process
state. Transparent to the application. - Application-level checkpointing Application
saves its own state. Requires coding this
functionality into the application.
73Example E-Mail Client
- Say you have an e-mail client like Outlook or
Thunderbird. - What would the OS need to do to implement
OS-level checkpointing? What needs to be saved? - Contents of memory, including stack.
- Window state
- Thread state. Which threads have been created
with which options. - Signal handler state.
- Open file descriptor state. File offset state.
- Network connection state?
- Others?
- What would have to be implemented in app-level
checkpointing? - What folder is in main window. What message is
showing. - Composition windows. Contents. Cursor position.
Undo buffer. - Which seems easier? How about a hybrid?
74Distributed Checkpointing
- Assume we have a complex computation running on
10,000 CPUs. Normally, a single CPU system might
fail once in 5 years, which might be acceptable. - Do you think this 10,000 CPU system is going to
fail once in 5 years? - Checkpointing is the answer, but how? The
computation works by sending many messages
between the various machines.
75Example A Broadcast
- Assume some kind of computational fluid dynamics
(CFD) code, computing the flow over a wing. For
speed, we run on three processes. - Periodically, process 1 broadcasts a new velocity
to processes 2 and 3.
Velocity updatebroadcast
1
2
3
76Example A Broadcast
Process 2checkpoint(n) // Ckpt 2-1vel
receive_new_velocity()checkpoint(n) // Ckpt
2-2wr_log(upd d done, vupd)
Process 1checkpoint(n) // Ckpt
1-1broadcast_new_velocity(vel)checkpoint(n)
// Ckpt 1-2wr_log(upd d, vupd)
Process 3checkpoint(n) // Ckpt 3-1vel
receive_new_velocity()checkpoint(n) // Ckpt
3-2wr_log(vupd d done, vupd)
- Each process enters their respective code
fragments together (SPMD). - Assume sending a broadcast is asynchronous.
- Process 1 broadcasts, Process 2-3 receive the new
value. - Everyone is happy.
77Example A Crash
Process 11 checkpoint(n) // Ckpt 1-12
broadcast_new_velocity(vel)3 checkpoint(n)
// Ckpt 1-24 wr_log(upd d, upd)
Process 21 checkpoint(n) // Ckpt 2-12 vel
receive_new_velocity()3 checkpoint(n) //
Ckpt 2-24 wr_log(upd d, upd)
Process 31 checkpoint(n) // Ckpt 3-12 vel
receive_new_velocity()3 checkpoint(n) //
Ckpt 3-24 wr_log(upd d, upd)
- Suppose there is a power failure across the whole
cluster. We can recover with checkpoints. Which
ones do we use? - Suppose all logs have upd 10 as last thing.
- Suppose none do, and we just use the last
recorded checkpoint? - Suppose some have upd 10, some dont.
78Consistent Checkpoints
- A consistent collection (recovery line) of
checkpoints cannot have a checkpoint from P1
before it has sent a message M, and a checkpoint
from P2 after it has received M. - Recovery to this state would lead to
- P1 thinks it has not sent M. P2 thinks it has
received M. - Impossible and inconsistent. There is no moment
in time where this could have been the state of
the distributed system.
C1-2
C1-1
C1-3
R2
C1-5
C1-4
M2
R1
M1
C2-1
C2-2
R3
C2-3
79Independent Checkpointing
- Suppose we have a set of processes that
independently take periodic checkpoints. Can we
always find a recovery line?
- Known as the domino effect.
- There is a technique described in the book on how
to find one, if it exists.
80Coordinated Checkpointing
- There are distributed snapshot techniques that
can help, but complex. - An alternative is to use a global coordinator.
- Multicast a CHECKPOINT_REQUEST message.
- Upon receipt, take a local checkpoint, block any
new messages the application gives, and sends an
ACK. - When coordinator gets an ACK from all processes,
it sends back CHECKPOINT_DONE.
P1
ACK
ACK
C
CD
CR
CD
CR
M1
ACK
P2
ACK
81Incremental
- Every time a checkpoint is taken, all processes
must write their local state. - If this is a CFD computation, and each process
has a 1000x1000x1000 3D grid, then that is a lot
of storage. - What if not all processes have changed their
state? - One way is for each process to decide. May lead
to a lot of network traffic. - If we only care about the coordinator state
changes, then coordinator can just send
checkpoint request to each process P it has sent
a message to since last time. - Each process P must then cascade this checkpoint
request to each process that P has sent a message
to since the last checkpoint request.
82Hybrid
- What happens if we only use update logging?
- What happens if we only use checkpointing?
83Message Logging
- Earlier, we mentioned that logging changes can be
more efficient. - Assume that in a distributed system, all state
changes are caused by the sending and receiving
of messages (piecewise deterministic). Then we
can just log messages. - These messages can then be replayed from the last
checkpoint.
84Message Replay
RecvM1
RecvM2
RecvM4
SendM3
S1
S2
S3
C1
Replay M1
Replay M2
Replay M4
RecvM1
RecvM2
RecvM4
SendM3
Recoverfrom C1
S1
S2
S3
- How do we save the messages?
85Pessimistic vs. Optimistic Logging
- To replay messages, we need to save them.
- We can save all messages to stable storage before
we deliver to application. Upon crash, just
replay since last checkpoint. - Disadvantages?
- We can save all messages asynchronously. However
- So R must be rolled back also, which can be
expensive. - In the synchronous method, we pay a (smaller)
cost up front. - In the asynchronous method, we pay a (bigger)
cost upon failure.
86Recovery Lines Revisited
C1-1
P1
P2
C2-1
- Is the above a valid recovery line? If not, can
we use logging to make it valid?