Revisiting failure detectors - PowerPoint PPT Presentation

About This Presentation

Title:

Revisiting failure detectors

Description:

Title: Concurrent Reading and Writing using Mobile Agents Author: Sukumar Ghosh Last modified by: Sukumar Ghosh Created Date: 11/1/2002 2:53:35 AM – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 23

Provided by: Suku79

Learn more at: http://homepage.cs.uiowa.edu

Category:

more less

Transcript and Presenter's Notes

Title: Revisiting failure detectors

1
Revisiting failure detectors

Some of you asked questions about implementing
consensus using S - how does it differ from
reaching consensus using P. Here it is.
Recall the definition of S (strong) FD
Strong completeness weak accuracy

2
Consensus using S

Program for process p
Vp (?,?, .. ?) Vpp input of p Dp Vp
(Phase 1) Same as phase 1 of consensus with P
(Phase 2)
send (Vp, p) to all
receive (Dq, q) from all q, or q is a
suspect
k 1
do k ? n ?
if ?Vqk Vpp ? ? ? Vqk ? ? Vpk
Dpk ? fi
od
(Phase 3)
Decide on the first element Vp j Vp j ? ?

3
Example
0 1 2 3 4
0 1 2 3 4
1, 4
Never suspected
? - ? ? -
? - - ? -
0
? ? - ? -
? - - ? -
2, 4
1
? - - ? -
? ? ? ? -
4
2
? - - ? -
2, 4
? ? - ? -
3
crashed
4
V after Phase 2
V after Phase 1
List of suspects
4
Atomic Commit Protocols

Network of servers
The initiator of a transaction is called the
coordinator,
and the remianing servers are participants

S1
Servers may crash
S3
S2
5
Requirements of Atomic Commit Protocols
S1

Network of servers
Termination. All non-faulty servers must
eventually reach an irrevocable decision.
Agreement. If any server decides to commit, then
every server must have voted to commit.
Validity. If all servers vote commit and there is
no failure, then all servers must commit.

Servers may crash
S3
S2
6
One-phase Commit
server
participant
Commit / abort
server
server
client
participant
coordinator
server
participant
If a participant deadlocks or faces a problem
then the coordinator may never be able to find
it. Too simplistic.
7
Two-phase commit (2PC)

Phase 1 The coordinator sends VOTE to the
participants. and receive yes / no from them.
Phase 2
if ?server j vote(j) yes ? multicast COMMIT to
all severs
? ? server j vote (j) no ? multicast ABORT
to all servers
fi
What if failures occur?

8
Failure scenarios in 2PC

(Phase 1)
Fault Coordinator did not receive YES / NO
OR
Participant did not receive VOTE
Solution Broadcast ABORT
Abort local transactions

9
Failure scenarios in 2PC

(Phase 2)
(Fault) A participant does not receive a COMMIT
or ABORT message from the coordinator
(it may be the case that the coordinator crashed
after sending ABORT or COMIT to a fraction of the
servers), then it remains undecided, until the
coordinator is repaired and reinstalled into the
system.
This blocking is a known weakness of 2PC.

10
Coping with blocking in 2PC

A non-faulty participant can ask other
participants about
what message (COMMIT or ABORT) did they receive
from
the coordinator, and take appropriate actions.
But what if no non-faulty participant received
anything?
Who knows if the coordinator committed or aborted
the
local transaction before crashing? Continue to
wait

11
Non-blocking Atomic Commit

A blocking protocol has the potential to prevent
non-faulty participants from reaching a final
decision.
A solution to the atomic commitment problem is
called non-blocking, if in spite of server
crashes, every non-faulty participant eventually
decides.
One solution is to impose the requirement of
uniform agreement

12
Uniform agreement

If any participant (faulty or not) delivers a
message m
(commit or abort) then all correct processes
eventually
deliver m.
To implement uniform agreement, no server should
deliver a COMMIT or ABORT message until it has
relayed it to all other servers.
If a process times out in phase 2, then it
decides abort.

13
Recovery Stable storage
Creates the illusion of an incorruptible storage,
even if a writer or a disk crashes at any time.
The implementation Uses at least two independent
disks.
A0
A1
inspect
update
14
Stable storage

To write, do the following
copy on disk A0
record timestamp T0
compute checksum S0
copy on disk A1
record timestamp T1
compute checksum S1
Readers check four cases
Both checksums OK and T1gtT0
Both checksums OK and T1ltT0
Checksum on A1 wrong
Checksum on A2 wrong
(Which copy to accept in each case?)

A0
update
inspect
A1
15
Checkpointing

Mechanism for (backward) error recovery.
Transaction states are periodically stored on
stable storages. Following a failure, the
transaction rolls back to the nearest checkpoint.
Independent (unsynchronized) or coordinated
(synchronized) checkpointing

16
Classification of checkpointing
Coordinated Checkpointing takes a consistent
snapshot. Has some overhead. Uncoordinated
checkpointing apparently has no overhead. But it
may have some efficiency problems.
17
Checkpointing (continued)

Some actions can be reversed, but some cannot be
reversed (like dispensing cash from an ATM
machine, printing a document etc).
Such actions are logged, and during replay, logs
substitute real actions.

18
Group Communication

Group oriented activities are steadily
increasing.
There are many types of groups
? Open and Closed groups
? Peer-to-peer and hierarchical groups

19
Major issues

Atomic multicast
Ordered multicast
Dynamic groups
Failure handling

20
Atomic multicast

A multicast is called atomic, when the message is
delivered to every correct (i.e. functioning)
member, or to no member at all.
Sometimes, certain features available in the
infrastructure of a distributed system simplify
the implementation of multicast. Examples are (1)
multicast on an ethernet LAN (2) IP multicast

21
Basic vs. reliable multicast