Title: EEC 688/788 Secure and Dependable Computing
1EEC 688/788Secure and Dependable Computing
- Lecture 14
- Wenbing Zhao
- Department of Electrical and Computer Engineering
- Cleveland State University
- wenbing_at_ieee.org
2Outline
- Group communication systems
- Ordered multicast
- Techniques to implement ordered multicast
- Group membership service
- Agreed and safe delivery
- Checkpointing and recovery
- Reference
- Reliable distributed systems, by K. P. Birman,
Springer Chapter 14-16
3Group Communication System
- Services provided by the GCS
- Membership service who is up and who is down
- Deals with failure detection and more
- Reliable, ordered, multicast service
- FIFO, causal, total
- Virtual synchrony service
- Virtual synchrony synchronizes membership change
with multicasts - GCS is often used to build fault tolerant systems
4Reliable Multicast
- Reliable multicast the message is targeted to
multiple receivers, and all receivers receive the
message reliably - Positive or negative acknowledgement
- Need to avoid ack/nack implosion
- Distinguish receiving from delivery!
Application
Delivering
Middleware
Receiving
5Ordered Reliable Multicast
- Ordered reliable multicast if many messages are
multicast by many senders, in what order the
messages are delivered at the receivers? - First in first out (FIFO)
- Causal the causal relationship among msgs
preserved - Total all msgs are delivered at all receivers
in the same order
6FIFO Ordered Multicast
- FIFO or sender ordered multicast
- Messages are delivered in the order they were
sent (by any single sender)
a
e
p q r s
b
c
d
delivery of c to p is delayed until after b is
delivered
7Causally Ordered Multicast
- Causal or happens-before ordering
- If send(a) ? send(b) then deliver(a) occurs
before deliver(b) at common destinations
a
p q r s
b
8Causally Ordered Multicast
- Causal or happens-before ordering
- If send(a) ? send(b) then deliver(a) occurs
before deliver(b) at common destinations
a
p q r s
b
c
delivery of c to p is delayed until after b is
delivered
9Causally Ordered Multicast
- Causal or happens-before ordering
- If send(a) ? send(b) then deliver(a) occurs
before deliver(b) at common destinations
a
e
p q r s
b
c
delivery of c to p is delayed until after b is
delivered
e is sent (causally) after b
10Causally Ordered Multicast
- Causal or happens-before ordering
- If send(a) ? send(b) then deliver(a) occurs
before deliver(b) at common destinations
a
e
p q r s
b
c
d
delivery of c to p is delayed until after b is
delivered
delivery of e to r is delayed until after bc are
delivered
11Totally Ordered Multicast
- Total ordering
- Messages are delivered in same order to all
recipients (including the sender)
a
e
p q r s
b
c
d
all deliver a, b, c, d, then e
12Implementing Total Ordering
- Use a token that moves around
- Token has a sequence number
- When you hold the token you can send the next
burst of multicasts - Use a sequencer to order all multicast
- Message is first multicast to all, including the
sequencer then the sequencer determines the
order for the message and informs all - Or send to the sequencer and the sequencer
multicast with total order information - Each sender can take turn to serve as the
sequencer
13Group membership service
- Input
- Process join events
- Process leave events
- Apparent failures
- Output
- Membership views for group(s) to which those
processes belong
14Issues?
- The service itself needs to be fault-tolerant
- Otherwise our entire system could be crippled by
a single failure! - Hence Group Membership Service (GMS) must run
some form of protocol (GMP)
15Approach
- Assume that GMS has members p,q,r at time t
- Designate the oldest of these as the protocol
leader - To initiate a change in GMS membership, leader
will run the GMP - Others cant run the GMP they report events to
the leader
16GMP Example
p
q
r
- Example
- Initially, GMS consists of p,q,r
- Then q is believed to have crashed
17Unreliable Failure Detection
- Recall that failures are hard to distinguish from
network delay - So we accept risk of mistake
- If p is running a protocol to exclude q because
q has failed, all processes that hear from p
will cut channels to q - Avoids messages from the dead
- q must rejoin to participate in GMS again
18Basic GMP
- Someone reports that q has failed
- Leader (process p) runs a 2-phase commit protocol
- Announces a proposed new GMS view
- Excludes q, or might add some members who are
joining, or could do both at once - Waits until a majority of members of current view
have voted ok - Then commits the change
19GMP Example
Proposed V1 p,r
Commit V1
p
q
r
OK
V0 p,q,r
V1 p,r
- Proposes new view p,r -q
- Needs majority consent p itself, plus one more
(current view had 3 members) - Can add members at the same time
20Special Concerns?
- What if someone doesnt respond?
- P can tolerate failures of a minority of members
of the current view - New first-round overlaps its commit
- Commit that q has left. Propose add s and drop
r - P must wait if it cant contact a majority
- Avoids risk of partitioning
21What If Leader Fails?
- Here we do a 3-phase protocol
- New leader identifies itself based on age ranking
(oldest surviving process) - It runs an inquiry phase
- The adored leader has died. Did he say anything
to you before passing away? - Note that this causes participants to cut
connections to the adored previous leader - Then run normal 2-phase protocol but terminate
any interrupted view changes leader had initiated
22GMP Example
p
Proposed V1 r,s
Commit V1
Inquire -p
q
r
OK
OK nothing was pending
V0 p,q,r
V1 r,s
- New leader first sends an inquiry
- Then proposes new view r,s -p
- Needs majority consent q itself, plus one more
(current view had 3 members) - Again, can add members at the same time
23Safe and Agreed Delivery
- For totally ordered reliable multicast, there are
two delivery policies - Safe delivery a message is delivered only when
all correct processes have received it - Agreed delivery a message is delivered as long
as it is the next message in total order
24Safe and Agreed Delivery
- Safe delivery guarantees the uniformity of
multicast - If a message is delivered to any process, it is
delivered by all correct processes - Agreed delivery does not
- It is possible that a message is delivered in one
(or more) process, but is not delivered by some
correct process
25Checkpointing and Recovery
- Faults occur over time. How to ensure a fault
tolerant system remain operational for extensive
period of time? - Recover failed replicas, or replace failed
replicas with new one gt Recovery is needed - How to recover a failed replica or install a new
replica? - Checkpointing a correct replica and transfer the
state to the recovering replica
26Checkpointing
- Checkpointing the act of taking a snapshot of an
entity so that we can restore it later - A replica is a process running in an operating
system. The state of a process - Processes' memory, stack and registers
- Threads
- Open or mmap'ed files
- Current working directory
- Interprocess communication
- Semaphores, shared memory, pipes, sockets
- Dynamic Load Libraries
27Checkpointing
- Many tools are available to perform checkpointing
transparently or semi-transparently - http//www.checkpointing.org/
- Condor, libckpt, etc.
- Checkpoints taken in general are not portable
- Checkpoint size might be big
28Checkpointing of Application State
- Sometimes it is more efficient to save and store
the application state only - Checkpoints can be very portable and compact in
size - class Counter int counter Counter(int
initVal) counter initVal void
increment() counter void decrement()
counter-- void setState(int c) counter
c int getState() return counter
29Logging
- Logging of messages
- Checkpointing in general is expensive
- Logging of messages is cheaper
- gt we can periodically do checkpointing, or do
checkpointing on demand and log all messages in
between - Logging of other non-deterministic activities
- Access order to shared data
30Roll-Forward Recovery
- With replication in space, it is possible to
recover a fault while the system is progressing
ahead - Roll-forward recovery is made possible by
- Checkpointing of replica state
- Logging of incoming messages
- Reliable, totally ordered group communication
system
31Roll-Forward Recovery
- We want to ensure the newly admitted replica to
have a consistent state with others when it
starts - Steps of adding a new replica into a group (with
on-demand checkpointing) - A recovered (or a new) replica joins a group
- A join message is multicast in total order
- On receiving the join message, it is put into
incoming message queue and wait for processing - When the join message is at the head of the
queue, a checkpoint is taken and it is
transferred to the new replica
32Roll-Forward Recovery
- At the new replica, it starts queueing messages
after it receives the join messages (sent by
itself) - When the checkpoint is received by the new
replica, its state is restored using the received
checkpoint (the checkpoint is delivered out of
order!) - The queued messages are delivered in order, at
the new replica - Other replicas do not stop and wait for the new
replica - Steps of adding a new replica into a group with
periodic checkpointing is similar
33Steps of Roll-Forward Recovery
34Steps of Roll-Forward Recovery
35Steps of Roll-Forward Recovery
36Steps of Roll-Forward Recovery
37Roll-backward Recovery
- Roll-backward recovery is used for systems
relying on replication in time for fault
tolerance - When a failure occurs, roll back using the most
recent checkpoint (and retry)
38Roll-backward Recovery in a Distributed System
- Performing roll-backward recovery in a
distributed system is non-trivial - Need to solve the distributed snapshot problem
- It is easy to perform a local checkpoint of a
process, but in a distributed system, when one
process rolls back, other processes must also
roll back to a consistent state
39Distributed Snapshot Problem
- Goal Determine the global system state
- e.g. the total amount of money
- Assumptions
- Each process records its own state
- No shared clock/memory
- Imagine that a group of photographers taking
snapshots of different portions and trying to
combine to get the overall picture
40Distributed Snapshot
- A distributed snapshot reflects a state in which
the distributed system might have been - What constitute a consistent global state?
- If we have recorded that process P has received a
message from another process Q, then we should
also have recorded that process Q had actually
sent the message - The reverse condition (Q has sent a message that
P has not yet received) is allowed
41Distributed Snapshot
- A pair of mutually consistent checkpoints
42Distributed Snapshot
- A missing message
- gt need to log messages (i.e.,consider channel
state in addition to process state)
43Distributed Snapshot
- An orphan message
- The two checkpoints are definitely not consistent
44Chandy and Lamport's Algorithm
- Assumptions
- FIFO, unidirectional, reliable channels (A
bidirectional channel is modelled as two
unidirectional channels) - No process fails during the snapshot
- System state consists of process state and
channel state (messages sent but not received) - Any process P can initiate taking a distributed
snapshot
45Chandy and Lamport's Algorithm
- P starts by recording its own local state and
sends a marker along each of its outgoing
channels - When Q receives a marker through channel C, its
action depends on whether it had already recorded
its local state - Not yet recorded
- It records its local state, and sends the marker
along each of its outgoing channels - It starts recording incoming messages on OTHER
channels - Already recorded the marker on C indicates that
the channels state should be recorded - All messages received before this marker and
after Q recorded its own state
46Chandy and Lamport's Algorithm
- Q is finished when it has received a marker along
each of its incoming channels - The recorded local state as well as the state it
recorded for each incoming channel, can be
collected and sent to the process that initiated
the snapshot - The global state can be subsequently constructed
47Chandy and Lamport's Algorithm
C1
M
C2
Process Q receives a marker for the first time
(from C1) and records its local state
Q records all incoming message on C2 (and other
incoming channels except C1, if any)
Q receives a marker for its incoming channel C2
and finishes recording the state of the incoming
channel C2