Title: EEC 688/788 Secure and Dependable Computing
1EEC 688/788Secure and Dependable Computing
- Lecture 7
- Wenbing Zhao
- Department of Electrical and Computer Engineering
- Cleveland State University
- wenbing_at_ieee.org
2Outline
- Checkpointing and logging
- Checkpoint-based protocols
- Uncoordinted checkpointing
- Coordinated checkpointing
- Logging-based protocols
- Pessimistic logging
- Optimistic logging
- Causal logging
3Chandy and Lamport Distributed Snapshot Protocol
- CL snapshot protocol is a nonblocking protocol
- TS checkpointing protocol is blocking
- CL protocol is more desirable for applications
that do not wish to suspect normal operation - However, CL protocol is only concerned how to
obtain a consistent global checkpoint - CL Protocol no coordinator, any node may
initiate a global checkpointing - Data structure
- Marker message equivalent to the CHECKPOINT
message - Marker certificate keep track to see a marker is
received from every incoming channel
4CL Distributed Snapshot Protocol
5Example
- P0 channel state m0 (p1 to p0 channel)
- P1 channel state m1 (p2 to p1 channel)
- P2 channel state empty
6Comparison of TS CL Protocols
- Similarity
- Both rely on control msgs to coordinate
checkpointing - Both capture channel state in virtually the same
way - Start logging channel state upon receiving the
1st checkpoint msg from another channel - Stop logging channel state after received
checkpoint on the incoming channel - Communication overhead similar
7Comparison of TS CL Protocols
- Differences strategies in producing a global
checkpoint - TS protocol suspends normal operation upon 1st
checkpoint msg while CL does not - TS protocol captures channel state prior to
taking a checkpoint, while CL captures channel
state after taking a checkpoint - TS protocol more complete and robust than CL
- Has fault handling mechanism
8Log Based Protocols
- Work might be lost upon recovery using
checkpoint-based protocols - By logging messages, we may be able to recover
the system to where it was prior to the failure - System mode the execution of a process is
modeled as a set of consecutive state intervals - Each interval is initiated by a nondeterministic
state or initial state - We assume the only type of nondeterministic event
is receiving of a message
9Log Based Protocols
- In practice, logging is always used together with
checkpointing - Limits the recovery time start with the latest
checkpoint instead of from the initial state - Limits the size of the log after taking a
checkpoint, previously logged events can be
purged - Logging protocol types
- Pessimistic logging msgs are logged prior to
execution - Optimistic logging msgs are logged
asynchronously - Causal logging nondeterministic events that not
yet logged (to stable storage) are piggybacked
with each msg sent - For optimistic and causal logging, dependency of
processes has to be tracked gt more complexity,
longer recovery time
10Pessimistic Logging
- Synchronously log every incoming message to
stable storage prior to execution - Each process periodically checkpoints its state
no need for coordination - Recovery a process restores its state using the
last checkpoint and replay all logged incoming
msgss
11Pessimistic Logging Example
- Pessimistic logging can cope with concurrent
failures and the recovery of two or more processes
12Benefits of Pessimistic Logging
- Processes do not need to track their dependencies
- Logging mechanism is easy to implement and less
error prone - Output commit is automatically ensured
- No need to carry out coordinated global
checkpointing - By replaying the logged msgs, a process can
always bring itself to be consistent with other
processes - Recovery can be done completely locally
- Only impact to other processes duplicate msgs
(can be discarded)
13Pessimistic Logging Discussion
- Reconnection
- A process must be able to cope with temporary
connection failures and be ready to accept
reconnections from other processes - Application logic should be made independent from
the transport level events event-based or
document-based computing paradigm - Message duplicate detection
- Messages may be replayed during recovery gt
duplicate messages - Transport level duplicate detection irrelevant.
Must add mechanism in application level
protocols, e.g., WS-ReliableMessaging - Atomic message receiving and logging
- A process may fail right after the receiving of a
message before it has a chance to log it to
stable storage - Need application-level reliable messaging
mechanism
14Application-Level Reliable Messaging
- Sender buffers message sent until receives an
application-level ack - Benefits of application-level reliable messaging
- Atomic message receiving and logging
- Facilitate distributed system recovery from
process failures enables reconnection - Enables optimization message received can be
executed immediately and the logging can be
deferred until another message is to be sent - Logging and msg execution can be done
concurrently - If a process sends out a message after receiving
several msgs, logging of msgs can be batched
15Sender Based Message Logging
- Basic idea
- Log the message at the sending side in volatile
memory - Should the receiving process fail, it could
obtain the messages logged at the sending
processes for recovery. - To avoid restarting from the initial state after
a failure, a process can periodically checkpoint
its local state and write the message log in
stable storage (as part of the checkpoint)
asynchronously - Tradeoff
- Relative ordering of messages must be explicitly
supplied by the receiver to the sender (quite
counter-intuitive!) - The receiver must wait for an explicit ack for
the ordering message before it send any msgs to
other processes (however, it can execute the
message received immediately without delay) - The mechanism is to prevent the formation of
orphan messages and orphan processes
16Orphan Message and Orphan Process
- An orphan message is one that was sent by a
process prior to a failure, but cannot be
guaranteed to be regenerated upon the recovery of
the process - An orphan process is a process that receives an
orphan message - If a process sends out a message and subsequently
fails before the determinants of the messages it
has received are properly logged, the message
sent becomes an orphan message
17Sender Based Message Logging Protocol Data
Structures
- A counter, seq_counter, used to assign a sequence
number (using the current value of the counter)
to each outgoing message - Needed for duplicate detection
- A table for duplicate detection
- Each entry has the form ltprocess_id,max_seqgt,
where max_seq is the maximum sequence number that
the current process has received from a process
with an identifier of process_id. - A message is deemed as a duplicate if it carries
a sequence number lower or equal to max_seq for
the corresponding process - Another counter, rsn_counter, used to record the
receiving/execution order of an incoming message - The counter is initialized to 0 and incremented
by one for each message received
18Sender Based Message Logging Protocol Data
Structures
- A message log (in volatile memory) for msg sent
by the process. In addition to the msg sent, the
following meta data is also recorded - Destination process id, receiver_id
- Sending sequence number, seq
- Receiving sequence number, rsn.
- A history list for the messages received since
the last checkpoint. It is used to find the
receiving order number for a duplicate msg. - Upon receiving a duplicate message, the process
should supply the corresponding (original)
receiving order number so that the sender of the
message can log such ordering information
properly - Each entry in the list has the following
information - Sending process id, sender_id
- Sending sequence number, seq
- Receiving sequence number, rsn (assigned by the
current process).
19What Should be Checkpointed?
- All the data structures described above except
the history list must be checkpointed together
with the process state - The two counters, one for assigning the message
sequence number and the other for assigning the
message receiving order, are needed so that the
process can continue doing so upon recovery using
the checkpoint - The table for duplicate detection is needed for a
similar reason. - Why the message log must be checkpointed?
- The log is needed for the receiving processes to
recover from a failure, and hence, cannot be
garbage collected upon a checkpointing operation - Additional mechanism is necessary to ensure that
the message log does not grow indefinitely
20Sender Based Message Logging Protocol Message
Types
- REGULAR It is used for sending regular messages
generated by the application process, and it has
the form ltREGULAR, seq, rsn,mgt - ORDER It is used for the receiving process is
notify the sending process the receiving order of
the message. An order message carries the form
ltORDER, m, rsngt, - m is the message identifier consisting of a
tuple ltsender_id, receiver_id, seqgt - ACK It is used for the sending process (of a
regular message) to acknowledge the receipt of
the order message. It assumes the form ltACK, mgt
21Sender Based Message Logging Protocol Normal
Operation
- The protocol operates in three steps for each
message - A regular message, ltREGULAR,seq, rsn,mgt, is sent
from one process, e.g., Pi, to another process,
e.g., Pj . - Process Pj determines the receiving/execution
order, rsn, of the regular message and informs
the determinant information to Pi in an order
message ltORDER, m, rsngt. - Process Pj waits until it has received the
corresponding acknowledgment message, ltACK, mgt,
before it sends out any regular message.
22(No Transcript)
23Sender Based Message Logging Protocol Recovery
Mechanism
- On recovering from a failure, a process first
restores its state using the latest local
checkpoint, and then it must broadcast a request
to all other processes in the system to
retransmit all their logged messages that were
sent to the process - The recovering process retransmit the regular
messages or the ack messages based on the
following rule - If the entry in the log for a message contains no
rsn value, then a regular message is
retransmitted because the intended receiving
process might not have received this message. - If the entry in the log for a message contains a
valid rsn value, then an ack message is sent so
that the receiving process can send regular
messages - When a process receives a regular message, it
always sends a corresponding order message in
response
24Actions upon Receiving a Regular Message
- A process always sends a corresponding order msg
in response - Three scenarios with recovery
- The msg is a not duplicate the current rsn
counter value is assigned to the msg and the
order msg is sent. The process must wait until it
receives the ack msg before it can send any
regular msg - The msg is a duplicate, and the corresponding rsn
is found in the history list actions are
identical to above except rsn is not newly
assigned - The msg is a duplicate, and no rsn is found in
the history list the process must have
checkpointed its state after receiving the msg
and the msg is no longer needed for recovery.
Hence, the order msg includes a special constant
indicating so. The sender can then purge the msg
in its log - The recovering process may receive two types of
retransmitted regular messages - Those with a valid rsn value the rsn must be
already part of the checkpoint. It executes the
msg according to the order - Those without can assign the msg to any order
25Limitations of Sender Based Msg Logging Protocol
- Wont work in the presence of 2 or more
concurrent failures - Determinant for some regular msgs (i.e., rsn)
might be lost gt orphan processes and cascading
rollbacks
P2 may become an orphan process if P0 and P1 both
crash received mt that no one has sent
26Truncating Senders Message Log
- Once a process completes a local checkpoint, it
broadcasts a message containing the highest rsn
value for the messages that it has executed prior
to the checkpoint. - All messages sent by other processes to this
process that were assigned a value that is
smaller or equal to this rsn value can now to
purged from its message log (including those in
stable storage as part of a checkpoint) - Alternatively, this highest rsn value can be
piggybacked with each message (regular or control
messages) sent to another process to enable
asynchronous purging of the logged messages that
are no longer needed