Checkpointing and Recovery

About This Presentation

Title:

Description:

Number of Views:319

Avg rating:3.0/5.0

Slides: 20

Provided by: s10123

Category:

Tags: asynchronous | checkpointing | counter | recovery | synchronous

Transcript and Presenter's Notes

Title: Checkpointing and Recovery

1
Checkpointing and Recovery
2
Purpose

3
Examples
4
What to Save?

5
Stable Storage

Checkpoints must survive failure of processes
(including failure during a disk write)
A simple approach for stable storage

6
Approaches

7
Asynchronous Checkpointing

Failed process
8
Other Issues with Asynchronous Checkpointing

9
Asynchronous Checkpointing (Continued)

Identify dependency between different checkpoint
intervals
This information is stored along with checkpoints
in a stable storage
When a process repairs, it requests this
information from others to determine the need for
rollback

10
Two Examples of Asynchronous Checkpointing

11
Algorithm by Bhargava et al

12
Algorithm by Wang et al

Difference
If a message sent from Ii, x is received in Ij, y
then draw an edge between cj, x-1 to cj, y
Recovery line obtained is similar to that by by
Bhargava and Lian
Advantage
Number of useful checkpoints is at most N(N1)/2
This can be shown that the number of checkpoints
that are ahead of recovery line

13
Coordinated Checkpointing

14
Algorithm by Tamir and Sequin

Blocking checkpoint
A coordinator decides when a checkpoint is taken
Coordinator sends a request message to all
Each process
Stops executing
Flushes the channels
Takes a tentative checkpoint
Replies to coordinator
When all processes send replies, the coordinator
asks them to change it to a permanent checkpoint

15
Algorithm by Tamir and Sequin