Snapshots, checkpoints, rollback, and restart - PowerPoint PPT Presentation

About This Presentation

Title:

Snapshots, checkpoints, rollback, and restart

Description:

One guy begins a snapshot by sending markers to everyone else ... Snapshot per subgroup. When there is a phase ... Can adapt snapshot time to match this model ... – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 27

Provided by: compN

Category:

more less

Transcript and Presenter's Notes

Title: Snapshots, checkpoints, rollback, and restart

1
Snapshots, checkpoints, rollback, and restart

Larry Rudolph
MIT CSAIL
January 2005

2
Overview

Computers fail
When they fail, work gets lost
Should save work often to reduce the pain
But I back up my disk only when I hear that
someone elses disk crashed
Does this make sense?

3
Checkpoints

This talk is concerned about running programs
crashing (not disks)
We use the term checkpoint to be
a process that stores sufficient program state to
nonvolatile storage (disk) so that the program
can be restarted from that point

4
Do Checkpoints matter?

For short computations, just rerun -- NO
For long running computations -- YES
An OS runs long but it doesnt matter
For parallel programs, it is harder (interesting)
Stop all processes and all communication
Write all relevant state to disk

5
OS does all the important stuff

Checkpointing incurs overhead
stops computation uses network disk bw
System initiated (helps users, system pays)
airlines would rather you bought another ticket
if flight was cancelled
Application initiated (looking out for 1)
Knows best places, e.g. outer loop

6
Checkpointing Overhead
Failure
I
C

Assume periodic checkpointing
done every I work period
the overhead of checkpointing is C
When failure happens, redo work since ckpt

7
Checkpointing Overhead
Failure
I
C

Dont do checkpoints if C gt I
Surprisingly, our study show that this improves
performance of existing systems!

8
Key Idea Just say no(if you dont want to do it)

Application initiates checkpoints (at best place)
It asks the system to actually perform it
The system can decide to not do it
The system knows more about the overall system
state than the application

9
Key Idea Just say no(if you dont want to)
10
Risk-based decision

Recall, skip checkpoint if I lt C (not worth it)
Really want to say
Cost of doing ckpt vs cost of not doing ckpt
Let p be probability of a failure
skip checkpoint if pI lt C

11
Derivation

Expected cost of skipping gt expected cost of
performing
p(2I C) (1-p)(0) gt p(I 2C)
(1-p)C
pI gt pC (1-p)C
pI gt C

12
Failure prediction

Examination of logs, can predict failures
based on temp, power spikes, retries, alarms
have already shown success rate of 70
System has good idea of p
System has good idea of cost of failure

13
Results are good (trust me)

Hard to believe real numbers
hard to get real statistics

14
When can I trust you?

If I ask someone to deliver a message
My wife will my kids might forget
The post office will always do it
The internet will only make a best effort (TCP/IP
Drops packets)
If I ask you to save my seat, will you?

15
How to do checkpoints?

Shared memory supercomputers, no choice
On message-passing clusters, can do better

16
Fault Tolerance

Example bank a sends 50 to bank b
a must remove 50, b must add 50
If a fails after sending, they both have the 50
If b fails after receiving, neither has the 50

17
Checkpoint in message-passing systems

Each processor or node does a checkpoint of its
local state
The system takes a snapshot, checkpoints plus
messages not yet part of state

18
Consistent State

A system is consistent if
all received messages have been sent
all sent messages have been received
If an agent has rolled back, all msgs sent since
previous checkpoint are invalid and must be
removed.
If an agent saves its state after receiving a
msg, the sender must also save its state

19
Chandy Lamport

One guy begins a snapshot by sending markers to
everyone else
Any agent getting a marker, sends markers to
everyone else
etc ...

20
Marker Sent Before Message

Ensure consistency during snapshot

21
(No Transcript)
22
CL Evaluation

Good news it works
Bad news too many messages

23
Extension

Do not snapshot the entire system
Snapshot only connected components
Let any node initiate a snapshot
Let any node initiate a roll back

24
Snapshot per subgroup

multiple snapshots per column
Agents that fail cause only column to rollback
with lots of failures, there is better progress

25
Snapshot per subgroup

When there is a phase change, may need all to
roll back
If happens rarely, ok
Can adapt snapshot time to match this model
Take snapshot when connected component size is
small enough
System-wide knowledge

26
Conclusion

Failures are part of life
Must learn how to mitigate their effect
Want to understand differences/similarities
between parallel programs and pervasive
Humans intuitively know some of this but ...

Write a Comment

User Comments (0)