Snapshots, checkpoints, rollback, and restart - PowerPoint PPT Presentation

About This Presentation
Title:

Snapshots, checkpoints, rollback, and restart

Description:

One guy begins a snapshot by sending markers to everyone else ... Snapshot per subgroup. When there is a phase ... Can adapt snapshot time to match this model ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 27
Provided by: compN
Category:

less

Transcript and Presenter's Notes

Title: Snapshots, checkpoints, rollback, and restart


1
Snapshots, checkpoints, rollback, and restart
  • Larry Rudolph
  • MIT CSAIL
  • January 2005

2
Overview
  • Computers fail
  • When they fail, work gets lost
  • Should save work often to reduce the pain
  • But I back up my disk only when I hear that
    someone elses disk crashed
  • Does this make sense?

3
Checkpoints
  • This talk is concerned about running programs
    crashing (not disks)
  • We use the term checkpoint to be
  • a process that stores sufficient program state to
    nonvolatile storage (disk) so that the program
    can be restarted from that point

4
Do Checkpoints matter?
  • For short computations, just rerun -- NO
  • For long running computations -- YES
  • An OS runs long but it doesnt matter
  • For parallel programs, it is harder (interesting)
  • Stop all processes and all communication
  • Write all relevant state to disk

5
OS does all the important stuff
  • Checkpointing incurs overhead
  • stops computation uses network disk bw
  • System initiated (helps users, system pays)
  • airlines would rather you bought another ticket
    if flight was cancelled
  • Application initiated (looking out for 1)
  • Knows best places, e.g. outer loop

6
Checkpointing Overhead
Failure
I
C
  • Assume periodic checkpointing
  • done every I work period
  • the overhead of checkpointing is C
  • When failure happens, redo work since ckpt

7
Checkpointing Overhead
Failure
I
C
  • Dont do checkpoints if C gt I
  • Surprisingly, our study show that this improves
    performance of existing systems!

8
Key Idea Just say no(if you dont want to do it)
  • Application initiates checkpoints (at best place)
  • It asks the system to actually perform it
  • The system can decide to not do it
  • The system knows more about the overall system
    state than the application

9
Key Idea Just say no(if you dont want to)
10
Risk-based decision
  • Recall, skip checkpoint if I lt C (not worth it)
  • Really want to say
  • Cost of doing ckpt vs cost of not doing ckpt
  • Let p be probability of a failure
  • skip checkpoint if pI lt C

11
Derivation
  • Expected cost of skipping gt expected cost of
    performing
  • p(2I C) (1-p)(0) gt p(I 2C)
    (1-p)C
  • pI gt pC (1-p)C
  • pI gt C

12
Failure prediction
  • Examination of logs, can predict failures
  • based on temp, power spikes, retries, alarms
  • have already shown success rate of 70
  • System has good idea of p
  • System has good idea of cost of failure

13
Results are good (trust me)
  • Hard to believe real numbers
  • hard to get real statistics

14
When can I trust you?
  • If I ask someone to deliver a message
  • My wife will my kids might forget
  • The post office will always do it
  • The internet will only make a best effort (TCP/IP
    Drops packets)
  • If I ask you to save my seat, will you?

15
How to do checkpoints?
  • Shared memory supercomputers, no choice
  • On message-passing clusters, can do better

16
Fault Tolerance
  • Example bank a sends 50 to bank b
  • a must remove 50, b must add 50
  • If a fails after sending, they both have the 50
  • If b fails after receiving, neither has the 50

17
Checkpoint in message-passing systems
  • Each processor or node does a checkpoint of its
    local state
  • The system takes a snapshot, checkpoints plus
    messages not yet part of state

18
Consistent State
  • A system is consistent if
  • all received messages have been sent
  • all sent messages have been received
  • If an agent has rolled back, all msgs sent since
    previous checkpoint are invalid and must be
    removed.
  • If an agent saves its state after receiving a
    msg, the sender must also save its state

19
Chandy Lamport
  • One guy begins a snapshot by sending markers to
    everyone else
  • Any agent getting a marker, sends markers to
    everyone else
  • etc ...

20
Marker Sent Before Message
  • Ensure consistency during snapshot

21
(No Transcript)
22
CL Evaluation
  • Good news it works
  • Bad news too many messages

23
Extension
  • Do not snapshot the entire system
  • Snapshot only connected components
  • Let any node initiate a snapshot
  • Let any node initiate a roll back

24
Snapshot per subgroup
  • multiple snapshots per column
  • Agents that fail cause only column to rollback
  • with lots of failures, there is better progress

25
Snapshot per subgroup
  • When there is a phase change, may need all to
    roll back
  • If happens rarely, ok
  • Can adapt snapshot time to match this model
  • Take snapshot when connected component size is
    small enough
  • System-wide knowledge

26
Conclusion
  • Failures are part of life
  • Must learn how to mitigate their effect
  • Want to understand differences/similarities
    between parallel programs and pervasive
  • Humans intuitively know some of this but ...
Write a Comment
User Comments (0)
About PowerShow.com