Parallel Checkpointing - PowerPoint PPT Presentation

1 / 7
About This Presentation
Title:

Parallel Checkpointing

Description:

... Aur lien Bouteiller, Franck Cappello, Samir Djilali, Gilles F dak, C cile ... MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on the Pessimistic ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 8
Provided by: SathishV4
Category:

less

Transcript and Presenter's Notes

Title: Parallel Checkpointing


1
Parallel Checkpointing
  • Sources/Credits
  • James S. Plank, An Overview of Checkpointing
    in Uniprocessor and Distributed Systems, Focusing
    on Implementation and Performance'', University
    of Tennessee Technical Report CS-97-372, July,
    1997
  • Georg Stellner CoCheck Checkpointing and
    Process Migration for MPI. IPPS 1996 526-531
  • MPICH-V Toward a Scalable Fault Tolerant MPI for
    Volatile Nodes -- George Bosilca, Aurélien
    Bouteiller, Franck Cappello, Samir Djilali,
    Gilles Fédak, Cécile Germain, Thomas Hérault,
    Pierre Lemarinier, Oleg Lodygensky, Frédéric
    Magniette, Vincent Néri, Anton Selikhov --
    SuperComputing 2002, Baltimore USA, November 2002
  • MPICH-V2 a Fault Tolerant MPI for Volatile Nodes
    based on the Pessimistic Sender Based Message
    Logging -- Aurélien Bouteiller, Franck Cappello,
    Thomas Hérault, Géraud Krawezik, Pierre
    Lemarinier, Frédéric Magniette -- To appear in
    SuperComputing 2003, Phoenix USA, November 2003
  • A Fault-Tolerant Communication Library for Grid
    Environments, Edgar Gabriel, Graham E Fagg,
    Antonin Bukovsky, Thara Angskun, and Jack J
    Dongarra, 17th Annual ACM International
    Conference on Supercomputing (ICS'03)
    International Workshop on Grid Computing and
    e-Science, June 21, 2003, San Francisco.
  • FT-MPI Fault Tolerant MPI, supporting dynamic
    applications in a dynamic world, Graham Fagg and
    Jack Dongarra In J. Dongarra, P. Kacsuk, N.
    Podhorszki (Eds.) Recent Advances in Parallel
    Virtual Machine and Message Passing Interface 7th
    European PVM/MPI Users' Group Meeting,
    Balatonfred, Hungary, September 2000. Lecture
    Notes in Computer Science 1908, Springer Verlag,
    Berlin, p. 346 ff.
  • Vadhiyar, S. and Dongarra, J. SRS - A Framework
    for Developing Malleable and Migratable Parallel
    Applications for Distributed Systems. Parallel
    Processing Letters, Vol. 13, number 2, pp.
    291-312, June 2003.

2
Introduction
  • Checkpointing storing applications state in
    order to resume later
  • Uses of checkpointing fault tolerance
    rollback recovery, process migration job
    swapping, debugging
  • Types OS (e.g. Sprite), user-level transparent
    (e.g. libckpt), user-level non-transparent (e.g.
    Dome, SRS)

3
Introduction
4
Introduction
  • In a parallel program, each process has events
    and local state
  • An event changes the local state of a process
  • Global state an external view of the parallel
    application (e.g. lines S, S, S) used for
    checkpointing and restarting

5
Introduction
  • Types of global states
  • Consistent global state from where program can
    be restarted correctly
  • Inconsistent - Otherwise
  • Chandy Lamport 2 rules for consistent global
    states
  • 1. if a receive event is part of local state of a
    process, the corresponding send event must be
    part of the local state of the sender or has not
    occurred at all.
  • 2. if a send event is part of the local state of
    a process and the matching receive is not part of
    the local state of the receiver, then the message
    must be part of the state of the network.

6
Independent checkpoints and domino effect
  • Checkpointing in distributed systems
  • Coordinated checkpointing
  • All processes coordinate to take a consistent
    checkpoint
  • Checkpointing with message logging
  • Independent checkpoints are taken by processes
    and a process logs the messages it receives after
    the last checkpoint
  • Thus recovery is by previous checkpoint and the
    messages.

7
Different kinds of checkpointing
  • OS level checkpointing
  • Checkpointing of process states for homogeneous
    processors
  • Checkpointing at the MPI layer, i.e.
    checkpointing of messages separate MPI
    implementation
  • User-level checkpointing insertion of
    checkpoint calls in the application
Write a Comment
User Comments (0)
About PowerShow.com