CoCheck: Checkpointing and Process Migration for MPI - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

CoCheck: Checkpointing and Process Migration for MPI

Description:

each process: resume execution (using the new addresses and the messages in the buffer area) ... 5. http://www.cs.utk.edu/~bosilca/classes/gb_lecture7.pdf. Thank You ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 23
Provided by: eceRu
Category:

less

Transcript and Presenter's Notes

Title: CoCheck: Checkpointing and Process Migration for MPI


1
CoCheck Checkpointing and Process Migration for
MPI
  • Reviewed by Saswati Swami

2
Outline
  • 1. Background
  • 2. CoCheck
  • -- Definition
  • -- Components
  • -- Basic Concepts Properties
  • -- Protocol
  • -- tuMPI
  • -- tuMPI Performance
  • -- Limitations
  • -- Comparison with MPICH-V2
  • 3. Work Extension - MPICH-CL
  • 4. References

3
Background Fault Tolerant MPI
4
Background Checkpoint
  • Saving the state of a program at a certain
    point so that it can be restarted from that point
    at a later time or on a different machine

5
Background Coordinated Checkpointing
  • Coordinated Checkpointing
  • - All processes coordinate their checkpoints so
    that the
  • global system state is coherent.
  • - Efficient when fault frequency is low
  • - Negligible overhead on fault free execution

6
Background Coordinated Checkpointing contd..
7
Background Determining Global States
8
Background Chandy Lamport Algorithm
  • 1. If a receive event is part of a local state
    of a process, the corresponding send event must
    be part of the local state of the sender or has
    not occurred at all.
  • 2. If a send event is part of a local state of a
    process and the matching receive event is not
    part of the local state of the receiver, the
    message must be part of the state of the network.

9
CoCheck Definition
  • CoCheck An environment that provides
    checkpointing transparent to the application for
    parallel applications on Networks of
    workstations.

Application
CoCheck Overlay Library
Message Passing Environment (MPE) Library
Checkpointing Library
OS Library
Operating System
10
CoCheck Components
  • CoCheck consists of 3 components
  • 1. An Overlay Library for the message passing API
  • 2. A single process Checkpoint Library
  • These 2 libraries are linked into every
    application process generating a servicing layer.
  • 3. A Resource Management process which
    coordinates the checkpointing protocol.
  • This process is external to the application and
    runs as part of the scheduling system.

11
CoCheck Basic Concepts Properties
  • 1. Virtual address application address vs.
    current address.
  • 2. Ready Message (RM) special messages used to
    clear each channel.

12
CoCheck Protocol
  • Checkpointing
  • 1. central instance send notification to all
    processes
  • 2. each process send RM to all other processes
  • 3. each process receive incoming messages until
    all RMs have been received
  • -- Store user messages
  • in buffer count RM
  • 4. each process disconnect from parallel
    environment
  • 5. each process create checkpoint or migrate
  • Restarting
  • 1. central instance restart all processes
  • 2. each process reconnect to parallel
    environment
  • 3. each process send new address to central
    instance
  • 4. each process connect new addresses and
    distribute the new mapping table to all processes
  • 5. each process resume execution (using the new
    addresses and the messages in the buffer area)

13
CoCheck Protocol contd.
14
CoCheck tuMPI
  • Process structure of tuMPI

15
CoCheck tuMPI
  • Migrating a tuMPI AP with CoCheck

16
CoCheck tuMPI Performance
  • Migration time of a process
  • t(x) migration time of a process with memory
    size x.
  • 1.77s x / (763 KB/s)

17
CoCheck Limitations
  • 1. Requires global synchronization - may take a
    long time to perform checkpoint because of
    checkpoint server stress
  • 2. High cost of fault recovery - in the case of a
    single fault, all processes have to roll back to
    their checkpoints
  • 3. checkpoint request cannot be processed when a
    send operation is in progress.

18
CoCheck Comparison with MPICH-V2
  • CoCheck
  • 1. Coordinated Checkpointing
  • 2. Efficient when fault frequency is low
  • 3. High cost of fault recovery
  • 4. Checkpoint scheduler determines when a
    checkpoint should take place and what checkpoint
    server should be used.
  • MPICH-V2
  • 1. Uncoordinated Checkpointing, pessimistic
    Logging
  • 2. Efficient when fault frequency is high
  • 3. Fault recovery overhead is limited
  • 4. Checkpoint scheduler is not required by
    pessimistic protocol. Minimizes size of
    checkpointed payload by using a best effort
    heuristic.

19
Work Extension MPICH-CL
  • Coordinated checkpoint versus message log for
    fault tolerant MPI, Aurelien Bouteiller, Pierre
    Lemarinier, Geraud Krawezik, Franck Cappello

20
Work Extension - MPICH-CL
21
References
  • 1. Managing Checkpoints for Parallel Programs,
    Jim Pruyne and Miron Livny
  • 2. www.cs.hku.hk/cluster2003/presentation/technica
    l/4A-1.pdf
  • 3. http//www.lri.fr/gk/MPICH-V/papers/Cluster200
    3.pdf
  • 4. http//www.globusworld.org/program/slides/7c_3.
    pdf
  • 5. http//www.cs.utk.edu/bosilca/classes/gb_lectu
    re7.pdf

22
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com