Fault Tolerance in MPI Programs - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Fault Tolerance in MPI Programs

Description:

Fault tolerance is a property of a program, not of an API ... Checkpointing & roll back. System directed. User directed. Redundancy & vote. Approach technique ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 14
Provided by: eceRu
Category:

less

Transcript and Presenter's Notes

Title: Fault Tolerance in MPI Programs


1
Fault Tolerance in MPI Programs
  • William Gropp, Ewing Lusk
  • Presenter Zhicheng Qiu

2
Contents
  • Declaration
  • Existing FT MPI
  • FT MPI standard
  • Write (non-transparent) FT in MPI
  • Summary discussion

3
Declaration
  • Fault tolerance is a property of a program, not
    of an API specification or an implementation.
  • Within certain constraints, MPI can provide a
    useful context for writing application programs
    that exhibit significant degrees of fault
    tolerance.

4
Current FT MPI
Non Automatic
Automatic
coordinated based
Log based
Pessimistic log
Causal log
Optimistic log
Cocheck Independent of MPI Ste96
Optimistic recovery In distributed systems n
faults with coherent checkpoint SY85
Manetho n faults EZ92
Coordinated checkpoint
Framework
Starfish Enrichment of MPI AF 99
FT-MPI Modification of MPI routines User Fault
Treatment FD00
Egida RAV99
Clip Semi-transparent Checkpoint CLP97
MPI/FT Redundance of tasks BNC01
API
Pruitt 98 2 faults sender based Pru98
MPI-FT N fault Centralized server LNLE00
MPICH-V2 N faults Distributed logging ABFC03
Sender based Mess. Log. 1 fault sender based JZ87
MPICH-CL N faults ???
Comm. Lib.
5
Fault Tolerance MPI standard
  • FT is a property of an MPI program coupled with
    the MPI implementation.
  • Four lever of survive
  • Automatically recovers (MPICH)
  • Error notification (FT-MPI)
  • Failure can be ignore (Manager/worker)
  • Restart from checkpoint (CoCheck etc)

Easy for programmer
6
Fault Tolerance MPI standard
  • MPI Standard does mention about the FT.
  • Require to implement reliable communication
  • Built in or user defined error handlers
  • Predefined error

7
Writing FT App in MPI
  • Basic approach
  • Checkpointing roll back
  • System directed
  • User directed
  • Redundancy vote
  • Approach technique
  • MPI
  • Modify / Extend MPI

8
The checkpoint frequency
  • ETT(1k0/t0a(k1t0/2))
  • 0dET/dt0-k0/t02a/2

Additional cost
9
Use intercommunicators

Manager(s) Centralized/Distributed work pool
Manager/Worker Model
intercommunicator
Worker processors
The intermediate status of the computing is
stored at the manager party.
10
Modify/Extend MPI
  • Modify MPI Semantics
  • Break the constrain of the MPI semantics
  • Provider the programmer more error information
    and error handling methods
  • Extending MPI
  • Define extensions to MPI (MPE_XXX)
  • Encapsulate the MPI procedures

11
Summary
  • MPI Standard provides in the way of support for
    writing fault-tolerant programs.
  • Many approach could be used to write the
    nontransparent FT MPI program.

12
Discussion
  • Any new idea in this paper?
  • What kind of the Failure will be in the parallel
    system? What is the basic approach to detect and
    handle those failures?

13
References
  • The Message Passing Interface (MPI) standard
    http//www-unix.mcs.anl.gov/mpi/
Write a Comment
User Comments (0)
About PowerShow.com