Title: HARNESS and fault tolerant MPI
1- HARNESS and fault tolerant MPI
- Reviewed by Saswati Swami
2Outline
- 1. Motivation
- 2. FT-MPI definition
- 3. FT-MPI semantics
- 4. FT-MPI implementation
- 5. Applications
- 6. Limitations
- 7. References
3Motivation
- - As application and machine sizes grow the MTBF
is less than the application run time. - - Current MPI implementations either just abort
everything OR they use check-pointing to roll
back which is expensive. - - All communication is via a communicator. MPI
standard is based on a static model so any
decrease in tasks leads to corrupted
communicator. - - Develop MPI plug-in that takes advantage of
Harness - robustness to allow a range of recovery
alternatives - to an MPI application. Not just another
MPI implementation.
4Motivation
- Harness will give us basic functionality of
starting tasks, some basic comms between them,
some attribute storage and some indication of
errors and failures.
Application
TCP/IP basic link
Harness run-time
Pipes / sockets
TCP/IP
HARNESS Daemon
5FT-MPI definition
- 1. FT-MPI is a fault tolerant MPI system
developed under the DOE HARNESS project. - 2. FT-MPI extends MPI and allows applications to
decide what to do when an error occurs - restarting a failed node
- continuing with a lesser number of nodes
- 3. Under FT-MPI when a member of a communicator
dies - the communicator state changes to indicate
a problem - messages transfers can continue if safe
or be stopped / ignored - to continue The users application can
fix the communicators or - abort.
6FT-MPI semantics
- FT-MPI
- 1. Communicator states
- - FT_OK,
- - FT_DETECTED,
- - FT_RECOVER,
- - FT_RECOVERED,
- - FT_FAILED
- 2. Process states
- - OK,
- - UNAVAILABLE
- - JOINING
- - FAILED
- MPI
- 1. Communicator states
- - VALID
- - INVALID
- 2. Process states
- - OK,
- - FAILED
7FT-MPI semantics communicator
failure handling
- Communicators are invalidated if there is a
failure detected. - underlying system sends a state update to all
processes for that communicator. - for communication error, all communicators are
not updated. - for process exit, all communicators that included
this process are changed. - system behavior depends on the communicator
failure mode chosen. - - modes set using MPI attribute calls.
8FT-MPI semantics five failure modes
- 1. SHRINK
- re-order processes to make a contiguous
communicator. - on a rebuild this forces the missing process to
disappear from the communicator. - size changes, also process rank may change.
- users need the communicators size to match its
extent i.e. when using home grown collectives
9FT-MPI semantics five failure modes
10FT-MPI semantics five failure modes
11FT-MPI semantics five failure modes
12FT-MPI semantics five failure modes
- 2. Blank
- rebuild the communicator so that gaps are
allowed. - but make sure collectives do the right thing
afterwards. - MPI_COMM_SIZE returns the extent of the
communicator, not the no. of valid processes in
it. - P2P operations to a gap fail.
- good for parameter sweeps / Monte Carlo
simulations where process loss only means
resending of data.
13FT-MPI semantics five failure modes
- 3. REBUILD
- automatic recovery when a communicator that has
died is rebuilt. - new process is inserted either in gap or at end.
- new process is notified by return value from
MPI_init. - used for applications that need a constant number
of processes as in power of two FFT solvers.
14FT-MPI semantics five failure modes
- 4. REBUILD ALL
- same as REBUILD except rebuilds all
communicators, groups and resets all key values
etc. - does a lot more work than REBUILD behind the
scenes. - Useful for applications where there is multiple
communicators (for each dimension) and SOME of
key values etc. - Slower and has slightly higher overhead due to
extra state it has to distribute.
15FT-MPI semantics five failure modes
- 5. ABORT
- default MPI behavior
- user unable to trap graceful exit
16FT-MPI semantics message modes
- 1. NO-OP (NOP)
- no user level message operations allowed on
error. - all operations return an error code.
- - User will re-post all operations after
recovery. - 2. CONTINUE (CONT)
- all messages that can be sent are sent.
- - you always get to receive if a message is
waiting for you. - all operations which returned MPI_SUCCESS will be
finished after recovery.
17FT-MPI semantics P2P vs. collective
correctness
- 1. Collective operations are dealt with
differently than P2P. - - will return only if operation would have
given the same answer as if no failure occurred
for the surviving members. - 2. Two classes of collective operations
- - broadcast / scatter
- succeed if the non root node fails
- - gather / reduce
- fail if there is an error
18FT-MPI semantics FT-MPI basic usage
- Simple FT-MPI send usage
- do
- rc MPI_Send (. com )
- If (rcMPI_ERR_OTHER)
- MPI_Comm_dup (com, newcom )
- MPI_Comm_free (com)
- com newcom / continue /
-
- while (rc!MPI_SUCCESS)
- Checking every call is not always necessary. SPMD
master-worker codes only need error checking in
the master code if the user is willing to accept
the master as the only point of failure.
19FT-MPI implementation
- 1. Built in multiple layers
- 2. Has tuned collectives and user derived data
type handling. - 3. Users need to re-compile to libftmpi and start
application with ftmpirun command - 4. Can be run both with and without a HARNESS
core. -
20FT-MPI overall implementation structure
Collective Library
21Derived Data Types
22Derived Data Types
23FT-MPI DDT performance
24FT-MPI DDT performance
25FT-MPI DDT performance
26Reordering of a collective topology
27Applications
- 2 Example applications
- 1. SCALAPAK
- Non modified just to check we can handle a
pre-existing standard MPI application. - 2. PSTSWM (Parallel Spectral Transform Shallow
Water Model - Two versions
- - standard to test performance
- - user level checkpoint with rebuild
28Limitations
- Applications needs to be designed to use FT-MPI
by including the extended APIs. - Changing legacy applications not feasible.
- Additional user directed check-pointing needed
for applications that need a higher level of
fault tolerance - Semantics of existing MPI objects and functions
are time-tested. - Also, the paper needs better logical organization
29References
- 1. Building and using a Fault Tolerant MPI
implementation, Graham E Fagg, Jack J Dongarra - 2. Fault Tolerant MPI in High Performance
Computing Semantics and Applications, Graham E.
Fagg, Edgar Gabriel, and Jack J. Dongarra - 3. Making of the holy grail or a YAMI that is
FT, Graham E Fagg
30