HARNESS and fault tolerant MPI - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

HARNESS and fault tolerant MPI

Description:

re-order processes to make a contiguous communicator. ... Users need to re-compile to libftmpi and start application with ftmpirun command ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 31

Provided by: WSE9169

Category:

more less

Transcript and Presenter's Notes

Title: HARNESS and fault tolerant MPI

1

HARNESS and fault tolerant MPI
Reviewed by Saswati Swami

2
Outline

1. Motivation
2. FT-MPI definition
3. FT-MPI semantics
4. FT-MPI implementation
5. Applications
6. Limitations
7. References

3
Motivation

- As application and machine sizes grow the MTBF
is less than the application run time.
- Current MPI implementations either just abort
everything OR they use check-pointing to roll
back which is expensive.
- All communication is via a communicator. MPI
standard is based on a static model so any
decrease in tasks leads to corrupted
communicator.
- Develop MPI plug-in that takes advantage of
Harness
robustness to allow a range of recovery
alternatives
to an MPI application. Not just another
MPI implementation.

4
Motivation

Harness will give us basic functionality of
starting tasks, some basic comms between them,
some attribute storage and some indication of
errors and failures.

Application
TCP/IP basic link
Harness run-time
Pipes / sockets
TCP/IP
HARNESS Daemon
5
FT-MPI definition

1. FT-MPI is a fault tolerant MPI system
developed under the DOE HARNESS project.
2. FT-MPI extends MPI and allows applications to
decide what to do when an error occurs
restarting a failed node
continuing with a lesser number of nodes
3. Under FT-MPI when a member of a communicator
dies
the communicator state changes to indicate
a problem
messages transfers can continue if safe
or be stopped / ignored
to continue The users application can
fix the communicators or
abort.

6
FT-MPI semantics

FT-MPI
1. Communicator states
- FT_OK,
- FT_DETECTED,
- FT_RECOVER,
- FT_RECOVERED,
- FT_FAILED
2. Process states
- OK,
- UNAVAILABLE
- JOINING
- FAILED

MPI
1. Communicator states
- VALID
- INVALID
2. Process states
- OK,
- FAILED

7
FT-MPI semantics communicator
failure handling

Communicators are invalidated if there is a
failure detected.
underlying system sends a state update to all
processes for that communicator.
for communication error, all communicators are
not updated.
for process exit, all communicators that included
this process are changed.
system behavior depends on the communicator
failure mode chosen.
- modes set using MPI attribute calls.

8
FT-MPI semantics five failure modes

1. SHRINK
re-order processes to make a contiguous
communicator.
on a rebuild this forces the missing process to
disappear from the communicator.
size changes, also process rank may change.
users need the communicators size to match its
extent i.e. when using home grown collectives

9
FT-MPI semantics five failure modes
10
FT-MPI semantics five failure modes
11
FT-MPI semantics five failure modes
12
FT-MPI semantics five failure modes

2. Blank
rebuild the communicator so that gaps are
allowed.
but make sure collectives do the right thing
afterwards.
MPI_COMM_SIZE returns the extent of the
communicator, not the no. of valid processes in
it.
P2P operations to a gap fail.
good for parameter sweeps / Monte Carlo
simulations where process loss only means
resending of data.

13
FT-MPI semantics five failure modes

3. REBUILD
automatic recovery when a communicator that has
died is rebuilt.
new process is inserted either in gap or at end.
new process is notified by return value from
MPI_init.
used for applications that need a constant number
of processes as in power of two FFT solvers.

14
FT-MPI semantics five failure modes

4. REBUILD ALL
same as REBUILD except rebuilds all
communicators, groups and resets all key values
etc.
does a lot more work than REBUILD behind the
scenes.
Useful for applications where there is multiple
communicators (for each dimension) and SOME of
key values etc.
Slower and has slightly higher overhead due to
extra state it has to distribute.

15
FT-MPI semantics five failure modes

5. ABORT
default MPI behavior
user unable to trap graceful exit

16
FT-MPI semantics message modes

1. NO-OP (NOP)
no user level message operations allowed on
error.
all operations return an error code.
- User will re-post all operations after
recovery.
2. CONTINUE (CONT)
all messages that can be sent are sent.
- you always get to receive if a message is
waiting for you.
all operations which returned MPI_SUCCESS will be
finished after recovery.

17
FT-MPI semantics P2P vs. collective
correctness

1. Collective operations are dealt with
differently than P2P.
- will return only if operation would have
given the same answer as if no failure occurred
for the surviving members.
2. Two classes of collective operations
- broadcast / scatter
succeed if the non root node fails
- gather / reduce
fail if there is an error

18
FT-MPI semantics FT-MPI basic usage

Simple FT-MPI send usage
do
rc MPI_Send (. com )
If (rcMPI_ERR_OTHER)
MPI_Comm_dup (com, newcom )
MPI_Comm_free (com)
com newcom / continue /
while (rc!MPI_SUCCESS)
Checking every call is not always necessary. SPMD
master-worker codes only need error checking in
the master code if the user is willing to accept
the master as the only point of failure.

19
FT-MPI implementation

1. Built in multiple layers
2. Has tuned collectives and user derived data
type handling.
3. Users need to re-compile to libftmpi and start
application with ftmpirun command
4. Can be run both with and without a HARNESS
core.

20
FT-MPI overall implementation structure
Collective Library
21
Derived Data Types

22
Derived Data Types
23
FT-MPI DDT performance
24
FT-MPI DDT performance
25
FT-MPI DDT performance
26
Reordering of a collective topology
27
Applications

2 Example applications
1. SCALAPAK
Non modified just to check we can handle a
pre-existing standard MPI application.
2. PSTSWM (Parallel Spectral Transform Shallow
Water Model
Two versions
- standard to test performance
- user level checkpoint with rebuild

28
Limitations

Applications needs to be designed to use FT-MPI
by including the extended APIs.
Changing legacy applications not feasible.
Additional user directed check-pointing needed
for applications that need a higher level of
fault tolerance
Semantics of existing MPI objects and functions
are time-tested.
Also, the paper needs better logical organization

29
References

1. Building and using a Fault Tolerant MPI
implementation, Graham E Fagg, Jack J Dongarra
2. Fault Tolerant MPI in High Performance
Computing Semantics and Applications, Graham E.
Fagg, Edgar Gabriel, and Jack J. Dongarra
3. Making of the holy grail or a YAMI that is
FT, Graham E Fagg