Title: MARMOT MPI Analysis and Checking Tool
1MARMOT- MPI Analysis and Checking Tool
Bettina Krammer, Matthias Müller krammer_at_hlrs.de
, mueller_at_hlrs.de HLRS High Performance
Computing Center Stuttgart Allmandring 30 D-70550
Stuttgart http//www.hlrs.de
2Overview
- General Problems of MPI Programming
- Related Work
- New Approach MARMOT
- Results within CrossGrid
- Outlook
3Problems of MPI Programming
- All problems of serial programming
- Additional problems
- Increased difficulty to verify correctness of
program - Increased difficulty to debug N parallel
processes - Portability issues between different MPI
implementations and platforms, e.g. with mpich on
the Crossgrid testbed - WARNING MPI_Recv tag 36003 gt 32767 ! MPI only
guarantees tags up to this. - THIS implementation allows tags up to 137654536
- versions of LAM-MPI lt v 7.0 only guarantee tags
up to 32767 - New problems such as race conditions and deadlocks
4Related Work
- Parallel Debuggers, e.g. totalview, DDT, p2d2
- Debug version of MPI Library
- Post-mortem analysis of trace files
- Special verification tools for runtime analysis
- MPI-CHECK
- Limited to Fortran
- Umpire
- First version limited to shared memory platforms
- Distributed memory version in preparation
- Not publicly available
- MARMOT
5What is MARMOT?
- Tool for the development of MPI applications
- Automatic runtime analysis of the application
- Detect incorrect use of MPI
- Detect non-portable constructs
- Detect possible race conditions and deadlocks
- MARMOT does not require source code
modifications, just relinking - C and Fortran binding of MPI -1.2 is supported
6Design of MARMOT
7Basics of MARMOT
- Library written in C that will be linked to the
application in addition to the underlying MPI
library. - This library consists of the debug server and the
debug clients. - No source code modification of the application is
required. - Additional process working as debug server, i.e.
the application will have to be run with n1
instead of n processes.
8Basics of MARMOT (contd)
- Main interface MPI profiling interface
according to MPI standard 1.2 - Implementation of C language binding of MPI
- Implementation of Fortran language binding as a
wrapper to the C interface - Environment variables for tool behaviour and
output (report of errors, warnings and/or
remarks, trace-back, etc.). - MARMOT produces a human readable logfile.
9Examples of Client Checks verification on the
local nodes
- Verification of proper construction and usage of
MPI resources such as communicators, groups,
datatypes etc., for example - Verification of MPI_Request usage
- invalid recycling of active request
- invalid use of unregistered request
- warning if number of requests is zero
- warning if all requests are MPI_REQUEST_NULL
- Check for pending messages and active requests in
MPI_Finalize - Verification of all other arguments such as
ranks, tags, etc.
10Examples of Server Checks verification between
the nodes, control of program
- Everything that requires a global view
- Control the execution flow, trace the MPI calls
on each node throughout the whole application - Signal conditions, e.g. deadlocks (with traceback
on each node.) - Check matching send/receive pairs for consistency
- Check collective calls for consistency
- Output of human readable log file
11MARMOT within CrossGrid
12CrossGrid Application Tasks
13CrossGrid Application WP 1.4 Air pollution
modelling (STEM-II)
- Air pollution modeling with STEM-II model
- Transport equation solved with Petrov-Crank-Nikols
on-Galerkin method - Chemistry and Mass transfer are integrated using
semi-implicit Euler and pseudo-analytical methods - 15500 lines of Fortran code
- 12 different MPI calls
- MPI_Init, MPI_Comm_size, MPI_Comm_rank,
MPI_Type_extent, MPI_Type_struct,
MPI_Type_commit, MPI_Type_hvector, MPI_Bcast,
MPI_Scatterv, MPI_Barrier, MPI_Gatherv,
MPI_Finalize.
14STEM application on an IA32 cluster with Myrinet
15CrossGrid Application WP 1.3 High Energy Physics
- Filtering of real-time data with neural networks
(ANN application) - 11500 lines of C code
- 11 different MPI calls
- MPI_Init, MPI_Comm_size, MPI_Comm_rank,
MPI_Get_processor_name, MPI_Barrier, MPI_Gather,
MPI_Recv, MPI_Send, MPI_Bcast, MPI_Reduce,
MPI_Finalize.
16HEP application on an IA32 cluster with Myrinet
17CrossGrid Application WP 1.1 Medical Application
- Calculation of blood flow with Lattice-Boltzmann
method - Stripped down application with 6500 lines of C
code - 14 different MPI calls
- MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Pack,
MPI_Bcast, MPI_Unpack, MPI_Cart_create,
MPI_Cart_shift, MPI_Send, MPI_Recv, MPI_Barrier,
MPI_Reduce, MPI_Sendrecv, MPI_Finalize
18Medical application on an IA32 cluster with
Myrinet
19Message statistics with native MPI
20Message statistics with MARMOT
21Medical application on an IA32 cluster with
Myrinet without barrier
22Barrier with native MPI
23Barrier with MARMOT
24Conclusion
- MARMOT supports MPI 1.2 for C and Fortran binding
- Tested with several applications and benchmarks
- Tested on different platforms, using different
compilers and MPI implementations, e.g. - IA32/IA64 clusters (Intel, g compiler) with
mpich (CrossGrid testbed, ) - IBM Regatta
- NEC SX5, SX6
- Hitachi SR8000, Cray T3e
25Impact and Exploitation
- MARMOT is mentioned in the U.S. DoD High
Performance Computing Modernization Program
(HPCMP) Programming Environment and Training
(PET) project as one of two freely available
tools (MPI-Check and MARMOT) for the development
of portable MPI programs. - Contact with tool developers of the other 2 known
verification tools like MARMOT (MPI-Check,
umpire) - Contact with users outside CrossGrid
- MARMOT has been presented at various conferences
and publications - http//www.hlrs.de/organization/tsc/projects/marmo
t/pubs.html
26Future Work
- Scalability and general performance improvements
- Better user interface to present problems and
warnings - Extended functionality
- more tests to verify collective calls
- MPI-2
- Hybrid programming
27Thanks for your attentionAdditional information
and downloadhttp//www.hlrs.de/organization/tsc/
projects/marmot
Thanks to all CrossGrid partnershttp//www.eu-cro
ssgrid.org/
28Back Up
29Impact and Exploitation
- MARMOT is mentioned in the U.S. DoD High
Performance Computing Modernization Program
(HPCMP) Programming Environment and Training
(PET) project as one of two freely available
tools (MPI-Check and MARMOT) for the development
of portable MPI programs. - Contact with tool developers of the other 2 known
verification tools like MARMOT (MPI-Check,
umpire) - MARMOT has been presented at various conferences
and publications - http//www.hlrs.de/organization/tsc/projects/marmo
t/pubs.html
30Impact and Exploitation contd
- Tescico proposal
- (Technical and Scientific Computing)
- French/German consortium
- Bull, CAPS Entreprise, Université Versailles,
- T-Systems, Intel Germany, GNS, Universität
Dresden, HLRS - ITEA-labelled
- French authorities agreed to give funding
- German authorities did not
31Impact and Exploitation contd
- MARMOT is integrated in the HLRS training classes
- Deployment in GermanyMARMOT is in use in two of
the 3 National High Performance Computing
Centers - HLRS at Stuttgart (IA32, IA64, NEC SX)
- NIC at Jülich (IBM Regatta)
- Just replaced their Cray T3E with a IBM Regatta
- spent a lot of time finding a problem that would
have been detected automatically by MARMOT
32Feedback of Crossgrid Applications
- Task 1.1 (biomedical)
- C application
- Identified issues
- Possible race conditions due to use of
MPI_ANY_SOURCE - deadlock
- Task 1.2 (flood)
- Fortran application
- Identified issues
- Tags outside of valid range
- Possible race conditions due to use of
MPI_ANY_SOURCE - Task 1.3 (hep)
- ANN (C application)
- no issues found by MARMOT
- Task 1.4 (meteo)
- STEMII (Fortran)
- MARMOT detected holes in self-defined datatypes
used in MPI_Scatterv, MPI_Gatherv. These holes
were removed, which helped to improve the
performance of the communication.
33Examples of Log File
- 54 rank 1 performs MPI_Cart_shift
- 55 rank 2 performs MPI_Cart_shift
- 56 rank 0 performs MPI_Send
- 57 rank 1 performs MPI_Recv
- WARNING MPI_Recv Use of MPI_ANY_SOURCE may
cause race conditions! - 58 rank 2 performs MPI_Recv
- WARNING MPI_Recv Use of MPI_ANY_SOURCE may
cause race conditions! - 59 rank 0 performs MPI_Send
- 60 rank 1 performs MPI_Recv
- WARNING MPI_Recv Use of MPI_ANY_SOURCE may
cause race conditions! - 61 rank 0 performs MPI_Send
- 62 rank 1 performs MPI_Bcast
- 63 rank 2 performs MPI_Recv
- WARNING MPI_Recv Use of MPI_ANY_SOURCE may
cause race conditions! - 64 rank 0 performs MPI_Pack
- 65 rank 2 performs MPI_Bcast
- 66 rank 0 performs MPI_Pack
34Examples of Log File (continued)
- 7883 rank 2 performs MPI_Barrier
- 7884 rank 0 performs MPI_Sendrecv
- 7885 rank 1 performs MPI_Sendrecv
- 7886 rank 2 performs MPI_Sendrecv
- 7887 rank 0 performs MPI_Sendrecv
- 7888 rank 1 performs MPI_Sendrecv
- 7889 rank 2 performs MPI_Sendrecv
- 7890 rank 0 performs MPI_Barrier
- 7891 rank 1 performs MPI_Barrier
- 7892 rank 2 performs MPI_Barrier
- 7893 rank 0 performs MPI_Sendrecv
- 7894 rank 1 performs MPI_Sendrecv
35Examples of Log File (Deadlock)
- 9310 rank 1 performs MPI_Sendrecv
- 9311 rank 2 performs MPI_Sendrecv
- 9312 rank 0 performs MPI_Barrier
- 9313 rank 1 performs MPI_Barrier
- 9314 rank 2 performs MPI_Barrier
- 9315 rank 1 performs MPI_Sendrecv
- 9316 rank 2 performs MPI_Sendrecv
- 9317 rank 0 performs MPI_Sendrecv
- 9318 rank 1 performs MPI_Sendrecv
- 9319 rank 0 performs MPI_Sendrecv
- 9320 rank 2 performs MPI_Sendrecv
- 9321 rank 0 performs MPI_Barrier
- 9322 rank 1 performs MPI_Barrier
- 9323 rank 2 performs MPI_Barrier
- 9324 rank 1 performs MPI_Comm_rank
- 9325 rank 1 performs MPI_Bcast
- 9326 rank 2 performs MPI_Comm_rank
- 9327 rank 2 performs MPI_Bcast
- 9328 rank 0 performs MPI_Sendrecv
36Examples of Log File (Deadlock traceback on 0)
- timestamp 9298 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 2, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 1, recvtag 1, comm
self-defined communicator, status) - timestamp 9300 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 1, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 2, recvtag 1, comm
self-defined communicator, status) - timestamp 9304 MPI_Barrier(comm
MPI_COMM_WORLD) - timestamp 9307 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 2, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 1, recvtag 1, comm
self-defined communicator, status) - timestamp 9309 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 1, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 2, recvtag 1, comm
self-defined communicator, status) - timestamp 9312 MPI_Barrier(comm
MPI_COMM_WORLD) - timestamp 9317 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 2, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 1, recvtag 1, comm
self-defined communicator, status) - timestamp 9319 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 1, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 2, recvtag 1, comm
self-defined communicator, status) - timestamp 9321 MPI_Barrier(comm
MPI_COMM_WORLD) - timestamp 9328 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 2, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 1, recvtag 1, comm
self-defined communicator, status)
37Examples of Log File (Deadlock traceback on 1)
- timestamp 9301 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 2, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 0, recvtag 1, comm
self-defined communicator, status) - timestamp 9302 MPI_Barrier(comm
MPI_COMM_WORLD) - timestamp 9306 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 0, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 2, recvtag 1, comm
self-defined communicator, status) - timestamp 9310 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 2, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 0, recvtag 1, comm
self-defined communicator, status) - timestamp 9313 MPI_Barrier(comm
MPI_COMM_WORLD) - timestamp 9315 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 0, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 2, recvtag 1, comm
self-defined communicator, status) - timestamp 9318 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 2, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 0, recvtag 1, comm
self-defined communicator, status) - timestamp 9322 MPI_Barrier(comm
MPI_COMM_WORLD) - timestamp 9324 MPI_Comm_rank(comm
MPI_COMM_WORLD, rank) - timestamp 9325 MPI_Bcast(buffer, count 3,
datatype MPI_DOUBLE, root 0, comm
MPI_COMM_WORLD)
38Examples of Log File (Deadlock traceback on 2)
- timestamp 9303 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 0, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 1, recvtag 1, comm
self-defined communicator, status) - timestamp 9305 MPI_Barrier(comm
MPI_COMM_WORLD) - timestamp 9308 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 1, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 0, recvtag 1, comm
self-defined communicator, status) - timestamp 9311 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 0, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 1, recvtag 1, comm
self-defined communicator, status) - timestamp 9314 MPI_Barrier(comm
MPI_COMM_WORLD) - timestamp 9316 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 1, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 0, recvtag 1, comm
self-defined communicator, status) - timestamp 9320 MPI_Sendrecv(sendbuf, sendcount
7220, sendtype MPI_DOUBLE, dest 0, sendtag
1, recvbuf, recvcount 7220, recvtype
MPI_DOUBLE, source 1, recvtag 1, comm
self-defined communicator, status) - timestamp 9323 MPI_Barrier(comm
MPI_COMM_WORLD) - timestamp 9326 MPI_Comm_rank(comm
MPI_COMM_WORLD, rank) - timestamp 9327 MPI_Bcast(buffer, count 3,
datatype MPI_DOUBLE, root 0, comm
MPI_COMM_WORLD)
39Classical Solutions I Parallel Debugger
- Examples totalview, DDT, p2d2
- Advantages
- Same approach and tool as in serial case
- Disadvantages
- Can only fix problems after and if they occur
- Scalability How can you debug programs that
crash after 3 hours on 512 nodes? - Reproducibility How to debug a program that
crashes only every fifth time? - It does not help to improve portability
40Classical Solutions II Debug version of MPI
Library
- Examples
- catches some incorrect usage e.g. node count in
MPI_CART_CREATE (mpich) - deadlock detection (NEC mpi)
- Advantages
- good scalability
- better debugging in combination with totalview
- Disadvantages
- Portability only helps to use this
implementation of MPI - Trade-of between performance and safety
- Reproducibility Does not help to debug
irreproducible programs
41Comparison with related projects
42Comparison with related projects
43Examples
44Example 1 request-reuse (source code)
- /
- Here we re-use a request we didn't free before
- /
- include ltstdio.hgt
- include ltassert.hgt
- include "mpi.h"
- int main( int argc, char argv )
- int size -1
- int rank -1
- int value -1
- int value2 -1
- MPI_Status send_status, recv_status
- MPI_Request send_request, recv_request
- printf( "We call Irecv and Isend with non-freed
requests.\n" ) - MPI_Init( argc, argv )
- MPI_Comm_size( MPI_COMM_WORLD, size )
45Example 1 request-reuse (source code continued)
- if( rank 0 )
- / this is just to get the request used
/ - MPI_Irecv( value, 1, MPI_INT, 1, 18,
MPI_COMM_WORLD, recv_request ) - / going to receive the message and reuse a
non-freed request / - MPI_Irecv( value, 1, MPI_INT, 1, 17,
MPI_COMM_WORLD, recv_request ) - MPI_Wait( recv_request, recv_status )
- assert( value 19 )
-
- if( rank 1 )
- value2 19
- / this is just to use the request /
- MPI_Isend( value, 1, MPI_INT, 0, 18,
MPI_COMM_WORLD, send_request ) - / going to send the message /
- MPI_Isend( value2, 1, MPI_INT, 0, 17,
MPI_COMM_WORLD, send_request ) - MPI_Wait( send_request, send_status )
-
- MPI_Finalize()
- return 0
46Example 1 request-reuse (output log)
- We call Irecv and Isend with non-freed
requests. - 1 rank 0 performs MPI_Init
- 2 rank 1 performs MPI_Init
- 3 rank 0 performs MPI_Comm_size
- 4 rank 1 performs MPI_Comm_size
- 5 rank 0 performs MPI_Comm_rank
- 6 rank 1 performs MPI_Comm_rank
- I am rank 0 of 2 PEs
- 7 rank 0 performs MPI_Irecv
- I am rank 1 of 2 PEs
- 8 rank 1 performs MPI_Isend
- 9 rank 0 performs MPI_Irecv
- 10 rank 1 performs MPI_Isend
- ERROR MPI_Irecv Request is still in use !!
- 11 rank 0 performs MPI_Wait
- ERROR MPI_Isend Request is still in use !!
- 12 rank 1 performs MPI_Wait
- 13 rank 0 performs MPI_Finalize
- 14 rank 1 performs MPI_Finalize
47Example 2 deadlock (source code)
- / This program produces a deadlock.
- At least 2 nodes are required to run the
program. -
- Rank 0 recv a message from Rank 1.
- Rank 1 recv a message from Rank 0.
-
- AFTERWARDS
- Rank 0 sends a message to Rank 1.
- Rank 1 sends a message to Rank 0.
- /
- include ltstdio.hgt
- include "mpi.h"
- int main( int argc, char argv )
- int rank 0
- int size 0
- int dummy 0
- MPI_Status status
48Example 2 deadlock (source code continued)
- MPI_Init( argc, argv )
- MPI_Comm_rank( MPI_COMM_WORLD, rank )
- MPI_Comm_size( MPI_COMM_WORLD, size )
- if( size lt 2 )
- fprintf( stderr," This program needs at least
2 PEs!\n" ) -
- else
- if ( rank 0 )
- MPI_Recv( dummy, 1, MPI_INT, 1, 17,
MPI_COMM_WORLD, status ) - MPI_Send( dummy, 1, MPI_INT, 1, 18,
MPI_COMM_WORLD ) -
- if( rank 1 )
- MPI_Recv( dummy, 1, MPI_INT, 0, 18,
MPI_COMM_WORLD, status ) - MPI_Send( dummy, 1, MPI_INT, 0, 17,
MPI_COMM_WORLD ) -
-
- MPI_Finalize()
-
49Example 2 deadlock (output log)
- mpirun -np 3 deadlock1
- 1 rank 0 performs MPI_Init
- 2 rank 1 performs MPI_Init
- 3 rank 0 performs MPI_Comm_rank
- 4 rank 1 performs MPI_Comm_rank
- 5 rank 0 performs MPI_Comm_size
- 6 rank 1 performs MPI_Comm_size
- 7 rank 0 performs MPI_Recv
- 8 rank 1 performs MPI_Recv
- 8 Rank 0 is pending!
- 8 Rank 1 is pending!
- WARNING deadlock detected, all clients are
pending
50Example 2 deadlock (output log continued)
- Last calls (max. 10) on node 0
- timestamp 1 MPI_Init( argc, argv )
- timestamp 3 MPI_Comm_rank( comm, rank )
- timestamp 5 MPI_Comm_size( comm, size )
- timestamp 7 MPI_Recv( buf, count -1,
datatype non-predefined datatype, source -1,
tag -1, comm, status) - Last calls (max. 10) on node 1
- timestamp 2 MPI_Init( argc, argv )
- timestamp 4 MPI_Comm_rank( comm, rank )
- timestamp 6 MPI_Comm_size( comm, size )
- timestamp 8 MPI_Recv( buf, count -1,
datatype non-predefined datatype, source -1,
tag -1, comm, status )
51Bandwidth on a IA64 cluster with Myrinet
52Latency on an IA64 cluster with Myrinet
53cg.B on an IA32 cluster with Myrinet
54is.B on an IA32 cluster with Myrinet
55is.A