Title: Generic MPI Job Submission by the P-GRADE Grid Portal Zolt
1Generic MPI Job Submission by the P-GRADE Grid
PortalZoltán Farkas (zfarkas_at_sztaki.hu)MTA
SZTAKI
2Contents
- MPI
- Standards
- Implementations
- P-GRADE Grid Portal
- Workflow execution, file handling
- Direct job submission
- Brokered job submission
3MPI
- MPI stands for Message Passing Interface
- Standards 1.1 and 2.0
- MPI Standard features
- Collective communication (1.1)
- Point-to-Point communication (1.1)
- Group management (1.1)
- Dynamic Processes (2.0)
- Programming Language APIs
4MPI Implementations
- MPICH
- Freely available implementation of MPI
- Runs on many architectures (even on Windows)
- Implements Standards 1.1 (MPICH) and 2.0 (MPICH2)
- Supports Globus (MPICH-G2)
- Nodes are allocated upon application execution
- LAM/MPI
- Open-source implementation of MPI
- Implements Standards 1.1 and parts of 2.0
- Many interesting features (checkpoint)
- Nodes are allocated before application execution
- Open MPI
- Implements Standard 2.0
- Uses technologies of other projects
5MPICH execution on x86 clusters
- Application can be started
- using mpirun
- specifying
- number of requested nodes (-np ltnodenumbergt),
- a file containing the nodes to be allocated
(-machinefile ltarggt) OPTIONAL, - the executable,
- executable arguments.
- mpirun np 7 ./cummu N M p 32
- Processes are spawned using rsh or ssh,
depending on the configuration
6MPICH x86 execution requirements
- Executable (and input files) must be present on
worker nodes - Using Shared Filesystem, or
- User distributes the files before invoking
mpirun. - Accessing worker nodes from the host running
mpirun - Using rsh or ssh
- Without user interaction (host-based
authentication)
7P-GRADE Grid Portal
- Technologies
- Apache Tomcat
- GridSphere
- Java Web Start
- Condor
- Globus
- EGEE Middleware
- Scripts
8P-GRADE Grid Portal
- Workflow execution
- DAGMan as workflow scheduler
- pre and post script to perform tasks around job
exeution - Direct job execution using GT-2
- GridFTP, GRAM
- pre create temporary storage directory, copy
input files - job Condor-G is executing a wrapper script
- post download results
- Job execution using EGEE broker (both LCG/gLite)
- pre create application context as input sandbox
- job Scheduler universe Condor job executing a
script, which does job submission, status
polling, output downloading. A wrapper script is
submitted to the broker - post error checking
9Workflow Manager Portlet
10Workflow example
- Jobs
- Input/output files
- Data transfers
11Portal File handling
- Local files
- User has access to these files through the Portal
- Local input files are uploaded from the user
machine - Local output files are downloaded to the user
machine - Remote files
- Files reside on EGEE Storage Elements or are
accessible using GridFTP - EGEE SE files
- lfn/
- guid
- GridFTP files gsiftp//
12Workflow Files
- File Types
- In/Out
- Local/Remote
- File Names
- Internal
- Global
- File Lifetime
- Permanent
- Volatile
13Portal Direct job execution
- The resource to be used is known before job
execution - The user must have a valid, accepted certificate
- Local files are supported
- Remote GridFTP files are supported, even in case
of grid-unaware applications - Jobs may be sequential or MPI applications
14Direct exec step-by-step I.
- Pre script
- creates a storage directory on the selected
sites front-end node, using the fork
jobmanager - local input files are copied to this directory
from the Portal machine using GridFTP - remote input files are copied using GridFTP (in
case of errors, a two-phase copy is tried using
Portal machine) - Condor-G job
- a wrapper script (wrapperp) is specified as the
real executable - a single job is submitted to the requested
jobmanager, for MPI jobs the hostcount RSL
attribute is used to specify the number of
requested nodes
15Direct exec step-by-step II.
- LRSM
- allocate the number of requested nodes (if
needed) - start wrapperp on one of the allocated nodes
(master worker node) - Wrapperp (running on master worker node)
- copies the executable and input files from the
front-end node (scp or rcp) - in case of PBS jobmanagers, executable and input
files are copied to the allocated nodes
(PBS_NODEFILE). In case of non-PBS jobmanagers,
shared filesystem is required, as the host names
of the allocated nodes cannot be determined - wrapperp searches for mpirun
- the real executable is started using the found
mpirun - in case of PBS jobmanagers, output files are
copied from the allocated worker nodes to the
master worker node) - output files are copied to the front-end node
16Direct exec step-by-step III.
- Post script
- local output files are copied from the temporary
working directory created by the pre script to
the Portal machine using GridFTP - remote output files are copied using GridFTP (in
case of errors, a two-phase copy is tried using
Portal machine) - DAGMan keeps on job scheduling
17Direct execution
Remote file storage
Portal machine
1
2
5
1
5
Fork
GridFTP
Temp. Storage
PBS
Master WN
3
Wrapperp
Slave WN1
Slave WNn-1
4
In/exe
mpirun
In/exe
In/exe
Executable
4
Executable
Executable
Output
Output
Output
18Direct Submission Summary
- Pros
- Users can add remote file support to legacy
applications - Works for both sequential and MPI(CH)
applications - For PBS jobmanagers, there is no need to have a
shared filesystem (support for other jobmanagers
can be added, depends on informations provided by
jobmanagers) - Works in case of jobmanagers, which do not
support MPI - Faster, than submitting with the broker
- Cons
- user needs to specify the execution resource
- currently doesnt work on non-PBS jobmanagers
without shared filesystems
19Portal Brokered job submission
- EGEE Resource Broker is used
- The resource to be used is unknown before job
execution - The user must have a valid, accepted certificate
- Local files are supported
- Remote files residing on Storage Elements are
supported, even in case of grid-unaware
applications - Jobs may be sequential or MPI applications
20Broker exec step-by-step I.
- Pre script
- creates the Scheduler universe Condor submit file
- Scheduler Universe Condor job
- the job is a shell script
- the script is responsible for
- job submission a wrapper script (wrapperrb) is
specified as the real executable in the JDL file - job status polling
- job output downloading
21 Broker exec step-by-step II.
- Resource Broker
- handles requests of the Scheduler universe Condor
job - sends the job to a CE
- watches its exeution
- reports errors
-
- LRMS on CE
- allocates the requested number of nodes
- starts wrapperrb on the master worker node using
mpirun
22Broker exec step-by-step III.
- Wrapperrb
- the script is started by mpirun, so this script
starts on every allocated worker node like an
MPICH process - checks if remote input files are already present.
If not, they are downloaded from the storage
element - if the user specified any remote output files,
they are removed from the storage - the real executable is started with the arguments
passed to the script. These arguments already
contain MPICH-specific ones - after the executable has been finished, remote
output files are uploaded to the storage element
(only in case of gLite) - Post script
- nothing special
23Broker execution
Portal Machine
2
Storage Element
Resource Broker
5
3
Master WN mpirun
Globus
Front-end node
5
Slave WN1
Slave WNn-1
4
PBS
wrapperrb
wrapperrb
wrapperrb
Real exe
Real exe
Real exe
5
24Broker Submission Summary
- Pros
- adds support for remote file handling in case of
legacy applications - extends the functionality of the EGEE broker
- one solution supports both sequential and MPI
applications - Cons
- slow application execution
- status polling generates high load with 500 jobs
25(No Transcript)
26- Thank you for your attention
- ?