Title: Diapositiva 1
1How the JSDL can Exploit the Parallelism?
I. Rodero, F. Guim, J. Corbalan, J. Labarta
irodero, francesc.guim, julita.corbalan,
jesus.labarta_at_bsc.es
Barcelona Supercomputing Center (http//www.bsc.es
) Computer Architecture Department, Technical
University of Catalonia (UPC)
CCGrid 2006,19th May 2006. Singapore
2Agenda
- Introduction to the JSDL
- Extension of the JSDL for Parallel Jobs
(Proposal) - Barcelona Supercomputing Centre (BSC) Use Case
- Experiences
- Conclusion and Current Work
3Job Description Language (JDL)
- Grid Jobs need to be described accurately for
their submission - Executable and arguments
- Input, output and error files
- Environment variables
- Etc
- And how to submit them
- Requirements
- Staging Files
- Etc
- Needed for scheduling and to interact with local
schedulers - Grid Jobs more sophisticated than typical POSIX
applications.
4Job Submission Description Language (JSDL)
- Different JDLs are defined in several projects
European Data Grid JDL
ETC, ETC, ETC...
SGE Execution Scripts
GRMS Job Description (GJD)
Resource Specification Language (RSL)
Standardization (GGF)
JSDL
5Job Submission Description Language (JSDL)
- JSDL just describes how to submit a job, not the
job itself - Job Identification
- Application
- Resources
- Data Staging
- Defined by the JSDL working group of the GGF
- XML based
- Current version 1.0
- POSIX Extension
- Description of an application executed on a POSIX
compliant system. - Executable, Arguments, Input, Output, Error,
WorkingDirectory, Environment and some limits.
6Extension of the JSDL 1.0 for Parallel
Jobs (Proposal)
7Current Specification for Parallel Jobs
- Current specification (JSDL v1.0 POSIX
Extension) allows to specify some parallelism
details - TotalCPUCount
- IndividualCPUCount
-
- OK, this is suitable for homogeneous resources
and rigid MPI applications. - Of course it covers a lot of use cases but not
all (for example ours -) - Problems
- It is not allowed to specify the programming
model - No multilevel are taken into account (MPI
oriented) - No information regarding the application behavior
is included
This element is a range value specifying the
total number of CPUs required for this job
submission. If this is not present then it is not
defined and the consuming system MAY choose any
value.
This element is a range value specifying the
number of CPUs for each of the resources to be
allocated to the job submission. If this is not
present then it is not defined and the consuming
system MAY choose any value.
8Extension Proposal
- Our intention is to specify an extension for
parallel applications - General
- Suitable for all the programming models
- Suitable for the future approaches
- In this proposal we have chosen the extension by
new XML elements. - The namespace prefix used for this schema is
jsdl-par. Since this is only a proposal the
normative namespace is not given.
9ParallelApplication Element
- ParallelApplication Element
- ltParallelApplication namexsdNCName?gt
- ltApplType... /gt
- ltLevels... /gt
- ltLevelDescription... /gt
- ltxsdanyothergt
- lt/ParallelApplicationgt
- Describes the parallelism details of an
application. - type of the application
- number of levels
- topology of each level
- Sub-element of the JSDL Application element
- It MUST appear only once and MUST NOT appear for
sequential applications.
10Application Type Element
- ApplType Element
- ltApplTypegt
- jsdl-parParallelApplTypeEnumeration
- lt/ApplTypegt
- Enumeration type
- Type of application regarding the programming
model - Should be interpreted by the local system to
select the most suitable tool or runtime (i.e.
POE, OpenMP runtime, etc).
11Levels Element
- Levels Element
- ltLevelsgt xsdpositiveInteger lt/Levelsgt
- Positive integer
- Number of parallelism levels of the application.
- Example a MPIOpenMP application has 2 levels
of parallelism.
12Level Description Element
- LevelDescription Element
- ltLevelDescription levelxsdpositiveInteger
malleablexsdboolean?gt - ltParallelism... /gt?
- ltTopology... /gt?
- ltxsdanyothergt
- lt/LevelsDescriptiongt
- Complex type specifying the parallelism of
different levels - Required at least the description of one level
- Attributes
- levellevel of parallelism that the element
describes. The first level of parallelism
correspond to the most external layer. Its type
is xsdpositiveInteger and the default value is 1
(the case of an MPI application). - malleableindicates if the application is
malleable or not (the number of processes/threads
can be dynamically modified at run-time). Its
type is xsdboolean and the default is false.
moldable applications can be defined
implicitly just by the unrestricted
topology and disabling the malleable
attribute. Example ltParallelApplicationgt
ltApplTypegtmpilt/ApplTypegt ltLevelsgt1lt/Levelsgt
ltLevelDescription level1gt
ltParallelismgt ltjsdlLowerBoundedRangegt4.0lt
/jsdlLowerBoundedRangegt lt/Parallelismgt
ltTopologygtunrestrictedlt/Topologygt
lt/LevelDescriptiongt lt/ParallelApplicationgt
13Parallelism Element
- Parallelism Element
- ltParallelismgt
- jsdlRangeValue_Type
- lt/Parallelismgt?
- Range value
- Number of processes or threads
- Topology element only has sense if the value is
not an exact number. - In this case the application is moldable.
- Malleability is indicated by an attribute of the
LevelDescription element.
14Topology Element
- Topology Element
- ltTopologygt
- jsdl-parParallelApplTopologyEnumeration
- lt/Topologygt
- Enumeration type
- Topology of the application
- Used by the LRMS to decide the number of
processes or threads that will be spawned for
the application.
15Complete Schema
16Document Example
The OpenMP threads are power of 2 (if no enough
CPUs are available, it should spawn only 8, 4, 2
or 1threads)
ltParallelApplicationgt ltApplTypegtmpi_openmplt/App
lTypegt ltLevelsgt2lt/Levelsgt ltLevelDescription
level1gt ltParallelismgt
ltjsdlLowerBoundedRangegt4.0lt/jsdlLowerBoundedRang
egt lt/Parallelismgt ltTopologygtunrestricted
lt/Topologygt lt/LevelDescriptiongt
ltLevelDescription level2 malleabletruegt
ltParallelismgt ltjsdlUpperBoundedRangegt16.0lt/
jsdlUpperBoundedRangegt lt/Parallelismgt
ltTopologygtpower2lt/Topologygt lt/LevelDescriptiongt
lt/ParallelApplicationgt
MPIOpenMP application
At least 4 MPI processes (not limited)
OpenMP threads can be modified dinamically
(malleable)
17Barcelona Supercomputing Center (BSC) Use
Case http//www.bsc.es/grid/enanos
18Motivation eNANOS Project
- HPC-Applications
- Hybrid MPIOpenMP programming model (including
pure MPI and OpenMP) - HPC-Grids. Composed of HPC resources
- Clusters of SMPs CC-NUMA architectures with a
medium to high number of processors - Heterogeneous resources
- Efficient execution on the Local Resources
- Coordination between the Grid and Local Levels
- Experience obtained from the HPC-Europa project
19eNANOS Execution Framework
1. Grid Broker receives a JSDL 1.0 document from
the HPC-Europa Portal. 2. The JSDL document is
converted to RSL (2) because the Grid broker is
built on top of the Globus infrastructure. Some
details in environment variables. 3. The job
manager (3) transforms the RSL document to a
LoadLeveler script.
20Examples Description
- In the following examples there are shown two
files - JSDL documents.
- Real jobs (NAS BT and Simple MPI)
- Described as a proposal (CPMD).
- LoadLeveler scripts obtained from the previous
JSDL documents. Some problems - LL not able to manage multilevel parallel
applications. - Some semantic lost.
- total_tasks field used for total number of MPI
processes. - No mechanism to specify the number of OpenMP
threads. - eNANOS Scheduler manages the second level of
parallelism and other details. - Currently we use environment variables as
mechanism.
21Example 1 NAS BT-MZ (multilevel)
- The JSDL document describes a job composed of a
NAS BT-MZ benchmark (MPIOpenMP) and class A.
This job has 2 levels of parallelism. - It is requested 2 MPI processes and 4 OpenMP
threads per process, and the standard error and
output are redirected to a specific machine
(pcmas).
lt?xml version"1.0" encoding"UTF-8"?gt ltJobDefinit
ion xmlns"http//schemas.ggf.org/jsdl/2005/10/jsd
l"gt ltJobDescriptiongt
ltJobIdentificationgt
ltDescriptiongtExecution of a NAS MultiZone class
Alt/Descriptiongt ltJobProjectgtBSC_Testlt/
JobProjectgt lt/JobIdentificationgt
ltApplicationgt ltns1POSIXApplication
xmlnsns1"http//schemas.ggf.org/jsdl/2005/06/jsd
l-posix"gt ltns1Executable
filesystemName"__user1_uni_upc_ac_irodero_enanos_
benchmarks_"gtExecNaslt/ns1Executablegt
ltns1Argumentgtbt-mz.Alt/ns1Argumentgt
ltns1Argumentgt2lt/ns1Argumentgt
ltns1Argumentgt4lt/ns1Argumentgt
ltns1OutputgtBT.A.OUTlt/ns1Outputgt
ltns1ErrorgtBT.A.ERRlt/ns1Errorgt
ltns1Environment name"OMP_SCHEDULE"gtstaticlt/ns1E
nvironmentgt ltns1Environment
name"THREAD_BOUND"gt1lt/ns1Environmentgt
lt/ns1POSIXApplicationgt lt/Applicationgt
ltResourcesgt ltCandidateHostsgt
ltHostNamegtkadesh8.cepba.upc.edult/Hos
tNamegt lt/CandidateHostsgt
ltFileSystem name"__user1_uni_upc_ac_irodero_enano
s_benchmarks_"gt
ltMountPointgt/user1/uni/upc/ac/irodero/enanos/bench
markslt/MountPointgt lt/FileSystemgt
lt/Resourcesgt (...)
22Example 1 NAS BT-MZ (multilevel)
(...) ltDataStaginggt
ltFileNamegtBT.A.ERRlt/FileNamegt
ltCreationFlaggtappendlt/CreationFlaggt
ltDeleteOnTerminationgtfalselt/DeleteOnTerminationgt
ltTargetgt
ltURIgtgsiftp//pcmas.ac.upc.es/home/irodero/tests/B
T.A.ERRlt/URIgt lt/Targetgt
lt/DataStaginggt ltDataStaginggt
ltFileNamegtBT.A.OUTlt/FileNamegt
ltCreationFlaggtappendlt/CreationFlaggt
ltDeleteOnTerminationgtfalselt/DeleteOnTerminationgt
ltTargetgt
ltURIgtgsiftp//pcmas.ac.upc.es/home/irodero/tests/B
T.A.OUTlt/URIgt lt/Targetgt
lt/DataStaginggt lt/JobDescriptiongt
ltns2ParallelApplication xmlnsns2http//schemas
.ggf.org/jsdl/2006/03/jsdl-pargt
ltns2ApplTypegtmpi_openmplt/ns2ApplTypegt
ltns2Levelsgt2lt/ns2Levelsgt
ltns2LevelDescription level1gt
ltns2Parallelismgt
ltjsdlexactgt2.0lt/jsdlexactgt
lt/ns2Parallelismgt lt/ns2LevelDescriptiongt
ltns2LevelDescription level2gt
ltns2Parallelismgt
ltjsdlexactgt4.0lt/jsdlexactgt
lt/ns2Parallelismgt lt/ns2LevelDescriptiongt
lt/ns2ParallelApplicationgt lt/JobDefinitiongt
23Example 1 NAS BT-MZ (multilevel)
- LoadLeveler Script
- We have reused existent ways to specify
parallelism (i.e. OMP_NUM_THREADS)
! /bin/sh Job command file created by
GRAM/JobManager/loadleveler.pm _at_ job_type
parallel _at_ initialdir
/user1/uni/upc/ac/irodero _at_ input
/dev/null _at_ output /user1/uni/upc/ac/i
rodero/.globus/.gass_cache/local/md5/37/c304c3b041
7c6d05ad764f2f3db3e0/md5/2e/ 1cdbe76729af40306461c
2447b1e4c/data _at_ error /user1/uni/upc/ac/ir
odero/.globus/.gass_cache/local/md5/37/c304c3b0417
c6d05ad764f2f3db3e0/md5/e0/db2941 aeb23a18167
bd2c821a2d7ba/data _at_ account_no
BSC_Test _at_ class short _at_ restart
yes _at_ requirements (LL_Version gt
"2.0") (Adapter "ethernet") _at_ total_tasks
2 _at_ node 1 _at_ environment
COPY_ALL\ X509_USER_PROXY/user1/uni/upc/ac/iro
dero/.globus/.gass_cache/local/md5/37/c304c3b0417c
6d05ad764f2f3db3e0/md5/b7/c5e a252c9f577a52da2db27
7d3058b/data \ GLOBUS_LOCATION/aplic/GLOBUS/
2.4 \ GLOBUS_GRAM_JOB_CONTACThttps//kadesh8
.cepba.upc.edu37895/33752/1143561915/ \
GLOBUS_GRAM_MYJOB_CONTACTURLx-nexus//kadesh8.cep
ba.upc.edu37896/ \ HOME/user1/uni/upc/ac/ir
odero \ LOGNAMEirodero \
GRID_ID_ENV1_at_1143561909633 \
OMP_SCHEDULEstatic \ OMP_NUM_THREADS4\
THREAD_BOUND1 \ PAR_MALLEABLEtrue _at_
queue /user1/uni/upc/ac/irodero/enanos/benchmark
s/ExecNas bt-mz.A 2 4 End of job command file.
MPI processes
OpenMP threads
Is not malleable
24Example 2 Simple MPI (Solver)
- The JSDL document describes a job composed of a
simple MPI application. In particular is a
typical solver that should be executed with 16
MPI processes.
lt?xml version"1.0" encoding"UTF-8"?gt ltJobDefinit
ion xmlns"http//schemas.ggf.org/jsdl/2005/10/jsd
l"gt ltJobDescriptiongt
ltJobIdentificationgt
ltDescriptiongtExecution of a simple MPI-based
Solverlt/Descriptiongt
ltJobProjectgtBSC_Testlt/JobProjectgt
lt/JobIdentificationgt ltApplicationgt
ltns1POSIXApplication xmlnsns1"http//schem
as.ggf.org/jsdl/2005/06/jsdl-posix"gt
ltns1Executable filesystemName"__user1_uni_upc
_ac_irodero_enanos__solver_"gtSolverlt/ns1Executabl
egt ltns1Outputgt/user1/uni/upc/ac/i
rodero/enanos/solver/solver.out
lt/ns1Outputgt ltns1Errorgt/user1/un
i/upc/ac/irodero/enanos/solver/solver.outlt/ns1Err
orgt lt/ns1POSIXApplicationgt
lt/Applicationgt ltResourcesgt
ltCandidateHostsgt
ltHostNamegtkadesh8.cepba.upc.edult/HostNamegt
lt/CandidateHostsgt ltFileSystem
name"__user1_uni_upc_ac_irodero_enanos_solver_"gt
ltMountPointgt/user1/uni/upc/a
c/irodero/enanos/benchmarks/solverlt/MountPointgt
lt/FileSystemgt lt/Resourcesgt
lt/JobDescriptiongt ltns2ParallelApplication
xmlnsns2http//schemas.ggf.org/jsdl/2006/03/jsd
l-pargt ltns2ApplTypegtmpilt/ns2ApplTypegt
ltns2Levelsgt1lt/ns2Levelsgt
ltns2LevelDescription level1gt
ltns2Parallelismgt
ltjsdlexactgt16.0lt/jsdlexactgt
lt/ns2Parallelismgt lt/ns2LevelDescriptiongt
lt/ns2ParallelApplicationgt lt/JobDefinitiongt
25Example 2 Simple MPI (Solver)
- LoadLeveler Script
- This is a case in which the LoadLeveler System
can manage the job by itself because it only has
one level of parallelism and it is a rigid
application. The added information is only for
our execution framework.
! /bin/sh Job command file created by
GRAM/JobManager/loadleveler.pm _at_ job_type
parallel _at_ initialdir
/user1/uni/upc/ac/irodero/enanos/solver _at_ input
/dev/null _at_ output
/user1/uni/upc/ac/irodero/enanos/solver/solver.out
_at_ error /user1/uni/upc/ac/irodero/en
anos/solver/solver.err _at_ class
short _at_ restart yes _at_ total_tasks
16 _at_ node 1 _at_ environment
COPY_ALL\ X509_USER_PROXY/user1/uni/upc/ac/i
rodero/.globus/.gass_cache/local/md5/37/c304c3b041
7c6d05ad764f2f3db3e0/md5/b7/ c5ea252c9f577a52da2db
277d3058b/data \ GLOBUS_LOCATION/aplic/GLOBU
S/2.4 \ GLOBUS_GRAM_JOB_CONTACThttps//kades
h8.cepba.upc.edu37895/33752/1143561915/ \
GLOBUS_GRAM_MYJOB_CONTACTURLx-nexus//kadesh8.cep
ba.upc.edu37896/ \ HOME/user1/uni/upc/ac/ir
odero \ LOGNAMEirodero \
GRID_ID_ENV7_at_1143561974629 \ _at_
queue /user1/uni/upc/ac/irodero/enanos/solver/So
lver End of job command file.
MPI processes
26Example 3 CPMD (malleable)
- The JSDL document describes a multilevel parallel
job composed of a CPMD application (MPIOpenMP). - The job requires 2 MPI processes but the second
level of parallelism (OpenMP threads) is expected
to be power of 2 with a maximum of 16 threads.
The second level of parallelism is malleable as
well.
lt?xml version"1.0" encoding"UTF-8"?gt ltJobDefinit
ion xmlns"http//schemas.ggf.org/jsdl/2005/10/jsd
l"gt ltJobDescriptiongt
ltJobIdentificationgt
ltDescriptiongtExecution of a CPMD
applicationlt/Descriptiongt
ltJobProjectgtBSC_Testlt/JobProjectgt
lt/JobIdentificationgt ltApplicationgt
ltns1POSIXApplication xmlnsns1"http//schem
as.ggf.org/jsdl/2005/06/jsdl-posix"gt
ltns1Executablegt/scratch_tmp/irodero/CPMD-3.9.1
/cpmd.xlt/ns1Executablegt
ltns1Argumentgt/scratch_tmp/irodero/CPMD-3.9.1/inpu
ts/small.inp lt/ns1Argumentgt
ltns1Outputgtcpmd.4.pwr4.outlt/ns1Outputgt
ltns1Errorgtcpmd.4.pwr4.errlt/ns1Errorgt
ltns1Environment namePP_LIBRARY_PATH
gt/scratch_tmp/irodero/CPMD-3.9.1/PP_LIBlt/ns1Envi
ronmentgt ltns1Environment
name"OMP_SCHEDULE"gtstaticlt/ns1Environmentgt
lt/ns1POSIXApplicationgt
lt/Applicationgt ltResourcesgt
ltCandidateHostsgt
ltHostNamegtkadesh8.cepba.upc.edult/HostNamegt
lt/CandidateHostsgt lt/Resourcesgt
lt/JobDescriptiongt (...)
27Example 3 CPMD (malleable)
(...) ltns2ParallelApplication
xmlnsns2http//schemas.ggf.org/jsdl/2006/03/jsd
l-pargt ltns2ApplTypegtmpi_openmplt/ns2Appl
Typegt ltns2Levelsgt2lt/ns2Levelsgt
ltns2LevelDescription level1gt
ltns2Parallelismgt
ltjsdlexactgt2.0lt/jsdlexactgt
lt/ns2Parallelismgt lt/ns2LevelDescriptiongt
ltns2LevelDescription level2
malleabletruegt ltns2Parallelismgt
ltjsdlUpperBoundedRangegt16.0lt/jsdlU
pperBoundedRangegt lt/ns2Parallelismgt
ltns2Topologygtpower2lt/ns2Topologygt
lt/ns2LevelDescriptiongt
lt/ns2ParallelApplicationgt lt/JobDefinitiongt
28Example 3 CPMD (malleable)
- LoadLeveler Script
- It is a multilevel application
- The OpenMP level is malleable (PAR_MALLEABLEtrue)
-
! /bin/sh Job command file created by
GRAM/JobManager/loadleveler.pm _at_ job_type
parallel _at_ initialdir
/scratch_tmp/irodero/cpmd _at_ input
/dev/null _at_ output cpmd.4.pwr4.out
_at_ error cpmd.4.pwr4.err _at_ class
short _at_ restart yes _at_
total_tasks 2 _at_ node 1 _at_
environment COPY_ALL\ MP_EUILIBip\
MP_EUIDEVICEen0\ PP_LIBRARY_PATH/scratch_tm
p/irodero/CPMD-3.9.1/PP_LIB\
GRID_ID_ENV24_at_1143531900624\
OMP_NUM_THREADS16\ PAR_TOPOLOGYpower2\
PAR_MALLEABLEtrue _at_ queue /scratch_tmp/irode
ro/CPMD-3.9.1/cpmd.x /scratch_tmp/irodero/CPMD-3.9
.1/inputs/small.inp End of job command file.
MPI processes
MPI processes
OpenMP Is malleable
29Experiences at BSC
- NAS BT-MZ, MPIOpenMP (multilevel)
- Power3 SMP based node with 16 CPUs
- Original JSDL 1.0 causes overload of node
- Extension allows a good resource usage
Poor performance! Obvious when it spawns 4 MPI
process 8 OpenMP threads Need dynamic management
Execution time (seconds)
30Conclusions and Current Work
- Some extensions are needed for Parallel Jobs
- Multilevel parallel applications should be taken
into account - JSDL can be extended to achieve these goals
- Currently working in new mechanisms at local
systems to implement all the functionality in a
more fashionable way. - Collaborating with GGF JSDL Working Group
- Extension proposal presented at the GGF16,
Athens, February 13-16, 2006. - BSC use case presented at the GGF17, Japan, May
10-12, 2006.
31Thank you for your attention!