Title: Job Submission
1Job Submission
- Fokke Dijkstra RuG/SARA
- Grid tutorial Groningen September 2006
2Contents
- The LCG Workload Management System (WMS) in gLite
- Job Submission to EGEE / NL-Grid
- Job Preparation
- A simple example Job Lifecycle
- Job Description Language (JDL)
- Job Submission Monitoring
- Some more advanced topics
3WMS
?
4The LCG WMS
- The user submits jobs via the Workload Management
System - The Goal of WMS is the distributed scheduling and
resource management in a Grid environment. - What does it allow Grid users to do?
- To submit their jobs
- To execute them
- To get information about their status
- To retrieve their output
- The WMS tries to
- Optimize the usage of resources
- Execute user jobs as fast as possible
5WMS components
6Job Preparation
- You need to provide
- A complete (enough) job description
- What program?
- What data?
- Any requirements on OS, installed software, ??
- Possibly a program
- Youre submitting in unknown territory!
- Program portably!
- Dont rely on hard-coded paths or special
locations - The program you send may not even be in HOME!
- Perhaps some input data
- Perhaps instructions on what to do with the output
7How to Write a Job Description
- Here is a minimal job description (call it
hello.jdl) - We specified
- The program to run and its arguments
- Directed the standard error and output streams to
files - Told it what to do with the output
Executable /bin/echoArguments
GoedemiddagStdError stderr.logStdOutput
stdout.logOutputSandbox stderr.log,
stdout.log
8Job Submission Example
- User issues a voms-proxy-init
- enters his certificates password
- Receives a valid Globus proxy
- User issues a edg-job-submit mytest.jdl
- and gets back from the system a unique Job
Identifier (JobId) - User issues a edg-job-status JobId
- to get logging information about the current
status of his Job - When the OutputReady status is reached, the
user can issue a edg-job-get-output JobId - and the system returns the name of the temporary
directory where the job output can be found on
the UI machine.
9Submitting it
- voms-proxy-init --voms tutor
- Your identity /Oedgtutorial/Ousers/Orug/OUrc/
CNFokke Dijkstra - Enter GRID pass phrase
- Creating temporary proxy .........................
........................................ Done - Contacting mu4.matrix.sara.nl30007
/Odutchgrid/Ohosts/OUsara.nl/CNmu4.matrix.sar
a.nl "tutor" Done - Creating proxy ...................................
........... Done - Your proxy is valid until Mon Sep 11 232212
2006 - edg-job-submit hello.jdl
- Selected Virtual Organisation name (from UI conf
file) tutor - Connecting to host mu3.matrix.sara.nl, port 7772
- Logging to host mu3.matrix.sara.nl, port 9002
- JOB SUBMIT OUTCOME
- The job has been successfully submitted to the
Network Server. - Use edg-job-status command to check job current
status. Your job identifier (edg_jobId) is - - https//mu3.matrix.sara.nl9000/Nz6PWWJCjtT7YY3
PJWDu5Q
JobId
10A Job Submission Example
Job Status
submitted
LCG File Catalog (LFC)
Information System (IS)
User Interface (UI)
Resource Broker (RB)
Storage Element (SE)
Logging Bookkeeping (LB)
Job Submission Service (JSS)
Computing Element (CE)
11Checking the status
- edg-job-status https//mu3.matrix.sara.nl9000/N
z6PWWJCjtT7YY3PJWDu5Q
- BOOKKEEPING INFORMATION
- Status info for the Job https//mu3.matrix.sara.
nl9000/Nz6PWWJCjtT7YY3PJWDu5Q - Current Status Done (Success)
- Exit code 0
- Status Reason Job terminated successfully
- Destination mu6.matrix.sara.nl2119/jobman
ager-pbs-long - reached on Tue Jun 1 081425 2004
12A Job Submission Example
Job Status
submitted
LCG File Catalog (LFC)
Information System (IS)
User Interface (UI)
Resource Broker (RB)
Storage Element (SE)
Logging Bookkeeping (LB)
Job Submission Service (JSS)
Computing Element (CE)
13Getting the Output
- edg-job-get-output https//mu3.matrix.sara.nl90
00/Nz6PWWJCjtT7YY3PJWDu5Q - Retrieving files from host mu3.matrix.sara.nl (
for https//mu3.matrix.sara.nl9000/Nz6PWWJCjtT7YY
3PJWDu5Q )
- JOB GET OUTPUT OUTCOME
- Output sandbox files for the job
- - https//mu3.matrix.sara.nl9000/Nz6PWWJCjtT7YY3
PJWDu5Q - have been successfully retrieved and stored in
the directory - /tmp/jobOutput/fokke_Nz6PWWJCjtT7YY3PJWDu5Q
- cat /tmp/jobOutput/fokke_Nz6PWWJCjtT7YY3PJWDu5Q/
std.out - Goedemiddag
14A Job Submission Example
Job Status
submitted
LCG File Catalog (LFC)
Information System (IS)
waiting
ready
scheduled
Resource Broker (RB)
running
Storage Element (SE)
done
Logging Bookkeeping (LB)
Job Submission Service (JSS)
outputready
Computing Element (CE)
15Job Description Language (JDL)
- Based upon Condors CLASSified ADvertisement
language (ClassAd) - ClassAd is an extensible language
- Sequence of attributes (key,value pairs)
separated by semi-colons.
Executable /bin/echoArguments
GoedemiddagStdError stderr.logStdOutput
stdout.logOutputSandbox stderr.log,
stdout.log
16Types of Attributes
- The supported attributes are grouped in two
categories - Job
- Define the job itself
- Resources
- Taken into account by the RB for carrying out the
matchmaking algorithm - Computing Resource (Attributes)
- Used to build expressions of Requirements and/or
Rank attributes by the user - Have to be prefixed with other.
- Data and Storage resources (Attributes)
- Input data to process, SE where to store output
data, protocols spoken by application when
accessing SEs
17Job Definition Attributes
- Executable (mandatory)
- The command name
- Arguments (optional)
- Job command line arguments
- StdInput, StdOutput, StdErr (optional)
- Standard input/output/error of the job
- Environment (optional)
- List of environment settings
- InputSandbox (optional)
- List of files on the UI local disk needed by the
job for running - The listed files are staged from the UI to the
remote CE - OutputSandbox (optional)
- List of files, generated by the job, which have
to be retrieved
18Resource Attributes
- Requirements
- Job requirements on computing resources
- Specified using attributes of resources published
in the Information System - If not specified, default value defined in UI
configuration file is considered - Default other.GlueCEStateStatus "Production"
(the resource has to be in the Production grid) - Rank
- Expresses preference (how to rank resources that
have already met the Requirements expression) - Specified using attributes of resources published
in the Information Service - If not specified, default value defined in the UI
configuration file is considered - Default - other.GlueCEStateFreeCPUs (the highest
number of free CPUs)
19Data Attributes
- InputData (optional)
- Refers to data used as input by the job these
data are published in the Replica Catalog and
stored in the SEs) - PFNs and/or LFNs
- DataAccessProtocol (mandatory if InputData
specified) - The protocol or the list of protocols which the
application is able to speak with for accessing
InputData on a given SE - OutputSE (optional)
- The hostname of the output SE
- RB uses it to choose a CE that is compatible with
the job and is close to SE - OutputData (optional)
- Output Data that will be registered at the end of
the job
20Example JDL File
- Executable gridTest
- StdError stderr.log
- StdOutput stdout.log
- InputSandbox /home/joda/test/gridTest
- OutputSandbox stderr.log, stdout.log
- InputData lfn/grid/tutor/testbed0-00019
- DataAccessProtocol gridftp
- Requirements other.ArchitectureINTEL \
other.OpSysLINUX other.FreeCpus gt4 - Rank other.GlueHostBenchmarkSF00
21Job Submission
- edg-job-submit r ltres_idgt n ltuser e-mail
addressgt -c ltconfig filegt -o ltoutput filegt
ltjob.jdlgt - -r the job is submitted by the RB directly to the
computing element identified by ltres_idgt - -c the configuration file ltconfig filegt is used
by the UI instead of the standard configuration
file - -o the generated edg_jobId is written in the
ltoutput filegt - Useful for other commands, e.g.
- edg-job-status i ltinput filegt (or edg_jobId)
- -i the status information about edg_jobId
contained in the ltinput filegt are displayed - --vo the VO under which the job will be run
22Other WMS UI Commands
- edg-job-list-match
- Lists resources matching a job description
- Performs the matchmaking without submitting the
job - edg-job-cancel
- Cancels a given job
- edg-job-status
- Displays the status of the job
- edg-job-get-output
- Returns the job-output (the OutputSandbox files)
to the user - edg-job-get-logging-info
- Displays logging information about submitted jobs
(all the events pushed by the various
components of the WMS) - Very useful for debug purposes
23WMS Match Making
- The RB is the core component of WMS.
- It has to find the best suitable computing
resource (CE) where the job will be executed - It interacts with Data Management service and
Information System - They supply RB with all the information required
for the resolution of the matches - The CE chosen by RB has to match the job
requirements (e.g. runtime environment, data
access requirements, and so on) - If 2 or more CEs satisfy all the requirements,
the one with the best Rank is chosen
24Direct Job submission
- The RB has to deal with three possible scenarios.
- Scenario 1 Direct Job Submission
- Job is scheduled on a given CE (specified in the
edg-job-submit command via r option) - RB doesnt perform any matchmaking algorithm
- Take care if InputData is specified!
25Brokered Job Submission, No InputData
- Scenario 2 Job Submission without data-access
Requirements - Neither CE nor input data are specified.
- RB starts the matchmaking algorithm, which
consists of two phases - Requirements check (RB contacts the IS to check
which CEs satisfy all the requirements) - If more than one CE satisfies the job
requirements, the CE with the best rank is chosen
by the RB
26Brokered Job Submission, Grid Data
- Scenario 3 CE is not specified in the JDL
- RB contacts Data Management service to find out
which SEs have copies of the requested input
data sets - RB makes best effort match between
- Computing resources for which user is authorized
- SEs nearby which can provide the requested
data sets via the requested transfer protocol - Any optional output SE specified in the job
description - RB strategy consists of submitting jobs close to
data! - The main two phases of the match making algorithm
remain unchanged - Requirements check
- Rank computation
- The matchmaking is only performed for CEs
satisfying the data-access requirements (i.e.
which are close to data)
27Proxy Renewal
- Why?
- To avoid job failure because it outlived the
validity of the initial proxy - WMS support automatic proxy renewal mechanism as
long as the user credentials are handled by a
proxy server. - Create a proxy using
- voms-proxy-init
- Register this proxy with the MyProxy server using
- myproxy-init s ltservergt -t ltcredgt -c ltproxygt
d -n - server is the server address (e.g.
px.matrix.sara.nl) - cred is the number of hours the proxy should be
valid on the server - proxy is the number of hours renewed proxies
should be valid - Short term proxies can then be used to start jobs
using - grid-proxy-init hours lthoursgt command
- The Proxy is automatic renewed by WMS without
user intervention for all the job life
28MPI jobs
- MPI
- Message passing
- Link with parallel library
- Run on multiple processors
- gLite
- Limited support
- Some sites can run MPI jobs
- JobType
- JobTypeMPICH
- NodeNumber 8
- Adds MPICH support as requirement
- Executable run in paralllel on 8 CPUs
29Other JobTypes
- Interactive
- StdOutput, StdInput and StdError forwarded to
user - default X window
- Other tools
- Checkpointable
- Job must save checkpoints
- Checkpoints can be retrieved
- Not fully supported yet
30Further Information
- The gLite User Guide!
- http//glite.web.cern.ch/glite/documentation/def
ault.asp - ClassAd https//www.cs.wisc.edu/condor/classad/
- Sara Grid pages http//www.sara.nl/userinfo/grid/
31UI configuration file
- Can be set if (expert) user is not happy with
default one - Most relevant attributes
- RB(s)
- When submitting a job, the first specified RB is
tried, if the operation fails the second one is
considered, etc. - LBserver(s)
- The LB to be used for a job is chosen by the RB
- So when a edg-job-status ltedg-jobidgt is issued,
the LB to contact is specified in the edg-jobid - This list specifies the LB(s) that must be
contacted when issuing a edg-job-status all /
edg-job-get-logging-info all (to have
information for all the jobs belonging to that
user) - Default JDL Requirements
- other.GlueCEStateStatus "Production"
- Default JDL Rank
- other.GlueCEStateFreeCPUs
- Default Virtual Organisation
- Which VO the job should use to run
32UI Command Error Messages
- The UI commands accept some arguments in input.
If the user makes a mistake via command line, the
following messages can appear - Argument is not allowed (the argument is not
known) - Argument must be specified at the end of the
command (both the jobId and JDL file name must be
put at the end of the command line) - Argument is missing for the output option
(the user forgot to add the parameter, required
by the argument) - Argument -all cannot be specified with argument
input (some arguments are OR-exclusive) - CEId format is ltfull hostnamegtltport
numbergt/jobmanager-ltservicegt. The provided CEID
http//lx01.absolute.com10854/jobmanager has a
wrong format. (the user has mis-spelled the CE
identifier after resource)
33Resource Broker errors
- During the calling of the RB API, the following
can happen - Resource Broker grid013g.cnaf.infn.it7771 not
available (cant open a connection with the RB
specified in the UI configuration file) - Unable to get LB address from RB
grid013g.cnaf.infn.it (the function
get_lb_contact returned an error)
34JDL Proxy Error Messages
- While the UI commands are checking the JDL file,
the following errors may occur - Mandatory Attribute default error in the
configuration file /opt/edg/etc/UI_ConfigENV.cfg
(there arent any default values) - Mandatory Attribute missing in JDL file
Executable (Executable is one of the mandatory
attributes) - Multiple InputSandbox attribute found in JDL
file (InputSandbox attribute is repeated twice) - Wrong function call for list attribute .
Function usage is Member/IsMember(List, Value)
(e.g. in the requirements attribute the function
Member/IsMember is used with a wrong syntax) - Proxy (this refers to the security grid proxy and
not to a proxy machine) - If the user specifies a duration for the proxy
that he wants to provide, using the option h of
edg-job-submit, a possible message is - Proxy certificate will expire in less then X
hours. Creating a new X-hours-duration
certificate (this to make sure that at least the
required proxy validity is granted )