Title: Using the Batch System at NERSC
1Using the Batch Systemat NERSC
- Mark Durst
- NERSC/USG
- ERSUG Training, Argonne, IL
- 28 April 1999
2Outline
- Quick example
- How batch processing works
- Batch and pipe queues
- How to submit jobs
- Monitoring jobs
- Reminders and Pointers
3!/bin/csh file simple1 QSUB -q
serial QSUB -J y keep job
log set mynamewhoami set nowdate set
mylocnpwd echo "" echo "Hello myname, this
is your shell script 0," echo "running at
now." echo "" echo "Your current directory is
mylocn, which should" echo "be the same as
HOME." echo "" echo "I'm going to sleep
now." echo "" sleep 90 exit
4 cqsub simple1 Task id t51847 inserted into
database nqedb. cqstatl t51847 ----------------
------------- NQE 3.3.0.9 Database Task
Summary ----------------------------- IDENTIFIER
NAME SYSTEM-OWNER OWNER
LOCATION ST --------------------
------- ---------------- --------
------------------- ---- t51847
simple1 scheduler.main mjdurst NQE Database
NNew cqstatl t51847 -----------------------
------ NQE 3.3.0.9 Database Task
Summary ----------------------------- IDENTIFIER
NAME SYSTEM-OWNER OWNER
LOCATION ST --------------------
------- ---------------- --------
------------------- ---- t51847
simple1 scheduler.main mjdurst NQE Database
NPend cqstatl t51847 ----------------------
------- NQE 3.3.0.9 Database Task
Summary ----------------------------- IDENTIFIER
NAME SYSTEM-OWNER OWNER
LOCATION ST --------------------
------- ---------------- --------
------------------- ---- t51847
simple1 lws.mcurie mjdurst NQE Database
NSche cqstatl t51847 ----------------------
------- NQE 3.3.0.9 Database Task
Summary ----------------------------- IDENTIFIER
NAME SYSTEM-OWNER OWNER
LOCATION ST --------------------
------- ---------------- --------
------------------- ---- t51847 (49939.mcurie)
simple1 lws.mcurie mjdurst nqs_at_mcurie
NSubm
5 qstat 49939 --------------------------------- NQ
S 3.3.0.9 BATCH REQUEST SUMMARY ------------------
--------------- IDENTIFIER NAME USER
LOCATION/QUEUE JID PRTY REQMEM REQTIM
ST ------------- ------- --------
--------------------- ---- ---- ------ ------
--- 49939.mcurie simple1 mjdurst
serial_short_at_mcurie 3753 25 364 1800
R03 qstat 49939 nqs-100 qstat CAUTION
Request lt49939gt not found. cqstatl
t51847 ----------------------------- NQE 3.3.0.9
Database Task Summary ----------------------------
- IDENTIFIER NAME SYSTEM-OWNER
OWNER LOCATION ST ------------------
-- ------- ---------------- --------
------------------- ---- t51847 (49939.mcurie)
simple1 monitor.main mjdurst NQE Database
NComp ls -l total 12 -rwxrw-r-- 1
mjdurst mpccc 365 Jan 15 1047
simple1 -rw-r--r-- 1 mjdurst mpccc
0 Jan 15 1050 simple1.e51847 -rw-r--r-- 1
mjdurst mpccc 1285 Jan 15 1050
simple1.l51847 -rw-r--r-- 1 mjdurst mpccc
2638 Jan 15 1050 simple1.o51847
6 cat simple1.l51847 01/15 104813 Submitting to
queue ltserialgt by ltmjdurst(12113)gt 01/15 104813
Command line options lt-e /u1/mjdurst/tests/bat.si
mple/simple1.e51847 -J y -j
/u1/mjdurst/tests/bat.simple/simple1.l51847 -lM
28mw 28mw -lT 1800 1800 -mu mjdurst_at_mcurie -o
/u1/mjdurst/tests/bat.simple/simple1.o51847
-r simple1 -x -q serialgt. 01/15 104813 Script
file options lt-q serial -J y
keep job loggt. 01/15 104815 Arrived in
ltserial_at_mcuriegt from ltmcuriegt. 01/15 104815
Request-id is lt49939.mcuriegt, Request
nameltsimple1gt. 01/15 104815 NQE Task ID is
ltnqedb.t51847gt. 01/15 104815 Origin
uidlt12113gt, Target usernameltmjdurstgt. 01/15
104815 Account/Project nameltmpcccgt,
Account/Project IDlt105gt. 01/15 104815
Submission security levellt0gt, compartmentslt0gt. 0
1/15 104817 Account/Project nameltmpcccgt,
Account/Project IDlt105gt. 01/15 104817 Arrived
in ltserial_short_at_mcuriegt from ltserial_at_mcuriegt. 01/
15 104820 Submission security levellt0gt,
compartmentslt0gt. 01/15 104820 Execution
security levellt0gt, compartmentslt0gt. 01/15
104823 Started, pidlt36967gt, jidlt3753gt,
shelllt/bin/cshgt, umasklt18gt. 01/15 104823
Running in queue ltserial_shortgt. 01/15 105002
Finished. 01/15 105002 Returning stderr output
file. 01/15 105003 Returning stdout output file.
7 cat simple1.o51847 mcurie.nersc.gov,
a Cray T3E-900 running UNICOS/mk 2.0.3.32
------------------------------Contact
Information------------------------------
NERSC Web http//www.nersc.gov/
ESnet Web http//www.es.net/ ESCHER Web
http//www.nersc.gov/hardware/servers/vis-server.h
tml ltsnipgt CFS
CONVERSION
CFS to HPSS conversion was
successfully completed on January 7, 1999. Users
can access all of their CFS files on the new
HPSS system, "archive". The cfs command on the
NERSC Crays now points to the new HPSS interface,
hsi. For more info on using hsi reference this
URL http//www.nersc.gov/hardware/storage/
hsi.ch1.html. If your HPSS password fails or you
don't have an HPSS account, contact the
Account Support group at 1-800-66NERSC, option
2, or (510) 486-8612 ----------------------------
--------------------------------------------------
Your current working directory is
/u/mpccc/mjdurst. Hello mjdurst, this is your
shell script /usr/spool/nqe/spool/scripts/BBI
0, running at Fri Jan 15 104831 PST
1999. Your current directory is /u1/mjdurst,
which should be the same as /u/mpccc/mjdurst. I'm
going to sleep now. logout
8Why Batch Processing?
- Batch queues are necessary
- On systems with many jobs
- When scheduling is difficult
- To assure greater throughput
- Interactive jobs are limited
- J90 10 hrs.
- T3E lt 64 PEs, lt 30 minutes parallel (1 hr
serial) - Some machines/processors batch-only
- J90 all batch machines
- T3E many APP PEs (at night, almost all)
9The Batch Process
- User creates shell script myscript
- Submits to NQE with cqsub myscript
- Returns NQE task id (e.g., t4913)
- NQE forwards to NQS
- J90 selects a machine (J90 wait time here)
- NQS runs the job
- Assign NQS job id (e.g., 6859.mcurie)
- Select a batch queue
- Place the job there (T3E wait time here)
- Run it when appropriate
- NQS/NQE returns job logs at completion
10Pipe Queues
- Groups of batch queues
- Direct to a pipe with QSUB -q serial
- Default is production
- To see them qstat -p
- T3E
- serial,debug, production,long
- J90
- production
- batchk (for evening, weekend killeen queues)
- batchb,f,s,c,j (not recommended)
11Preparing for Batch Submission
- Write your shell script
- C shell or Bourne/Korn shell
- Starts in users home directory
- Debug interactively (if possible)
- Decide on needed resources
- J90 CPU time, memory
- T3E amount of parallel, serial time number of
PEs - Select other QSUB options
- Check for appropriate queue and submit
12Essential options to cqsub (QSUB directives)
- J90
- -lM ltmemgt
- -lT lttimegt
- T3E
- -l mpp_p ltnumgt
- -l mpp_t ltpar_timegt
- -lT ltser_timegt
- dont use -lM
13Other cqsub options
- -J y save job log (recommended)
- -j ltfilegt save it in file
- -mb send mail when job starts (-me ends)
- -a lttimegt hold job until after time
- -o ltfilegt put standard output in file
- default name ltbatfilegt.oltidgt)
- -eo combine standard error and output
- makes output look like terminal record
- -x exports users environment to job
- -s ltshellgt specify shell
14Job Submission
- cqsub ltfilegt
- Can give options at submission time
- Override file options
- Less dependable
- If no file name, expects commands from terminal
- Useful behavior in automated script generation
submission - Response
- Task id t16839 inserted into database nqedb.
- Task id useful for tracking with cqstatl.
- Dont break (Ctrl-C) out of cqsub!
- Instead, allow to finish, then use cqdel
15Monitoring Jobs
- cqstatl lttaskidgt
- cqstatl -a grep ltusernamegt (if no lttaskidgt)
- ST column (status) indicates progress
- NNew, NPend, NSche still in NQE
- NSubm submitted to NQS
- NComp done
- NTerm killed
- NFail job failed (user or system error)
- IDENTIFIER column holds NQS job id
- (once submitted)
- cqstatl -f lttaskidgt details for your job
16Monitoring Jobs (contd)
- T3E qstat ltjobidgt once your job reaches NQS
- cqstatl -d nqs qstat
- qstat -au ltusernamegt (if no ltjobidgt)
- J90 qstat -h lthostnamegt ltjobidgt
- Find hostname from NQS id (from cqstatl)
- e.g., 2861.seymour
- ST column (status) now indicates
- RNN Running (with NN processes)
- Qxy waiting in the queue (xy encodes reason)
- man qstat to decode
17 cqstatl -a ----------------------------- NQE
3.3.0.9 Database Task Summary --------------------
--------- IDENTIFIER NAME
SYSTEM-OWNER OWNER LOCATION
ST -------------------- ------- ----------------
-------- ------------------- ---- t48217
(46356.mcurie) PCM lws.mcurie alewife
nqs_at_mcurie NSubm t48713 (46848.mcurie)
third lws.mcurie u6670 nqs_at_mcurie
NSubm t49200 (47518.mcurie) int566A
lws.mcurie u61176 nqs_at_mcurie
NSubm t49245 (47368.mcurie) xqcd_ho lws.mcurie
snm nqs_at_mcurie NSubm t50349
(48480.mcurie) int650 lws.mcurie u61176
nqs_at_mcurie NSubm t50881 (49338.mcurie)
lte34-0 lws.mcurie lungfish nqs_at_mcurie
NSubm
ltsnipgt t51870
case17c scheduler.main salmon NQE Database
NTerm t51871 case1c9
scheduler.main salmon NQE Database
NFail t51872 case16c scheduler.main
salmon NQE Database NPend t51873
(49967.mcurie) q_lsms lws.mcurie marlin
nqs_at_mcurie NSubm t51875
case11c scheduler.main salmon NQE Database
NPend t51877 (49970.mcurie) G08
lws.mcurie u66870 nqs_at_mcurie
NSubm t51878 (49971.mcurie) qHsig.3 lws.mcurie
bass nqs_at_mcurie NSubm t51881
(49975.mcurie) Jobge_b lws.mcurie carp
nqs_at_mcurie NSubm t51884 (49979.mcurie)
job16.a lws.mcurie adt nqs_at_mcurie
NSubm t51885 (49980.mcurie) run_dyn
lws.mcurie flounder nqs_at_mcurie
NSubm t51886 (49981.mcurie) jupiter lws.mcurie
grouper nqs_at_mcurie NSubm t51887
(49983.mcurie) JobCZ.b lws.mcurie tarpon
nqs_at_mcurie NComp (output greatly
abridged)
18 qstat -a --------------------------------- NQS
3.3.0.9 BATCH REQUEST SUMMARY --------------------
------------- IDENTIFIER NAME USER
LOCATION/QUEUE JID PRTY REQMEM REQTIM
ST ------------- ------- --------
--------------------- ---- ---- ------ ------
--- 49979.mcurie job16.ag adt pe32_at_mcurie
4164 25 255 1520 R03 49936.mcurie
akr520 u6677 pe32_at_mcurie 3732 25
323 1800 R03 49964.mcurie case14c9 salmon
pe32_at_mcurie 3944 25 255 1795
R03 49967.mcurie q_lsms marlin pe32_at_mcurie
999 28672 1800 Cge 49983.mcurie
JobCZ.bb tarpon pe32_at_mcurie 317
28672 1800 Qge 49984.mcurie bitgc11 u62098
pe32_at_mcurie 244 28672 1800
Qge 49985.mcurie bitgc11 u62098 pe32_at_mcurie
242 28672 1800 Qge 49362.mcurie
Job_a2 carp pe128_at_mcurie 5308 25
323 1800 R03 49335.mcurie script.2 sturgeon
pe256_at_mcurie 999 28672 1800
Qqs 49033.mcurie uo2_3h2o dorado gc128_at_mcurie
--- 28672 7200 Hop 49255.mcurie
run010_A bluegill long128_at_mcurie 4617 25
255 1800 R03 49276.mcurie sg3D10 aku
long128_at_mcurie 999 4096 1800
Qce 49277.mcurie sg3D10 aku long128_at_mcurie
999 4096 1800 Qqu 49867.mcurie
run_t4 flounder long128_at_mcurie 70
28672 1800 Cgg no pipe queue entries (output
greatly abridged)
19 qstat -f pe32 ----------------------------------
-- NQS 3.3.0.9 BATCH QUEUE pe32_at_mcurie
Status ENABLED/RUNNING ------------------
------------------
Priority 15 ltENTRIESgt
Total 17 Running
5 Queued 12 Waiting 0
Holding 0 Arriving
0 Exiting 0 ltRUN LIMITSgt
Queue 13 User 2
Group 20 ltCOMPLEX MEMBERSHIPgt
regular ltLOCAL SCHEDULER EXTENSIONSgt
Miser Queue unspecified Scheduling
Window 00.0 ltRESOURCE USAGEgt
LIMIT ALLOCATED
Memory Size unlimited
143360kw Quick File Space
0b 0kw MPP
Processor Elements 416
60 ltREQUEST LIMITSgt
PER-PROCESS PER-REQUEST
type a Tape Drives
unspecified (0) type b Tape Drives
unspecified (0)
type c Tape Drives
unspecified (0) type d Tape Drives
unspecified (0)
(contd)
20 type e Tape Drives
unspecified (0) type f Tape Drives
unspecified (0)
type g Tape Drives
unspecified (0) type h Tape Drives
unspecified (0)
Core File Size unspecified (256mw)
Data Size unspecified (256mw)
Permanent File Space 20gb
25gb Memory Size
28mw 29mw Nice
Increment 5 Quick File
Space unspecified (0b) 0b
Stack Size unspecified (256mw)
CPU Time Limit 3600sec
7200sec Temporary File Space
unspecified (0b) unspecified (0b)
Working Set Limit unspecified (256mw)
MPP Processor Elements
32 MPP Time Limit
15000sec 15000sec Shared
Memory Limit
unspecified (0mw) Shared Memory Segments
unspecified (0)
MPP Memory Size unspecified (256mw)
unlimited ltACCESSgt Route Pipe Only
Users Unrestricted ltCUMULATI
VE TIMEgt System Time
3563114615067464.00 secs User Time
281421545294442428.00 secs (qstat -f output,
contd from previous slide)
21Troubleshooting
- No task id returned
- Typically means NQE down
- message like Cant connect
- Job doesnt make it to NQS try cqstatl lttaskidgt
- NFail usually indicates submission error
- Nabort could be a system problem
- No listing if many days old (NQE database is
purged frequently) - Stuck in NPend status
- J90 Many jobs ahead of you?
- T3E over pipe queue limit?
22Troubleshooting (contd)
- Stuck in NSubm use qstat
- Q normal on T3E, rare on J90
- T3E
- Hop can be allocation problem
- C (checkpointed) may be daily shuffling
- May need both pslist and qstat -m to sort it all
out - Job crashes
- Read job log, stdout, stderr
- ...limit exceeded ran out of time (or memory,
or) - Job vanishes
- Did machine(s) crash? If not, collect info and
contact Consultants
23Pointers
- Batch job is like a login session
- Starts in your home directory
- Uses your startup files
- But doesnt inherit environment (unless you use
-x) - Environment variable ENVIRONMENT
- Not set in interactive work, set to BATCH in
batch jobs - Can exclude parts of startup files
- /usr/tmp faster than home directory
- TMPDIR vanishes (avoids littering)
- Just one quota for TMPDIR , rest of /usr/tmp/
- Cant monitor batch J90 temp file systems
24Pointers (contd)
- Dont submit blindly
- Debug executables, scripts first
- Dont trust inherited shell scripts
- Spend time with man pages
- J90 large memory jobs should/must multitask
- T3E reduce serial time in parallel jobs
- Stage HPSS retrievals (dmget)
- Submit follow-on serial jobs within your job