Portable Batch System PBS aka TORQUE - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Portable Batch System PBS aka TORQUE

Description:

PBS is a workload management system for Linux clusters. It supplies command for. job submittion, ... Indicates that a job should not rerun if it fails. #PBS V ... – PowerPoint PPT presentation

Number of Views:101
Avg rating:3.0/5.0
Slides: 14
Provided by: a15434
Category:
Tags: pbs | torque | aka | batch | portable | rerun | system

less

Transcript and Presenter's Notes

Title: Portable Batch System PBS aka TORQUE


1
Portable Batch System (PBS aka TORQUE)
  • Lars Schley

2
OpenPBS / TORQUEBatch Processing
  • PBS is a workload management system for Linux
    clusters
  • It supplies command for
  • job submittion,
  • job monitoring (tracing) and
  • job deletion.
  • It consists of the following components
  • Job server (pbs_server)
  • provides the basic batch services
  • receiving/creating a batch job
  • modifying the job
  • protecting the job against system crashes
  • and running the job.

3
OpenPBS / TORQUEBatch Processing
  • Job Executor (pbs_mom)
  • receives a copy of the job from the job server
  • sets the job into execution
  • creates a new session as identical user
  • returns the job's output to the user.
  • Job Scheduler (pbs_sched)
  • runs site's policy controlling which job is run
    and where and when it is run
  • PBS allows each site to create its own Scheduler
  • Currently the Maui Scheduler is used

4
OpenPBS / TORQUEBatch Processing
  • Maui communicates
  • with Moms monitoring the state of a system's
    resources
  • with Server retrieving information about the
    availability of jobs to execute
  • Steps needed to run your first production code
  • 1. Create a job script
  • containing the PBS options to request the needed
    resources
  • (i.e. number of processors, wall-clock time,
    etc.)
  • and use commands to prepare for execution of the
    executable (i.e. cd to working directory, etc.).
  • 2. Submit the job script file to PBS/TORQUE
  • 3. Monitor the job

5
PBS Options
  • PBS -N myJob
  • Assigns a job name. The default is the name of
    PBS job script.
  • PBS -l nodes4ppn2
  • The number of nodes and processors per node.
  • PBS -l walltime010000
  • The maximum wall-clock time during which this
    job can run.
  • PBS -o mypath/my.out
  • The path and file name for standard output.
  • PBS -e mypath/my.err
  • The path and file name for standard error.
  • PBS -j oe
  • Join option that merges the standard error stream
    with the standard output stream

6
PBS Options
  • PBS -k oe
  • Define which output of the batch job to retain on
  • the execution host.
  • PBS -W stageinfile_list
  • Copies the file onto the execution host before
    the job starts.
  • PBS -W stageoutfile_list
  • Copies the file from the execution host after
    the job completes.
  • PBS -r n
  • Indicates that a job should not rerun if it
    fails.
  • PBS V
  • Exports all environment variables to the job.

7
First example
  • !/bin/bash
  • PBS -N MyAppName
  • PBS -l nodes1
  • PBS -l walltime000100
  • PBS -e /home/dgtest/dgtest0200/test.err
  • PBS -o /home/dgtest/dgtest0200/test.out
  • PBS -V
  • /bin/hostname -f

8
Procedure
  • Use command line
  • Use editor mcedit to create an executable script
  • mcedit myExample.sh
  • Use first example code
  • Make myExample.sh executable
  • chmod x myExample.sh
  • Test your script
  • ./ myExample.sh
  • Submit your script
  • qsub q dgtest myExample.sh
  • remember your job identifier
  • i.e. 96682.udo-torque01.grid.uni-dortmund.de

9
Procedure
  • Check wether your job runs
  • qstat
  • Something wrong?
  • tracejob 96682.udo-torque01.grid.uni-dortmund.de
  • 01/29/2008 160729 S
  • enqueuing into dgtest, state 1 hop 1
  • 01/29/2008 160729 S
  • Job Queued at request of dgtest0200_at_udo-torque01
    .grid.uni-dortmund.de, owner dgtest0200_at_udo-torq
    ue01.grid.uni-dortmund.de, job
    name MyHelloWorldExample, queue dgtest
  • 01/29/2008 160730 S
  • Job Modified at request of root_at_udo-torque01.gri
    d.uni-dortmund.de

10
Procedure
  • 01/29/2008 160730 S
  • Job Run at request of root_at_udo-torque01.grid.un
    i-dortmund.de
  • 01/29/2008 160730 S
  • child reported success for job after 0 seconds
    (destudo-wn044.grid.uni-dortmund.de), rc0
  • 01/29/2008 160730 S
  • sending 'b' mail for job 96682.udo-torque01.gri
    d.uni-dortmund.de to dgtest0200_at_udo-torque01.grid.
    uni-dortmund.de (---)
  • 01/29/2008 160730 S
  • Job Modified at request of root_at_udo-torque01.gri
    d.uni-dortmund.de

11
Procedure
  • 01/29/2008 160730 S
  • sending 'e' mail for job 96682.udo-torque01.grid
    .uni-dortmund.de to dgtest0200_at_udo-torque01.grid.u
    ni-dortmund.de (Exit_status0
  • 01/29/2008 160730 S
  • Exit_status0 resources_used.cput000000
    resources_used.mem0kb resources_used.vmem0kb
    resources_used.walltime000000
  • 01/29/2008 160731 S
  • dequeuing from dgtest, state COMPLETE

12
Monitor / Control a Job
  • qstat a
  • check status of jobs, queues, and the PBS server
  • qstat f
  • get all the information about a job, i.e.
    resources requested, resource limits, owner,
    source, destination, queue, etc.
  • canceljob job.ID
  • delete a job from the queue
  • qhold job.ID
  • hold a job if it is in the queue
  • qrls job.ID
  • release a job from hold

13
Exercise
  • Write a bash script, which submits a simpler
    script automatically 10 times
  • Pipe job ids into a file
Write a Comment
User Comments (0)
About PowerShow.com