Queuing Tutorial - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Queuing Tutorial

Description:

Takes state information from the resource manager and then schedules jobs to run. ... wait for the queue to get smaller because the job will wait, its waiting anyways! ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 19

Provided by: geoffz

Category:

more less

Transcript and Presenter's Notes

Title: Queuing Tutorial

1
Queuing Tutorial

An overview of Torque/Moab queuing.

2
Topics

Architecture of the queuing system
Workflow
Job Scripts
Some queuing strategies

3
Architecture

Resource Manager
Scheduler
Allocation Manager

4
Resource Manager

Torque
Branch of OpenPBS
2 Parts
Pbs_mom
Daemon on each compute node
Handles job start up and keeps track of the
nodes state
Pbs_server
Server that jobs are submitted to.
Keeps track of all nodes and jobs.

5
Scheduler

Moab
Takes state information from the resource manager
and then schedules jobs to run.
The Brains

6
Allocation Manager

Gold
Keeps track of cpu-hours

7
Workflow
8
Workflow

From the queuing system point of view
When a scheduling interval starts
Moab asks pbs_server the state of the nodes and
of any jobs.
Moab attempts to schedule any eligble jobs if
there are enough resources free.
Moab tells pbs_server to schedule start any jobs
that can be started.
Pbs_server contacts the pbs_mom on the first node
assigned to the job (That pbs_mom is called the
mother superior).
The mother superior executes the jobs scripts
submitted by the user.
When a pbs heartbeat happens
The pbs_server will contact the pbs_mom and ask
the status of its node.

9
Workflow

From a users point of view
Submit a job script to the queuing system.
Wait for the job to be scheduled and ran.
Get the results.

10
Job Scripts

A job script is a script to start your job
If there was no queuing system, you should be
able to execute that script and start your job.
The job script has a few definitions to inform
the queuing system of your job requirements and
who you are

11
Script Definitions

Walltime request
PBS -lwalltimehhmmss
CPU request
For System X
PBS -lnodesXppn2
X nodes with 2 processors per node
For Cauldron
PBS -lncpusX

12
Script Definitions

Which queue you want to use
PBS -q ltqueue namegt
3 queues available now
System X OS X partition production_q
System X Linux partition linux_q
Cauldron cauldron_q

13
Script Definitions

Some information about who you are
Your submission group
PBS -W group_listltgroupgt
For System X it is tcf_user
For Cauldron it is sgiusers
Type groups when logged into a head node to
check that you belong to group of the machine you
wish to submit to
Your cpu-hour hat
PBS -A lthatgt
On Cauldron it is sgim0000
System X users were told their hat in their
welcome letters.

14
Job Script Template

!/bin/bash
PBS -lwalltime010000
PBS -lncpus8
PBS -q cauldron_q
PBS -W group_listsgiusers
PBS -A sgim0000

15
Job Script

After the PBS definitions, put in the commands to
start your job
There are example job scripts found in /apps/doc

16
Running Your Job

Use qsub to submit your job to the queue
qsub ./jobscript
To check on your jobs status
qstat -a ltqueue namegt
showq -p ltpartition namegt
OSX, LINUX, or CAULDRON
checkjob ltjob id numbergt
cstat (on Cauldron)
To delete a job, use qdel
qdel ltjob id numbergt

17
Queuing Strategies

Queue early, queue often
Queue your jobs up!
Cant run jobs if they arent in the queue
Dont wait for the queue to get smaller because
the job will wait, its waiting anyways!
Possibility for backfill for smaller jobs
Have an accurate walltime
Accurate walltimes will help the queue try to
backfill in smaller jobs in between runs of
larger jobs, but only if it wont effect the
start time of the next job
Try to queue large jobs before downtimes
If you have a large job that can never seem to
have enough cpus available, queue it up before a
downtime.

18
Queue Strategies

The command showbf
That command shows cpus available right now, and
for how long.
Checkpointing
If your code does checkpointing you can exploit
backfill, by queuing jobs to fill the small
places but maybe not running to completion
Good idea in general, in case of hardware failure

Write a Comment

User Comments (0)