Queuing Tutorial - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Queuing Tutorial

Description:

Takes state information from the resource manager and then schedules jobs to run. ... wait for the queue to get smaller because the job will wait, its waiting anyways! ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 19
Provided by: geoffz
Category:

less

Transcript and Presenter's Notes

Title: Queuing Tutorial


1
Queuing Tutorial
  • An overview of Torque/Moab queuing.

2
Topics
  • Architecture of the queuing system
  • Workflow
  • Job Scripts
  • Some queuing strategies

3
Architecture
  • Resource Manager
  • Scheduler
  • Allocation Manager

4
Resource Manager
  • Torque
  • Branch of OpenPBS
  • 2 Parts
  • Pbs_mom
  • Daemon on each compute node
  • Handles job start up and keeps track of the
    nodes state
  • Pbs_server
  • Server that jobs are submitted to.
  • Keeps track of all nodes and jobs.

5
Scheduler
  • Moab
  • Takes state information from the resource manager
    and then schedules jobs to run.
  • The Brains

6
Allocation Manager
  • Gold
  • Keeps track of cpu-hours

7
Workflow
8
Workflow
  • From the queuing system point of view
  • When a scheduling interval starts
  • Moab asks pbs_server the state of the nodes and
    of any jobs.
  • Moab attempts to schedule any eligble jobs if
    there are enough resources free.
  • Moab tells pbs_server to schedule start any jobs
    that can be started.
  • Pbs_server contacts the pbs_mom on the first node
    assigned to the job (That pbs_mom is called the
    mother superior).
  • The mother superior executes the jobs scripts
    submitted by the user.
  • When a pbs heartbeat happens
  • The pbs_server will contact the pbs_mom and ask
    the status of its node.

9
Workflow
  • From a users point of view
  • Submit a job script to the queuing system.
  • Wait for the job to be scheduled and ran.
  • Get the results.

10
Job Scripts
  • A job script is a script to start your job
  • If there was no queuing system, you should be
    able to execute that script and start your job.
  • The job script has a few definitions to inform
    the queuing system of your job requirements and
    who you are

11
Script Definitions
  • Walltime request
  • PBS -lwalltimehhmmss
  • CPU request
  • For System X
  • PBS -lnodesXppn2
  • X nodes with 2 processors per node
  • For Cauldron
  • PBS -lncpusX

12
Script Definitions
  • Which queue you want to use
  • PBS -q ltqueue namegt
  • 3 queues available now
  • System X OS X partition production_q
  • System X Linux partition linux_q
  • Cauldron cauldron_q

13
Script Definitions
  • Some information about who you are
  • Your submission group
  • PBS -W group_listltgroupgt
  • For System X it is tcf_user
  • For Cauldron it is sgiusers
  • Type groups when logged into a head node to
    check that you belong to group of the machine you
    wish to submit to
  • Your cpu-hour hat
  • PBS -A lthatgt
  • On Cauldron it is sgim0000
  • System X users were told their hat in their
    welcome letters.

14
Job Script Template
  • !/bin/bash
  • PBS -lwalltime010000
  • PBS -lncpus8
  • PBS -q cauldron_q
  • PBS -W group_listsgiusers
  • PBS -A sgim0000

15
Job Script
  • After the PBS definitions, put in the commands to
    start your job
  • There are example job scripts found in /apps/doc

16
Running Your Job
  • Use qsub to submit your job to the queue
  • qsub ./jobscript
  • To check on your jobs status
  • qstat -a ltqueue namegt
  • showq -p ltpartition namegt
  • OSX, LINUX, or CAULDRON
  • checkjob ltjob id numbergt
  • cstat (on Cauldron)
  • To delete a job, use qdel
  • qdel ltjob id numbergt

17
Queuing Strategies
  • Queue early, queue often
  • Queue your jobs up!
  • Cant run jobs if they arent in the queue
  • Dont wait for the queue to get smaller because
    the job will wait, its waiting anyways!
  • Possibility for backfill for smaller jobs
  • Have an accurate walltime
  • Accurate walltimes will help the queue try to
    backfill in smaller jobs in between runs of
    larger jobs, but only if it wont effect the
    start time of the next job
  • Try to queue large jobs before downtimes
  • If you have a large job that can never seem to
    have enough cpus available, queue it up before a
    downtime.

18
Queue Strategies
  • The command showbf
  • That command shows cpus available right now, and
    for how long.
  • Checkpointing
  • If your code does checkpointing you can exploit
    backfill, by queuing jobs to fill the small
    places but maybe not running to completion
  • Good idea in general, in case of hardware failure
Write a Comment
User Comments (0)
About PowerShow.com