Wolfgang Friebel,

About This Presentation

Title:

Wolfgang Friebel,

Description:

assignment of resources according to policies (who gets how much CPU when) ... Condor targeted at using idle workstations (not used at DESY) ... – PowerPoint PPT presentation

Number of Views:206

Avg rating:3.0/5.0

Slides: 27

Provided by: wolfgang4

Category:

more less

Transcript and Presenter's Notes

Title: Wolfgang Friebel,

1
Installing and Running SGE at DESY (Zeuthen)

Wolfgang Friebel,
15.10.2001
HEPiX Meeting Berkeley

2
Introduction

Motivations for using a batch system
more effective usage of available computers (e.
g. more uniform load)
usage of resources 24h/day
assignment of resources according to policies
(who gets how much CPU when)
quicker execution of tasks (system knows most
powerful least loaded nodes)
Our goal
You tell the batch system a script name and what
you need in terms of disk space, memory, CPU time
The batch system guarantees fastest possible
turnaround
Could even be used to get xterm windows on least
loaded machines for interactive use

3
Batch Systems Overview

Condor targeted at using idle workstations (not
used at DESY)
NQS public domain and commercial versions, basic
functionality.
Used for APE100 projects
Loadleveler mostly found on IBM machines, used at
DESY
LSF popular, rich set of features, licensed
software, used at DESY
PBS public domain and commercial versions,
origin NASA
rich set of features, became popular recently,
used in H1
Codine/GRD batch system similar to LSF in
functionality, used in HERA-B
and for all farms at DESY Zeuthen
SGE/SGEEE Sun Grid Engine (Enterprise Edition),
open source
successors of Codine/GRD. Became the only
batch system
at Zeuthen (except for the legacy APE 100
batch system)

4
The old Batch System Concept

Each group runs a separate cluster with separate
instances of GRD or Codine
Project priorities within a group are maintained
by configuring several queues reflecting the
priorities
Queue names were named after priority, e.g. long,
medium, short, idle, ...
Could also be named according to task, e.g.
simulation, production, test, ...
Individuals had to obey group dependent rules to
submit jobs
Priorities between different groups were realized
by the cluster size (CPU power)
Urgent tasks were tried to carry out by asking
other groups to temporarily use their cluster
Administrative overhead to enable accounts on
machines
Users had to adapt their batch jobs to the new
environment
There were always heavily overloaded clusters
next to machines with lots of idle CPU cycles

5
A new Scheme for Batch Processing

Two factors led us design a new batch processing
scheme
shortcomings of the old system, especially the
non uniform usage pattern
licensing situation, our GRD license ended,
wanted to go to the open source successor of GRD
One central batch system for all groups
dynamic allocation of resources according to the
current needs of groups
more uniform configuration of batch nodes
Very few queue types
basically only two types Queue for ordinary
batch jobs and idle queue
most of the scheduling decisions based on other
mechanisms (see below)
Resource requests for jobs determine queuing
Resource definition based on the concept of
complexes (explained later)
User should request resources if the defaults are
not well suited for the jobs
Bookkeeping of resources within the batch system

6
The Sun Grid Engine Components

Components of the system
Queues contain information on number of jobs and
job characteristics that
are allowed on a given host. Jobs need to fit
into a queue to get
executed. Queues are bound to specific hosts.
Resources Features of hosts or queues that are
known to SGE. Resource
attributes are defined in so called (global,
host, queue and user
defined) complexes
Projects contain lists of users (usersets) that
are working together. The
relative importance to other projects may be
defined using shares.
Policies Algorithms that define, which jobs are
scheduled to which queues
and how the priority of running jobs has to be
set. SGEEE knows
functional, share based, urgency based and
override policies
Shares SGEEE can use a pool of tickets to
determine the importance of
jobs. The pool of tickets owned by a
project/job etc. is called share

7
Benefits Using the SGEEE Batch System

For users
jobs get executed on the most suitable (least
loaded, fastest) machine
fair scheduling according to defined sharing
policies
no one else can overuse the system and provoke
system degradation
users need no knowledge of host names where their
jobs can run
quick access to load parameters of all managed
hosts
For administrators
one time allocation of resources to users,
projects, groups
no manual intervention to guarantee policies
reconfiguration of the running system (to adapt
to changing usage pattern)
easy monitoring of hosts and jobs

8
Policies for the Job handling within SGEEE

Within SGEEE tickets are used to distribute the
workload
User based functional policy
Tickets are assigned to projects, users and jobs.
More tickets mean higher priority and faster
execution (if concurrent jobs are running on a
CPU)
Share based policy
Certain fractions of the system resources
(shares) can be assigned to projects and users.
Projects and users receive that shares during a
configurable moving time window (e.g. CPU usage
for a month based on usage during the past month)
Deadline policy
By redistributing tickets the system can assign
jobs an increasing weight to meet a certain
deadline. Can be used by authorized users only
Override policy
Sysadmins can give additional tickets to jobs,
users or projects to temporarily adjust their
relative importance.

9
Classes of Hosts and Users

Submit Host node that is allowed to submit jobs
(qsub) and query its
status
Exec Host node that is allowed to run (and
submit) jobs
Admin Host node from which admin commands may be
issued
Master Host node controlling all SGE activity,
collecting status
information, keeping access control lists
etc.
A certain host can have any mixture of the roles
above
Administrator user that is allowed to fully
control SGE
Operator user with admin privileges, who is not
allowed to
change the queue configuration
Owner user that is allowed to suspend jobs in
queues he owns
or disable owned queues
User can manipulate only his own jobs

10
The Zeuthen SGEEE Installation

SGEEE built from the source with AFS support
Another system (SGE with AFS) was built for the
HERA-B experiment
Two separate clusters (no mix of operating
systems)
95 Linux nodes in default SGEEE cell
Other Linux machines (public login) used as
submit hosts
17 HP-UX nodes in cell hp
A cell is a separate pool of nodes controlled by
a master node

11
The Zeuthen SGEEE Installation

In production since 9/2001
Smooth migration from the old system
Two batch systems were running in parallel for a
limited time
Coexistence of old queue configuration scheme and
the new one
Ongoing tuning of the new system
Initial goal was to reestablish functionality of
the old system
Now step by step changes towards a truly
homogeneous system
Initially some projects were bound to subgroups
of hosts

12
Our Queue Concept

one queue per CPU with large time limit and low
priority
users have to specify at least a CPU time limit
(usually much smaller)
Users can request other resources (memory, disk)
differing from default values
optionally a second queue that gets suspended as
soon as there are jobs in the first queue (idle
queue)
interactive use is possible because of low batch
priority
relation between jobs, users and projects is
respected because of sharing policies

13
Complexes within SGE

Complexes are containers for resource definitions
Resources can be requested by a batch job
You can have hard requests that need to be
fulfilled (e.g. host architecture)
Soft requests are fulfilled if possible
The actual value for some resource parameters is
known
Amount of available main memory or disk space can
be used for decisions
Arbitrary "load sensors" can be written to
measure resource parameters
Resources can be reserved for the current job
Parameters can be made "consumable". A portion of
a requested resource gets subtracted from the
value of the currently available resource
parameter
The most important parameters are known to SGEEE
Parameters like CPU time, virtual free memory
etc. are built in already
To be used some of them need to be activated in
the configuration

14
Our Complexes Concept

Users have to specify for a job
Time limit (CPU time)
Users can request for a job
A certain amount of virtual and real free memory
The existence of one or two scratch disks
(coming soon)
The available free disk space for a given scratch
disk
To have a guaranteed amount of disk space
reserved
More hardware oriented features like
Using only machines from a subcluster (farm)
Run on a specific host (not recommended)

15
Experiences

System is easily useable from a users point of
view
System is highly configurable (needs some time to
find the optimum policies to implement)
System is very stable
crashing jobs mostly due to failing token renewal
(our plugin procedure based on arc and
batchtkauth)
other failures due to missing (on purpose!) path
aliases for the automounter
System adapts dynamically process priority to
meet share policies or to keep up with changing
policies
SGE(EE) maintainers are very active and keep
implementing new ideas
quick incorporation of patches, reported bugs get
fixed asap.

16
Advanced Use of SGEEE

Using the perl API
every aspect of the batch system is accessible
through the perl API
the perl API is accessible after use SGE in perl
scripts
there is almost no documentation but a few sample
scripts in /afs/ifh.de/user/f/friebel/public and
in /afs/ifh.de/products/source/gridengine/source/e
xperimental/perlgui
Using the load information reported by SGEEE
each host reports a number of load values to the
master host (qmaster)
there is a default set of load parameters that
are always reported
further parameters can be reported by writing
load sensors
qhost is a simple interface to display that
information
a powerful monitoring system could be built
around that feature, which is based on the
"Performance Data Collection" (PDC) built in
subsystem

17
Conclusions

Ease of installation from source
Access to source code
Chance of integration into a monitoring system
API for C and Perl
Excellent load balancing mechanisms
Managing the requests of concurrent groups
Mechanisms for recovery from machine crashes
Fallback solutions for dying daemons
Weakest point is AFS integration and Token
prolongation mechanism (basically the same code
as for Loadleveler and for older LSF versions)

18
Conclusions

SGEEE has all ingredients to build a company wide
batch infrastructure
Allocation of resources according to policies
ranging from departmental policies to individual
user policies
Dynamic adjustment of priorities for running jobs
to meet policies
Supports interactive jobs, array jobs, parallel
jobs
Can be used with Kerberos (4 and 5) and AFS
SGEEE is open source maintained by Sun
Getting deeper knowledge by studying the code
Can enhance the code (examples more schedulers,
tighter AFS integration, monitoring only daemons)
Code is centrally maintained by a core developer
team
Could play a more important role in HEP
(component of a grid environment, open industry
grade batch system as recommended solution within
HEPiX?)

19
References

http//gridengine.sunsource.net/servlets/ProjectSo
urce
Download Page for source code of SGE(EE)
http//www.arl.hpc.mil/docs/grd/
lots of docs from raytheon
http//supportforum.Sun.COM/gridengine/
Supportforum, Mailinglists
http//hoover.hpac.tudelft.nl/cugs98cd/S98PROC/AUT
HORS/FERSTL/INDEX.HTM GRD on a
Conference 1998
http//www-zeuthen.desy.de/computing/services/batc
h/ Zeuthen pages with URL to the reference manual
http//www-zeuthen.desy.de//batch/sge53.pdf
The SGEEE reference manual, user and
installation guide

20
Technical Details of SGEEE(not presented)

Submitting Jobs
The graphical interface qmon
Job submission and file systems
Sample job script
Advanced usage of qsub
Abnormal job termination

21
Submitting Jobs

Requirements for submitting jobs
have a valid token (verify with tokens),
otherwise obtain a new one (klog)
ensure that in your .tcshrc or .zshrc no
commands are executed that need a terminal (tty)
(users have often a stty command in their startup
scripts)
you are within batch if the env variable JOB_NAME
is set or if the env variable ENVIRONMENT is set
to BATCH
Submitting a job
specify what resources you need (-l option) and
what script should be executed
qsub -l t10000 job_script
in the simplest case the job script contains 1
line, the name of the executable
many more options available
alternatively use the graphical interface to
submit jobs
qmon

22
The Submit Window of qmon
23
Job Submission and File Systems

Current working directory
the directory from where the qsub command was
called. STDOUT and STDERR of a job go into files
that are created in HOME. Because of quota
limits and archiving policies that is not
recommended.
With the -cwd option to qsub the files get
created in the current working directory. For
performance reasons that should be on a local
file system
If cwd is in NFS space, the batch system must not
use the real mount point but be translated
according to /usr/SGE/default/common/sge_aliases.
As every job stores the full info from
sge_aliases, it is of advantage to get rid of
that file and discourage the use of NFS as
current working directory
If required, create your own HOME/.sge_aliases
file
Local file space (Zeuthen policies)
/usr1/tmp is guaranteed to exist on all linux
nodes and has typically gt 10GB
/data exists on some linux nodes and has
typically gt 15GB capacity. A job can request the
existence of /data by -l datadir
TMPDIR is a unique directory below /usr1/tmp,
that gets erased at the end of the job. Normal
jobs should make use of that mechanism if possible

24
A Simple Job Script

!/bin/zsh
-S /bin/zsh
-l t03000
-j y
WORKDIR/usr1/tmp/LOGNAME/JOB_ID
DATADIR/net/ilos/h1data7
echo using working directory WORKDIR
mkdir -p WORKDIR
cp DATADIR/large_input WORKDIR
cd WORKDIR
h1_reco
cp large_out DATADIR
if -s large_out -s DATADIR/large_out then
cd rm -r WORKDIR
fi

otherwise the default shell would be used
the time limit for this job
25
Advanced Usage of qsub

Option files
instead of giving qsub options on the command
line, users may store those in .sge_projects
files in their HOME or current working
directories
content of a sample .sge_projects file
cwd -S /usr/local/bin/perl -j y -l t240000
Array jobs
SGE allows to schedule n identical jobs with one
qsub call using the t option
qsub -t 1-10 array_job_script
within the script use the variable SGE_TASK_ID to
select different inputs and write to distinct
output files (SGE_TASK_ID is 1...10 in the
example above)
Conditional job execution
jobs can be scheduled to wait for dependent jobs
to successfully finish (rc0)
jobs can be submitted in hold state (needs to be
released by user or operator)
jobs can be told not to start before a given date
start dependent jobs on the same host (using
qalter -q QUEUE ... within script)

26
Abnormal Job Termination

Termination because of CPU limit exceeded
jobs get an XCPU signal that can be catched by
the job. In that case termination procedures can
be executed, before the SIGKILL signal is sent
SIGKILL will be sent a few minutes after XCPU was
sent. It cannot be catched.
Restart after execution host crashes
if a host crashes when a given job is running,
the job will be restarted. In that case the
variable RESTARTED is set to 1
The job will be reexecuted from the beginning on
any free host. If the job can be restarted using
some results achieved so far, then the variable
RESTARTED can be checked. The job can be forced
to be executed on the same host by inserting
qalter -q QUEUE JOB_ID
literally in the job script
Signaling the end of the job
with the qsub option -notify a SIGUSR1 signal is
sent to the job a few minutes before the job is
suspended or terminated