Wolfgang Friebel, - PowerPoint PPT Presentation

About This Presentation
Title:

Wolfgang Friebel,

Description:

assignment of resources according to policies (who gets how much CPU when) ... Condor targeted at using idle workstations (not used at DESY) ... – PowerPoint PPT presentation

Number of Views:196
Avg rating:3.0/5.0
Slides: 27
Provided by: wolfgang4
Category:

less

Transcript and Presenter's Notes

Title: Wolfgang Friebel,


1
Installing and Running SGE at DESY (Zeuthen)
  • Wolfgang Friebel,
  • 15.10.2001
  • HEPiX Meeting Berkeley

2
Introduction
  • Motivations for using a batch system
  • more effective usage of available computers (e.
    g. more uniform load)
  • usage of resources 24h/day
  • assignment of resources according to policies
    (who gets how much CPU when)
  • quicker execution of tasks (system knows most
    powerful least loaded nodes)
  • Our goal
  • You tell the batch system a script name and what
    you need in terms of disk space, memory, CPU time
  • The batch system guarantees fastest possible
    turnaround
  • Could even be used to get xterm windows on least
    loaded machines for interactive use

3
Batch Systems Overview
  • Condor targeted at using idle workstations (not
    used at DESY)
  • NQS public domain and commercial versions, basic
    functionality.
  • Used for APE100 projects
  • Loadleveler mostly found on IBM machines, used at
    DESY
  • LSF popular, rich set of features, licensed
    software, used at DESY
  • PBS public domain and commercial versions,
    origin NASA
  • rich set of features, became popular recently,
    used in H1
  • Codine/GRD batch system similar to LSF in
    functionality, used in HERA-B
  • and for all farms at DESY Zeuthen
  • SGE/SGEEE Sun Grid Engine (Enterprise Edition),
    open source
  • successors of Codine/GRD. Became the only
    batch system
  • at Zeuthen (except for the legacy APE 100
    batch system)

4
The old Batch System Concept
  • Each group runs a separate cluster with separate
    instances of GRD or Codine
  • Project priorities within a group are maintained
    by configuring several queues reflecting the
    priorities
  • Queue names were named after priority, e.g. long,
    medium, short, idle, ...
  • Could also be named according to task, e.g.
    simulation, production, test, ...
  • Individuals had to obey group dependent rules to
    submit jobs
  • Priorities between different groups were realized
    by the cluster size (CPU power)
  • Urgent tasks were tried to carry out by asking
    other groups to temporarily use their cluster
  • Administrative overhead to enable accounts on
    machines
  • Users had to adapt their batch jobs to the new
    environment
  • There were always heavily overloaded clusters
    next to machines with lots of idle CPU cycles

5
A new Scheme for Batch Processing
  • Two factors led us design a new batch processing
    scheme
  • shortcomings of the old system, especially the
    non uniform usage pattern
  • licensing situation, our GRD license ended,
    wanted to go to the open source successor of GRD
  • One central batch system for all groups
  • dynamic allocation of resources according to the
    current needs of groups
  • more uniform configuration of batch nodes
  • Very few queue types
  • basically only two types Queue for ordinary
    batch jobs and idle queue
  • most of the scheduling decisions based on other
    mechanisms (see below)
  • Resource requests for jobs determine queuing
  • Resource definition based on the concept of
    complexes (explained later)
  • User should request resources if the defaults are
    not well suited for the jobs
  • Bookkeeping of resources within the batch system

6
The Sun Grid Engine Components
  • Components of the system
  • Queues contain information on number of jobs and
    job characteristics that
  • are allowed on a given host. Jobs need to fit
    into a queue to get
  • executed. Queues are bound to specific hosts.
  • Resources Features of hosts or queues that are
    known to SGE. Resource
  • attributes are defined in so called (global,
    host, queue and user
  • defined) complexes
  • Projects contain lists of users (usersets) that
    are working together. The
  • relative importance to other projects may be
    defined using shares.
  • Policies Algorithms that define, which jobs are
    scheduled to which queues
  • and how the priority of running jobs has to be
    set. SGEEE knows
  • functional, share based, urgency based and
    override policies
  • Shares SGEEE can use a pool of tickets to
    determine the importance of
  • jobs. The pool of tickets owned by a
    project/job etc. is called share

7
Benefits Using the SGEEE Batch System
  • For users
  • jobs get executed on the most suitable (least
    loaded, fastest) machine
  • fair scheduling according to defined sharing
    policies
  • no one else can overuse the system and provoke
    system degradation
  • users need no knowledge of host names where their
    jobs can run
  • quick access to load parameters of all managed
    hosts
  • For administrators
  • one time allocation of resources to users,
    projects, groups
  • no manual intervention to guarantee policies
  • reconfiguration of the running system (to adapt
    to changing usage pattern)
  • easy monitoring of hosts and jobs

8
Policies for the Job handling within SGEEE
  • Within SGEEE tickets are used to distribute the
    workload
  • User based functional policy
  • Tickets are assigned to projects, users and jobs.
    More tickets mean higher priority and faster
    execution (if concurrent jobs are running on a
    CPU)
  • Share based policy
  • Certain fractions of the system resources
    (shares) can be assigned to projects and users.
  • Projects and users receive that shares during a
    configurable moving time window (e.g. CPU usage
    for a month based on usage during the past month)
  • Deadline policy
  • By redistributing tickets the system can assign
    jobs an increasing weight to meet a certain
    deadline. Can be used by authorized users only
  • Override policy
  • Sysadmins can give additional tickets to jobs,
    users or projects to temporarily adjust their
    relative importance.

9
Classes of Hosts and Users
  • Submit Host node that is allowed to submit jobs
    (qsub) and query its
  • status
  • Exec Host node that is allowed to run (and
    submit) jobs
  • Admin Host node from which admin commands may be
    issued
  • Master Host node controlling all SGE activity,
    collecting status
  • information, keeping access control lists
    etc.
  • A certain host can have any mixture of the roles
    above
  • Administrator user that is allowed to fully
    control SGE
  • Operator user with admin privileges, who is not
    allowed to
  • change the queue configuration
  • Owner user that is allowed to suspend jobs in
    queues he owns
  • or disable owned queues
  • User can manipulate only his own jobs

10
The Zeuthen SGEEE Installation
  • SGEEE built from the source with AFS support
  • Another system (SGE with AFS) was built for the
    HERA-B experiment
  • Two separate clusters (no mix of operating
    systems)
  • 95 Linux nodes in default SGEEE cell
  • Other Linux machines (public login) used as
    submit hosts
  • 17 HP-UX nodes in cell hp
  • A cell is a separate pool of nodes controlled by
    a master node

11
The Zeuthen SGEEE Installation
  • In production since 9/2001
  • Smooth migration from the old system
  • Two batch systems were running in parallel for a
    limited time
  • Coexistence of old queue configuration scheme and
    the new one
  • Ongoing tuning of the new system
  • Initial goal was to reestablish functionality of
    the old system
  • Now step by step changes towards a truly
    homogeneous system
  • Initially some projects were bound to subgroups
    of hosts

12
Our Queue Concept
  • one queue per CPU with large time limit and low
    priority
  • users have to specify at least a CPU time limit
    (usually much smaller)
  • Users can request other resources (memory, disk)
    differing from default values
  • optionally a second queue that gets suspended as
    soon as there are jobs in the first queue (idle
    queue)
  • interactive use is possible because of low batch
    priority
  • relation between jobs, users and projects is
    respected because of sharing policies

13
Complexes within SGE
  • Complexes are containers for resource definitions
  • Resources can be requested by a batch job
  • You can have hard requests that need to be
    fulfilled (e.g. host architecture)
  • Soft requests are fulfilled if possible
  • The actual value for some resource parameters is
    known
  • Amount of available main memory or disk space can
    be used for decisions
  • Arbitrary "load sensors" can be written to
    measure resource parameters
  • Resources can be reserved for the current job
  • Parameters can be made "consumable". A portion of
    a requested resource gets subtracted from the
    value of the currently available resource
    parameter
  • The most important parameters are known to SGEEE
  • Parameters like CPU time, virtual free memory
    etc. are built in already
  • To be used some of them need to be activated in
    the configuration

14
Our Complexes Concept
  • Users have to specify for a job
  • Time limit (CPU time)
  • Users can request for a job
  • A certain amount of virtual and real free memory
  • The existence of one or two scratch disks
  • (coming soon)
  • The available free disk space for a given scratch
    disk
  • To have a guaranteed amount of disk space
    reserved
  • More hardware oriented features like
  • Using only machines from a subcluster (farm)
  • Run on a specific host (not recommended)

15
Experiences
  • System is easily useable from a users point of
    view
  • System is highly configurable (needs some time to
    find the optimum policies to implement)
  • System is very stable
  • crashing jobs mostly due to failing token renewal
    (our plugin procedure based on arc and
    batchtkauth)
  • other failures due to missing (on purpose!) path
    aliases for the automounter
  • System adapts dynamically process priority to
    meet share policies or to keep up with changing
    policies
  • SGE(EE) maintainers are very active and keep
    implementing new ideas
  • quick incorporation of patches, reported bugs get
    fixed asap.

16
Advanced Use of SGEEE
  • Using the perl API
  • every aspect of the batch system is accessible
    through the perl API
  • the perl API is accessible after use SGE in perl
    scripts
  • there is almost no documentation but a few sample
    scripts in /afs/ifh.de/user/f/friebel/public and
    in /afs/ifh.de/products/source/gridengine/source/e
    xperimental/perlgui
  • Using the load information reported by SGEEE
  • each host reports a number of load values to the
    master host (qmaster)
  • there is a default set of load parameters that
    are always reported
  • further parameters can be reported by writing
    load sensors
  • qhost is a simple interface to display that
    information
  • a powerful monitoring system could be built
    around that feature, which is based on the
    "Performance Data Collection" (PDC) built in
    subsystem

17
Conclusions
  • Ease of installation from source
  • Access to source code
  • Chance of integration into a monitoring system
  • API for C and Perl
  • Excellent load balancing mechanisms
  • Managing the requests of concurrent groups
  • Mechanisms for recovery from machine crashes
  • Fallback solutions for dying daemons
  • Weakest point is AFS integration and Token
    prolongation mechanism (basically the same code
    as for Loadleveler and for older LSF versions)

18
Conclusions
  • SGEEE has all ingredients to build a company wide
    batch infrastructure
  • Allocation of resources according to policies
    ranging from departmental policies to individual
    user policies
  • Dynamic adjustment of priorities for running jobs
    to meet policies
  • Supports interactive jobs, array jobs, parallel
    jobs
  • Can be used with Kerberos (4 and 5) and AFS
  • SGEEE is open source maintained by Sun
  • Getting deeper knowledge by studying the code
  • Can enhance the code (examples more schedulers,
    tighter AFS integration, monitoring only daemons)
  • Code is centrally maintained by a core developer
    team
  • Could play a more important role in HEP
    (component of a grid environment, open industry
    grade batch system as recommended solution within
    HEPiX?)

19
References
  • http//gridengine.sunsource.net/servlets/ProjectSo
    urce
  • Download Page for source code of SGE(EE)
  • http//www.arl.hpc.mil/docs/grd/
  • lots of docs from raytheon
  • http//supportforum.Sun.COM/gridengine/
  • Supportforum, Mailinglists
  • http//hoover.hpac.tudelft.nl/cugs98cd/S98PROC/AUT
    HORS/FERSTL/INDEX.HTM GRD on a
    Conference 1998
  • http//www-zeuthen.desy.de/computing/services/batc
    h/ Zeuthen pages with URL to the reference manual
  • http//www-zeuthen.desy.de//batch/sge53.pdf
  • The SGEEE reference manual, user and
    installation guide

20
Technical Details of SGEEE(not presented)
  • Submitting Jobs
  • The graphical interface qmon
  • Job submission and file systems
  • Sample job script
  • Advanced usage of qsub
  • Abnormal job termination

21
Submitting Jobs
  • Requirements for submitting jobs
  • have a valid token (verify with tokens),
    otherwise obtain a new one (klog)
  • ensure that in your .tcshrc or .zshrc no
    commands are executed that need a terminal (tty)
    (users have often a stty command in their startup
    scripts)
  • you are within batch if the env variable JOB_NAME
    is set or if the env variable ENVIRONMENT is set
    to BATCH
  • Submitting a job
  • specify what resources you need (-l option) and
    what script should be executed
  • qsub -l t10000 job_script
  • in the simplest case the job script contains 1
    line, the name of the executable
  • many more options available
  • alternatively use the graphical interface to
    submit jobs
  • qmon

22
The Submit Window of qmon
23
Job Submission and File Systems
  • Current working directory
  • the directory from where the qsub command was
    called. STDOUT and STDERR of a job go into files
    that are created in HOME. Because of quota
    limits and archiving policies that is not
    recommended.
  • With the -cwd option to qsub the files get
    created in the current working directory. For
    performance reasons that should be on a local
    file system
  • If cwd is in NFS space, the batch system must not
    use the real mount point but be translated
    according to /usr/SGE/default/common/sge_aliases.
    As every job stores the full info from
    sge_aliases, it is of advantage to get rid of
    that file and discourage the use of NFS as
    current working directory
  • If required, create your own HOME/.sge_aliases
    file
  • Local file space (Zeuthen policies)
  • /usr1/tmp is guaranteed to exist on all linux
    nodes and has typically gt 10GB
  • /data exists on some linux nodes and has
    typically gt 15GB capacity. A job can request the
    existence of /data by -l datadir
  • TMPDIR is a unique directory below /usr1/tmp,
    that gets erased at the end of the job. Normal
    jobs should make use of that mechanism if possible

24
A Simple Job Script
  • !/bin/zsh
  • -S /bin/zsh
  • -l t03000
  • -j y
  • WORKDIR/usr1/tmp/LOGNAME/JOB_ID
  • DATADIR/net/ilos/h1data7
  • echo using working directory WORKDIR
  • mkdir -p WORKDIR
  • cp DATADIR/large_input WORKDIR
  • cd WORKDIR
  • h1_reco
  • cp large_out DATADIR
  • if -s large_out -s DATADIR/large_out then
  • cd rm -r WORKDIR
  • fi

otherwise the default shell would be used
the time limit for this job
25
Advanced Usage of qsub
  • Option files
  • instead of giving qsub options on the command
    line, users may store those in .sge_projects
    files in their HOME or current working
    directories
  • content of a sample .sge_projects file
  • cwd -S /usr/local/bin/perl -j y -l t240000
  • Array jobs
  • SGE allows to schedule n identical jobs with one
    qsub call using the t option
  • qsub -t 1-10 array_job_script
  • within the script use the variable SGE_TASK_ID to
    select different inputs and write to distinct
    output files (SGE_TASK_ID is 1...10 in the
    example above)
  • Conditional job execution
  • jobs can be scheduled to wait for dependent jobs
    to successfully finish (rc0)
  • jobs can be submitted in hold state (needs to be
    released by user or operator)
  • jobs can be told not to start before a given date
  • start dependent jobs on the same host (using
    qalter -q QUEUE ... within script)

26
Abnormal Job Termination
  • Termination because of CPU limit exceeded
  • jobs get an XCPU signal that can be catched by
    the job. In that case termination procedures can
    be executed, before the SIGKILL signal is sent
  • SIGKILL will be sent a few minutes after XCPU was
    sent. It cannot be catched.
  • Restart after execution host crashes
  • if a host crashes when a given job is running,
    the job will be restarted. In that case the
    variable RESTARTED is set to 1
  • The job will be reexecuted from the beginning on
    any free host. If the job can be restarted using
    some results achieved so far, then the variable
    RESTARTED can be checked. The job can be forced
    to be executed on the same host by inserting
  • qalter -q QUEUE JOB_ID
  • literally in the job script
  • Signaling the end of the job
  • with the qsub option -notify a SIGUSR1 signal is
    sent to the job a few minutes before the job is
    suspended or terminated
Write a Comment
User Comments (0)
About PowerShow.com