Platform LSF - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Platform LSF

Description:

lim - Collects host load and configuration information and forwards it to the ... Termination/suspension/resuming action. Checkpointing/restarting ... – PowerPoint PPT presentation

Number of Views:571
Avg rating:3.0/5.0
Slides: 16
Provided by: dta1
Category:
Tags: lsf | platform | resuming

less

Transcript and Presenter's Notes

Title: Platform LSF


1
Platform LSF
2
Cluster Contents
  • Cluster-level Architecture
  • Fault Tolerance
  • How LSF deals with Resources
  • How users interface with LSF
  • Security Architecture
  • Parallel Jobs
  • Extensibility

3
Cluster-level Architecture
4
Cluster-level Architecture
  • Four daemons run on each server host
  • lim - Collects host load and configuration
    information and forwards it to the master LIM
    running on the master host.
  • pim - Collects information about job processes
    running on the host such as CPU and memory used
    by the job, and reports the information to
    sbatchd.
  • res - Accepts remote execution requests to
    provide, transparent and secure remote execution
    of jobs and tasks.
  • sbatchd - Receives the request to run the job
    from mbatchd and manages local execution of the
    job.
  • Two additional daemons run on the master host
  • mbatchd - Receives job submission, and
    information query requests. Manages jobs held in
    queues. Dispatches jobs to hosts as determined by
    mbschd.
  • mbschd Makes scheduling decisions based on job
    requirements and policies. Works with and is
    started by mbatchd.

5
Fault Tolerance
  • System Level
  • The master host is the first host listed in the
    lsf.cluster.cluster_name file or is defined along
    with other candidate master hosts by
    LSF_MASTER_LIST in lsf.conf.
  • Every event in the system is logged to the
    lsb.events file, including all job submissions
    and job and host state changes.
  • If the master LIM becomes unavailable, a LIM on
    another host automatically takes over. The
    sbatchd on the new master starts the mbatchd.
  • The new mbatchd reads the lsb.events file to
    recover the state of the system.
  • When the new mbatchd starts up it polls the
    sbatchds on each host and finds the current
    status of its jobs.
  • If an sbatchd fails but the host is still
    running, jobs running on the host are not lost.
    When sbatchd is restarted it regains control of
    all jobs running on the host.

6
Fault Tolerance
  • Job Level Rerun and Requeue
  • Automatic job requeue occurs when a job finishes
    and has specified a certain exit code
    (indicating some type of failure).
  • Automatic job rerun occurs when the execution
    host becomes unavailable, or the system fails
    while a job is running.
  • Job rerun can be configured on the queue or job
    level (RERUNNABLE yes)
  • Job requeue can be configured at the queue level
    (REQUEUE_EXIT_VALUES 99)

7
Resources
  • The LSF system uses built-in and configured
    resources to track resource availability and
    usage. Jobs are scheduled according to the
    resources available on individual hosts.
  • Dynamic Resources
  • Load Indicies Periodically collected by the
    lim status, r15s, r1m, r15m, ut, pg, io, ls, it,
    tmp, swp, mem
  • Dynamic External Load Indicies Collected by the
    elim daemon. Typically used to track floating
    licenses or available scratch space
  • Static Resources
  • Static Indicies Collected by the lim at
    startup ncpus, ndisks, maxmem, maxswp, maxtmp
  • Static External Resources Defined in the
    lsf.shared file and attached to hosts in the
    lsf.cluster.cluster file. Typically used to
    represent node locked licenses or parallel file
    systems.

8
Resources
  • LSF monitors the resource consumption of jobs
    submitted to the LSF system.
  • The pim daemons samples the following
    information at regular intervals (defined by
    SBD_SLEEP_TIME in the lsb.params file)
  • Total CPU time consumed by all processes in the
    job
  • Total resident memory usage in KB of all
    currently running processes in a job
  • Total virtual memory usage in KB of all currently
    running processes in a job
  • Currently active process group ID in a job
  • Currently active processes in a job
  • The information is used to enforce resource
    limits and load thresholds as well as fairshare
    scheduling.
  • Workload and Infrastructure information can be
    archived for analysis by using Platform
    Analytics. By default job related information is
    summarized and retained in the lsb.acct file.

9
User Interface
  • Command Line
  • Jobs can be submitted on the command line, or
    more commonly by using scripts
  • Web Interface
  • LSF provides a web GUI that operates via web
    services to submit, monitor, and operate on user
    jobs
  • API
  • C Applications coded in C can use this API. The
    application must be running on an LSF server or
    client.
  • Java Java applications can use this API. The
    application must be running on an LSF server or
    client.
  • Web Services Any web services aware application
    can use the LSF web services API to submit jobs
    to LSF. The application can run on any host.

10
User Interface
Monitor/control/configure/submit
11
User Interface
  • Access to Enterprise Grid anytime, anywhere
  • Job Submission, Monitoring Administration
  • Cluster Management
  • SOAP/XML API

12
Security Architecture
  • LSF supports the following three authentication
    mechanisms
  • External authentication (eauth) This is the
    default authentication method and is used by the
    vast majority of our customers. Eauth uses a
    configurable internal encryption key to encrypt
    authentication data.
  • Privileged ports (setuid) -- If you do not use
    external authentication, privileged ports
    (setuid) authentication is used. This is the
    mechanism most UNIX remote utilities use. The LSF
    commands must be installed as setuid programs and
    owned by root.
  • Identification daemon (identd) -- LSF also
    supports authentication using the RFC 931 or RFC
    1413 identification protocols. Under these
    protocols, user commands do not need to be
    installed as setuid programs owned by root. You
    must install the identd daemon available in the
    public domain.

13
Security Architecture
  • External Authentication (eauth)
  • LSF ships with an eauth binary which is
    installed in LSF_SERVERDIR
  • Customers are free to write their own eauth
    programs to enforce site specific security
    requirements. Example eauth source code is
    provided in the LSF distribution.
  • The LSF Kerberos integration package comes with
    an eauth (source code) that implements Kerberos
    authentication. Because Kerberos authentication
    is widely used within the DOD, the discussion
    that follows will focus on this method.
  • The eauth mechanism can pass data (such as
    authentication credentials) from users to
    execution hosts. The environment variable
    LSF_EAUTH_AUX_DATA specifies the full path to a
    file where data, such as a credential, is stored.
    The mechanisms of eauth -c and eauth -s allow the
    LSF daemons to pass this data using a secure
    exchange.
  • eauth operates in two modes client and server
    mode. When called as eauth c, eauth acquires
    credentials. When called as eauth s it
    validates them.

14
Security Architecture
  • Kerberos Example
  • The purpose of the Kerberos (krb5) integration
    with LSF is to provide the following
    functionality
  • Perform krb5 authentication of end users to LSF
    when submitting jobs
  • Authenticate the LSF daemon communications
  • Provide the functionality to forward krb5
    credentials to the execution environment
  • The integration makes use of LSFs eauth
    mechanism which allows all krb5 specific calls to
    be encapsulated within a small set of programs
  • eauth The main driver for user and daemon
    communication and initial credential creation
  • daemons.wrap Wraps the LSF daemons and is
    responsible for setting up the credential cache
    on the execution host
  • eexec Creates the initial credential cache with
    a job specific name
  • krenewPendJob Renews credentials for pending
    jobs
  • krenew Renews credentials for running jobs

15
Extensibility
  • LSF can be extended in many different ways to
    achieve almost any objective
  • Dynamic external resources can be added via the
    elim facility
  • LSF can run a pre and/or post execution script
    to setup or clean up after a job
  • A job starter script can be configured to run
    just before a job (defined on the queue level).
    This script runs in the same shell as the user
    job
  • Job submission parameters can be validated by an
    esub script. Esub can also pass information to
    eexec which can modify the environment of the job
  • A script can be configured to run in the case
    of
  • Job termination
  • Termination/suspension/resuming action
  • Checkpointing/restarting
  • Custom authentication requirements can be
    satisfied by the creation of an external
    authentication program.
  • Sites can even create their own scheduling
    algorithms and plug them into the LSF scheduler
Write a Comment
User Comments (0)
About PowerShow.com