Cluster Resources Training - PowerPoint PPT Presentation

1 / 185
About This Presentation
Title:

Cluster Resources Training

Description:

Please do not put your call on hold (the entire group will ... CLI. GUI. Application. Admin. Users. Security. 1 Cluster Resources, Inc. 8. The Initial Cluster ... – PowerPoint PPT presentation

Number of Views:170
Avg rating:3.0/5.0
Slides: 186
Provided by: Mus104
Category:

less

Transcript and Presenter's Notes

Title: Cluster Resources Training


1
  • Cluster Resources Training
  • February 2006

2
Presentation Protocols
  • For problems or questions Send email to
  • training_at_clusterresources.com
  • We will pause for questions at the end of each
    section
  • Please remain on mute except for questions
  • Please do not put your call on hold (the entire
    group will hear your music)
  • Please be conscientious of the other people
    attending the conference
  • You can also submit questions during the training
    to the AOL Instant Messenger screen name CRI Web
    Training

3
Session 1
  • 1. Introduction
  • 2. Deployment, Diagnostics Troubleshooting
  • 3. Optimization Troubleshooting
  • 4. End Users
  • 5. Political Sharing Service Delivery

4
Session 2
  • 6. Reporting Monitoring
  • 7. Grids
  • 8. Utility
  • 9. Torque
  • 10. Future

5
1. Introduction
  • Overview of the Modern Cluster
  • Cluster Evolution
  • Cluster Productivity Losses
  • Moab Workload Manager Architecture
  • What Moab Does
  • What Moab Does Not Do

6
Purpose of the Cluster Get Work Done
Workload Types
Interactive (Jobs and Reservations w/ Workflow)
Batch (Normal Batch Jobs)
Fixed Deadline (Fixed Time or QoS-based)
Services (Web, Data-mining, Visualization,
Accounting)
7
Cluster Stack / Framework
Grid Workload Manager Scheduler, Policy
Manager, Integration Platform
Cluster Workload Manager Scheduler, Policy
Manager, Integration Platform
Application
Resource Manager
Portal
Application
Serial
Parallel
Security
GUI
Message Passing
CLI
Operating System
Hardware (Cluster or SMP)
Admin
8
The Initial Cluster
  • Standalone
  • Uniform Resources
  • Uniform Workload
  • Dedicated Usage
  • Single Support Staff

9
  • Increasing numbers of users, groups and
    organizations want to access clusters
  • Increasing workload complexity ? more jobs, more
    data intensive jobs, more dependencies
  • Increasing numbers of clusters, resource
    managers, hardware, storage, network and
    licensing to manage
  • More demands on administrators
  • Increasing demands for higher user productivity
    and work completion

Networks
Groups
Data / Storage
Applications / Licenses
Resource Managers
Workload
Security
Hardware
OSs
Increasing Complexity
10
The Evolved Cluster
Admin
Job Queue
User
Compute Nodes
11
Productivity Losses
  • Scheduling Inefficiencies
  • Managing complex site policies
  • Keeping Everybody Happy
  • Scheduling jobs where they will finish faster
    rather than where they will start sooner
  • Middleware Failures/ Overhead
  • Licensing
  • Network Applications
  • Resource Managers
  • Hardware Failures
  • Job Loss and Delay due to Node, Network, and
    other Infrastructure Failures

Remaining Productivity
  • Partitioning Losses
  • Underutilization of Resources Due to Physical
    Access Limits
  • Environmental Losses
  • File System Failures
  • Network Failures
  • Hardware Failures
  • Intra-job inefficiencies
  • Poorly Designed Jobs
  • Poorly Functioning Jobs
  • Heterogeneous Resources Allocated
  • Political Losses
  • Underutilization of Resources Due to Overly
    Strict Political Access Constraints

12
Moab Architecture
13
What Moab Does
  • Optimizes Resource Utilization with Intelligent
    Scheduling and Advanced Reservations
  • Unifies Cluster Management across Varied
    Resources and Services
  • Dynamically Adjusts Workload to Enforce Policies
    and Service Level Agreements
  • Automates Diagnosis and Failure Response

14
What Moab Does Not Do
  • Does not does do resource management (usually)
  • Does not install the system (usually)
  • Not a storage manager
  • Not a license manager
  • Does not do message passing

15
Presentation Protocols
  • For problems or questions Send email to
  • training_at_clusterresources.com
  • We will pause for questions at the end of each
    section
  • Please remain on mute except for questions
  • Please do not put your call on hold (the entire
    group will hear your music)
  • Please be conscientious of the other people
    attending the conference
  • You can also submit questions during the training
    to the AOL Instant Messenger screen name CRI Web
    Training

16
2. Deployment, Diagnostics and Troubleshooting
  • 2.1 Installation and Deployment
  • 2.2 Troubleshooting and Diagnostics

17
2.1 Deployment
  • Installation
  • Configuration
  • Testing
  • Simulation
  • Scheduling Sequence
  • Commands Overview
  • Scheduling Objects

18
Moab Workload Manager Installation
  • You only install Moab Workload Manager on
    the head node.
  • gt tar -xzvf moab-4.5.0p0.linux.tar.gz
  • gt cd moab-4.5.4
  • gt ./configure
  • gt make
  • When you are ready to use Moab in
    production, you may install it into the
    install directory you have configured using make
    install.
  • Workload Manager must be running before
    Cluster Manager and Access Portal will work.
  • You can choose to install client commands on
    a remote system as well.

19
File Locations
  • (MOABHOMEDIR)
  • moab.cfg (general config file containing
    information required by both the Moab server and
    user interface clients)
  • moab-private.cfg (config file containing private
    information required by the Moab server only)
  • .moab.ck  (Moab checkpoint file)
  • .moab.pid (Moab 'lock' file to prevent multiple
    instances)
  • log(directory for Moab log files - REQUIRED BY
    DEFAULT)
  • moab.log  (Moab log file)
  • moab.log.1 (previous 'rolled' Moab log file)
  • stats(directory for Moab statistics files -
    REQUIRED BY DEFAULT)
  • Moab stats files (in format 'stats.ltYYYYgt_ltMMgt_ltDD
    gt')
  • Moab fairshare data files (in format
    'FS.ltEPOCHTIMEgt')
  • tools (directory for local tools called by Moab -
    OPTIONAL BY DEFAULT)
  • traces (directory for Moab simulation trace files
    - REQUIRED FOR SIMULATIONS)
  • resource.trace1 (sample resource trace file)
  • workload.trace1 (sample workload trace file)

20
  • spool (directory for temporary Moab files -
    REQUIRED FOR ADVANCED FEATURES)
  • contrib (directory containing contributed code in
    the areas of GUI's, algorithms, policies, etc)
  • (MOABINSTDIR)
  • bin (directory for installed Moab executables)
  • moab (Moab scheduler executable)
  • mclient (Moab user interface client executable)
  • /etc/moab.cfg (optional file.  This file is used
    to override default '(MOABHOMEDIR)' settings. 
    It should contain the string 'MOABHOMEDIR
    (DIRECTORY)' to override the 'built-in'
    (MOABHOMEDIR)' setting.

21
Initial Configuration moab.cfg
  • moab.cfg contains the parameters and settings for
    Moab Workload Manager. This is where you will
    set most of the policy settings.

Example of what moab.cfg will look like after
installation
moab.cfg SCHEDCFGMoab SERVERtest.icluster.
org4255 ADMINCFG1 USERSroot RMCFGbase
TYPEPBS
22
Supported Platforms/Environments
  • Resource Managers
  • TORQUE, OpenPBS, PBSPro, LSF, Loadleveler, SLURM,
    BProc, clubMASK, S3, WIKI
  • Operating Systems
  • RedHat, SUSE, Fedora, Debian, FreeBSD, ( all
    known variants of Linux), AIX, IRIX, HP-UX, OS/X,
    OSF/Tru-64, SunOS, Solaris, ( all known variants
    of UNIX)
  • Hardware
  • Intel x86, Intel IA-32, Intel IA-64, AMD x86, AMD
    Opteron, SGI Altix, HP, IBM SP, IBM x-Series, IBM
    p-Series, IBM i-Series, Mac G4 and G5

23
Basic Parameters
  • SCHEDCFG
  • Specifies how the Moab server will execute and
    communicate with client requests. 
  • Example SCHEDCFGorion SERVERcw.psu.edu
  • ADMINCFG
  • Moab provides role-based security enabled by way
    of multiple levels of admin access. 
  • Example The following may be used to enable
    users greg amd thomas as level 1 admins
  • ADMINCFG1 USERSgreg,thomas NOTE Moab may only
    be launched by the primary admin user id.
  • RMCFG
  • In order for Moab to properly interact with a
    resource manager, the interface to this resource
    manager must be defined.
  • For example To interface to a TORQUE resource
    manager, the following may be used
  • RMCFGtorque1 TYPEpbs

24
Scheduling Modes
- Configure modes in moab.cfg
  • Simulation Mode
  • Allows a test drive of the scheduler. You can
    evaluate how various policies can improve the
    current performance on a stable production
    system.
  • Test Mode
  • Test mode allows evaluation of new Moab releases,
    configurations, and policies in a risk-free
    manner. the test-mode Moab behaves identical to a
    live or normal mode except the ability to start,
    cancel, or modify jobs.
  • Normal Mode
  • Live (after installation, automatically set this
    way)
  • Interactive Mode
  • Like test mode but instead of disabling all
    resource and job control functions, Moab sends
    the desired change request to the screen and asks
    for permission to complete it.

25
(No Transcript)
26
Testing New Policies
  • Verifying Correct Specification of New Policies
  • If manually editing the moab.cfg file, use the
    mdiag C command
  • Moab Cluster Manager automatically verifies
    proper policy specification
  • Verifying Correct Behavior of New Policies
  • Put in INTERACTIVE Mode to ensure you want to
    make each change
  • Determining Long Term Impact of New Policies
  • Put in SIMULATION Mode

27
  • Moab 'Side-by-Side
  • Allows a production cluster or other resource to
    be logically partitioned along resource and
    workload boundaries and allows different
    instances of Moab to schedule different
    partitions.
  • Use parameters IGNORENODES, IGNORECLASSES,
    IGNOREUSERS

moab.cfg for production partition
SCHEDCFGprod MODENORMAL SERVERorion.cxz.com42
020 RMCFGTORQUE TYPEPBS
IGNORENODES node61,node62,node63,node64
IGNOREUSERS
gridtest1,gridtest2
moab.cfg for test partition
SCHEDCFGprod MODENORMAL SERVERorion.cxz.com42
020 RMCFGTORQUE TYPEPBS
IGNORENODES !node61,node62,node63,node64
IGNOREUSERS
!gridtest1,gridtest2
28
Simulation
  • What is the impact of additional hardware on
    cluster utilization?
  • What delays to key projects can be expected with
    the addition of new users?
  • How will new prioritization weights alter cycle
    distribution among existing workload?
  • What total loss of compute resources will result
    from introducing a maintenance downtime?
  • Are the benefits of cycle stealing from
    non-dedicated desktop systems worth the effort?
  • How much will anticipated grid workload delay the
    average wait time of local jobs?

29
Scheduling Iterations
  • Update State Information
  • Refresh Reservations
  • Schedule Reserved Jobs
  • Schedule Priority Jobs
  • Backfill Jobs
  • Update Statistics
  • Handle User Requests
  • Perform Next Scheduling Cycle

30
Job Flow
  • Determine Basic Job Feasibility
  • Prioritize Jobs
  • Enforce Configured Throttling Policies
  • Determine Resource Availability
  • Allocate Resources to Job
  • Launch Job

31
Commands Overview
Command Description
checkjob provide detailed status report for specified job
checknode provide detailed status report for specified node
mcredctl controls various aspects about the credential objects within Moab
mdiag provide diagnostic reports for resources, workload, and scheduling
mjobctl control and modify job
mnodectl control and modify nodes
mrmctl query and control resource managers
mrsvctl control and modify reservations
mschedctl modify scheduler state and behavior
mshow displays various diagnostic messages about the system and job queues
msub scheduler job submission
resetstats reset scheduler statistics
showbf show current resource availability
showq show queued jobs
showres show existing reservations
showstart show estimates of when job can/will start
showstate show current state of resources
showstats show usage statistics
32
End User Commands
Command Flags Description
canceljob cancel existing job
checkjob display job state, resource requirements, environment, constraints, credentials, history, allocated resources, and resource utilization
showbf show resource availability for jobs with specific resource requirements
showq display detailed prioritized list of active and idle jobs
showstart show estimated start time of idle jobs
showstats show detailed usage statistics for users, groups, and accounts which the end user has access to
33
Creating a Remap Class
  • REMAPCLASS parameter

moab.cfg jobs submitted to 'batch' should be
remapped REMAPCLASS batch stevens only
queue CLASSCFGstevens MIN.FEATURESstevens
REQUIREDUSERLISTstevens,stevens2 special
queue for I/O nodes CLASSCFGio MAX.PROC8
MIN.FEATURESio general access
queues CLASSCFGquick MIN.PROC2 MAX.PROC8
MIN.FEATURESfastshort CLASSCFGmedium
MIN.PROC2 MAX.PROC8 CLASSCFGdefault
MAX.PROC64 ...
34
Scheduling Objects
  • Moab functions by manipulating five primary,
    elementary objects 
  • Jobs
  • Nodes
  • Reservations
  • QOS structures
  • Policies

35
Jobs
  • Job information is provided to the Moab scheduler
    from a resource manager
  • (Such as Loadleveler, PBS, Wiki, or LSF)
  • Job attributes include ownership of the
  • Job
  • Job state
  • Amount
  • Type of resources required by the job
  • Wallclock limit
  • A job consists of one or more requirements each
    of which requests a number of resources of a
    given type. 

36
Job States
  • Indicates the jobs current status and
    eligibility for execution and can be any of the
    values listed below
  • Idle
  • Job is currently queued and eligible to run but
    is not executing
  • Starting
  • The batch system has attempted to start the job
    and the job is currently performing pre-start
    tasks which may including provisioning resources,
    staging data, executing system pre-launch
    scripts, etc.
  • Running
  • Job is currently executing the user application
  • Suspended
  • Job was running but has been suspended by the
    scheduler or an admin.  The user application is
    still in place on the allocated compute resources
    but it is not executing
  • Completed
  • The job has completed running
  • Hold
  • The job is idle and is not eligible to run due to
    a user, admin, or batch system hold

37
Job Requirement
  • Consists of a request for a single type of
    resources.  Each requirement consists of the
    following components
  • Task Definition
  • A specification of the elementary resources which
    compose an individual task.
  • Resource Constraints
  • A specification of conditions which must be met
    in order for resource matching to occur.  Only
    resources from nodes which meet all resource
    constraints may be allocated to the job req.
  • Task Count
  • The number of task instances required by the req.
  • Task List
  • The list of nodes on which the task instances
    have been located.
  • Req Statistics
  • Statistics tracking resource utilization

38
Nodes
  • Within Moab, a node is a collection of resources
    with a particular set of associated attributes.
  • A node is defined as one or more CPU's, together
    with associated memory, and possibly other
    compute resources such as local disk, swap,
    network adapters, software licenses, etc. 

39
Advance Reservations
  • An object which dedicates a block of specific
    resources for a particular use.
  • Each reservation consists of a list of resources,
    an access control list, and a time range for
    which this access control list will be enforced.
  • The reservation prevents the listed resources
    from being used in a way not described by the
    access control list during the time range
    specified.

40
Policies
  • Generally specified via a config file and serve
    to control how and when jobs start. 
  • Include
  • Job prioritization
  • Fairness policies
  • Fairshare configuration policies
  • Scheduling policies.

41
Resources
  • Jobs, nodes, and reservations all deal with the
    abstract concept of a resource.  A resource in
    the Moab world is one of the following    
  • Processors
  • Specified with a simple count value.
  • Memory
  • Real memory or 'RAM' is specified in megabytes
    (MB).
  • Swap
  • Virtual memory or 'swap' is specified in
    megabytes (MB).
  • Disk
  • Local disk is specified in megabytes (MB).
  • In addition to these elementary resource types,
    there are two higher level resource concepts used
    within Moab. 
  • Task
  • Processor equivalent, or PE.

42
 Task
  • A collection of elementary resources which must
    be allocated together within a single node. 
  • In Moab, when jobs or reservations request
    resources, they do so in terms of tasks typically
    using a task count and a task definition. 

43
Resource Manager (RM)
  • While other systems may have more strict
    interpretations of a resource manager and its
    responsibilities, Moab's multi-resource manager
    support allows a much more liberal
    interpretation.
  • In essence, any object which can provide
    environmental information and environmental
    control can be utilized as a resource manager.
  • Moab is able to aggregate information from
    multiple unrelated sources into a larger more
    complete world view of the cluster which includes
    all the information and control found within a
    standard resource manager such as TORQUE
    including
  • Node
  • Job
  • Queue management services.

44
Node Attributes
  • ACCESS
  • Node access policy which can be one of SHARED,
    SHAREDONLY, SINGLEJOB, SINGLETASK, or SINGLEUSER
  • CHARGERATE
  • Assign specific charging rates to the usage of
    particular resources
  • COMMENT
  • Allows an organization to annotate a node via the
    config file to indicate special information
    regarding this node to both users and
    administrators

NODECFGnode013 COMMENT"Login Node"
45
  • FEATURES
  • The NODECFG parameter can be used to directly
    assign a list of node features to individual
    nodes.
  • GRES
  • The NODECFG parameter can be used to directly
    assign a list of consumable generic attributes to
    individual nodes or to the special pseudo-node
    global which provide shared cluster (aka
    floating) consumable resources.

NODECFGnode013 FEATURESgpfs,fastio
NODECFGnode013 GRESquickcalc20
46
 Resource Managers
  • Moab can be configured to manage more than one
    resource manager simultaneously, even resource
    managers of different types.
  • Moab aggregates information from the RMs to fully
    manage workload, resources, and cluster policies

47
Moabs Interaction With RM
  • Load global resource information
  • Load node specific information (optional)
  • Load job information
  • Load queue/policy information (optional)
  • Cancel/preempt/modify jobs according to cluster
    policies
  • Start jobs in accordance with available resources
    and policy constraints
  • Handle user commands

48
Resource Manager Configuration
  • RMCFG parameter with TYPE attribute
  • Scheduler/Resource Manager Interactions
  • GETJOBINFO
  • GETNODEINFO
  • STARTJOB
  • CANCELJOB

RMCFGorion TYPEPBS
49
License Management
  • Moab supports both node-locked and floating
    license models and even allows mixing the two
    models simultaneously
  • Methods for determining license availability
  • Local Consumable Resources
  • Resource Manager Based Consumable Resources
  • Interfacing to an External License Manager
  • Requesting Licenses within Jobs

qsub gt qsub -l nodes2,softwareblast
cmdscript.txt
50
2.2 Troubleshooting and Diagnostics
  • Object Messages
  • Diagnostic Commands
  • Admin Notification
  • Logging
  • Tracking System Failures
  • Checkpointing
  • Debuggers

http//www.clusterresources.com/products/mwm/moabd
ocs/14.0troubleshootingandsysmaintenance.shtml
51
Object Messages
  • Messages can hold information regarding failures
    and key events
  • Messages possess event time, owner, expiration
    time, and event count information
  • Resource managers and peer services can attach
    messages to objects
  • Admins can attach messages
  • Multiple messages per object are supported
  • Messages are persistent
  • http//www.clusterresources.com/products/mwm/moabd
    ocs/commands/mschedctl.shtml
  • http//www.clusterresources.com/products/mwm/moabd
    ocs/14.3messagebuffer.shtml

52
Diagnostics
  • Moabs diagnostic commands present detailed state
    information
  • Scheduling problems
  • Summarize performance
  • Evaluate current operation reporting on any
    unexpected or potentially erroneous conditions
  • Where possible correct detected problems if
    desired

53
mdiag
  • Displays object state/health
  • Displays object configuration
  • Attributes, resources, policies
  • Displays object history and performance
  • Displays object failures and messages
  • http//www.clusterresources.com/products/mwm/moabd
    ocs/commands/mdiag.shtml

54
mdiag usage
  • Most common diagnostics
  • Scheduler (mdiag S)
  • Jobs (mdiag j)
  • Nodes (mdiag n)
  • Resource manager (mdiag R)
  • Blocked jobs (mdiag b)
  • Configuration (mdiag C)
  • Other diagnostics
  • Fairshare, Priority
  • Users, Accounts, Classes
  • Reservations, QoS, etc

DEMO
mdiag
  • http//www.clusterresources.com/products/mwm/moabd
    ocs/commands/mdiag.shtml

55
mdiag details
  • Performs numerous internal health and consistency
    checks
  • Race conditions, object configuration
    inconsistencies, possible external failures
  • Not just for failures
  • Provides status, config, and current performance
  • Enables moab as an information service
  • --flagsxml

56
Job Troubleshooting
  • To determine why a particular job will not start,
    there are several commands which
  • can be helpful
  • checkjob -v
  • Checkjob will  evaluate the ability of a job to
    start immediately.  Tests include resource
    access, node state, job constraints (ie,
    startdate, taskspernode, QOS, etc). 
    Additionally, command line flags may be specified
    to provide further information.
  • -l ltPOLICYLEVELgt    // evaluate impact of
    throttling policies on job feasibility   -n
    ltNODENAMEgt       // evaluate resource access on
    specific node   -r ltRESERVATION_LISTgt  //
    evaluate access to specified reservations
  • checknode
  • Display detailed status of node
  • mdiag -b
  • Display various reasons job is considered
    'blocked' or 'non-queued'.
  • mdiag -j
  • Display high level summary of job attributes and
    perform sanity check on job attributes/state.
  • showbf -v
  • Determine general resource availability subject
    to specified constraints.

57
Other Diagnostics
  • checkjob and checknode commands
  • Why a job cannot start
  • Which nodes can be availableinformation
    regarding the recent events impacting current job
  • Nodes state

58
Admin Notification
  • MAILPROGRAM must be set to DEFAULT
  • Notifications can be delivered as a result of
    generic events, generic metric thresholds and
    QoS-based service levels
  • http//www.clusterresources.com/products/mwm/moabd
    ocs/14.4eventmgmt.shtml

59
Issues with Client Commands
  • Utilize built in moab logging
  • showq --loglevel9
  • Or
  •  
  • Check the moab log files

60
Logging Facilities
  • Moab Log
  • Report detailed scheduler actions, configuration,
    events, failures, etc
  • Event Log
  • Report scheduler, job, node, and reservation
    events and failures
  • Syslog
  • USESYSLOG

stats/events.Wed_Aug_24_2005 1124979598 rm
base RMUP initialized 1124979598
sched Moab SCHEDSTART - 1124982013
node node017 GEVENT CPU2
Down 1124989457 node node135 GEVENT
/var/tmp Full 1124996230 node node139
GEVENT /home Full 1125013524 node
node407 GEVENT Transient Power Supply
Failure
  • http//www.clusterresources.com/products/mwm/moabd
    ocs/a.fparameters.shtmleventrecordlist
  • http//www.clusterresources.com/products/mwm/moabd
    ocs/14.2logging.shtml
  • http//www.clusterresources.com/products/mwm/moabd
    ocs/a.fparameters.shtmlusesyslog

61
Logging Basics
  • LOGDIR - Indicates directory for log files
  • LOGFILE - Indicates path name of log file
  • LOGFILEMAXSIZE - Indicates maximum size of log
    file before rolling
  • LOGFILEROLLDEPTH - Indicates maximum number of
    log files to maintain \
  • LOGLEVEL - Indicates verbosity of logging

62
Function Level Information
  • In source and debug releases, each subroutine is
    logged, along with all printable parameters.

moab.log MPolicyCheck(orion.322,2,Reason)
63
Status Information
  • Information about internal status is logged at
    all LOGLEVELs.  Critical internal status is
    indicated at low LOGLEVELs while less critical
    and more vebose status information is logged at
    higher LOGLEVELs.

moab.log INFO job orion.4228 rejected
(max user jobs) INFO job fr4n01.923.0
rejected (maxjobperuser policy failure)
64
Scheduler Warnings
  • Warnings are logged when the scheduler detects an
    unexpected value or receives an unexpected result
    from a system call or subroutine.

moab.log WARNING cannot open fairshare data
file '/opt/moab/stats/FS.87000'
65
Scheduler Alerts
  • Alerts are logged when the scheduler detects
    events of an unexpected nature which may indicate
    problems in other systems or in objects.

moab.log ALERT job orion.72 cannot run.
deferring job for 360 Seconds
66
Scheduler Errors
  • Errors are logged when the scheduler detects
    problems of a nature of which impact the
    scheduler's ability to properly schedule the
    cluster.

moab.log ERROR cannot connect to Loadleveler
API
67
Searching Moab Logs
  • While major failures will be reported via the
    mdiag -S command, these failures can also be
    uncovered by searching the logs using the grep
    command as in the following

gt grep -E "WARNINGALERTERROR" moab.log
68
Event Logs
  • Major events are reported to both the Moab log
    file as well as the Moab event log.  By default,
    the event log is maintained in the statistics
    directory and rolls on a daily basis, using the
    naming convention
  • events.WWW_MMM_DD_YYYY (e.g. events.Fri_Aug_19_20
    05)

event log format ltEPOCHTIMEgt ltOBJECTgt
ltOBJECTIDgt ltEVENTgt ltDETAILSgt
69
Enabling Syslog
  • In addition to the log file, the Moab Scheduler
    can report events it determines to be critical to
    the UNIX syslog facility via the daemon facility
    using priorities ranging from INFO to ERROR. 
  • The verbosity of this logging is not affected by
    the LOGLEVEL parameter.  In addition to errors
    and critical events, user commands that affect
    the state of the jobs, nodes, or the scheduler
    may also be logged to syslog.  
  • Moab syslog messages are reported using the INFO,
    NOTICE, and ERR syslog priorities.

70
Tracking System Failures
  • The scheduler has a number of dependencies which
    may cause failures if not satisfied.
  • Disk Space
  • The scheduler utilizes a number of files. If the
    file system is full or otherwise inaccessible,
    the following behaviors might be noted

File Failure
moab.pid scheduler cannot perform single instance check
moab.ck scheduler cannot store persistent record of reservations, jobs, policies, summary statistics, etc.
moab.cfg/moab.dat scheduler cannot load local configuration
log/ scheduler cannot log activities
stats/ scheduler cannot write job records
71
  • Network
  • The scheduler utilizes a number of socket
    connections to perform basic functions. Network
    failures may affect the following facilities.

Network Connection Failure
scheduler client scheduler client commands fail
resource manager scheduler is unable to load/update information regarding nodes and jobs
allocation manager scheduler is unable to validate account access or reserve/debit account balances
72
  • Processor Utilization
  • On a heavily loaded system, the scheduler may
    appear sluggish and unresponsive.  However no
    direct failures should result from this slowdown.
     Indirect failures may include timeouts of peer
    services (such as the resource manager or
    allocation manager) or timeouts of client
    commands.
  • Memory
  • Depending on cluster size and configuration,
    the scheduler may require up to 120 MB of
    memory on the server host.  If inadequate memory
    is available, multiple aspects of scheduling may
    be negatively affected.

73
Internal Errors
  • Logs
  • Tracing the Failure with a Debugger

74
Checkpointing
  • Moab checkpoints its internal state.  The
    checkpoint file records statistics and attributes
    for jobs, nodes, reservations, users, groups,
    classes, and almost every other scheduling
    object.
  • CHECKPOINTEXPIRATIONTIME - Indicates how long
    unmodified data should be kept after the
    associated object has disappeared.  ie, job
    priority for a job no longer detected.
  • FORMAT - DDHHMMSS
  • EXAMPLE - CHECKPOINTEXPIRATIONTIME  1000000
  • CHECKPOINTFILE - Indicates path name of
    checkpoint file
  • FORMAT - ltSTRINGgt
  • EXAMPLE - CHECKPOINTFILE  /var/adm/moab/moab.ck
  • CHECKPOINTINTERVAL - Indicates interval between
    subsequent checkpoints.
  • FORMAT - DDHHMMSS
  • EXAMPLE - CHECKPOINTINTERVAL  001500

moab.cfg
75
Using a Debugger
  • Attach to a running Moab process
  • or
  • Start Moab under the debugger

gt export MOABDEBUGyes gtcd ltMOABHOMEDIRgt/src/moab
gt gdb ../../bin/moab (gdb) b MQOSInitialize
(gdb) r
76
Cluster Resources Training
  • We will reconvene at 1200 pm EST

77
Presentation Protocols
  • For problems or questions Send email to
  • training_at_clusterresources.com
  • We will pause for questions at the end of each
    section
  • Please remain on mute except for questions
  • Please do not put your call on hold (the entire
    group will hear your music)
  • Please be conscientious of the other people
    attending the conference
  • You can also submit questions during the training
    to the AOL Instant Messenger screen name CRI Web
    Training

78
3. Optimization and Uptime
  • 3.1 Optimization
  • 3.2 Uptime

79
3.1 Optimization
  • Optimization is maximizing performance while
    fully addressing all mission objectives. True
    optimization includes aspects of policy
    selection, increased availability, user training,
    and other factors.
  • Identifying Policy Bottlenecks
  • Identifying Resource Fragmentation
  • Preemption
  • Malleable/Dynamic Jobs
  • Backfill

80
Productivity Losses
  • Scheduling Inefficiencies
  • Managing complex site policies
  • Keeping Everybody Happy
  • Scheduling jobs where they will finish faster
    rather than where they will start sooner
  • Middleware Failures/ Overhead
  • Licensing
  • Network Applications
  • Resource Managers
  • Hardware Failures
  • Job Loss and Delay due to Node, Network, and
    other Infrastructure Failures

Remaining Productivity
  • Partitioning Losses
  • Underutilization of Resources Due to Physical
    Access Limits
  • Environmental Losses
  • File System Failures
  • Network Failures
  • Hardware Failures
  • Intra-job inefficiencies
  • Poorly Designed Jobs
  • Poorly Functioning Jobs
  • Heterogeneous Resources Allocated
  • Political Losses
  • Underutilization of Resources Due to Overly
    Strict Political Access Constraints

81
Identifying Policy Bottlenecks
  • Most Optimization is Enable by Default
  • Sources of Bottlenecks
  • Usage Limits, Fairshare Caps
  • Eval Steps
  • Verify priority (are most important jobs getting
    access to resources first?)
  • http//clusterresources.com/moabdocs/commands/mdia
    g-priority.shtml

82
Identifying Policy Bottlenecks (contd)
  • Sources of Bottlenecks (contd)
  • Eval Steps (contd)
  • Check job blockage
  • Adjust Limits, Caps, Priority as needed
  • If needed, use simulation to determine
    performance impact of changes
  • http//clusterresources.com/moabdocs/commands/mdia
    g-queues.shtml

83
Identifying Resource Fragmentation
  • Fragmentation based on queues, reservations,
    partitions, os's, architectures, etc.
  • Recommend changes, use node sets, soften
    reservations, time-based reservations, etc.
  • User training to eliminate user specified
    fragmentation

84
Preemption
  • Conflict between high utilization for cluster and
    guarantees for important jobs
  • Preemption allows scheduler to 'retract' some
    scheduling decisions to address newly submitted
    workload
  • QoS-based preemption allows scheduler to enable
    preemption only if targets cannot be satisfied in
    other ways

http//www.clusterresources.com/products/mwm/docs/
8.4preemption.shtml
85
QoS Based Preemption
  • Preemption only occurs when the following 3
    conditions are satisfied
  • The preemptor job has the PREEMPTOR attribute set
  • The preemptee job has the PREEMPTEE attribute set
  • The preemptor job has a higher priority than the
    preemptee job

PREEMPTPOLICY REQUEUE enable qos priority to
make preemptors higher priority than preemptees
QOSWEIGHT 1 QOSCFGhigh QFLAGSPREEMPTOR
PRIORITY1000 QOSCFGmed QOSCFGlow
QFLAGSPREEMPTEE associate class 'special'
with QOS high CLASSCFGspecial QDEFhigh
86
Preemption based Backfill
  • The PREEMPT backfill policy allows the scheduler
    to start backfill jobs even if required walltime
    is not available.
  • If the job runs too long and interferes with
    another job which was guaranteed a particular
    timeslot, the backfill job is preempted and the
    priority job is allowed to run.  
  • When another potential timeslot becomes
    available, the preempted backfill job will again
    be optimistically executed.

87
Trigger and Context Based Preemption Policies
  • Mark a job a preemptor if its delivered or
    expected response time exceeds a specified
    threshold
  • Mark a job preemptible if it violates soft policy
    usage limits or fairshare targets
  • Mark a job a preemptor if it is running in a
    reservation it owns
  • Preempt a job as the result of a specific user,
    node, job, reservation, or other object event
    using object triggers
  • Preempt a job as the result of an external
    generic event or generic metric

88
Types of Preemption
  • PREEMPTPOLICY Parameter
  • Job Requeue
  • active jobs are terminated and returned to the
    job queue in an idle state.
  • Job Suspend
  • active jobs stop executing but remain in memory
    or the allocated compute nodes, must be resumed
  • Job Checkpoint
  • job saves off its current state and either
    terminate or continue running
  • Job Cancel
  • active jobs are canceled

Moab is only able to utilize preemption if the
underlying resource manager/OS combination
supports this capability.
89
Malleable/Dynamic Jobs
  • Moab adjusts jobs to utilize available resource
    and fill holes
  • Moab adjusts both job size and job duration
  • Only supported with resource managers which
    support dynamic job modification (i.e. TORQUE) or
    with msub

http//www.clusterresources.com/products/mwm/docs/
22.4dynamicjobs.shtml
90
Backfill
  • Allows a scheduler to make better use of
    available resources by running jobs out of order
  • Prioritizes the jobs in the queue according to a
    number of factors and then orders the jobs into a
    highest priority first (or priority FIFO) sorted
    list

91
Backfill Algorithms
  • FIRSTFIT
  • Selects those which actually fit in the current
    backfill window
  • First job is started
  • While backfill jobs and idle resources remain,
    step 1 repeats
  • BESTFIT
  • Selects those which actually fit in the current
    backfill window
  • Degree of fit determined by the BACKFILLMETRIC
    parameter
  • Job with best fit started
  • While backfill jobs and idle resources remain,
    step 1 repeats
  • GREEDY
  • Selects those which actually fit in the current
    backfill window
  • All possible combinations of jobs are evaluated,
    degree of fit determined by the BACKFILLMETRIC
    parameter
  • OPTMISTIC/PREEMPT
  • Selects jobs which can most likely fit using
    wall-clock accuracy based estimates

92
Configuring Backfill
  • Use BACKFILLPOLICY parameter
  • FORMAT - use one of the following
  • FIRSTFIT
  • BESTFIT
  • GREEDY
  • OPTIMISTIC
  • PREEMPT
  • NONE

BACKFILLPOLICY    BESTFIT
93
3.2 Uptime
  • Importing Failures and Events
  • Responses
  • Triggers
  • Reducing the Downtime Associated with Downtime

94
Importing Failures and Events
  • Standard Resource Manager Info
  • Native Resource Manager Info
  • Generic Events
  • Generic Metrics
  • Moab will detect and report failures with
    resource managers as well as failures reported by
    resource managers
  • Moab will also locally generate events based on
    reservations, SLA thresholds and other factors

95
Event Responses
  • Policy Changes
  • Resource Allocation Priority
  • Reservation Creation
  • Email Notification
  • Log Event to Internal and External Targets
  • Resource State Changes
  • Triggers
  • http//clusterresources.com/moabdocs/9.2accounting
    .shtmlgevents

96
Triggers
How can I install a set of nodes to the O.S.
requested by a job before it starts?
  • Arbitrary Actions
  • Object
  • Event
  • Action
  • Additional Attributes
  • Action Type
  • Offset
  • Threshold

How can I open up secure access for interactive
work when a user places a personal reservation on
a set of nodes?
  • http//clusterresources.com/moabdocs/20.1triggers.
    shtml

97
Triggers (contd)
  • Advanced Trigger Usage
  • Variable import/export
  • Dependencies
  • Multi-fire, user flags
  • Periodic Triggers
  • http//clusterresources.com/moabdocs/20.1triggers.
    shtml

98
Trigger Usage
  • Respond to failures or external events
  • Dynamically change environment
  • Provision resources
  • Actuate external services
  • Manage dynamic security
  • Manage peer applications
  • Modify policies
  • Respond to health check info, notifications, etc.
  • http//clusterresources.com/moabdocs/20.1triggers.
    shtml

99
Reducing the Downtime Associated with Downtime
  • Automated Failure Recovery
  • System Reservations
  • Interleaved Maintenance
  • High Availability

100
Automated Failure Recovery
  • Node Triggers

moab.cfg NODECFGDEFAULT TRIGGERATypeexec,E
Typefail,Action"HOME/triggers/send_node_down_em
ail.pl OID",MultiFireTRUE,RearmTime100
  • When a node goes down, an email containing the
    name of the node is sent to the administrator
  • Can be expanded to execute any arbitrary script,
    handing that script the name of the node, the
    time of the failure and other useful information.

101
Automating Recovery
  • The MOABRECOVERYACTION environment variable can
    be used to control scheduler action in the case
    of a catastrophic internal failure.

Recovery Mode Description
die Moab will exit, and if core files are externally enabled, will create a core file for analysis (this is the default behavior)
ignore Moab will ignore the signal and continue processing.  This may cause Moab to continue running with corrupt data which may be dangerous.  Use this setting with caution.
restart When a SIGSEGV is received, Moab will relaunch using the current checkpoint file, the original launch environment, and the original command line flags.  The receipt of the signal will be logged but Moab will continue scheduling.  Because the scheduler is restarted with a new memory image, no corrupt scheduler data should exist.  One caution with this mode is that it may mask underlying system failures by allowing Moab to overcome them.  If used, the event log should be checked occasionally to determine if failures are being detected.
trap When a SIGSEGV is received, Moab will stay alive but will enter diagnostic mode.  In this mode, Moab will stop scheduling but will respond to client requests allowing analysis of the failure to occur using internal diagnostics available via the mdiag command.
102
System Reservations
  • Easy to reserve entire cluster, or only sections
    of it
  • Can easily be scripted to roll out updates
    across the entire cluster at specific times,
    ensuring that no workload will be interrupted

gt mrsvctl c t ALL gt mrsvctl c t ALL s
10000 g staff gt mrsvctl c h node00-90-9
d 240000 gt mrsvctl c h node0-90-90-9
T Action/tmp/update.pl \ HOSTLIST,atypeexec,
etypestart s 235000_6/15 d 1500
103
Rolling/Interleaved Maintenance
Announce Jobs can No longer start
Wasted cycles draining the system.
Actual Maintenance Window (s)
Jobs
1. Node Draining Method Maintenance Window
Nodes
  • Completion-time
  • Scheduled Maintenance
  • Window

Nodes
3. Rolling Maintenance Windows
Nodes
Time
104
High Availability
  • High Availability allows Moab to run on two
    different machines, a primary and secondary
    server.
  • While both are running, the secondary server, or
    fallback server, will continually update its
    internal statistics, reservations, and other
    information to stay synchronized with the primary
    server.
  • Should the primary server stop running, the
    secondary will pick up all responsibilities of
    the primary server and begin to schedule jobs and
    track internal data.
  • When the primary server comes back online, the
    secondary server will hand over its data and
    resume functionality as the secondary server.
  • http//clusterresources.com/moabdocs/22.2ha.shtml

105
High Availability Example
moab.cfg on master server (duplicate
moab.cfg of the master or the same file using a
shared file system) SCHEDCFGcolony SERVERhead1
FBSERVERhead2
moab-private.cfg on head1 server CLIENTCFGcolony
KEY1dfv-fewv443v HOSThead2 AUTHadmin1
moab-private.cfg on head2 server CLIENTCFGcolony
KEY1dfv-fewv443v HOSThead1 AUTHadmin1
  • http//clusterresources.com/moabdocs/22.2ha.shtml

106
Enabling High Availability Features
  • Moab runs on two machines, primary and secondary
    server
  • The secondary server, or fallback server, will
    continually update its internal statistics,
    reservations, and other information to stay
    synchronized with the primary server and take
    over scheduling should the primary server fail

107
Configuring High Availability
  • moab.cfg
  • SCHEDCFGmycluster SERVERprimaryhostname3000
  • SCHEDCFGmycluster FBSERVERsecondaryhostname
  • Both the SERVER and FBSERVER are of the format
    ltHOSTgtltPORTgt.  It is also necessary to ensure
    a few configuration settings for correct
    operation
  • each server must specify a shared key using the
    clientcfg parameter in the moab-private.cfg file.
  • each server must be properly configured as an
    administrator inside of the resource manager
    using the clientcfg AUTH parameter.
  • each server can properly communicate with the
    resource manager. (See the torque/pbs integration
    guide for a specific example.) 

108
Confirming Configuration
  • Run mdiag R to confirm fallback Moab is able to
    communicate with the primary Moab

node40/ mdiag -R RMrmnode30 Type PBS
State Active ResourceType COMPUTE Version
'1.2.0p6-snap.1122589577' Nodes
Reported 4 Flags
executionServer,noTaskOrdering,typeIsExplicit
Partition rmnode30 Event Management
EPORT15004 NOTE SSS protocol enabled
Submit Command /usr/local/bin/qsub
DefaultClass batch RM Performance
AvgTime0.01s MaxTime1.03s (218
samples) RMinternal Type SSS State Active
Version 'SSS2.0' Flags
executionServer,localQueue,typeIsExplicit RM
Performance AvgTime0.00s MaxTime0.00s
(125 samples) NOTE use 'mrmctl -f -r ' to
clear stats/failures
109
Confirmation cont.
  • Run mdiag n to confirm fallback Moab is able to
    communicate with the primary resource manager.

compute node summary Name
State Procs Memory Opsys node31
Idle 11 2727
Linux-2.6 node32 Idle 11
2727 Linux-2.6 node33
Idle 11 2727 Linux-2.6 node34
Idle 11 2727
Linux-2.6 ----- --- 44
108108 ----- Total Nodes 4 (Active
0 Idle 4 Down 0)
110
Presentation Protocols
  • For problems or questions Send email to
  • training_at_clusterresources.com
  • We will pause for questions at the end of each
    section
  • Please remain on mute except for questions
  • Please do not put your call on hold (the entire
    group will hear your music)
  • Please be conscientious of the other people
    attending the conference
  • You can also submit questions during the training
    to the AOL Instant Messenger screen name CRI Web
    Training

111
4. End Users
  • Moab Access Portal
  • Moab Cluster Manager
  • End User Commands
  • End User Empowerment

112
Moab Access Portal TM
  • Submit Jobs from a web browser
  • View and Modify only your own Workload
  • Assist end-users to self-manage behaviours

http//clusterresources.com/map
113
Moab Cluster Manager TM
  • Administer Resources and Workload Policies
    Through an Easy-to-Use Graphical User Interface
  • Monitor, Diagnose and Report Resource Allocation
    and Usage
  • http//clusterresources.com/mcm

114
End User Commands
Command Flags Description
canceljob cancel existing job
checkjob display job state, resource requirements, environment, constraints, credentials, history, allocated resources, and resource utilization
releaseres release a user reservation
setres create a user reservation
showbf show resource availability for jobs with specific resource requirements
showq display detailed prioritized list of active and idle jobs
showstart show estimated start time of idle jobs
showstats show detailed usage statistics for users, groups, and accounts which the end user has access to
115
Assist Users in Better Utilizing Resources
  • General info
  • Job eval
  • Completed job failure post-mortem
  • Job start time estimates
  • Job control
  • Reservation control

116
Assist Users in Better Utilizing Resources (contd)
  • How do You Evaluate a Request
  • showstart (Earliest start, completion time,
    etc.)
  • showstats f (General service level statistics)
  • showstats u (User Statistics)
  • showbf (Immediately available resources)

117
Presentation Protocols
  • For problems or questions Send email to
  • training_at_clusterresources.com
  • We will pause for questions at the end of each
    section
  • Please remain on mute except for questions
  • Please do not put your call on hold (the entire
    group will hear your music)
  • Please be conscientious of the other people
    attending the conference
  • You can also submit questions during the training
    to the AOL Instant Messenger screen name CRI Web
    Training

118
5. Political Sharing and Service Delivery
  • Credentials
  • Priority
  • Usage Limits
  • Fairshare
  • Quality of Service
  • Distributing Resources
  • Partitions
  • Reservations
  • Allocation Management

119
Credentials
  • Certain job attributes (such as user, group,
    account, class and qos) describe entities the job
    belongs to and can be used to associate policies
    with jobs.
  • Every Job has credentials
  • Users (The only mandatory credential)
  • Groups (Standard Unix group or arbitrary
    collection of users)
  • Accounts (Associated with projects and billing)
  • Class (Associated with RM queues)
  • Quality Of Service (QoS) (Policy overrides,
    resource access, service targets, charge rates)
  • All Credentials can have Usage Limits, Fairshare
    Targets, Priorities, Usage History, Credential
    Access Lists / Defaults
  • http//clusterresources.com/moabdocs/3.5credovervi
    ew.shtml

120
Credential Membership
  • Membership Examples

moab.cfg user steve can access accounts a14,
a7, a2, a6, and a1. If no account is explicitly
requested, his job will be assigned to account
a2 USERCFGsteve ADEFa2 ALISTa14,a7,a2,a6,a1
moab.cfg account omega3 can only be accessed
by users johnh, stevek, jenp ACCOUNTCFGomega3
MEMBERULISTjohnh,stevek,jenp
moab.cfg Controlling QoS Access on a Per
Group Basis GROUPCFGstaff QLISTstandard,special
QDEFstandard
121
Fairness
  • Definition
  • giving all users equal access to compute
    resources
  • incorporating historical resource usage,
    political issues, and job value
  • Moab provides a comprehensive and flexible set of
    tools allowing the ability to address the many
    and varied fairness management needs.

http//clusterresources.com/moabdocs/6.0managingfa
irness.shtml
122
Performance Metrics
  • Metrics of Responsiveness
  • Queue Time
  • How long a jobs been waiting
  • X Factor
  • Duration-weighted time responsiveness factor
  • Strongest single factor of perceived fairness
  • Metrics of Utilization
  • Throughput
  • Jobs per unit time
  • Utilization
  • Percentage of cluster in use

http//clusterresources.com/moabdocs/5.1.2priority
factors.shtml
123
General Fairness Strategies
  • Maximize Scheduler Options -- Do Not Overspecify
  • Keep It Simple Do Not Address Hypothetical
    Issues
  • Seek To Adjust User Behaviour,
  • Not Limit User Options
  • Allow Users to Specify Required Service Level
  • Monitor Cluster Performance Regularly
  • Tune Policies As Needed

124
Priority
  • 2-tier prioritization structure
  • Independent component and
  • subcomponent weights/caps
  • Components include service,
  • target, fairshare, resource, usage,
  • job attribute, and credential
  • Negative priority jobs may be blocked
  • Tuning facility available with mdiag -p
  • http//clusterresources.com/moabdocs/5.1jobpriorit
    ization.shtml

125
Job Prioritization Component Overview
  • Service
  • Level of service delivered or anticipated
  • Includes queue time, xfactor, bypass, policy
    violation
  • Target
  • Desired service level
  • Provides exponential factor growth
  • Includes target queue time, target xfactor
  • Credential
  • Based on credential priorities
  • Includes user, group, account, QoS, and class

http//www.clusterresources.com/moabdocs/5.1.2prio
rityfactors.shtmlservice
http//www.clusterresources.com/moabdocs/5.1.2prio
rityfactors.shtmltarget
http//www.clusterresources.com/moabdocs/5.1.2prio
rityfactors.shtmlcred
126
Job Prioritization Component Overview
  • Fairshare
  • Includes user, group, account, QoS, and class
    fairshare
  • Includes current Based on historical resource
    consumption
  • usage metric of jobs per user, procs per user,
    and ps per user
  • May allow prioritization with cap fairshare
    target
  • Resource
  • Based on requested resources
  • Includes nodes
Write a Comment
User Comments (0)
About PowerShow.com