Title: Cluster Resources Training
1- Cluster Resources Training
- February 2006
2Presentation Protocols
- For problems or questions Send email to
- training_at_clusterresources.com
- We will pause for questions at the end of each
section - Please remain on mute except for questions
- Please do not put your call on hold (the entire
group will hear your music) - Please be conscientious of the other people
attending the conference - You can also submit questions during the training
to the AOL Instant Messenger screen name CRI Web
Training
3Session 1
- 1. Introduction
- 2. Deployment, Diagnostics Troubleshooting
- 3. Optimization Troubleshooting
- 4. End Users
- 5. Political Sharing Service Delivery
4Session 2
- 6. Reporting Monitoring
- 7. Grids
- 8. Utility
- 9. Torque
- 10. Future
51. Introduction
- Overview of the Modern Cluster
- Cluster Evolution
- Cluster Productivity Losses
- Moab Workload Manager Architecture
- What Moab Does
- What Moab Does Not Do
6Purpose of the Cluster Get Work Done
Workload Types
Interactive (Jobs and Reservations w/ Workflow)
Batch (Normal Batch Jobs)
Fixed Deadline (Fixed Time or QoS-based)
Services (Web, Data-mining, Visualization,
Accounting)
7Cluster Stack / Framework
Grid Workload Manager Scheduler, Policy
Manager, Integration Platform
Cluster Workload Manager Scheduler, Policy
Manager, Integration Platform
Application
Resource Manager
Portal
Application
Serial
Parallel
Security
GUI
Message Passing
CLI
Operating System
Hardware (Cluster or SMP)
Admin
8The Initial Cluster
- Standalone
- Uniform Resources
- Uniform Workload
- Dedicated Usage
- Single Support Staff
9- Increasing numbers of users, groups and
organizations want to access clusters - Increasing workload complexity ? more jobs, more
data intensive jobs, more dependencies - Increasing numbers of clusters, resource
managers, hardware, storage, network and
licensing to manage - More demands on administrators
- Increasing demands for higher user productivity
and work completion
Networks
Groups
Data / Storage
Applications / Licenses
Resource Managers
Workload
Security
Hardware
OSs
Increasing Complexity
10The Evolved Cluster
Admin
Job Queue
User
Compute Nodes
11Productivity Losses
- Scheduling Inefficiencies
-
- Managing complex site policies
- Keeping Everybody Happy
- Scheduling jobs where they will finish faster
rather than where they will start sooner
- Middleware Failures/ Overhead
-
- Licensing
- Network Applications
- Resource Managers
- Hardware Failures
- Job Loss and Delay due to Node, Network, and
other Infrastructure Failures
Remaining Productivity
- Partitioning Losses
-
-
- Underutilization of Resources Due to Physical
Access Limits
- Environmental Losses
-
- File System Failures
- Network Failures
- Hardware Failures
- Intra-job inefficiencies
-
- Poorly Designed Jobs
- Poorly Functioning Jobs
- Heterogeneous Resources Allocated
- Political Losses
-
- Underutilization of Resources Due to Overly
Strict Political Access Constraints
12Moab Architecture
13What Moab Does
- Optimizes Resource Utilization with Intelligent
Scheduling and Advanced Reservations - Unifies Cluster Management across Varied
Resources and Services - Dynamically Adjusts Workload to Enforce Policies
and Service Level Agreements - Automates Diagnosis and Failure Response
14What Moab Does Not Do
- Does not does do resource management (usually)
- Does not install the system (usually)
- Not a storage manager
- Not a license manager
- Does not do message passing
15Presentation Protocols
- For problems or questions Send email to
- training_at_clusterresources.com
- We will pause for questions at the end of each
section - Please remain on mute except for questions
- Please do not put your call on hold (the entire
group will hear your music) - Please be conscientious of the other people
attending the conference - You can also submit questions during the training
to the AOL Instant Messenger screen name CRI Web
Training
162. Deployment, Diagnostics and Troubleshooting
- 2.1 Installation and Deployment
- 2.2 Troubleshooting and Diagnostics
172.1 Deployment
- Installation
- Configuration
- Testing
- Simulation
- Scheduling Sequence
- Commands Overview
- Scheduling Objects
18Moab Workload Manager Installation
- You only install Moab Workload Manager on
the head node.
- gt tar -xzvf moab-4.5.0p0.linux.tar.gz
- gt cd moab-4.5.4
- gt ./configure
- gt make
- When you are ready to use Moab in
production, you may install it into the
install directory you have configured using make
install.
- Workload Manager must be running before
Cluster Manager and Access Portal will work.
- You can choose to install client commands on
a remote system as well.
19File Locations
- (MOABHOMEDIR)
- moab.cfg (general config file containing
information required by both the Moab server and
user interface clients) - moab-private.cfg (config file containing private
information required by the Moab server only) - .moab.ck (Moab checkpoint file)
- .moab.pid (Moab 'lock' file to prevent multiple
instances) - log(directory for Moab log files - REQUIRED BY
DEFAULT) - moab.log (Moab log file)
- moab.log.1 (previous 'rolled' Moab log file)
- stats(directory for Moab statistics files -
REQUIRED BY DEFAULT) - Moab stats files (in format 'stats.ltYYYYgt_ltMMgt_ltDD
gt') - Moab fairshare data files (in format
'FS.ltEPOCHTIMEgt') - tools (directory for local tools called by Moab -
OPTIONAL BY DEFAULT) - traces (directory for Moab simulation trace files
- REQUIRED FOR SIMULATIONS) - resource.trace1 (sample resource trace file)
- workload.trace1 (sample workload trace file)
20- spool (directory for temporary Moab files -
REQUIRED FOR ADVANCED FEATURES) - contrib (directory containing contributed code in
the areas of GUI's, algorithms, policies, etc) - (MOABINSTDIR)
- bin (directory for installed Moab executables)
- moab (Moab scheduler executable)
- mclient (Moab user interface client executable)
- /etc/moab.cfg (optional file. This file is used
to override default '(MOABHOMEDIR)' settings.
It should contain the string 'MOABHOMEDIR
(DIRECTORY)' to override the 'built-in'
(MOABHOMEDIR)' setting.
21Initial Configuration moab.cfg
- moab.cfg contains the parameters and settings for
Moab Workload Manager. This is where you will
set most of the policy settings.
Example of what moab.cfg will look like after
installation
moab.cfg SCHEDCFGMoab SERVERtest.icluster.
org4255 ADMINCFG1 USERSroot RMCFGbase
TYPEPBS
22Supported Platforms/Environments
- Resource Managers
- TORQUE, OpenPBS, PBSPro, LSF, Loadleveler, SLURM,
BProc, clubMASK, S3, WIKI - Operating Systems
- RedHat, SUSE, Fedora, Debian, FreeBSD, ( all
known variants of Linux), AIX, IRIX, HP-UX, OS/X,
OSF/Tru-64, SunOS, Solaris, ( all known variants
of UNIX) - Hardware
- Intel x86, Intel IA-32, Intel IA-64, AMD x86, AMD
Opteron, SGI Altix, HP, IBM SP, IBM x-Series, IBM
p-Series, IBM i-Series, Mac G4 and G5
23Basic Parameters
- SCHEDCFG
- Specifies how the Moab server will execute and
communicate with client requests. - Example SCHEDCFGorion SERVERcw.psu.edu
- ADMINCFG
- Moab provides role-based security enabled by way
of multiple levels of admin access. - Example The following may be used to enable
users greg amd thomas as level 1 admins - ADMINCFG1 USERSgreg,thomas NOTE Moab may only
be launched by the primary admin user id. - RMCFG
- In order for Moab to properly interact with a
resource manager, the interface to this resource
manager must be defined. - For example To interface to a TORQUE resource
manager, the following may be used - RMCFGtorque1 TYPEpbs
24Scheduling Modes
- Configure modes in moab.cfg
- Simulation Mode
- Allows a test drive of the scheduler. You can
evaluate how various policies can improve the
current performance on a stable production
system. - Test Mode
- Test mode allows evaluation of new Moab releases,
configurations, and policies in a risk-free
manner. the test-mode Moab behaves identical to a
live or normal mode except the ability to start,
cancel, or modify jobs. - Normal Mode
- Live (after installation, automatically set this
way) - Interactive Mode
- Like test mode but instead of disabling all
resource and job control functions, Moab sends
the desired change request to the screen and asks
for permission to complete it.
25(No Transcript)
26Testing New Policies
- Verifying Correct Specification of New Policies
- If manually editing the moab.cfg file, use the
mdiag C command - Moab Cluster Manager automatically verifies
proper policy specification - Verifying Correct Behavior of New Policies
- Put in INTERACTIVE Mode to ensure you want to
make each change - Determining Long Term Impact of New Policies
- Put in SIMULATION Mode
27- Moab 'Side-by-Side
- Allows a production cluster or other resource to
be logically partitioned along resource and
workload boundaries and allows different
instances of Moab to schedule different
partitions. - Use parameters IGNORENODES, IGNORECLASSES,
IGNOREUSERS
moab.cfg for production partition
SCHEDCFGprod MODENORMAL SERVERorion.cxz.com42
020 RMCFGTORQUE TYPEPBS
IGNORENODES node61,node62,node63,node64
IGNOREUSERS
gridtest1,gridtest2
moab.cfg for test partition
SCHEDCFGprod MODENORMAL SERVERorion.cxz.com42
020 RMCFGTORQUE TYPEPBS
IGNORENODES !node61,node62,node63,node64
IGNOREUSERS
!gridtest1,gridtest2
28Simulation
- What is the impact of additional hardware on
cluster utilization? - What delays to key projects can be expected with
the addition of new users? - How will new prioritization weights alter cycle
distribution among existing workload? - What total loss of compute resources will result
from introducing a maintenance downtime? - Are the benefits of cycle stealing from
non-dedicated desktop systems worth the effort? - How much will anticipated grid workload delay the
average wait time of local jobs?
29Scheduling Iterations
- Update State Information
- Refresh Reservations
- Schedule Reserved Jobs
- Schedule Priority Jobs
- Backfill Jobs
- Update Statistics
- Handle User Requests
- Perform Next Scheduling Cycle
30Job Flow
- Determine Basic Job Feasibility
- Prioritize Jobs
- Enforce Configured Throttling Policies
- Determine Resource Availability
- Allocate Resources to Job
- Launch Job
31Commands Overview
Command Description
checkjob provide detailed status report for specified job
checknode provide detailed status report for specified node
mcredctl controls various aspects about the credential objects within Moab
mdiag provide diagnostic reports for resources, workload, and scheduling
mjobctl control and modify job
mnodectl control and modify nodes
mrmctl query and control resource managers
mrsvctl control and modify reservations
mschedctl modify scheduler state and behavior
mshow displays various diagnostic messages about the system and job queues
msub scheduler job submission
resetstats reset scheduler statistics
showbf show current resource availability
showq show queued jobs
showres show existing reservations
showstart show estimates of when job can/will start
showstate show current state of resources
showstats show usage statistics
32End User Commands
Command Flags Description
canceljob cancel existing job
checkjob display job state, resource requirements, environment, constraints, credentials, history, allocated resources, and resource utilization
showbf show resource availability for jobs with specific resource requirements
showq display detailed prioritized list of active and idle jobs
showstart show estimated start time of idle jobs
showstats show detailed usage statistics for users, groups, and accounts which the end user has access to
33Creating a Remap Class
moab.cfg jobs submitted to 'batch' should be
remapped REMAPCLASS batch stevens only
queue CLASSCFGstevens MIN.FEATURESstevens
REQUIREDUSERLISTstevens,stevens2 special
queue for I/O nodes CLASSCFGio MAX.PROC8
MIN.FEATURESio general access
queues CLASSCFGquick MIN.PROC2 MAX.PROC8
MIN.FEATURESfastshort CLASSCFGmedium
MIN.PROC2 MAX.PROC8 CLASSCFGdefault
MAX.PROC64 ...
34Scheduling Objects
- Moab functions by manipulating five primary,
elementary objects - Jobs
- Nodes
- Reservations
- QOS structures
- Policies
35Jobs
- Job information is provided to the Moab scheduler
from a resource manager - (Such as Loadleveler, PBS, Wiki, or LSF)
- Job attributes include ownership of the
- Job
- Job state
- Amount
- Type of resources required by the job
- Wallclock limit
- A job consists of one or more requirements each
of which requests a number of resources of a
given type.
36Job States
- Indicates the jobs current status and
eligibility for execution and can be any of the
values listed below - Idle
- Job is currently queued and eligible to run but
is not executing - Starting
- The batch system has attempted to start the job
and the job is currently performing pre-start
tasks which may including provisioning resources,
staging data, executing system pre-launch
scripts, etc. - Running
- Job is currently executing the user application
- Suspended
- Job was running but has been suspended by the
scheduler or an admin. The user application is
still in place on the allocated compute resources
but it is not executing - Completed
- The job has completed running
- Hold
- The job is idle and is not eligible to run due to
a user, admin, or batch system hold
37Job Requirement
- Consists of a request for a single type of
resources. Each requirement consists of the
following components - Task Definition
- A specification of the elementary resources which
compose an individual task. - Resource Constraints
- A specification of conditions which must be met
in order for resource matching to occur. Only
resources from nodes which meet all resource
constraints may be allocated to the job req. - Task Count
- The number of task instances required by the req.
- Task List
- The list of nodes on which the task instances
have been located. - Req Statistics
- Statistics tracking resource utilization
38Nodes
- Within Moab, a node is a collection of resources
with a particular set of associated attributes. - A node is defined as one or more CPU's, together
with associated memory, and possibly other
compute resources such as local disk, swap,
network adapters, software licenses, etc.
39Advance Reservations
- An object which dedicates a block of specific
resources for a particular use. - Each reservation consists of a list of resources,
an access control list, and a time range for
which this access control list will be enforced. - The reservation prevents the listed resources
from being used in a way not described by the
access control list during the time range
specified.
40Policies
- Generally specified via a config file and serve
to control how and when jobs start. - Include
- Job prioritization
- Fairness policies
- Fairshare configuration policies
- Scheduling policies.
41Resources
- Jobs, nodes, and reservations all deal with the
abstract concept of a resource. A resource in
the Moab world is one of the following - Processors
- Specified with a simple count value.
- Memory
- Real memory or 'RAM' is specified in megabytes
(MB). - Swap
- Virtual memory or 'swap' is specified in
megabytes (MB). - Disk
- Local disk is specified in megabytes (MB).
- In addition to these elementary resource types,
there are two higher level resource concepts used
within Moab. - Task
- Processor equivalent, or PE.
42 Task
- A collection of elementary resources which must
be allocated together within a single node. - In Moab, when jobs or reservations request
resources, they do so in terms of tasks typically
using a task count and a task definition.
43Resource Manager (RM)
- While other systems may have more strict
interpretations of a resource manager and its
responsibilities, Moab's multi-resource manager
support allows a much more liberal
interpretation. - In essence, any object which can provide
environmental information and environmental
control can be utilized as a resource manager. - Moab is able to aggregate information from
multiple unrelated sources into a larger more
complete world view of the cluster which includes
all the information and control found within a
standard resource manager such as TORQUE
including - Node
- Job
- Queue management services.
44Node Attributes
- ACCESS
- Node access policy which can be one of SHARED,
SHAREDONLY, SINGLEJOB, SINGLETASK, or SINGLEUSER - CHARGERATE
- Assign specific charging rates to the usage of
particular resources - COMMENT
- Allows an organization to annotate a node via the
config file to indicate special information
regarding this node to both users and
administrators
NODECFGnode013 COMMENT"Login Node"
45- FEATURES
- The NODECFG parameter can be used to directly
assign a list of node features to individual
nodes. - GRES
- The NODECFG parameter can be used to directly
assign a list of consumable generic attributes to
individual nodes or to the special pseudo-node
global which provide shared cluster (aka
floating) consumable resources.
NODECFGnode013 FEATURESgpfs,fastio
NODECFGnode013 GRESquickcalc20
46 Resource Managers
- Moab can be configured to manage more than one
resource manager simultaneously, even resource
managers of different types. - Moab aggregates information from the RMs to fully
manage workload, resources, and cluster policies
47Moabs Interaction With RM
- Load global resource information
- Load node specific information (optional)
- Load job information
- Load queue/policy information (optional)
- Cancel/preempt/modify jobs according to cluster
policies - Start jobs in accordance with available resources
and policy constraints - Handle user commands
48Resource Manager Configuration
- RMCFG parameter with TYPE attribute
- Scheduler/Resource Manager Interactions
- GETJOBINFO
- GETNODEINFO
- STARTJOB
- CANCELJOB
RMCFGorion TYPEPBS
49License Management
- Moab supports both node-locked and floating
license models and even allows mixing the two
models simultaneously - Methods for determining license availability
- Local Consumable Resources
- Resource Manager Based Consumable Resources
- Interfacing to an External License Manager
- Requesting Licenses within Jobs
qsub gt qsub -l nodes2,softwareblast
cmdscript.txt
502.2 Troubleshooting and Diagnostics
- Object Messages
- Diagnostic Commands
- Admin Notification
- Logging
- Tracking System Failures
- Checkpointing
- Debuggers
http//www.clusterresources.com/products/mwm/moabd
ocs/14.0troubleshootingandsysmaintenance.shtml
51Object Messages
- Messages can hold information regarding failures
and key events - Messages possess event time, owner, expiration
time, and event count information - Resource managers and peer services can attach
messages to objects - Admins can attach messages
- Multiple messages per object are supported
- Messages are persistent
- http//www.clusterresources.com/products/mwm/moabd
ocs/commands/mschedctl.shtml - http//www.clusterresources.com/products/mwm/moabd
ocs/14.3messagebuffer.shtml
52Diagnostics
- Moabs diagnostic commands present detailed state
information - Scheduling problems
- Summarize performance
- Evaluate current operation reporting on any
unexpected or potentially erroneous conditions - Where possible correct detected problems if
desired
53mdiag
- Displays object state/health
- Displays object configuration
- Attributes, resources, policies
- Displays object history and performance
- Displays object failures and messages
- http//www.clusterresources.com/products/mwm/moabd
ocs/commands/mdiag.shtml
54mdiag usage
- Most common diagnostics
- Scheduler (mdiag S)
- Jobs (mdiag j)
- Nodes (mdiag n)
- Resource manager (mdiag R)
- Blocked jobs (mdiag b)
- Configuration (mdiag C)
- Other diagnostics
- Fairshare, Priority
- Users, Accounts, Classes
- Reservations, QoS, etc
DEMO
mdiag
- http//www.clusterresources.com/products/mwm/moabd
ocs/commands/mdiag.shtml
55mdiag details
- Performs numerous internal health and consistency
checks - Race conditions, object configuration
inconsistencies, possible external failures - Not just for failures
- Provides status, config, and current performance
- Enables moab as an information service
- --flagsxml
56Job Troubleshooting
- To determine why a particular job will not start,
there are several commands which - can be helpful
- checkjob -v
- Checkjob will evaluate the ability of a job to
start immediately. Tests include resource
access, node state, job constraints (ie,
startdate, taskspernode, QOS, etc).
Additionally, command line flags may be specified
to provide further information. - -l ltPOLICYLEVELgt // evaluate impact of
throttling policies on job feasibility -n
ltNODENAMEgt // evaluate resource access on
specific node -r ltRESERVATION_LISTgt //
evaluate access to specified reservations - checknode
- Display detailed status of node
- mdiag -b
- Display various reasons job is considered
'blocked' or 'non-queued'. - mdiag -j
- Display high level summary of job attributes and
perform sanity check on job attributes/state. - showbf -v
- Determine general resource availability subject
to specified constraints.
57Other Diagnostics
- checkjob and checknode commands
- Why a job cannot start
- Which nodes can be availableinformation
regarding the recent events impacting current job - Nodes state
58Admin Notification
- MAILPROGRAM must be set to DEFAULT
- Notifications can be delivered as a result of
generic events, generic metric thresholds and
QoS-based service levels
- http//www.clusterresources.com/products/mwm/moabd
ocs/14.4eventmgmt.shtml
59Issues with Client Commands
- Utilize built in moab logging
- showq --loglevel9
- Or
-
- Check the moab log files
60Logging Facilities
- Moab Log
- Report detailed scheduler actions, configuration,
events, failures, etc - Event Log
- Report scheduler, job, node, and reservation
events and failures - Syslog
- USESYSLOG
stats/events.Wed_Aug_24_2005 1124979598 rm
base RMUP initialized 1124979598
sched Moab SCHEDSTART - 1124982013
node node017 GEVENT CPU2
Down 1124989457 node node135 GEVENT
/var/tmp Full 1124996230 node node139
GEVENT /home Full 1125013524 node
node407 GEVENT Transient Power Supply
Failure
- http//www.clusterresources.com/products/mwm/moabd
ocs/a.fparameters.shtmleventrecordlist - http//www.clusterresources.com/products/mwm/moabd
ocs/14.2logging.shtml - http//www.clusterresources.com/products/mwm/moabd
ocs/a.fparameters.shtmlusesyslog
61Logging Basics
- LOGDIR - Indicates directory for log files
- LOGFILE - Indicates path name of log file
- LOGFILEMAXSIZE - Indicates maximum size of log
file before rolling - LOGFILEROLLDEPTH - Indicates maximum number of
log files to maintain \ - LOGLEVEL - Indicates verbosity of logging
62Function Level Information
- In source and debug releases, each subroutine is
logged, along with all printable parameters.
moab.log MPolicyCheck(orion.322,2,Reason)
63Status Information
- Information about internal status is logged at
all LOGLEVELs. Critical internal status is
indicated at low LOGLEVELs while less critical
and more vebose status information is logged at
higher LOGLEVELs.
moab.log INFO job orion.4228 rejected
(max user jobs) INFO job fr4n01.923.0
rejected (maxjobperuser policy failure)
64Scheduler Warnings
- Warnings are logged when the scheduler detects an
unexpected value or receives an unexpected result
from a system call or subroutine.
moab.log WARNING cannot open fairshare data
file '/opt/moab/stats/FS.87000'
65Scheduler Alerts
- Alerts are logged when the scheduler detects
events of an unexpected nature which may indicate
problems in other systems or in objects.
moab.log ALERT job orion.72 cannot run.
deferring job for 360 Seconds
66Scheduler Errors
- Errors are logged when the scheduler detects
problems of a nature of which impact the
scheduler's ability to properly schedule the
cluster.
moab.log ERROR cannot connect to Loadleveler
API
67Searching Moab Logs
- While major failures will be reported via the
mdiag -S command, these failures can also be
uncovered by searching the logs using the grep
command as in the following
gt grep -E "WARNINGALERTERROR" moab.log
68Event Logs
- Major events are reported to both the Moab log
file as well as the Moab event log. By default,
the event log is maintained in the statistics
directory and rolls on a daily basis, using the
naming convention - events.WWW_MMM_DD_YYYY (e.g. events.Fri_Aug_19_20
05)
event log format ltEPOCHTIMEgt ltOBJECTgt
ltOBJECTIDgt ltEVENTgt ltDETAILSgt
69Enabling Syslog
- In addition to the log file, the Moab Scheduler
can report events it determines to be critical to
the UNIX syslog facility via the daemon facility
using priorities ranging from INFO to ERROR. - The verbosity of this logging is not affected by
the LOGLEVEL parameter. In addition to errors
and critical events, user commands that affect
the state of the jobs, nodes, or the scheduler
may also be logged to syslog. - Moab syslog messages are reported using the INFO,
NOTICE, and ERR syslog priorities.
70Tracking System Failures
- The scheduler has a number of dependencies which
may cause failures if not satisfied. - Disk Space
- The scheduler utilizes a number of files. If the
file system is full or otherwise inaccessible,
the following behaviors might be noted
File Failure
moab.pid scheduler cannot perform single instance check
moab.ck scheduler cannot store persistent record of reservations, jobs, policies, summary statistics, etc.
moab.cfg/moab.dat scheduler cannot load local configuration
log/ scheduler cannot log activities
stats/ scheduler cannot write job records
71- Network
- The scheduler utilizes a number of socket
connections to perform basic functions. Network
failures may affect the following facilities.
Network Connection Failure
scheduler client scheduler client commands fail
resource manager scheduler is unable to load/update information regarding nodes and jobs
allocation manager scheduler is unable to validate account access or reserve/debit account balances
72- Processor Utilization
- On a heavily loaded system, the scheduler may
appear sluggish and unresponsive. However no
direct failures should result from this slowdown.
Indirect failures may include timeouts of peer
services (such as the resource manager or
allocation manager) or timeouts of client
commands.
- Memory
- Depending on cluster size and configuration,
the scheduler may require up to 120 MB of
memory on the server host. If inadequate memory
is available, multiple aspects of scheduling may
be negatively affected.
73Internal Errors
- Logs
- Tracing the Failure with a Debugger
74Checkpointing
- Moab checkpoints its internal state. The
checkpoint file records statistics and attributes
for jobs, nodes, reservations, users, groups,
classes, and almost every other scheduling
object. - CHECKPOINTEXPIRATIONTIME - Indicates how long
unmodified data should be kept after the
associated object has disappeared. ie, job
priority for a job no longer detected. - FORMAT - DDHHMMSS
- EXAMPLE - CHECKPOINTEXPIRATIONTIME 1000000
- CHECKPOINTFILE - Indicates path name of
checkpoint file - FORMAT - ltSTRINGgt
- EXAMPLE - CHECKPOINTFILE /var/adm/moab/moab.ck
- CHECKPOINTINTERVAL - Indicates interval between
subsequent checkpoints. - FORMAT - DDHHMMSS
- EXAMPLE - CHECKPOINTINTERVAL 001500
moab.cfg
75Using a Debugger
- Attach to a running Moab process
- or
- Start Moab under the debugger
gt export MOABDEBUGyes gtcd ltMOABHOMEDIRgt/src/moab
gt gdb ../../bin/moab (gdb) b MQOSInitialize
(gdb) r
76Cluster Resources Training
- We will reconvene at 1200 pm EST
77Presentation Protocols
- For problems or questions Send email to
- training_at_clusterresources.com
- We will pause for questions at the end of each
section - Please remain on mute except for questions
- Please do not put your call on hold (the entire
group will hear your music) - Please be conscientious of the other people
attending the conference - You can also submit questions during the training
to the AOL Instant Messenger screen name CRI Web
Training
783. Optimization and Uptime
- 3.1 Optimization
- 3.2 Uptime
793.1 Optimization
- Optimization is maximizing performance while
fully addressing all mission objectives. True
optimization includes aspects of policy
selection, increased availability, user training,
and other factors. - Identifying Policy Bottlenecks
- Identifying Resource Fragmentation
- Preemption
- Malleable/Dynamic Jobs
- Backfill
80Productivity Losses
- Scheduling Inefficiencies
-
- Managing complex site policies
- Keeping Everybody Happy
- Scheduling jobs where they will finish faster
rather than where they will start sooner
- Middleware Failures/ Overhead
-
- Licensing
- Network Applications
- Resource Managers
- Hardware Failures
- Job Loss and Delay due to Node, Network, and
other Infrastructure Failures
Remaining Productivity
- Partitioning Losses
-
-
- Underutilization of Resources Due to Physical
Access Limits
- Environmental Losses
-
- File System Failures
- Network Failures
- Hardware Failures
- Intra-job inefficiencies
-
- Poorly Designed Jobs
- Poorly Functioning Jobs
- Heterogeneous Resources Allocated
- Political Losses
-
- Underutilization of Resources Due to Overly
Strict Political Access Constraints
81Identifying Policy Bottlenecks
- Most Optimization is Enable by Default
- Sources of Bottlenecks
- Usage Limits, Fairshare Caps
- Eval Steps
- Verify priority (are most important jobs getting
access to resources first?)
- http//clusterresources.com/moabdocs/commands/mdia
g-priority.shtml
82Identifying Policy Bottlenecks (contd)
- Sources of Bottlenecks (contd)
- Eval Steps (contd)
- Check job blockage
- Adjust Limits, Caps, Priority as needed
- If needed, use simulation to determine
performance impact of changes
- http//clusterresources.com/moabdocs/commands/mdia
g-queues.shtml
83Identifying Resource Fragmentation
- Fragmentation based on queues, reservations,
partitions, os's, architectures, etc. - Recommend changes, use node sets, soften
reservations, time-based reservations, etc. - User training to eliminate user specified
fragmentation
84Preemption
- Conflict between high utilization for cluster and
guarantees for important jobs - Preemption allows scheduler to 'retract' some
scheduling decisions to address newly submitted
workload - QoS-based preemption allows scheduler to enable
preemption only if targets cannot be satisfied in
other ways
http//www.clusterresources.com/products/mwm/docs/
8.4preemption.shtml
85QoS Based Preemption
- Preemption only occurs when the following 3
conditions are satisfied - The preemptor job has the PREEMPTOR attribute set
- The preemptee job has the PREEMPTEE attribute set
- The preemptor job has a higher priority than the
preemptee job
PREEMPTPOLICY REQUEUE enable qos priority to
make preemptors higher priority than preemptees
QOSWEIGHT 1 QOSCFGhigh QFLAGSPREEMPTOR
PRIORITY1000 QOSCFGmed QOSCFGlow
QFLAGSPREEMPTEE associate class 'special'
with QOS high CLASSCFGspecial QDEFhigh
86Preemption based Backfill
- The PREEMPT backfill policy allows the scheduler
to start backfill jobs even if required walltime
is not available. - If the job runs too long and interferes with
another job which was guaranteed a particular
timeslot, the backfill job is preempted and the
priority job is allowed to run. - When another potential timeslot becomes
available, the preempted backfill job will again
be optimistically executed.
87Trigger and Context Based Preemption Policies
- Mark a job a preemptor if its delivered or
expected response time exceeds a specified
threshold - Mark a job preemptible if it violates soft policy
usage limits or fairshare targets - Mark a job a preemptor if it is running in a
reservation it owns - Preempt a job as the result of a specific user,
node, job, reservation, or other object event
using object triggers - Preempt a job as the result of an external
generic event or generic metric
88Types of Preemption
- PREEMPTPOLICY Parameter
- Job Requeue
- active jobs are terminated and returned to the
job queue in an idle state. - Job Suspend
- active jobs stop executing but remain in memory
or the allocated compute nodes, must be resumed - Job Checkpoint
- job saves off its current state and either
terminate or continue running - Job Cancel
- active jobs are canceled
Moab is only able to utilize preemption if the
underlying resource manager/OS combination
supports this capability.
89Malleable/Dynamic Jobs
- Moab adjusts jobs to utilize available resource
and fill holes - Moab adjusts both job size and job duration
- Only supported with resource managers which
support dynamic job modification (i.e. TORQUE) or
with msub
http//www.clusterresources.com/products/mwm/docs/
22.4dynamicjobs.shtml
90Backfill
- Allows a scheduler to make better use of
available resources by running jobs out of order - Prioritizes the jobs in the queue according to a
number of factors and then orders the jobs into a
highest priority first (or priority FIFO) sorted
list
91 Backfill Algorithms
- FIRSTFIT
- Selects those which actually fit in the current
backfill window - First job is started
- While backfill jobs and idle resources remain,
step 1 repeats - BESTFIT
- Selects those which actually fit in the current
backfill window - Degree of fit determined by the BACKFILLMETRIC
parameter - Job with best fit started
- While backfill jobs and idle resources remain,
step 1 repeats - GREEDY
- Selects those which actually fit in the current
backfill window - All possible combinations of jobs are evaluated,
degree of fit determined by the BACKFILLMETRIC
parameter - OPTMISTIC/PREEMPT
- Selects jobs which can most likely fit using
wall-clock accuracy based estimates
92Configuring Backfill
- Use BACKFILLPOLICY parameter
- FORMAT - use one of the following
- FIRSTFIT
- BESTFIT
- GREEDY
- OPTIMISTIC
- PREEMPT
- NONE
BACKFILLPOLICY BESTFIT
933.2 Uptime
- Importing Failures and Events
- Responses
- Triggers
- Reducing the Downtime Associated with Downtime
94Importing Failures and Events
- Standard Resource Manager Info
- Native Resource Manager Info
- Generic Events
- Generic Metrics
- Moab will detect and report failures with
resource managers as well as failures reported by
resource managers - Moab will also locally generate events based on
reservations, SLA thresholds and other factors
95Event Responses
- Policy Changes
- Resource Allocation Priority
- Reservation Creation
- Email Notification
- Log Event to Internal and External Targets
- Resource State Changes
- Triggers
- http//clusterresources.com/moabdocs/9.2accounting
.shtmlgevents
96Triggers
How can I install a set of nodes to the O.S.
requested by a job before it starts?
- Arbitrary Actions
- Object
- Event
- Action
- Additional Attributes
- Action Type
- Offset
- Threshold
How can I open up secure access for interactive
work when a user places a personal reservation on
a set of nodes?
- http//clusterresources.com/moabdocs/20.1triggers.
shtml
97Triggers (contd)
- Advanced Trigger Usage
- Variable import/export
- Dependencies
- Multi-fire, user flags
- Periodic Triggers
- http//clusterresources.com/moabdocs/20.1triggers.
shtml
98Trigger Usage
- Respond to failures or external events
- Dynamically change environment
- Provision resources
- Actuate external services
- Manage dynamic security
- Manage peer applications
- Modify policies
- Respond to health check info, notifications, etc.
- http//clusterresources.com/moabdocs/20.1triggers.
shtml
99Reducing the Downtime Associated with Downtime
- Automated Failure Recovery
- System Reservations
- Interleaved Maintenance
- High Availability
100Automated Failure Recovery
moab.cfg NODECFGDEFAULT TRIGGERATypeexec,E
Typefail,Action"HOME/triggers/send_node_down_em
ail.pl OID",MultiFireTRUE,RearmTime100
- When a node goes down, an email containing the
name of the node is sent to the administrator - Can be expanded to execute any arbitrary script,
handing that script the name of the node, the
time of the failure and other useful information.
101Automating Recovery
- The MOABRECOVERYACTION environment variable can
be used to control scheduler action in the case
of a catastrophic internal failure.
Recovery Mode Description
die Moab will exit, and if core files are externally enabled, will create a core file for analysis (this is the default behavior)
ignore Moab will ignore the signal and continue processing. This may cause Moab to continue running with corrupt data which may be dangerous. Use this setting with caution.
restart When a SIGSEGV is received, Moab will relaunch using the current checkpoint file, the original launch environment, and the original command line flags. The receipt of the signal will be logged but Moab will continue scheduling. Because the scheduler is restarted with a new memory image, no corrupt scheduler data should exist. One caution with this mode is that it may mask underlying system failures by allowing Moab to overcome them. If used, the event log should be checked occasionally to determine if failures are being detected.
trap When a SIGSEGV is received, Moab will stay alive but will enter diagnostic mode. In this mode, Moab will stop scheduling but will respond to client requests allowing analysis of the failure to occur using internal diagnostics available via the mdiag command.
102System Reservations
- Easy to reserve entire cluster, or only sections
of it - Can easily be scripted to roll out updates
across the entire cluster at specific times,
ensuring that no workload will be interrupted
gt mrsvctl c t ALL gt mrsvctl c t ALL s
10000 g staff gt mrsvctl c h node00-90-9
d 240000 gt mrsvctl c h node0-90-90-9
T Action/tmp/update.pl \ HOSTLIST,atypeexec,
etypestart s 235000_6/15 d 1500
103Rolling/Interleaved Maintenance
Announce Jobs can No longer start
Wasted cycles draining the system.
Actual Maintenance Window (s)
Jobs
1. Node Draining Method Maintenance Window
Nodes
- Completion-time
- Scheduled Maintenance
- Window
Nodes
3. Rolling Maintenance Windows
Nodes
Time
104High Availability
- High Availability allows Moab to run on two
different machines, a primary and secondary
server. - While both are running, the secondary server, or
fallback server, will continually update its
internal statistics, reservations, and other
information to stay synchronized with the primary
server. - Should the primary server stop running, the
secondary will pick up all responsibilities of
the primary server and begin to schedule jobs and
track internal data. - When the primary server comes back online, the
secondary server will hand over its data and
resume functionality as the secondary server.
- http//clusterresources.com/moabdocs/22.2ha.shtml
105High Availability Example
moab.cfg on master server (duplicate
moab.cfg of the master or the same file using a
shared file system) SCHEDCFGcolony SERVERhead1
FBSERVERhead2
moab-private.cfg on head1 server CLIENTCFGcolony
KEY1dfv-fewv443v HOSThead2 AUTHadmin1
moab-private.cfg on head2 server CLIENTCFGcolony
KEY1dfv-fewv443v HOSThead1 AUTHadmin1
- http//clusterresources.com/moabdocs/22.2ha.shtml
106Enabling High Availability Features
- Moab runs on two machines, primary and secondary
server - The secondary server, or fallback server, will
continually update its internal statistics,
reservations, and other information to stay
synchronized with the primary server and take
over scheduling should the primary server fail
107Configuring High Availability
- moab.cfg
- SCHEDCFGmycluster SERVERprimaryhostname3000
- SCHEDCFGmycluster FBSERVERsecondaryhostname
- Both the SERVER and FBSERVER are of the format
ltHOSTgtltPORTgt. It is also necessary to ensure
a few configuration settings for correct
operation - each server must specify a shared key using the
clientcfg parameter in the moab-private.cfg file.
- each server must be properly configured as an
administrator inside of the resource manager
using the clientcfg AUTH parameter. - each server can properly communicate with the
resource manager. (See the torque/pbs integration
guide for a specific example.)
108Confirming Configuration
- Run mdiag R to confirm fallback Moab is able to
communicate with the primary Moab
node40/ mdiag -R RMrmnode30 Type PBS
State Active ResourceType COMPUTE Version
'1.2.0p6-snap.1122589577' Nodes
Reported 4 Flags
executionServer,noTaskOrdering,typeIsExplicit
Partition rmnode30 Event Management
EPORT15004 NOTE SSS protocol enabled
Submit Command /usr/local/bin/qsub
DefaultClass batch RM Performance
AvgTime0.01s MaxTime1.03s (218
samples) RMinternal Type SSS State Active
Version 'SSS2.0' Flags
executionServer,localQueue,typeIsExplicit RM
Performance AvgTime0.00s MaxTime0.00s
(125 samples) NOTE use 'mrmctl -f -r ' to
clear stats/failures
109Confirmation cont.
- Run mdiag n to confirm fallback Moab is able to
communicate with the primary resource manager.
compute node summary Name
State Procs Memory Opsys node31
Idle 11 2727
Linux-2.6 node32 Idle 11
2727 Linux-2.6 node33
Idle 11 2727 Linux-2.6 node34
Idle 11 2727
Linux-2.6 ----- --- 44
108108 ----- Total Nodes 4 (Active
0 Idle 4 Down 0)
110Presentation Protocols
- For problems or questions Send email to
- training_at_clusterresources.com
- We will pause for questions at the end of each
section - Please remain on mute except for questions
- Please do not put your call on hold (the entire
group will hear your music) - Please be conscientious of the other people
attending the conference - You can also submit questions during the training
to the AOL Instant Messenger screen name CRI Web
Training
1114. End Users
- Moab Access Portal
- Moab Cluster Manager
- End User Commands
- End User Empowerment
112Moab Access Portal TM
- Submit Jobs from a web browser
- View and Modify only your own Workload
- Assist end-users to self-manage behaviours
http//clusterresources.com/map
113Moab Cluster Manager TM
- Administer Resources and Workload Policies
Through an Easy-to-Use Graphical User Interface - Monitor, Diagnose and Report Resource Allocation
and Usage
- http//clusterresources.com/mcm
114End User Commands
Command Flags Description
canceljob cancel existing job
checkjob display job state, resource requirements, environment, constraints, credentials, history, allocated resources, and resource utilization
releaseres release a user reservation
setres create a user reservation
showbf show resource availability for jobs with specific resource requirements
showq display detailed prioritized list of active and idle jobs
showstart show estimated start time of idle jobs
showstats show detailed usage statistics for users, groups, and accounts which the end user has access to
115Assist Users in Better Utilizing Resources
- General info
- Job eval
- Completed job failure post-mortem
- Job start time estimates
- Job control
- Reservation control
116Assist Users in Better Utilizing Resources (contd)
- How do You Evaluate a Request
- showstart (Earliest start, completion time,
etc.) - showstats f (General service level statistics)
- showstats u (User Statistics)
- showbf (Immediately available resources)
117Presentation Protocols
- For problems or questions Send email to
- training_at_clusterresources.com
- We will pause for questions at the end of each
section - Please remain on mute except for questions
- Please do not put your call on hold (the entire
group will hear your music) - Please be conscientious of the other people
attending the conference - You can also submit questions during the training
to the AOL Instant Messenger screen name CRI Web
Training
1185. Political Sharing and Service Delivery
- Credentials
- Priority
- Usage Limits
- Fairshare
- Quality of Service
- Distributing Resources
- Partitions
- Reservations
- Allocation Management
119Credentials
- Certain job attributes (such as user, group,
account, class and qos) describe entities the job
belongs to and can be used to associate policies
with jobs. - Every Job has credentials
- Users (The only mandatory credential)
- Groups (Standard Unix group or arbitrary
collection of users) - Accounts (Associated with projects and billing)
- Class (Associated with RM queues)
- Quality Of Service (QoS) (Policy overrides,
resource access, service targets, charge rates) - All Credentials can have Usage Limits, Fairshare
Targets, Priorities, Usage History, Credential
Access Lists / Defaults
- http//clusterresources.com/moabdocs/3.5credovervi
ew.shtml
120Credential Membership
moab.cfg user steve can access accounts a14,
a7, a2, a6, and a1. If no account is explicitly
requested, his job will be assigned to account
a2 USERCFGsteve ADEFa2 ALISTa14,a7,a2,a6,a1
moab.cfg account omega3 can only be accessed
by users johnh, stevek, jenp ACCOUNTCFGomega3
MEMBERULISTjohnh,stevek,jenp
moab.cfg Controlling QoS Access on a Per
Group Basis GROUPCFGstaff QLISTstandard,special
QDEFstandard
121Fairness
- Definition
- giving all users equal access to compute
resources - incorporating historical resource usage,
political issues, and job value
- Moab provides a comprehensive and flexible set of
tools allowing the ability to address the many
and varied fairness management needs.
http//clusterresources.com/moabdocs/6.0managingfa
irness.shtml
122Performance Metrics
- Metrics of Responsiveness
- Queue Time
- How long a jobs been waiting
- X Factor
- Duration-weighted time responsiveness factor
- Strongest single factor of perceived fairness
- Metrics of Utilization
- Throughput
- Jobs per unit time
- Utilization
- Percentage of cluster in use
http//clusterresources.com/moabdocs/5.1.2priority
factors.shtml
123General Fairness Strategies
- Maximize Scheduler Options -- Do Not Overspecify
- Keep It Simple Do Not Address Hypothetical
Issues - Seek To Adjust User Behaviour,
- Not Limit User Options
- Allow Users to Specify Required Service Level
- Monitor Cluster Performance Regularly
- Tune Policies As Needed
124Priority
- 2-tier prioritization structure
- Independent component and
- subcomponent weights/caps
- Components include service,
- target, fairshare, resource, usage,
- job attribute, and credential
- Negative priority jobs may be blocked
- Tuning facility available with mdiag -p
- http//clusterresources.com/moabdocs/5.1jobpriorit
ization.shtml
125Job Prioritization Component Overview
- Service
- Level of service delivered or anticipated
- Includes queue time, xfactor, bypass, policy
violation - Target
- Desired service level
- Provides exponential factor growth
- Includes target queue time, target xfactor
- Credential
- Based on credential priorities
- Includes user, group, account, QoS, and class
http//www.clusterresources.com/moabdocs/5.1.2prio
rityfactors.shtmlservice
http//www.clusterresources.com/moabdocs/5.1.2prio
rityfactors.shtmltarget
http//www.clusterresources.com/moabdocs/5.1.2prio
rityfactors.shtmlcred
126Job Prioritization Component Overview
- Fairshare
- Includes user, group, account, QoS, and class
fairshare - Includes current Based on historical resource
consumption - usage metric of jobs per user, procs per user,
and ps per user - May allow prioritization with cap fairshare
target - Resource
- Based on requested resources
- Includes nodes