Title: Condor Tutorial
1Condor Tutorial
- Prabhaker Mateti
- Wright State University
2Acknowledgements
- Many of these slides are adapted from
tutorials byMiron Livny, and his
associatesUniversity of Wisconsin-Madisonhttp//
www.cs.wisc.edu/condor
3Clusters with Part Time Nodes
- Cycle Stealing Running of jobs on a workstations
that don't belong to the owner. - Definition of Idleness E.g., No keyboard and no
mouse activity - Tools/Libraries
- Condor
- PVM
- MPI
4Performance v. Throughput
- High Performance - Very large amounts of
processing capacity over short time periods - FLOPS - Floating Point Operations Per Second
- High Throughput - Large amounts of processing
capacity sustained over very long time periods - FLOPY - Floating Point Operations Per Year
- FLOPY 365x24x60x60FLOPS?
5Cooperation
- Workstations are personal
- Others use slows you down
- Immediate-Eviction
- Pause-and-Migrate
- Willing to share
- Letting you cycle-steal
- Willing to trust
6Granularity of Migration
- Process migration
- Process Collection of objects
- at least one active object
- Object migration
- Passive objects
- Active objects
7Migration of Jobs Technical Issues
- Checkpointing Preserving the state of the
process so it can be resumed. - One architecture to another
- Your environment
- keyboard, mouse, display, files,
8Condor
- A system for high throughput computing by making
use of idle computing resources - Lots of jobs over a long period of time, not a
short burst of high performance - Manages both machines and jobs
- Has been stable, and delivered thousands of CPU
hours
9Condor Techniques
- Migratory programs
- Checkpointing
- Remote IO
- Resource matching
10Condor Assumptions
- Large numbers of workstations are idle most of
the time - Owners of such machines would not mind their use
by others while idle - Owners want their work to be given high priority
11Roles
- Owner offers his machine for use by others
- User requests to run his jobs
- Administrator manages the pool of available
machines - Multiple roles possible
12Classified Advertisements Example
- MyType "Machine"
- TargetType "Job"
- Name "froth.cs.wisc.edu"
- StartdIpAddr
- "lt128.105.73.44
- 33846gt"
- Arch "INTEL"
- OpSys "SOLARIS26"
- VirtualMemory 225312
- Disk 35957
- KFlops 21058
- Mips 103
- LoadAvg 0.011719
- KeyboardIdle 12
- Cpus 1
- Memory 128
- Requirements LoadAvg lt 0.300000
KeyboardIdle gt 15 60 - Rank 0
13Condor User Requests
- Describes the program, and its needs
- Example condor_submit File
- Universe standard
- Executable /home/wsu03/condor/my_job.condor
- Input my_job.stdin
- Output my_job.stdout
- Error my_job.stderr
- Log my_job.log
- Arguments -arg1 -arg2
- InitialDir /home/wsu03/condor/run_1
- Queue
14ClassAds Example for Jobs
- Requirements Arch INTEL
- OpSys LINUX
- Memory gt 20
- Rank (Memory gt 32)
- ( (Memory 100)
- (IsDedicated 10000)
- Mips )
15Condor Pool of Machines
- Pool can be a single machine, or a group of
machines volunteered by their owners - Determined by a central manager - the
matchmaker and centralized information repository - Each machine runs various daemons to provide
different services, either to the users who
submit jobs, the machine owners, or the pool
itself
16Condor System Structure
17Condor Agents
- Condor Resource Agent
- condor_startd daemon
- allows a machine to execute Condor jobs
- enforces owner policy
- Condor User Agent
- condor_schedd daemon
- allows a machine to submit jobs to a pool
18Condor Robustness
- Checkpointing allows guaranteed forward progress
of your jobs, even jobs that run for weeks before
completion - If an execute machine crashes, you only loose
work done since the last checkpoint - Condor maintains a persistent job queue - if the
submit machine crashes, Condor will recover
19Whats Condor Good For?
- Managing a large number of jobs
- You specify the jobs in a file and submit them to
Condor, which runs them all and can send you
email when they complete - Mechanisms to help you manage huge numbers of
jobs (1000s), all the data, etc. - Condor can handle inter-job dependencies (DAGMan)
20Throughput
- Checkpointing allows your job to run on
opportunistic resources, not dedicated - Checkpointing permits migration - if a machine is
no longer available, migrate - With remote system calls, you dont even need an
account on a machine where your job executes
21Can your program work with Condor?
- What kind of I/O does it do?
- Does it use TCP/IP? (network sockets)
- Can the job be resumed?
- Multiple processes?
- fork(), pvm_addhost(), etc.
22Typical IO
- Interactive TTY
- Batch TTY (just reads from STDIN and writes to
STDOUT or STDERR, but you can redirect to/from
files) - X Windows
- NFS, AFS, or another network file system
- Local file system
- TCP/IP
23Condor Universes
- Different universes support different
functionalities - Vanilla
- Standard
- Scheduler
- PVM
24Condor Universes IO support
- No support for interactive TTY
X11 NFS LocalFiles TCP
Vanilla x x - x
Standard - x x -
Scheduler x x x x
PVM x x x x
25Condor Universes
- PVM (Parallel Virtual Machine)
- Multiple processes in Condor
- Scheduler
- The job is run on the submit machine, not on a
remote execute machine - Job is automatically restarted if the
condor_schedd is shutdown - Used to schedule jobs
26Submitting Jobs to Condor
- Choosing a Universe for your job
- Preparing your job
- Making it batch-ready
- Re-linking if checkpointing and remote system
calls are desired (condor_compile) - Creating a submit description file
- condor_submit your request to the User Agent
(condor_schedd)
27Making your job batch-ready
- Must be able to run in the background no
interactive input, windows, GUI, etc. - Can still use STDIN, STDOUT, and STDERR but files
are used for these instead of the actual devices - If your job expects input from the keyboard, you
have to put the input you want into a file
28Preparing Your Job (contd)
- If you are going to use the standard universe
with checkpointing and remote system calls, you
must re-link your job with Condors libraries - condor_compile gcc -o myjob myjob.c
29Submit Description File
- Tells Condor about your job
- Which executable, universe, input, output and
error files to use, command-line arguments,
environment variables, any special requirements
or preferences (more on this later) - Can describe many jobs at once (a cluster) each
with different input, arguments, output, etc.
30Example condor_submit File
Universe standard Executable
/home/wsu03/condor/my_job.condor Input
my_job.stdin Output my_job.stdout Error
my_job.stderr Log my_job.log Arguments
-arg1 -arg2 InitialDir /home/wsu03/condor/ru
n_1 Queue
31Example Submit Description File
- Submits a single job to the standard universe,
specifies files for STDIN, STDOUT and STDERR,
creates a UserLog, defines command line
arguments, and specifies the directory the job
should be run in - As if you did
cd /home/wright/condor/run_1
/home/wsu03/condor/my_job.condor -arg1 -arg2 \
gt my_job.stdout 2gt my_job.stderr \ lt
my_job.stdin
32Clusters and Processes
- A submit file describes one or more jobs
- The collection of jobs is called a cluster
- Each job is called a process or proc
- A Condor Job ID is the cluster number, a
period, and the proc number (e.g., 23.5) - Proc numbers always start at 0
33A Cluster Submit Description File
- Universe standard
- Executable /home/wsu03/condor/my_job.condor
- Input my_job.stdin
- Output my_job.stdout
- Error my_job.stderr
- Log my_job.log
- Arguments -arg1 -arg2
- InitialDir /home/wsu03/condor/run_(Process)
- Queue 500
34A Cluster Submit Description File
- Queue 500 submit 500 jobs at once
- The initial directory for each job is specified
with the (Process) macro - (Process) will be expanded to the process number
for each job in the cluster - run_0, run_1, run_499 directories
- All the input/output files will be in different
directories
35condor_submit
- condor_submit the-submit-file-name
- condor_submit parses the file and creates a
ClassAd that describes your job(s) - Creates the files you specified for STDOUT and
STDERR - Sends your jobs ClassAd(s) and executable to the
condor_schedd, which stores the job in its queue
36Monitoring Your Jobs
- Using condor_q
- Using a User Log file
- Using condor_status
- Using condor_rm
- Getting email from Condor
- Using condor_history after completion
37Using condor_q
- Displays the status of your jobs, how much
compute time it has accumulated, etc. - Many different options
- A single job, a single cluster, all jobs that
match a certain constraint, or all jobs - Can view remote job queues, either individual
queues, or -global
38Using a User Log file
- Specify in your submit file
- Log filename
- Entries logged for
- When it was submitted
- when it started executing
- if it is checkpointed or vacated
- if there are any problems, etc.
39Using condor_status
- the -run option to see
- Machines running jobs
- The user who submitted each job
- The machine they submitted from
- Can also view the status of various submitters
with -submitter ltnamegt
40Using condor_rm
- Removes a job from the Condor queue
- You can only remove jobs that you own
- Root can condor_rm someone elses jobs
- You can give specific job IDs (cluster or
cluster.proc), or you can remove all of your jobs
with the -a option.
41Getting Email from Condor
- By default, Condor will send you email when your
jobs completes - If you dont want this email, put this in your
submit file - notification never
- If you want email every time something happens to
your job (checkpoint, exit, etc), use this - notification always
42Getting Email from Condor
- If you only want email if your job exits with an
error, use this - notification error
- By default, the email is sent to your account on
the host you submitted from. If you want the
email to go to a different address, use this - notify_user email_at_address.here
43Using condor_history
- Once your job completes, it will no longer show
up in condor_q - Now, you must use condor_history to view the
jobs ClassAd - The status field (ST) will have either a C
for completed, or an X if the job was removed
with condor_rm
44Classified Advertisements
- A ClassAd is a set of named expressions
- Each named expression is an attribute
- Expressions are similar to those in C
- Constants, attribute references, operators
45Classified Advertisements Example
- MyType "Machine"
- TargetType "Job"
- Name "froth.cs.wisc.edu"
- StartdIpAddr
- "lt128.105.73.44
- 33846gt"
- Arch "INTEL"
- OpSys "SOLARIS26"
- VirtualMemory 225312
- Disk 35957
- KFlops 21058
- Mips 103
- LoadAvg 0.011719
- KeyboardIdle 12
- Cpus 1
- Memory 128
- Requirements LoadAvg lt 0.300000
KeyboardIdle gt 15 60 - Rank 0
46ClassAd Matching
- ClassAds are always considered in pairs
- Does ClassAd A match ClassAd B (and vice versa)?
- This is called 2-way matching
- If the same attribute appears in both ClassAds,
you can specify which attribute you mean by
putting MY. or TARGET. in front of the
attribute name
47ClassAd Matching Example
- ClassAd B
- MyType "ApartmentRenter"
- TargetType "Apartment"
- UnderGrad False
- RentOffer 900
- Rank 1/(TARGET.RentOffer 100.0)
50HeatIncluded - Requirements OnBusLine
- SquareArea gt 2700
- ClassAd A
- MyType "Apartment
- TargetType "ApartmentRenter
- SquareArea 3500
- RentOffer 1000
- OnBusLine True
- Rank UnderGradFalse TARGET.RentOffer
- Requirements MY.RentOffer - TARGET.RentOffer lt
150
48ClassAds in the Condor System
- ClassAds allow Condor to be a general system
- Constraints and ranks on matches expressed by the
entities themselves - Only priority logic integrated into the
Match-Maker - All principal entities in the Condor system are
represented by ClassAds - Machines, Jobs, Submitters
49ClassAds Example for Machines
- Friend Owner "tannenba
- Owner "wright"
- ResearchGroup Owner "jbasney" Owner
"raman" - Trusted Owner ! "rival" Owner !
"riffraff" - Requirements Trusted
- ( ResearchGroup (LoadAvg lt 0.3
KeyboardIdle gt 1560) ) - Rank Friend ResearchGroup10
50ClassAd Machine Example
- Machine will never start a job submitted by
rival or riffraff - If someone from ResearchGroup (jbasney or
raman) submits a job, it will always run - If anyone else submits a job, it will only run
here if the keyboard has been idle for more than
15 minutes and the load average is less than 0.3
51Machine Rank Example Described
- If the machine is running a job submitted by
owner foo, it will give this a Rank of 0, since
foo is neither a friend nor in the same research
group - If wright or tannenba submits a job, it will
be ranked at 1 (since Friend will evaluate to 1
and ResearchGroup is 0) - If raman or jbasney submit a job, it will
have a rank of 10 - While a machine is running a job, it will be
preempted for a higher ranked job
52ClassAds Example for Jobs
- Requirements Arch INTEL
- OpSys LINUX
- Memory gt 20
- Rank (Memory gt 32)
- ( (Memory 100)
- (IsDedicated 10000)
- Mips )
53Job Example Described
- The job must run on an Intel CPU, running Linux,
with at least 20 megs of RAM - All machines with 32 megs of RAM or less are
Ranked at 0 - Machines with more than 32 megs of RAM are ranked
according to how much RAM they have, if the
machine is dedicated (which counts a lot to this
job!), and how fast the machine is, as measured
in MIPS
54ClassAd Attributes in your Pool
- Condor defines a number of attributes by default,
which are listed in the User Manual (About
Requirements and Rank) - To see if machines in your pool have other
attributes defined, use - condor_status -long lthostnamegt
- A custom-defined attribute might not be defined
on all machines in your pool, so youll probably
want to use meta-operators
55ClassAd Meta-Operators
- Meta operators allow you to compare against
UNDEFINED as if it were a real value - ? is meta-equal-to
- ! is meta-not-equal-to
- Color ! Red (non-meta) would evaluate to
UNDEFINED if Color is not defined - Color ! Red would evaluate to True if Color
is not defined, since UNDEFINED is not Red
56Priorities In Condor
- User Priorities
- Priorities between users in the pool to ensure
fairness - The lower the value, the better the priority
- Job Priorities
- Priorities that users give to their own jobs to
determine the order in which they will run - The higher the value, the better the priority
- Only matters within a given users jobs
57User Priorities in Condor
- Each active user in the pool has a user priority
- Viewed or changed with condor_userprio
- The lower the number, the better
- A given users share of available machines is
inversely related to the ratio between user
priorities. - Example Freds priority is 10, Joes is 20.
Fred will be allocated twice as many machines as
Joe.
58User Priorities in Condor, cont.
- Condor continuously adjusts user priorities over
time - machines allocated gt priority, priority worsens
- machines allocated lt priority, priority improves
- Priority Preemption
- Higher priority users will grab machines away
from lower priority users (thanks to
Checkpointing) - Starvation is prevented
- Priority thrashing is prevented
59Job Priorities in Condor
- Can be set at submit-time in your description
file with - prio ltnumbergt
- Can be viewed with condor_q
- Can be changed at any time with condor_prio
- The higher the number, the more likely the job
will run (only among the jobs of an individual
user)
60Managing a Large Cluster of Jobs
- Condor can manage huge numbers of jobs
- Special features of the submit description file
make this easier - Condor can also manage inter-job dependencies
with condor_dagman - For example job A should run first, then, run
jobs B and C, when those finish, submit D, etc - Well discuss DAGMan later
61Submitting a Large Cluster
- Each process runs in its own directory
- InitialDir dir.(process)
- Can either have multiple Queue entries, or put a
number after Queue to tell Condor how many to
submit - Queue 1000
- A cluster is more efficient Your jobs will run
faster, and theyll use less space - Can only have one executable per cluster
Different executables must be different clusters!
62Inter-Job Dependencies with DAGMan
- DAGMan handles a set of jobs that must be run in
a certain order - Also provides pre and post operations, so you
can have a program or script run before each job
is submitted and after it completes - Robust handles errors and submit-machine crashes
63Using DAGMan
- You define a DAG description file, which is
similar in function to the submit file you give
to condor_submit - DAGMan restrictions
- Each job in the DAG must be in its own cluster
(for now) - All jobs in the DAG must have a User Log and must
share the same file
64DAGMan Description File
- is a comment
- First section names the jobs in your DAG and
associates a submit description file with each
job - Second (optional) section defines PRE and POST
scripts to run - Final section defines the job dependencies
65Example DAGMan File
Job A A.submit Job B B.submit Job C C.submit Job
D D.submit Script PRE D d_input_checker Script
POST A a_output_processor A.out PARENT A CHILD
B C PARENT B C CHILD D
66Setting up a DAG for Condor
- Create all the submit description files for the
individual jobs - Prepare any executables you plan to use
- Can have a mix of Vanilla and Standard jobs
- Setup any PRE/POST commands or scripts you wish
to use
67Submitting a DAG to Condor
- condor_submit_dag DAG-description-file
- This will check your input file for errors and
submit a copy of condor_dagman as a scheduler
universe job with all the necessary command-line
arguments
68Removing a DAG
- On shutdown, DAGMan will remove any jobs that are
currently in the queue that are associated with
its DAG - Once all jobs are gone, DAGMan itself will exit,
and the scheduler universe job will be removed
from the queue
69Typical Problems
- Special requirements expressions for vanilla jobs
- You didnt submit it from a directory that is
shared - Condor isnt running as root
- You dont have your file permissions setup
correctly
70Special Requirements Expressions for Vanilla Jobs
- When you submit a vanilla job, Condor
automatically appends two extra Requirements - UID_DOMAIN ltsubmit_uid_domaingt
- FILESYSTEM_DOMAIN ltsubmit_fsgt
- Since there are no remote system calls with
Vanilla jobs, they depend on a shared file system
and a common UID space to run as you and access
your files
71Special Requirements Expressions for Vanilla Jobs
- By default, each machine in your pool is in its
own UID_DOMAIN and FILESYSTEM_DOMAIN, so your
pool administrator has to configure your pool
specially if there really is a common UID space
and a network file system - If you dont have an account on the remote
system, Vanilla jobs wont work
72Shared Files for Vanilla Jobs
- May be not all directories are sharedInitialdir
/tmp will probably cause trouble for Vanilla
jobs! - You must be sure to set Initialdir to a shared
directory (or cd into it to run condor_submit)
for Vanilla jobs
73Why Dont My Jobs Run?
- Try condor_q -analyze
- Try specifying a User Log for your job
- Look at condor_userprio maybe you have a low
priority and higher priority users are being
served - Problems with file permissions or network file
systems - Look at the SchedLog
74Using condor_q -analyze
- Analyzes your jobs ClassAd, get all the ClassAds
of the machines in the pool, and tell you whats
going on - Will report errors in your Requirements
expression (impossible to match, etc.) - Will tell you about user priorities in the pool
(other people have better priority)
75Looking at condor_userprio
- You can look at condor_userprio yourself
- If your priority value is a really high number
(because youve been running a lot of Condor
jobs), other users will have priority to run jobs
in your pool
76File Permissions in Condor
- If Condor isnt running as root, the
condor_shadow process runs as the user the
condor_schedd is running as (usually condor) - You must grant this user write access to your
output files, and read access to your input files
(both STDOUT, STDIN from your submit file, as
well as files your job explicitly opens)
77File Permissions in Condor
- Often, there will be a condor group and you can
make your files owned and write-able by this
group - For vanilla jobs, even if the UID_DOMAIN setting
is correct, and they match for your submit and
execute machines, if Condor isnt running as
root, your job will be started as user Condor,
not as you!
78Problems with NFS in Condor
- For NFS, sometimes the administrators will setup
read-only mounts, or have UIDs remapped for
certain partitions (the classic example is root
nobody, but modern NFS can do arbitrary
remappings)
79Problems with NFS in Condor
- If your pool uses NFS automounting, the directory
that Condor thinks is your InitialDir might not
exist on a remote machine - With automounting, you always need to specify
InitialDir explicitly - InitialDir /home/me/...
80Problems with AFS in Condor
- If your pool uses AFS, the condor_shadow, even if
its running with your UID, will not have your
AFS token. - You must grant an unauthenticated AFS user the
appropriate access to your files - Some sites provide a better alternative that
world-writable files - Host ACLs
- Network-specific ACLs
81Looking at the SchedLog
- Looking at the log file of the condor_schedd, the
SchedLog file can possibly give you a clue if
there are problems. - Find it with
- condor_config_val schedd_log
- You might need your pool administrator to turn on
a higher debugging level to see more verbose
output
82Other User Features
- Submit-Only installation
- Heterogeneous Submit
- PVM jobs
83Submit-Only Installation
- Can install just a condor_master and
condor_schedd on your machine - Can submit jobs into a remote pool
- Special option to condor_install
84Heterogeneous Submit
- The job you submit doesnt have to be the same
platform as the machine you submit from - Maybe you have access to a pool that is full of
Alphas, but you have a Sparc on your desk, and
moving all your data is a pain - You can take an Alpha binary, copy it to your
Sparc, and submit it with a requirements
expression that says you need to run on ALPHA/OSF1
85PVM Jobs in Condor
- Condor can run parallel applications
- PVM applications now
- Future work includes support for MPI
- Master-Worker Paradigm
- What does Condor-PVM do?
- How to compile and submit Condor-PVM jobs
86Master-Worker Paradigm
- Condor-PVM is designed to run PVM applications
based on the master-worker paradigm. - Master
- has a pool of work, sends pieces of work to the
workers, manages the work and the workers - Worker
- gets a piece of work, does the computation, sends
the result back
87What does Condor-PVM do?
- Condor acts as the PVM resource manager.
- All pvm_addhost requests get re-mapped to Condor.
- Condor dynamically constructs PVM virtual
machines out of non-dedicated desktop machines. - When a machine leaves the pool, the user gets
notified via the normal PVM notification
mechanisms.
88Submission of Condor-PVM jobs
- Binary Compatible
- Compile and link with PVM library just as normal
PVM applications. No need to link with Condor. - In the submit description file, set
- universe PVM
- machine_count ltmingt..ltmaxgt
89Resource Agent Configuration Expressions
90Resource Agent Configuration
- Default Setup
- WANT_VACATE True
- WANT_SUSPEND True
- START Keyboard_Idle CPU_Idle
- SUSPEND Keyboard_Busy CPU_Busy
- CONTINUE Keyboard and CPU idle again
- VACATE If Suspended gt 10 minutes
- KILL If spent gt 10 minutes in VACATE state
91condor_master
- Watches/restarts other daemons
- Sends Email if suspicious problems arise
- Runs condor_preen
- Provides administrator remote control
92Condor Administrator Commands
- condor_off hostname
- condor_on
- condor_restart
- condor_reconfig
- condor_vacate
- Can be used by the Owner also
93Host-based Access Control
- HOST_ALLOW and HOST_DENY to grant machines
(subnets, domains) different access levels - READ access
- WRITE access
- ADMINISTRATOR access
- OWNER access
94Host-based Access Control Ex.
- HOSTDENY_READ .com
- HOSTALLOW_WRITE .cs.wright.edu
- HOSTDENY_WRITE ppp.wright.edu, 172.44.
- HOSTALLOW_ADMINISTRATOR
osis111.cs.wright.edu - HOSTALLOW_OWNER (FULL_HOSTNAME),
(HOSTALLOW_ADMINISTRATOR)
95Configuration File Hierarchy
- condor_config
- Pool-wide default
- Condor pool administrators requirements
- condor_config.local
- Overrides for a specific machine
- Reflects Owners requirements
- condor_config.root
- System Administrator requirements
96Obtaining Condor
- Condor accounts available! E-mailmiron_at_cs.wisc.e
du - Condor executables can be downloaded from
http//www.cs.wisc.edu/condor - Complete Users and Administrators manual
http//www.cs.wisc.edu/condor/manual