Using Condor An Introduction Condor Week 2006 presentation

About This Presentation

Transcript and Presenter's Notes

Title: Using Condor An Introduction Condor Week 2006

1
Using Condor An IntroductionCondor Week 2006
2
Tutorial Outline

The story of Frieda, the scientist
Using Condor to manage jobs
Using Condor to manage resources
Condor architecture and mechanisms
Condor on the grid
Flocking
Condor and other grid technologies
Stop me if you have any questions!

3
Meet Frieda.
4
Friedas Application

Run a Parameter Sweep of F(x,y,z) for 20 values
of x, 10 values of y and 3 values of z
20103 600 combinations
F takes on the average 6 hours to compute on a
typical workstation (total 600 6 3600
hours)
F requires a moderate (256MB) amount of memory
F performs moderate I/O - (x,y,z) is 5 MB and
F(x,y,z) is 50 MB

5
I have 600simulations to run.Where can I get
help?
6
As if by magic, a genie appears from a lamp, and
says, Install a Personal Condor!
7
Getting Condor

Available as a free download from
http//www.cs.wisc.edu/condor
Download Condor for your operating system
Available for most UNIX (including Linux and
Apples OS/X) platforms
Also for Windows NT / XP

8
Condor Releases

Stable / Developer Releases
Version numbering scheme similar to that of the
(pre 2.6) Linux kernels
Major.minor.release
Minor is even (a.b.c) Stable
Examples 6.4.3, 6.6.8, 6.6.9
Very stable, mostly bug fixes
Minor is odd (a.b.c) Developer
New features, may have some bugs
Examples 6.5.5, 6.7.18, 6.7.19

9
Frieda Installs a Personal Condor on her
machine

What do we mean by a Personal Condor?
Condor on your own workstation
No root / administrator access required
No system administrator intervention needed
After installation, Frieda submits her jobs to
her Personal Condor

10
Friedas Condor Pool
11
Personal Condor?!Whats the benefit of a Condor
Pool with just one user and one machine?
12
Your Personal Condor will ...

Keep an eye on your jobs and will keep you posted
on their progress
Implement your policy on the execution order of
the jobs
Keep a log of your job activities
Add fault tolerance to your jobs
Implement your policy on when the jobs can run on
your workstation

13
Getting StartedSubmitting Jobs to Condor

Overview
Choose a Universe for your job
Make your job batch-ready
Create a submit description file
Run condor_submit to put your job in the queue

14
1. Choose the Universe

Controls how Condor handles jobs
Choices include
Vanilla
Standard
Grid
Java
Parallel

15
Using the Vanilla Universe

The Vanilla Universe
Allows running almost any serial job
Provides automatic file transfer, etc.
Like vanilla ice cream
Can be used in just about any situation

16
2. Make your job batch-ready

Must be able to run in the background no
interactive input, windows, GUI, etc.
Can still use STDIN, STDOUT, and STDERR (the
keyboard and the screen), but files are used for
these instead of the actual devices
Similar to UNIX
./myprogram ltinput.txt gtoutput.txt

17
3. Create a Submit Description File

A plain ASCII text file
Condor does not care about file extensions
Tells Condor about your job
Which executable, universe, input, output and
error files to use, command-line arguments,
environment variables, any special requirements
or preferences (more on this later)
Can describe many jobs at once (a cluster),
each with different input, arguments, output, etc.

18
Simple Submit Description File

Simple condor_submit input file
(Lines beginning with are comments)
NOTE the words on the left side are not
case sensitive, but filenames are!
Universe vanilla
Executable my_job
Output output.txt
Queue

19
4. Run condor_submit

You give condor_submit the name of the submit
file you have created
condor_submit my_job.submit
condor_submit parses the submit file, checks for
it errors, and creates a ClassAd that describes
your job(s)
ClassAds Condors internal data representation
Similar to classified ads (as the name implies)
Represent an object its attributes
Can also describe what an object matches with

20
The Job Queue

condor_submit sends your jobs ClassAd(s) to the
schedd
The schedd (more details later)
Manages the local job queue
Stores the job in the job queue
Atomic operation, two-phase commit
Like money in the bank
View the queue with condor_q

21
Examplecondor_submit and condor_q

condor_submit my_job.submit
Submitting job(s).
1 job(s) submitted to cluster 1.
condor_q
-- Submitter perdita.cs.wisc.edu
lt128.105.165.341027gt
ID OWNER SUBMITTED RUN_TIME
ST PRI SIZE CMD
1.0 frieda 6/16 0652 0000000
I 0 0.0 my_job
1 jobs 1 idle, 0 running, 0 held

22
Input, output error files

Controlled by submit file settings
You can define the jobs standard input, standard
output and standard error
Read jobs standard input from input_file
Input input_file
Shell equivalent program ltinput_file
Write jobs standard ouput to output_file
Output output_file
Shell equivalent program gtoutput_file
Write jobs standard error to error_file
Error error_file
Shell equivalent program 2gterror_file

23
Feedback on your job

Condor sends you email about events
Turn it off Notification Never
Only on errors Notification Error
Condor creates a log file (user log)
The Life Story of a Job
Shows all events in the life of a job
Always have a log file
To turn it on Log filename

24
Sample Condor User Log
000 (0001.000.000) 05/25 191003 Job submitted
from host lt128.105.146.141816gt ... 001
(0001.000.000) 05/25 191217 Job executing on
host lt128.105.146.141026gt ... 005
(0001.000.000) 05/25 191306 Job
terminated. (1) Normal termination (return value
0) ...
25
Example Submit Description File With Logging
Example condor_submit input file (Lines
beginning with are comments) NOTE the words
on the left side are not case sensitive,
but filenames are! Universe
vanilla Executable /home/frieda/condor/my_job.co
ndor Log my_job.log Job log (from
Condor) Input my_job.in Programs
standard input Output my_job.out
Programs standard output Error
my_job.err Programs standard error Arguments
-a1 -a2 InitialDir /home/frieda/condor/run Que
ue
26
Clusters and Processes

If your submit file describes multiple jobs, we
call this a cluster
Each cluster has a unique cluster number
Each job in a cluster is called a process
Process numbers always start at zero
A Condor Job ID is the cluster number, a
period, and the process number (i.e. 2.1)
A cluster can have a single process
Job ID 20.0 Cluster 20, process 0
Or, a cluster can have more than one process
Job ID 21.0, 21.1, 21.2 Cluster 21, process 0,
1, 2

27
Submit File for a Cluster
Example submit file for a cluster of 2 jobs
with separate input, output, error and log
files Universe vanilla Executable
my_job Arguments -x 0 log
my_job_0.log Input my_job_0.in Output
my_job_0.out Error my_job_0.err Queue
Job 2.0 (cluster 2, process 0) Arguments -x
1 log my_job_1.log Input
my_job_1.in Output my_job_1.out Error
my_job_1.err Queue Job 2.1 (cluster 2,
process 1)
28
Submitting The Job
condor_submit my_job.submit-file Submitting
job(s). 2 job(s) submitted to cluster 2.
condor_q -- Submitter perdita.cs.wisc.edu
lt128.105.165.341027gt ID OWNER
SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0
frieda 4/15 0652 0000211 R 0 0.0
my_job a1 a2 2.0 frieda 4/15 0656
0000000 I 0 0.0 my_job x 0 2.1
frieda 4/15 0656 0000000 I 0 0.0
my_job x 1 3 jobs 2 idle, 1 running, 0 held
29
Back to our 600 jobs

We could put all input, output, error log files
in the one directory
One of each type for each job
Thatd be 2400 files (4 files 600 jobs)
Difficult to sort through
Better Create a subdirectory for each run

30
Organize your files and directories for big runs

Create subdirectories for each run
run_0, run_1, run_599
Create input files in each of these
run_0/my_job.in
run_1/my_job.in
run_599/my_job.in
The output, error log files for each job will
be created by Condor from your jobs output

31
Submit Description File for 600 Jobs

Cluster of 600 jobs with different directories
Universe vanilla
Executable my_job
Arguments -x 0
Log my_job.log
Input my_job.in
Output my_job.out
Error my_job.err
InitialDir run_0 Log, input, output error
files -gt run_0
Queue Job 3.0 (Cluster 3, Process 0)
Arguments -x 1
InitialDir run_1 Log, input, output error
files -gt run_1
Queue Job 3.1 (Cluster 3, Process 1)
Do this 598 more times

32
Submit File for a Big Cluster of Jobs

We just submitted 1 cluster with 600 processes
All the input/output files will be in different
directories
The submit file is pretty unwieldy (over 1200
lines)
Isnt there a better way?

33
Submit File for a Big Cluster of Jobs (the better
way) 1

We can queue all 600 in 1 Queue command
Queue 600
Condor provides (Process) and (Cluster)
(Process) will be expanded to the process number
for each job in the cluster
0, 1, 599
(Cluster) will be expanded to the cluster number
Will be 4 for all jobs in this cluster

34
Submit File for a Big Cluster of Jobs (the better
way) 2

The initial directory for each job can be
specified using (Process)
InitialDir run_(Process)
Condor will expand these to run_0, run_1,
run_599 directories
Similarly, arguments can be variable
Arguments -x (Process)
Condor will expand these to -x 0, -x 1,
-x 599

35
Better Submit File for 600 Jobs

Example condor_submit input file that defines
a cluster of 600 jobs with different
directories
Universe vanilla
Executable my_job
Arguments x (Process) x 0, -x 1, -x 599
Log my_job.log
Input my_job.in
Output my_job.out
Error my_job.err
InitialDir run_(Process) run_0 run_599
Queue 600 Jobs 4.0 4.599

36
Now, we submit it

condor_submit my_job.submit
Submitting job(s) ................................
..................................................
..................................................
..................................................
..................................................
.......................
Logging submit event(s) ..........................
..................................................
..................................................
..................................................
..................................................
.............................
600 job(s) submitted to cluster 4.

37
And, Check the queue

condor_q
-- Submitter x.cs.wisc.edu lt128.105.121.53510gt
x.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE
CMD
4.0 frieda 4/20 1208 0000005 R 0 9.8
my_job -arg1 x 0
4.1 frieda 4/20 1208 0000003 I 0 9.8
my_job -arg1 x 1
4.2 frieda 4/20 1208 0000001 I 0 9.8
my_job -arg1 x 2
4.3 frieda 4/20 1208 0000000 I 0 9.8
my_job -arg1 x 3
...
4.598 frieda 4/20 1208 0000000 I 0 9.8
my_job -arg1 x 598
4.599 frieda 4/20 1208 0000000 I 0 9.8
my_job -arg1 x 599
600 jobs 599 idle, 1 running, 0 held

38
Removing jobs

If you want to remove a job from the Condor
queue, you use condor_rm
You can only remove jobs that you own (you cant
run condor_rm on someone elses jobs unless you
are root on UNIX or administrator on Windows)
You can give specific job IDs (cluster or
cluster.proccondor
condor_rm 4.0 Removes a single job
condor_rm 4 Removes the whole cluster
Or, remove all of your jobs with -a
condor_rm -a Removes all jobs / clusters

39
Another Universe
40
More about Condor Universes

Multiple Condor Universes
Different feature sets
Weve been using the Vanilla universe
Can be used to run any serial job
And, introducing
Scheduler
Local

41
Condor UniversesScheduler and Local

Scheduler Universe
Plug in a meta-scheduler
Developed for DAGMan (more later)
Similar to Globuss fork job manager
Local
Very similar to vanilla, but jobs run on the
local host
Has more control over jobs than scheduler
universe

42
Friedas Condor Pool
Frieda can still only run one job at a time,
however.
43
The Boss says Frieda can add her co-workers
desktop machines into her Condor pool as
wellbut only if they can also submit jobs.
Good News
44
Adding nodes

Frieda installs Condor on the desktop machines,
and configures them with her machine as the
central manager
The central manager
Central repository for the whole pool
Performs job / machine matching, etc.
These are non-dedicated nodes, meaning that
they can't always run Condor jobs

45
Friedas Condor Pool
Now, Frieda and her co-workers can run multiple
jobs at a time so their work completes sooner.
46
condor_status
condor_status Name OpSys Arch State
Activ LoadAv Mem ActvtyTime antipholus.cs
LINUX INTEL Unclaimed Idle 0.020 511
0022842 coral.cs.wisc LINUX INTEL Claimed
Busy 0.990 511 0012721 doc.cs.wisc.e LINUX
INTEL Unclaimed Idle 0.260 511
0002004 dsonokwa.cs.w LINUX INTEL Claimed
Busy 0.810 511 0000145 ferdinand.cs. LINUX
INTEL Claimed Suspe 1.130 511
0000055 vm1_at_pinguino. LINUX INTEL Unclaimed
Idle 0.000 255 0010328 vm2_at_pinguino. LINUX
INTEL Unclaimed Idle 0.190 255 0010329
47
How can my jobs access their data files?
48
Access to Data in Condor

Use shared filesystem if available
No shared filesystem?
Condor can transfer files
Can automatically send back changed files
Atomic transfer of multiple files
Can be encrypted over the wire
Remote I/O Socket
Standard Universe can use remote system calls
(more on this later)

49
Condor File Transfer

ShouldTransferFiles YES
Always transfer files to execution site
ShouldTransferFiles NO
Rely on a shared filesystem
ShouldTransferFiles IF_NEEDED
Will automatically transfer the files if the
submit and execute machine are not in the same
FileSystemDomain

50
We Need More

Condor is managing and running our jobs, but
Our CPU requirements are greater than our
resources
Jobs get vacated when people use their
workstations

51
Happy Day! Friedas organization purchased a
Dedicated Cluster!

Frieda Installs Condor on all the dedicated
Cluster nodes
Frieda also adds a dedicated central manager
She configures her entire pool with this new host
as the central manager

52
Friedas Condor Pool
With the additional resources, Frieda and her
co-workers can get their jobs completed even
faster.
53
What Condor Daemons are running on my machine,
and what do they do?
54
condor_master

Starts up all other Condor daemons
If there are any problems and a daemon exits, it
restarts the daemon and sends email to the
administrator
Acts as the server for many Condor remote
administration commands
condor_reconfig, condor_restart, condor_off,
condor_on, condor_config_val, etc.

55
Condor Daemon Layout
Master
56
Central Managercondor_collector

Central manager central repository and match
maker for whole pool
Collects information from all other Condor
daemons in the pool
Directory Service / Database for a Condor pool
Each daemon sends a periodic update called a
ClassAd to the collector
Services queries for information
Queries from other Condor daemons
Queries from users (condor_status)
Only on the Central Manager
At least one collector per pool

57
Condor Pool Layout Collector
ClassAd Communication Pathway
Master
Collector
58
Central Managercondor_negotiator

Performs matchmaking in Condor
Each Negotiation Cycle (typically 5 minutes)
Gets information from the collector about all
available machines and all idle jobs
Tries to match jobs with machines that will serve
them
Both the job and the machine must satisfy each
others requirements
Only one negotiator per pool
Only on the Central Manager

59
Condor Pool Layout Negotiator
ClassAd Communication Pathway
Master
negotiator
Collector
60
Execute Hostscondor_startd

Execute host machines that run user jobs
Represents a machine to the Condor system
Responsible for starting, suspending, and
stopping jobs
Enforces the wishes of the machine owner (the
owners policy more on this in the
administrators tutorial)
Creates a starter for each running job
One startd runs on each execute node

61
Condor Pool Layout startd
ClassAd Communication Pathway
Master
negotiator
Collector
62
Submit Hostscondor_schedd

Submit hosts machines that users can submit
jobs on
Maintains the persistent queue of jobs
Responsible for contacting available machines and
sending them jobs
Services user commands which manipulate the job
queue
condor_submit,condor_rm, condor_q, condor_hold,
condor_release, condor_prio,
Creates a shadow for each running job
One schedd runs on each submit host

63
Condor Pool Layout schedd
ClassAd Communication Pathway
Master
negotiator
negotiator
schedd
Collector
Master
Master
startd
startd
schedd
schedd
64
Condor Pool Layout master
ClassAd Communication Pathway
Master
negotiator
Collector
Master
Master
startd
startd
schedd
schedd
65
Some of the machines in the Pool do not have
enough memory or scratch disk space to run my job!
66
Specify Requirements!

An expression (syntax similar to C or Java)
Must evaluate to True for a match to be made

67
Specify Rank!

All matches which meet the requirements can be
sorted by preference with a Rank expression.
Higher the Rank, the better the match

68
Now my jobs arent running..Whats wrong?
69
Checking the queue

Check the queue with condor_q
bash-2.05a condor_q
-- Submitter x.cs.wisc.edu lt128.105.121.53510gt
x.cs.wisc.edu
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
5.0 frieda 4/20 1223 0000000 I 0 9.8
my_job -arg1 n 0
5.1 frieda 4/20 1223 0000000 I 0 9.8
my_job -arg1 n 1
5.2 frieda 4/20 1223 0000000 I 0 9.8
my_job -arg1 n 2
5.3 frieda 4/20 1223 0000000 I 0 9.8
my_job -arg1 n 3
5.4 frieda 4/20 1223 0000000 I 0 9.8
my_job -arg1 n 4
5.5 frieda 4/20 1223 0000000 I 0 9.8
my_job -arg1 n 5
5.6 frieda 4/20 1223 0000000 I 0 9.8
my_job -arg1 n 6
5.7 frieda 4/20 1223 0000000 I 0 9.8
my_job -arg1 n 7
6.0 frieda 4/20 1322 0000000 H 0 9.8
my_job -arg1 arg2
8 jobs 8 idle, 0 running, 1 held

70
Check machine status

Verify that there are idle machines with
condor_status
bash-2.05a condor_status
Name OpSys Arch State Activity LoadAv
Mem ActvtyTime
vm1_at_tonic.c LINUX INTEL Claimed Busy 0.000
501 0000020
vm2_at_tonic.c LINUX INTEL Claimed Busy 0.000
501 0000019
vm3_at_tonic.c LINUX INTEL Claimed Busy 0.040
501 0000017
vm4_at_tonic.c LINUX INTEL Claimed Busy 0.000
501 0000005
Total Owner Claimed Unclaimed
Matched Preempting
INTEL/LINUX 4 0 4 0
0 0
Total 4 0 4 0
0 0

71
Look in Job Log

Look in your job log for clues
bash-2.05a cat my_job.log
000 (031.000.000) 04/20 144731 Job submitted
from host lt128.105.121.5348740gt
...
007 (031.000.000) 04/20 150200 Shadow
exception!
Error from starter on gig06.stat.wisc.edu
Failed to open '/scratch.1/frieda/workspace/v67/c
ondor-test/test3/run_0/my_job.in' as standard
input No such file or directory (errno 2)
0 - Run Bytes Sent By Job
0 - Run Bytes Received By Job
...

72
Still not running?Exercise a little patience

On a busy pool, it can take a while to match and
start your jobs
Wait at least a negotiation cycle or two
(typically 5 minutes)

73
Look to condor_q for helpcondor_q -analyze

bash-2.05a condor_q -ana 29
---
029.000 Run analysis summary. Of 1243
machines,
1243 are rejected by your job's requirements
0 are available to run your job
WARNING Be advised
No resources matched request's constraints
Check the Requirements expression below
Requirements ((Memory gt 8192)) (Arch
"INTEL") (OpSys "LINUX") (Disk gt
DiskUsage) (TARGET.FileSystemDomain
MY.FileSystemDomain)

74
Queue analysis for 6.7condor_q better-analyze

bash-2.05a condor_q -better-ana 29
The Requirements expression for your job is
( ( target.Memory gt 8192 ) ) ( target.Arch
"INTEL" )
( target.OpSys "LINUX" ) ( target.Disk gt
DiskUsage )
( TARGET.FileSystemDomain MY.FileSystemDomain
)
Condition Machines Matched
Suggestion
--------- ----------------
----------
1 ( ( target.Memory gt 8192 ) ) 0
MODIFY TO 4000
2 ( TARGET.FileSystemDomain "cs.wisc.edu" )584
3 ( target.Arch "INTEL" ) 1078
4 ( target.OpSys "LINUX" ) 1100
5 ( target.Disk gt 13 ) 1243

75
Use condor_status to learn about resources

bash-2.05a condor_status const 'Memory gt 8192'
(no output means no matches)
bash-2.05a condor_status -const 'Memory gt 4096'
Name OpSys Arch State Activ
LoadAv Mem ActvtyTime
vm1_at_s0-03.cs. LINUX X86_64 Unclaimed Idle 0.000
5980 1053505
vm2_at_s0-03.cs. LINUX X86_64 Unclaimed Idle 0.000
5980 13053703
vm1_at_s0-04.cs. LINUX X86_64 Unclaimed Idle 0.000
7988 1060005
vm2_at_s0-04.cs. LINUX X86_64 Unclaimed Idle 0.000
7988 13060347
Total Owner Claimed Unclaimed
Matched Preempting
X86_64/LINUX 4 0 0 4
0 0
Total 4 0 0 4
0 0

76
Weve seen how Condor can

Keep an eye on your jobs and will keep you posted
on their progress
Implement your policy on the execution order of
the jobs
Keep a log of your job activities

77
My new jobs run for 20 days

What happens when a job is forced off its CPU?
Preempted by higher priority user or job
Vacated because of user activity
How can I add fault tolerance to my jobs?

78
Run them inTodds Private Universe?
79
Condors Standard Universe to the rescue!

Support for transparent process checkpoint and
restart
Remote system calls (remote I/O)
Your job can read / write files as if they were
local

80
Process Checkpointing in theStandard Universe

Condors process checkpointing provides a
mechanism to automatically save the state of a
job
The process can then be restarted from right
where it was checkpointed
After preemption, crash, etc.

81
Checkpointing details

The entire state of a process into a checkpoint
file
Memory image
CPU registers
I/O, etc.
Typically, no changes to your jobs source code
needed
Your job must be relinked with Condors Standard
Universe support library

82
Relinking Your Job for Standard Universe

To do this, just place condor_compile in front
of the command you normally use to link your job

condor_compile gcc -o myjob myjob.c - OR -
condor_compile f77 -o myjob filea.f fileb.f - OR
- condor_compile make f MyMakefile
83
Limitations of the Standard Universe

Condors checkpointing is not at the kernel
level.
Standard Universe the job may not
Fork()
Use kernel threads
Use some forms of IPC, such as pipes and shared
memory
Must have access to source code to relink
Many typical scientific jobs are OK

84
When will Condor checkpoint your job?

Periodically, if desired
For fault tolerance
When your job is preempted by a higher priority
job
When your job is vacated because the execution
machine becomes busy
When you explicitly run condor_checkpoint,
condor_vacate, condor_off or condor_restart
command

85
Remote System Calls inthe Standard Universe

I/O system calls are trapped and sent back to
submit machine
Allows transparent migration across
administrative domains
Checkpoint on machine A, restart on B
No source code changes required
Language independent
Opportunities for application steering
Example Condor tells customer process how to
open files

86
Connecting Condors

Frieda knows people with their own Condor pools,
and gets permission to use their computing
resources
How can Condor help her do this?

87
Connect Condorswith Flocking

Frieda configures her Condor pool to flock to
her friends pool.
Flocking is a Condor-specific technology.

88
Friedas Condor Pool
89
Frieda meets The Grid

Frieda also has access to grid resources she
wants to use
She has certificates and access to Globus or
other resources at remote institutions
But Frieda wants Condors queue management
features for her jobs!
She installs Condor so she can submit Grid
Universe jobs to Condor

90
Grid Universe

All handled in your submit file
Supports a number of back end types
Globus GT2, GT3, GT4
NorduGrid
UNICORE
Condor
PBS
LSF

91
Grid Universe Globus 2/3

Used for a Globus GT2 / GT3 back-end
Condor-G
Grid_Resource (gt2gt3) Head-Node
Globus_rsl ltRSL-Stringgt
Example
Universe grid
Grid_Resource gt2 beak.cs.wisc.edu/jobmanager
Globus_rsl (queuelong)(projectatom-smasher)

92
Grid Universe Globus 4

Used for a Globus GT4 back-end
Grid_Resource gt4 ltHead-Nodegt ltScheduler-Typegt
Globus_XML ltXML-Stringgt
Example
Universe grid
Grid_Resource gt4 beak.cs.wisc.edu Condor
Globus_xml ltqueuegtlonglt/queuegtltprojectgtatom-smas
herlt/projectgt

93
Grid Universe Condor

Used for a Condor back-end
Condor-C
Grid_Resource condor ltSchedd-Namegt
ltCollector-Namegt
Remote_ltparamgt ltvaluegt
Remote_ is stripped off
Example
Universe grid
Grid_Resource condor beak condor.cs.wisc.edu
Remote_Universe standard

94
Grid Universe NorduGrid

Used for a NorduGrid back-end
Grid_Resource nordugrid ltHost-Namegt
Example
Universe grid
Grid_Resource nordugrid ngrid.cs.wisc.edu

95
Grid Universe UNICORE

Used for a UNICORE back-end
Grid_Resource unicore ltUSitegt ltVSitegt
Example
Universe grid
Grid_Resource unicore uhost.cs.wisc.edu vhost

96
Grid Universe PBS

Used for a PBS back-end
New in 6.7.19
Grid_Resource pbs
Example
Universe grid
Grid_Resource pbs

97
Grid Universe LSF

Used for a LSF back-end
New in 6.7.19
Grid_Resource lsf
Example
Universe grid
Grid_Resource lsf

98
Credential Management

Condor will do The Right Thing with your X509
certificate and proxy
Override default proxy
X509UserProxy /home/frieda/other/proxy
Proxy may expire before jobs finish executing
Condor can use MyProxy to renew your proxy
When a new proxy is available, Condor will
forward the renewed proxy to the job
This works for non-grid jobs, too

99
My jobs have have dependencies

Can Condor help solve my dependency problems?

100
Frieda learns DAGMan

Directed Acyclic Graph Manager
DAGMan allows you to specify the dependencies
between your Condor jobs, so it can manage them
automatically for you.
(e.g., Dont run job B until job A has
completed successfully.)

101
What is a DAG?

A DAG is the data structure used by DAGMan to
represent these dependencies.
Each job is a node in the DAG.
Each node can have any number of parent or
children nodes as long as there are no loops!

102
Defining a DAG

A DAG is defined by a .dag file, listing each of
its nodes and their dependencies
diamond.dag
Job A a.sub
Job B b.sub
Job C c.sub
Job D d.sub
Parent A Child B C
Parent B C Child D
each node will run the Condor job specified by
its accompanying Condor submit file

103
Submitting a DAG

To start your DAG, just run condor_submit_dag
with your .dag file, and Condor will start a
personal DAGMan daemon which to begin running
your jobs
condor_submit_dag diamond.dag
condor_submit_dag is run by the schedd
DAGMan daemon itself is watched by Condor, so
you dont have to

104
Running a DAG

DAGMan acts as a meta-scheduler, managing the
submission of your jobs to Condor based on the
DAG dependencies.

105
Running a DAG (contd)

DAGMan holds submits jobs to the Condor queue
at the appropriate times.

106
Running a DAG (contd)

In case of a job failure, DAGMan continues until
it can no longer make progress, and then creates
a rescue file with the current state of the DAG.

107
Recovering a DAG

Once the failed job is ready to be re-run, the
rescue file can be used to restore the prior
state of the DAG.

108
Recovering a DAG (contd)

Once that job completes, DAGMan will continue the
DAG as if the failure never happened.

109
Finishing a DAG

Once the DAG is complete, the DAGMan job itself
is finished, and exits.

110
Additional DAGMan Features

Provides other handy features for job management
nodes can have PRE POST scripts
failed nodes can be automatically re-tried a
configurable number of times
job submission can be throttled

111
General User Commands

condor_status View Pool Status
condor_q View Job Queue
condor_submit Submit new Jobs
condor_rm Remove Jobs
condor_prio Intra-User Prios
condor_history Completed Job Info
condor_submit_dag Submit new DAG
condor_checkpoint Force a checkpoint
condor_compile Link Condor library

112
Condor Job Universes

Serial Jobs
Vanilla Universe
Standard Universe
Grid Universe
Scheduler
Local Universe
Java Universe

Parallel Jobs
MPI Universe
PVM Universe
Parallel Universe

113
Why have a special Universe for Java jobs?

Java Universe provides more than just inserting
java at the start of the execute line of a
vanilla job
Knows which machines have a JVM installed
Knows the location, version, and performance of
JVM on each machine
Knows about jar files, etc.
Provides more information about Java job
completion than just JVM exit code
Program runs in a Java wrapper, allowing Condor
to report Java exceptions, etc.

114
Universe Java Job

Example Java Universe Submit file
Universe java
Executable Main.class
jar_files MyLibrary.jar
Input infile
Output outfile
Arguments Main 1 2 3
Queue

115
Java support, cont.

bash-2.05a condor_status java
Name JavaVendor Ver State Actv
LoadAv Mem
abulafia.cs Sun Microsy 1.5.0_ Claimed Busy
0.180 503
acme.cs.wis Sun Microsy 1.5.0_ Unclaimed Idle
0.000 503
adelie01.cs Sun Microsy 1.5.0_ Claimed Busy
0.000 1002
adelie02.cs Sun Microsy 1.5.0_ Claimed Busy
0.000 1002
Total Owner Claimed Unclaimed
Matched Preempting
INTEL/LINUX 965 179 516 250
20 0
INTEL/WINNT50 102 6 65 31
0 0
SUN4u/SOLARIS28 1 0 0 1
0 0
X86_64/LINUX 128 2 106 20
0 0
Total 1196 187 687 302
20 0

116
Frieda wants Condor features on remote resources

She wants to run standard universe jobs on
Grid-managed resources
For matchmaking and dynamic scheduling of jobs
For job checkpointing and migration
For remote system calls

117
Condor GlideIn

Frieda can use the Grid Universe to run Condor
daemons on Grid resources
When the resources run these GlideIn jobs, they
will temporarily join her Condor Pool
She can then submit Standard, Vanilla, PVM, or
MPI Universe jobs and they will be matched and
run on the remote resources
Currently only supports Globus GT2
We hope to fix this limitation

118
(No Transcript)
119
How It Works
Personal Condor
Remote Resource
120
GlideIn Concerns

What if the remote resource kills my GlideIn job?
That resource will disappear from your pool and
your jobs will be rescheduled on other machines
Standard universe jobs will resume from their
last checkpoint like usual
What if all my jobs are completed before a
GlideIn job runs?
If a GlideIn Condor daemon is not matched with a
job in 10 minutes, it terminates, freeing the
resource

121
In Review

With Condors help, Frieda can
Manage her compute job workload
Access local machines
Access remote Condor Pools via flocking
Access remote compute resources on the Grid via
Grid Universe jobs
Carve out her own personal Condor Pool from the
Grid with GlideIn technology

122
Advanced Topics
123
Administrator Commands

condor_vacate Leave a machine now
condor_on Start Condor
condor_off Stop Condor
condor_reconfig Reconfig on-the-fly
condor_config_val View/set config
condor_userprio User Priorities
condor_stats View detailed usage
accounting stats

124
Job Policy Expressions

User can supply job policy expressions in the
submit file.
Can be used to describe a successful run.
on_exit_remove ltexpressiongt
on_exit_hold ltexpressiongt
periodic_remove ltexpressiongt
periodic_hold ltexpressiongt

125
Job Policy Examples

Do not remove if exits with a signal
on_exit_remove ExitBySignal False
Place on hold if exits with nonzero status or ran
for less than an hour
on_exit_hold ( (ExitBySignalFalse)
(ExitSignal ! 0) ) ( (ServerStartTime -
JobStartDate) lt 3600)
Place on hold if job has spent more than 50 of
its time suspended
periodic_hold CumulativeSuspensionTime gt
(RemoteWallClockTime / 2.0)

126
My boss wants to watch what Condor is doing
127
Use CondorView!

Provides visual graphs of current and past
utilization
Data is derived from Condor's own accounting
statistics
Interactive Java applet
Quickly and easily view
How much Condor is being used
How many cycles are being delivered
Who is using them
Utilization by machine platform or by user

128
CondorView Usage Graph
129
A Common Question

My Personal Condor is flocking with a bunch of
Solaris and Linux machines, and also doing a
GlideIn to a SGI O2K. I do not want to
statically partition my jobs.
Solution In your submit file, specify
Executable myjob.(OpSys).(Arch)
Requirements (ArchINTEL
OpSysLINUX)\ (ArchSUN4u
OpSysSOLARIS8 )\
(ArchSGI OpSysIRIX65)
The (xxx) notation is replaced with
attributes from the machine ClassAd which was
matched with your job.

Using Condor An Introduction Condor Week 2006 PowerPoint PPT Presentation