Title: Condor%20COD%20(Computing%20On%20Demand)%20Condor%20Week%205/5/2003
1Condor COD (Computing On Demand)Condor Week
5/5/2003
2What problem are we trying to solve?
- Some people want to run interactive, yet
compute-intensive applications - Jobs that take lots of compute power over a
relatively short period of time - They want to use batch computing resources, but
need them right away - Ideally, when theyre not in use, resources would
go back to the batch system
3Some example applications
- A distributed build/compilation of a large
software system - A very complex spreadsheet that takes a lot of
cycles when you press recalculate - High-energy physics (HEP) analysis jobs
- Visualization tools for data-mining, rendering
graphics, etc.
4Example application for COD
Users Workstation
Compute Farm
On-demand workers
Idle nodes
Controller application
5Whats the Condor solution?
- Condor COD Computing on Demand
- Use Condor to manage the batch resources when
theyre not in use by the interactive jobs - Allow the interactive jobs to come in with high
priority and run instead of the batch job on any
given resource
6Why did we have to change Condor for that?
- Doesnt Condor already notice when an interactive
job starts on a CPU? - Doesnt Condor already provide checkpointing when
that happens? - Cant I configure Condor to run whatever jobs I
want with a higher priority on my own machines?
7Well, yes But thats not good enough
- Not all jobs can be checkpointed, and even those
that can take some time - We want this to be instantaneous, not waiting for
the batch system to schedule tasks - You can configure Condor to run higher priority
jobs, but the other jobs are kicked off the
machine
8Whats new about COD?
- Checkpoint to swap space
- When a high-priority COD job appears, the
lower-priority batch job is suspended - The COD job can run right away, while the batch
job is suspended - Batch jobs (even those that cant checkpoint) can
resume instantly once there are no more active
COD jobs
9But wait, theres more
- The condor_startd can now manage multiple
claims on each resource - If any COD claim becomes active, the regular
Condor claim is automatically suspended - Without an active COD, regular claim resumes
- There is a new command-line tool to request,
activate, suspend, resume and release these
claims - Theres even a C object to do all of that, if
you really want it
10COD claim-management commands
- Request authorizes the user and returns a unique
claim ID for future commands - Activate spawns an application on a given COD
claim, with various options to define the
application, job ID, etc - Suspends any regular Condor job
- You can have multiple COD claims on a single
resource, and they can all be running
simultaneously
11COD commands (contd)
- Suspend
- Given COD claim is suspended
- If there are no more active COD claims, a regular
Condor batch job can now run - Resume Given COD claim is resumed, suspending
the Condor batch job (if any) - Deactivate Kill the application but hold onto
the COD claim - Release Get rid of the COD claim itself
12COD command protocol
- All commands use ClassAds
- Allows for a flexible protocol
- Excellent error propagation
- Can use existing ClassAd technology
- Similar to existing Condor protocol
- Separation of claiming from activation, so you
can have hot-spares, etc.
13How does all of that solve the problem?
- The interactive COD application starts up, and
goes out to claim some compute nodes - Once the helper applications are in place and
ready, these COD claims are suspended, allowing
batch jobs to run - When the interactive application has work, it can
instantly suspend the batch jobs and resume the
COD applications to perform the computations
14Step 1 Initial state
Users Workstation
Compute Farm
Idle nodes
Idle nodes
15Step 2 Application spawned
Users Workstation
Compute Farm
Idle nodes
Idle nodes
Controller application spawned
16Step 3 Compute node setup
Users Workstation
Compute Farm
On-demand workers
On-demand workers
Idle nodes
Claiming and initializing 4 compute nodes for
rendering Got reply from c1.cluster.org c6.clust
er.org c14.cluster.org c17.cluster.org SUCCESS!
request
activate
17Step 3 Commands used
- condor_cod_request name c1.cluster.org \
- classad c1.out
- Successfully sent CA_REQUEST_CLAIM to startd at
lt128.105.143.1455642gt - Result ClassAd written to c1.out
- ID of new claim is lt128.105.143.1455642gt105165
62082 - condor_cod_activate keyword fractgen \
- id lt128.105.143.1455642gt10516562082
- Successfully sent CA_ACTIVATE_CLAIM to startd at
lt128.105.143.1455642gt -
18Step 4 Checkpoint to swap
Users Workstation
Compute Farm
Suspended worker
SELECT FRACTAL TYPE ltMandelbrotgt (more user
input)
suspend
19Step 4 Commands used
condor_cod_suspend \ id lt128.105.143.14556
42gt10516562082 Successfully sent
CA_SUSPEND_CLAIM to startd at lt128.105.143.145564
2gt
- Rendering application on each COD node is
suspended while interactive tool waits for input - The resources are now available for batch Condor
jobs
20Step 5 Batch jobs can run
Users Workstation
Compute Farm
SPECIFY PARAMETERS max_iterations 400000 TL
-0.65865, -0.56254 BR -0.45865, -0.71254 (more
user input)
Batch queue
21Step 6 Computation burst
Users Workstation
Compute Farm
Interactive workers
On-demand workers
Idle nodes
CLICK ltRENDERgt TO VIEW YOUR FRACTAL
RENDER
resume
Suspended batch job
22Step 6 Commands used
condor_cod_resume \ id lt128.105.143.145564
2gt10516562082 Successfully sent
CA_RESUME_CLAIM to startd at lt128.105.143.1455642
gt
- Batch Condor jobs on COD nodes are suspended
- All COD rendering applications are resumed on
each node
23Step 7 Results produced
Users Workstation
Compute Farm
Interactive workers
On-demand workers
Idle nodes
Data
Display
Suspended batch job
24Step 8 User input while batch work resumes
Users Workstation
Compute Farm
Suspended worker
Idle nodes
Idle nodes
ZOOM BOX COORDINATES TL -0.60301, -0.61087 BR
-0.58037, -0.62785
suspend
25Step 9 Computation burst 2
Users Workstation
Compute Farm
Interactive workers
On-demand workers
Idle nodes
Data
resume
Display
RENDER
Suspended batch job
26Step 10 Clean-up
Users Workstation
Compute Farm
Idle nodes
Idle nodes
REALLY QUIT? Y/N Releasing compute nodes 4
nodes terminated successfully!
release
27Step 10 Commands used
condor_cod_release \ id lt128.105.143.14556
42gt10516562082 Successfully sent
CA_RELEASE_CLAIM to startd at lt128.105.143.145564
2gt State of claim when it was released
"Running"
- The jobs are cleaned up, claims released, and
resources returned to batch system
28Other changes for COD
- The condor_starter has been modified so that it
can run jobs without communicating with a
condor_shadow - All the great job control features of the starter
without a shadow - Starter can write its own UserLog
- Other useful features for COD
29condor_status cod
- New cod option to condor_status to view COD
claims in a Condor pool - Name ID ClaimState TimeInState
RemoteUser JobId Keyword - astro.cs.wi COD1 Idle 0000004 wright
- chopin.cs.w COD1 Running 0000205 wright
3.0 fractgen - chopin.cs.w COD2 Suspended 0001021 wright
4.0 fractgen - Total Idle Running
Suspended Vacating Killing - INTEL/LINUX 3 1 1
1 0 0 - Total 3 1 1
1 0 0
30What else could I use all these new features for?
- Short-running system administration tasks that
need quick access but dont want to disturb the
jobs in your batch system - A Grid Shell
- A condor_starter that doesnt need a
condor_shadow is a powerful job management
environment that can monitor a job running under
a hostile batch system on the grid
31Future work
- More ways to tell COD about your application
- For now, you define important attributes in your
condor_config file and pre-stage the executables - Ability to transfer files to and from a COD job
at a remote machine - Weve already got the functionality in Condor, so
why rely on a shared filesystem or pre-staging?
32More future work
- Accounting for COD jobs
- Working with some real-world applications and
integrating these new COD features - Would the real users please stand up?
- Better Grid Shell support
- This is really a separate-yet-related area of
work
33How do you use COD?
- Upgrade to Condor version 6.5.3 or greater COD
is already included - There will be a new section in the Condor manual
(coming soon) - If you need more help, ask the ever helpful
condor-admin_at_cs.wisc.edu - Find me at the BoF on Wednesday, 9am to Noon
(room TBA)