New Ways to Fetch Work - PowerPoint PPT Presentation

About This Presentation
Title:

New Ways to Fetch Work

Description:

New Ways to Fetch Work The new hook infrastructure in Condor 7.1.* – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 24
Provided by: Derek196
Category:
Tags: federation | fetch | new | saml | ways | work

less

Transcript and Presenter's Notes

Title: New Ways to Fetch Work


1
New Ways to Fetch Work
  • The new hook infrastructure in Condor 7.1.

2
Whats the problem?
  • Users wanted to take advantage of Condors
    resource management daemon (condor_startd) to run
    jobs, but they had their own scheduling system.
  • Specialized scheduling needs
  • Jobs live in their own database or other storage
    than a Condor job queue

3
Fetch vs. push
  • Instead of trying to get these jobs into a
    condor_schedd, or try to push them to the
    condor_startd, just get the condor_startd to
    fetch (pull) the work
  • Lower latency than the overhead of matchmaking
    and the schedd
  • Fetching only requires an outbound network
    connection which makes life easier if you
    glide-in behind a firewall

4
Whats the dumb solution?
  • Put code directly into the condor_startd that can
    talk directly to the other scheduling system(s)
  • Wed have to support other protocols
  • Wed have to link even more libraries and
    dependencies into our code
  • Very inflexible

5
Another dumb solution
  • Make it a web service!
  • Mostly the same problems
  • What protocol?
  • What format to describe the jobs?
  • Add a dependency on libCurl?
  • What if I dont want a webserver to be handling
    my jobs?
  • Security? Authentication? Privacy?

6
Our solution (hopefully not dumb)
  • Make a system of hooks that you can plug into
  • A hook is a point during the life-cycle of a job
    where the Condor daemons will invoke an external
    program
  • The hook invocation points have to be hard-coded
    into Condor, but then anyone can implement their
    own hooks to do what they want

7
Why isnt that dumb?
  • All the logic, code, libraries, etc, to fetch
    jobs from any given system lives completely
    outside of the Condor source and binaries
  • New hooks can be installed without a new version
    of Condor
  • No new library dependencies for us
  • Hooks are written by people who know what theyre
    doing

8
How does Condor communicate with hooks?
  • Passing around ASCII ClassAds via standard input
    and standard output
  • Some hooks get control data via a command-line
    argument (argv)
  • Hooks can be written in any language (scripts,
    binaries, whatever you want) so long as you can
    read/write STDIN/OUT
  • Decades of UNIX wisdom cant be wrong!

9
What hooks are available?
  • Hooks for fetching work (condor_startd)
  • FETCH_JOB
  • REPLY_FETCH
  • EVICT_CLAIM
  • Hooks for running jobs (condor_starter)
  • PREPARE_JOB
  • UPDATE_JOB_INFO
  • JOB_EXIT

10
HOOK_FETCH_JOB
  • Invoked by the startd whenever it wants to try to
    fetch new work
  • FetchWorkDelay expression
  • Hook gets a current copy of the slot ClassAd
  • Hook prints the job ClassAd to STDOUT
  • If STDOUT is empty, theres no work

11
HOOK_REPLY_FETCH
  • Invoked by the startd once it decides what to do
    with the job ClassAd returned by HOOK_FETCH_WORK
  • Gives your external system a chance to know what
    happened
  • argv1 accept or reject
  • Gets a copy of slot and job ClassAds
  • Condor ignores all output
  • Optional hook

12
HOOK_EVICT_CLAIM
  • Invoked if the startd has to evict a claim thats
    running fetched work
  • Informational only you cant stop or delay this
    train once its left the station
  • STDIN Both slot and job ClassAds
  • STDOUT gt /dev/null

13
HOOK_PREPARE_JOB
  • Invoked by the condor_starter when it first
    starts up (only if defined)
  • Opportunity to prepare the job execution
    environment
  • Transfer input files, executables, etc.
  • INPUT both slot and job ClassAds
  • OUTPUT ignored, but starter wont continue until
    this hook exits
  • Not specific to fetched work

14
HOOK_UPDATE_JOB_INFO
  • Periodically invoked by the starter to let you
    know whats happening with the job
  • INPUT both ClassAds
  • Job ClassAd is updated with additional attributes
    computed by the starter
  • ImageSize, JobState, RemoteUserCpu, etc.
  • OUTPUT ignored

15
HOOK_JOB_EXIT
  • Invoked by the starter whenever the job exits for
    any reason
  • Argv1 indicates what happened
  • exit Died a natural death
  • evict Booted off prematurely by the startd
    (PREEMPT TRUE, condor_off, etc)
  • remove Removed by condor_rm
  • hold Held by condor_hold

16
HOOK_JOB_EXIT
  • HUH!?! condor_rm? What are you talking about?
  • The starter hooks can be defined even for regular
    Condor jobs, local universe, etc.
  • INPUT copy of the job ClassAd with extra
    attributes about what happened
  • ExitCode, JobDuration, etc.
  • OUTPUT Ignored
  • Except for dumb exceptions the schedd doesnt
    distinguish rm vs. hold when telling the starter
    to go away (yet). Argh!

17
Defining hooks
  • Each slot can have its own hook keyword
  • Prefix for config file parameters
  • Can use different sets of hooks to talk to
    different external systems on each slot
  • Global keyword used when the per-slot keyword is
    not defined
  • Keyword is inserted by the startd into its copy
    of the job ClassAd and given to the starter

18
Defining hooks example
  • Most slots fetch work from the database system
  • STARTD_JOB_HOOK_KEYWORD DB
  • Slot4 fetches and runs work from a web service
  • SLOT4_JOB_HOOK_KEYWORD WEB
  • The database system needs to both provide work
    and
  • know the reply for each attempted claim
  • DB_DIR /usr/local/condor/fetch/db
  • DATABASE_HOOK_FETCH_WORK (DB_DIR)/fetch_work.ph
    p
  • DATABASE_HOOK_REPLY_FETCH (DB_DIR)/reply_fetch.
    php
  • The web system only needs to fetch work
  • WEB_DIR /usr/local/condor/fetch/web
  • WEB_HOOK_FETCH_WORK (WEB_DIR)/fetch_work.php

19
Semantics of fetched jobs
  • Condor_startd treats them just like any other
    kind of job
  • All the standard resource policy expressions
    apply (START, SUSPEND, PREEMPT, RANK, etc).
  • Fetched jobs can coexist in the same pool with
    jobs pushed by Condor, COD, etc.
  • Fetched work ! Backfill

20
Semantics continued
  • If the startd is unclaimed and fetches a job, a
    claim is created
  • If that job completes, the claim is reused and
    the startd fetches again
  • Keep fetching until either
  • The claim is evicted by Condor
  • The fetch hook returns no more work

21
Limitations for fetched jobs
  • No schedd/shadow means no standard universe for
    checkpointing, migration, and remote system calls
  • Could use stand-alone checkpointing
  • Application-specific checkpointing
  • Other features that are unavailable
  • User policy expressions (e.g. periodic hold)
  • No DAGMan (youre on your own)

22
Limitations of the hooks
  • If the starter cant run your fetched job because
    your ClassAd is bogus, no hook is invoked to tell
    you about it
  • We need a HOOK_STARTER_FAILURE
  • No hook when the starter is about to evict you
    (so you can checkpoint)
  • Can implement this yourself with a wrapper script
    and the SoftKillSig attribute

23
More information
  • New section in the Condor 7.1 manual
  • Chapter 4 Miscellaneous Concepts
  • 4.4 Job Hooks
  • http//www.cs.wisc.edu/condor/manual/v7.1/4_4Job_H
    ooks.html
  • Any questions?
Write a Comment
User Comments (0)
About PowerShow.com