Running persistent jobs in Condor - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Running persistent jobs in Condor

Description:

Run Backfill to have running glideins for the flocked jobs. Future Uses Where else could persistent jobs be useful? Glidein Pilot Factory. – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 28
Provided by: Kathle117
Category:

less

Transcript and Presenter's Notes

Title: Running persistent jobs in Condor


1
Running persistent jobs in Condor
  • Derek Weitzel Brian Bockelman
  • Holland Computing Center

2
Introduction HCC
  • University of Nebraska Holland Computing Center
    2009
  • 162 Research groups
  • Not all of them are using Condor wide variety of
    users.
  • Condor Applications
  • Bioinformatics CS-Rosetta, Autodock
  • Math Graph Transversal
  • Physics CMS and Dzero

3
Introduction HCC
  • 7,000 Cores, all opportunistic for grid
  • HCCs Use of the Open Science Grid
  • 2.1 Million hours this year
  • Grid usage is based off GlideinWMS Engages
    OSG-MM

4
Open Science Non-HEP Grid Usage
GPN HCC
5
Persistent Jobs
  • Today, Ill be discussing a unique way we create
    backfill at HCC using what we call persistent
    jobs
  • We are sharing it because its an unexpected
    combination of several unique-to-Condor
    capabilities to solve a problem that usually is
    solved by writing a lot of other code.

6
Persistent Jobs - Motivation
  • What is a persistent job?
  • A single task which lasts longer than a single
    batch job can run.
  • It is thus necessary to keep state between runs
  • If your cluster limits batch jobs to 12 hours, a
    24 hour task would be a persistent job.
  • At a minimum, between runs, the persistent job
    must keep some state what job am I?
  • This could also be a quite large checkpoint!

7
Persistent Jobs - Examples
  • A few examples
  • Traditional Condor standard universe job that
    uses checkpoints.
  • Processing streams of data, where there is one
    job per stream.
  • Taken to the extreme, a persistent job might
    never exit.
  • We will show an example of persistent jobs for
    grid backfill.

8
Persistent Jobs - Difficulties
  • Persistent jobs are difficult because
  • We want no more than one instance running,
  • Prefer more than zero instances running.
  • To satisfy the first, we need some sort of
    locking and heartbeat mechanisms.
  • To satisfy the second, we need something to
    automatically re-queue.
  • And then something to monitor the automatic
    submission framework.
  • Oh, and deadlock detection!

9
Traditional Approach
  • Keep the state in a database.
  • Have a locking check-in/check-out mechanisms to
    make sure checkpoints are used by one job.
  • Have a server listening for job heartbeats and
    update the DB.
  • Have a submit daemon.
  • A lot of infrastructure! At least three services
    you have to write, care, and feed.

10
Yuck!
  • We just invented a ton of architecture.
  • Too much for a grad student to write, much less
    operate.
  • Distributed architectures are ripe for bugs.
  • You can take a semester long course in this and
    still not be very good at it.
  • Having a cron/daemon that submits jobs is asking
    for a bug that submit an infinite of jobs or
    deadlocks and submits 0.

11
Persistent Jobs the Condor Way!
  • Can we come up with a solution that
  • Uses Condor only
  • Runs on grid (Grid Universe or Glide-in).
  • Allows us to checkpoint.
  • Insight Use the Condor Global Job ID as the
    unique state!
  • And Condor can re-run completed jobs!
  • Dont do book-keeping, let Condor do it for you.

12
Persistent Jobs the Condor Way!
  • The state kept by each job is the Condor Job ID.
  • Use the fact Condor can resubmit completed jobs.
  • Set OnExitRemove FALSE
  • For Grid
  • Globus_Resubmit TRUE
  • Globus_Rematch TRUE
  • Use the job ID to uniquely identify a checkpoint
    file kept in a storage element.
  • Periodically overwrite with new checkpoints.
  • Sit back, and use the fact that Condor will keep
    trying to re-run the job indefinitely and will
    make sure that the job only has one running
    instance.

13
Our Use Case
  • We want to have an infinite amount of work for
    our clusters.
  • Yet people will be mad if we just do sleep()
  • Solution BOINC
  • Einstein_at_Home is an acceptable use of our
    resources.
  • BOINC is a persistent job! As long as you feed
    it the same checkpoint directory(), it will
    continue to run indefinitely!
  • Actually, it gets fed jobs internally, but we
    dont know anything about the job boundaries.

14
Our Solution
  • We take the basic BOINC binary, statically
    compiled for RHEL4/32-bit for maximum
    compatibility.
  • BOINC uses the md5sum of the hosts mac address
    to identify the checkpoint directory is correct
    between executions.
  • Replace use of mac address with the md5sum of the
    Condor JobID!
  • Now, regardless of the worker node running the
    job, the correct checkpoint is used.
  • Checkpoints (100MB each) kept on SRM storage. No
    NFS used!

15
Data Flow
Submit Host
Job Wrapper
Initial Binary/Checkpoint
Job Output
Worker/Boinc
SRM - Nebraska
Checkpoints 1000 Jobs 27MB/s
16
Workflow
Job Start
Example Values Checkpoint 1hr Lifetime 12hrs
Download Checkpoint and Binary
FORKED
Sleep
NO
Time to Chkpoint?
YES
RESUBMIT
Send Checkpoint
NO
Run lifetime over?
YES
KILL
Kill BOINC, Exit Job
17
Boinc Results
18
Why not Condor backfill?
  • Why this and not use Condor backfill?
  • Not all of our sites run Condor. We want to be
    able to utilize an arbitrary site.
  • Lots of exercise for operating grid components!
  • Standard backfill tied to physical worker node
    (MAC). No clue when you will see that node again
    on the grid, so you might just be throwing away
    work. BOINC progress expires in 3 days.
  • Cant push out backfill jobs using OSG-MM.
  • Possibly could do this for gWMS I dont know.

19
GlideinWMS Backfill
  • Keeping a Glide-in instance busy with backfill
    until you get user jobs.
  • Preempt backfill jobs for user jobs.
  • Removes the user ramp up time.
  • Use Flocking to GlideinWMS
  • GlideinWMS is resource hungry, need interactive
    nodes for deployment.
  • Run Backfill to have running glideins for the
    flocked jobs.

20
Future Uses
  • Where else could persistent jobs be useful?
  • Glidein Pilot Factory. Removes the need to have
    processes that submit jobs.
  • If the thing stops working, then its a Condor
    bug, and you have someone else to yell at. ?

21
Conclusions
  • Persistent jobs can be a nightmare due to the
    necessary bookkeeping.
  • Condor jobs can be made persistent.
  • Dont go inventing things Condor does for you.
  • By utilizing basic Condor and nothing else
    (entire project is 2 python scripts), we can
    generate an arbitrary amount of backfill for our
    site or the OSG!

22
Back Page
Questions? Code location svn//t2.unl.edu/brian/
osg_boinc
23
Back Page
Backup Slides
24
Issues found during implementation
  • Rematching in grid universe
  • Using matchmaking in grid universe
  • Need special keyword
  • GlobusRematch TRUE
  • Was a huge pain to get right.
  • Job accounting
  • Job accounting collects data when Condor job
    finishes, not for each run.
  • Not solved, relying on grid resource accounting

25
Issues found during implementation
  • Need to easily modify existing jobs
  • In Arguments, use special variables
  • (Variable)
  • Variables evaluated at time of match
  • Used for job lifetime, checkpoint location,
    binary location

26
Results
  • Glideins running / Jobs Queued

27
Glidein Usage by Site
Write a Comment
User Comments (0)
About PowerShow.com