Introduction to Condor and CamGrid - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Introduction to Condor and CamGrid

Description:

Well, if they're 'few' (how many's that?), ' small' (minimal memory footprint) and short (less than an hour? ... Sysadmin tasks shared out. Weaknesses: ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 33
Provided by: drmarkc
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Condor and CamGrid


1
Introduction to Condor and CamGrid
Mark Calleja Cambridge eScience
Centre www.escience.cam.ac.uk
2
The problem
  • Ive got some computational jobs to run. Where
    can I run them?
  • Well, if theyre few (how manys that?),
    small (minimal memory footprint) and short
    (less than an hour?), I may be tempted to use my
    desktop

3
My problems bigger than that!
  • Well, if you need many gigabytes (terabytes?) of
    memory, or your application can make use of
    many (gt 10?) processors simultaneously, e.g.
    using MPI or OpenMP, then maybe you need

4
Actually, its somewhere in between
  • Many (most?) scientific jobs wont require an HPC
    facility, but we may have many such independent
    jobs that need to be run.
  • So, we need access to lots of machines like my
    desktop, or maybe slightly more powerful.
  • Enter

5
Condor
  • From the University of Wisconsin, since 1985.
  • Allows a heterogeneous mix of machines to be
    linked into a pool, making them accessible for
    useful work.
  • These can be lowly desktops, to bigger server
    types, even whole clusters.
  • A pool is defined by one special machine, the
    central manager, and the other machines which
    allow it to coordinate them.
  • Separate pools can cooperate, or flock (more
    anon).
  • Machines in a pool can decide how much CPU time
    theyll give to Condor, e.g. only when idle, or
    outside office hours, etc.

6
Great! How do I use it?
  • First, need to be given access to a submit node,
    but thats the only account youll need.
  • This is an example of the grid computing
    paradigm.
  • Condor has some useful features beyond just
    making many machines available simultaneously,
    e.g.
  • MPI support (with some effort)
  • Failure resilience
  • Process checkpoint / restart / migration
  • Workflow support
  • Condors ideal for performing parameter sweeps.

7
How do I start?
  • First, make the application batch ready. This
    means
  • No interactive input (but can redirect
    stdin/stdout from/to files)
  • No GUIs
  • Ideally statically linked, but not necessary
  • Pick your universe, i.e. Condor environment to
    use.
  • Create a submit description file. This will give
    Condor clues how/where the job can run.
  • Submit your job!

8
Condors universes
  • Controls how Condor handles jobs
  • Choices include
  • Vanilla
  • Standard
  • Grid
  • Java
  • Parallel
  • VM
  • Today well just deal with the vanilla (simple,
    takes your application as is), and the standard
    (allows for checkpointing).

9
The job description file
  • This is a plain ASCII text file.
  • It tells Condor about your job, i.e. which
    executable, universe, input, output and error
    files to use, command-line arguments, environment
    variables, any special requirements or
    preferences
  • Can describe many jobs at once (a cluster),
    each with different input, arguments, output,
    etc.
  • Suppose I have an application called a.out
    which needs to takes the command line argument
    -a ltnumbergt and reads input data from the files
    inp1 and inp2. Output files are returned
    automatically.
  • Furthermore, this application must run on a 32
    bit Linux box. Then, a suitable job description
    file would look like

10
  • Simple condor_submit input file
  • (Lines beginning with are comments)
  • NOTE the words on the left side are not
  • case sensitive, but filenames are!
  • Universe vanilla
  • Executable a.out
  • should_transfer_files YES
  • when_to_transfer_output ON_EXIT_OR_EVICT
  • transfer_input_files inp1, inp2
  • Requirements OpSys "LINUX" Arch
    "INTEL"
  • Arguments -a 0
  • Log log.txt
  • Input input.txt
  • Output output.txt
  • Error error.txt
  • notify_user mark_at_cam.ac.uk
  • Queue

11
Submitting the job
  • If weve created the submit description file
    job, then we can submit it to our Condor pool
    with
  • woolly condor_submit job
  • Submitting job(s).
  • Logging submit event(s).
  • 1 job(s) submitted to cluster 684.
  • We can keep track of our job with condor_q
  • woolly condor_q
  • -- Submitter woolly.escience.cam.ac.uk
    lt172.24.116.79683gt
  • ID OWNER SUBMITTED RUN_TIME
    ST PRI SIZE CMD
  • 684.0 mcal00 3/3 1456 0000001
    R 0 0.0 a.out

12
Submitting LOTS of jobs
  • I have 600 jobs to run. Also, I can build 32 and
    64 bit versions of my applications, say
    a.out.INTEL and a.out.X86_64.
  • First, prepare 600 subdirectories with all
    relevant input files (easier for sorting files).
    The corresponding output files will go in these
    directories.
  • Call these directories dir_0, dir_1, , dir_599.
  • Also, I require 1GB of memory for each job, and
    Id prefer to have the fastest machines.
  • If you submit lots of jobs, make sure you have
    the available disk space to receive the output!
  • My job description files can now look like

13
  • job description file for 600 jobs
  • Universe vanilla
  • Executable a.out.(Arch)
  • should_transfer_files YES
  • when_to_transfer_output ON_EXIT_OR_EVICT
  • transfer_input_files inp1, inp2
  • Requirements Memory gt 1000 OpSys "LINUX"
    \
  • (Arch "INTEL" Arch
    "X86_64")
  • Id prefer fastest processors
  • Rank Kflops
  • Arguments -a (Process)
  • InitialDir dir_(Process)
  • Log log.txt
  • Input input.txt
  • Output output.txt
  • Error error.txt

14
The Standard universe
  • The Vanilla universe is easy to get started with,
    but has some limitations
  • If the remote machine disappears (host or n/w
    problems) then the job is restarted from scratch.
  • If I dont have an account on the execute host,
    and the odds are that I wont, then I cant
    monitor the output files as theyre being
    generated (actually, we have a local solution).
  • The Standard universe aims to address these
    shortcomings. However, we must be able to link
    our applications code with Condors libraries
  • condor_compile gcc -o myjob myjob.c
  • or even
  • condor_compile make f MyMakeFile

15
The Standard universe (2)
  • Not many compilers are supported, but gcc, g,
    g77 and now gfortran (F95 support!) work.
  • However, your job must be well behaved, i.e.
    cant fork, use kernel threads or certain IPC
    tools (pipes, shared memory).
  • If it passes these tests, and most scientific
    codes do, then use universe standard in the
    submit script and dont include anything about
    file transfer.
  • I/O calls on the execute host will now be echoed
    back to the submit machine, so youll see all
    output files as theyre created.
  • Also, Condor will periodically save the state of
    your job, so if theres an outage it will restart
    your job from the last saved image, and not from
    scratch (c.f. Vanilla universe).

16
DAGMan Condors workflow manager
  • Directed Acyclic Graph Manager
  • DAGMan allows you to specify the dependencies
    between your Condor jobs, so it can manage them
    automatically for you.
  • Allows complicated workflows to be built up (can
    embed DAGs).
  • E.g., Dont run job B until job A has
    completed successfully.
  • Each node is a Condor job.
  • A node can have any number of parent or
    children nodes as long as there are no loops!

17
Defining a DAG
  • A DAG is defined by a .dag file, listing each of
    its nodes and their dependencies
  • diamond.dag
  • Job A a.sub
  • Job B b.sub
  • Job C c.sub
  • Job D d.sub
  • Parent A Child B C
  • Parent B C Child D
  • each node will run the Condor job specified by
    its accompanying Condor submit file
  • One can also have pre- and post- jobs to run on
    the submit machine before or after any node (see
    examples).
  • To start your DAG, just run condor_submit_dag
    with your .dag file, and Condor will start a
    personal DAGMan daemon with which to begin
    running your jobs
  • condor_submit_dag diamond.dag

18
DAGMan continued
  • DAGMan holds submits jobs to the Condor queue
    at the appropriate times.
  • In case of a job failure, DAGMan continues until
    it can no longer make progress, and then creates
    a rescue file with the current state of the
    DAG.
  • Once the failed job is ready to be re-run, the
    rescue file can be used to restore the prior
    state of the DAG.
  • DAGMan has other useful features
  • nodes can have PRE POST scripts
  • failed nodes can be automatically re-tried a
    configurable number of times
  • job submission can be throttled to limit number
    of active jobs in the queue.

19
Some general useful commands
  • condor_status View Pool Status
  • condor_q View Job Queue
  • condor_submit Submit new Jobs
  • condor_rm Remove Jobs
  • condor_history Completed Job Info
  • condor_submit_dag Submit new DAG
  • condor_checkpoint Force a checkpoint
  • condor_compile Link Condor library
  • These commands can take several arguments.
    Run them with a -h argument to see the options.

20
What we havent covered
  • Parrot A user-space file system from the Condor
    project.
  • Condor-C Condors delegated job submission
    mechanism.
  • Parallel/MPI jobs Condors way of dealing with
    multi-process jobs, either spanning multiple
    machines, or multi-processors/cores on a single
    machine, or both.
  • Other Universes e.g. Virtual/Java/Grid/Scheduler
  • Checkpointing Vanilla Universe jobs Can be
    done, but requires the user to do it, e.g. either
    by using DAGMan or own shell script recursively .
  • Lots of bells and whistles, e.g. submit script
    options.

21
CamGrid
  • Based around the Condor middleware that youve
    just heard about.
  • Started in Jan 2005 by five groups (now up to
    eleven 13 pools).
  • Each group sets up and runs its own pool, and
    flocks to/from other pools.
  • Hence a decentralised, federated model.
  • Strengths
  • No single point of failure
  • Sysadmin tasks shared out
  • Weaknesses
  • Debugging can be complicated, especially
    networking issues.

22
Condor Flocking
  • Condor attempts to run a submitted job in its
    local pool.
  • However, queues can be configured to try sending
    jobs to other pools flocking.
  • User-priority system is flocking-aware
  • A pools local users can have priority over
    remote users flocking in.
  • This is how CamGrid works each group/department
    maintains its own pool and flocks with the
    others.

23
Actually, CamGrid currently has 13 pools.
24
Participating departments/groups
  • Cambridge eScience Centre
  • Dept. of Earth Science (2)
  • High Energy Physics
  • School of Biological Sciences
  • National Institute for Environmental eScience (2)
  • Chemical Informatics
  • Semiconductors
  • Astrophysics
  • Dept. of Oncology
  • Dept. of Materials Science and Metallurgy
  • Biological and Soft Systems

25
CamGrids vanilla-universe file viewer
  • Condors Vanilla Universe is nice and easy to
    use, but comes at the cost of no real-time
    visibility as output files get generated on
    execute machines.
  • We have our own web-based solution for CamGrid.
  • First, ask me for a password (tell me what
    username you submit jobs as preferably your
    CRSid).
  • Then, use these details on the form at the bottom
    of
  • http//www.escience.cam.ac.uk/projects/camgrid/
    condor_tools.html
  • Uses cookies for session information (1 hour
    sessions).
  • Has a UK eScience CA certificate for
    tempo.escience.cam.ac.uk

26
(No Transcript)
27
(No Transcript)
28
Some details
  • First point of contact for help is your local CO.
  • ucam-camgrid-users mailing list
  • Currently have 1,000 core/processors, mostly
    4-core Dell 1950 (8GB memory) like HPCF.
  • Pretty much all linux, and mostly x86_64.
  • Run the latest Condor stable version, currently
    7.0.5, but well upgrade to 7.2.2 when it appears
    (which will provide Standard universe support for
    gfortran).
  • Can run MPI jobs, but only within some individual
    pools, and then preferably as multi-core SMP jobs
    on individual machines.
  • The Condor manual is a great learning resource,
    and we keep an online copy with an added search
    facility at
  • http//holbein.escience.cam.ac.uk/condor_manual/

29
Thats 808 years, back in the reign of John I,
just before the Magna Carta
30
Its still only March!
56 refereed publications to date (Science, Phys.
Rev. Lett.,)
31
Links
  • CamGrid www.escience.cam.ac.uk/projects/camgrid/
  • Condor www.cs.wisc.edu/condor/
  • Email mc321_at_cam.ac.uk
  • Questions?

32
  • Examples URL
  • Please point your browsers at
  • http//www.escience.cam.ac.uk/projects/camgrid/wo
    rkshop/
  • CamGrid Vanilla/Parallel file viewer
  • Your password for this session is the same as
    your username, but with the i changed to 1,
    e.g.
  • username trainXX
  • password tra1nXX
Write a Comment
User Comments (0)
About PowerShow.com