Condor: What It Is and Why You Should Worry About It - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Condor: What It Is and Why You Should Worry About It

Description:

What is Condor and what is it good for? How does it work? ... What is Condor? ... Machines in Condor have one (or more) of four different roles see following ... – PowerPoint PPT presentation

Number of Views:516
Avg rating:3.0/5.0
Slides: 65
Provided by: bruceb61
Category:
Tags: condor | worry

less

Transcript and Presenter's Notes

Title: Condor: What It Is and Why You Should Worry About It


1
CondorWhat It Is and Why You Should Worry About
It
  • Bruce Beckles
  • e-Science Specialist
  • University of Cambridge Computing Service
  • TECHLINK SEMINAR 23 JUNE 2004

2
Overview of Seminar
  • What is Condor and what is it good for?
  • How does it work?
  • What are we planning to do with it in the CS?
  • Why should I care?
  • What will it do for me?
  • What are the security implications?

3
What is Condor?
http//www.cs.wisc.edu/condor/description.html
  • A full-featured cross-platform batch scheduling
    system like PBS or Sun Grid Engine (SGE)
  • A specialised workload management system for
    compute-intensive jobs specially designed for
    utilising spare CPU cycles, i.e. resource
    harvesting or resource scavenging
  • Especially good for high throughput computing
    processing large numbers of independent jobs
    (embarrassingly parallel jobs) as efficiently
    as possible

4
What is Condor good for? (1)
  • Consider the following scenario
  • I have a simulation which takes two hours to run
    on my high-end PC workstation
  • I need to run it 1000 times with slightly
    different parameters each time
  • If I do this serially it will take at least 2000
    hours 83? days 12 weeks, or 3 months. If I
    try running more than one simulation on my PC at
    once, this wont really improve things as each
    simulation will then take much longer than 2
    hours to run

5
What is Condor good for? (1a)
  • Suppose my department has 100 PCs like mine that
    are mostly sitting around idle overnight (say for
    an average of 8 hours a day).
  • If I could run jobs on those machines when they
    were idle i.e. when their legitimate users
    arent using them, so I dont inconvenience
    anyone then I could get about 800 CPU hours a
    day!
  • This is an ideal scenario for Condor, and so
    using Condor in this situation means it would
    only take me 2.5 days to run my simulations
    instead of months. Hurrah!

6
What is Condor good for? (2)
  • Suppose I manage either a Linux or UNIX cluster,
    or a collection of general purpose machines that
    many people are supposed to be able to run jobs
    on (or both). I could just let users access the
    machines individually and run jobs on them, but
    that might be a bad idea because
  • Too much trouble for the users to do that for
    each machine they want to use
  • Want to make sure that use of the machines is
    fair, i.e. no one user hogging the machines at
    the expense of everyone else
  • Want jobs to be distributed evenly across
    machines to stop individual machines being
    overloaded
  • May want certain users jobs to have priority
    over other users jobs
  • etc.

7
What is Condor good for? (2a)
  • So I want a batch queuing system of some
    description to manage jobs on these machines. I
    probably want it to have the following features
  • Easy to install, use and administer
  • Free (because I probably dont have much money
    for such things!)
  • Allows me to control who can use particular
    resources, how much use they can make of them,
    who has priority on them and so on
  • Distribute jobs evenly across all the machines
  • If machines die, want to be able to restart jobs
    somewhere else. If queue handler dies, dont
    want all jobs in the queue to be lost
  • Can preempt or suspend jobs (if desired)
  • May want it to support parallel jobs using MPI or
    PVM
  • May want it to be cross-platform (e.g. my users
    require both Solaris and Linux machines)
  • May want it to provide mechanisms for
    checkpointing
  • Can be made secure at both the machine and user
    levels (if desired), and can encrypt its network
    traffic (if desired)
  • Condor has all these features! (with some
    caveats, of course)

8
How does Condor work?
  • A collection of machines running Condor is called
    a pool. Individual pools can be joined together
    by a process known as flocking.
  • Machines in Condor have one (or more) of four
    different roles see following two slides for
    details.
  • Each machine can have more than one role, so it
    is possible to have a Condor pool consisting of
    one machine(!), or less trivially, two machines.

9
Machine Roles in Condor (1)
  • Central Manager Resource broker for a pool
    keeps track of which machines are available, what
    jobs are running, negotiates which machine will
    run which job, etc. The central manager is also
    sometimes confusingly called the master or
    (mistakenly) the server. There must be exactly
    one central manager per pool. For large pools
    this machine should not have any other role in
    the pool.
  • Submit Machine (or Submit Host) Machine which
    submits jobs to the pool. There must be at least
    one submit host in a pool.

10
Machine Roles in Condor (2)
  • Execution Machine (or Execute Host) Machine on
    which jobs can be run. There must be at least
    one execute host in a pool.
  • Checkpoint Server Machine which stores all the
    checkpoint files produced by jobs which
    checkpoint. Having such a machine is optional,
    and there can only be one checkpoint server per
    pool.
  • So, typically you would have one central
    manager, one or more submit hosts and one or more
    execute hosts. It is often the case that all
    submit hosts are also execute hosts (and
    vice-versa).

11
A typical Condor Pool
Checkpoint server (optional)
Central Manager
Monitors status of execute hosts and assigns jobs
to them
Matches jobs from submit hosts to appropriate
execute hosts
These machines are both submit and execute hosts
Execute hosts
Submit hosts
Checkpoint files from jobs that checkpoint are
stored on checkpoint server
12
Lets follow a job under Condor (1)
  • A job is submitted at a submit host.
  • The submit host tells the central manager about
    the job using Condors ClassAd mechanism, which
    provides a (user-customizable) way of describing
    what the job requires in order to run, as well as
    what it desires from the execute host (e.g. a job
    might require a minimum of 256 MB of RAM, but
    runs significantly better with 512 MB, and so it
    would prefer to run on machines with 512MB or
    more (if any are available), but would accept a
    machine with 256MB if that is all that is
    available).

13
Lets follow a job under Condor (2)
  • The central manager has been monitoring all the
    execute hosts, so it knows which are available
    and what sort of machine they are (OS, memory,
    etc.). Execute hosts periodically send a ClassAd
    describing themselves to the central manager.
  • Every so often the central manager enters a
    negotiation cycle where it matches waiting jobs
    with available execute hosts.
  • So eventually the job from the submit host is
    matched to a suitable execute host (unless there
    are no suitable execute hosts in the pool, of
    course).

14
Lets follow a job under Condor (3)
  • The central manager informs the chosen execute
    host that it has been claimed and gives it a
    ticket.
  • The central manager then informs the submit host
    which execute host to use and gives it a matching
    ticket.
  • The submit host contacts the execute host,
    presenting its matching ticket, and transfers the
    jobs executable and data files to the execute
    host (if necessary Condor can make use of a
    shared filesystem between submit and execute
    hosts), which then begins to run the job.
  • For some types of jobs the job running on the
    execute host will access files and resources on
    the submit host via remote procedure calls.

15
Lets follow a job under Condor (4)
  • For most jobs, a TCP connection is maintained
    between the submit and execute host while the job
    is running. If the submit host dies (or the TCP
    connection is broken) the execute host aborts the
    job. If the execute host dies (or the TCP
    connection is broken) the job is re-submitted to
    the Condor pool.
  • Certain sorts of jobs can checkpoint, both
    periodically (for safety) and when interrupted.
    If a job is interrupted and has successfully
    checkpointed, it will resume from its last
    checkpointed state when it starts to run again
    (possibly on a different execute host) rather
    than starting from the beginning. Checkpointing
    is only supported for certain sorts of jobs, and
    on certain platforms (most notably, it is not
    supported under Windows).
  • When the job finishes, the results are returned
    to the submit host (unless a shared filesystem is
    in use between submit and execute hosts).

16
Lets see that in a diagram
Central Manager
Condor daemons (Normally listen on ports 9614 and
9618)
Execute Host tells Central Manager about itself.
Central Manager tells it when to accept a job
from Submit Host.
Submit Host tells Central Manager about job.
Central Manager tells it to which Execute Host it
should send job.
Execute Host
Submit Host
Condor daemons
Condor daemons
Send job to Execute Host. Send results to Submit
Host.
Spawns job and signals it when to abort, suspend,
or checkpoint.
condor_shadow process
Users job
Users executable code Condor libraries
All system calls performed as remote procedure
calls back to Submit Host.
Checkpoint file (if any) is saved to disk.
17
Types of Jobs
  • Condor classifies jobs according to the type of
    environment it provides for them to run in. Each
    of these environments is called a universe.
    There are currently 7 different universes, as
    outlined on the next three slides.
  • Not all universes exist for all platforms in
    particular only the vanilla universe is supported
    for Windows platforms.

18
Job Universes (1)
  • Standard For jobs compiled with the Condor
    libraries this universe provides checkpointing
    and remote system calls. Jobs must be single
    threaded and use a supported compiler. This
    universe does not exist under Windows.
  • Vanilla For jobs which cannot be compiled with
    the Condor libraries and for shell scripts and
    (Windows) batch files. For these jobs Condor
    simply spawns a process to run the given
    executable, shell script or batch file. No
    checkpointing or remote system calls are provided
    by Condor for jobs in this universe.

19
Job Universes (2)
  • PVM For programs written to the Parallel Virtual
    Machine interface.
  • MPI For programs written to the MPICH interface,
    i.e. MPI jobs. Currently supports MPICH versions
    1.2.2, 1.2.3 and 1.2.4.
  • Globus This is simply a mechanism to submit jobs
    to resources managed by the Globus Toolkit 2.2 or
    higher.

20
Job Universes (3)
  • Java For jobs written for the Java Virtual
    Machine (JVM). All JVMs should be supported by
    this universe.
  • Scheduler For special circumstances such as for
    the Condor workflow tool DAGMan. Scheduler
    universe jobs ignore any machine requirements
    given, run on the submit host immediately upon
    submission, and will never be preempted.
    Normally an end-user would never explicitly use
    this universe.

21
Some Features of note (1)
  • DAGMan The Directed Acyclic Graph Manager
    (DAGMan) is a meta-scheduler or workflow tool for
    Condor. It handles running sequences of jobs
    where there are dependencies between the jobs
    (e.g. one job has to finish before another can
    start). A directed acyclic graph (DAG) is a way
    of representing such sequences of jobs.
  • Flexible resource management Condors ClassAd
    mechanism provides great flexibility in managing
    resources. Resources can specify what sorts of
    jobs they are prepared to run (e.g. only jobs
    from certain users) as well as advertise unique
    features (e.g. special software installed on that
    resource).

22
Some Features of note (2)
  • User priorities Condor supports a priority
    system for users by default it implements a
    fair share policy (similar to SGEs share
    based policy) so that the user priority for
    users who frequently run jobs gradually gets
    worse until they run fewer jobs and then their
    priority gradually returns to normal.
  • Job priorities Condor supports a priority system
    for jobs, independent of its user priority
    system. Job priorities determine which of a
    users jobs is given priority for appropriate,
    available resources. By default all jobs have
    the same priority.

23
Some Features of note (3)
  • Backfilling Backfilling is a method by which a
    batch queuing system more efficiently utilises
    resources. If a high priority job cannot run
    because resources it needs are unavailable, the
    scheduler will look for lower priority jobs which
    can run on the currently available resources to
    ensure resources are maximally utilised. Because
    of the way Condor matches jobs and resources, a
    resource will not remain idle if there is some
    job it can run (OK there are some caveats to
    that statement!), and this results in resources
    being utilised as efficiently as a queuing system
    which supports backfilling.

24
Some Features of note (4)
  • Job Preemption Condor supports job preemption
    it can be configured so that low priority jobs
    already running will be killed or checkpointed,
    and returned to the queue so that higher priority
    jobs can run instead.
  • Computing on Demand Computing on Demand (COD)
    allows Condor to immediately run short-term jobs
    on instantly available resources (at the expense
    of any other jobs running on those resources) so
    that it can support compute-intensive jobs which
    require interactive response times, e.g. for the
    rendering process of a graphics rendering
    application which waits for user input and then
    renders the chosen image (where the image
    rendering requires a burst of intensive computing
    power).

25
Some Features of note (5)
  • Perl interface The Condor Perl module provides a
    Perl interface to Condor that allows job
    submission, job monitoring and administration of
    Condor.
  • Management of Globus jobs Condor-G is Condors
    interface to Globus and provides many of the
    benefits of Condors job management system for
    such jobs. In particular, Condors job
    submission and monitoring tools are much easier
    to use than those provided by the Globus Toolkit.
  • DRMAA Support The current stable release of
    Condor does not provide support for the
    Distributed Resource Management Application API
    (DRMAA), but the current experimental release
    will, and this will eventually be incorporated in
    the next major stable release.

26
Some Features of note (6)
  • Cygwin Under Windows Condor works well with jobs
    that use the Cygwin libraries, although this is
    not an official feature of Condor. This may
    provide a method of using some UNIX or Linux
    programs on Windows execute hosts with Condor.
  • Free Condor is freely available to anyone. A
    paid support option is available, but is unlikely
    to be necessary except for people with extremely
    unusual environments, or thousands of machines in
    their pool(s).
  • Open Source Technically, Condor is open
    source. That is, its source is released under an
    open source license to anyone who can convince
    the Condor Team they should have it (they arent
    that difficult to convince). Yes, we have it,
    but No, Im not going to give it to just anyone.

27
Some points worth knowing (1)
  • Condor is useful for much more than just cycle
    scavenging, but its default configuration is one
    designed for such resource harvesting.
  • Condor is extremely happy to work with a
    heterogeneous pool of machines in particular
    the central manager can run on a completely
    different platform to any other machine in the
    pool.
  • By default, all the security (including
    encryption) features in Condor are switched off!
  • When used on more than just one machine, Condor
    wants to run its daemons as root (Administrator
    under Windows) but this is not compulsory,
    although there are consequences to not running it
    as root that need to be considered.

28
Some points worth knowing (2)
  • Condor is extremely simple to install (normally a
    few minutes per machine) and its installation can
    be automated very simply.
  • However, precisely because it is so flexible and
    powerful, its configuration can be quite complex
    and is not entirely intuitive its manual,
    although reasonably comprehensive (except where
    security is concerned), requires considerable
    work before it can be considered helpful.
  • Condor has a very active and friendly user
    community who are usually happy to provide help
    and advice.

29
Some points worth knowing (3)
  • In its default configuration Condor starts a
    negotiation cycle every 300 seconds (5 minutes),
    and it waits for a machine to be idle for 15
    minutes before trying to run a job on it. This
    is why new pool administrators often report that
    it takes Condor up to 20 minutes to run jobs even
    when all machines in the pool are idle this can
    be easily fixed by adjusting the appropriate
    configuration file parameters.
  • By default Condor will run one job per virtual
    machine. A virtual machine is normally a CPU (so
    dual CPU machines normally count as two
    machines for Condors purposes). However, if
    you want Condor to run more jobs on a particular
    machine than that machine has CPUs then you can
    configure the machine to claim to have as many
    virtual machines as you wish.

30
Some points worth knowing (4)
  • Much of the communication in Condor uses UDP
    rather than TCP. This may have implications if
    machines running Condor are to communicate across
    firewalls, or on congested networks.
  • Communication in Condor happens on two well known
    (configurable) ports (9614, 9618) and many, many
    transient ports. You can restrict the range of
    transient ports it uses, but you cant restrict
    it too much as it wont re-use ports as quickly
    as one might like and so will quickly run out if
    the range is too small (from experimentation, a
    range of 10 ports is definitely too small!).

31
Some points worth knowing (5)
  • Condor uses a very different metaphor to other
    batch queuing systems for describing its
    scheduling process in Condor there is no notion
    of queues so it is not usually possible to
    straightforwardly translate the behaviour of
    other scheduling systems into Condor terms
  • Despite this, I have so far found only one
    significant feature of other batch queuing
    systems that Condor doesnt implement namely it
    doesnt support so-called interactive jobs
    where the input and output of the job are
    directed to a terminal window or console for user
    input and control.

32
Officially Supported Platforms
  • UNIX
  • HP-UX 10.20 (on PA7000 and PA8000)
  • Solaris 2.6, 2.7, 8, 9 (all on SPARC)
  • IRIX 6.5 (on R5000, R8000, R10000)
  • MacOS X (10.2, 10.3) (on PowerPC)
  • Digital Unix 4.0 (on Alpha)
  • Tru64 5.1 (on Alpha)
  • AIX 5.2L (on PowerPC)
  • Linux Red Hat Linux 7.x (on Alpha, Intel x86
    and Itanium), 8.0 (on Intel x86), 9.0 (on
    Intel x86)
  • Windows Windows 2000 Professional and
    Server, Windows XP Professional, Windows 2003
    Server

Support is being dropped for the italicised
underlined platforms. Not all the features of
Condor are available on the platforms in blue,
most notably checkpointing and remote system
calls (the standard universe) are not available.
33
But it will also run on
  • Linux
  • Debian GNU/Linux 3.0r2 (woody), and sarge
    (although there are some reported problems with
    compiling user programs against the Condor
    libraries)
  • SuSE Linux 8.1 and 9.0
  • Fedora Core 1 (although there are some reported
    problems)
  • Gentoo Linux
  • Probably also on Windows NT 4.0 (Service Pack 6a)
    and Windows XP Home

34
Official Support is planned for
  • UNIX
  • HP-UX 11.11 (on PA-RISC)
  • Solaris 10
  • FreeBSD (on Intel x86)
  • Linux
  • Fedora Core 2
  • Also support for PowerPC and AMD64 architectures

35
University Computing Service plans for Condor
  • PWF Linux, CamGrid, etc.

36
CS Plans for Condor (1)
  • Deploy Condor across all the CS-owned PWF
    machines, followed by those departments which
    make use of the MCS Service and who wish to join.
    Any user who would normally be eligible for CS
    resources will be able, on application to the CS,
    to use Condor on CS-owned PWF machines.
  • Initially on PWF Linux only, and then expanding
    to include PWF Windows and the PWF Macs
  • Part of the CamGrid project
  • Seeks to deploy a University-wide grid
  • Initially using Condor a Globus gateway to be
    added later?
  • Wants to use as many PCs as possible so
    attempting to support as many different
    requirements from resource owners as possible
  • Different deployment models for CS-managed
    resources and non-CS-managed resources

37
CS Plans for Condor (2)
  • Default behaviour of PWF Linux machines
    (rebooting into PWF Windows when idle) will be
    changed
  • Machines with PWF Linux installed will boot into
    PWF Linux when machine is idle at times when the
    room is believed to be largely unused (e.g. at
    night time, outside of the University term).
  • These machines will then reboot into Windows just
    before the room is due to start being seriously
    used by ordinary users (e.g. the next morning).
  • Clarification (added after talk was given) This
    has not been finalised and is currently in
    negotiation.
  • This means that during term jobs will probably
    only have about 8 hours or so when they can run
    on machines before they are killed by the machine
    rebooting.

38
Condor pool architecture
  • Will use a dedicated machine as the central
    manager (may also act as a Kerberos domain
    controller for machines in this Condor pool)
  • Initially only a single submit host (more added
    as necessary to cope with more execute hosts) and
    1TB of short-term storage will be provided.
    Access to this submit host will initially be via
    SSH only.
  • No checkpoint server will be provided (maybe
    later if there is sufficient demand).

39
CS Condor pool
Submit Host tells Central Manager about a job.
Central Manager tells it to which Execute Host it
should send job.
Central Manager
Submit Host
Condor daemons Kerberos domain controller?
SSH access only
1TB of dedicated short-term storage
Send job to Execute Host.
Each Execute Host tells Central Manager about
itself. Central Manager tells each Execute Host
when to accept a job from Submit Host.
PWF Linux machines (Execute Hosts)
Execute host returns results to Submit Host.
All daemon-daemon (i.e. machine-machine)
communication in the pool is authenticated via
Kerberos.
40
Security Configuration
  • Users will have to SSH into the submit host to
    submit jobs and this will also provide user
    authentication (other secure access mechanisms,
    e.g. a web portal over HTTPS, will be added
    later).
  • Within the Condor pool authentication will be
    machine-based and will use Kerberos.
  • Most (probably all) Condor daemons will not run
    as root the most significant security
    consequence of this for end-users is that any
    files left behind by a Condor job will be
    accessible to the next Condor job that runs on
    that machine. We will probably implement a
    script to clean up after jobs.
  • Network transmissions will not be encrypted.

41
Job Universes supported
  • Initially will only provide official support
    for the vanilla universe and possibly the Java
    universe.
  • Official support for the standard universe will
    be provided in due course.
  • Unlikely to ever provide official support for
    the PVM, Globus, and scheduler universes. In
    fact, probably will not install support for the
    PVM universe. Possible, but not very likely,
    that official support may eventually be
    provided for the MPI universe.

42
Authorisation and Priority
  • As mentioned before, any user normally eligible
    for CS resources will be able, on application to
    the CS, to use Condor on CS-owned PWF machines.
  • At least some Departments want to restrict use of
    Condor on their MCS machines to members of the
    Department and we will support this.
  • We will also support prioritising a Departments
    machines for jobs from members of that Department
    if anyone wants this.
  • For CS-owned PWF machines we will probably use
    Condors default fair share policy (or a close
    variant) for user priorities.

43
Potential Users
  • Particle Physics (HEP Group at the Cavendish
    Laboratory)
  • Molecular Informatics (Unilever Centre for
    Molecular Informatics, Department of Chemistry)
    so far, will be our most intensive users, with
    plans to analyse the molecular structure of
    100,000 molecules using the General Atomic and
    Molecular Electronic Structure System (GAMESS).
  • Mineral Science (Department of Earth Sciences)
  • Anyone else? Please get in touch!

44
Why Should I Care?
  • Use of Condor amongst scientific researchers is
    rapidly increasing, so for the science
    departments chances are someone in your
    department is either already using Condor or will
    soon want to.
  • This means you may be asked to support it, either
    explicitly, or implicitly (by allowing it to be
    used on your network).
  • There are security implications to having Condor
    running on your network more on this later.

45
So, What Will Condor on the PWF Do For Me?
  • Condor provides a way of increasing the
    utilisation of your institutions computing
    resources at no extra financial cost, and so
    maximise the return on your institutions
    investment in those resources. For some
    Departments it is particularly important that
    they can show they are making maximal use of
    their IT equipment. And if you already use the
    MCS Service, we will set this up for you.
  • Also, if you wanted to have a cluster of Linux
    machines, but didnt want the MCS Service with
    both PWF Windows and PWF Linux, we could offer a
    Linux-only installation with Condor as a
    scheduler for your cluster provided the way we
    configure Condor and PWF Linux meets your needs
    at a reduced cost (currently 500 15 per
    machine, per anumn).
  • So what more do you want!?!

asks a quizzical kitten
46
What Else Will Condor on the PWF Do For Me?
  • You want more!?! Well
  • The CS will be providing a significant
    computational resource which will be freely
    available to members of your Department or
    College it may provide much needed computing
    power for them that they cant get in the
    Department, which may (if youre lucky!) mean
    that they harass you less for more machines, a
    greater share of your cluster(s), etc. So tell
    them about it!
  • We will be a source of expertise in Condor, so we
    will be able to provide some help and advice with
    Condor pools in your institution.

47
Why Should I Worry About Condor?
  • or, The Security Implications of Having a Condor
    Pool on Your Network

48
What Hackers Want
  • Before considering the security implications of a
    system like Condor, it is useful to consider just
    what it is an attacker will be trying to achieve.
    Typically they want to achieve one or more of
    the following overlapping goals
  • A higher level of access to your system(s) than
    they are permitted (worst case root compromise)
  • The ability to run arbitrary code on your
    system(s)
  • A denial of service (DoS) attack against you or
    someone who can be attacked from your system(s)
  • If they can achieve either of the first two
    goals, they can often achieve all of them.

49
Condor Hackers Dream?
  • Condor can help them achieve all of these
    goals, but particularly the second (ability to
    run arbitrary code)
  • Condor is designed to allow users to run
    arbitrary code (no questions asked). For some
    attackers this will be enough, but others may
    want more, and if there is anything on your
    system that could allow your system to be
    compromised, Condor may well allow an attacker to
    exploit this.
  • By design, Condor allows remote users (who do not
    have an account on your system) to have access to
    your system(s). By default, it allows the whole
    world to have access to your system(s).
  • By default, Condor will want you to install it as
    root, so a vulnerability in Condor could lead to
    a root compromise.
  • Condor-related processes can consume a lot of CPU
    cycles (perfect for a DoS attack against you) and
    there are ways to force some of the Condor
    daemons to do this. Even if the attacker only
    manages to perform a DoS attack against Condor,
    this may cause you significant problems if your
    users make serious use of Condor.

50
And now the bad news
  • No one seems to have seriously attacked Condor,
    so we dont have any idea of how secure it is
  • and because no one has attacked it, or attacked
    using it, many muppets out there will tell you
    that it must be secure because no one has
    managed to use it compromise anyone (yet)
  • It is being increasingly deployed over very large
    collections of machines (800 at UCL alone) which
    will make it a very attractive target.
  • As previously mentioned, by default Condor wants
    to be installed as root, has all security
    features (authentication and encryption)
    disabled, and gives everyone in the whole world
    access to your machine. Go, evil hacker, go!
  • We dont yet have any way of remotely probing
    machines to see if they are running Condor.

51
Authentication in Condor
  • Condor supports the authentication methods
    described on the following four slides.
  • None of the so-called strong authentication
    methods are supported on all platforms, so
    heterogeneous Condor pools may well have problems
    here
  • User authentication and machine (i.e. daemon)
    authentication are controlled separately.
  • Authentication and encryption are implemented and
    controlled separately.
  • Authentication and encryption can be set to one
    of NEVER, OPTIONAL, PREFERRED or REQUIRED, with
    the obvious meanings.

52
Strong Authentication Methods in Condor (1)
  • Kerberos Uses the MIT implementation of Kerberos
    V5. Only supported under Linux and (I believe)
    most versions of UNIX, excluding MacOS X.
    Support for Kerberos under Windows is due to be
    added in the current development release of
    Condor.
  • Windows (NTSSPI) Authentication This uses
    Microsofts Security Support Provider Interface
    (SSPI) to enforce NT LAN Manager (NTLM)
    authentication, which is based on challenge and
    response, using the users password as a key.
    NTLM authentication apparently bears some
    similarity to Kerberos. (Obviously) only
    available under Windows.

53
Strong Authentication Methods in Condor (2)
  • File System Authentication Utilises file
    ownership to verify identity. The authenticating
    daemon requires the party to be authenticated to
    write a file to a specific location and then
    checks the ownership of this file. This is only
    available on non-Windows platforms. There are
    only specific circumstances under which one
    should regard this as strong authentication.
  • Remote File System Authentication Utilises file
    ownership on a remote filesystem to verify
    identity. The authenticating daemon requires the
    party to be authenticated to write a file to a
    specific location on a remote filesystem and then
    checks the ownership of this file. This is only
    available on non-Windows platforms. This is an
    undocumented authentication mechanism. And
    again, there are only specific circumstances
    under which one should regard this as strong
    authentication.

54
Strong Authentication Methods in Condor (3)
  • GSI GSI is the Grid Security Infrastructure,
    developed by the Globus Alliance. It is a PKI
    which uses X.509 digital certificates, and is so
    appallingly implemented that my little sisters
    cat could probably break it. The Globus Alliance
    have yet to produce a stable implementation of
    GSI under Windows so Condor only supports GSI
    under Linux (and possibly some versions of UNIX?).

55
Other Authentication Methods in Condor
  • IP/Host-Based Security This form of security is
    now considered outdated but remains available.
    It allows or denies access based on the IP
    address or DNS name of the remote machine.
  • Claim To Be Authentication Accept whatever
    identity is presented by the client, i.e. no
    authentication. Normally used for testing
    purposes.
  • Anonymous Authentication Skip authentication
    checks, i.e. no authentication. Normally used
    for testing purposes.

56
Other Security Considerations
  • Condor will at least in some circumstances
    overwrite existing files if they have the same
    name as the output file(s) produced by the job.
    Also, output files can have any legal file name
    the job chooses. The inventive amongst you can
    probably see how these (particularly in
    combination) could lead to problems.
  • It is possible to submit a Condor job which will
    spawn a process that does not die or get killed
    when the job completes or when Condor terminates
    it. Apparently Condors behaviour in this regard
    will improve under the 2.6 kernel (that is, when
    Condor officially supports this kernel).
  • There exists a trivial DoS attack against Condor
    which anyone who can submit jobs to a Condor pool
    can carry out which will cripple the central
    manager. This will eventually be addressed in a
    future version of Condor.

57
And now
California Condor by Jamie Spangler
58
What I havent discussed
  • How Condor compares to batch queuing systems and
    other batch scheduling systems such as PBS, LSF
    and Sun Grid Engine (now N1 Grid Engine)
  • The roles of the different Condor daemons
  • Consequences of Condor daemons not running as
    root some of these consequences require a more
    detailed knowledge of how Condor works than I
    have given here, so if you really need to know,
    ask me privately or consult the Condor manual,
    Sections 3.7.1.1 and 3.7.2.
  • Compiling code against the Condor libraries for
    the standard universe
  • The intricacies of job submission this is a
    seminar in itself. Consult the Condor manual,
    Sections 2.4 and 2.5 and the condor_submit
    section in Section 9, and then ask (nicely)
    myself or Mark Calleja (Earth Sciences) for
    guidance.
  • Job monitoring

59
References Condor Project and documentation
  • Condor Project
  • http//www.cs.wisc.edu/condor/
  • Condor Manual
  • Current Stable Release
  • http//www.cs.wisc.edu/condor/manual/v6.6
  • Current Development Release
  • http//www.cs.wisc.edu/condor/manual/v6.7
  • Previous Stable Release
  • http//www.cs.wisc.edu/condor/manual/v6.4

60
References other users of Condor
  • Testimonials from Condor users
  • http//www.cs.wisc.edu/condor/wts/stories.html
  • UCL Condor
  • http//grid.ucl.ac.uk/Condor.html
  • eMinerals minigrid (Department of Earth
    Sciences)
  • http//www.esc.cam.ac.uk/mcal00/grid.html
  • Southampton University Computing Services
    Windows Condor Pilot Service
  • http//www.iss.soton.ac.uk/research/e-science/cond
    or/
  • University of Reading Department of Meteorology
    GRID
  • http//www.met.rdg.ac.uk/swsellis/system/grid/
  • Condor at the University of Essex
  • http//cswww.essex.ac.uk/intranet/students/Technic
    alGroup/TechnicalHelp/condor.htm

61
References security and firewalls
  • Condor v6.6 Manual, Section 3.7 Security In
    Condor
  • http//www.cs.wisc.edu/condor/manual/v6.6/3_7Secur
    ity_In.html
  • MIT Kerberos
  • http//web.mit.edu/kerberos/www/
  • Microsofts Security Support Provider Interface
  • http//msdn.microsoft.com/library/en-us/security/s
    ecurity/sspi.asp
  • Grid Security Infrastructure (GSI)
  • http//www-unix.globus.org/toolkit/docs/3.2/gsi/in
    dex.html
  • http//www.globus.org/security/v2.0/
  • Documentation from the CamGrid project on Condor
    and firewalls
  • http//www.escience.cam.ac.uk/projects/camgrid/doc
    umentation.html

62
References other schedulers
  • Sun Grid Engine (now known as N1 Grid Engine)
  • http//wwws.sun.com/software/gridware/
  • http//www.sun.com/products-n-solutions/hardware/d
    ocs/Software/Sun_Grid_Engine/
  • Portable Batch System (PBS)
  • http//www.openpbs.org/
  • http//www.pbspro.com/
  • Platform LSF
  • http//www.platform.com/products/LSF/

63
Contacts
  • Myself Bruce Beckles, e-Science Specialist
  • condor-support_at_ucs.cam.ac.uk
  • mbb10_at_cam.ac.uk
  • Mark Calleja, Department of Earth Sciences
  • Probably the most experienced user of Condor in
    the University.

64
Questions?
Write a Comment
User Comments (0)
About PowerShow.com