Title: Farming%20with%20Condor
1Farming with Condor
- Douglas Thain
- thain_at_cs.wisc.edu
- INFN Bologna, December 2001
2Outline
- Introduction
- What is Condor? Why Condor on the Farm?
- Components
- Daemons, pools, flocks, ClassAds
- Short Example
- Executing 1000 jobs.
- Complications
- Firewalls, security, etc
3The Condor Project (Est. 1985)
- Distributed systems CS research performed by a
team that faces - software engineering challenges in a
UNIX/Linux/NT environment, - active interaction with users and collaborators,
- daily maintenance and support challenges of a
distributed production environment, - and educating and training students.
- Funding -
- NSF, NASA,DoE, DoD, IBM, INTEL,
- Microsoft and the UW Graduate School
4A Bird of Opportunity
Busy
Job
Idle
Busy
Over the course of a week, 80 of a desktop
machines time is wasted.
Job
Idle
5The Condor Principle
The owner is absolutely in charge!
The Condor Corollary
The visitor must be prepared for the unexpected!
6Tricky Details
- What if the user returns?
- Checkpoint the job periodically.
- Restart the job elsewhere from a checkpoint.
- What if the machine does not have your files?
- Perform I/O via Remote System Calls
- These two features require that you link with the
Condor C library. - Cant relink? You may still use Condor, but with
some loss in opportunities.
7Checkpointing
Job
Checkpoint
Restart
Job
8Remote System Calls
Just like home!
Job
Shadow
Remote System Calls
Disk
9The INFN Condor Pool
10Top 10 Condor Pools
226 Condor Pools
5576 Condor Hosts
11Back to the Farm
- The cluster is the new engine of scientific
computing. - Inexpensive to
- procure
- expand
- repair
12The Ideal Cluster
- The ideal cluster has every node identical, in
every way - CPU
- Memory
- File system
- User accounts
- Software installation
- Users expect to be able to execute on any node.
- Some models (MPI) require perfectly matched nodes.
13The Bad News
- Keeping the entire cluster available for use is
very difficult, when users expect complete
symmetry! - Software failures
- Full disk, wild process, etc...
- Hardware failures
- Replace with exact match? (not best buy)
- Replace with better hardware? (goes unused)
- Much better to query rather than assume state of
the cluster.
14High Throughput Computingis a 24-7-365
activity.
FLOPY ? (606024752)FLOPS
15Why Condor on the Farm?
- Condor is expert at managing very heterogeneous
resources for high-throughput computing. - Large clusters, despite our best efforts, will
always be slightly heterogeneous. - (It may not be in your financial interests to
keep them perfectly homogeneous.) - Condor assists users in making progress, despite
the imperfections of the cluster. - Few users require the whole identical cluster.
- The pursuit of cluster perfection is then an in
issue of small throughput improvement, rather
than 0 or max.
16Basic HTC Mechanisms
- Matchmaking - enables requests for services and
offers to provide services find each other
(ClassAds). - Persistence - records are kept in stable storage
-- any component may crash and reboot. - Asynchronous API - enables management of dynamic
(opportunistic) resources. - Checkpointing - enables preemptive resume
scheduling (go ahead and use it as long as it is
available!). - Remote I/O - enables remote (from execution site)
access to local (at submission site) data.
17City Bird, Country Farm
- The lessons learned and techniques used in
stealing cycles from workstations are just as
important when trying to maximize the throughput
of a homogeneous luster.
18Outline
- Introduction
- What is Condor? Why Condor on the Farm?
- Components
- Daemons, pools, flocks, ClassAds
- Short Example
- Executing 1000 jobs.
- Complications
- Firewalls, security, etc
19Components
- Condor can be quite complicated
- Many daemons, many connections, many logs...
- The complexity is necessary and desirable
- Each process represents an independent interest
- Machine requirements (startd)
- User requirements (schedd)
- System requirements (central manager)
- Explain the structure by working from the bottom
up.
20A Single Machine
administrator email
Only run jobs submitted from Bologna or
Milan. Prefer jobs owned by thain. Evict jobs
that dont fit in memory.
condor startd
disk
Local policy file
RAM
cpu
keyboard
21A Single Pool
Machine state and policy.
Machine state and policy.
Global Policy All things being equal, Bologna
gets 2x as many machines as Milan.
22A Typical Pool
condor startd
RAM
cpu
Global Policy All things being equal, Bologna
gets 2x as many machines as Milan.
Uniform Local Policy All machines except 3
prefer mazzanti
NFS / AFS Server
disk
RAM
cpu
23Schedulers
24Multiple Pools
25Matchmaking
- Each Central Manager is an introduction service
that matches compatible machines and jobs. - A simple language (ClassAds) is used to represent
everyones needs and desires. - The match is not binding contract -- each side is
responsible for enforcing its own needs. - If a central manager crashes, jobs will continue
to run, but no further introductions are made.
26ClassAd Example
- Job Ad
- Type Job
- Cmd cmsim.exe
- Owner thain
- Requirements
- (OpSysLINUX)
- (Memorygt128)
- Machine Ad
- Type Machine
- Name vulture
- OpSys LINUX
- Memory 256
- Requirements
- (Ownerthain)
27Matchmaking with ClassAds
Central Manager
Startd
Schedd
28Placement vs. Scheduling
- A Condor Central Manager suggests the placement
of jobs on machines, with the understanding that
all jobs are ready to run. - A Condor scheduler is responsible for executing a
list of jobs with various requirements. It may
order jobs according to the users requests. - Neither component plans ahead to make a schedule
or a reservation for execution -- it is assumed
change is so frequent that schedules are not
useful.
29Can we Schedule?
- Of course, schedule is important for users that
have strict time contraints. - Scheduling is more important to High-Performance
Computing (HPC) than High-Throughput Computing
(HTC.) - Scheduling requirements may be worked into Condor
in one of two ways - 1 - Users may share a single submission point.
- 2 - The administrator may periodically
reconfigure policy according to a schedule
established elsewhere.
30Scheduling
Method 1 All users share a schedd.
31Outline
- Introduction
- What is Condor? Why Condor on the Farm?
- Components
- Daemons, pools, flocks, ClassAds
- Short Example
- Executing 1000 jobs.
- Complications
- Firewalls, security, etc
32How Many Machines?
- condor_status
- Name OpSys Arch State
Activity LoadAv Mem - lxpc1.na.infn LINUX-GLIBC INTEL Unclaimed Idle
0.000 30 - axpd21.pd.inf OSF1 ALPHA Owner Idle
0.266 96 - vlsi11.pd.inf SOLARIS26 SUN4u Claimed Busy
0.000 256 - . . .
- Machines Owner Claimed
Unclaimed Matched Preempting - ALPHA/OSF1 115 67 46
1 0 1 - INTEL/LINUX 53 18 0
35 0 0 - INTEL/LINUX-GLIBC 16 7 0
9 0 0 - SUN4u/SOLARIS251 1 1 0
0 0 0 - SUN4u/SOLARIS26 6 2 0
4 0 0 - SUN4u/SOLARIS27 1 1 0
0 0 0 - SUN4x/SOLARIS26 2 1 0
1 0 0
33Submit the Job
- Create a submit file
- vi sim.submit
- Submit the job
- condor_submit sim.submit
Executable sim Input sim.in Output
sim.out Log sim.log queue
34Watch the Progress
- condor_q
- -- Submitter axpbo8.bo.infn.it
lt131.154.10.291038gt - ID OWNER SUBMITTED RUN_TIME
ST PRI SIZE CMD - 5.0 thain 6/21 1240 0000015
R 0 2.5 sim.exe
Each job gets a unique number.
Status Unexpanded, Running or Idle
Size of program image (MB)
35Receive E-mail When Done
- This is an automated email from the Condor system
- on machine "axpbo8.bo.infn.it". Do not reply.
- Your condor job
- /tmp_mnt/usr/users/ccl/thain/test/sim 40
- exited with status 0.
- Submitted at Wed Jun 21 142442 2000
- Completed at Wed Jun 21 143636 2000
- Real Time 0 001154
- Run Time 0 000652
- Committed Time 0 000137
- . . .
36Running Many Processes
- The real benefit of Condor comes from managing
1000s of jobs. - First, get organized. Write a script to make
1000 input files. - Now, simply adjust your submit file
Executable sim.exe Input sim.in.(PROCESS) Out
put sim.out.(PROCESS) Log sim.log Queue 1000
37What can go wrong?
- If an execution site crashes
- Your job will restart elsewhere.
- If the central manager crashes
- Jobs will continue to run, no new matches will be
made. - If the submit machine crashes
- Jobs will stop, but be re-started when it
reboots. - The only way to lose a job is to throw away the
disk on the submit machine!
38Outline
- Introduction
- What is Condor? Why Condor on the Farm?
- Components
- Daemons, pools, flocks, ClassAds
- Short Example
- Executing 1000 jobs.
- Complications
- Firewalls, security, etc
39Firewalls
- Why a firewall?
- Prevent all outside contact.
- Prevent non-approved contact.
- Carefully securing every node is too much work.
- Whats the problem?
- A variety of processes comprise Condor.
- A variety of ports must be used at once.
- Submit and execute machines must communicate
directly, not through the CM.
40The Firewall Problem
Firewall
Public Network
Private Network
41Firewall Solution 1
Allow ports 1000-1010.
Firewall
Public Network
Private Network
42Firewall Solution 1
- Pros
- Easy to configure Condor.
- Easy to configure firewall.
- Machine remain a part of the pool.
- Cons
- Number of ports limits number of simultaneous
interactions with the node. (running jobs
queue ops negotiations, etc.) - More ports more connections, less security
43Firewall Solution 2
Private Network
Firewall
condor schedd
Public Network
ssh
44Firewall Solution 2
- Pros
- Only port through router is ssh.
- Cons
- Pool is partitioned!
- Users must manually submit to every pool that is
behind a firewall. (I.e. they wont.) - No global policy possible.
- No global management/status possible.
45Network Address Translation
- Both solutions only work as long as the firewall
simply drops packets it doesnt like. - If the firewall is a Network Address Translator
(masquerade,) then only solution 2 works. - Research in Progress A Condor NAT that runs on
the firewall and exports the pool to the outside
world.
46Security
- Current Condor security
- Authenticate via DNS.
- Authorize classes of hosts for certain tasks.
- New Condor (6.3.X?) security
- Authenticate with encrypted credentials.
- Authorize on a per-user basis.
- Forward credentials to necessary sites.
47Condor 6.2 Security
- Authentication DNS is queried for each incoming
connection in order to determine the name. - Authorization Each participant permits a class
of hosts to perform certain tasks. At UW-CS - HOSTALLOW_READ .wisc.edu, .infn.it
- Hosts that may query the machine state.
- HOSTALLOW_WRITE .cs.wisc.edu, .infn.it
- Hosts that may execute jobs, send updates, etc...
- HOSTALLOW_OWNER (FULL_HOSTNAME)
- Hosts that may cause this machine to vacate.
- HOSTALLOW_ADMINISTRATOR condor.cs.wisc.edu
- Hosts that may change priorities, turn Condor
on/off
48Condor 6.3.X? Security
- Principle No single security mechanism is
appropriate for all sites. Condor must have many
tools. - United States Air Force
- Kerberos authentication, all connections
encrypted - Cluster behind a firewall
- Host authentication, no encryption
- Grid Computing
- GSI credentials from certain authorities,
encryption is up to the user.
49Condor 6.3.X Security
Execute
GSI ?
YES!
GSI
I/O
KRB 5 ?
GSI ?
FORWARD CERT
NO
YES!
Submit
Data storage
Disk
50You donthave to be asuper personto dosuper
computing!
51Getting Condor
- Condor Home Page
- http//www.cs.wisc.edu
- Binaries are freely available.
- Versions
- 6.2.x - Stable releases, bug fixes only
- 6.3.x - Development releases
52For More Info
- Condor Home Page
- http//www.cs.wisc.edu/condor
- These slides
- http//www.cs.wisc.edu/thain
- Douglas Thain
- thain_at_cs.wisc.edu
- Questions Now?