Farming with Condor - PowerPoint PPT Presentation

About This Presentation
Title:

Farming with Condor

Description:

Condor is expert at managing very heterogeneous resources for high-throughput computing. ... Condor assists users in making progress, despite the imperfections ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 53
Provided by: dougla9
Learn more at: https://www3.nd.edu
Category:
Tags: condor | farming

less

Transcript and Presenter's Notes

Title: Farming with Condor


1
Farming with Condor
  • Douglas Thain
  • thain_at_cs.wisc.edu
  • INFN Bologna, December 2001

2
Outline
  • Introduction
  • What is Condor? Why Condor on the Farm?
  • Components
  • Daemons, pools, flocks, ClassAds
  • Short Example
  • Executing 1000 jobs.
  • Complications
  • Firewalls, security, etc

3
The Condor Project (Est. 1985)
  • Distributed systems CS research performed by a
    team that faces
  • software engineering challenges in a
    UNIX/Linux/NT environment,
  • active interaction with users and collaborators,
  • daily maintenance and support challenges of a
    distributed production environment,
  • and educating and training students.
  • Funding -
  • NSF, NASA,DoE, DoD, IBM, INTEL,
  • Microsoft and the UW Graduate School

4
A Bird of Opportunity
Busy
Job
Idle
Busy
Over the course of a week, 80 of a desktop
machines time is wasted.
Job
Idle
5
The Condor Principle
The owner is absolutely in charge!
The Condor Corollary
The visitor must be prepared for the unexpected!
6
Tricky Details
  • What if the user returns?
  • Checkpoint the job periodically.
  • Restart the job elsewhere from a checkpoint.
  • What if the machine does not have your files?
  • Perform I/O via Remote System Calls
  • These two features require that you link with the
    Condor C library.
  • Cant relink? You may still use Condor, but with
    some loss in opportunities.

7
Checkpointing
Job
Checkpoint
Restart
Job
8
Remote System Calls
Just like home!
Job
Shadow
Remote System Calls
Disk
9
The INFN Condor Pool
10
Top 10 Condor Pools
226 Condor Pools
5576 Condor Hosts
11
Back to the Farm
  • The cluster is the new engine of scientific
    computing.
  • Inexpensive to
  • procure
  • expand
  • repair

12
The Ideal Cluster
  • The ideal cluster has every node identical, in
    every way
  • CPU
  • Memory
  • File system
  • User accounts
  • Software installation
  • Users expect to be able to execute on any node.
  • Some models (MPI) require perfectly matched nodes.

13
The Bad News
  • Keeping the entire cluster available for use is
    very difficult, when users expect complete
    symmetry!
  • Software failures
  • Full disk, wild process, etc...
  • Hardware failures
  • Replace with exact match? (not best buy)
  • Replace with better hardware? (goes unused)
  • Much better to query rather than assume state of
    the cluster.

14
High Throughput Computingis a 24-7-365
activity.
FLOPY ? (606024752)FLOPS
15
Why Condor on the Farm?
  • Condor is expert at managing very heterogeneous
    resources for high-throughput computing.
  • Large clusters, despite our best efforts, will
    always be slightly heterogeneous.
  • (It may not be in your financial interests to
    keep them perfectly homogeneous.)
  • Condor assists users in making progress, despite
    the imperfections of the cluster.
  • Few users require the whole identical cluster.
  • The pursuit of cluster perfection is then an in
    issue of small throughput improvement, rather
    than 0 or max.

16
Basic HTC Mechanisms
  • Matchmaking - enables requests for services and
    offers to provide services find each other
    (ClassAds).
  • Persistence - records are kept in stable storage
    -- any component may crash and reboot.
  • Asynchronous API - enables management of dynamic
    (opportunistic) resources.
  • Checkpointing - enables preemptive resume
    scheduling (go ahead and use it as long as it is
    available!).
  • Remote I/O - enables remote (from execution site)
    access to local (at submission site) data.

17
City Bird, Country Farm
  • The lessons learned and techniques used in
    stealing cycles from workstations are just as
    important when trying to maximize the throughput
    of a homogeneous luster.

18
Outline
  • Introduction
  • What is Condor? Why Condor on the Farm?
  • Components
  • Daemons, pools, flocks, ClassAds
  • Short Example
  • Executing 1000 jobs.
  • Complications
  • Firewalls, security, etc

19
Components
  • Condor can be quite complicated
  • Many daemons, many connections, many logs...
  • The complexity is necessary and desirable
  • Each process represents an independent interest
  • Machine requirements (startd)
  • User requirements (schedd)
  • System requirements (central manager)
  • Explain the structure by working from the bottom
    up.

20
A Single Machine
administrator email
Only run jobs submitted from Bologna or
Milan. Prefer jobs owned by thain. Evict jobs
that dont fit in memory.
condor startd
disk
Local policy file
RAM
cpu
keyboard
21
A Single Pool
Machine state and policy.
Machine state and policy.
Global Policy All things being equal, Bologna
gets 2x as many machines as Milan.
22
A Typical Pool
condor startd
RAM
cpu
Global Policy All things being equal, Bologna
gets 2x as many machines as Milan.
Uniform Local Policy All machines except 3
prefer mazzanti
NFS / AFS Server
disk
RAM
cpu
23
Schedulers
24
Multiple Pools
25
Matchmaking
  • Each Central Manager is an introduction service
    that matches compatible machines and jobs.
  • A simple language (ClassAds) is used to represent
    everyones needs and desires.
  • The match is not binding contract -- each side is
    responsible for enforcing its own needs.
  • If a central manager crashes, jobs will continue
    to run, but no further introductions are made.

26
ClassAd Example
  • Job Ad
  • Type Job
  • Cmd cmsim.exe
  • Owner thain
  • Requirements
  • (OpSysLINUX)
  • (Memorygt128)
  • Machine Ad
  • Type Machine
  • Name vulture
  • OpSys LINUX
  • Memory 256
  • Requirements
  • (Ownerthain)

27
Matchmaking with ClassAds
Central Manager
Startd
Schedd
28
Placement vs. Scheduling
  • A Condor Central Manager suggests the placement
    of jobs on machines, with the understanding that
    all jobs are ready to run.
  • A Condor scheduler is responsible for executing a
    list of jobs with various requirements. It may
    order jobs according to the users requests.
  • Neither component plans ahead to make a schedule
    or a reservation for execution -- it is assumed
    change is so frequent that schedules are not
    useful.

29
Can we Schedule?
  • Of course, schedule is important for users that
    have strict time contraints.
  • Scheduling is more important to High-Performance
    Computing (HPC) than High-Throughput Computing
    (HTC.)
  • Scheduling requirements may be worked into Condor
    in one of two ways
  • 1 - Users may share a single submission point.
  • 2 - The administrator may periodically
    reconfigure policy according to a schedule
    established elsewhere.

30
Scheduling
Method 1 All users share a schedd.
31
Outline
  • Introduction
  • What is Condor? Why Condor on the Farm?
  • Components
  • Daemons, pools, flocks, ClassAds
  • Short Example
  • Executing 1000 jobs.
  • Complications
  • Firewalls, security, etc

32
How Many Machines?
  • condor_status
  • Name OpSys Arch State
    Activity LoadAv Mem
  • lxpc1.na.infn LINUX-GLIBC INTEL Unclaimed Idle
    0.000 30
  • axpd21.pd.inf OSF1 ALPHA Owner Idle
    0.266 96
  • vlsi11.pd.inf SOLARIS26 SUN4u Claimed Busy
    0.000 256
  • . . .
  • Machines Owner Claimed
    Unclaimed Matched Preempting
  • ALPHA/OSF1 115 67 46
    1 0 1
  • INTEL/LINUX 53 18 0
    35 0 0
  • INTEL/LINUX-GLIBC 16 7 0
    9 0 0
  • SUN4u/SOLARIS251 1 1 0
    0 0 0
  • SUN4u/SOLARIS26 6 2 0
    4 0 0
  • SUN4u/SOLARIS27 1 1 0
    0 0 0
  • SUN4x/SOLARIS26 2 1 0
    1 0 0

33
Submit the Job
  • Create a submit file
  • vi sim.submit
  • Submit the job
  • condor_submit sim.submit

Executable sim Input sim.in Output
sim.out Log sim.log queue
34
Watch the Progress
  • condor_q
  • -- Submitter axpbo8.bo.infn.it
    lt131.154.10.291038gt
  • ID OWNER SUBMITTED RUN_TIME
    ST PRI SIZE CMD
  • 5.0 thain 6/21 1240 0000015
    R 0 2.5 sim.exe

Each job gets a unique number.
Status Unexpanded, Running or Idle
Size of program image (MB)
35
Receive E-mail When Done
  • This is an automated email from the Condor system
  • on machine "axpbo8.bo.infn.it". Do not reply.
  • Your condor job
  • /tmp_mnt/usr/users/ccl/thain/test/sim 40
  • exited with status 0.
  • Submitted at Wed Jun 21 142442 2000
  • Completed at Wed Jun 21 143636 2000
  • Real Time 0 001154
  • Run Time 0 000652
  • Committed Time 0 000137
  • . . .

36
Running Many Processes
  • The real benefit of Condor comes from managing
    1000s of jobs.
  • First, get organized. Write a script to make
    1000 input files.
  • Now, simply adjust your submit file

Executable sim.exe Input sim.in.(PROCESS) Out
put sim.out.(PROCESS) Log sim.log Queue 1000
37
What can go wrong?
  • If an execution site crashes
  • Your job will restart elsewhere.
  • If the central manager crashes
  • Jobs will continue to run, no new matches will be
    made.
  • If the submit machine crashes
  • Jobs will stop, but be re-started when it
    reboots.
  • The only way to lose a job is to throw away the
    disk on the submit machine!

38
Outline
  • Introduction
  • What is Condor? Why Condor on the Farm?
  • Components
  • Daemons, pools, flocks, ClassAds
  • Short Example
  • Executing 1000 jobs.
  • Complications
  • Firewalls, security, etc

39
Firewalls
  • Why a firewall?
  • Prevent all outside contact.
  • Prevent non-approved contact.
  • Carefully securing every node is too much work.
  • Whats the problem?
  • A variety of processes comprise Condor.
  • A variety of ports must be used at once.
  • Submit and execute machines must communicate
    directly, not through the CM.

40
The Firewall Problem
Firewall
Public Network
Private Network
41
Firewall Solution 1
Allow ports 1000-1010.
Firewall
Public Network
Private Network
42
Firewall Solution 1
  • Pros
  • Easy to configure Condor.
  • Easy to configure firewall.
  • Machine remain a part of the pool.
  • Cons
  • Number of ports limits number of simultaneous
    interactions with the node. (running jobs
    queue ops negotiations, etc.)
  • More ports more connections, less security

43
Firewall Solution 2
Private Network
Firewall
condor schedd
Public Network
ssh
44
Firewall Solution 2
  • Pros
  • Only port through router is ssh.
  • Cons
  • Pool is partitioned!
  • Users must manually submit to every pool that is
    behind a firewall. (I.e. they wont.)
  • No global policy possible.
  • No global management/status possible.

45
Network Address Translation
  • Both solutions only work as long as the firewall
    simply drops packets it doesnt like.
  • If the firewall is a Network Address Translator
    (masquerade,) then only solution 2 works.
  • Research in Progress A Condor NAT that runs on
    the firewall and exports the pool to the outside
    world.

46
Security
  • Current Condor security
  • Authenticate via DNS.
  • Authorize classes of hosts for certain tasks.
  • New Condor (6.3.X?) security
  • Authenticate with encrypted credentials.
  • Authorize on a per-user basis.
  • Forward credentials to necessary sites.

47
Condor 6.2 Security
  • Authentication DNS is queried for each incoming
    connection in order to determine the name.
  • Authorization Each participant permits a class
    of hosts to perform certain tasks. At UW-CS
  • HOSTALLOW_READ .wisc.edu, .infn.it
  • Hosts that may query the machine state.
  • HOSTALLOW_WRITE .cs.wisc.edu, .infn.it
  • Hosts that may execute jobs, send updates, etc...
  • HOSTALLOW_OWNER (FULL_HOSTNAME)
  • Hosts that may cause this machine to vacate.
  • HOSTALLOW_ADMINISTRATOR condor.cs.wisc.edu
  • Hosts that may change priorities, turn Condor
    on/off

48
Condor 6.3.X? Security
  • Principle No single security mechanism is
    appropriate for all sites. Condor must have many
    tools.
  • United States Air Force
  • Kerberos authentication, all connections
    encrypted
  • Cluster behind a firewall
  • Host authentication, no encryption
  • Grid Computing
  • GSI credentials from certain authorities,
    encryption is up to the user.

49
Condor 6.3.X Security
Execute
GSI ?
YES!
GSI
I/O
KRB 5 ?
GSI ?
FORWARD CERT
NO
YES!
Submit
Data storage
Disk
50
You donthave to be asuper personto dosuper
computing!
51
Getting Condor
  • Condor Home Page
  • http//www.cs.wisc.edu
  • Binaries are freely available.
  • Versions
  • 6.2.x - Stable releases, bug fixes only
  • 6.3.x - Development releases

52
For More Info
  • Condor Home Page
  • http//www.cs.wisc.edu/condor
  • These slides
  • http//www.cs.wisc.edu/thain
  • Douglas Thain
  • thain_at_cs.wisc.edu
  • Questions Now?
Write a Comment
User Comments (0)
About PowerShow.com