Condor by Example - PowerPoint PPT Presentation

About This Presentation
Title:

Condor by Example

Description:

Oops! Some items are filled in at the last minute. Please fill the _ with notes. ondor ... Condor converts a collection of unrelated workstations into a high ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 65
Provided by: nd2
Learn more at: https://www3.nd.edu
Category:
Tags: condor | example | oops

less

Transcript and Presenter's Notes

Title: Condor by Example


1
Condor by Example
2
Lecture Format
  • In each lecture
  • Lecture to whole group.
  • Workshop and examples at computer.
  • Oops!
  • Some items are filled in at the last minute.
  • Please fill the _______ with notes.

3
Outline
  • Overview
  • Submitting Jobs, Getting Feedback
  • Setting Requirements with ClassAds
  • Which Universe?
  • Move to Workshop

4
What is Condor?
  • Condor converts a collection of unrelated
    workstations into a high-throughput computing
    facility.
  • Condor uses matchmaking to make sure that
    everyone is happy.

5
What is High-Throughput Computing?
  • High-performance CPU cycles/second under ideal
    circumstances.
  • How fast can I run simulation X on this
    machine?
  • High-throughput CPU cycles/day (week, month,
    year?) under non-ideal circumstances.
  • How many times can I run simulation X in the
    next week using all available machines?

6
What is High-Throughput Computing?
  • Condor does whatever it takes to run your jobs,
    even if some machines
  • Crash!
  • Are disconnected
  • Run out of disk space
  • Are removed or added from the pool
  • Are put to other uses

7
What is Matchmaking?
  • Condor uses Matchmaking to make sure that work
    gets done within the constraints of both users
    and owners.
  • Users (jobs) have constraints
  • I need an Alpha with 256 MB RAM
  • Owners (machines) have constraints
  • Only run jobs when I am away from my desk and
    never run jobs owned by Bob.

8
Who uses Condor?
  • Hundreds of universities and companies around the
    world!
  • University of Wisconsin, USA
  • 682 CPUs in one building
  • Computer architecture simulations
  • National Institute of Physics, Italy
  • 200 CPUs in many cities
  • Reconstruction of collider events
  • And many others!

9
What can Condordo for me?
  • Condor can
  • increase your throughput.
  • do your housekeeping.
  • improve reliability.
  • give performance feedback.

10
Cluster Overview
Server 512 MB 800 MHz
20 GB
100 Mb/s network
Client 128 MB 666 MHz
Client 128 MB 666 MHz
Client 128 MB 666 MHz
Client 128 MB 666 MHz
Client 128 MB 666 MHz
10 GB
10 GB
10 GB
10 GB
10 GB
11
How many machines now?
  • The map is out of date!
  • The system is always changing.
  • First example What machines (and of what kind)
    are in the pool now?

12
How Many Machines?
  • condor_status
  • Name OpSys Arch State
    Activity LoadAv Mem
  • lxpc1.na.infn LINUX-GLIBC INTEL Unclaimed Idle
    0.000 30
  • axpd21.pd.inf OSF1 ALPHA Owner Idle
    0.266 96
  • vlsi11.pd.inf SOLARIS26 SUN4u Claimed Busy
    0.000 256
  • . . .
  • Machines Owner Claimed
    Unclaimed Matched Preempting
  • ALPHA/OSF1 115 67 46
    1 0 1
  • INTEL/LINUX 53 18 0
    35 0 0
  • INTEL/LINUX-GLIBC 16 7 0
    9 0 0
  • SUN4u/SOLARIS251 1 1 0
    0 0 0
  • SUN4u/SOLARIS26 6 2 0
    4 0 0
  • SUN4u/SOLARIS27 1 1 0
    0 0 0
  • SUN4x/SOLARIS26 2 1 0
    1 0 0

13
Machine States
  • Most machines will be
  • Owner
  • The machines owner is busy at the console, so no
    Condor jobs may run.
  • Claimed
  • Condor has selected the machine to run jobs for
    other users.

14
Machine States
  • Only a few should be
  • Unclaimed
  • The owner is gone, but Condor has not yet
    selected the machine.
  • Matched
  • Between claimed and unclaimed.
  • Preempting
  • Condor is busy removing a job.

15
More Things to Try
  • condor_status -help
  • condor_status avail
  • condor_status run
  • condor_status total
  • condor_status pool condor.cs.wisc.edu

16
Submitting Jobs
17
Steps to Running a Job
  • Re-link for Condor.
  • Submit the job.
  • Watch the progess.
  • Receive email when done.

18
Example Job
  • Integrate sin(x) from 0 to 10, using 10 million
    slices.
  • Simple program takes a few seconds.
  • ./integrate 10 10000000
  • 2.0445075

19
  • PROGRAM INTEGRATE
  • CHARACTER STR10
  • REAL X, SLICES, LIMIT
  • CALL GETARG(1,STR)
  • READ (STR,) LIMIT
  • CALL GETARG(2,STR)
  • READ (STR,) SLICES
  • TOTAL0
  • STEPLIMIT/SLICES
  • DO X0, LIMIT, STEP
  • TOTAL TOTAL SIN(X)STEP
  • END DO
  • PRINT , TOTAL
  • END

20
Re-link for Condor
  • If you normally compile like this
  • g77 integrate.f -o integrate
  • Then compile for Condor like this
  • condor_compile g77 integrate.f -o integrate

21
Submit the Job
  • Create a submit file
  • emacs integrate.submit
  • Submit the job
  • condor_submit integrate.submit

Executable integrate Arguments 10
10000000 Output integrate.out Log
integrate.log queue
22
Watch the Progress
  • condor_q
  • -- Submitter axpbo8.bo.infn.it
  • ID OWNER SUBMITTED RUN_TIME
    ST PRI SIZE CMD
  • 5.0 thain 6/21 1240 0000015
    R 0 2.5 fib 40

Each job gets a unique number.
Status Unexpanded, Running or Idle
Size of program image (MB)
23
Receive E-mail When Done
  • This is an automated email from the Condor system
  • on machine "axpbo8.bo.infn.it". Do not reply.
  • Your condor job
  • /tmp_mnt/usr/users/ccl/thain/test/fib 40
  • exited with status 0.
  • Submitted at Wed Jun 21 142442 2000
  • Completed at Wed Jun 21 143636 2000
  • Real Time 0 001154
  • Run Time 0 000652
  • Committed Time 0 000137
  • . . .

24
Running Many Processes
  • 100 processes are almost as easy as !.
  • Each condor_submit makes one cluster of one or
    more processes.
  • Add the number of processes to run to the Queue
    statement.
  • Use the (PROCESS) variable to give each process
    slightly different instructions.

25
Running Many Processes
  • Perform the same program on 50 different
    intervals.
  • Output goes in integrate.out.1, integrate.out.2,
    and so on

Executable integrate Arguments (PROCESS)
10000000 Output integrate.out.(PROCESS) Log
integrate.log Queue 50
26
Running Many Processes
  • condor_q
  • -- Submitter axpbo8.bo.infn.it
  • ID OWNER SUBMITTED RUN_TIME
    ST PRI SIZE CMD
  • 9.3 thain 6/23 1047 0000540
    R 0 2.5 fib 3
  • 9.6 thain 6/23 1047 0000511
    R 0 2.5 fib 6
  • 9.7 thain 6/23 1047 0000509
    R 0 2.5 fib 7
  • . . .
  • 21 jobs 2 idle, 19 running, 0 held

Cluster number
Process number
27
Where Are They Running?
  • condor_q run
  • Submitter axpbo8.bo.infn.it
  • ID OWNER SUBMITTED RUN_TIME
    HOST(S)
  • 9.47 thain 6/23 1047 0000703
    ax4bbt.bo.infn.it
  • 9.48 thain 6/23 1047 0000651
    pewobo1.bo.infn.it
  • 9.49 thain 6/23 1047 0000630
    osde01.pd.infn.it

Current Location
28
Help! Im buried in Email!
  • By default, Condor sends one email for each
    completed process.
  • Add these to your submit file
  • notification error
  • notification never
  • To send it to someone else
  • notify_user thain_at_cs.wisc.edu

29
Removing Processes
  • Remove one process
  • condor_rm 9.47
  • Remove a whole cluster
  • condor_rm 9
  • Remove everything!
  • condor_rm -a

30
Getting Feedback
31
What have I done?
  • The user log file (fib.log) shows a chronological
    list of everything important that happened to a
    job.
  • 001 (007.035.000) 06/21 170344 Job executing on
    host
  • 004 (007.035.000) 06/21 170458 Job was evicted.
  • 009 (007.035.000) 06/21 170510 Job was aborted
    by the user.

32
What have I done?
  • condor_history
  • ID OWNER SUBMITTED CPU_USAGE
    ST COMPLETED CMD
  • 9.3 thain 6/23 1047 0000000
    C 6/23 1058 fib 3
  • 9.40 thain 6/23 1047 0000024
    C 6/23 1059 fib 40
  • 9.10 thain 6/23 1047 0000000
    C 6/23 1101 fib 10
  • 9.47 thain 6/23 1047 0000545
    C 6/23 1101 fib 47
  • 9.7 thain 6/23 1047 0000000
    C 6/23 1101 fib 7

33
Brief I/O Summary
  • condor_q io
  • -- Schedd c01.cs.wisc.edu
  • ID OWNER READ WRITE SEEK
    XPUT BUFSIZE BLKSIZE
  • 756.15 joe 244.9 KB 379.8 KB 71 1.3
    KB/s 512.0 KB 32.0 KB
  • 758.24 joe 198.8 KB 219.5 KB 78 45.0
    B /s 512.0 KB 32.0 KB
  • 758.26 joe 44.7 KB 22.1 KB 2727 13.0
    B /s 512.0 KB 32.0 KB
  • 3 jobs 0 idle, 3 running, 0 held

34
Complete I/O Summaryin Email
Your condor job "/usr/joe/records.remote input
output" exited with status 0. Total I/O 104.2
KB/s effective throughput 5 files opened 104
reads totaling 411.0 KB 316 writes totaling 1.2
MB 102 seeks I/O by File buffered file
/usr/joe/input opened 2 times 100 reads
totaling 398.6 KB 311 write totaling 1.2 MB 101
seeks (Only since Condor Version
6.1.11)
35
Complete I/O Summaryin Email
  • The summary helps identify performance problems.
    Even advanced users don't know exactly how their
    programs and libraries operate.

36
Complete I/O Summary in Email
  • Example
  • CMSSIM - collider simulation
  • Why is this job so slow?
  • Data summary
  • read 250 MB from 20 MB file.
  • Very high SEEK total - random access.
  • Solution Increase buffer to 20 MB.

37
Who Uses Condor?
  • condor_q global
  • -- Schedd to02xd.to.infn.it
  • ID OWNER SUBMITTED RUN_TIME
    ST PRI SIZE CMD
  • 127.0 garzelli 6/21 1845 1141816
    R 0 17.2 tosti2trisdn
  • -- Schedd quark.ts.infn.it
  • ID OWNER SUBMITTED RUN_TIME
    ST PRI SIZE CMD
  • 600.0 dellaric 4/10 1457 55092031
    R 0 9.1 john p2.dat
  • 665.0 dellaric 6/2 1114 20032730
    R 0 9.2 john p1.dat
  • 788.0 pamela 6/20 0927 3044143
    R 0 15.4 montepamela

38
Who uses Condor?
  • condor_status submitters
  • Name Machine Running
    IdleJobs MaxJobsRunning
  • rebuzzin_at_pv.infn.it decux1.pv. 22 34
    200
  • pamela_at_ts.infn.it quark.ts.i 6 1
    200
  • giunti_at_to.infn.it to05xd.to. 21 49
    200
  • . . .
  • RunningJobs
    IdleJobs
  • cattaneo_at_pv.infn.it 0
    1
  • pamela_at_ts.infn.it 6
    1
  • rebuzzin_at_pv.infn.it 22
    34
  • Total 59
    86

39
Who Uses Condor?
  • condor_userprio
  • Last Priority Update 6/23 1627
  • Effective
  • User Name Priority
  • ------------------------------ ---------
  • meucci_at_pv.infn.it 0.50
  • longof_at_ts.infn.it 0.50
  • thain_at_bo.infn.it 0.50
  • dellaric_at_ts.infn.it 2.00
  • clueoff_at_pd.infn.it 3.00
  • pamela_at_ts.infn.it 5.81
  • rebuzzin_at_pv.infn.it 18.18
  • giunti_at_to.infn.it 19.72
  • ------------------------------ ---------
  • Number of users shown 8

40
Who Uses Condor?
  • The user priority is computed by Condor to
    estimate how much of the pools CPU resources
    have been used by each submitter.
  • Lighter users receive a lower priority they will
    be allocated CPUs before heavy users.
  • Users consuming the same amount of CPU will be
    allocated an equal amount.

41
Measuring Goodput
  • Goodput is the amount of time a workstation
    spends making forward progress on work assigned
    by Condor.
  • This is a big topic all by itself
    http//www.cs.wisc.edu/condor/goodput

42
Measuring Goodput
  • condor_q goodput
  • -- Submitter coral.cs.wisc.edu
    coral.cs.wisc.edu
  • ID OWNER SUBMITTED RUN_TIME
    GOODPUT CPU_UTIL Mb/s
  • 719.74 thain 6/23 0735 2204759
    100.0 87.6 0.00
  • 719.75 thain 6/23 0735 2203845
    40.5 99.8 0.00
  • 719.76 thain 6/23 0735 2203816
    96.9 98.7 0.00
  • 719.77 thain 6/23 0735 2211006
    100.0 99.8 0.00

43
Setting Requirements
  • We believe that Condor must allow both users
    (jobs) and owners (machines) to set requirements.
  • This is an absolute necessity in order to
    convince people to participate in the community.

44
ClassAds
  • ClassAds are a simple language for describing
    both the properties and the requirements of jobs
    and machines.
  • Condor stores nearly everything in ClassAds --
    use the l option to condor_q and condor_submit
    to get the full details.

45
ClassAd for a Machine
  • condor_status l axpbo8
  • MyType "Machine"
  • TargetType "Job"
  • Name "axpbo8.bo.infn.it"
  • START TRUE
  • VirtualMemory 342696
  • Disk 28728536
  • Memory 160
  • Cpus 1
  • Arch "ALPHA"
  • OpSys "OSF1

46
ClassAd for a Job
  • condor_q l 9.49
  • MyType "Job"
  • TargetType "Machine"
  • Owner "thain"
  • Cmd "/tmp_mnt/usr/users/ccl/thain/test/fib"
  • Out fib.out.49
  • Args 49
  • ImageSize 2544
  • DiskUsage 2544
  • Requirements (Arch "ALPHA") (OpSys
    "OSF1")
  • (Disk DiskUsage)
    (VirtualMemory ImageSize)

47
Default Requirements
  • By default, Condor assumes the requirements for
    your job are I need a machine with
  • The same operating system and architecture as my
    workstation.
  • Enough disk to store the program.
  • Enough virtual memory to run the program.

48
ClassAd Requirements
  • Similar to C/C/Java expressions
  • Symbols Arch, OpSys, Memory, Mips
  • Values 15, 6.5, LINUX
  • Operators
  • , ,
  • ,
  • ( )

49
Adding Requirements
  • In the submit file, add a line beginning with
    requirements

Executable fib Arguments 40 Output
fib.out Log fib.log Requirements (Memory
64) queue
50
Example Requirements
  • (Memory64)
  • (Machine axpbo3.bo.infn.it )
  • (Mips100) (Kflops10000)
  • (Subnet ! 131.154.10)
  • (Disk 20000000)

51
Preferences
  • Condor assumes that any machines that match your
    requirements are suitable.
  • However, you may prefer some machines over
    others. (100 Mips is better than 10)
  • To indicate a preference, you may provide a
    ClassAd expression which ranks all matches.

52
Rank
  • The rank expression is evaluated into a number
    for every potential matching machine.
  • A machine with a higher number will be preferred
    over a machine with a lower number.

53
Rank Examples
  • Prefer machines with more Mips
  • Rank Mips
  • Prefer machines with a high ratio of memory to
    cpu performance
  • Rank Memory/Mips
  • Prefer more memory, but add 100 to the rank if
    the machine is Solaris 2.7
  • Rank Memory 100(OpSysSOLARIS27)

54
Standardor Vanilla?
55
Which Universe?
  • Each Condor universe provides different services
    to different kinds of programs
  • Standard Relinked UNIX programs
  • Vanilla Unmodified UNIX programs
  • PVM
  • Scheduler (Not described here)
  • Globus

56
Standard Universe
  • Submit a specially-linked UNIX application to the
    Condor system.
  • Advantages
  • Checkpointing for fault tolerance.
  • Remote I/O services
  • Friendly environment anywhere in the world.
  • Data buffering and staging.
  • I/O performance feedback.
  • User remapping of data sources.

57
Standard Universe
  • Disadvantages
  • Must statically link with Condor library.
  • Limited class of applications
  • Single-process UNIX binaries.
  • Certain system calls prohibited.

58
System Call Limitations
  • Standard universe does not allow
  • Multiple processes
  • fork(), exec(), system()
  • Inter-process communication
  • semaphores, messages, shared memory
  • Complex I/O
  • mmap(), select(), poll(), non-blocking I/O,
  • Kernel-level threads
  • (User level threads are OK.)

59
System Call Limitations
  • Too restrictive?
  • Use the vanilla universe.

60
Vanilla Universe
  • Submit any sort of UNIX program to the Condor
    system.
  • Advantages
  • No relinking required.
  • Any program at all, including
  • Binaries
  • Shell scripts
  • Interpreted programs (java, perl)
  • Multiple processes

61
Vanilla Universe
  • Disadvantages
  • No checkpointing.
  • Very limited remote I/O services.
  • Specify input files explicitly.
  • Specify output files explicitly.
  • Condor will refuse to start a vanilla job on a
    machine that is unfriendly.
  • ClassAds FilesystemDomain and UIDDomain

62
Which Universe?
  • Standard
  • Good for mixed Condor pools, flocked pools, and
    the Grid at large.
  • Vanilla
  • Good for a Condor pool of identical machines.

63
Conclusion
  • Condor expands your reach to many CPUs even
    those you cannot log in to.
  • Condor makes it easy to run and manage large
    numbers of jobs
  • Good candidates for the standard universe are
    single-process CPU-bound jobs with simple I/O.
  • Too restrictive? Use the vanilla universe, but
    fewer available machines.

64
Move to Workshop
  • Meet again in room ____ at _____.
  • Bring printouts to follow along.
Write a Comment
User Comments (0)
About PowerShow.com