Condor and GridShell - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Condor and GridShell

Description:

Astronomy is increasingly being done by using large surveys with ... Each job takes a different amount of time to run: we are using resources inefficiently. ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 29
Provided by: JeffreyP157
Category:

less

Transcript and Presenter's Notes

Title: Condor and GridShell


1
Condor and GridShell
  • How to Execute 1 Million Jobs on the Teragrid

Jeffrey P. Gardner - PSC Edward Walker -
TACC Miron Livney - U. Wisconsin Todd Tannenbaum
- U. Wisconsin And many others!
2
Scientific Motivation
  • Astronomy is increasingly being done by using
    large surveys with 100s of millions of objects.
  • Analyzing large astronomical datasets frequently
    means performing the same analysis task on
    gt100,000 objects.
  • Each object may take several hours of computing.
  • The amount of computing time required may vary,
    sometimes dramatically, from object to object.

3
Solution PBS?
  • In theory, PBS should provide the answer.
  • Submit 100,000 single-processor PBS jobs
  • In practice, this does not work.
  • Teragrid nodes are multiprocessor
  • Only 1 PBS job per node
  • Teragrid machines frequently restrict the number
    of jobs a single user may run.
  • Chad might get really mad if I submitted 100,000
    PBS jobs!

4
Solution mprun?
  • We could submit a single job that uses many
    processors.
  • Now we have a reasonable number of PBS jobs (Chad
    will now be happy).
  • Scheduling priority would reflect our actual
    resource usage.
  • This still has problems.
  • Each job takes a different amount of time to run
    we are using resources inefficiently.

5
The Real Solution CondorGridShell
  • The real solution is to submit one large PBS job,
    then use a private scheduler to manage serial
    work units within each PBS job.
  • We can even submit large PBS jobs to multiple
    Teragrid machines, then farm out serial work
    units as resources become availiable.

Vocabulary JOB (n) a thing that is
submitted via Globus or PBS WORK UNIT (n) An
independent unit of work (usually serial),
such as the analysis of a single astronomical
object
6
The Real Solution CondorGridShell
  • The real solution is to submit one large PBS job,
    then use a private scheduler to manage serial
    work units within each PBS job.
  • We can even submit large PBS jobs to multiple
    Teragrid machines, then farm out serial work
    units as resources become availiable.

Condor
GridShell
Vocabulary JOB (n) a thing that is
submitted via Globus or PBS WORK UNIT (n) An
independent unit of work (usually serial),
such as the analysis of a single astronomical
object
7
Condor Overview
  • Condor was first designed as a CPU cycle
    harvester for workstations sitting on peoples
    desks.
  • Condor is designed to schedule large numbers of
    jobs across a distributed, heterogeneous and
    dynamic set of computational resources.

8
Condor The User Experience
1. User writes a simple Condor submit script
my_job.submit A simple Condor submit
script Universe vanilla Executable
my_program Queue
2. User submits the job
condor_submit my_job.submit Submitting
job(s). 1 job(s) submitted to cluster 1.
9
Condor The User Experience
3. User watches job run
condor_q -- Submitter perdita.cs.wisc.edu
lt128.105.165.341027gt ID OWNER
SUBMITTED RUN_TIME ST PRI SIZE CMD
1.0 Jeff 6/16 0652 0000121 R
0 0.0 my_program 1 jobs 0 idle, 1 running, 0
held
4. Job completes. User is happy.
10
Advantages of Condor
  • Condor user experience is simple
  • Condor is flexible
  • Resources can be any mix of architectures
  • Resources do not need a common filesystem
  • Resources do not need common user accounting
  • Condor is dynamic
  • Resources can disappear and reappear
  • Condor is fault-tolerant
  • Jobs are automatically migrated to new resources
    if existing one become unavailable.

11
Condor Daemons
  • condor_startd (runs on execution node)
  • Advertises specs and availability of execution
    node (ClassAds). Starts jobs on exec. node.
  • condor_schedd (runs on submit node)
  • Handles job submission. Tracks job status.
  • condor_collector (runs on central manager)
  • Collects system information from execution node.
  • condor_negotiator(runs on central manager)
  • Matches schedd jobs to machines.

12
Condor Daemon Layout
Startd sends system specifications (ClassAds) and
system status to Collector
13
Condor Daemon Layout
Schedd sends job info to Negotiator
User submits Condor job
14
Condor Daemon Layout
Negotiator uses information from Collector to
match Schedd jobs to available Startds
15
Condor Daemon Layout
Schedd sends job to Startd on assigned execution
node
16
Personal Condor on a Teragrid Platform
  • Condor daemons can be run as a normal user.
  • Condor GlideIn ability supports the ability to
    launch condor_startds on nodes within an LSF or
    PBS job.

17
Personal Condor on a Teragrid Platform(Condor
runs with normal user permissions)
Login Node
Submission Machine (could be login node)
PBS Job - GlideIn
18
GridShell Overview
  • Allows users to interact with distributed grid
    computing resources from a simple shell-like
    interface.
  • extends TCSH version 6.12 to incorporates
    grid-enabled features
  • parallel inter-script message-passing and
    synchronization
  • output redirection to remote files
  • parametric sweep

19
GridShell Examples
  • Redirecting the standard output of a command to a
    remote file location using GlobusFTP
  • a.out gt gsiftp//tg-login.ncsa.teragrid.org/data
  • Message passing between 2 parallel tasks
  • if ( _GRID_TASKID 0) then
  • echo "hello" gt task_1
  • else
  • Set msgcat lt task_0
  • endif
  • Executing 256 instances of a job
  • a.out on 256 procs

20
Merging GridShell with Condor
  • Use GridShell to launch Condor GlideIn jobs at
    multiple grid sites
  • All Condor GlideIn jobs report back to a central
    collector
  • This converts the entire Teragrid into your own
    personal Condor pool!

21
Merging GridShell with Condor
NCSA
SDSC
PSC
User starts GridShell Session at PSC
22
Merging GridShell with Condor
NCSA
SDSC
PSC
GridShell session starts event monitor on remote
login nodes via Globus
23
Merging GridShell with Condor
NCSA
SDSC
PSC
Local event monitor starts condor daemons on
login node
24
All event monitors submit Condor GlideIn PBS jobs
NCSA
SDSC
PSC
25
Condor startds tell collector that they have
started
NCSA
SDSC
PSC
26
Condor schedd distributes independent work units
to compute nodes
NCSA
SDSC
PSC
27
GridShell in a NutShell
  • Using GridShell coupled with Condor one can
    easily harness the power of the Teragrid to
    process large numbers of independent work units.
  • Scheduling can be done dynamically from a central
    Condor queue to multiple grid sites as clusters
    of processors become availible.
  • All of this fits into existing Teragrid software.

28
Merging GridShell with Condor
NCSA
SDSC
PSC
Write a Comment
User Comments (0)
About PowerShow.com