Title: Status of UTA MC production farm and Its Software
1Status of UTA MC production farm and Its Software
- David Adams
- Karthik Gopalratnam
- Drew Meyer
- Tomasz Wlodek
- Jae Yu
2UTA D0 Monte Carlo Farm
- UTA operates 2 Linux MC farms HEP and CSE
- HEP farm 6-566 , 36-866 MHz processors, 3
file servers, - (250 GB) one job server,
8mm tape drive. - CSE farm 10 866 MHz processors, 1 file
server (20 GB), - 1 job server
- There is a possibility of adding third
farm(ACS, 36 866 MHz) - A possibility of a fourth one emerged few
days ago - Control software (job submission, load
balancing, archiving, bookkeeping, job execution
control etc) developed entirely in UTA by former
UTA student, Drew Meyer, - Scalablestarted with 7, then 25, now 52
processors, http//wwwhep.uta.edu/mcfarm/mcfarm/m
ain.html
3HEP Monte Carlo farm at UTA
4MCFARM UTA farm control system
- MCFARM is a specialized batch system for
- Pythia, Isajet, D0g, D0sim, D0reco,
recoanalyze - Can be adapted for ATLAS, CDF
- It is intelligent it knows how to handle and
in - most cases recover typical error
conditions. - Hard to break even if several nodes crash the
production can continue for a few hours - Interfaced to SAM and bookkeeping package,
- (more about bookkeeping later)
- http//www-hep.uta.edu/mcfarm/mcfarm/main.htm
l
5A couple of experimental groups in D0 have
expressed interest in our software and plan to
install it on their farms
6Main server (Job Manager) Can read and write to
all other nodes Contains executables and job
archive
Execution node (The worker) Mounts its home
directory on main server Can read and write to
file server disk
File server Mounts /home on main server Its disk
stores min bias and generator files and is
readable and writable by everybody
Both CSE and HEP farm share the same layout, they
differ only by the number of nodes involved and
by the software which exports completed jobs to
final destination The layout is flexible enough
to allow for farm expansion when new nodes are
available
7D0 Monte Carlo production chain
Generator job (Pythia, Isajet, )
D0gstar (D0 GEANT)
D0gstar (D0 GEANT)
Background events (prepared in advance)
D0sim (Detector response)
D0reco (reconstruction)
SAM storage in FNAL
RecoA (root tuple)
SAM storage in FNAL
8MC farm software daemons and their control
WWW
Root daemon
Lock manager
Bookkeeper
Monitor daemon
Distribute daemon
Execute daemon
Gather daemon
Job archive
Cache disk
Tape
SAM
Remote machine
9UTA cluster of Linux farms
SAM mass storage in FNAL
32 866MHz
bb_ftp (planned)
ACS farm at UTA (planned)
UTA www server
bb_ftp
UTA analysis server (300Gb)
8 mm tape
12 866MHz
25 dual 866MHz
CSE farm at UTA (1 supervisor, 1 file server, 10
workers)
HEP farm at UTA (1 supervisor, 3 file servers, 21
workers, tape drive)
10Production bookeeping
- During a running period the farms produce few
thousand jobs - Some jobs crash, need to be restarted
- Users must be kept up to date about their MC
requests status (waiting? Running? Done?) - A dedicated bookkeeping software is needed
11Original bookeeping
WWW server
Each farm server runs a bookkeeper which keeps
track of production progress
Every few hours the bookkeeper compiles a html
table with production status and pulls it to
WWW server
HEP farm
This works fine as long as you have small number
of farms! We need GRID-enabled bookeeper
12Before we start writing Globus-based bookeeper,
we had to learn a little bit about Globus!
- Installed Globus 2.0-beta on UTA MC farm
- Checked that we can communicate between D0 and
Atlas farms and that authentication works - Execute simple Hello World programs between
farms - Executed shell scripts from farm to farm
- Executed simple python scripts
- Executed python scripts with module dependencies
Everything works, we are experts! Now we can
write the Globus-enabled bookeeper.
13The new, Globus-enabled, bookkeeper
A machine from Atlas farm is running bookkeeper
www server
Globus-job-run Grid-ftp
Globus domain
HEP farm
CSE farm
14The new bookeeper
- One dedicated bookeeper machine can serve any
number of MC production farms running mcfarm
software - The communication with remote centers is done
using Globus-tools only - No need to install bookeeper on every farm
makes life simpler if many farms participate!
15What next?
- Right now MC runs are submitted from every farm
server - I would like to start runs on farms from the
bookkeeping machine, via Globus - In this world the bookkeeper becomes a
supervisor of farm servers, also known as
King of the world
16Conclusions
- The UTA farm is very successful
- UTA MCFARM software is solid and robust
- First step towards Grid-enabling the farm
clusters has been made