Title: Upgrade D0 farm
1Upgrade D0 farm
2Reasons for upgrade
- RedHat 7 needed for D0 software
- New versions of
- ups/upd v4_6
- fbsng v1_3fp2_1
- sam
- Use of farm for MC and analysis
- Integration in farm network
3MC production on farm
- Input requests
- Request translated in mc_runjob macro
- Stages
- mc_runjob on batch server (hoeve)
- MC job on node
- SAM store on file server (schuur)
41.2 TB
fbs(rcp,sam)
mcc request
farm server
SAM DB
file server
fbs job 1 mcc 2 rcp 3 sam
fbs(mcc)
datastore
mcc input
FNAL SARA
mcc output
node
100 cpus
control
40 GB
data
metadata
5cron sam
1.2 TB
fbs(rcp,sam)
mcc request
farm server
SAM DB
file server
fbs job 1 mcc 2 rcp
fbs(mcc)
datastore
mcc input
FNAL SARA
mcc output
node
100 cpus
control
40 GB
data
metadata
6hoeve
node
schuur
fbsuser mc_runjob
cron
fbs submit
fbsusercp fbsusermcc
fbsuser rcp
fbs submit
willemsam
data
control
7SECTION mcc EXEC/d0gstar/curr/minbias-0
2073214824/batch NUMPROC1 QUEUEFastQ
STDOUT/d0gstar/curr/minbias-02073214824/stdou
t STDERR/d0gstar/curr/minbias-02073214824/st
dout SECTION rcp EXEC/d0gstar/curr/minb
ias-02073214824/batch_rcp NUMPROC1
QUEUEIOQ DEPENDdone(mcc)
STDOUT/d0gstar/curr/minbias-02073214824/stdout_rc
p STDERR/d0gstar/curr/minbias-02073214824/st
dout_rcp
8!/bin/sh . /usr/products/etc/setups.sh cd
/d0gstar/mcc/mcc-dist . mcc_dist_setup.sh mkdir
-p /data/curr/minbias-02073214824 cd
/data/curr/minbias-02073214824 cp -r
/d0gstar/curr/minbias-02073214824/ . touch
/d0gstar/curr/minbias-02073214824/.uname -n sh
minbias-02073214824.sh pwd gt log touch
/d0gstar/curr/minbias-02073214824/uname
-n /d0gstar/bin/check minbias-02073214824
batch_rcp runs on schuur
!/bin/sh iminbias-02073214824 if -f
/d0gstar/curr/i/OK then mkdir -p
/data/disk2/sam_cache/i cd /data/disk2/sam_cache/
i nodels /d0gstar/curr/i/node nodebasename
node jobecho i awk 'print
substr(0,length-8,9)' rcp -pr
node/data/dest/d0reco/recojob . rcp -pr
node/data/dest/reco_analyze/rAtpljob . rcp
-pr node/data/curr/i/Metadata/.params . rcp
-pr node/data/curr/i/Metadata/.py . rsh -n
node rm -rf /data/curr/i rsh -n node rm -rf
/data/dest//job touch /d0gstar/curr/i/RCP f
i
batch runs on node
9runs on schuur called by fbs or cron
!/bin/sh locate() filegrep "import "
import_1_job.py awk -F \" 'print
2' sam locate file fgrep -q return ? .
/usr/products/etc/setups.sh setup
sam SAM_STATIONhoeve export SAM_STATION tosam1
LISTcat tosam for job in LIST do cd
/data/disk2/sam_cache/job list'gen d0g
sim' for i in list do until locate i
(sam declare import_i_job.py locate
i) do sleep 60 done done list'reco
recoanalyze' for i in list do sam store
--descripimport_i_job.py --sourcepwd
return? echo Return code sam store
return done done echo Job finished ...
declare gen, d0g, sim
store reco, recoanalyze
10Filestream
- Fetch input from sam
- Read input file from schuur
- Process data on node
- Copy output to schuur
11hoeve
node
schuur
attach filestream
mc_runjob
cron
fbs submit
rcp d0exe
rcp
fbs submit
sam
data
control
12Analysis on farm
- Stages
- Read files from sam
- Copy files to node(s)
- Perform analysis on node
- Copy files to file server
- Store files in sam
131.2 TB
fbs(1), fbs(3)
farm server
SAM DB
file server
- sam rcp
- analyze
- rcp sam
fbs(2)
datastore
FNAL SARA
node
100 cpus
control (fbs)
40 GB
data
metadata
14triviaal
node-2
willemsam
input
fbsuserrcp
fbsuser analysis program
output
fbsuserrcp
willemsam
15SECTION sam EXEC/home/willem/batch_sam
NUMPROC1 QUEUEIOQ
STDOUT/home/willem/stdout
STDERR/home/willem/stdout
batch.jdf
batch_sam
!/bin/sh . /usr/products/etc/setups.sh setup
sam SAM_STATIONtriviaal export SAM_STATION sam
run project get_file.py --interactive gt
log /usr/bin/rsh -n -l fbsuser triviaal rcp -r
/stage/triviaal/sam_cache/boo node-2/data/test
gtgt log
161.2 TB
fbs(1), fbs(3)
farm server
SAM DB
file server
fbs(2)
- sam
- rcp analyze rcp
- rcp sam
datastore
FNAL SARA
node
100 cpus
control (fbs)
40 GB
data
metadata
17triviaal
node-2
willemsam
fbsuserfbs submit
fbsuser rcp analysis program rcp
input
output
willemsam
18rsh -l fbsuser triviaal fbs submit
willem/batch_node.jdf
SECTION sam EXEC/d0gstar/batch_node
NUMPROC1 QUEUEFastQ
STDOUT/d0gstar/stdout STDERR/d0gstar/stdout
!/bin/sh uname -a date
19SECTION ana EXEC/d0gstar/batch_node
NUMPROC1 QUEUEFastQ
STDOUT/d0gstar/stdout STDERR/d0gstar/stdout
SECTION sam EXEC/home/willem/batch
NUMPROC1 QUEUEIOQ STDOUT/home/willem
/stdout STDERR/home/willem/stdout
!/bin/sh . /usr/products/etc/setups.sh setup
fbsng setup sam SAM_STATIONtriviaal export
SAM_STATION sam run project get_file.py
--interactive gt log /usr/bin/rsh -n -l fbsuser
triviaal fbs submit /home/willem/batch_node.jdf
!/bin/sh rcp -pr server/stage/triviaal/sam_cache
/boo /data/test . /d0/fnal/ups/etc/setups.sh setup
root -q KCC_4_0exceptionoptthread setup
kailib root -b -q /d0gstar/test.C
gSystem-gtcd("/data/test/boo") gSystem-gtExec("pw
d") gSystem-gtExec("ls -l")
20 This file sets up and runs a SAM
project. import os, sys, string, time,
signal from re import from globals import
import run_project from commands import
Set the following variables to appropriate
values Consult database for valid
choices sam_station "triviaal"
Consult Database for valid choices project_definit
ion "op_moriond_p1014" A particular snapshot
version, last or new snapshot_version
'new' Consult database for valid
choices appname "test" version
"1" group "test"
get_file.py
The maximum number of files to get from
sam max_file_amt 5 for additional
debug info use "--verbose" verbosity
"--verbose" verbosity "" Give up
on all exceptions give_up 1 def
file_ready(filename) Replace this
python subroutine with whatever you
want to do to process the file that was
retrieved. This function will only be
called in the event of a successful
delivery. print "File ",filename," has
been delivered!" os.system('cp
'filename' /stage/triviaal/sam') return
21Disk partitioning hoeve
/d0
/fnal
/mcc
/fbsng
/mcc-dist
/mc_runjob
/d0dist
/d0usr
/curr
/fnal -gt /d0/fnal /d0usr -gt /fnal/d0usr /d0dist
-gt /fnal/d0dist /usr/products -gt /fnal/ups
22ana_runjob
- Is analogous to mc_runjob
- Creates and submits analysis jobs
- Input
- get_file.py with SAM project name
- Project defines files to be processed
- analysis script
23Integration with grid (1)
- At present separate clusters
- D0, LHCb, Alice, DAS cluster
- hoeve and schuur in farm network
24Present network layout
ajax
hefnet
schuur
hoeve
router
switch
surfnet
node
node
node
NFS
25New network layout
hefnet
ajax
lambda
farmrouter
booder
switch
switch
switch
hoeve
schuur
LHCb
D0
alice
NFS
26New network layout
hefnet
ajax
lambda
farmrouter
das-2
booder
switch
switch
switch
hoeve
schuur
LHCb
D0
alice
NFS
27Server tasks
- hoeve
- software server
- farm server
- schuur
- fileserver
- sam node
- booder
- home directory server
- in backup scheme
28Integration with grid (2)
- Replace fbs with pbs or condor
- pbs on Alice and LHCb nodes
- condor on das cluster
- Use EDG installation tool LCGF
- Install d0 software with rpm
- Problem with sam (uses ups/upd)
29Integration with grid (3)
- Package mcc in rpm
- Separate programs from working space
- Use cfg commands to steer mc_runjob
- Find better place for card files
- Input structure now created on node
30Grid job
PBS job
submit
!/bin/sh macro1 pwdpwd cd
/opt/fnal/d0/mcc/mcc-dist . mcc_dist_setup.sh cd
pwd dir/opt/fnal/d0/mcc/mc_runjob/py_script pyth
on dir/Linker.py scriptmacro
willem_at_tbn09 willem cat test.pbs PBS batch
job script PBS -o /home/willem/out PBS -e
/home/willem/err PBS -l nodes1 Changing to
directory as requested by user cd
/home/willem Executing job as requested by
user ./submit minbias.macro
31RunJob class for grid
class RunJob_farm(RunJob_batch) def
__init__(self,nameNone) RunJob_batch.__init
__(self,name) self.myType"runjob_farm"
def Run(self) self.jobname
self.linker.CurrentJob() self.jobnaam
string.splitfields(self.jobname,'/')-1 comm
'chmod x ' self.jobname
commands.getoutput(comm) if
self.tdconf'RunOption' 'RunInBackground'
RunJob_batch.Run(self) else bq
self.tdconf'BatchQueue' dirn
os.path.dirname(self.jobname) print dirn
comm 'cd ' dirn ' sh ' self.jobnaam
' pwd gt stdout' print comm
runcommand(comm)
32To be decided
- Location of minimum bias files
- Location of MC output
33Job status
- Job status is recorded in
- fbs
- /d0/mcc/curr/ltjob_namegt
- /data/mcc/curr/ltjob_namegt
34SAM servers
- On master node
- station
- fss
- On master and worker nodes
- stager
- bbftp