Title: Running Inspiral Workflows on the OSG
1Running Inspiral Work-flows on the OSG
- Britta Daudert
- LIGO-Pegasus Face to Face
- February 11/12 2008
-
G070489-00-R
2Work-flow Planning
Abstract Work-flow Resource independent
Logical File Names No Sites Selection
Dax
Pegasus
Work-flow Planner finds Physical File Names
writes submit files
Concrete Work-flow Resource
dependent Ready for submission to OSG site(s)
Dag
3Running Workflows on the OSG
Dax
DX
x
Dag
G070489-00-R
4What Pegasus needs
- 1)The dax file, say inspiral-0.dax (abstract
Work-flow) - 2)The site catalogue sites.xml (site layout)
- 3)The transformation catalogue tc.data(location
of executables) - 4) The property file properties.bundle.STAR_BNL(Pe
gasus config file) -
5How to get what Pegasus needs
- The Dax is generated with LIGO sofware Glue, Lal
and Lalapps - and 3) tc.data and sites.xml are generated from
the command line - pegasus-get-sites source vors grid osg
- Pegasus-get-sites queries
- VORSVO Resource Selector
- http//vors.grid.iu.edu/cgi-bin/index.cgi
- 4) The property file is written manually (by me)
6Pegasus-plan and pegasus-run
- From the comand line we generate the dag
pegasus-plan -Dpegasus.user.properties./propertie
s.bundle.STAR_BNL--dir ./test_STAR_BNL --sites
STAR_BNL --output local --dax inspiral-0.dax
Storage of submit files
7Pegasus-plan and pegasus-run(2)
I have concretized your abstract workflow. The
workflow has been enteredinto the workflow
database with a state of "planned". The next step
isto start or execute your workflow. The
invocation required ispegasus-run-Dpegasus.use
r.properties/usr2/bdaudert/test_STAR_BNL/bdaudert
/pegasus/inspiral/run0001 /pegasus.32712.propertie
s--nodatabase /usr2/bdaudert/test_STAR_BNL/bdaude
rt/pegasus/inspiral
8Pegasus-plan and pegasus-run(3)
- And all that is left to do to submit the work
flow to STAR_BNL is to copy and paste to the
command line -
-
pegasus-run-Dpegasus.user.properties/usr2/bdaude
rt/test_STAR_BNL/bdaudert/pegasus/inspiral/run0001
/pegasus.32712.properties--nodatabase
/usr2/bdaudert/test_STAR_BNL/bdaudert/pegasus/insp
iral
9The Inspiral Pipeline On UWMilwaukee vs Local
Cluster
G070489-00-R
10Monitoring with Gratia
11Challenges and Solutions
Large Data Sets
Long Data Transfer Times Disk Space Problems
G070489-00-R
12Challenges and Solutions
Compressing data (LIGO)
Large Data Sets
Restructure Work-flow (LIGO)
SRM (OSG)
G070489-00-R
13Challenges and Solutions
Work-flow design
Cleanup issues Disk space problems
G070489-00-R
14Challenges and Solutions
Dynamic Cleanup (Pegasus 2.0)
Work-flow design
Depth First (Condor)
G070489-00-R
15When things go wrong
Britta
Error analysis
Condor_G/Globus
Pegasus
LIGO software
16When things go wrong (2)
- Britta Errors
- Check for typos
- Check tc.data, sites.xml for correct locations of
executable
17When things go wrong (3)
- Pegasus Errors
- java.lang.RuntimeException There are no entries
for the sites - Site was down at time of VORS query or site is
not suporting LIGO - java.lang.RuntimeException Site Selector could
not map the job ligolalapps_tmpltbank - Executable lalapps_tmpltbank was compiled on 32
bit machine but listed as 46 bit
18When things go wrong (4)
- LIGO Software issues
- Error messages in .out of failed job could look
like - XLAL Error XLALFrGenerateCache Gap in Frame
data found - Bug in glue software (3 month to identify and
fix) - Or
- XLAL Error XLALFrGenerateCache No Frame Files
found in ./gwf - Data Corruption during file transfer (re-submit)
- WNs cant see files in DATA
- Condor Issue???
19When things go wrong (5)
- Condor_G/Globus errors
- Recently, at Star_BNL
- XLAL Error XLALFrGenerateCache No Frame Files
found in ./gwf - Condor Upgrade on Gatekeeper needed!!!
- Open GOC ticket, Trouble Shooting with Sys admin,
Config changes, testing - Took 2 months to identify and resolve
20G070489-00-R