Title: Grid Usecase BioMed
1Grid Usecase BioMed
- How to get biologists to compute
Surfnet / Grid Tutorial
Jan Bot
Vermelding onderdeel organisatie
2Who am I
- Graduated March 2008
- Bioinformatics group TU Delft
- BioAssist programmer
- Happy grid user
- Working on the grid as part of the TU Delft NKI
collaboration - Chris Klijn human copy number variation
- Jeroen de Ridder viral insertions in mice
3DNA Genes
4Copy number variation Viral insertions
- Pieces of DNA can be added, deleted, moved
removed - Viruses can insert themselves into a genome
- This causes all kinds of problems, for example
cancer - Multiple mutations needed before a tumor starts
to develop
5aCGH data
- Array comparative genomic hybridization
- Compare DNA of sample against a reference
6KCSmart Datasets
- Leukaemia lymphoma cell-lines
- aCGH data (10k affy) from the Sanger Institute
- Same samples measured on 1.8M SNP6
- 105 cell-line samples
- About 350 mb of data
7KCSmart Overview
For each tumor we construct a pair-wise space by
comparing each chromosome arm with each other
chromosome arm. A point in this space is a pair
of genomic loci.
8KCSmart Compute Co-occurrence Score
Using a 2d Gaussian kernel we want to look for
local enrichment of high scores in the pairwise
space. Peaks in the convolved space allows us to
define two genomic loci that can be said to be
co-aberrated to a certain degree
9KCSmart Parameters (1)?
- Chromosome arms
- Natural split at the centromere to better divide
work load - Not all p-arms contain measurements (39 out of
44) - Resolution
- 'Grid points' are fixed on the genome
- Location of the grid points, and thus the
computational complexity, doesn't change when
using different datasets - Measurements are allocated to grid points
- Tried this for 20, 25, 35, 50 kbp
- Choice based on the best resolution which still
fits in memory
10k data
Grid
1.8m data
10KCSmart Parameters (2)?
- Scale
- The kernel width in base pairs
- Capture changes on different scales
- 0.2, 2, 10, 20 mbp (6 sigma)
- Amplification type
- Either insertion or deletion
- All possible combinations for two chromosomes
- insins, deldel, insdel, delins
- insamplification, delloss)?
11KCSmart Getting the Parameters Right
- 10k data to estimate memory consumption and
running times - Find best resolution scale that still fit in
2.3 gb of memory - Final Parameters
- chr 1.0, 1.5, ..., 22.5
- res 20000
- scale 0.2, 2, 10, 20
- amp 'ins-ins', 'del-del', 'ins-del',
'del-ins' - Roughly 10k jobs (without the jobs required for
finding the correct parameter settings!)? - All parameters generated using a python script
- In a jdl it looks likeParameters"19.5 15.5 2
1 20000", "2.5 4.0 2 1 20000"
12KCSmart Output
- /- 10k files
- 7.5 gb of 'peak-info'
- 1 TB of raw data
- Problems with the grid
- once you have all the scripts in place to run
jobs it's easy to create more output than a
biologist can analyze - once the biologist has some results he'll ask you
to do it again (and again...)?
13KCSmart Results 10k data
14KCSmart Results 1.8m data
15KCSmart Results 1.8M data
Found a know deletion pair (T-cell receptor) the
method works.
16KCSmart Future work
- Higher resolution (once we have 64 bit WNs)?
- Smaller scale
- Mutual exclusiveness tests
- Run on real tumor dataset
17Matlab jobs
- Compile code using Matlab (on a UI), run using
MCR - Add ctf executable to input sandboxInputSandbo
x"kcsmart_topos.sh","kcsmart_large.bin","kcsmar
t_large_run.ctf","curl.gz" - Add 'require code' to jdlRequirements
Member("lsgmcr-7.5",other.GlueHostApplicationSoft
wareRunTimeEnvironment) - Load module on WNmodule load mcr
- Call executable
18Job status tracking problems
- How do you check which jobs failed?
- Use output files as indicatorslcg-ls
lfn///grid/lsgrid/jbot/chris_large/output/ gt
output.txtcat output.txt /code/chris/check_mis
sing.pl gt to_do.txt - Copy subset of parameters to jdl file
- Submit job again
- This takes too long!
19The Annoyances glite-wms-job-
- glite-wms-job-status
- It barely tells me anything (unless I specified
error codes myself)? - I would rather know
- the number of failed / running jobs
- the error output or the parameters with which
this job was run - Use with grep awk
- glite-wms-job-status job-ids gt status.txt
- cat status.txt gawk 'prev7getlineif(0/Exi
t\ Code/)print prev' - Output https//wms.grid.sara.nl9000/ztINwkKvTJf
KnUuZBTYs_g
Status info for the Job https//wms.grid.sar
a.nl9000/ztINwkKvTJfKnUuZBTYs_g Current
Status Done (Exit Code !0)? Exit code
1 Status Reason Warning job
exit code ! 0 Destination
gb-ce-lumc.lumc.nl2119/jobmanager-pbs-medium
Submitted Sun Sep 7 212456 2008 CEST
20The Annoyances glite-wms-job-
- glite-wms-job-cancel
- Does not recursively cancel jobs stored in a file
- Fix
- glite-wms-job-status -i jobs.txt grep 'http'
gawk 'print 7' gt to_cancel.txt - glite-wms-job-cancel -i to_cancel.txt
Status info for the Job https//wms.grid.sar
a.nl9000/ztINwkKvTJfKnUuZBTYs_g Current
Status Done (Exit Code !0)? Exit code
1 Status Reason Warning job
exit code ! 0 Destination
gb-ce-lumc.lumc.nl2119/jobmanager-pbs-medium
Submitted Sun Sep 7 212456 2008 CEST
21The Annoyances lcg-
- lcg-cr
- Getting files to and from the SEs
- What, lcg-cr doesn't always work?
- On error try again
- No error good to go, right?
- Try copying the file back to the WN
- lcg-cp
- Copying gt 3000 files from a SE to the UI machine
takes gt 1 hour - Copying the same files over ssh (scp) to my
(remote) machine 2 minutes - Security overhead?
- Work-around
- lcg-rec-cp slow
- custom script (do it in parallel) nasty
- Both don't work when the MCR is loaded
22ToPoS
- Main developer Pieter van Beek
- WebDav Tokens pilot job
- Instead of submitting one job at a time, claim a
(bunch of) computer(s) until all jobs are done
23ToPoS Overview
(2) Pilot Jobs
(1) Job tokens
User
(6) All Output
(4) Job Token
ToPoS Server
(3) Job Request
(5) Job Output
The Grid
24Token renewal
Pilot job
Get unused token
Running pilot job
Submit
Finished?
Pilot job with token
Execute token task
no
yes
Delete token
affirm token use
25ToPoS Conclusion
- Advantages
- Easy output handling using Curl with atomic
operations - Handles failed jobs
- Less overhead
- Able to dynamically add or remove nodes
- Easy to re-run jobs
- Easy access to output
- Disadvantages
- Little / no security
- Some overhead at the end of a run (unless you're
reserving tokens)? - Feature request progress bar
26Fixing the difficulties LEARN BASH!
- diff is your friend
- Useful to transfer missing files to and from SE
- grep
- Usefull for querying status of jobs (use with the
-c option)? - (g)awk
- Handy to cancel jobs
- Redirect output to file and push processes to
background - lcg-ls is a typical example
27Why not let the biologist do it?
- Recourses needed to get this working on the grid
- /- 180 replies from grid support
- /- 100 messages exchanged with the biologists
- Many hours of work, mostly finding out about the
'quirks' of the software - Advantage of making a programmer submit the jobs
- One person to handle support
- Re-use experience with other projects
28Some other tricks
- Nikhef does not 'advertise' the installed
software - Do your own load balancing (once the job is in a
queue, it doesn't get re-scheduled)? - Easy to do with the cancel-script shown
previously - Don't keep your stuff in home when on WNs,
change directory to TMPDIR at the beginning of
your script - Keep in mind once you retrieved your job-output
it's gone from the grid - Use startGridSession
- When using ToPoS make sure you land in the
'long' queue
29Thanks!
- Sara Grid Support
- Jeroen Engelberts
- Pieter van Beek
- Machiel Jansen
- NikHef
- Jan Just Keijser
- Collaborators
- Chris Klijn
- Jeroen de Ridder