Grid Usecase BioMed - PowerPoint PPT Presentation

About This Presentation
Title:

Grid Usecase BioMed

Description:

... TB of raw data. Problems ... run jobs it's easy to create more output than a biologist can ... Don't keep your stuff in $home when on WNs, change directory ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 30
Provided by: jan73
Category:
Tags: biomed | grid | usecase

less

Transcript and Presenter's Notes

Title: Grid Usecase BioMed


1
Grid Usecase BioMed
  • How to get biologists to compute

Surfnet / Grid Tutorial
Jan Bot
Vermelding onderdeel organisatie
2
Who am I
  • Graduated March 2008
  • Bioinformatics group TU Delft
  • BioAssist programmer
  • Happy grid user
  • Working on the grid as part of the TU Delft NKI
    collaboration
  • Chris Klijn human copy number variation
  • Jeroen de Ridder viral insertions in mice

3
DNA Genes
4
Copy number variation Viral insertions
  • Pieces of DNA can be added, deleted, moved
    removed
  • Viruses can insert themselves into a genome
  • This causes all kinds of problems, for example
    cancer
  • Multiple mutations needed before a tumor starts
    to develop

5
aCGH data
  • Array comparative genomic hybridization
  • Compare DNA of sample against a reference

6
KCSmart Datasets
  • Leukaemia lymphoma cell-lines
  • aCGH data (10k affy) from the Sanger Institute
  • Same samples measured on 1.8M SNP6
  • 105 cell-line samples
  • About 350 mb of data

7
KCSmart Overview
For each tumor we construct a pair-wise space by
comparing each chromosome arm with each other
chromosome arm. A point in this space is a pair
of genomic loci.
8
KCSmart Compute Co-occurrence Score
Using a 2d Gaussian kernel we want to look for
local enrichment of high scores in the pairwise
space. Peaks in the convolved space allows us to
define two genomic loci that can be said to be
co-aberrated to a certain degree
9
KCSmart Parameters (1)?
  • Chromosome arms
  • Natural split at the centromere to better divide
    work load
  • Not all p-arms contain measurements (39 out of
    44)
  • Resolution
  • 'Grid points' are fixed on the genome
  • Location of the grid points, and thus the
    computational complexity, doesn't change when
    using different datasets
  • Measurements are allocated to grid points
  • Tried this for 20, 25, 35, 50 kbp
  • Choice based on the best resolution which still
    fits in memory

10k data
Grid
1.8m data
10
KCSmart Parameters (2)?
  • Scale
  • The kernel width in base pairs
  • Capture changes on different scales
  • 0.2, 2, 10, 20 mbp (6 sigma)
  • Amplification type
  • Either insertion or deletion
  • All possible combinations for two chromosomes
  • insins, deldel, insdel, delins
  • insamplification, delloss)?

11
KCSmart Getting the Parameters Right
  • 10k data to estimate memory consumption and
    running times
  • Find best resolution scale that still fit in
    2.3 gb of memory
  • Final Parameters
  • chr 1.0, 1.5, ..., 22.5
  • res 20000
  • scale 0.2, 2, 10, 20
  • amp 'ins-ins', 'del-del', 'ins-del',
    'del-ins'
  • Roughly 10k jobs (without the jobs required for
    finding the correct parameter settings!)?
  • All parameters generated using a python script
  • In a jdl it looks likeParameters"19.5 15.5 2
    1 20000", "2.5 4.0 2 1 20000"

12
KCSmart Output
  • /- 10k files
  • 7.5 gb of 'peak-info'
  • 1 TB of raw data
  • Problems with the grid
  • once you have all the scripts in place to run
    jobs it's easy to create more output than a
    biologist can analyze
  • once the biologist has some results he'll ask you
    to do it again (and again...)?

13
KCSmart Results 10k data
14
KCSmart Results 1.8m data
15
KCSmart Results 1.8M data
Found a know deletion pair (T-cell receptor) the
method works.
16
KCSmart Future work
  • Higher resolution (once we have 64 bit WNs)?
  • Smaller scale
  • Mutual exclusiveness tests
  • Run on real tumor dataset

17
Matlab jobs
  • Compile code using Matlab (on a UI), run using
    MCR
  • Add ctf executable to input sandboxInputSandbo
    x"kcsmart_topos.sh","kcsmart_large.bin","kcsmar
    t_large_run.ctf","curl.gz"
  • Add 'require code' to jdlRequirements
    Member("lsgmcr-7.5",other.GlueHostApplicationSoft
    wareRunTimeEnvironment)
  • Load module on WNmodule load mcr
  • Call executable

18
Job status tracking problems
  • How do you check which jobs failed?
  • Use output files as indicatorslcg-ls
    lfn///grid/lsgrid/jbot/chris_large/output/ gt
    output.txtcat output.txt /code/chris/check_mis
    sing.pl gt to_do.txt
  • Copy subset of parameters to jdl file
  • Submit job again
  • This takes too long!

19
The Annoyances glite-wms-job-
  • glite-wms-job-status
  • It barely tells me anything (unless I specified
    error codes myself)?
  • I would rather know
  • the number of failed / running jobs
  • the error output or the parameters with which
    this job was run
  • Use with grep awk
  • glite-wms-job-status job-ids gt status.txt
  • cat status.txt gawk 'prev7getlineif(0/Exi
    t\ Code/)print prev'
  • Output https//wms.grid.sara.nl9000/ztINwkKvTJf
    KnUuZBTYs_g

Status info for the Job https//wms.grid.sar
a.nl9000/ztINwkKvTJfKnUuZBTYs_g Current
Status Done (Exit Code !0)? Exit code
1 Status Reason Warning job
exit code ! 0 Destination
gb-ce-lumc.lumc.nl2119/jobmanager-pbs-medium
Submitted Sun Sep 7 212456 2008 CEST
20
The Annoyances glite-wms-job-
  • glite-wms-job-cancel
  • Does not recursively cancel jobs stored in a file
  • Fix
  • glite-wms-job-status -i jobs.txt grep 'http'
    gawk 'print 7' gt to_cancel.txt
  • glite-wms-job-cancel -i to_cancel.txt

Status info for the Job https//wms.grid.sar
a.nl9000/ztINwkKvTJfKnUuZBTYs_g Current
Status Done (Exit Code !0)? Exit code
1 Status Reason Warning job
exit code ! 0 Destination
gb-ce-lumc.lumc.nl2119/jobmanager-pbs-medium
Submitted Sun Sep 7 212456 2008 CEST
21
The Annoyances lcg-
  • lcg-cr
  • Getting files to and from the SEs
  • What, lcg-cr doesn't always work?
  • On error try again
  • No error good to go, right?
  • Try copying the file back to the WN
  • lcg-cp
  • Copying gt 3000 files from a SE to the UI machine
    takes gt 1 hour
  • Copying the same files over ssh (scp) to my
    (remote) machine 2 minutes
  • Security overhead?
  • Work-around
  • lcg-rec-cp slow
  • custom script (do it in parallel) nasty
  • Both don't work when the MCR is loaded

22
ToPoS
  • Main developer Pieter van Beek
  • WebDav Tokens pilot job
  • Instead of submitting one job at a time, claim a
    (bunch of) computer(s) until all jobs are done

23
ToPoS Overview
(2) Pilot Jobs
(1) Job tokens
User
(6) All Output
(4) Job Token
ToPoS Server
(3) Job Request
(5) Job Output
The Grid
24
Token renewal
Pilot job
Get unused token
Running pilot job
Submit
Finished?
Pilot job with token
Execute token task
no
yes
Delete token
affirm token use
25
ToPoS Conclusion
  • Advantages
  • Easy output handling using Curl with atomic
    operations
  • Handles failed jobs
  • Less overhead
  • Able to dynamically add or remove nodes
  • Easy to re-run jobs
  • Easy access to output
  • Disadvantages
  • Little / no security
  • Some overhead at the end of a run (unless you're
    reserving tokens)?
  • Feature request progress bar

26
Fixing the difficulties LEARN BASH!
  • diff is your friend
  • Useful to transfer missing files to and from SE
  • grep
  • Usefull for querying status of jobs (use with the
    -c option)?
  • (g)awk
  • Handy to cancel jobs
  • Redirect output to file and push processes to
    background
  • lcg-ls is a typical example

27
Why not let the biologist do it?
  • Recourses needed to get this working on the grid
  • /- 180 replies from grid support
  • /- 100 messages exchanged with the biologists
  • Many hours of work, mostly finding out about the
    'quirks' of the software
  • Advantage of making a programmer submit the jobs
  • One person to handle support
  • Re-use experience with other projects

28
Some other tricks
  • Nikhef does not 'advertise' the installed
    software
  • Do your own load balancing (once the job is in a
    queue, it doesn't get re-scheduled)?
  • Easy to do with the cancel-script shown
    previously
  • Don't keep your stuff in home when on WNs,
    change directory to TMPDIR at the beginning of
    your script
  • Keep in mind once you retrieved your job-output
    it's gone from the grid
  • Use startGridSession
  • When using ToPoS make sure you land in the
    'long' queue

29
Thanks!
  • Sara Grid Support
  • Jeroen Engelberts
  • Pieter van Beek
  • Machiel Jansen
  • NikHef
  • Jan Just Keijser
  • Collaborators
  • Chris Klijn
  • Jeroen de Ridder
Write a Comment
User Comments (0)
About PowerShow.com