BioMake - PowerPoint PPT Presentation

About This Presentation
Title:

BioMake

Description:

Wildcards allow basic 'parameterization' of make rules ... Be on guard make sure no target is created if the action fails, otherwise you ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 34
Provided by: skamSour
Category:
Tags: biomake | make

less

Transcript and Presenter's Notes

Title: BioMake


1
BioMake
  • Specifying biocompute pipelines with Makefiles
    and beyond
  • Chris Mungall
  • BDGP

2
There is no escaping pipelines in bioinformatics
  • A pipeline is any specification of a set of tasks
    which are interdependent
  • Compute pipelines
  • genomic, transcriptomic,proteomic and others
  • Data transformation pipelines
  • db exports, generating xml/html, data warehouses
  • Laboratory Information Management Systems
    (LIMS).....

3
A Simple Compute Pipeline
MultiFasta
FastaSeq
  • A project consists of
  • targets/goals (e.g files)
  • dependencies between targets (e.g blast requires
    formatted db)
  • actions (e.g run blastall)
  • Chaining of targets (e.g formatdb run before
    blastall)
  • Similar to metabolic pathways (actions as
    reactions programs as catalysts)

formatdb
repeatmask
BlastIndex
MaskedSeq
genscan
blastall
GenscanResults
BlastResults
bop
key
cheesy analogy
SourceFile (user supplied)
GAME-XML
NonSource (derived)
4
First Attempt Shell Pipes
generalised form
actionA input-file actionB - actionC - gt
final-target rmask myseq.fst blastx mydb -
bop - gt myseq.genscan rmask myseq.fst genscan -
bop - gt myseq.genscan cat glucose hexo-kinase
- fructose-bis-kinase - gt glucose-2-6-p
  • Unix shell pipe '' feeds the output of one
    program in as the input to another hides
    intermediate files (e.g glucose-6-p)
  • OK for manual use, less good for full automation
  • Linear dependency flows only pipelines are
    graphs
  • Not all programs accept '-' as an input need
    intermediate tmp files
  • no re-use of previous results (full path
    recompute!)
  • no knowledge of when to re-compute final target
  • e.g. redo blast if blastdb changes

5
Second Attempt write a script
  • Code pipeline in an imperative language
    (java/perl/C/C/python/etc)
  • explicitly code rules such as
  • don't recompute if we have output file already
  • Repetitive, redundant code unaesthetic
  • Underlying logic becomes obscured once we start
    extending things (asynchronous execution via PBS,
    comparing timestamps to determine if a recompute
    is required)

sub blastpipe (mydb, myseq) formatdb mydb
unless -f mydb rmask myseq gt myseq.m
unless -f myseq.m blastout
blast-mydb-myseq.out blastx mydb myseq gt
blastout unless -f blastout
6
Third Attempt Objects
class BlastJob extends Job method run
seq self-gtseq mydb
self-gtmydb j new FormatDBJob(seq)
self-gtcheck(j)
self-gtshell(blastx mydb seq)
  • Attempts to abstract and reuse code
  • Similar problems to scripting
  • Different approaches, but end results are very
    complex (eg gadfly-pipe, biopipe, ensembl) and
    (IMHO) somewhat unsatisfactory
  • Crux of problem mixing code/rules with
    data/config

7
Back to basics Makefiles
  • Makefiles are used in a variety of contexts
  • Mostly for building and compiling applications
  • .class file depends on .java file
  • .jar file depends on list of .class files
  • avoids unnecessary re-compiling uses timestamps
  • A Makefile is a set of rules
  • Each rule has a target, dependencies (subtargets)
    and action(s)
  • Targets are files, actions are unix shell commands

8
Anatomy of a makefile
make masked sequence myseq.m myseq rmask
myseq gt myseq.m run blast on masked
seq blastout mydb.psq myseq.m blastx mydb
myseq.m gt blastout echo ran blast! index
blastable db mydb.psq mydb formatdb -p T
mydb rules follow this pattern target
subtarget1, ..., subtargetN shell command
1 shell command 2...
  • our starting point is the file myseq, the end
    point is the blast results blastout
  • we first want to mask out any repeats using rmask
    to create myseq.m
  • we then blastx myseq.m against a protein db
    called mydb
  • before blastx is run the protein db must be
    indexed using formatdb

9
The make command
make blastout formatdb -p T mydb rmask
myseq.fst gt myseq.m blastx mydb myseq.m gt
blastout make blastout make 'blastout' is up
to date cat newseqs gtgt mydb make
blastout formatdb -p T mydb blastx mydb myseq.m gt
blastout
run blast on masked seq blastout mydb.psq
myseq.m blastx mydb myseq.m gt blastout echo
ran blast! index blastable db mydb.psq
mydb formatdb -p T mydb make masked
sequence myseq.m myseq rmask myseq gt myseq.m
  • make uses unix file modification timestamps when
    checking dependencies
  • if a subtarget is more recent than the goal
    target, then re-execute action

10
If compounds were files and cells were makefiles
things might look a bit like this
cheesy analogy
glucose-2-6-p synthesis from glucose via
glucose-6-p using fructose-bis-kinase and
hexo-kinase-p enymes as catalysts (all
reactions carried out by unix command react)
final step glucose-2-6-p glucose-6-p
fructose-bis-kinase react -catalyst
fructose-bis-kinase glucose-6-p gt glucose-2-6-p
initial step glucose-6-p glucose
hexo-kinase-p react -catalyst hexo-kinase-p
glucose gt glucose-6-p
11
Get wild with makefiles
  • The preceding examples used hardcoded targets
  • what if we want to use the same logic over
    multiple targets (eg multiple input sequence
    fasta files and/or multiple blast databases)?
  • Makefiles allow this via wildcards
  • Wildcards allow basic 'parameterization' of make
    rules
  • Syntax is a little strange even stranger than
    perl

12
Pattern matching examples
  • A is a wildcard match on the target/dependency
    file
  • A in the action line(s) is replaced with the
    match
  • A _at_ in the action line is replaced by the
    target/goal
  • Concise but ugly!
  • We can iterate through lots of input seqs

blastout_.m mydb.psq .m blastx mydb .m gt
_at_ .m .fst rmask .fst gt _at_
make blastout_seq1.m rmask seq1.fst gt
seq1.m blastx mydb seq1.m gt \ blastout_seq1.m
13
Defensive makefile logic
blastout_.m mydb.psq .m blastx mydb .m gt
_at_.tmp mv _at_.tmp _at_ .m .fst rmask .fst
gt _at_.tmp mv _at_.tmp _at_
  • Be on guard make sure no target is created if
    the action fails, otherwise you get rubbish all
    the way downstream
  • With the unix shell construct foo1 foo2, foo2
    will not be run if foo1 exits with non-zero
    status (error)
  • Things are looking even uglier! (but safer)

14
(some) Makefile limitations
  • Wildcard problems difficult to have more than
    one wildcard parameter for a rule
  • blast needs at least 2 fastadb and input
    sequence
  • Dependencies inherently tied to filesystem
  • Logic is fundamentally synchronous (always waits
    for actions to finish)
  • not good for managing remotely executed tasks
  • Nasty syntax takes a while to master (and is
    easily forgotten)
  • No programming language constructs
  • Too dumb/low-level leads to repetitive patterns

One side effect of these limitations is the large
number of makefile makers these typically take a
makefile templates and automatically generate one
or more makefiles. This takes us back to evil
scripting logic! Despite this, makefiles are an
invaluable tool. But can they be improved?
15
Alternatives to make
  • nmake
  • minor improvements (aimed at software
    maintenance)
  • scons
  • build
  • apache ant
  • XML/java based - focused mainly on java code
    maintenance (does this very well)
  • Nothing that would suffice for a full-on genomic
    pipeline.........?

16
Some things that would be nice
  • Abstract away from filesystem
  • Specify Goals and requirements as parameterized
    functions

blastout_.m mydb.psq .m blastx mydb .m gt
_at_
NO!
maskblastout(DB,Seq) formatdb(DB)
rmask(Seq) blastx DB rmask(Seq) gt target()
YES!
17
BioMake using skam
  • Skam Skolem Asynchronous Make
  • Functional-Logical Makefile replacement
  • Lightweight (400 lines of prolog code! prolog
    does most of the work)
  • Like makefiles, but with function constructs (aka
    skolems)
  • Works with filesystem, databases, GH-style
    datastores (URLs?)
  • Sync and Async dependency logic
  • works with remote execution e.g PBS
  • Fully programmable other nice features (see
    later)
  • As generic as makefiles no bioinformatics logic
  • BioMake
  • A set of rules useful for bioinformatics pipelines

18
Skam is based on skolem function constructs
  • May be recursively applied functions as trees
  • A skolem function uniquely identifies a target /
    piece of data for example

xml2html(bop(genscan(rmask(MySeqID))))
  • This skolem identifies the data that comes from
    repeatmasking MySeqID, putting it through
    genscan, feeding the results to bop, then
    transforming the resulting GAME XML to HTML
  • In the example above, MySeqID is a variable

19
Tangent Other uses of skolems not necessarily
related to pipelines
  • Uniquely identifying features typed by SO
  • general form for splices sites, exons and introns
  • donor(GeneID, Number)
  • acceptor(GeneID, Number)
  • intron(DonorID,AcceptorID)
  • exon(AcceptorID,init,DonorID,term)
  • derived features can be uniquely and persistently
    identified with nested functions
  • intron(donor(CG1234, 4),acceptor(CG1234, 7))
  • exon(acceptor(CG1234, 7), term)


primitive

derived
20
Example Skolem Rule
mblastx(Seq,DB) flat blast_results/Seq.DB.bla
stx req formatdb(DB) rmask(Seq) srun
blastx -filter SEGXNU DB rmask(Seq) comment
this target is for the results of running blastx
on a masked input genomic
sequence (wublast)
  • skolem function followed by tag value metadata
  • variables always begin with UpperCase

21
formatdb(DB) req DB run formatdb DB
comment prepares blastdb for blasting
(wublast) rmask(Seq) flat masked_seqs/Seq.mas
ked req Seq srun RepeatMasker -lib
(LIB) Seq comment masks out repeats from
input sequence mblastx(Seq,DB) flat
blast_results/Seq.DB.blastx req formatdb(DB)
rmask(Seq) srun blastx -filter SEGXNU DB
rmask(Seq) comment this target is for the
results of running blastx on
a masked input genomic sequence (wublast)
22
Skolem Specifications
  • Each rule concerns a skolem function construct
  • eg sim4(Seq,DB) similar to makefile target
  • Each goal has metadata tagvalue pairs, including
  • flat How to flatten skolem function (to map to
    filesystem/URL/whatever)
  • unix makefiles conflate this with the target
  • req required SubGoals (also specified as
    functions)
  • just like makefile dependencies
  • (s)run Shell command for generating target
  • just like makefile action
  • Other meta tags allowed (...)

23
BioMake Data
  • relational (ie tabular) data can be included
  • use xml-like ltdatagt...lt/datagt tags
  • implemented as prolog facts
  • useful for
  • specifying metadata on local fasta databases
  • logical name, file path, SO type, alphabet,
    species
  • we can use this to intelligently automate analyses

24
Example data section
ltdata relation"fastadb"
delimiter"ws" spec"fastadb(ID,SeqAlph
abet,SOType,BopType,Taxon)"
comment"formal description of local multifasta
datasources"gt na_pe.dros
na transposable_element P_Element dmel ne_EST.a
ll_nr.dros na EST EST dmel na_gb.dros na
mRNA mRNA dmel na_cDNA.dros na
mRNA cDNA dmel na_DGC.dros na
EST EST dmel na_ARGs.dros na
mRNA mRNA dmel na_users.dros na
mRNA mRNA dmel lt/datagt
25
Controlling BioMake logic with Rules
  • Data can be queried with prolog rules
  • Useful for
  • Adding logic/decisions/code to specifications
  • figuring out which blast program to run against
    which fastadb
  • Or for determinig bop type
  • Querying ontologies eg SO
  • Skolem targets can be qualified with prolog
    constraints in squiggly braces ...

26
ltprologgt exprdb(D)- fastadb(D,na,'mRNA'). SO
type mRNA is expressed exprdb(D)-
fastadb(D,na,'snRNA'). SO type snRNA is
expressed exprdb(D)- fastadb(D,na,'EST').
SO type EST is expressed only use sim4
against fastadbs of expressed seq from
dmel sim4db(D)- exprdb(D),fastadb(D,_,_,_,dmel).
lt/prologgt all_sim4(Seq) req L
setof(sim4(S,D)),sim4db(D),L)
comment the requirement for this target is to
sim4 the sequence S against
all databases D such that sim4db(D) is true
i.e. All databases of
expressed drosophila melanogaster
sequence. We do this with the prolog
predicate setof, which
finds all solutions to a problem.
27
Runmodes
  • Default runmode is 'sync' (runs locally, waits
    until sh command is done before proceeding just
    like makefiles)
  • PBS wrapper also provided
  • Extensible you can add other runmodes
  • run in background, locally
  • rsh to remote machine and run

28
Specifying Runmodes
specify runmode as a USER-VARIABLE can be
overridden on the command line (RUNMODE)
async(submit('qsub -q qresearch1_at_bam')) mblastx(S
eq,DB) flat blast_results/Seq.DB.blastx
req formatdb(DB) rmask(Seq) srun blastx
-filter SEGXNU DB rmask(Seq) runode
(RUNMODE)
29
How skam executes actions
  • Actions are specified with the run tag
  • Actions are unix shell (sh) commands
  • Function arguments are flattened
  • sh command is wrapped in a shell mini-pipeline
  • Check status functions for goal run if no status
  • mini-pipeline can be run locally/remotely,
    synchronously/asynchronously (eg PBS)
  • shell pipe deals with copying/moving files and
    setting status flags

30
Algorithm for running an action for function
construct F e.g. genscan(unit1.fasta) First
check for the existence of status_run(F) e.g
status_run(genscan(unit1.fasta)) uses
unix . test -f genscan/unit.fasta.status_run
IF function is true (ie file exists) THEN do
nothing (status of target is still run) ELSE
touch status_run(F) e.g. touch
genscan/unit.fasta.status_run launch
shell-pipe-command for F (see below)
local
!/bin/sh cd /data/cluster/cjm/test-pipeline/ (
( mv genscan/unit1.fasta /tmp/unit1.fasta
genscan /tmp/unit1.fasta 1gt /tmp/unit1.fasta.gensc
an 2gt /tmp/unit1.fasta.genscan.stderr )
( touch genscan/unit1.fasta.genscan.status-o
k ) ( touch genscan/unit1.fasta.genscan
.status-err ) ) ( mv /tmp/unit1.fasta.genscan
genscan/unit1.fasta.genscan mv
/tmp/unit1.fasta.genscan.stderr
genscan/unit1.fasta.genscan.stderr )
remote
can be fed directly to qsub for farm submission
31
DataShifters
  • BioMake defaults to the local filesystem
  • function constructs flattened to make relative
    file paths
  • files automatically shifted between working dir
    and temporary dir
  • Other DataShifters possible
  • just provide a wrapper script that implements
    simple basic data-atom operations
    (fetch,store,test-id)
  • MySQL wrapper provided (tiny fast perl script)
  • copies files back and forth between tmpdir and db

32
cluster node
local machine
qsub/qstat
scheduler
shell mini-pipelines
- user loads input fasta seqs into db - user asks
skam for full genomic analysis on seq 34 -
foreach subgoal of full_genomic(S), send a shell
pipe to PBS the shell pipe will run on the
node, copy data from the db to the node tmpdir,
execute the analysis program writing again to
the node tmpdir on completion, the shell
pipe copies the tmpdir output into the db
33
Status/Discussion
  • Almost ready for public consumption
  • memory/garbage collection problem
  • No fancy user interface just command line
  • Prolog extremely powerful and concise
  • may be abstruse to some
  • ...but should be hidden behind the scenes
  • Almost in use for GO/OBO
  • db construction, release management and file
    transformations
Write a Comment
User Comments (0)
About PowerShow.com