Title: BioMake
1BioMake
- Specifying biocompute pipelines with Makefiles
and beyond - Chris Mungall
- BDGP
2There is no escaping pipelines in bioinformatics
- A pipeline is any specification of a set of tasks
which are interdependent - Compute pipelines
- genomic, transcriptomic,proteomic and others
- Data transformation pipelines
- db exports, generating xml/html, data warehouses
- Laboratory Information Management Systems
(LIMS).....
3A Simple Compute Pipeline
MultiFasta
FastaSeq
- A project consists of
- targets/goals (e.g files)
- dependencies between targets (e.g blast requires
formatted db) - actions (e.g run blastall)
- Chaining of targets (e.g formatdb run before
blastall) - Similar to metabolic pathways (actions as
reactions programs as catalysts)
formatdb
repeatmask
BlastIndex
MaskedSeq
genscan
blastall
GenscanResults
BlastResults
bop
key
cheesy analogy
SourceFile (user supplied)
GAME-XML
NonSource (derived)
4First Attempt Shell Pipes
generalised form
actionA input-file actionB - actionC - gt
final-target rmask myseq.fst blastx mydb -
bop - gt myseq.genscan rmask myseq.fst genscan -
bop - gt myseq.genscan cat glucose hexo-kinase
- fructose-bis-kinase - gt glucose-2-6-p
- Unix shell pipe '' feeds the output of one
program in as the input to another hides
intermediate files (e.g glucose-6-p) - OK for manual use, less good for full automation
- Linear dependency flows only pipelines are
graphs - Not all programs accept '-' as an input need
intermediate tmp files - no re-use of previous results (full path
recompute!) - no knowledge of when to re-compute final target
- e.g. redo blast if blastdb changes
5Second Attempt write a script
- Code pipeline in an imperative language
(java/perl/C/C/python/etc) - explicitly code rules such as
- don't recompute if we have output file already
- Repetitive, redundant code unaesthetic
- Underlying logic becomes obscured once we start
extending things (asynchronous execution via PBS,
comparing timestamps to determine if a recompute
is required)
sub blastpipe (mydb, myseq) formatdb mydb
unless -f mydb rmask myseq gt myseq.m
unless -f myseq.m blastout
blast-mydb-myseq.out blastx mydb myseq gt
blastout unless -f blastout
6Third Attempt Objects
class BlastJob extends Job method run
seq self-gtseq mydb
self-gtmydb j new FormatDBJob(seq)
self-gtcheck(j)
self-gtshell(blastx mydb seq)
- Attempts to abstract and reuse code
- Similar problems to scripting
- Different approaches, but end results are very
complex (eg gadfly-pipe, biopipe, ensembl) and
(IMHO) somewhat unsatisfactory - Crux of problem mixing code/rules with
data/config
7Back to basics Makefiles
- Makefiles are used in a variety of contexts
- Mostly for building and compiling applications
- .class file depends on .java file
- .jar file depends on list of .class files
- avoids unnecessary re-compiling uses timestamps
- A Makefile is a set of rules
- Each rule has a target, dependencies (subtargets)
and action(s) - Targets are files, actions are unix shell commands
8Anatomy of a makefile
make masked sequence myseq.m myseq rmask
myseq gt myseq.m run blast on masked
seq blastout mydb.psq myseq.m blastx mydb
myseq.m gt blastout echo ran blast! index
blastable db mydb.psq mydb formatdb -p T
mydb rules follow this pattern target
subtarget1, ..., subtargetN shell command
1 shell command 2...
- our starting point is the file myseq, the end
point is the blast results blastout - we first want to mask out any repeats using rmask
to create myseq.m - we then blastx myseq.m against a protein db
called mydb - before blastx is run the protein db must be
indexed using formatdb
9The make command
make blastout formatdb -p T mydb rmask
myseq.fst gt myseq.m blastx mydb myseq.m gt
blastout make blastout make 'blastout' is up
to date cat newseqs gtgt mydb make
blastout formatdb -p T mydb blastx mydb myseq.m gt
blastout
run blast on masked seq blastout mydb.psq
myseq.m blastx mydb myseq.m gt blastout echo
ran blast! index blastable db mydb.psq
mydb formatdb -p T mydb make masked
sequence myseq.m myseq rmask myseq gt myseq.m
- make uses unix file modification timestamps when
checking dependencies - if a subtarget is more recent than the goal
target, then re-execute action
10If compounds were files and cells were makefiles
things might look a bit like this
cheesy analogy
glucose-2-6-p synthesis from glucose via
glucose-6-p using fructose-bis-kinase and
hexo-kinase-p enymes as catalysts (all
reactions carried out by unix command react)
final step glucose-2-6-p glucose-6-p
fructose-bis-kinase react -catalyst
fructose-bis-kinase glucose-6-p gt glucose-2-6-p
initial step glucose-6-p glucose
hexo-kinase-p react -catalyst hexo-kinase-p
glucose gt glucose-6-p
11Get wild with makefiles
- The preceding examples used hardcoded targets
- what if we want to use the same logic over
multiple targets (eg multiple input sequence
fasta files and/or multiple blast databases)? - Makefiles allow this via wildcards
- Wildcards allow basic 'parameterization' of make
rules - Syntax is a little strange even stranger than
perl
12Pattern matching examples
- A is a wildcard match on the target/dependency
file - A in the action line(s) is replaced with the
match - A _at_ in the action line is replaced by the
target/goal - Concise but ugly!
- We can iterate through lots of input seqs
blastout_.m mydb.psq .m blastx mydb .m gt
_at_ .m .fst rmask .fst gt _at_
make blastout_seq1.m rmask seq1.fst gt
seq1.m blastx mydb seq1.m gt \ blastout_seq1.m
13Defensive makefile logic
blastout_.m mydb.psq .m blastx mydb .m gt
_at_.tmp mv _at_.tmp _at_ .m .fst rmask .fst
gt _at_.tmp mv _at_.tmp _at_
- Be on guard make sure no target is created if
the action fails, otherwise you get rubbish all
the way downstream - With the unix shell construct foo1 foo2, foo2
will not be run if foo1 exits with non-zero
status (error) - Things are looking even uglier! (but safer)
14(some) Makefile limitations
- Wildcard problems difficult to have more than
one wildcard parameter for a rule - blast needs at least 2 fastadb and input
sequence - Dependencies inherently tied to filesystem
- Logic is fundamentally synchronous (always waits
for actions to finish) - not good for managing remotely executed tasks
- Nasty syntax takes a while to master (and is
easily forgotten) - No programming language constructs
- Too dumb/low-level leads to repetitive patterns
One side effect of these limitations is the large
number of makefile makers these typically take a
makefile templates and automatically generate one
or more makefiles. This takes us back to evil
scripting logic! Despite this, makefiles are an
invaluable tool. But can they be improved?
15Alternatives to make
- nmake
- minor improvements (aimed at software
maintenance) - scons
- build
- apache ant
- XML/java based - focused mainly on java code
maintenance (does this very well) - Nothing that would suffice for a full-on genomic
pipeline.........?
16Some things that would be nice
- Abstract away from filesystem
- Specify Goals and requirements as parameterized
functions
blastout_.m mydb.psq .m blastx mydb .m gt
_at_
NO!
maskblastout(DB,Seq) formatdb(DB)
rmask(Seq) blastx DB rmask(Seq) gt target()
YES!
17BioMake using skam
- Skam Skolem Asynchronous Make
- Functional-Logical Makefile replacement
- Lightweight (400 lines of prolog code! prolog
does most of the work) - Like makefiles, but with function constructs (aka
skolems) - Works with filesystem, databases, GH-style
datastores (URLs?) - Sync and Async dependency logic
- works with remote execution e.g PBS
- Fully programmable other nice features (see
later) - As generic as makefiles no bioinformatics logic
- BioMake
- A set of rules useful for bioinformatics pipelines
18Skam is based on skolem function constructs
- May be recursively applied functions as trees
- A skolem function uniquely identifies a target /
piece of data for example
xml2html(bop(genscan(rmask(MySeqID))))
- This skolem identifies the data that comes from
repeatmasking MySeqID, putting it through
genscan, feeding the results to bop, then
transforming the resulting GAME XML to HTML - In the example above, MySeqID is a variable
19Tangent Other uses of skolems not necessarily
related to pipelines
- Uniquely identifying features typed by SO
- general form for splices sites, exons and introns
- donor(GeneID, Number)
- acceptor(GeneID, Number)
- intron(DonorID,AcceptorID)
- exon(AcceptorID,init,DonorID,term)
- derived features can be uniquely and persistently
identified with nested functions - intron(donor(CG1234, 4),acceptor(CG1234, 7))
- exon(acceptor(CG1234, 7), term)
primitive
derived
20Example Skolem Rule
mblastx(Seq,DB) flat blast_results/Seq.DB.bla
stx req formatdb(DB) rmask(Seq) srun
blastx -filter SEGXNU DB rmask(Seq) comment
this target is for the results of running blastx
on a masked input genomic
sequence (wublast)
- skolem function followed by tag value metadata
- variables always begin with UpperCase
21formatdb(DB) req DB run formatdb DB
comment prepares blastdb for blasting
(wublast) rmask(Seq) flat masked_seqs/Seq.mas
ked req Seq srun RepeatMasker -lib
(LIB) Seq comment masks out repeats from
input sequence mblastx(Seq,DB) flat
blast_results/Seq.DB.blastx req formatdb(DB)
rmask(Seq) srun blastx -filter SEGXNU DB
rmask(Seq) comment this target is for the
results of running blastx on
a masked input genomic sequence (wublast)
22Skolem Specifications
- Each rule concerns a skolem function construct
- eg sim4(Seq,DB) similar to makefile target
- Each goal has metadata tagvalue pairs, including
- flat How to flatten skolem function (to map to
filesystem/URL/whatever) - unix makefiles conflate this with the target
- req required SubGoals (also specified as
functions) - just like makefile dependencies
- (s)run Shell command for generating target
- just like makefile action
- Other meta tags allowed (...)
23BioMake Data
- relational (ie tabular) data can be included
- use xml-like ltdatagt...lt/datagt tags
- implemented as prolog facts
- useful for
- specifying metadata on local fasta databases
- logical name, file path, SO type, alphabet,
species - we can use this to intelligently automate analyses
24Example data section
ltdata relation"fastadb"
delimiter"ws" spec"fastadb(ID,SeqAlph
abet,SOType,BopType,Taxon)"
comment"formal description of local multifasta
datasources"gt na_pe.dros
na transposable_element P_Element dmel ne_EST.a
ll_nr.dros na EST EST dmel na_gb.dros na
mRNA mRNA dmel na_cDNA.dros na
mRNA cDNA dmel na_DGC.dros na
EST EST dmel na_ARGs.dros na
mRNA mRNA dmel na_users.dros na
mRNA mRNA dmel lt/datagt
25Controlling BioMake logic with Rules
- Data can be queried with prolog rules
- Useful for
- Adding logic/decisions/code to specifications
- figuring out which blast program to run against
which fastadb - Or for determinig bop type
- Querying ontologies eg SO
- Skolem targets can be qualified with prolog
constraints in squiggly braces ...
26ltprologgt exprdb(D)- fastadb(D,na,'mRNA'). SO
type mRNA is expressed exprdb(D)-
fastadb(D,na,'snRNA'). SO type snRNA is
expressed exprdb(D)- fastadb(D,na,'EST').
SO type EST is expressed only use sim4
against fastadbs of expressed seq from
dmel sim4db(D)- exprdb(D),fastadb(D,_,_,_,dmel).
lt/prologgt all_sim4(Seq) req L
setof(sim4(S,D)),sim4db(D),L)
comment the requirement for this target is to
sim4 the sequence S against
all databases D such that sim4db(D) is true
i.e. All databases of
expressed drosophila melanogaster
sequence. We do this with the prolog
predicate setof, which
finds all solutions to a problem.
27Runmodes
- Default runmode is 'sync' (runs locally, waits
until sh command is done before proceeding just
like makefiles) - PBS wrapper also provided
- Extensible you can add other runmodes
- run in background, locally
- rsh to remote machine and run
28Specifying Runmodes
specify runmode as a USER-VARIABLE can be
overridden on the command line (RUNMODE)
async(submit('qsub -q qresearch1_at_bam')) mblastx(S
eq,DB) flat blast_results/Seq.DB.blastx
req formatdb(DB) rmask(Seq) srun blastx
-filter SEGXNU DB rmask(Seq) runode
(RUNMODE)
29How skam executes actions
- Actions are specified with the run tag
- Actions are unix shell (sh) commands
- Function arguments are flattened
- sh command is wrapped in a shell mini-pipeline
- Check status functions for goal run if no status
- mini-pipeline can be run locally/remotely,
synchronously/asynchronously (eg PBS) - shell pipe deals with copying/moving files and
setting status flags
30Algorithm for running an action for function
construct F e.g. genscan(unit1.fasta) First
check for the existence of status_run(F) e.g
status_run(genscan(unit1.fasta)) uses
unix . test -f genscan/unit.fasta.status_run
IF function is true (ie file exists) THEN do
nothing (status of target is still run) ELSE
touch status_run(F) e.g. touch
genscan/unit.fasta.status_run launch
shell-pipe-command for F (see below)
local
!/bin/sh cd /data/cluster/cjm/test-pipeline/ (
( mv genscan/unit1.fasta /tmp/unit1.fasta
genscan /tmp/unit1.fasta 1gt /tmp/unit1.fasta.gensc
an 2gt /tmp/unit1.fasta.genscan.stderr )
( touch genscan/unit1.fasta.genscan.status-o
k ) ( touch genscan/unit1.fasta.genscan
.status-err ) ) ( mv /tmp/unit1.fasta.genscan
genscan/unit1.fasta.genscan mv
/tmp/unit1.fasta.genscan.stderr
genscan/unit1.fasta.genscan.stderr )
remote
can be fed directly to qsub for farm submission
31DataShifters
- BioMake defaults to the local filesystem
- function constructs flattened to make relative
file paths - files automatically shifted between working dir
and temporary dir - Other DataShifters possible
- just provide a wrapper script that implements
simple basic data-atom operations
(fetch,store,test-id) - MySQL wrapper provided (tiny fast perl script)
- copies files back and forth between tmpdir and db
32cluster node
local machine
qsub/qstat
scheduler
shell mini-pipelines
- user loads input fasta seqs into db - user asks
skam for full genomic analysis on seq 34 -
foreach subgoal of full_genomic(S), send a shell
pipe to PBS the shell pipe will run on the
node, copy data from the db to the node tmpdir,
execute the analysis program writing again to
the node tmpdir on completion, the shell
pipe copies the tmpdir output into the db
33Status/Discussion
- Almost ready for public consumption
- memory/garbage collection problem
- No fancy user interface just command line
- Prolog extremely powerful and concise
- may be abstruse to some
- ...but should be hidden behind the scenes
- Almost in use for GO/OBO
- db construction, release management and file
transformations