BioMake

About This Presentation

Title:

BioMake

Description:

Wildcards allow basic 'parameterization' of make rules ... Be on guard make sure no target is created if the action fails, otherwise you ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 34

Provided by: skamSour

Category:

more less

Transcript and Presenter's Notes

Title: BioMake

1
BioMake

Specifying biocompute pipelines with Makefiles
and beyond
Chris Mungall
BDGP

2
There is no escaping pipelines in bioinformatics

A pipeline is any specification of a set of tasks
which are interdependent
Compute pipelines
genomic, transcriptomic,proteomic and others
Data transformation pipelines
db exports, generating xml/html, data warehouses
Laboratory Information Management Systems
(LIMS).....

3
A Simple Compute Pipeline
MultiFasta
FastaSeq

A project consists of
targets/goals (e.g files)
dependencies between targets (e.g blast requires
formatted db)
actions (e.g run blastall)
Chaining of targets (e.g formatdb run before
blastall)
Similar to metabolic pathways (actions as
reactions programs as catalysts)

formatdb
repeatmask
BlastIndex
MaskedSeq
genscan
blastall
GenscanResults
BlastResults
bop
key
cheesy analogy
SourceFile (user supplied)
GAME-XML
NonSource (derived)
4
First Attempt Shell Pipes
generalised form
actionA input-file actionB - actionC - gt
final-target rmask myseq.fst blastx mydb -
bop - gt myseq.genscan rmask myseq.fst genscan -
bop - gt myseq.genscan cat glucose hexo-kinase
- fructose-bis-kinase - gt glucose-2-6-p

Unix shell pipe '' feeds the output of one
program in as the input to another hides
intermediate files (e.g glucose-6-p)
OK for manual use, less good for full automation
Linear dependency flows only pipelines are
graphs
Not all programs accept '-' as an input need
intermediate tmp files
no re-use of previous results (full path
recompute!)
no knowledge of when to re-compute final target
e.g. redo blast if blastdb changes

5
Second Attempt write a script

Code pipeline in an imperative language
(java/perl/C/C/python/etc)
explicitly code rules such as
don't recompute if we have output file already
Repetitive, redundant code unaesthetic
Underlying logic becomes obscured once we start
extending things (asynchronous execution via PBS,
comparing timestamps to determine if a recompute
is required)

sub blastpipe (mydb, myseq) formatdb mydb
unless -f mydb rmask myseq gt myseq.m
unless -f myseq.m blastout
blast-mydb-myseq.out blastx mydb myseq gt
blastout unless -f blastout
6
Third Attempt Objects
class BlastJob extends Job method run
seq self-gtseq mydb
self-gtmydb j new FormatDBJob(seq)
self-gtcheck(j)
self-gtshell(blastx mydb seq)

Attempts to abstract and reuse code
Similar problems to scripting
Different approaches, but end results are very
complex (eg gadfly-pipe, biopipe, ensembl) and
(IMHO) somewhat unsatisfactory
Crux of problem mixing code/rules with
data/config

7
Back to basics Makefiles

Makefiles are used in a variety of contexts
Mostly for building and compiling applications
.class file depends on .java file
.jar file depends on list of .class files
avoids unnecessary re-compiling uses timestamps
A Makefile is a set of rules
Each rule has a target, dependencies (subtargets)
and action(s)
Targets are files, actions are unix shell commands

8
Anatomy of a makefile
make masked sequence myseq.m myseq rmask
myseq gt myseq.m run blast on masked
seq blastout mydb.psq myseq.m blastx mydb
myseq.m gt blastout echo ran blast! index
blastable db mydb.psq mydb formatdb -p T
mydb rules follow this pattern target
subtarget1, ..., subtargetN shell command
1 shell command 2...

our starting point is the file myseq, the end
point is the blast results blastout
we first want to mask out any repeats using rmask
to create myseq.m
we then blastx myseq.m against a protein db
called mydb
before blastx is run the protein db must be
indexed using formatdb

9
The make command
make blastout formatdb -p T mydb rmask
myseq.fst gt myseq.m blastx mydb myseq.m gt
blastout make blastout make 'blastout' is up
to date cat newseqs gtgt mydb make
blastout formatdb -p T mydb blastx mydb myseq.m gt
blastout
run blast on masked seq blastout mydb.psq
myseq.m blastx mydb myseq.m gt blastout echo
ran blast! index blastable db mydb.psq
mydb formatdb -p T mydb make masked
sequence myseq.m myseq rmask myseq gt myseq.m

make uses unix file modification timestamps when
checking dependencies
if a subtarget is more recent than the goal
target, then re-execute action

10
If compounds were files and cells were makefiles
things might look a bit like this
cheesy analogy
glucose-2-6-p synthesis from glucose via
glucose-6-p using fructose-bis-kinase and
hexo-kinase-p enymes as catalysts (all
reactions carried out by unix command react)
final step glucose-2-6-p glucose-6-p
fructose-bis-kinase react -catalyst
fructose-bis-kinase glucose-6-p gt glucose-2-6-p
initial step glucose-6-p glucose
hexo-kinase-p react -catalyst hexo-kinase-p
glucose gt glucose-6-p
11
Get wild with makefiles

The preceding examples used hardcoded targets
what if we want to use the same logic over
multiple targets (eg multiple input sequence
fasta files and/or multiple blast databases)?
Makefiles allow this via wildcards
Wildcards allow basic 'parameterization' of make
rules
Syntax is a little strange even stranger than
perl

12
Pattern matching examples

A is a wildcard match on the target/dependency
file
A in the action line(s) is replaced with the
match
A _at_ in the action line is replaced by the
target/goal
Concise but ugly!
We can iterate through lots of input seqs

blastout_.m mydb.psq .m blastx mydb .m gt
_at_ .m .fst rmask .fst gt _at_
make blastout_seq1.m rmask seq1.fst gt
seq1.m blastx mydb seq1.m gt \ blastout_seq1.m
13
Defensive makefile logic
blastout_.m mydb.psq .m blastx mydb .m gt
_at_.tmp mv _at_.tmp _at_ .m .fst rmask .fst
gt _at_.tmp mv _at_.tmp _at_

Be on guard make sure no target is created if
the action fails, otherwise you get rubbish all
the way downstream
With the unix shell construct foo1 foo2, foo2
will not be run if foo1 exits with non-zero
status (error)
Things are looking even uglier! (but safer)

14
(some) Makefile limitations

Wildcard problems difficult to have more than
one wildcard parameter for a rule
blast needs at least 2 fastadb and input
sequence
Dependencies inherently tied to filesystem
Logic is fundamentally synchronous (always waits
for actions to finish)
not good for managing remotely executed tasks
Nasty syntax takes a while to master (and is
easily forgotten)
No programming language constructs
Too dumb/low-level leads to repetitive patterns

One side effect of these limitations is the large
number of makefile makers these typically take a
makefile templates and automatically generate one
or more makefiles. This takes us back to evil
scripting logic! Despite this, makefiles are an
invaluable tool. But can they be improved?
15
Alternatives to make

nmake
minor improvements (aimed at software
maintenance)
scons
build
apache ant
XML/java based - focused mainly on java code
maintenance (does this very well)
Nothing that would suffice for a full-on genomic
pipeline.........?

16
Some things that would be nice

Abstract away from filesystem
Specify Goals and requirements as parameterized
functions

blastout_.m mydb.psq .m blastx mydb .m gt
_at_
NO!
maskblastout(DB,Seq) formatdb(DB)
rmask(Seq) blastx DB rmask(Seq) gt target()
YES!
17
BioMake using skam

Skam Skolem Asynchronous Make
Functional-Logical Makefile replacement
Lightweight (400 lines of prolog code! prolog
does most of the work)
Like makefiles, but with function constructs (aka
skolems)
Works with filesystem, databases, GH-style
datastores (URLs?)
Sync and Async dependency logic
works with remote execution e.g PBS
Fully programmable other nice features (see
later)
As generic as makefiles no bioinformatics logic
BioMake
A set of rules useful for bioinformatics pipelines

18
Skam is based on skolem function constructs

May be recursively applied functions as trees
A skolem function uniquely identifies a target /
piece of data for example

xml2html(bop(genscan(rmask(MySeqID))))

This skolem identifies the data that comes from
repeatmasking MySeqID, putting it through
genscan, feeding the results to bop, then
transforming the resulting GAME XML to HTML
In the example above, MySeqID is a variable

19
Tangent Other uses of skolems not necessarily
related to pipelines

Uniquely identifying features typed by SO
general form for splices sites, exons and introns
donor(GeneID, Number)
acceptor(GeneID, Number)
intron(DonorID,AcceptorID)
exon(AcceptorID,init,DonorID,term)
derived features can be uniquely and persistently
identified with nested functions
intron(donor(CG1234, 4),acceptor(CG1234, 7))
exon(acceptor(CG1234, 7), term)

primitive

derived
20
Example Skolem Rule
mblastx(Seq,DB) flat blast_results/Seq.DB.bla
stx req formatdb(DB) rmask(Seq) srun
blastx -filter SEGXNU DB rmask(Seq) comment
this target is for the results of running blastx
on a masked input genomic
sequence (wublast)

skolem function followed by tag value metadata
variables always begin with UpperCase

21
formatdb(DB) req DB run formatdb DB
comment prepares blastdb for blasting
(wublast) rmask(Seq) flat masked_seqs/Seq.mas
ked req Seq srun RepeatMasker -lib
(LIB) Seq comment masks out repeats from
input sequence mblastx(Seq,DB) flat
blast_results/Seq.DB.blastx req formatdb(DB)
rmask(Seq) srun blastx -filter SEGXNU DB
rmask(Seq) comment this target is for the
results of running blastx on
a masked input genomic sequence (wublast)
22
Skolem Specifications

Each rule concerns a skolem function construct
eg sim4(Seq,DB) similar to makefile target
Each goal has metadata tagvalue pairs, including
flat How to flatten skolem function (to map to
filesystem/URL/whatever)
unix makefiles conflate this with the target
req required SubGoals (also specified as
functions)
just like makefile dependencies
(s)run Shell command for generating target
just like makefile action
Other meta tags allowed (...)

23
BioMake Data

relational (ie tabular) data can be included
use xml-like ltdatagt...lt/datagt tags
implemented as prolog facts
useful for
specifying metadata on local fasta databases
logical name, file path, SO type, alphabet,
species
we can use this to intelligently automate analyses

24
Example data section
ltdata relation"fastadb"
delimiter"ws" spec"fastadb(ID,SeqAlph
abet,SOType,BopType,Taxon)"
comment"formal description of local multifasta
datasources"gt na_pe.dros
na transposable_element P_Element dmel ne_EST.a
ll_nr.dros na EST EST dmel na_gb.dros na
mRNA mRNA dmel na_cDNA.dros na
mRNA cDNA dmel na_DGC.dros na
EST EST dmel na_ARGs.dros na
mRNA mRNA dmel na_users.dros na
mRNA mRNA dmel lt/datagt
25
Controlling BioMake logic with Rules

Data can be queried with prolog rules
Useful for
Adding logic/decisions/code to specifications
figuring out which blast program to run against
which fastadb
Or for determinig bop type
Querying ontologies eg SO
Skolem targets can be qualified with prolog
constraints in squiggly braces ...

26
ltprologgt exprdb(D)- fastadb(D,na,'mRNA'). SO
type mRNA is expressed exprdb(D)-
fastadb(D,na,'snRNA'). SO type snRNA is
expressed exprdb(D)- fastadb(D,na,'EST').
SO type EST is expressed only use sim4
against fastadbs of expressed seq from
dmel sim4db(D)- exprdb(D),fastadb(D,_,_,_,dmel).
lt/prologgt all_sim4(Seq) req L
setof(sim4(S,D)),sim4db(D),L)
comment the requirement for this target is to
sim4 the sequence S against
all databases D such that sim4db(D) is true
i.e. All databases of
expressed drosophila melanogaster
sequence. We do this with the prolog
predicate setof, which
finds all solutions to a problem.
27
Runmodes

Default runmode is 'sync' (runs locally, waits
until sh command is done before proceeding just
like makefiles)
PBS wrapper also provided
Extensible you can add other runmodes
run in background, locally
rsh to remote machine and run

28
Specifying Runmodes
specify runmode as a USER-VARIABLE can be
overridden on the command line (RUNMODE)
async(submit('qsub -q qresearch1_at_bam')) mblastx(S
eq,DB) flat blast_results/Seq.DB.blastx
req formatdb(DB) rmask(Seq) srun blastx
-filter SEGXNU DB rmask(Seq) runode
(RUNMODE)
29
How skam executes actions

Actions are specified with the run tag
Actions are unix shell (sh) commands
Function arguments are flattened
sh command is wrapped in a shell mini-pipeline
Check status functions for goal run if no status
mini-pipeline can be run locally/remotely,
synchronously/asynchronously (eg PBS)
shell pipe deals with copying/moving files and
setting status flags

30
Algorithm for running an action for function
construct F e.g. genscan(unit1.fasta) First
check for the existence of status_run(F) e.g
status_run(genscan(unit1.fasta)) uses
unix . test -f genscan/unit.fasta.status_run
IF function is true (ie file exists) THEN do
nothing (status of target is still run) ELSE
touch status_run(F) e.g. touch
genscan/unit.fasta.status_run launch
shell-pipe-command for F (see below)
local
!/bin/sh cd /data/cluster/cjm/test-pipeline/ (
( mv genscan/unit1.fasta /tmp/unit1.fasta
genscan /tmp/unit1.fasta 1gt /tmp/unit1.fasta.gensc
an 2gt /tmp/unit1.fasta.genscan.stderr )
( touch genscan/unit1.fasta.genscan.status-o
k ) ( touch genscan/unit1.fasta.genscan
.status-err ) ) ( mv /tmp/unit1.fasta.genscan
genscan/unit1.fasta.genscan mv
/tmp/unit1.fasta.genscan.stderr
genscan/unit1.fasta.genscan.stderr )
remote
can be fed directly to qsub for farm submission
31
DataShifters

BioMake defaults to the local filesystem
function constructs flattened to make relative
file paths
files automatically shifted between working dir
and temporary dir
Other DataShifters possible
just provide a wrapper script that implements
simple basic data-atom operations
(fetch,store,test-id)
MySQL wrapper provided (tiny fast perl script)
copies files back and forth between tmpdir and db

32
cluster node
local machine
qsub/qstat
scheduler
shell mini-pipelines
- user loads input fasta seqs into db - user asks
skam for full genomic analysis on seq 34 -
foreach subgoal of full_genomic(S), send a shell
pipe to PBS the shell pipe will run on the
node, copy data from the db to the node tmpdir,
execute the analysis program writing again to
the node tmpdir on completion, the shell
pipe copies the tmpdir output into the db
33
Status/Discussion