biomake chris mungall berkeley drosophila genome project

23
BioMake BioMake Chris Mungall Chris Mungall Berkeley Drosophila Genome Berkeley Drosophila Genome Project Project [email protected] [email protected]

Upload: angel-norris

Post on 19-Jan-2018

215 views

Category:

Documents


0 download

DESCRIPTION

Existing approaches  Run tasks by hand, or with ad-hoc scripts  doesn’t scale, leads to insanity  Unix/GNU makefiles  ++: concise, generic, high level abstraction  --: limited expressive power, hacky  makefile replacements (cons,scons,ant,build…)  geared towards software development  Bio compute pipeline software (biopipe, enspipe)  excellent for certain tasks  not completely generic

TRANSCRIPT

Page 1: BioMake Chris Mungall Berkeley Drosophila Genome Project

BioMakeBioMakeChris MungallChris Mungall

Berkeley Drosophila Genome Berkeley Drosophila Genome ProjectProject

[email protected]@fruitfly.org

Page 2: BioMake Chris Mungall Berkeley Drosophila Genome Project

build networksbuild networks Many bioinformaticians spend large amounts of Many bioinformaticians spend large amounts of

time coding and running time coding and running build/make networksbuild/make networks A build network is a recipe describing the A build network is a recipe describing the

execution of a collection of execution of a collection of interdependent interdependent heterogeneous tasksheterogeneous tasks sequence analysis pipelinessequence analysis pipelines data ‘compilation’data ‘compilation’

importing, transforming and exporting dataimporting, transforming and exporting data LIMSLIMS

Error prone, tedious repetitive code and hard Error prone, tedious repetitive code and hard to configureto configure

Page 3: BioMake Chris Mungall Berkeley Drosophila Genome Project

Existing approachesExisting approaches Run tasks by hand, or with ad-hoc scriptsRun tasks by hand, or with ad-hoc scripts

doesn’t scale, leads to insanitydoesn’t scale, leads to insanity Unix/GNU makefilesUnix/GNU makefiles

++: concise, generic, high level abstraction++: concise, generic, high level abstraction --: limited expressive power, hacky--: limited expressive power, hacky

makefile replacements (cons,scons,ant,build…)makefile replacements (cons,scons,ant,build…) geared towards software developmentgeared towards software development

Bio compute pipeline software (biopipe, Bio compute pipeline software (biopipe, enspipe)enspipe) excellent for certain tasksexcellent for certain tasks not completely genericnot completely generic

Page 4: BioMake Chris Mungall Berkeley Drosophila Genome Project

biomake: executable biomake: executable computational protocolscomputational protocols

A declarative language for specifying A declarative language for specifying build networksbuild networks concise, Turing-complete, highly configurableconcise, Turing-complete, highly configurable

Dependency managementDependency management e.ge.g mask genomic sequence prior to mask genomic sequence prior to

genefindinggenefinding Local and remote job executionLocal and remote job execution

compute farm job managementcompute farm job management Filesystem or database orientedFilesystem or database oriented

Page 5: BioMake Chris Mungall Berkeley Drosophila Genome Project

Example Genomic Example Genomic Sequence Analysis Sequence Analysis

PipelinePipeline Prepare and cut assembled sequence into slicesPrepare and cut assembled sequence into slices Download latest NR peptide dataset and index itDownload latest NR peptide dataset and index it Blast genomic slices against NR and other Blast genomic slices against NR and other

datasetsdatasets RepeatMask genomic slices and run genefinders RepeatMask genomic slices and run genefinders

on masked sequenceon masked sequence Synthesise gene models and do peptide analysis Synthesise gene models and do peptide analysis

on resultson results Store everything in a relational db, and prepare Store everything in a relational db, and prepare

files for export to publicfiles for export to public

Page 6: BioMake Chris Mungall Berkeley Drosophila Genome Project

Target dependenciesTarget dependenciesAssembledGenomic Sequence

ChunkedSequence

RepeatMaskedSequence

GenePredictions

BlastAlignments

FastaDB(local)

FastaDB

(remote)BlastIndexed

FastaDB

RelationalDatabase

GeneModels

BlastP & HMMAlignments

XMLExpo

rt

XMLImport

FlatfileExport

Page 7: BioMake Chris Mungall Berkeley Drosophila Genome Project

Specifying TargetsSpecifying Targets

ChunkedSequence

BlastAlignments

FastaDB(local)

BlastIndexedFastaDB

formatdb(D) run: formatdb -i D

blast(P,S,D,A) req: formatdb(D) run: blastall -p P -i S -d D A flat: S-blast/bn(S).D.P.out

A generic targetpattern has a nameand arguments

A target pattern has tags for specifying dependencies, actions and filesystem or database IDs

Page 8: BioMake Chris Mungall Berkeley Drosophila Genome Project

BioMake ExecutionBioMake ExecutionTARGET:blast(blastx, my.seq, nr.fly, -)

SUBTARGET:formatdb(nr.fly)

RUN: formatdb -i nr -p TOUT: nr.fly.{psq,pin,phr} formatdb-nr.fly.OK

RUN: blastall -p blastx -i my.seq -d nr.fly

OUT:my.seq-blast/my.seq.nr.fly.blastx.outmy.seq-blast/my.seq.nr.fly.blastx.out.OK

formatdb(D) run: formatdb -i D

blast(P,S,D,A) req: formatdb(D) run: blastall -p P -i S -d D A flat: S-blast/bn(S).D.P.out

Page 9: BioMake Chris Mungall Berkeley Drosophila Genome Project

Targets can be nestedTargets can be nested

bop(S,B) run: apollo -bop -s S B -o target flat: B.game.xml

store(XML) run: xml2db XML

store(bop(my.seq, genscan(repeatmask(my.seq, drosophila))))

target instantiations can be thought of as a skolem IDs

repeatmask(S,Org) run: repeatmask S -a Org flat: S.mask

genscan(S) run: genscan S flat: S-pred/bn(S).genscan.out

Page 10: BioMake Chris Mungall Berkeley Drosophila Genome Project

IterationIteration Pipelines frequently involve iterating over Pipelines frequently involve iterating over

collections of data:collections of data: Perform a sequence analysis on every entry Perform a sequence analysis on every entry

in a multi-fasta format filein a multi-fasta format file Perform a peptide analysis on every gene Perform a peptide analysis on every gene

prediction in some genscan outputprediction in some genscan output Query a database for a list of IDs and perform Query a database for a list of IDs and perform

some task on eachsome task on each biomake has language constructs for biomake has language constructs for

iterationiteration

Page 11: BioMake Chris Mungall Berkeley Drosophila Genome Project

Iterating over datasetsIterating over datasets

splitfasta(F) run: splitfasta.pl -d F-split -md5 F flat: F-split/bn(F).pathlist comment: splitfasta.pl is part of the biomake distro

analyze_multifasta(F) iterate: analyze_seq(S) where S in splitfasta(F)

analyze_seq(S) req: genie(S) blast(blastx,S,nr,-)

MultiFasta>seq1TAGGTATTGGTTAGGTGCGTCCTC>seq2GCGGTATAGCTTTTCCTTCTCTCT>seq3CAAAGCAGAGATATATTTATTCGC

>seq1TAGGTATTGGTTAGGTGCGTCCTC

>seq2GCGGTATAGCTTTTCCTTCTCTCT

>seq3CAAAGCAGAGATATATTTATTCGC

seq1.genie.out

seq1.nr.blastx.out

seq2.genie

seq2.nr.blastx.out

seq3.nr.blastx.out

seq3.genie

Page 12: BioMake Chris Mungall Berkeley Drosophila Genome Project

Controlling the runmodeControlling the runmode Tasks can be run locally or on a Tasks can be run locally or on a

compute farm, synchronously or compute farm, synchronously or asynchronouslyasynchronously wrapper provided for PBSwrapper provided for PBS

runmode:runmode: tag states the mode and tag states the mode and wrapper for a particular target patternwrapper for a particular target pattern can be set globally and per-patterncan be set globally and per-pattern

special status targets provide execution special status targets provide execution statusstatus

Page 13: BioMake Chris Mungall Berkeley Drosophila Genome Project

runmode examplerunmode exampleblast(P,S,D,A) req: formatdb(D) run: blastall -p P -i S -d D A flat: S-blast/bn(S).D.P.out runmode: async(qsubwrap)

The blast job will beexecuted on the computefarm via qsubwrap (comes with biomake distro)

Upon submission, the status targetstatus_run(blast(P,S,D,A))will be generated; oncompletion, the targetstatus_ok (blast(P,S,D,A))will be generated

biomake can automaticallyhandle moving data inand out between user’sfilesystem (or db) and local cluster nodes

Page 14: BioMake Chris Mungall Berkeley Drosophila Genome Project

DatastoresDatastores BioMake persists targets in DatastoresBioMake persists targets in Datastores The The flat:flat: tag flattens targets to unique tag flattens targets to unique

datastore IDsdatastore IDs Datastore can be filesystem or relational Datastore can be filesystem or relational

databasedatabase default is filesystemdefault is filesystem can be set globally or per targetcan be set globally or per target

e.g. analysis result targets can be stored on filesystem, e.g. analysis result targets can be stored on filesystem, status targets stored in DBstatus targets stored in DB

NFS traffic can be avoided on compute farm by NFS traffic can be avoided on compute farm by storing targets and status targets in a databasestoring targets and status targets in a database

Page 15: BioMake Chris Mungall Berkeley Drosophila Genome Project

Asynchronous ExecutionAsynchronous Executionlocal machinerunning biomake

scheduler

node

NFS

For each target T to be built:1) biomake fetches status of T skips T if status = ok/run2) biomake stores status_run(T)3) biomake creates a runner agent script and submits it to the cluster4) continue onto next target

1) agent fetches any input data2) agent runs command (eg blast)synchronously3) agent stores result4) agent stores status of T as‘ok’ or ‘err’

AGENT

AGENT

Page 16: BioMake Chris Mungall Berkeley Drosophila Genome Project

Specifying rulesSpecifying rules Pipeline systems often require a rule Pipeline systems often require a rule

basebase only do nuc to nuc alignments on one species or only do nuc to nuc alignments on one species or

two recently diverged speciestwo recently diverged species use sequence ontology hierarchy to decide use sequence ontology hierarchy to decide

analyses or parametersanalyses or parameters biomake protocols can have prolog facts biomake protocols can have prolog facts

and rules embedded inside themand rules embedded inside them biomake distro comes with SO prolog db biomake distro comes with SO prolog db

and rules for graph traversaland rules for graph traversal

Page 17: BioMake Chris Mungall Berkeley Drosophila Genome Project

<data relation=“fastadb” cols=“ID,SeqAlphabet,SOType,Org” del=“ws”>protein.fly.fstprotein.fly.fst aa

aapolypeptidepolypeptide D melanogasterD melanogaster

est.fly.fstest.fly.fst nnaa

ESTEST D melanogasterD melanogaster

cdna.fly.fstcdna.fly.fst nnaa

cDNAcDNA D melanogasterD melanogaster</data><prolog> nucdb(D):- fastadb(D,na,_,_). pepdb(D):- fastadb(D,aa,_,_).</prolog>

formatdb(D) run: formatdb -i D -p ‘T’ {pepdb(D)} run: formatdb -i D -p ‘F’ {nucdb(D)}

Embedding prolog factsEmbedding prolog facts

Page 18: BioMake Chris Mungall Berkeley Drosophila Genome Project

BioMake module systemBioMake module system The biomake core language is generic The biomake core language is generic

no bioinformatics-specific code or tweaksno bioinformatics-specific code or tweaks biomake uses a module systembiomake uses a module system biomake distro comes withbiomake distro comes with

biosequence_analysis modulebiosequence_analysis module Sequence Ontology prolog db and rulesSequence Ontology prolog db and rules scripts for handling bioinformatics datascripts for handling bioinformatics data

Page 19: BioMake Chris Mungall Berkeley Drosophila Genome Project

biomake extensibilitybiomake extensibility biomake is a declarative languagebiomake is a declarative language embodies both logical and functional embodies both logical and functional

paradigmsparadigms targets are actually higher order functionstargets are actually higher order functions standard FP functions available in fp standard FP functions available in fp

modulemodule cons,map,grep,filter,fold,…cons,map,grep,filter,fold,…

goal: expressive power, concise goal: expressive power, concise specifications and simplicityspecifications and simplicity

Page 20: BioMake Chris Mungall Berkeley Drosophila Genome Project

BioMake in useBioMake in use currently we’re using biomake for…currently we’re using biomake for…

analysis of repeat families found in analysis of repeat families found in orthologous and paralogous intronsorthologous and paralogous introns

Building the Gene Ontology databaseBuilding the Gene Ontology database ……but we are still dependent on but we are still dependent on

legacy pipeline code for many legacy pipeline code for many analysesanalyses

Page 21: BioMake Chris Mungall Berkeley Drosophila Genome Project

Running biomakeRunning biomake Get distro from Get distro from http://skam.sourceforge.nethttp://skam.sourceforge.net Requires XSB PrologRequires XSB Prolog

http://xsb.sourceforge.nethttp://xsb.sourceforge.net Run via command lineRun via command line

similar to unix make commandsimilar to unix make command Works on both OS X and LinuxWorks on both OS X and Linux Relational datastore requires mysql (Pg Relational datastore requires mysql (Pg

soon)soon) Better docs coming soon, lots of examplesBetter docs coming soon, lots of examples

Page 22: BioMake Chris Mungall Berkeley Drosophila Genome Project

AcknowledgementsAcknowledgementsShengqiang ShuSima MisraErwin FriseEric SmithMark YandellGeorge HartzellChris SmithSimon Prochnik

Jon TupyJosh KaminkerKaren EilbeckNomi HarrisSuzanna LewisGerry Rubin

Page 23: BioMake Chris Mungall Berkeley Drosophila Genome Project

Problem SpecificationProblem Specification A build network consist of multiple A build network consist of multiple TargetsTargets

e.ge.g the output from a the output from a blastxblastx alignment of alignment of my.seqmy.seq to the to the protein database protein database nr.flynr.fly

Targets have a Targets have a logical patternlogical pattern e.ge.g blastblast alignment using alignment using PP of some seq of some seq SS vs some db vs some db DD

Targets are Targets are dependentdependent on other Targets on other Targets e.ge.g blast depends on the indexing of db blast depends on the indexing of db DD using using

formatdbformatdb Upstream changes trigger downstream actionsUpstream changes trigger downstream actions Targets are built by Targets are built by running actionsrunning actions

““formatdb -p T -i nr.flyformatdb -p T -i nr.fly”” ““blastall -p blastx -i my.seq -d nr.flyblastall -p blastx -i my.seq -d nr.fly””