software openaccess uap:reproducibleandrobusthtsdata …

9
Kämpf et al. BMC Bioinformatics (2019) 20:664 https://doi.org/10.1186/s12859-019-3219-1 SOFTWARE Open Access uap: reproducible and robust HTS data analysis Christoph Kämpf 1,2,3,4† , Michael Specht 3† , Alexander Scholz 3 , Sven-Holger Puppel 3 , Gero Doose 2,5 , Kristin Reiche 3* , Jana Schor 1* and Jörg Hackermüller 1,2* Abstract Background: A lack of reproducibility has been repeatedly criticized in computational research. High throughput sequencing (HTS) data analysis is a complex multi-step process. For most of the steps a range of bioinformatic tools is available and for most tools manifold parameters need to be set. Due to this complexity, HTS data analysis is particularly prone to reproducibility and consistency issues. We have defined four criteria that in our opinion ensure a minimal degree of reproducible research for HTS data analysis. A series of workflow management systems is available for assisting complex multi-step data analyses. However, to the best of our knowledge, none of the currently available work flow management systems satisfies all four criteria for reproducible HTS analysis. Results: Here we present uap, a workflow management system dedicated to robust, consistent, and reproducible HTS data analysis. uap is optimized for the application to omics data, but can be easily extended to other complex analyses. It is available under the GNU GPL v3 license at https://github.com/yigbt/uap. Conclusions: uap is a freely available tool that enables researchers to easily adhere to reproducible research principles for HTS data analyses. Keywords: Sequencing data analysis, Work-flow management, Reproducible research Background Next generation or high throughput sequencing (HTS) methods that rely on massively parallel DNA sequenc- ing have opened a new era of molecular life sciences. A continuous growth in sequencing throughput, precision, length of the sequencing reads, and increasing automation and miniaturization led to a huge advantage in manual efforts and costs per experiment compared to traditional Sanger sequencing. HTS has thus become a mainstay in many fields of biology and biomedicine: Apart from de novo genome sequencing, HTS is not only increasingly replacing other massively parallel techniques like microar- rays in transcriptomics or epigenomics, but also allows *Correspondence: [email protected]; [email protected]; [email protected] Christoph Kämpf and Michael Specht contributed equally to this work 1 Young Investigators Group Bioinformatics and Transcriptomics, Department Molecular Systems Biology, Helmholtz Center for Environmental Research – UFZ, Permoserstraße 15, 04318 Leipzig, Germany 2 Bioinformatics Department, Universität Leipzig, Härtelstraße 16-18, 04107 Leipzig, Germany Full list of author information is available at the end of the article for new approaches e.g. in microbial ecology, population genetics, clinical diagnostics, or breeding. As a conse- quence, the amounts of HTS-generated data are exploding in many disciplines, associated with challenges for storage, transfer, curation and particularly reproducible analysis of these data [1]. HTS data analysis is a multi-step process and fairly complex compared to the analysis of other biological data. Analysis of e.g. gene expression microarray data is involving image processing and statistical analysis of the expression signal. Analysis of a comparable transcriptome sequencing (RNA-seq) data set involves – apart from the initial base calling – at least clipping of adapter and low quality sequences, mapping to a reference genome, count- ing of mapped reads per annotation element and statistical analysis and may include many other steps like transcript de novo assembly or analysis of differential splicing. For most of these steps a selection from a range of available bioinformatic tools needs to be made and for most tools a variety of parameters needs to be set. © The Author(s). 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Upload: others

Post on 29-Jul-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SOFTWARE OpenAccess uap:reproducibleandrobustHTSdata …

Kaumlmpf et al BMC Bioinformatics (2019) 20664 httpsdoiorg101186s12859-019-3219-1

SOFTWARE Open Access

uap reproducible and robust HTS dataanalysisChristoph Kaumlmpf1234dagger Michael Specht3dagger Alexander Scholz3 Sven-Holger Puppel3 Gero Doose25Kristin Reiche3 Jana Schor1 and Joumlrg Hackermuumlller12

Abstract

Background A lack of reproducibility has been repeatedly criticized in computational research High throughputsequencing (HTS) data analysis is a complex multi-step process For most of the steps a range of bioinformatic tools isavailable and for most tools manifold parameters need to be set Due to this complexity HTS data analysis isparticularly prone to reproducibility and consistency issues We have defined four criteria that in our opinion ensure aminimal degree of reproducible research for HTS data analysis A series of workflow management systems is availablefor assisting complex multi-step data analyses However to the best of our knowledge none of the currently availablework flow management systems satisfies all four criteria for reproducible HTS analysis

Results Here we present uap a workflow management system dedicated to robust consistent and reproducibleHTS data analysis uap is optimized for the application to omics data but can be easily extended to other complexanalyses It is available under the GNU GPL v3 license at httpsgithubcomyigbtuap

Conclusions uap is a freely available tool that enables researchers to easily adhere to reproducible researchprinciples for HTS data analyses

Keywords Sequencing data analysis Work-flow management Reproducible research

BackgroundNext generation or high throughput sequencing (HTS)methods that rely on massively parallel DNA sequenc-ing have opened a new era of molecular life sciences Acontinuous growth in sequencing throughput precisionlength of the sequencing reads and increasing automationand miniaturization led to a huge advantage in manualefforts and costs per experiment compared to traditionalSanger sequencing HTS has thus become a mainstay inmany fields of biology and biomedicine Apart from denovo genome sequencing HTS is not only increasinglyreplacing othermassively parallel techniques likemicroar-rays in transcriptomics or epigenomics but also allows

Correspondence kristinreicheizifraunhoferde janaschorufzdejoerghackermuellerufzdedaggerChristoph Kaumlmpf and Michael Specht contributed equally to this work1Young Investigators Group Bioinformatics and Transcriptomics DepartmentMolecular Systems Biology Helmholtz Center for Environmental Research ndashUFZ Permoserstraszlige 15 04318 Leipzig Germany2Bioinformatics Department Universitaumlt Leipzig Haumlrtelstraszlige 16-18 04107Leipzig GermanyFull list of author information is available at the end of the article

for new approaches eg in microbial ecology populationgenetics clinical diagnostics or breeding As a conse-quence the amounts of HTS-generated data are explodinginmany disciplines associated with challenges for storagetransfer curation and particularly reproducible analysis ofthese data [1]HTS data analysis is a multi-step process and fairly

complex compared to the analysis of other biologicaldata Analysis of eg gene expression microarray data isinvolving image processing and statistical analysis of theexpression signal Analysis of a comparable transcriptomesequencing (RNA-seq) data set involves ndash apart from theinitial base calling ndash at least clipping of adapter and lowquality sequences mapping to a reference genome count-ing ofmapped reads per annotation element and statisticalanalysis and may include many other steps like transcriptde novo assembly or analysis of differential splicing Formost of these steps a selection from a range of availablebioinformatic tools needs to be made and for most tools avariety of parameters needs to be set

copy The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 40International License (httpcreativecommonsorglicensesby40) which permits unrestricted use distribution andreproduction in any medium provided you give appropriate credit to the original author(s) and the source provide a link to theCreative Commons license and indicate if changes were made The Creative Commons Public Domain Dedication waiver(httpcreativecommonsorgpublicdomainzero10) applies to the data made available in this article unless otherwise stated

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 2 of 9

In general an HTS data analysis can be described asa directed acyclic graph (DAG) like structure We calla node in the DAG a step Steps may depend on pre-viously computed results (produced by preceding steps)andor branch out to subsequent parts of the analysisResults of individual steps also may be merged again inlater steps and processed further See Fig 1 for a sketch ofa prototypic HTS analysisIndependent replication is a fundamental principle for

evaluating published findings If a complete replication isnot feasible eg due to the access to samples or cost andeffort for HTS experiments reproducing the analysis fromraw data to published claims is the second best alternative[2] The degree of reproducibility of biomedical research

Fig 1 Sketch of a prototypic DAG describing the analysis ofunmapped HTS reads HTS data analysis typically follows a DAG-likestructure Nodes of the DAG are called steps and may depend onpreceding steps andor branch out to subsequent parts of theanalysis Results of individual steps may be merged again in latersteps and processed further

has been criticized [3] and a rdquoreproducibility crisisrdquo hasbeen diagnosed recently [4] Given the complexity of HTSdata analyses outlined above reproducibility is particu-larly dependent on a detailed reporting of analysis detailsHowever a critical analysis of published HTS-based geno-typing studies revealed that less than a third of the studiesanalyzed provided sufficient information to reproduce themapping step [1]Several appeals have been made to alleviate these repro-

ducibility issues in computer science and computationalbiology Roger Peng emphasized the necessity of linkingexecutable analysis code and data as the gold standard sec-ond to full replication [2] Sandve and colleagues calledfor adhering to rdquoten simple rules for reproducible compu-tational researchrdquo which fully apply to HTS data analysis[5] Finally Gruumlning and coauthors defined a technologystack for reproducible research and formulate guidelinesthat particularly consider the numerical reproducibility ofcomputation in the life sciences [6]In our opinion a minimal degree of reproducible

research in managing HTS data analyses requires a toolwhich ensures that (i) the dependencies between analysissteps and intermediate results are correctly maintained(ii) analysis steps are successfully completed prior to exe-cution of subsequent steps (iii) all tools their versionsand full parameter sets (including standard parameterswhich are usually not set when starting the tool from thecommandline) are logged (iv) the consistency betweenthe code defining the analysis and the currently availableresults is ensuredA series of different bioinformatic workflow manage-

ment systems (WMS) is available to support complexDAG-like analyses WMS that are appropriate for HTSanalyses are in part general purpose systems in partspecific for HTS or even designed to address individualaspects of HTS data analysis Also differentWMS designsrequire various levels of experience from the userswhile providing different degrees of flexibility WMSapproaches like Ruffus [7] or SnakeMake [8] allowthe implementation of highly individual analyses eithervia domain-specific programming languages or a general-purpose programming language iRAP [9] RseqFlow[10] or MAP-RSeq [11] belong to a group of WMSthat implement a single specific type of HTS analysisSeveral WMS encapsulate the individual steps of an anal-ysis within modules and allow for their free combinationeg Galaxy [12] Unipro UGENE [13] KNIME [14] orTaverna [15] ndash many of which come with a graphicaluser interface Finally a group of lightweight modularizedWMS aims at a modular customizable command-line-based approach which includes eg bcbio-nextgen[16] Bpipe [17] or nextflow [18] Several of theseWMS rely on logging detailed information of used toolsrsquoversions and issued commands However to the best of

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 3 of 9

our knowledge none of the published WMS satisfies allfour criteria that we defined for maintaining a minimaldegree of reproducible research in HTS data analysisWe compared the essential features particularly regard-ing reproducibility of a set of actively maintained flexibleand modular WMS in more detail (Table 1 Additional file1 Table S1)Here we introduce the workflow management system

uap (Universal Analysis Pipeline) that may be used toimplement any DAG-like data analysis workflow but isprimarily aimed at HTS data analysis It is constructedto execute control and keep track of the analysis oflarge data sets uap encapsulates the usage of (not nec-essarily bioinformatic) tools and handles data flow andprocessing of the complete analysis Produced data istightly linked with the code specifying the analysis Thusit enables users to perform reproducible robust andconsistent data analyses We provide complete work-flows for handling genomic data and analyzing RNA-Seqand Chromatin Immuno-Precipitation DNA-Sequencing(ChIP-seq) data which can be used as templates thatallow for easy customization As we are also integratingsteps for downloading published sequencing raw data (egfrom SRA) uap enables users to efficiently reproduce thedata analysis of published studies The provided work-flows have been optimized for minimal IO load on highperformance computing (HPC) environments Althoughinitially designed for HTS data analysis the plugin archi-tecture of uap allows for the expansion to any kind of dataanalysis

Implementationuap is a workflow management software (WMS) imple-mented in Python It provides user-friendly access to arange of bioinformatic analyses of large datasets such ashigh throughput sequencing data Each analysis is com-pletely described by an individual configuration file inYAML1 format The steps of the analysis and their depen-dencies as well as the required tools parametrization andlocations in the file system are specified there Based onthese settings uap constructs a directed acyclic graph(DAG) that represents the workflow of the analysis

Analysis as a directed acyclic graph The DAG repre-sents single analysis steps as nodes and pairwise depen-dencies between steps as directed edges A step is ablueprint for a particular analysis with defined input andoutput data The passing of input data to a step and thegeneration of a particular type of output data is mod-eled as connections They control the data flow betweensteps by grouping output files and providing them todown-stream steps uap distinguishes between source

1httpyamlorg

and processing steps Source steps emit data into the work-flow and hand the files over to downstream processingsteps via output connections The user is free to categorizethe input data files for a workflow into user specifiedgroups to create separate output locations for each cate-gory Processing steps on the other hand receive data fromupstream steps via input connections define a sequenceof execution commands and assign output file locationsfor each input connection The entirety of these configu-rations of a step for a particular set of input connectionsis called a run It can be interpreted as an instance of astep and is the atomic unit of the analysis Additional file1 Figure S1 shows the DAG including its runs renderedby uap based on the configuration file for the analysis ofa published data set

Plug-in architecture Steps encapsulate the usage of atool in a single python class which allows users to easilycustomize uap by adding steps Every new step inheritsfrom a super class defines incoming and outgoing con-nections the required tools and has to implement theruns() method A step can individually be optimizedfor efficient use of CPU and memory usage For allowinga flexible accommodation to different high performancecomputing environments uap supports a step-specificadaptation of the environment eg for setting variablesor automatic loading and unloading of software modulesAdditional file 1 Figure S3 sketches how new processingsteps are defined

Enforcing consistency and integrity When computingon large data sets partial processing of large files due topremature termination of tools may remain undetectedwithout stringent monitoring of processes and poses asevere threat to data integrity In uap runs are there-fore executed in a temporary directory and monitoredthroughout execution The overall workflow is not com-promised in case a single run fails Result files are onlymoved to their final location if all processes of a run exitedgracefully and all expected output files existuap automatically re-schedules runs if it detects failed

processes or missing files Also changes in the configu-ration trigger re-scheduling of the affected runs and alldependent runs in the DAG

Maintaining reproducibility uap tightly links analysiscode and resulting data by hashing over the completesequence of commands including parameter specifica-tions of a run and appending the key to the output pathThus any changes to the analysis code alter the expectedoutput location which allows uap to check whether anal-ysis code and output correspond to each otherAt execution time an annotation file in YAML format is

captured for each run that contains the complete content

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 4 of 9

Table

1Com

parison

ofworkflowmanagem

entsystemsandHTS

analysispipe

lines

basedon

essentialfeaturespa

rticularlyregardingreprod

ucibility

Arvados

bcbio-ne

xtge

n[16]

BigD

ataScript[19]Bp

ipe[17]

ClusterFlow

[20]

Cosmos

[21]

CromwellGalaxy[40]

GwfHive

Nextflow[18]

Pype

Flow

Rabix[22]

Toil

uap

Mod

ular

ndash

Customizab

le

ndash

Flexible

ndash

Failurerecovery

ndashndash

ndash

ndashndash

ndash

Reprod

ucibility

Dep

ende

ncies

ndashndash

ndash

ndashndash

ndash

ndashndash

Con

sisten

cy

Linkingcode

andresult

ndashndash

ndashndash

ndashndash

ndash

ndashndash

ndash

ndashndash

Logg

ing

ndashndash

Dataauthen

ticity

ndash

ndashndash

ndash

ndash

ndashndash

ndashndash

ndashndash

Dataintegrity

ndash

ndashndash

ndash

ndashndash

ndashndash

ndash

Supp

platforms

Cluster

Docker

ndashndash

ndashndash

ndash

ndash

Cloud

ndash

ndash

ndash

ndash

Local

ndash

Toolswereselected

from

httpsgith

ubcom

pditommasoaw

esom

e-pipe

lineretainingon

lytoolsthatwereconsidered

tobe

activelyde

velope

dby

atleastfivecontrib

utors(latestcommitafter3

1012018)arelicen

sedas

open

sourcew

herethescop

eof

applicationisclearly

onbioinformaticsandwhich

supp

ortclusterba

tchsystem

sModularWorkflowsareassembled

from

reusab

lestep

sCustomizab

leCon

figurationof

theanalysisissepa

ratedfro

mthe

code

defin

ingthestep

sFlex

ibleW

orkflowscaneasilybe

alteredwith

outp

rogram

mingskillsFa

ilure

reco

veryFailuredu

ringexecutionisde

tected

andsubseq

uent

analyses

halte

dto

avoidcorrup

tedresultsRep

roducibility

-Dep

enden

ciesW

MSen

forces

thatde

pend

encies

betw

eenanalysisstep

sandinterm

ediate

results

arecorrectly

maintaine

dConsisten

cyW

MSsafegu

ards

thatanalysisstep

saresuccessfullycompleted

priortoexecutionof

subseq

uent

step

sLinkingco

dean

dresu

ltW

MSen

suresconsistencybe

tweenthecode

defin

ingtheanalysisandthecurren

tlyavailableresultsLoggingW

MSlogs

Stdo

utStderrexitstatustoo

lversion

sWMSversionexecuted

commandsexecutio

ndateand

in-ou

tput

files

(see

correspo

ndingAdd

ition

alfile1Tables

S1andS2

ford

etails)Dataau

then

ticity

WMSrecordsinform

ationab

outcreator

andcreatin

gprocess(es)o

fthe

dataD

ataintegrityW

MS

recordsinform

ationthatallowsto

verifytheintegrity

ofcreateddata(eghashsums)Suppp

latform

s-C

lusterW

MSsupp

ortscompu

teclusterm

anagem

entsystemstonrunjobsD

ockerW

MScanrunjobs

usingDockerimages

CloudW

MScanbe

deployed

inacompu

teclou

dSu

ppC

WLWMSsupp

ortscommon

workflowlang

uage

LocalWMScanbe

executed

locally

with

outd

epen

ding

onaclusterm

anagem

entsystem

ora(web

)serverResults

are

givenas

(not

met)(p

artially

met)(f

ullymet)and

ndash(not

stated

)Ratin

gsareba

sedon

inform

ationprovided

inthepa

persdocum

entatio

nsandmanualsA

morede

tailedcompa

rison

ispresen

tedinAdd

ition

alfile1TableS1

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 5 of 9

of the configuration at this point Hence an executed runis documented with the releases of all used software andthe invoked command line with all parameter settings Inaddition memory and CPU usage of each process check-sums of result files as well as the last kB of stdout andstderr output are reported The annotation file is storednext to the result files of a run

Process flow Initially uap reads the configuration gen-erates the respective DAG and defines all commands andoutput file names Throughout this initiation process uapinspects the planned analysis for potential errors Thegraph is tested to be acyclic all required tools (in definedreleases) are tested for their availability and the status ofall steps is determined This initiation phase is executedearly ie before submitting runs to a compute cluster uapthus implements a failing fast technique This is an impor-tant feature when working with large amounts of data on

HPC systems where software is dynamically loaded anderroneous configurations might otherwise only becomeapparent after hours of computation Figure 2 illustratesuaprsquos process flow error reporting and the link betweenconfiguration and result filesSubsequently uap can start runs display the com-

mands of runs show the state of the runs and ren-der execution graphs Execution graphs are useful toolsto inspect the performance of an analysis eg to iden-tify resource bottlenecks in a pipeline of commandsAdditional file 1 Figure S2 shows such an execution graphFigure 3 provides an overview of the main principles uapis built on

Resultsuap is a workflow management system dedicated to dataconsistency and adoption of a Reproducible Researchparadigm in HTS data analysis uap runs on UNIX-

Fig 2 uaprsquos process flow error reporting and the link between the analysis code and result files uap implements a failing fast approach the DAGis built from the configuration file tested to be acyclic all required tools are tested for their availability and the status of all steps is determinedSubsequently uap can start runs display the commands of runs show the state of the runs and render execution graphs Runs are executed intemporary directories and monitored throughout execution Result files are only generated at their final location if all processes of a run exitedgracefully and all expected output files exist Analysis code and resulting data are tightly linked by hashing over the complete sequence ofcommands and parameters of a run and appending the key to the output path Each run generates an annotation file in YAML format that capturesthe configuration software versions and releases the invoked command line all parameters memory and CPU usage of each process involvedchecksums of the result files as well as the last kB of stdout and stderr

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 6 of 9

Fig 3 Sketch of the main principles uap is built on a An analysis with uap comprises three parts (i) the uap source code itself implemented inPython - it contains the complete framework of uap and 2 classes for the implementation of source and processing steps These classes are used towrap any tool that is part of the analysis enabling an easy extension of the uaprsquos repertoire of steps (ii) the uap configuration in YAML format Itcontains all necessary information to run and reproduce the analysis given the data and (iii) the uap results - organized in one folder per step inthe output directory The special folder temp contains the expected results until the computation of the step has finished successfully and keepsthe intermediate results and log files upon failure b The progress or state of an analysis can be monitored with a call to uap status Itdetermines the state of each individual step in dependence of the state(s) of its previous step(s) and provides this information to the user

like operating systems and can interact with batchqueuing systems like the SunOracleUniva grid engines(SGEOGEUGE) and SLURM [23] to submit analyses tohigh performance computing systems uap is distributedunder the GNU GPL v3 license and is publicly avail-able at httpsgithubcomyigbtuap Its documentationis hosted at httpuapreadthedocsorg A docker con-tainer with a core set of tools is available at httpshubdockercomryigbtuaptags

uap is distributed with predefined workflows for (i)genome sequence download and index generation forread mapping programs (ii) transcriptome sequencing(RNA-seq) data analysis and (iii) Chromatin Immuno-Precipitation DNA-Sequencing (ChIP-Seq) data analysisFurther we provide small test data sets enabling a quickstart for each of these workflows An additional exampleusing a larger data set is provided via code for download-ing and analyzing a publicly available ChIP-Seq data set

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 7 of 9

(Barski et al [24]) The provided workflows are intendedto serve as an easy entry point into a uap analysis as wellas a template for similar analyses eg for other species orwith another set of tools

Preparing genomic data for HTS analysis An impor-tant prerequisite for HTS projects where a referencegenome is available is aligning (mapping) the sequencingreads to this genome Most mapping software requires aspecific data structure (index) of the genome to efficientlysolve this alignment problem Indices have to be gener-ated once prior to any mapping procedure We provideuap configuration files for the bwa [25] bowtie [26]and segemehl [27] mapping programs and samtools(fasta indexing) of the a) Mycoplasma genitaliumand b) Homo sapiens genome Genomic sequencescan be downloaded automatically prior to indexgeneration

Transcriptome assembly from RNA-Seq data Tran-scriptome sequencing identifies and estimates the quan-tity of RNA in biological samples Beyond quantification ofknown transcripts based on overlapping sequence readsRNA-seq allows the assembly of novel transcripts Weprovide a uap configuration file for combining split-readmapping with de novo transcript assembly uap readsthe sequencing data either from an Illumina sequencingrun folder or a set of fastq files applies quality con-trol removes adapter sequences and maps the reads toa genome using tophat2 [28] and segemehl [27 29]The mapped reads from tophat2 are directly processedby cufflinks [30] for de novo transcript assemblySplit-reads mapped with segemehl are prepared forcufflinks using an adapter script and then also pro-cessed with cufflinks The configuration also containsa step to determine the number of mapped reads pertranscript applying htseq-count [31]

Identification of enriched regions from ChIP-Seqdata ChIP sequencing is a method that integrates chro-matin immunoprecipitation (ChIP) and high-throughputDNA sequencing to identify sites of protein-DNA inter-actions The provided uap-configuration file for this taskinitially resembles the RNA-Seq workflow Here readsare expected to correspond to genomic DNA and themapping is done without considering split-reads usingbowtie2 [26] Mapped reads are subsequently sortedduplicates removed and enriched regions are detectedusing MACS2 [32]

ConclusionsThe critique on a lack of reproducibility in science anda growing awareness that many reported facts do not

seem to hold up to repeated investigation has mean-while reached a broader audience beyond the scien-tific community (eg [33ndash35]) In our opinion HTSdata analysis is particularly prone to consistency andreproducibility issues ndash especially due to the complex-ity of the analysis the involved data volume and thebroad range of available tools and their multitude ofparametersWorkflow management systems are indispensable for

reproducibly controlling more complex analyses like HTSdata analysis Published workflow management systemsfor HTS data analysis are either highly flexible and madefor experienced programmers or lack a lot of flexibil-ity but can be used intuitively and some are specific toa certain type of analysis In the introduction we listedfour minimal requirements that we consider essential forensuring reproducibility and consistency of HTS dataanalyses Additionally but beyond the programmatic con-trol of a of a WMS using versioned external data likegenome sequences or annotation is key for reproducibleanalysis None of the published systems we are awareof however completely satisfies these criteria uap hasbeen designed to fulfill these One critical requirement islinking analysis code and generated data While Repro-ducible Research in statistics [36] uses tools like Sweave[37] knitR [38] or Jupyter [39] to combine analysis codeand resulting data in one output file such a strategy isnot feasible for most steps of an HTS analysis due tothe size of generated data uap therefore relies on hash-ing over the complete sequence of commands includingparameters of a run and appending the key to the outputpath In addition uap performs logging and process mon-itoring supports different cluster management systemscreates recovery points plots execution graphs managesjob dependencies and is extensible to any kind of multi-step analysis It provides pre-built steps for the prepara-tion of genomic data and the analysis of RNA-Seq andChIP-Seq dataAmong the many different flavors of WMS uap is

clearly easier to operate for users with limited program-ming experience than systems based on domain specificprogramming languages while offering a lot more flexi-bility than single purpose tools Based on the comparisonof tools in Additional file 1 Table S1 the tools most ded-icated to reproducible research and providing the mostsimilar set of features compared to uap are Galaxy andNextflowIn our opinion Galaxy [40] and uap address dif-

ferent user groups and tasks and we use both WMSin our research environment Galaxy is great for pro-viding predefined workflows to users without experi-ence in programming and working on the commandline It allows these users to adapt parameters and exe-cute such workflows on their data For users working

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 8 of 9

frequently with large HTS data sets adapting work-flows to a larger extent or duplicating branches of theDAG for performing variants of the analysis in paral-lel is much more efficiently performed in uap Galaxydoes not link data and code in the sense of uap Butas any change to parameters in a Galaxy workflowtriggers re-execution of the sub-workflow below this isnot necessary If many changes throughout a workflowhave to be made this behavior of Galaxy may be hin-dering Obviously running HTS analyses on Galaxyrequires a Galaxy server integrated with an HPC envi-ronment which is not trivial to set up and demandscontinuous maintenance Starting from scratch settingup HTS analysis is significantly less effort using uapthan GalaxyNextflow is a powerful WMS dedicated to scalability

and reproducibility [18] Its approach to reproducibil-ity relies on a tight integration with github and thesupport of scalable containerization of pipelines usingeg Docker Nextflow and uap share several con-cepts eg using temporary files for intermediate resultsor analyzing the workflow DAG to enable failing fastIntermediate results of an HTS analysis can consumelarge amounts of storage space uap therefore providesa means for volatilization of intermediate results with-out breaking dependencies in the DAG - a feature whichdoes not seem to be available in Nextflow Loggingis somewhat limited in Nextflow compared to uapbut Nextflow provides a broader support for HPCenvironments including also support for cloud comput-ing Nextflowrsquos approach to reproducibility is pow-erful when software from Github is used as it enablesthe user to request a specific commit or when thetools used are publicly available as a container How-ever when a tool is run in rsquonative task supportrsquo likethe Kallisto example provided in [18] uap is morestringent in logging version and the full set (includingdefault) of parameters Where uap and Nextflow dif-fer most clearly regarding reproducibility is linking dataand code as to our understanding based on publica-tions and the online documentation this is not availablein NextflowIn summary we are convinced that reproducible

research principles need to be advanced for HTS dataanalysis and that uap is a highly useful system for facingthis challenge

Availability and requirementsProject name uapProject home page httpsgithubcomyigbtuapOperating system(s) LinuxProgramming language PythonOther requirements virtualenv git and graphviz

License GNU GPL v3Any restrictions to use by non-academics None

Supplementary informationSupplementary information accompanies this paper athttpsdoiorg101186s12859-019-3219-1

Additional file 1 This file includes supplemental tables and figures

AbbreviationsChIP-seq Chromatin immuno-precipitation dna-sequencing DAG Directedacyclic graph HTS High throughput sequencing HPC High performancecomputing RNA-seq Transcriptome sequencing WMS Workflowmanagement system

AcknowledgmentsNot applicable

Authorsrsquo contributionsKR MS and JH initially designed and implemented the software SHP CK ASKR JS and JH took over the further development of the software GDimplemented the script that formats segemehl-output to be compatiblewith the cufflinks suite of tools CK and AS provided the dockercontainers of uap CK KR JS and JH wrote the manuscript All authors readcorrected and approved the final manuscript

FundingThis work was supported in part by the Initiative and Networking Fund of theHelmholtz Association through a grant to JH (VH-NG738) by the DeutscheForschungsgemeinschaft (DFG) through a grant to JH (SPP1738) by theFraunhofer Future Fund and through the LIFE Leipzig Research Center forCivilization Diseases The funding bodies did not have any influence on thedesign of the study nor on collection analysis or interpretation of data nor onwriting the manuscript

Availability of data andmaterialsSource code is available at httpsgithubcomyigbtuap links to Dockercontainers and documentation are provided at httpswwwufzdeindexphpen=44919 All datasets used within the example workflows are publiclyavailable and are fully referenced within the workflow configuration files

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests KR is member ofthe editorial board (Associate Editor) of BMC Bioinformatics

Author details1Young Investigators Group Bioinformatics and Transcriptomics DepartmentMolecular Systems Biology Helmholtz Center for Environmental Research ndashUFZ Permoserstraszlige 15 04318 Leipzig Germany 2Bioinformatics DepartmentUniversitaumlt Leipzig Haumlrtelstraszlige 16-18 04107 Leipzig Germany3Bioinformatics Unit Department of Diagnostics Fraunhofer Institute for CellTherapy and Immunology Perlickstraszlige 1 04103 Leipzig Germany 4Presentaddress Department of Diagnostics Fraunhofer Institute for Cell Therapy andImmunology Perlickstraszlige 1 04103 Leipzig Germany 5Present address ecSeqBioinformatics GmbH Sternwartenstraszlige 29 04103 Leipzig Germany

Received 11 June 2019 Accepted 12 November 2019

References1 Nekrutenko A Taylor J Next-generation sequencing data interpretation

enhancing reproducibility and accessibility Nat Rev Genet 201213667ndash72

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 9 of 9

2 Peng RD Reproducible research in computational science Sci N Y20113341226ndash7

3 Bustin SA The reproducibility of biomedical research Sleepers awakeBiomol Detect Quantif 2014235ndash42

4 Baker M 1500scientists lift the lid on reproducibility Nature 2016533452ndash45 Sandve GK Nekrutenko A Taylor J Hovig E Ten simple rules for

reproducible computational research PLoS Comput Biol 2013910032856 Gruumlning B Chilton J Koumlster J Dale R Soranzo N van den Beek M

Goecks J Backofen R Nekrutenko A Taylor J Practical computationalreproducibility in the life sciences Cell Syst 20186631ndash5

7 Goodstadt L Ruffus A lightweight python library for computationalpipelines Bioinformatics 2010262778ndash9

8 Koumlster J Rahmann S Snakemake-a scalable bioinformatics workflowengine Bioinformatics 201228(19)2520ndash2

9 Fonseca Na Petryszak R Marioni J iRAP - an integrated RNA-seq AnalysisPipeline iRAP - an integrated RNA-seq Analysis Pipeline 20140ndash10httpsdoiorg101101005991

10 Wang Y Mehta G Mayani R Lu J Souaiaia T Chen Y Clark A Yoon HJWan L EvgrafovOV Knowles JA Deelman E Chen T RseqFlow Workflowsfor RNA-Seq data analysis Bioinformatics 201127(18)2598ndash600

11 Kalari KR Nair Aa Bhavsar JD O Brien DR Davila JI Bockol Ma Nie JTang X Baheti S Doughty JB Middha S Sicotte H Thompson AEAsmann YW Kocher J-Pa MAP-RSeq Mayo Analysis Pipeline for RNAsequencing BMC Bioinformatics 201415(1)224

12 Goecks J Nekrutenko A Taylor J Galaxy a comprehensive approach forsupporting accessible reproducible and transparent computationalresearch in the life sciences Genome Biol 20101186

13 Golosova O Henderson R Vaskin Y Gabrielian A Grekhov G NagarajanV Oler AJ Quintildeones M Hurt D Fursov M Huyen Y Unipro UGENE NGSpipelines and components for variant calling RNA-seq and ChIP-seq dataanalyses PeerJ 20142644

14 Berthold MR Cebron N Dill F Gabriel TR Koumltter T Meinl T Ohl P SiebC Thiel K Wiswedel B KNIME The Konstanz Information Miner InStudies in Classification Data Analysis and Knowledge Organization (GfKL2007) New York Springer 2007

15 Wolstencroft K Haines R Fellows D Williams A Withers D Owen SSoiland-Reyes S Dunlop I Nenadic A Fisher P Bhagat J Belhajjame KBacall F Hardisty A Nieva de la Hidalga A Balcazar Vargas MP Sufi SGoble C The Taverna workflow suite designing and executing workflowsof Web Services on the desktop web or in the cloud Nucleic acidsresearch 201341(Web Server issue)557ndash61

16 Guimera RV bcbio-nextgen Automated distributed next-gensequencing pipeline EMBnet J 20121730

17 Sadedin SP Pope B Oshlack A Bpipe a tool for running and managingbioinformatics pipelines Bioinformatics (Oxford Engl) 201228(11)1525ndash6

18 Di Tommaso P Chatzou M Floden EW Barja PP Palumbo ENotredame C Nextflow enables reproducible computational workflowsNat Biotechnol 201735316ndash9

19 Cingolani P Sladek R Blanchette M BigDataScript a scripting languagefor data pipelines Bioinformatics 201431(1)10ndash16

20 Ewels P Krueger F Kaumlller M Andrews S Cluster flow A user-friendlybioinformatics workflow tool F1000Research 201652824

21 Gafni E Luquette LJ Lancaster AK Hawkins JB Jung J-Y Souilmi Y WallDP Tonellato PJ Cosmos Python library for massively parallel workflowsBioinformatics 2014302956ndash8

22 Kaushik G Ivkovic S Simonovic J Tijanic N Davis-Dusenbery B Kural DRabix An open-source workflow executor supporting recomputabilityand interoperability of workflow descriptions Pac Symp Biocomput201722154ndash65

23 Yoo AB Jette MA Grondona M SLURM Simple Linux Utility for ResourceManagement In Feitelson D Rudolph L Schwiegelshohn U editorsJob Scheduling Strategies for Parallel Processing 9th InternationalWorkshop JSSPP 2003 Seattle WA USA June 24 2003 Revised PaperBerlin Springer 2003 p 44ndash60

24 Barski A Cuddapah S Cui K Roh T-Y Schones DE Wang Z Wei GChepelev I Zhao K High-resolution profiling of histone methylations inthe human genome Cell 2007129(4)823ndash37

25 Li H Durbin R Fastandaccurate long-read alignment with Burrows-Wheelertransform Bioinformatics (Oxford Engl) 201026(5)589ndash95

26 Langmead B Salzberg SL Fast gapped-read alignment with Bowtie 2Nat Methods 20129(4)357ndash9

27 Hoffmann S Otto C Kurtz S Sharma CM Khaitovich P Vogel J StadlerPF Hackermuumlller J Fast mapping of short sequences with mismatchesinsertions and deletions using index structures PLoS Comput Biol20095(9)1000502

28 Kim D Pertea G Trapnell C Pimentel H Kelley R Salzberg SL TopHat2accurate alignment of transcriptomes in the presence of insertionsdeletions and gene fusions Genome Biol 201314(4)36

29 Hoffmann S Otto C Doose G Tanzer A Langenberger D Christ S KunzM Holdt LM Teupser D Hackermuumlller J Stadler PF A multi-splitmapping algorithm for circular RNA splicing trans-splicing and fusiondetection Genome Biol 201415(2)34

30 Trapnell C Williams Ba Pertea G Mortazavi A Kwan G van Baren MJSalzberg SL Wold BJ Pachter L Transcript assembly and abundanceestimation from RNA-Seq reveals thousands of new transcripts andswitching among isoforms Nat Biotechnol 201128(5)511ndash5

31 Anders S Pyl PT Huber W HTSeq-A Python framework to work withhigh-throughput sequencing data Bioinformatics 201531(2)166ndash9

32 Zhang Y Liu T Meyer Ca Eeckhoute J Johnson DS Bernstein BENusbaum C Myers RM Brown M Li W Liu XS Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089(9)137

33 Fidler F Gordon A Science is in a reproducibility crisis ndash how do weresolve it Conversation 2013 Available at httptheconversationcomscience-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998Accessed 11 Jun 2019

34 Lehrer J The truth wears off New Yorker 2010Dec 1352ndash735 Van Bavel J Why do so many studies fail to replicate N Y Times 201610

Available at httpswwwnytimescom20160529opinionsundaywhy-do-so-many-studies-fail-to-replicatehtml Accessed 11 Jun 2019

36 Fomel S Claerbout JF Guest Editorsrsquo Introduction ReproducibleResearch Comput Sci Eng 200911(1)5ndash7

37 Leisch F Sweave Dynamic generation of statistical reports using literatedata analysis In Haumlrdle W Roumlnz B editors Compstat 2002mdashProceedings in Computational Statistics Heidelberg Physica Verlag 2002p 575ndash80 ISBN 3-7908-1517-9 httpwwwstatuni-muenchendetexttildelowleischSweave

38 Xie Y Implementing Reproducible Computational Research In StoddenV Leisch F Peng RD editors Boca Raton Chapman and HallCRC 2014ISBN 978-1466561595 httpwwwcrcpresscomproductisbn9781466561595

39 Peacuterez F Granger BE IPython a system for interactive scientificcomputing Comput Sci Eng 20079(3)21ndash9

40 Afgan E Baker D Batut B van den Beek M Bouvier D Cech M ChiltonJ Clements D Coraor N Gruumlning BA Guerler A Hillman-Jackson JHiltemann S Jalili V Rasche H Soranzo N Goecks J Taylor JNekrutenko A Blankenberg D The galaxy platform for accessiblereproducible and collaborative biomedical analyses 2018 update NucleicAcids Res 201846537ndash44

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Implementation
        • Analysis as a directed acyclic graph
          • Plug-in architecture
            • Enforcing consistency and integrity
            • Maintaining reproducibility
            • Process flow
              • Results
                • Preparing genomic data for HTS analysis
                  • Transcriptome assembly from RNA-Seq data
                    • Identification of enriched regions from ChIP-Seq data
                      • Conclusions
                      • Availability and requirements
                      • Supplementary informationSupplementary information accompanies this paper at httpsdoiorg101186s12859-019-3219-1
                        • Additional file 1
                          • Abbreviations
                          • Acknowledgments
                          • Authors contributions
                          • Funding
                          • Availability of data and materials
                          • Ethics approval and consent to participate
                          • Consent for publication
                          • Competing interests
                          • Author details
                          • References
                          • Publishers Note
Page 2: SOFTWARE OpenAccess uap:reproducibleandrobustHTSdata …

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 2 of 9

In general an HTS data analysis can be described asa directed acyclic graph (DAG) like structure We calla node in the DAG a step Steps may depend on pre-viously computed results (produced by preceding steps)andor branch out to subsequent parts of the analysisResults of individual steps also may be merged again inlater steps and processed further See Fig 1 for a sketch ofa prototypic HTS analysisIndependent replication is a fundamental principle for

evaluating published findings If a complete replication isnot feasible eg due to the access to samples or cost andeffort for HTS experiments reproducing the analysis fromraw data to published claims is the second best alternative[2] The degree of reproducibility of biomedical research

Fig 1 Sketch of a prototypic DAG describing the analysis ofunmapped HTS reads HTS data analysis typically follows a DAG-likestructure Nodes of the DAG are called steps and may depend onpreceding steps andor branch out to subsequent parts of theanalysis Results of individual steps may be merged again in latersteps and processed further

has been criticized [3] and a rdquoreproducibility crisisrdquo hasbeen diagnosed recently [4] Given the complexity of HTSdata analyses outlined above reproducibility is particu-larly dependent on a detailed reporting of analysis detailsHowever a critical analysis of published HTS-based geno-typing studies revealed that less than a third of the studiesanalyzed provided sufficient information to reproduce themapping step [1]Several appeals have been made to alleviate these repro-

ducibility issues in computer science and computationalbiology Roger Peng emphasized the necessity of linkingexecutable analysis code and data as the gold standard sec-ond to full replication [2] Sandve and colleagues calledfor adhering to rdquoten simple rules for reproducible compu-tational researchrdquo which fully apply to HTS data analysis[5] Finally Gruumlning and coauthors defined a technologystack for reproducible research and formulate guidelinesthat particularly consider the numerical reproducibility ofcomputation in the life sciences [6]In our opinion a minimal degree of reproducible

research in managing HTS data analyses requires a toolwhich ensures that (i) the dependencies between analysissteps and intermediate results are correctly maintained(ii) analysis steps are successfully completed prior to exe-cution of subsequent steps (iii) all tools their versionsand full parameter sets (including standard parameterswhich are usually not set when starting the tool from thecommandline) are logged (iv) the consistency betweenthe code defining the analysis and the currently availableresults is ensuredA series of different bioinformatic workflow manage-

ment systems (WMS) is available to support complexDAG-like analyses WMS that are appropriate for HTSanalyses are in part general purpose systems in partspecific for HTS or even designed to address individualaspects of HTS data analysis Also differentWMS designsrequire various levels of experience from the userswhile providing different degrees of flexibility WMSapproaches like Ruffus [7] or SnakeMake [8] allowthe implementation of highly individual analyses eithervia domain-specific programming languages or a general-purpose programming language iRAP [9] RseqFlow[10] or MAP-RSeq [11] belong to a group of WMSthat implement a single specific type of HTS analysisSeveral WMS encapsulate the individual steps of an anal-ysis within modules and allow for their free combinationeg Galaxy [12] Unipro UGENE [13] KNIME [14] orTaverna [15] ndash many of which come with a graphicaluser interface Finally a group of lightweight modularizedWMS aims at a modular customizable command-line-based approach which includes eg bcbio-nextgen[16] Bpipe [17] or nextflow [18] Several of theseWMS rely on logging detailed information of used toolsrsquoversions and issued commands However to the best of

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 3 of 9

our knowledge none of the published WMS satisfies allfour criteria that we defined for maintaining a minimaldegree of reproducible research in HTS data analysisWe compared the essential features particularly regard-ing reproducibility of a set of actively maintained flexibleand modular WMS in more detail (Table 1 Additional file1 Table S1)Here we introduce the workflow management system

uap (Universal Analysis Pipeline) that may be used toimplement any DAG-like data analysis workflow but isprimarily aimed at HTS data analysis It is constructedto execute control and keep track of the analysis oflarge data sets uap encapsulates the usage of (not nec-essarily bioinformatic) tools and handles data flow andprocessing of the complete analysis Produced data istightly linked with the code specifying the analysis Thusit enables users to perform reproducible robust andconsistent data analyses We provide complete work-flows for handling genomic data and analyzing RNA-Seqand Chromatin Immuno-Precipitation DNA-Sequencing(ChIP-seq) data which can be used as templates thatallow for easy customization As we are also integratingsteps for downloading published sequencing raw data (egfrom SRA) uap enables users to efficiently reproduce thedata analysis of published studies The provided work-flows have been optimized for minimal IO load on highperformance computing (HPC) environments Althoughinitially designed for HTS data analysis the plugin archi-tecture of uap allows for the expansion to any kind of dataanalysis

Implementationuap is a workflow management software (WMS) imple-mented in Python It provides user-friendly access to arange of bioinformatic analyses of large datasets such ashigh throughput sequencing data Each analysis is com-pletely described by an individual configuration file inYAML1 format The steps of the analysis and their depen-dencies as well as the required tools parametrization andlocations in the file system are specified there Based onthese settings uap constructs a directed acyclic graph(DAG) that represents the workflow of the analysis

Analysis as a directed acyclic graph The DAG repre-sents single analysis steps as nodes and pairwise depen-dencies between steps as directed edges A step is ablueprint for a particular analysis with defined input andoutput data The passing of input data to a step and thegeneration of a particular type of output data is mod-eled as connections They control the data flow betweensteps by grouping output files and providing them todown-stream steps uap distinguishes between source

1httpyamlorg

and processing steps Source steps emit data into the work-flow and hand the files over to downstream processingsteps via output connections The user is free to categorizethe input data files for a workflow into user specifiedgroups to create separate output locations for each cate-gory Processing steps on the other hand receive data fromupstream steps via input connections define a sequenceof execution commands and assign output file locationsfor each input connection The entirety of these configu-rations of a step for a particular set of input connectionsis called a run It can be interpreted as an instance of astep and is the atomic unit of the analysis Additional file1 Figure S1 shows the DAG including its runs renderedby uap based on the configuration file for the analysis ofa published data set

Plug-in architecture Steps encapsulate the usage of atool in a single python class which allows users to easilycustomize uap by adding steps Every new step inheritsfrom a super class defines incoming and outgoing con-nections the required tools and has to implement theruns() method A step can individually be optimizedfor efficient use of CPU and memory usage For allowinga flexible accommodation to different high performancecomputing environments uap supports a step-specificadaptation of the environment eg for setting variablesor automatic loading and unloading of software modulesAdditional file 1 Figure S3 sketches how new processingsteps are defined

Enforcing consistency and integrity When computingon large data sets partial processing of large files due topremature termination of tools may remain undetectedwithout stringent monitoring of processes and poses asevere threat to data integrity In uap runs are there-fore executed in a temporary directory and monitoredthroughout execution The overall workflow is not com-promised in case a single run fails Result files are onlymoved to their final location if all processes of a run exitedgracefully and all expected output files existuap automatically re-schedules runs if it detects failed

processes or missing files Also changes in the configu-ration trigger re-scheduling of the affected runs and alldependent runs in the DAG

Maintaining reproducibility uap tightly links analysiscode and resulting data by hashing over the completesequence of commands including parameter specifica-tions of a run and appending the key to the output pathThus any changes to the analysis code alter the expectedoutput location which allows uap to check whether anal-ysis code and output correspond to each otherAt execution time an annotation file in YAML format is

captured for each run that contains the complete content

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 4 of 9

Table

1Com

parison

ofworkflowmanagem

entsystemsandHTS

analysispipe

lines

basedon

essentialfeaturespa

rticularlyregardingreprod

ucibility

Arvados

bcbio-ne

xtge

n[16]

BigD

ataScript[19]Bp

ipe[17]

ClusterFlow

[20]

Cosmos

[21]

CromwellGalaxy[40]

GwfHive

Nextflow[18]

Pype

Flow

Rabix[22]

Toil

uap

Mod

ular

ndash

Customizab

le

ndash

Flexible

ndash

Failurerecovery

ndashndash

ndash

ndashndash

ndash

Reprod

ucibility

Dep

ende

ncies

ndashndash

ndash

ndashndash

ndash

ndashndash

Con

sisten

cy

Linkingcode

andresult

ndashndash

ndashndash

ndashndash

ndash

ndashndash

ndash

ndashndash

Logg

ing

ndashndash

Dataauthen

ticity

ndash

ndashndash

ndash

ndash

ndashndash

ndashndash

ndashndash

Dataintegrity

ndash

ndashndash

ndash

ndashndash

ndashndash

ndash

Supp

platforms

Cluster

Docker

ndashndash

ndashndash

ndash

ndash

Cloud

ndash

ndash

ndash

ndash

Local

ndash

Toolswereselected

from

httpsgith

ubcom

pditommasoaw

esom

e-pipe

lineretainingon

lytoolsthatwereconsidered

tobe

activelyde

velope

dby

atleastfivecontrib

utors(latestcommitafter3

1012018)arelicen

sedas

open

sourcew

herethescop

eof

applicationisclearly

onbioinformaticsandwhich

supp

ortclusterba

tchsystem

sModularWorkflowsareassembled

from

reusab

lestep

sCustomizab

leCon

figurationof

theanalysisissepa

ratedfro

mthe

code

defin

ingthestep

sFlex

ibleW

orkflowscaneasilybe

alteredwith

outp

rogram

mingskillsFa

ilure

reco

veryFailuredu

ringexecutionisde

tected

andsubseq

uent

analyses

halte

dto

avoidcorrup

tedresultsRep

roducibility

-Dep

enden

ciesW

MSen

forces

thatde

pend

encies

betw

eenanalysisstep

sandinterm

ediate

results

arecorrectly

maintaine

dConsisten

cyW

MSsafegu

ards

thatanalysisstep

saresuccessfullycompleted

priortoexecutionof

subseq

uent

step

sLinkingco

dean

dresu

ltW

MSen

suresconsistencybe

tweenthecode

defin

ingtheanalysisandthecurren

tlyavailableresultsLoggingW

MSlogs

Stdo

utStderrexitstatustoo

lversion

sWMSversionexecuted

commandsexecutio

ndateand

in-ou

tput

files

(see

correspo

ndingAdd

ition

alfile1Tables

S1andS2

ford

etails)Dataau

then

ticity

WMSrecordsinform

ationab

outcreator

andcreatin

gprocess(es)o

fthe

dataD

ataintegrityW

MS

recordsinform

ationthatallowsto

verifytheintegrity

ofcreateddata(eghashsums)Suppp

latform

s-C

lusterW

MSsupp

ortscompu

teclusterm

anagem

entsystemstonrunjobsD

ockerW

MScanrunjobs

usingDockerimages

CloudW

MScanbe

deployed

inacompu

teclou

dSu

ppC

WLWMSsupp

ortscommon

workflowlang

uage

LocalWMScanbe

executed

locally

with

outd

epen

ding

onaclusterm

anagem

entsystem

ora(web

)serverResults

are

givenas

(not

met)(p

artially

met)(f

ullymet)and

ndash(not

stated

)Ratin

gsareba

sedon

inform

ationprovided

inthepa

persdocum

entatio

nsandmanualsA

morede

tailedcompa

rison

ispresen

tedinAdd

ition

alfile1TableS1

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 5 of 9

of the configuration at this point Hence an executed runis documented with the releases of all used software andthe invoked command line with all parameter settings Inaddition memory and CPU usage of each process check-sums of result files as well as the last kB of stdout andstderr output are reported The annotation file is storednext to the result files of a run

Process flow Initially uap reads the configuration gen-erates the respective DAG and defines all commands andoutput file names Throughout this initiation process uapinspects the planned analysis for potential errors Thegraph is tested to be acyclic all required tools (in definedreleases) are tested for their availability and the status ofall steps is determined This initiation phase is executedearly ie before submitting runs to a compute cluster uapthus implements a failing fast technique This is an impor-tant feature when working with large amounts of data on

HPC systems where software is dynamically loaded anderroneous configurations might otherwise only becomeapparent after hours of computation Figure 2 illustratesuaprsquos process flow error reporting and the link betweenconfiguration and result filesSubsequently uap can start runs display the com-

mands of runs show the state of the runs and ren-der execution graphs Execution graphs are useful toolsto inspect the performance of an analysis eg to iden-tify resource bottlenecks in a pipeline of commandsAdditional file 1 Figure S2 shows such an execution graphFigure 3 provides an overview of the main principles uapis built on

Resultsuap is a workflow management system dedicated to dataconsistency and adoption of a Reproducible Researchparadigm in HTS data analysis uap runs on UNIX-

Fig 2 uaprsquos process flow error reporting and the link between the analysis code and result files uap implements a failing fast approach the DAGis built from the configuration file tested to be acyclic all required tools are tested for their availability and the status of all steps is determinedSubsequently uap can start runs display the commands of runs show the state of the runs and render execution graphs Runs are executed intemporary directories and monitored throughout execution Result files are only generated at their final location if all processes of a run exitedgracefully and all expected output files exist Analysis code and resulting data are tightly linked by hashing over the complete sequence ofcommands and parameters of a run and appending the key to the output path Each run generates an annotation file in YAML format that capturesthe configuration software versions and releases the invoked command line all parameters memory and CPU usage of each process involvedchecksums of the result files as well as the last kB of stdout and stderr

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 6 of 9

Fig 3 Sketch of the main principles uap is built on a An analysis with uap comprises three parts (i) the uap source code itself implemented inPython - it contains the complete framework of uap and 2 classes for the implementation of source and processing steps These classes are used towrap any tool that is part of the analysis enabling an easy extension of the uaprsquos repertoire of steps (ii) the uap configuration in YAML format Itcontains all necessary information to run and reproduce the analysis given the data and (iii) the uap results - organized in one folder per step inthe output directory The special folder temp contains the expected results until the computation of the step has finished successfully and keepsthe intermediate results and log files upon failure b The progress or state of an analysis can be monitored with a call to uap status Itdetermines the state of each individual step in dependence of the state(s) of its previous step(s) and provides this information to the user

like operating systems and can interact with batchqueuing systems like the SunOracleUniva grid engines(SGEOGEUGE) and SLURM [23] to submit analyses tohigh performance computing systems uap is distributedunder the GNU GPL v3 license and is publicly avail-able at httpsgithubcomyigbtuap Its documentationis hosted at httpuapreadthedocsorg A docker con-tainer with a core set of tools is available at httpshubdockercomryigbtuaptags

uap is distributed with predefined workflows for (i)genome sequence download and index generation forread mapping programs (ii) transcriptome sequencing(RNA-seq) data analysis and (iii) Chromatin Immuno-Precipitation DNA-Sequencing (ChIP-Seq) data analysisFurther we provide small test data sets enabling a quickstart for each of these workflows An additional exampleusing a larger data set is provided via code for download-ing and analyzing a publicly available ChIP-Seq data set

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 7 of 9

(Barski et al [24]) The provided workflows are intendedto serve as an easy entry point into a uap analysis as wellas a template for similar analyses eg for other species orwith another set of tools

Preparing genomic data for HTS analysis An impor-tant prerequisite for HTS projects where a referencegenome is available is aligning (mapping) the sequencingreads to this genome Most mapping software requires aspecific data structure (index) of the genome to efficientlysolve this alignment problem Indices have to be gener-ated once prior to any mapping procedure We provideuap configuration files for the bwa [25] bowtie [26]and segemehl [27] mapping programs and samtools(fasta indexing) of the a) Mycoplasma genitaliumand b) Homo sapiens genome Genomic sequencescan be downloaded automatically prior to indexgeneration

Transcriptome assembly from RNA-Seq data Tran-scriptome sequencing identifies and estimates the quan-tity of RNA in biological samples Beyond quantification ofknown transcripts based on overlapping sequence readsRNA-seq allows the assembly of novel transcripts Weprovide a uap configuration file for combining split-readmapping with de novo transcript assembly uap readsthe sequencing data either from an Illumina sequencingrun folder or a set of fastq files applies quality con-trol removes adapter sequences and maps the reads toa genome using tophat2 [28] and segemehl [27 29]The mapped reads from tophat2 are directly processedby cufflinks [30] for de novo transcript assemblySplit-reads mapped with segemehl are prepared forcufflinks using an adapter script and then also pro-cessed with cufflinks The configuration also containsa step to determine the number of mapped reads pertranscript applying htseq-count [31]

Identification of enriched regions from ChIP-Seqdata ChIP sequencing is a method that integrates chro-matin immunoprecipitation (ChIP) and high-throughputDNA sequencing to identify sites of protein-DNA inter-actions The provided uap-configuration file for this taskinitially resembles the RNA-Seq workflow Here readsare expected to correspond to genomic DNA and themapping is done without considering split-reads usingbowtie2 [26] Mapped reads are subsequently sortedduplicates removed and enriched regions are detectedusing MACS2 [32]

ConclusionsThe critique on a lack of reproducibility in science anda growing awareness that many reported facts do not

seem to hold up to repeated investigation has mean-while reached a broader audience beyond the scien-tific community (eg [33ndash35]) In our opinion HTSdata analysis is particularly prone to consistency andreproducibility issues ndash especially due to the complex-ity of the analysis the involved data volume and thebroad range of available tools and their multitude ofparametersWorkflow management systems are indispensable for

reproducibly controlling more complex analyses like HTSdata analysis Published workflow management systemsfor HTS data analysis are either highly flexible and madefor experienced programmers or lack a lot of flexibil-ity but can be used intuitively and some are specific toa certain type of analysis In the introduction we listedfour minimal requirements that we consider essential forensuring reproducibility and consistency of HTS dataanalyses Additionally but beyond the programmatic con-trol of a of a WMS using versioned external data likegenome sequences or annotation is key for reproducibleanalysis None of the published systems we are awareof however completely satisfies these criteria uap hasbeen designed to fulfill these One critical requirement islinking analysis code and generated data While Repro-ducible Research in statistics [36] uses tools like Sweave[37] knitR [38] or Jupyter [39] to combine analysis codeand resulting data in one output file such a strategy isnot feasible for most steps of an HTS analysis due tothe size of generated data uap therefore relies on hash-ing over the complete sequence of commands includingparameters of a run and appending the key to the outputpath In addition uap performs logging and process mon-itoring supports different cluster management systemscreates recovery points plots execution graphs managesjob dependencies and is extensible to any kind of multi-step analysis It provides pre-built steps for the prepara-tion of genomic data and the analysis of RNA-Seq andChIP-Seq dataAmong the many different flavors of WMS uap is

clearly easier to operate for users with limited program-ming experience than systems based on domain specificprogramming languages while offering a lot more flexi-bility than single purpose tools Based on the comparisonof tools in Additional file 1 Table S1 the tools most ded-icated to reproducible research and providing the mostsimilar set of features compared to uap are Galaxy andNextflowIn our opinion Galaxy [40] and uap address dif-

ferent user groups and tasks and we use both WMSin our research environment Galaxy is great for pro-viding predefined workflows to users without experi-ence in programming and working on the commandline It allows these users to adapt parameters and exe-cute such workflows on their data For users working

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 8 of 9

frequently with large HTS data sets adapting work-flows to a larger extent or duplicating branches of theDAG for performing variants of the analysis in paral-lel is much more efficiently performed in uap Galaxydoes not link data and code in the sense of uap Butas any change to parameters in a Galaxy workflowtriggers re-execution of the sub-workflow below this isnot necessary If many changes throughout a workflowhave to be made this behavior of Galaxy may be hin-dering Obviously running HTS analyses on Galaxyrequires a Galaxy server integrated with an HPC envi-ronment which is not trivial to set up and demandscontinuous maintenance Starting from scratch settingup HTS analysis is significantly less effort using uapthan GalaxyNextflow is a powerful WMS dedicated to scalability

and reproducibility [18] Its approach to reproducibil-ity relies on a tight integration with github and thesupport of scalable containerization of pipelines usingeg Docker Nextflow and uap share several con-cepts eg using temporary files for intermediate resultsor analyzing the workflow DAG to enable failing fastIntermediate results of an HTS analysis can consumelarge amounts of storage space uap therefore providesa means for volatilization of intermediate results with-out breaking dependencies in the DAG - a feature whichdoes not seem to be available in Nextflow Loggingis somewhat limited in Nextflow compared to uapbut Nextflow provides a broader support for HPCenvironments including also support for cloud comput-ing Nextflowrsquos approach to reproducibility is pow-erful when software from Github is used as it enablesthe user to request a specific commit or when thetools used are publicly available as a container How-ever when a tool is run in rsquonative task supportrsquo likethe Kallisto example provided in [18] uap is morestringent in logging version and the full set (includingdefault) of parameters Where uap and Nextflow dif-fer most clearly regarding reproducibility is linking dataand code as to our understanding based on publica-tions and the online documentation this is not availablein NextflowIn summary we are convinced that reproducible

research principles need to be advanced for HTS dataanalysis and that uap is a highly useful system for facingthis challenge

Availability and requirementsProject name uapProject home page httpsgithubcomyigbtuapOperating system(s) LinuxProgramming language PythonOther requirements virtualenv git and graphviz

License GNU GPL v3Any restrictions to use by non-academics None

Supplementary informationSupplementary information accompanies this paper athttpsdoiorg101186s12859-019-3219-1

Additional file 1 This file includes supplemental tables and figures

AbbreviationsChIP-seq Chromatin immuno-precipitation dna-sequencing DAG Directedacyclic graph HTS High throughput sequencing HPC High performancecomputing RNA-seq Transcriptome sequencing WMS Workflowmanagement system

AcknowledgmentsNot applicable

Authorsrsquo contributionsKR MS and JH initially designed and implemented the software SHP CK ASKR JS and JH took over the further development of the software GDimplemented the script that formats segemehl-output to be compatiblewith the cufflinks suite of tools CK and AS provided the dockercontainers of uap CK KR JS and JH wrote the manuscript All authors readcorrected and approved the final manuscript

FundingThis work was supported in part by the Initiative and Networking Fund of theHelmholtz Association through a grant to JH (VH-NG738) by the DeutscheForschungsgemeinschaft (DFG) through a grant to JH (SPP1738) by theFraunhofer Future Fund and through the LIFE Leipzig Research Center forCivilization Diseases The funding bodies did not have any influence on thedesign of the study nor on collection analysis or interpretation of data nor onwriting the manuscript

Availability of data andmaterialsSource code is available at httpsgithubcomyigbtuap links to Dockercontainers and documentation are provided at httpswwwufzdeindexphpen=44919 All datasets used within the example workflows are publiclyavailable and are fully referenced within the workflow configuration files

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests KR is member ofthe editorial board (Associate Editor) of BMC Bioinformatics

Author details1Young Investigators Group Bioinformatics and Transcriptomics DepartmentMolecular Systems Biology Helmholtz Center for Environmental Research ndashUFZ Permoserstraszlige 15 04318 Leipzig Germany 2Bioinformatics DepartmentUniversitaumlt Leipzig Haumlrtelstraszlige 16-18 04107 Leipzig Germany3Bioinformatics Unit Department of Diagnostics Fraunhofer Institute for CellTherapy and Immunology Perlickstraszlige 1 04103 Leipzig Germany 4Presentaddress Department of Diagnostics Fraunhofer Institute for Cell Therapy andImmunology Perlickstraszlige 1 04103 Leipzig Germany 5Present address ecSeqBioinformatics GmbH Sternwartenstraszlige 29 04103 Leipzig Germany

Received 11 June 2019 Accepted 12 November 2019

References1 Nekrutenko A Taylor J Next-generation sequencing data interpretation

enhancing reproducibility and accessibility Nat Rev Genet 201213667ndash72

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 9 of 9

2 Peng RD Reproducible research in computational science Sci N Y20113341226ndash7

3 Bustin SA The reproducibility of biomedical research Sleepers awakeBiomol Detect Quantif 2014235ndash42

4 Baker M 1500scientists lift the lid on reproducibility Nature 2016533452ndash45 Sandve GK Nekrutenko A Taylor J Hovig E Ten simple rules for

reproducible computational research PLoS Comput Biol 2013910032856 Gruumlning B Chilton J Koumlster J Dale R Soranzo N van den Beek M

Goecks J Backofen R Nekrutenko A Taylor J Practical computationalreproducibility in the life sciences Cell Syst 20186631ndash5

7 Goodstadt L Ruffus A lightweight python library for computationalpipelines Bioinformatics 2010262778ndash9

8 Koumlster J Rahmann S Snakemake-a scalable bioinformatics workflowengine Bioinformatics 201228(19)2520ndash2

9 Fonseca Na Petryszak R Marioni J iRAP - an integrated RNA-seq AnalysisPipeline iRAP - an integrated RNA-seq Analysis Pipeline 20140ndash10httpsdoiorg101101005991

10 Wang Y Mehta G Mayani R Lu J Souaiaia T Chen Y Clark A Yoon HJWan L EvgrafovOV Knowles JA Deelman E Chen T RseqFlow Workflowsfor RNA-Seq data analysis Bioinformatics 201127(18)2598ndash600

11 Kalari KR Nair Aa Bhavsar JD O Brien DR Davila JI Bockol Ma Nie JTang X Baheti S Doughty JB Middha S Sicotte H Thompson AEAsmann YW Kocher J-Pa MAP-RSeq Mayo Analysis Pipeline for RNAsequencing BMC Bioinformatics 201415(1)224

12 Goecks J Nekrutenko A Taylor J Galaxy a comprehensive approach forsupporting accessible reproducible and transparent computationalresearch in the life sciences Genome Biol 20101186

13 Golosova O Henderson R Vaskin Y Gabrielian A Grekhov G NagarajanV Oler AJ Quintildeones M Hurt D Fursov M Huyen Y Unipro UGENE NGSpipelines and components for variant calling RNA-seq and ChIP-seq dataanalyses PeerJ 20142644

14 Berthold MR Cebron N Dill F Gabriel TR Koumltter T Meinl T Ohl P SiebC Thiel K Wiswedel B KNIME The Konstanz Information Miner InStudies in Classification Data Analysis and Knowledge Organization (GfKL2007) New York Springer 2007

15 Wolstencroft K Haines R Fellows D Williams A Withers D Owen SSoiland-Reyes S Dunlop I Nenadic A Fisher P Bhagat J Belhajjame KBacall F Hardisty A Nieva de la Hidalga A Balcazar Vargas MP Sufi SGoble C The Taverna workflow suite designing and executing workflowsof Web Services on the desktop web or in the cloud Nucleic acidsresearch 201341(Web Server issue)557ndash61

16 Guimera RV bcbio-nextgen Automated distributed next-gensequencing pipeline EMBnet J 20121730

17 Sadedin SP Pope B Oshlack A Bpipe a tool for running and managingbioinformatics pipelines Bioinformatics (Oxford Engl) 201228(11)1525ndash6

18 Di Tommaso P Chatzou M Floden EW Barja PP Palumbo ENotredame C Nextflow enables reproducible computational workflowsNat Biotechnol 201735316ndash9

19 Cingolani P Sladek R Blanchette M BigDataScript a scripting languagefor data pipelines Bioinformatics 201431(1)10ndash16

20 Ewels P Krueger F Kaumlller M Andrews S Cluster flow A user-friendlybioinformatics workflow tool F1000Research 201652824

21 Gafni E Luquette LJ Lancaster AK Hawkins JB Jung J-Y Souilmi Y WallDP Tonellato PJ Cosmos Python library for massively parallel workflowsBioinformatics 2014302956ndash8

22 Kaushik G Ivkovic S Simonovic J Tijanic N Davis-Dusenbery B Kural DRabix An open-source workflow executor supporting recomputabilityand interoperability of workflow descriptions Pac Symp Biocomput201722154ndash65

23 Yoo AB Jette MA Grondona M SLURM Simple Linux Utility for ResourceManagement In Feitelson D Rudolph L Schwiegelshohn U editorsJob Scheduling Strategies for Parallel Processing 9th InternationalWorkshop JSSPP 2003 Seattle WA USA June 24 2003 Revised PaperBerlin Springer 2003 p 44ndash60

24 Barski A Cuddapah S Cui K Roh T-Y Schones DE Wang Z Wei GChepelev I Zhao K High-resolution profiling of histone methylations inthe human genome Cell 2007129(4)823ndash37

25 Li H Durbin R Fastandaccurate long-read alignment with Burrows-Wheelertransform Bioinformatics (Oxford Engl) 201026(5)589ndash95

26 Langmead B Salzberg SL Fast gapped-read alignment with Bowtie 2Nat Methods 20129(4)357ndash9

27 Hoffmann S Otto C Kurtz S Sharma CM Khaitovich P Vogel J StadlerPF Hackermuumlller J Fast mapping of short sequences with mismatchesinsertions and deletions using index structures PLoS Comput Biol20095(9)1000502

28 Kim D Pertea G Trapnell C Pimentel H Kelley R Salzberg SL TopHat2accurate alignment of transcriptomes in the presence of insertionsdeletions and gene fusions Genome Biol 201314(4)36

29 Hoffmann S Otto C Doose G Tanzer A Langenberger D Christ S KunzM Holdt LM Teupser D Hackermuumlller J Stadler PF A multi-splitmapping algorithm for circular RNA splicing trans-splicing and fusiondetection Genome Biol 201415(2)34

30 Trapnell C Williams Ba Pertea G Mortazavi A Kwan G van Baren MJSalzberg SL Wold BJ Pachter L Transcript assembly and abundanceestimation from RNA-Seq reveals thousands of new transcripts andswitching among isoforms Nat Biotechnol 201128(5)511ndash5

31 Anders S Pyl PT Huber W HTSeq-A Python framework to work withhigh-throughput sequencing data Bioinformatics 201531(2)166ndash9

32 Zhang Y Liu T Meyer Ca Eeckhoute J Johnson DS Bernstein BENusbaum C Myers RM Brown M Li W Liu XS Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089(9)137

33 Fidler F Gordon A Science is in a reproducibility crisis ndash how do weresolve it Conversation 2013 Available at httptheconversationcomscience-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998Accessed 11 Jun 2019

34 Lehrer J The truth wears off New Yorker 2010Dec 1352ndash735 Van Bavel J Why do so many studies fail to replicate N Y Times 201610

Available at httpswwwnytimescom20160529opinionsundaywhy-do-so-many-studies-fail-to-replicatehtml Accessed 11 Jun 2019

36 Fomel S Claerbout JF Guest Editorsrsquo Introduction ReproducibleResearch Comput Sci Eng 200911(1)5ndash7

37 Leisch F Sweave Dynamic generation of statistical reports using literatedata analysis In Haumlrdle W Roumlnz B editors Compstat 2002mdashProceedings in Computational Statistics Heidelberg Physica Verlag 2002p 575ndash80 ISBN 3-7908-1517-9 httpwwwstatuni-muenchendetexttildelowleischSweave

38 Xie Y Implementing Reproducible Computational Research In StoddenV Leisch F Peng RD editors Boca Raton Chapman and HallCRC 2014ISBN 978-1466561595 httpwwwcrcpresscomproductisbn9781466561595

39 Peacuterez F Granger BE IPython a system for interactive scientificcomputing Comput Sci Eng 20079(3)21ndash9

40 Afgan E Baker D Batut B van den Beek M Bouvier D Cech M ChiltonJ Clements D Coraor N Gruumlning BA Guerler A Hillman-Jackson JHiltemann S Jalili V Rasche H Soranzo N Goecks J Taylor JNekrutenko A Blankenberg D The galaxy platform for accessiblereproducible and collaborative biomedical analyses 2018 update NucleicAcids Res 201846537ndash44

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Implementation
        • Analysis as a directed acyclic graph
          • Plug-in architecture
            • Enforcing consistency and integrity
            • Maintaining reproducibility
            • Process flow
              • Results
                • Preparing genomic data for HTS analysis
                  • Transcriptome assembly from RNA-Seq data
                    • Identification of enriched regions from ChIP-Seq data
                      • Conclusions
                      • Availability and requirements
                      • Supplementary informationSupplementary information accompanies this paper at httpsdoiorg101186s12859-019-3219-1
                        • Additional file 1
                          • Abbreviations
                          • Acknowledgments
                          • Authors contributions
                          • Funding
                          • Availability of data and materials
                          • Ethics approval and consent to participate
                          • Consent for publication
                          • Competing interests
                          • Author details
                          • References
                          • Publishers Note
Page 3: SOFTWARE OpenAccess uap:reproducibleandrobustHTSdata …

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 3 of 9

our knowledge none of the published WMS satisfies allfour criteria that we defined for maintaining a minimaldegree of reproducible research in HTS data analysisWe compared the essential features particularly regard-ing reproducibility of a set of actively maintained flexibleand modular WMS in more detail (Table 1 Additional file1 Table S1)Here we introduce the workflow management system

uap (Universal Analysis Pipeline) that may be used toimplement any DAG-like data analysis workflow but isprimarily aimed at HTS data analysis It is constructedto execute control and keep track of the analysis oflarge data sets uap encapsulates the usage of (not nec-essarily bioinformatic) tools and handles data flow andprocessing of the complete analysis Produced data istightly linked with the code specifying the analysis Thusit enables users to perform reproducible robust andconsistent data analyses We provide complete work-flows for handling genomic data and analyzing RNA-Seqand Chromatin Immuno-Precipitation DNA-Sequencing(ChIP-seq) data which can be used as templates thatallow for easy customization As we are also integratingsteps for downloading published sequencing raw data (egfrom SRA) uap enables users to efficiently reproduce thedata analysis of published studies The provided work-flows have been optimized for minimal IO load on highperformance computing (HPC) environments Althoughinitially designed for HTS data analysis the plugin archi-tecture of uap allows for the expansion to any kind of dataanalysis

Implementationuap is a workflow management software (WMS) imple-mented in Python It provides user-friendly access to arange of bioinformatic analyses of large datasets such ashigh throughput sequencing data Each analysis is com-pletely described by an individual configuration file inYAML1 format The steps of the analysis and their depen-dencies as well as the required tools parametrization andlocations in the file system are specified there Based onthese settings uap constructs a directed acyclic graph(DAG) that represents the workflow of the analysis

Analysis as a directed acyclic graph The DAG repre-sents single analysis steps as nodes and pairwise depen-dencies between steps as directed edges A step is ablueprint for a particular analysis with defined input andoutput data The passing of input data to a step and thegeneration of a particular type of output data is mod-eled as connections They control the data flow betweensteps by grouping output files and providing them todown-stream steps uap distinguishes between source

1httpyamlorg

and processing steps Source steps emit data into the work-flow and hand the files over to downstream processingsteps via output connections The user is free to categorizethe input data files for a workflow into user specifiedgroups to create separate output locations for each cate-gory Processing steps on the other hand receive data fromupstream steps via input connections define a sequenceof execution commands and assign output file locationsfor each input connection The entirety of these configu-rations of a step for a particular set of input connectionsis called a run It can be interpreted as an instance of astep and is the atomic unit of the analysis Additional file1 Figure S1 shows the DAG including its runs renderedby uap based on the configuration file for the analysis ofa published data set

Plug-in architecture Steps encapsulate the usage of atool in a single python class which allows users to easilycustomize uap by adding steps Every new step inheritsfrom a super class defines incoming and outgoing con-nections the required tools and has to implement theruns() method A step can individually be optimizedfor efficient use of CPU and memory usage For allowinga flexible accommodation to different high performancecomputing environments uap supports a step-specificadaptation of the environment eg for setting variablesor automatic loading and unloading of software modulesAdditional file 1 Figure S3 sketches how new processingsteps are defined

Enforcing consistency and integrity When computingon large data sets partial processing of large files due topremature termination of tools may remain undetectedwithout stringent monitoring of processes and poses asevere threat to data integrity In uap runs are there-fore executed in a temporary directory and monitoredthroughout execution The overall workflow is not com-promised in case a single run fails Result files are onlymoved to their final location if all processes of a run exitedgracefully and all expected output files existuap automatically re-schedules runs if it detects failed

processes or missing files Also changes in the configu-ration trigger re-scheduling of the affected runs and alldependent runs in the DAG

Maintaining reproducibility uap tightly links analysiscode and resulting data by hashing over the completesequence of commands including parameter specifica-tions of a run and appending the key to the output pathThus any changes to the analysis code alter the expectedoutput location which allows uap to check whether anal-ysis code and output correspond to each otherAt execution time an annotation file in YAML format is

captured for each run that contains the complete content

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 4 of 9

Table

1Com

parison

ofworkflowmanagem

entsystemsandHTS

analysispipe

lines

basedon

essentialfeaturespa

rticularlyregardingreprod

ucibility

Arvados

bcbio-ne

xtge

n[16]

BigD

ataScript[19]Bp

ipe[17]

ClusterFlow

[20]

Cosmos

[21]

CromwellGalaxy[40]

GwfHive

Nextflow[18]

Pype

Flow

Rabix[22]

Toil

uap

Mod

ular

ndash

Customizab

le

ndash

Flexible

ndash

Failurerecovery

ndashndash

ndash

ndashndash

ndash

Reprod

ucibility

Dep

ende

ncies

ndashndash

ndash

ndashndash

ndash

ndashndash

Con

sisten

cy

Linkingcode

andresult

ndashndash

ndashndash

ndashndash

ndash

ndashndash

ndash

ndashndash

Logg

ing

ndashndash

Dataauthen

ticity

ndash

ndashndash

ndash

ndash

ndashndash

ndashndash

ndashndash

Dataintegrity

ndash

ndashndash

ndash

ndashndash

ndashndash

ndash

Supp

platforms

Cluster

Docker

ndashndash

ndashndash

ndash

ndash

Cloud

ndash

ndash

ndash

ndash

Local

ndash

Toolswereselected

from

httpsgith

ubcom

pditommasoaw

esom

e-pipe

lineretainingon

lytoolsthatwereconsidered

tobe

activelyde

velope

dby

atleastfivecontrib

utors(latestcommitafter3

1012018)arelicen

sedas

open

sourcew

herethescop

eof

applicationisclearly

onbioinformaticsandwhich

supp

ortclusterba

tchsystem

sModularWorkflowsareassembled

from

reusab

lestep

sCustomizab

leCon

figurationof

theanalysisissepa

ratedfro

mthe

code

defin

ingthestep

sFlex

ibleW

orkflowscaneasilybe

alteredwith

outp

rogram

mingskillsFa

ilure

reco

veryFailuredu

ringexecutionisde

tected

andsubseq

uent

analyses

halte

dto

avoidcorrup

tedresultsRep

roducibility

-Dep

enden

ciesW

MSen

forces

thatde

pend

encies

betw

eenanalysisstep

sandinterm

ediate

results

arecorrectly

maintaine

dConsisten

cyW

MSsafegu

ards

thatanalysisstep

saresuccessfullycompleted

priortoexecutionof

subseq

uent

step

sLinkingco

dean

dresu

ltW

MSen

suresconsistencybe

tweenthecode

defin

ingtheanalysisandthecurren

tlyavailableresultsLoggingW

MSlogs

Stdo

utStderrexitstatustoo

lversion

sWMSversionexecuted

commandsexecutio

ndateand

in-ou

tput

files

(see

correspo

ndingAdd

ition

alfile1Tables

S1andS2

ford

etails)Dataau

then

ticity

WMSrecordsinform

ationab

outcreator

andcreatin

gprocess(es)o

fthe

dataD

ataintegrityW

MS

recordsinform

ationthatallowsto

verifytheintegrity

ofcreateddata(eghashsums)Suppp

latform

s-C

lusterW

MSsupp

ortscompu

teclusterm

anagem

entsystemstonrunjobsD

ockerW

MScanrunjobs

usingDockerimages

CloudW

MScanbe

deployed

inacompu

teclou

dSu

ppC

WLWMSsupp

ortscommon

workflowlang

uage

LocalWMScanbe

executed

locally

with

outd

epen

ding

onaclusterm

anagem

entsystem

ora(web

)serverResults

are

givenas

(not

met)(p

artially

met)(f

ullymet)and

ndash(not

stated

)Ratin

gsareba

sedon

inform

ationprovided

inthepa

persdocum

entatio

nsandmanualsA

morede

tailedcompa

rison

ispresen

tedinAdd

ition

alfile1TableS1

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 5 of 9

of the configuration at this point Hence an executed runis documented with the releases of all used software andthe invoked command line with all parameter settings Inaddition memory and CPU usage of each process check-sums of result files as well as the last kB of stdout andstderr output are reported The annotation file is storednext to the result files of a run

Process flow Initially uap reads the configuration gen-erates the respective DAG and defines all commands andoutput file names Throughout this initiation process uapinspects the planned analysis for potential errors Thegraph is tested to be acyclic all required tools (in definedreleases) are tested for their availability and the status ofall steps is determined This initiation phase is executedearly ie before submitting runs to a compute cluster uapthus implements a failing fast technique This is an impor-tant feature when working with large amounts of data on

HPC systems where software is dynamically loaded anderroneous configurations might otherwise only becomeapparent after hours of computation Figure 2 illustratesuaprsquos process flow error reporting and the link betweenconfiguration and result filesSubsequently uap can start runs display the com-

mands of runs show the state of the runs and ren-der execution graphs Execution graphs are useful toolsto inspect the performance of an analysis eg to iden-tify resource bottlenecks in a pipeline of commandsAdditional file 1 Figure S2 shows such an execution graphFigure 3 provides an overview of the main principles uapis built on

Resultsuap is a workflow management system dedicated to dataconsistency and adoption of a Reproducible Researchparadigm in HTS data analysis uap runs on UNIX-

Fig 2 uaprsquos process flow error reporting and the link between the analysis code and result files uap implements a failing fast approach the DAGis built from the configuration file tested to be acyclic all required tools are tested for their availability and the status of all steps is determinedSubsequently uap can start runs display the commands of runs show the state of the runs and render execution graphs Runs are executed intemporary directories and monitored throughout execution Result files are only generated at their final location if all processes of a run exitedgracefully and all expected output files exist Analysis code and resulting data are tightly linked by hashing over the complete sequence ofcommands and parameters of a run and appending the key to the output path Each run generates an annotation file in YAML format that capturesthe configuration software versions and releases the invoked command line all parameters memory and CPU usage of each process involvedchecksums of the result files as well as the last kB of stdout and stderr

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 6 of 9

Fig 3 Sketch of the main principles uap is built on a An analysis with uap comprises three parts (i) the uap source code itself implemented inPython - it contains the complete framework of uap and 2 classes for the implementation of source and processing steps These classes are used towrap any tool that is part of the analysis enabling an easy extension of the uaprsquos repertoire of steps (ii) the uap configuration in YAML format Itcontains all necessary information to run and reproduce the analysis given the data and (iii) the uap results - organized in one folder per step inthe output directory The special folder temp contains the expected results until the computation of the step has finished successfully and keepsthe intermediate results and log files upon failure b The progress or state of an analysis can be monitored with a call to uap status Itdetermines the state of each individual step in dependence of the state(s) of its previous step(s) and provides this information to the user

like operating systems and can interact with batchqueuing systems like the SunOracleUniva grid engines(SGEOGEUGE) and SLURM [23] to submit analyses tohigh performance computing systems uap is distributedunder the GNU GPL v3 license and is publicly avail-able at httpsgithubcomyigbtuap Its documentationis hosted at httpuapreadthedocsorg A docker con-tainer with a core set of tools is available at httpshubdockercomryigbtuaptags

uap is distributed with predefined workflows for (i)genome sequence download and index generation forread mapping programs (ii) transcriptome sequencing(RNA-seq) data analysis and (iii) Chromatin Immuno-Precipitation DNA-Sequencing (ChIP-Seq) data analysisFurther we provide small test data sets enabling a quickstart for each of these workflows An additional exampleusing a larger data set is provided via code for download-ing and analyzing a publicly available ChIP-Seq data set

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 7 of 9

(Barski et al [24]) The provided workflows are intendedto serve as an easy entry point into a uap analysis as wellas a template for similar analyses eg for other species orwith another set of tools

Preparing genomic data for HTS analysis An impor-tant prerequisite for HTS projects where a referencegenome is available is aligning (mapping) the sequencingreads to this genome Most mapping software requires aspecific data structure (index) of the genome to efficientlysolve this alignment problem Indices have to be gener-ated once prior to any mapping procedure We provideuap configuration files for the bwa [25] bowtie [26]and segemehl [27] mapping programs and samtools(fasta indexing) of the a) Mycoplasma genitaliumand b) Homo sapiens genome Genomic sequencescan be downloaded automatically prior to indexgeneration

Transcriptome assembly from RNA-Seq data Tran-scriptome sequencing identifies and estimates the quan-tity of RNA in biological samples Beyond quantification ofknown transcripts based on overlapping sequence readsRNA-seq allows the assembly of novel transcripts Weprovide a uap configuration file for combining split-readmapping with de novo transcript assembly uap readsthe sequencing data either from an Illumina sequencingrun folder or a set of fastq files applies quality con-trol removes adapter sequences and maps the reads toa genome using tophat2 [28] and segemehl [27 29]The mapped reads from tophat2 are directly processedby cufflinks [30] for de novo transcript assemblySplit-reads mapped with segemehl are prepared forcufflinks using an adapter script and then also pro-cessed with cufflinks The configuration also containsa step to determine the number of mapped reads pertranscript applying htseq-count [31]

Identification of enriched regions from ChIP-Seqdata ChIP sequencing is a method that integrates chro-matin immunoprecipitation (ChIP) and high-throughputDNA sequencing to identify sites of protein-DNA inter-actions The provided uap-configuration file for this taskinitially resembles the RNA-Seq workflow Here readsare expected to correspond to genomic DNA and themapping is done without considering split-reads usingbowtie2 [26] Mapped reads are subsequently sortedduplicates removed and enriched regions are detectedusing MACS2 [32]

ConclusionsThe critique on a lack of reproducibility in science anda growing awareness that many reported facts do not

seem to hold up to repeated investigation has mean-while reached a broader audience beyond the scien-tific community (eg [33ndash35]) In our opinion HTSdata analysis is particularly prone to consistency andreproducibility issues ndash especially due to the complex-ity of the analysis the involved data volume and thebroad range of available tools and their multitude ofparametersWorkflow management systems are indispensable for

reproducibly controlling more complex analyses like HTSdata analysis Published workflow management systemsfor HTS data analysis are either highly flexible and madefor experienced programmers or lack a lot of flexibil-ity but can be used intuitively and some are specific toa certain type of analysis In the introduction we listedfour minimal requirements that we consider essential forensuring reproducibility and consistency of HTS dataanalyses Additionally but beyond the programmatic con-trol of a of a WMS using versioned external data likegenome sequences or annotation is key for reproducibleanalysis None of the published systems we are awareof however completely satisfies these criteria uap hasbeen designed to fulfill these One critical requirement islinking analysis code and generated data While Repro-ducible Research in statistics [36] uses tools like Sweave[37] knitR [38] or Jupyter [39] to combine analysis codeand resulting data in one output file such a strategy isnot feasible for most steps of an HTS analysis due tothe size of generated data uap therefore relies on hash-ing over the complete sequence of commands includingparameters of a run and appending the key to the outputpath In addition uap performs logging and process mon-itoring supports different cluster management systemscreates recovery points plots execution graphs managesjob dependencies and is extensible to any kind of multi-step analysis It provides pre-built steps for the prepara-tion of genomic data and the analysis of RNA-Seq andChIP-Seq dataAmong the many different flavors of WMS uap is

clearly easier to operate for users with limited program-ming experience than systems based on domain specificprogramming languages while offering a lot more flexi-bility than single purpose tools Based on the comparisonof tools in Additional file 1 Table S1 the tools most ded-icated to reproducible research and providing the mostsimilar set of features compared to uap are Galaxy andNextflowIn our opinion Galaxy [40] and uap address dif-

ferent user groups and tasks and we use both WMSin our research environment Galaxy is great for pro-viding predefined workflows to users without experi-ence in programming and working on the commandline It allows these users to adapt parameters and exe-cute such workflows on their data For users working

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 8 of 9

frequently with large HTS data sets adapting work-flows to a larger extent or duplicating branches of theDAG for performing variants of the analysis in paral-lel is much more efficiently performed in uap Galaxydoes not link data and code in the sense of uap Butas any change to parameters in a Galaxy workflowtriggers re-execution of the sub-workflow below this isnot necessary If many changes throughout a workflowhave to be made this behavior of Galaxy may be hin-dering Obviously running HTS analyses on Galaxyrequires a Galaxy server integrated with an HPC envi-ronment which is not trivial to set up and demandscontinuous maintenance Starting from scratch settingup HTS analysis is significantly less effort using uapthan GalaxyNextflow is a powerful WMS dedicated to scalability

and reproducibility [18] Its approach to reproducibil-ity relies on a tight integration with github and thesupport of scalable containerization of pipelines usingeg Docker Nextflow and uap share several con-cepts eg using temporary files for intermediate resultsor analyzing the workflow DAG to enable failing fastIntermediate results of an HTS analysis can consumelarge amounts of storage space uap therefore providesa means for volatilization of intermediate results with-out breaking dependencies in the DAG - a feature whichdoes not seem to be available in Nextflow Loggingis somewhat limited in Nextflow compared to uapbut Nextflow provides a broader support for HPCenvironments including also support for cloud comput-ing Nextflowrsquos approach to reproducibility is pow-erful when software from Github is used as it enablesthe user to request a specific commit or when thetools used are publicly available as a container How-ever when a tool is run in rsquonative task supportrsquo likethe Kallisto example provided in [18] uap is morestringent in logging version and the full set (includingdefault) of parameters Where uap and Nextflow dif-fer most clearly regarding reproducibility is linking dataand code as to our understanding based on publica-tions and the online documentation this is not availablein NextflowIn summary we are convinced that reproducible

research principles need to be advanced for HTS dataanalysis and that uap is a highly useful system for facingthis challenge

Availability and requirementsProject name uapProject home page httpsgithubcomyigbtuapOperating system(s) LinuxProgramming language PythonOther requirements virtualenv git and graphviz

License GNU GPL v3Any restrictions to use by non-academics None

Supplementary informationSupplementary information accompanies this paper athttpsdoiorg101186s12859-019-3219-1

Additional file 1 This file includes supplemental tables and figures

AbbreviationsChIP-seq Chromatin immuno-precipitation dna-sequencing DAG Directedacyclic graph HTS High throughput sequencing HPC High performancecomputing RNA-seq Transcriptome sequencing WMS Workflowmanagement system

AcknowledgmentsNot applicable

Authorsrsquo contributionsKR MS and JH initially designed and implemented the software SHP CK ASKR JS and JH took over the further development of the software GDimplemented the script that formats segemehl-output to be compatiblewith the cufflinks suite of tools CK and AS provided the dockercontainers of uap CK KR JS and JH wrote the manuscript All authors readcorrected and approved the final manuscript

FundingThis work was supported in part by the Initiative and Networking Fund of theHelmholtz Association through a grant to JH (VH-NG738) by the DeutscheForschungsgemeinschaft (DFG) through a grant to JH (SPP1738) by theFraunhofer Future Fund and through the LIFE Leipzig Research Center forCivilization Diseases The funding bodies did not have any influence on thedesign of the study nor on collection analysis or interpretation of data nor onwriting the manuscript

Availability of data andmaterialsSource code is available at httpsgithubcomyigbtuap links to Dockercontainers and documentation are provided at httpswwwufzdeindexphpen=44919 All datasets used within the example workflows are publiclyavailable and are fully referenced within the workflow configuration files

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests KR is member ofthe editorial board (Associate Editor) of BMC Bioinformatics

Author details1Young Investigators Group Bioinformatics and Transcriptomics DepartmentMolecular Systems Biology Helmholtz Center for Environmental Research ndashUFZ Permoserstraszlige 15 04318 Leipzig Germany 2Bioinformatics DepartmentUniversitaumlt Leipzig Haumlrtelstraszlige 16-18 04107 Leipzig Germany3Bioinformatics Unit Department of Diagnostics Fraunhofer Institute for CellTherapy and Immunology Perlickstraszlige 1 04103 Leipzig Germany 4Presentaddress Department of Diagnostics Fraunhofer Institute for Cell Therapy andImmunology Perlickstraszlige 1 04103 Leipzig Germany 5Present address ecSeqBioinformatics GmbH Sternwartenstraszlige 29 04103 Leipzig Germany

Received 11 June 2019 Accepted 12 November 2019

References1 Nekrutenko A Taylor J Next-generation sequencing data interpretation

enhancing reproducibility and accessibility Nat Rev Genet 201213667ndash72

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 9 of 9

2 Peng RD Reproducible research in computational science Sci N Y20113341226ndash7

3 Bustin SA The reproducibility of biomedical research Sleepers awakeBiomol Detect Quantif 2014235ndash42

4 Baker M 1500scientists lift the lid on reproducibility Nature 2016533452ndash45 Sandve GK Nekrutenko A Taylor J Hovig E Ten simple rules for

reproducible computational research PLoS Comput Biol 2013910032856 Gruumlning B Chilton J Koumlster J Dale R Soranzo N van den Beek M

Goecks J Backofen R Nekrutenko A Taylor J Practical computationalreproducibility in the life sciences Cell Syst 20186631ndash5

7 Goodstadt L Ruffus A lightweight python library for computationalpipelines Bioinformatics 2010262778ndash9

8 Koumlster J Rahmann S Snakemake-a scalable bioinformatics workflowengine Bioinformatics 201228(19)2520ndash2

9 Fonseca Na Petryszak R Marioni J iRAP - an integrated RNA-seq AnalysisPipeline iRAP - an integrated RNA-seq Analysis Pipeline 20140ndash10httpsdoiorg101101005991

10 Wang Y Mehta G Mayani R Lu J Souaiaia T Chen Y Clark A Yoon HJWan L EvgrafovOV Knowles JA Deelman E Chen T RseqFlow Workflowsfor RNA-Seq data analysis Bioinformatics 201127(18)2598ndash600

11 Kalari KR Nair Aa Bhavsar JD O Brien DR Davila JI Bockol Ma Nie JTang X Baheti S Doughty JB Middha S Sicotte H Thompson AEAsmann YW Kocher J-Pa MAP-RSeq Mayo Analysis Pipeline for RNAsequencing BMC Bioinformatics 201415(1)224

12 Goecks J Nekrutenko A Taylor J Galaxy a comprehensive approach forsupporting accessible reproducible and transparent computationalresearch in the life sciences Genome Biol 20101186

13 Golosova O Henderson R Vaskin Y Gabrielian A Grekhov G NagarajanV Oler AJ Quintildeones M Hurt D Fursov M Huyen Y Unipro UGENE NGSpipelines and components for variant calling RNA-seq and ChIP-seq dataanalyses PeerJ 20142644

14 Berthold MR Cebron N Dill F Gabriel TR Koumltter T Meinl T Ohl P SiebC Thiel K Wiswedel B KNIME The Konstanz Information Miner InStudies in Classification Data Analysis and Knowledge Organization (GfKL2007) New York Springer 2007

15 Wolstencroft K Haines R Fellows D Williams A Withers D Owen SSoiland-Reyes S Dunlop I Nenadic A Fisher P Bhagat J Belhajjame KBacall F Hardisty A Nieva de la Hidalga A Balcazar Vargas MP Sufi SGoble C The Taverna workflow suite designing and executing workflowsof Web Services on the desktop web or in the cloud Nucleic acidsresearch 201341(Web Server issue)557ndash61

16 Guimera RV bcbio-nextgen Automated distributed next-gensequencing pipeline EMBnet J 20121730

17 Sadedin SP Pope B Oshlack A Bpipe a tool for running and managingbioinformatics pipelines Bioinformatics (Oxford Engl) 201228(11)1525ndash6

18 Di Tommaso P Chatzou M Floden EW Barja PP Palumbo ENotredame C Nextflow enables reproducible computational workflowsNat Biotechnol 201735316ndash9

19 Cingolani P Sladek R Blanchette M BigDataScript a scripting languagefor data pipelines Bioinformatics 201431(1)10ndash16

20 Ewels P Krueger F Kaumlller M Andrews S Cluster flow A user-friendlybioinformatics workflow tool F1000Research 201652824

21 Gafni E Luquette LJ Lancaster AK Hawkins JB Jung J-Y Souilmi Y WallDP Tonellato PJ Cosmos Python library for massively parallel workflowsBioinformatics 2014302956ndash8

22 Kaushik G Ivkovic S Simonovic J Tijanic N Davis-Dusenbery B Kural DRabix An open-source workflow executor supporting recomputabilityand interoperability of workflow descriptions Pac Symp Biocomput201722154ndash65

23 Yoo AB Jette MA Grondona M SLURM Simple Linux Utility for ResourceManagement In Feitelson D Rudolph L Schwiegelshohn U editorsJob Scheduling Strategies for Parallel Processing 9th InternationalWorkshop JSSPP 2003 Seattle WA USA June 24 2003 Revised PaperBerlin Springer 2003 p 44ndash60

24 Barski A Cuddapah S Cui K Roh T-Y Schones DE Wang Z Wei GChepelev I Zhao K High-resolution profiling of histone methylations inthe human genome Cell 2007129(4)823ndash37

25 Li H Durbin R Fastandaccurate long-read alignment with Burrows-Wheelertransform Bioinformatics (Oxford Engl) 201026(5)589ndash95

26 Langmead B Salzberg SL Fast gapped-read alignment with Bowtie 2Nat Methods 20129(4)357ndash9

27 Hoffmann S Otto C Kurtz S Sharma CM Khaitovich P Vogel J StadlerPF Hackermuumlller J Fast mapping of short sequences with mismatchesinsertions and deletions using index structures PLoS Comput Biol20095(9)1000502

28 Kim D Pertea G Trapnell C Pimentel H Kelley R Salzberg SL TopHat2accurate alignment of transcriptomes in the presence of insertionsdeletions and gene fusions Genome Biol 201314(4)36

29 Hoffmann S Otto C Doose G Tanzer A Langenberger D Christ S KunzM Holdt LM Teupser D Hackermuumlller J Stadler PF A multi-splitmapping algorithm for circular RNA splicing trans-splicing and fusiondetection Genome Biol 201415(2)34

30 Trapnell C Williams Ba Pertea G Mortazavi A Kwan G van Baren MJSalzberg SL Wold BJ Pachter L Transcript assembly and abundanceestimation from RNA-Seq reveals thousands of new transcripts andswitching among isoforms Nat Biotechnol 201128(5)511ndash5

31 Anders S Pyl PT Huber W HTSeq-A Python framework to work withhigh-throughput sequencing data Bioinformatics 201531(2)166ndash9

32 Zhang Y Liu T Meyer Ca Eeckhoute J Johnson DS Bernstein BENusbaum C Myers RM Brown M Li W Liu XS Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089(9)137

33 Fidler F Gordon A Science is in a reproducibility crisis ndash how do weresolve it Conversation 2013 Available at httptheconversationcomscience-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998Accessed 11 Jun 2019

34 Lehrer J The truth wears off New Yorker 2010Dec 1352ndash735 Van Bavel J Why do so many studies fail to replicate N Y Times 201610

Available at httpswwwnytimescom20160529opinionsundaywhy-do-so-many-studies-fail-to-replicatehtml Accessed 11 Jun 2019

36 Fomel S Claerbout JF Guest Editorsrsquo Introduction ReproducibleResearch Comput Sci Eng 200911(1)5ndash7

37 Leisch F Sweave Dynamic generation of statistical reports using literatedata analysis In Haumlrdle W Roumlnz B editors Compstat 2002mdashProceedings in Computational Statistics Heidelberg Physica Verlag 2002p 575ndash80 ISBN 3-7908-1517-9 httpwwwstatuni-muenchendetexttildelowleischSweave

38 Xie Y Implementing Reproducible Computational Research In StoddenV Leisch F Peng RD editors Boca Raton Chapman and HallCRC 2014ISBN 978-1466561595 httpwwwcrcpresscomproductisbn9781466561595

39 Peacuterez F Granger BE IPython a system for interactive scientificcomputing Comput Sci Eng 20079(3)21ndash9

40 Afgan E Baker D Batut B van den Beek M Bouvier D Cech M ChiltonJ Clements D Coraor N Gruumlning BA Guerler A Hillman-Jackson JHiltemann S Jalili V Rasche H Soranzo N Goecks J Taylor JNekrutenko A Blankenberg D The galaxy platform for accessiblereproducible and collaborative biomedical analyses 2018 update NucleicAcids Res 201846537ndash44

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Implementation
        • Analysis as a directed acyclic graph
          • Plug-in architecture
            • Enforcing consistency and integrity
            • Maintaining reproducibility
            • Process flow
              • Results
                • Preparing genomic data for HTS analysis
                  • Transcriptome assembly from RNA-Seq data
                    • Identification of enriched regions from ChIP-Seq data
                      • Conclusions
                      • Availability and requirements
                      • Supplementary informationSupplementary information accompanies this paper at httpsdoiorg101186s12859-019-3219-1
                        • Additional file 1
                          • Abbreviations
                          • Acknowledgments
                          • Authors contributions
                          • Funding
                          • Availability of data and materials
                          • Ethics approval and consent to participate
                          • Consent for publication
                          • Competing interests
                          • Author details
                          • References
                          • Publishers Note
Page 4: SOFTWARE OpenAccess uap:reproducibleandrobustHTSdata …

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 4 of 9

Table

1Com

parison

ofworkflowmanagem

entsystemsandHTS

analysispipe

lines

basedon

essentialfeaturespa

rticularlyregardingreprod

ucibility

Arvados

bcbio-ne

xtge

n[16]

BigD

ataScript[19]Bp

ipe[17]

ClusterFlow

[20]

Cosmos

[21]

CromwellGalaxy[40]

GwfHive

Nextflow[18]

Pype

Flow

Rabix[22]

Toil

uap

Mod

ular

ndash

Customizab

le

ndash

Flexible

ndash

Failurerecovery

ndashndash

ndash

ndashndash

ndash

Reprod

ucibility

Dep

ende

ncies

ndashndash

ndash

ndashndash

ndash

ndashndash

Con

sisten

cy

Linkingcode

andresult

ndashndash

ndashndash

ndashndash

ndash

ndashndash

ndash

ndashndash

Logg

ing

ndashndash

Dataauthen

ticity

ndash

ndashndash

ndash

ndash

ndashndash

ndashndash

ndashndash

Dataintegrity

ndash

ndashndash

ndash

ndashndash

ndashndash

ndash

Supp

platforms

Cluster

Docker

ndashndash

ndashndash

ndash

ndash

Cloud

ndash

ndash

ndash

ndash

Local

ndash

Toolswereselected

from

httpsgith

ubcom

pditommasoaw

esom

e-pipe

lineretainingon

lytoolsthatwereconsidered

tobe

activelyde

velope

dby

atleastfivecontrib

utors(latestcommitafter3

1012018)arelicen

sedas

open

sourcew

herethescop

eof

applicationisclearly

onbioinformaticsandwhich

supp

ortclusterba

tchsystem

sModularWorkflowsareassembled

from

reusab

lestep

sCustomizab

leCon

figurationof

theanalysisissepa

ratedfro

mthe

code

defin

ingthestep

sFlex

ibleW

orkflowscaneasilybe

alteredwith

outp

rogram

mingskillsFa

ilure

reco

veryFailuredu

ringexecutionisde

tected

andsubseq

uent

analyses

halte

dto

avoidcorrup

tedresultsRep

roducibility

-Dep

enden

ciesW

MSen

forces

thatde

pend

encies

betw

eenanalysisstep

sandinterm

ediate

results

arecorrectly

maintaine

dConsisten

cyW

MSsafegu

ards

thatanalysisstep

saresuccessfullycompleted

priortoexecutionof

subseq

uent

step

sLinkingco

dean

dresu

ltW

MSen

suresconsistencybe

tweenthecode

defin

ingtheanalysisandthecurren

tlyavailableresultsLoggingW

MSlogs

Stdo

utStderrexitstatustoo

lversion

sWMSversionexecuted

commandsexecutio

ndateand

in-ou

tput

files

(see

correspo

ndingAdd

ition

alfile1Tables

S1andS2

ford

etails)Dataau

then

ticity

WMSrecordsinform

ationab

outcreator

andcreatin

gprocess(es)o

fthe

dataD

ataintegrityW

MS

recordsinform

ationthatallowsto

verifytheintegrity

ofcreateddata(eghashsums)Suppp

latform

s-C

lusterW

MSsupp

ortscompu

teclusterm

anagem

entsystemstonrunjobsD

ockerW

MScanrunjobs

usingDockerimages

CloudW

MScanbe

deployed

inacompu

teclou

dSu

ppC

WLWMSsupp

ortscommon

workflowlang

uage

LocalWMScanbe

executed

locally

with

outd

epen

ding

onaclusterm

anagem

entsystem

ora(web

)serverResults

are

givenas

(not

met)(p

artially

met)(f

ullymet)and

ndash(not

stated

)Ratin

gsareba

sedon

inform

ationprovided

inthepa

persdocum

entatio

nsandmanualsA

morede

tailedcompa

rison

ispresen

tedinAdd

ition

alfile1TableS1

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 5 of 9

of the configuration at this point Hence an executed runis documented with the releases of all used software andthe invoked command line with all parameter settings Inaddition memory and CPU usage of each process check-sums of result files as well as the last kB of stdout andstderr output are reported The annotation file is storednext to the result files of a run

Process flow Initially uap reads the configuration gen-erates the respective DAG and defines all commands andoutput file names Throughout this initiation process uapinspects the planned analysis for potential errors Thegraph is tested to be acyclic all required tools (in definedreleases) are tested for their availability and the status ofall steps is determined This initiation phase is executedearly ie before submitting runs to a compute cluster uapthus implements a failing fast technique This is an impor-tant feature when working with large amounts of data on

HPC systems where software is dynamically loaded anderroneous configurations might otherwise only becomeapparent after hours of computation Figure 2 illustratesuaprsquos process flow error reporting and the link betweenconfiguration and result filesSubsequently uap can start runs display the com-

mands of runs show the state of the runs and ren-der execution graphs Execution graphs are useful toolsto inspect the performance of an analysis eg to iden-tify resource bottlenecks in a pipeline of commandsAdditional file 1 Figure S2 shows such an execution graphFigure 3 provides an overview of the main principles uapis built on

Resultsuap is a workflow management system dedicated to dataconsistency and adoption of a Reproducible Researchparadigm in HTS data analysis uap runs on UNIX-

Fig 2 uaprsquos process flow error reporting and the link between the analysis code and result files uap implements a failing fast approach the DAGis built from the configuration file tested to be acyclic all required tools are tested for their availability and the status of all steps is determinedSubsequently uap can start runs display the commands of runs show the state of the runs and render execution graphs Runs are executed intemporary directories and monitored throughout execution Result files are only generated at their final location if all processes of a run exitedgracefully and all expected output files exist Analysis code and resulting data are tightly linked by hashing over the complete sequence ofcommands and parameters of a run and appending the key to the output path Each run generates an annotation file in YAML format that capturesthe configuration software versions and releases the invoked command line all parameters memory and CPU usage of each process involvedchecksums of the result files as well as the last kB of stdout and stderr

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 6 of 9

Fig 3 Sketch of the main principles uap is built on a An analysis with uap comprises three parts (i) the uap source code itself implemented inPython - it contains the complete framework of uap and 2 classes for the implementation of source and processing steps These classes are used towrap any tool that is part of the analysis enabling an easy extension of the uaprsquos repertoire of steps (ii) the uap configuration in YAML format Itcontains all necessary information to run and reproduce the analysis given the data and (iii) the uap results - organized in one folder per step inthe output directory The special folder temp contains the expected results until the computation of the step has finished successfully and keepsthe intermediate results and log files upon failure b The progress or state of an analysis can be monitored with a call to uap status Itdetermines the state of each individual step in dependence of the state(s) of its previous step(s) and provides this information to the user

like operating systems and can interact with batchqueuing systems like the SunOracleUniva grid engines(SGEOGEUGE) and SLURM [23] to submit analyses tohigh performance computing systems uap is distributedunder the GNU GPL v3 license and is publicly avail-able at httpsgithubcomyigbtuap Its documentationis hosted at httpuapreadthedocsorg A docker con-tainer with a core set of tools is available at httpshubdockercomryigbtuaptags

uap is distributed with predefined workflows for (i)genome sequence download and index generation forread mapping programs (ii) transcriptome sequencing(RNA-seq) data analysis and (iii) Chromatin Immuno-Precipitation DNA-Sequencing (ChIP-Seq) data analysisFurther we provide small test data sets enabling a quickstart for each of these workflows An additional exampleusing a larger data set is provided via code for download-ing and analyzing a publicly available ChIP-Seq data set

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 7 of 9

(Barski et al [24]) The provided workflows are intendedto serve as an easy entry point into a uap analysis as wellas a template for similar analyses eg for other species orwith another set of tools

Preparing genomic data for HTS analysis An impor-tant prerequisite for HTS projects where a referencegenome is available is aligning (mapping) the sequencingreads to this genome Most mapping software requires aspecific data structure (index) of the genome to efficientlysolve this alignment problem Indices have to be gener-ated once prior to any mapping procedure We provideuap configuration files for the bwa [25] bowtie [26]and segemehl [27] mapping programs and samtools(fasta indexing) of the a) Mycoplasma genitaliumand b) Homo sapiens genome Genomic sequencescan be downloaded automatically prior to indexgeneration

Transcriptome assembly from RNA-Seq data Tran-scriptome sequencing identifies and estimates the quan-tity of RNA in biological samples Beyond quantification ofknown transcripts based on overlapping sequence readsRNA-seq allows the assembly of novel transcripts Weprovide a uap configuration file for combining split-readmapping with de novo transcript assembly uap readsthe sequencing data either from an Illumina sequencingrun folder or a set of fastq files applies quality con-trol removes adapter sequences and maps the reads toa genome using tophat2 [28] and segemehl [27 29]The mapped reads from tophat2 are directly processedby cufflinks [30] for de novo transcript assemblySplit-reads mapped with segemehl are prepared forcufflinks using an adapter script and then also pro-cessed with cufflinks The configuration also containsa step to determine the number of mapped reads pertranscript applying htseq-count [31]

Identification of enriched regions from ChIP-Seqdata ChIP sequencing is a method that integrates chro-matin immunoprecipitation (ChIP) and high-throughputDNA sequencing to identify sites of protein-DNA inter-actions The provided uap-configuration file for this taskinitially resembles the RNA-Seq workflow Here readsare expected to correspond to genomic DNA and themapping is done without considering split-reads usingbowtie2 [26] Mapped reads are subsequently sortedduplicates removed and enriched regions are detectedusing MACS2 [32]

ConclusionsThe critique on a lack of reproducibility in science anda growing awareness that many reported facts do not

seem to hold up to repeated investigation has mean-while reached a broader audience beyond the scien-tific community (eg [33ndash35]) In our opinion HTSdata analysis is particularly prone to consistency andreproducibility issues ndash especially due to the complex-ity of the analysis the involved data volume and thebroad range of available tools and their multitude ofparametersWorkflow management systems are indispensable for

reproducibly controlling more complex analyses like HTSdata analysis Published workflow management systemsfor HTS data analysis are either highly flexible and madefor experienced programmers or lack a lot of flexibil-ity but can be used intuitively and some are specific toa certain type of analysis In the introduction we listedfour minimal requirements that we consider essential forensuring reproducibility and consistency of HTS dataanalyses Additionally but beyond the programmatic con-trol of a of a WMS using versioned external data likegenome sequences or annotation is key for reproducibleanalysis None of the published systems we are awareof however completely satisfies these criteria uap hasbeen designed to fulfill these One critical requirement islinking analysis code and generated data While Repro-ducible Research in statistics [36] uses tools like Sweave[37] knitR [38] or Jupyter [39] to combine analysis codeand resulting data in one output file such a strategy isnot feasible for most steps of an HTS analysis due tothe size of generated data uap therefore relies on hash-ing over the complete sequence of commands includingparameters of a run and appending the key to the outputpath In addition uap performs logging and process mon-itoring supports different cluster management systemscreates recovery points plots execution graphs managesjob dependencies and is extensible to any kind of multi-step analysis It provides pre-built steps for the prepara-tion of genomic data and the analysis of RNA-Seq andChIP-Seq dataAmong the many different flavors of WMS uap is

clearly easier to operate for users with limited program-ming experience than systems based on domain specificprogramming languages while offering a lot more flexi-bility than single purpose tools Based on the comparisonof tools in Additional file 1 Table S1 the tools most ded-icated to reproducible research and providing the mostsimilar set of features compared to uap are Galaxy andNextflowIn our opinion Galaxy [40] and uap address dif-

ferent user groups and tasks and we use both WMSin our research environment Galaxy is great for pro-viding predefined workflows to users without experi-ence in programming and working on the commandline It allows these users to adapt parameters and exe-cute such workflows on their data For users working

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 8 of 9

frequently with large HTS data sets adapting work-flows to a larger extent or duplicating branches of theDAG for performing variants of the analysis in paral-lel is much more efficiently performed in uap Galaxydoes not link data and code in the sense of uap Butas any change to parameters in a Galaxy workflowtriggers re-execution of the sub-workflow below this isnot necessary If many changes throughout a workflowhave to be made this behavior of Galaxy may be hin-dering Obviously running HTS analyses on Galaxyrequires a Galaxy server integrated with an HPC envi-ronment which is not trivial to set up and demandscontinuous maintenance Starting from scratch settingup HTS analysis is significantly less effort using uapthan GalaxyNextflow is a powerful WMS dedicated to scalability

and reproducibility [18] Its approach to reproducibil-ity relies on a tight integration with github and thesupport of scalable containerization of pipelines usingeg Docker Nextflow and uap share several con-cepts eg using temporary files for intermediate resultsor analyzing the workflow DAG to enable failing fastIntermediate results of an HTS analysis can consumelarge amounts of storage space uap therefore providesa means for volatilization of intermediate results with-out breaking dependencies in the DAG - a feature whichdoes not seem to be available in Nextflow Loggingis somewhat limited in Nextflow compared to uapbut Nextflow provides a broader support for HPCenvironments including also support for cloud comput-ing Nextflowrsquos approach to reproducibility is pow-erful when software from Github is used as it enablesthe user to request a specific commit or when thetools used are publicly available as a container How-ever when a tool is run in rsquonative task supportrsquo likethe Kallisto example provided in [18] uap is morestringent in logging version and the full set (includingdefault) of parameters Where uap and Nextflow dif-fer most clearly regarding reproducibility is linking dataand code as to our understanding based on publica-tions and the online documentation this is not availablein NextflowIn summary we are convinced that reproducible

research principles need to be advanced for HTS dataanalysis and that uap is a highly useful system for facingthis challenge

Availability and requirementsProject name uapProject home page httpsgithubcomyigbtuapOperating system(s) LinuxProgramming language PythonOther requirements virtualenv git and graphviz

License GNU GPL v3Any restrictions to use by non-academics None

Supplementary informationSupplementary information accompanies this paper athttpsdoiorg101186s12859-019-3219-1

Additional file 1 This file includes supplemental tables and figures

AbbreviationsChIP-seq Chromatin immuno-precipitation dna-sequencing DAG Directedacyclic graph HTS High throughput sequencing HPC High performancecomputing RNA-seq Transcriptome sequencing WMS Workflowmanagement system

AcknowledgmentsNot applicable

Authorsrsquo contributionsKR MS and JH initially designed and implemented the software SHP CK ASKR JS and JH took over the further development of the software GDimplemented the script that formats segemehl-output to be compatiblewith the cufflinks suite of tools CK and AS provided the dockercontainers of uap CK KR JS and JH wrote the manuscript All authors readcorrected and approved the final manuscript

FundingThis work was supported in part by the Initiative and Networking Fund of theHelmholtz Association through a grant to JH (VH-NG738) by the DeutscheForschungsgemeinschaft (DFG) through a grant to JH (SPP1738) by theFraunhofer Future Fund and through the LIFE Leipzig Research Center forCivilization Diseases The funding bodies did not have any influence on thedesign of the study nor on collection analysis or interpretation of data nor onwriting the manuscript

Availability of data andmaterialsSource code is available at httpsgithubcomyigbtuap links to Dockercontainers and documentation are provided at httpswwwufzdeindexphpen=44919 All datasets used within the example workflows are publiclyavailable and are fully referenced within the workflow configuration files

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests KR is member ofthe editorial board (Associate Editor) of BMC Bioinformatics

Author details1Young Investigators Group Bioinformatics and Transcriptomics DepartmentMolecular Systems Biology Helmholtz Center for Environmental Research ndashUFZ Permoserstraszlige 15 04318 Leipzig Germany 2Bioinformatics DepartmentUniversitaumlt Leipzig Haumlrtelstraszlige 16-18 04107 Leipzig Germany3Bioinformatics Unit Department of Diagnostics Fraunhofer Institute for CellTherapy and Immunology Perlickstraszlige 1 04103 Leipzig Germany 4Presentaddress Department of Diagnostics Fraunhofer Institute for Cell Therapy andImmunology Perlickstraszlige 1 04103 Leipzig Germany 5Present address ecSeqBioinformatics GmbH Sternwartenstraszlige 29 04103 Leipzig Germany

Received 11 June 2019 Accepted 12 November 2019

References1 Nekrutenko A Taylor J Next-generation sequencing data interpretation

enhancing reproducibility and accessibility Nat Rev Genet 201213667ndash72

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 9 of 9

2 Peng RD Reproducible research in computational science Sci N Y20113341226ndash7

3 Bustin SA The reproducibility of biomedical research Sleepers awakeBiomol Detect Quantif 2014235ndash42

4 Baker M 1500scientists lift the lid on reproducibility Nature 2016533452ndash45 Sandve GK Nekrutenko A Taylor J Hovig E Ten simple rules for

reproducible computational research PLoS Comput Biol 2013910032856 Gruumlning B Chilton J Koumlster J Dale R Soranzo N van den Beek M

Goecks J Backofen R Nekrutenko A Taylor J Practical computationalreproducibility in the life sciences Cell Syst 20186631ndash5

7 Goodstadt L Ruffus A lightweight python library for computationalpipelines Bioinformatics 2010262778ndash9

8 Koumlster J Rahmann S Snakemake-a scalable bioinformatics workflowengine Bioinformatics 201228(19)2520ndash2

9 Fonseca Na Petryszak R Marioni J iRAP - an integrated RNA-seq AnalysisPipeline iRAP - an integrated RNA-seq Analysis Pipeline 20140ndash10httpsdoiorg101101005991

10 Wang Y Mehta G Mayani R Lu J Souaiaia T Chen Y Clark A Yoon HJWan L EvgrafovOV Knowles JA Deelman E Chen T RseqFlow Workflowsfor RNA-Seq data analysis Bioinformatics 201127(18)2598ndash600

11 Kalari KR Nair Aa Bhavsar JD O Brien DR Davila JI Bockol Ma Nie JTang X Baheti S Doughty JB Middha S Sicotte H Thompson AEAsmann YW Kocher J-Pa MAP-RSeq Mayo Analysis Pipeline for RNAsequencing BMC Bioinformatics 201415(1)224

12 Goecks J Nekrutenko A Taylor J Galaxy a comprehensive approach forsupporting accessible reproducible and transparent computationalresearch in the life sciences Genome Biol 20101186

13 Golosova O Henderson R Vaskin Y Gabrielian A Grekhov G NagarajanV Oler AJ Quintildeones M Hurt D Fursov M Huyen Y Unipro UGENE NGSpipelines and components for variant calling RNA-seq and ChIP-seq dataanalyses PeerJ 20142644

14 Berthold MR Cebron N Dill F Gabriel TR Koumltter T Meinl T Ohl P SiebC Thiel K Wiswedel B KNIME The Konstanz Information Miner InStudies in Classification Data Analysis and Knowledge Organization (GfKL2007) New York Springer 2007

15 Wolstencroft K Haines R Fellows D Williams A Withers D Owen SSoiland-Reyes S Dunlop I Nenadic A Fisher P Bhagat J Belhajjame KBacall F Hardisty A Nieva de la Hidalga A Balcazar Vargas MP Sufi SGoble C The Taverna workflow suite designing and executing workflowsof Web Services on the desktop web or in the cloud Nucleic acidsresearch 201341(Web Server issue)557ndash61

16 Guimera RV bcbio-nextgen Automated distributed next-gensequencing pipeline EMBnet J 20121730

17 Sadedin SP Pope B Oshlack A Bpipe a tool for running and managingbioinformatics pipelines Bioinformatics (Oxford Engl) 201228(11)1525ndash6

18 Di Tommaso P Chatzou M Floden EW Barja PP Palumbo ENotredame C Nextflow enables reproducible computational workflowsNat Biotechnol 201735316ndash9

19 Cingolani P Sladek R Blanchette M BigDataScript a scripting languagefor data pipelines Bioinformatics 201431(1)10ndash16

20 Ewels P Krueger F Kaumlller M Andrews S Cluster flow A user-friendlybioinformatics workflow tool F1000Research 201652824

21 Gafni E Luquette LJ Lancaster AK Hawkins JB Jung J-Y Souilmi Y WallDP Tonellato PJ Cosmos Python library for massively parallel workflowsBioinformatics 2014302956ndash8

22 Kaushik G Ivkovic S Simonovic J Tijanic N Davis-Dusenbery B Kural DRabix An open-source workflow executor supporting recomputabilityand interoperability of workflow descriptions Pac Symp Biocomput201722154ndash65

23 Yoo AB Jette MA Grondona M SLURM Simple Linux Utility for ResourceManagement In Feitelson D Rudolph L Schwiegelshohn U editorsJob Scheduling Strategies for Parallel Processing 9th InternationalWorkshop JSSPP 2003 Seattle WA USA June 24 2003 Revised PaperBerlin Springer 2003 p 44ndash60

24 Barski A Cuddapah S Cui K Roh T-Y Schones DE Wang Z Wei GChepelev I Zhao K High-resolution profiling of histone methylations inthe human genome Cell 2007129(4)823ndash37

25 Li H Durbin R Fastandaccurate long-read alignment with Burrows-Wheelertransform Bioinformatics (Oxford Engl) 201026(5)589ndash95

26 Langmead B Salzberg SL Fast gapped-read alignment with Bowtie 2Nat Methods 20129(4)357ndash9

27 Hoffmann S Otto C Kurtz S Sharma CM Khaitovich P Vogel J StadlerPF Hackermuumlller J Fast mapping of short sequences with mismatchesinsertions and deletions using index structures PLoS Comput Biol20095(9)1000502

28 Kim D Pertea G Trapnell C Pimentel H Kelley R Salzberg SL TopHat2accurate alignment of transcriptomes in the presence of insertionsdeletions and gene fusions Genome Biol 201314(4)36

29 Hoffmann S Otto C Doose G Tanzer A Langenberger D Christ S KunzM Holdt LM Teupser D Hackermuumlller J Stadler PF A multi-splitmapping algorithm for circular RNA splicing trans-splicing and fusiondetection Genome Biol 201415(2)34

30 Trapnell C Williams Ba Pertea G Mortazavi A Kwan G van Baren MJSalzberg SL Wold BJ Pachter L Transcript assembly and abundanceestimation from RNA-Seq reveals thousands of new transcripts andswitching among isoforms Nat Biotechnol 201128(5)511ndash5

31 Anders S Pyl PT Huber W HTSeq-A Python framework to work withhigh-throughput sequencing data Bioinformatics 201531(2)166ndash9

32 Zhang Y Liu T Meyer Ca Eeckhoute J Johnson DS Bernstein BENusbaum C Myers RM Brown M Li W Liu XS Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089(9)137

33 Fidler F Gordon A Science is in a reproducibility crisis ndash how do weresolve it Conversation 2013 Available at httptheconversationcomscience-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998Accessed 11 Jun 2019

34 Lehrer J The truth wears off New Yorker 2010Dec 1352ndash735 Van Bavel J Why do so many studies fail to replicate N Y Times 201610

Available at httpswwwnytimescom20160529opinionsundaywhy-do-so-many-studies-fail-to-replicatehtml Accessed 11 Jun 2019

36 Fomel S Claerbout JF Guest Editorsrsquo Introduction ReproducibleResearch Comput Sci Eng 200911(1)5ndash7

37 Leisch F Sweave Dynamic generation of statistical reports using literatedata analysis In Haumlrdle W Roumlnz B editors Compstat 2002mdashProceedings in Computational Statistics Heidelberg Physica Verlag 2002p 575ndash80 ISBN 3-7908-1517-9 httpwwwstatuni-muenchendetexttildelowleischSweave

38 Xie Y Implementing Reproducible Computational Research In StoddenV Leisch F Peng RD editors Boca Raton Chapman and HallCRC 2014ISBN 978-1466561595 httpwwwcrcpresscomproductisbn9781466561595

39 Peacuterez F Granger BE IPython a system for interactive scientificcomputing Comput Sci Eng 20079(3)21ndash9

40 Afgan E Baker D Batut B van den Beek M Bouvier D Cech M ChiltonJ Clements D Coraor N Gruumlning BA Guerler A Hillman-Jackson JHiltemann S Jalili V Rasche H Soranzo N Goecks J Taylor JNekrutenko A Blankenberg D The galaxy platform for accessiblereproducible and collaborative biomedical analyses 2018 update NucleicAcids Res 201846537ndash44

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Implementation
        • Analysis as a directed acyclic graph
          • Plug-in architecture
            • Enforcing consistency and integrity
            • Maintaining reproducibility
            • Process flow
              • Results
                • Preparing genomic data for HTS analysis
                  • Transcriptome assembly from RNA-Seq data
                    • Identification of enriched regions from ChIP-Seq data
                      • Conclusions
                      • Availability and requirements
                      • Supplementary informationSupplementary information accompanies this paper at httpsdoiorg101186s12859-019-3219-1
                        • Additional file 1
                          • Abbreviations
                          • Acknowledgments
                          • Authors contributions
                          • Funding
                          • Availability of data and materials
                          • Ethics approval and consent to participate
                          • Consent for publication
                          • Competing interests
                          • Author details
                          • References
                          • Publishers Note
Page 5: SOFTWARE OpenAccess uap:reproducibleandrobustHTSdata …

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 5 of 9

of the configuration at this point Hence an executed runis documented with the releases of all used software andthe invoked command line with all parameter settings Inaddition memory and CPU usage of each process check-sums of result files as well as the last kB of stdout andstderr output are reported The annotation file is storednext to the result files of a run

Process flow Initially uap reads the configuration gen-erates the respective DAG and defines all commands andoutput file names Throughout this initiation process uapinspects the planned analysis for potential errors Thegraph is tested to be acyclic all required tools (in definedreleases) are tested for their availability and the status ofall steps is determined This initiation phase is executedearly ie before submitting runs to a compute cluster uapthus implements a failing fast technique This is an impor-tant feature when working with large amounts of data on

HPC systems where software is dynamically loaded anderroneous configurations might otherwise only becomeapparent after hours of computation Figure 2 illustratesuaprsquos process flow error reporting and the link betweenconfiguration and result filesSubsequently uap can start runs display the com-

mands of runs show the state of the runs and ren-der execution graphs Execution graphs are useful toolsto inspect the performance of an analysis eg to iden-tify resource bottlenecks in a pipeline of commandsAdditional file 1 Figure S2 shows such an execution graphFigure 3 provides an overview of the main principles uapis built on

Resultsuap is a workflow management system dedicated to dataconsistency and adoption of a Reproducible Researchparadigm in HTS data analysis uap runs on UNIX-

Fig 2 uaprsquos process flow error reporting and the link between the analysis code and result files uap implements a failing fast approach the DAGis built from the configuration file tested to be acyclic all required tools are tested for their availability and the status of all steps is determinedSubsequently uap can start runs display the commands of runs show the state of the runs and render execution graphs Runs are executed intemporary directories and monitored throughout execution Result files are only generated at their final location if all processes of a run exitedgracefully and all expected output files exist Analysis code and resulting data are tightly linked by hashing over the complete sequence ofcommands and parameters of a run and appending the key to the output path Each run generates an annotation file in YAML format that capturesthe configuration software versions and releases the invoked command line all parameters memory and CPU usage of each process involvedchecksums of the result files as well as the last kB of stdout and stderr

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 6 of 9

Fig 3 Sketch of the main principles uap is built on a An analysis with uap comprises three parts (i) the uap source code itself implemented inPython - it contains the complete framework of uap and 2 classes for the implementation of source and processing steps These classes are used towrap any tool that is part of the analysis enabling an easy extension of the uaprsquos repertoire of steps (ii) the uap configuration in YAML format Itcontains all necessary information to run and reproduce the analysis given the data and (iii) the uap results - organized in one folder per step inthe output directory The special folder temp contains the expected results until the computation of the step has finished successfully and keepsthe intermediate results and log files upon failure b The progress or state of an analysis can be monitored with a call to uap status Itdetermines the state of each individual step in dependence of the state(s) of its previous step(s) and provides this information to the user

like operating systems and can interact with batchqueuing systems like the SunOracleUniva grid engines(SGEOGEUGE) and SLURM [23] to submit analyses tohigh performance computing systems uap is distributedunder the GNU GPL v3 license and is publicly avail-able at httpsgithubcomyigbtuap Its documentationis hosted at httpuapreadthedocsorg A docker con-tainer with a core set of tools is available at httpshubdockercomryigbtuaptags

uap is distributed with predefined workflows for (i)genome sequence download and index generation forread mapping programs (ii) transcriptome sequencing(RNA-seq) data analysis and (iii) Chromatin Immuno-Precipitation DNA-Sequencing (ChIP-Seq) data analysisFurther we provide small test data sets enabling a quickstart for each of these workflows An additional exampleusing a larger data set is provided via code for download-ing and analyzing a publicly available ChIP-Seq data set

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 7 of 9

(Barski et al [24]) The provided workflows are intendedto serve as an easy entry point into a uap analysis as wellas a template for similar analyses eg for other species orwith another set of tools

Preparing genomic data for HTS analysis An impor-tant prerequisite for HTS projects where a referencegenome is available is aligning (mapping) the sequencingreads to this genome Most mapping software requires aspecific data structure (index) of the genome to efficientlysolve this alignment problem Indices have to be gener-ated once prior to any mapping procedure We provideuap configuration files for the bwa [25] bowtie [26]and segemehl [27] mapping programs and samtools(fasta indexing) of the a) Mycoplasma genitaliumand b) Homo sapiens genome Genomic sequencescan be downloaded automatically prior to indexgeneration

Transcriptome assembly from RNA-Seq data Tran-scriptome sequencing identifies and estimates the quan-tity of RNA in biological samples Beyond quantification ofknown transcripts based on overlapping sequence readsRNA-seq allows the assembly of novel transcripts Weprovide a uap configuration file for combining split-readmapping with de novo transcript assembly uap readsthe sequencing data either from an Illumina sequencingrun folder or a set of fastq files applies quality con-trol removes adapter sequences and maps the reads toa genome using tophat2 [28] and segemehl [27 29]The mapped reads from tophat2 are directly processedby cufflinks [30] for de novo transcript assemblySplit-reads mapped with segemehl are prepared forcufflinks using an adapter script and then also pro-cessed with cufflinks The configuration also containsa step to determine the number of mapped reads pertranscript applying htseq-count [31]

Identification of enriched regions from ChIP-Seqdata ChIP sequencing is a method that integrates chro-matin immunoprecipitation (ChIP) and high-throughputDNA sequencing to identify sites of protein-DNA inter-actions The provided uap-configuration file for this taskinitially resembles the RNA-Seq workflow Here readsare expected to correspond to genomic DNA and themapping is done without considering split-reads usingbowtie2 [26] Mapped reads are subsequently sortedduplicates removed and enriched regions are detectedusing MACS2 [32]

ConclusionsThe critique on a lack of reproducibility in science anda growing awareness that many reported facts do not

seem to hold up to repeated investigation has mean-while reached a broader audience beyond the scien-tific community (eg [33ndash35]) In our opinion HTSdata analysis is particularly prone to consistency andreproducibility issues ndash especially due to the complex-ity of the analysis the involved data volume and thebroad range of available tools and their multitude ofparametersWorkflow management systems are indispensable for

reproducibly controlling more complex analyses like HTSdata analysis Published workflow management systemsfor HTS data analysis are either highly flexible and madefor experienced programmers or lack a lot of flexibil-ity but can be used intuitively and some are specific toa certain type of analysis In the introduction we listedfour minimal requirements that we consider essential forensuring reproducibility and consistency of HTS dataanalyses Additionally but beyond the programmatic con-trol of a of a WMS using versioned external data likegenome sequences or annotation is key for reproducibleanalysis None of the published systems we are awareof however completely satisfies these criteria uap hasbeen designed to fulfill these One critical requirement islinking analysis code and generated data While Repro-ducible Research in statistics [36] uses tools like Sweave[37] knitR [38] or Jupyter [39] to combine analysis codeand resulting data in one output file such a strategy isnot feasible for most steps of an HTS analysis due tothe size of generated data uap therefore relies on hash-ing over the complete sequence of commands includingparameters of a run and appending the key to the outputpath In addition uap performs logging and process mon-itoring supports different cluster management systemscreates recovery points plots execution graphs managesjob dependencies and is extensible to any kind of multi-step analysis It provides pre-built steps for the prepara-tion of genomic data and the analysis of RNA-Seq andChIP-Seq dataAmong the many different flavors of WMS uap is

clearly easier to operate for users with limited program-ming experience than systems based on domain specificprogramming languages while offering a lot more flexi-bility than single purpose tools Based on the comparisonof tools in Additional file 1 Table S1 the tools most ded-icated to reproducible research and providing the mostsimilar set of features compared to uap are Galaxy andNextflowIn our opinion Galaxy [40] and uap address dif-

ferent user groups and tasks and we use both WMSin our research environment Galaxy is great for pro-viding predefined workflows to users without experi-ence in programming and working on the commandline It allows these users to adapt parameters and exe-cute such workflows on their data For users working

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 8 of 9

frequently with large HTS data sets adapting work-flows to a larger extent or duplicating branches of theDAG for performing variants of the analysis in paral-lel is much more efficiently performed in uap Galaxydoes not link data and code in the sense of uap Butas any change to parameters in a Galaxy workflowtriggers re-execution of the sub-workflow below this isnot necessary If many changes throughout a workflowhave to be made this behavior of Galaxy may be hin-dering Obviously running HTS analyses on Galaxyrequires a Galaxy server integrated with an HPC envi-ronment which is not trivial to set up and demandscontinuous maintenance Starting from scratch settingup HTS analysis is significantly less effort using uapthan GalaxyNextflow is a powerful WMS dedicated to scalability

and reproducibility [18] Its approach to reproducibil-ity relies on a tight integration with github and thesupport of scalable containerization of pipelines usingeg Docker Nextflow and uap share several con-cepts eg using temporary files for intermediate resultsor analyzing the workflow DAG to enable failing fastIntermediate results of an HTS analysis can consumelarge amounts of storage space uap therefore providesa means for volatilization of intermediate results with-out breaking dependencies in the DAG - a feature whichdoes not seem to be available in Nextflow Loggingis somewhat limited in Nextflow compared to uapbut Nextflow provides a broader support for HPCenvironments including also support for cloud comput-ing Nextflowrsquos approach to reproducibility is pow-erful when software from Github is used as it enablesthe user to request a specific commit or when thetools used are publicly available as a container How-ever when a tool is run in rsquonative task supportrsquo likethe Kallisto example provided in [18] uap is morestringent in logging version and the full set (includingdefault) of parameters Where uap and Nextflow dif-fer most clearly regarding reproducibility is linking dataand code as to our understanding based on publica-tions and the online documentation this is not availablein NextflowIn summary we are convinced that reproducible

research principles need to be advanced for HTS dataanalysis and that uap is a highly useful system for facingthis challenge

Availability and requirementsProject name uapProject home page httpsgithubcomyigbtuapOperating system(s) LinuxProgramming language PythonOther requirements virtualenv git and graphviz

License GNU GPL v3Any restrictions to use by non-academics None

Supplementary informationSupplementary information accompanies this paper athttpsdoiorg101186s12859-019-3219-1

Additional file 1 This file includes supplemental tables and figures

AbbreviationsChIP-seq Chromatin immuno-precipitation dna-sequencing DAG Directedacyclic graph HTS High throughput sequencing HPC High performancecomputing RNA-seq Transcriptome sequencing WMS Workflowmanagement system

AcknowledgmentsNot applicable

Authorsrsquo contributionsKR MS and JH initially designed and implemented the software SHP CK ASKR JS and JH took over the further development of the software GDimplemented the script that formats segemehl-output to be compatiblewith the cufflinks suite of tools CK and AS provided the dockercontainers of uap CK KR JS and JH wrote the manuscript All authors readcorrected and approved the final manuscript

FundingThis work was supported in part by the Initiative and Networking Fund of theHelmholtz Association through a grant to JH (VH-NG738) by the DeutscheForschungsgemeinschaft (DFG) through a grant to JH (SPP1738) by theFraunhofer Future Fund and through the LIFE Leipzig Research Center forCivilization Diseases The funding bodies did not have any influence on thedesign of the study nor on collection analysis or interpretation of data nor onwriting the manuscript

Availability of data andmaterialsSource code is available at httpsgithubcomyigbtuap links to Dockercontainers and documentation are provided at httpswwwufzdeindexphpen=44919 All datasets used within the example workflows are publiclyavailable and are fully referenced within the workflow configuration files

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests KR is member ofthe editorial board (Associate Editor) of BMC Bioinformatics

Author details1Young Investigators Group Bioinformatics and Transcriptomics DepartmentMolecular Systems Biology Helmholtz Center for Environmental Research ndashUFZ Permoserstraszlige 15 04318 Leipzig Germany 2Bioinformatics DepartmentUniversitaumlt Leipzig Haumlrtelstraszlige 16-18 04107 Leipzig Germany3Bioinformatics Unit Department of Diagnostics Fraunhofer Institute for CellTherapy and Immunology Perlickstraszlige 1 04103 Leipzig Germany 4Presentaddress Department of Diagnostics Fraunhofer Institute for Cell Therapy andImmunology Perlickstraszlige 1 04103 Leipzig Germany 5Present address ecSeqBioinformatics GmbH Sternwartenstraszlige 29 04103 Leipzig Germany

Received 11 June 2019 Accepted 12 November 2019

References1 Nekrutenko A Taylor J Next-generation sequencing data interpretation

enhancing reproducibility and accessibility Nat Rev Genet 201213667ndash72

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 9 of 9

2 Peng RD Reproducible research in computational science Sci N Y20113341226ndash7

3 Bustin SA The reproducibility of biomedical research Sleepers awakeBiomol Detect Quantif 2014235ndash42

4 Baker M 1500scientists lift the lid on reproducibility Nature 2016533452ndash45 Sandve GK Nekrutenko A Taylor J Hovig E Ten simple rules for

reproducible computational research PLoS Comput Biol 2013910032856 Gruumlning B Chilton J Koumlster J Dale R Soranzo N van den Beek M

Goecks J Backofen R Nekrutenko A Taylor J Practical computationalreproducibility in the life sciences Cell Syst 20186631ndash5

7 Goodstadt L Ruffus A lightweight python library for computationalpipelines Bioinformatics 2010262778ndash9

8 Koumlster J Rahmann S Snakemake-a scalable bioinformatics workflowengine Bioinformatics 201228(19)2520ndash2

9 Fonseca Na Petryszak R Marioni J iRAP - an integrated RNA-seq AnalysisPipeline iRAP - an integrated RNA-seq Analysis Pipeline 20140ndash10httpsdoiorg101101005991

10 Wang Y Mehta G Mayani R Lu J Souaiaia T Chen Y Clark A Yoon HJWan L EvgrafovOV Knowles JA Deelman E Chen T RseqFlow Workflowsfor RNA-Seq data analysis Bioinformatics 201127(18)2598ndash600

11 Kalari KR Nair Aa Bhavsar JD O Brien DR Davila JI Bockol Ma Nie JTang X Baheti S Doughty JB Middha S Sicotte H Thompson AEAsmann YW Kocher J-Pa MAP-RSeq Mayo Analysis Pipeline for RNAsequencing BMC Bioinformatics 201415(1)224

12 Goecks J Nekrutenko A Taylor J Galaxy a comprehensive approach forsupporting accessible reproducible and transparent computationalresearch in the life sciences Genome Biol 20101186

13 Golosova O Henderson R Vaskin Y Gabrielian A Grekhov G NagarajanV Oler AJ Quintildeones M Hurt D Fursov M Huyen Y Unipro UGENE NGSpipelines and components for variant calling RNA-seq and ChIP-seq dataanalyses PeerJ 20142644

14 Berthold MR Cebron N Dill F Gabriel TR Koumltter T Meinl T Ohl P SiebC Thiel K Wiswedel B KNIME The Konstanz Information Miner InStudies in Classification Data Analysis and Knowledge Organization (GfKL2007) New York Springer 2007

15 Wolstencroft K Haines R Fellows D Williams A Withers D Owen SSoiland-Reyes S Dunlop I Nenadic A Fisher P Bhagat J Belhajjame KBacall F Hardisty A Nieva de la Hidalga A Balcazar Vargas MP Sufi SGoble C The Taverna workflow suite designing and executing workflowsof Web Services on the desktop web or in the cloud Nucleic acidsresearch 201341(Web Server issue)557ndash61

16 Guimera RV bcbio-nextgen Automated distributed next-gensequencing pipeline EMBnet J 20121730

17 Sadedin SP Pope B Oshlack A Bpipe a tool for running and managingbioinformatics pipelines Bioinformatics (Oxford Engl) 201228(11)1525ndash6

18 Di Tommaso P Chatzou M Floden EW Barja PP Palumbo ENotredame C Nextflow enables reproducible computational workflowsNat Biotechnol 201735316ndash9

19 Cingolani P Sladek R Blanchette M BigDataScript a scripting languagefor data pipelines Bioinformatics 201431(1)10ndash16

20 Ewels P Krueger F Kaumlller M Andrews S Cluster flow A user-friendlybioinformatics workflow tool F1000Research 201652824

21 Gafni E Luquette LJ Lancaster AK Hawkins JB Jung J-Y Souilmi Y WallDP Tonellato PJ Cosmos Python library for massively parallel workflowsBioinformatics 2014302956ndash8

22 Kaushik G Ivkovic S Simonovic J Tijanic N Davis-Dusenbery B Kural DRabix An open-source workflow executor supporting recomputabilityand interoperability of workflow descriptions Pac Symp Biocomput201722154ndash65

23 Yoo AB Jette MA Grondona M SLURM Simple Linux Utility for ResourceManagement In Feitelson D Rudolph L Schwiegelshohn U editorsJob Scheduling Strategies for Parallel Processing 9th InternationalWorkshop JSSPP 2003 Seattle WA USA June 24 2003 Revised PaperBerlin Springer 2003 p 44ndash60

24 Barski A Cuddapah S Cui K Roh T-Y Schones DE Wang Z Wei GChepelev I Zhao K High-resolution profiling of histone methylations inthe human genome Cell 2007129(4)823ndash37

25 Li H Durbin R Fastandaccurate long-read alignment with Burrows-Wheelertransform Bioinformatics (Oxford Engl) 201026(5)589ndash95

26 Langmead B Salzberg SL Fast gapped-read alignment with Bowtie 2Nat Methods 20129(4)357ndash9

27 Hoffmann S Otto C Kurtz S Sharma CM Khaitovich P Vogel J StadlerPF Hackermuumlller J Fast mapping of short sequences with mismatchesinsertions and deletions using index structures PLoS Comput Biol20095(9)1000502

28 Kim D Pertea G Trapnell C Pimentel H Kelley R Salzberg SL TopHat2accurate alignment of transcriptomes in the presence of insertionsdeletions and gene fusions Genome Biol 201314(4)36

29 Hoffmann S Otto C Doose G Tanzer A Langenberger D Christ S KunzM Holdt LM Teupser D Hackermuumlller J Stadler PF A multi-splitmapping algorithm for circular RNA splicing trans-splicing and fusiondetection Genome Biol 201415(2)34

30 Trapnell C Williams Ba Pertea G Mortazavi A Kwan G van Baren MJSalzberg SL Wold BJ Pachter L Transcript assembly and abundanceestimation from RNA-Seq reveals thousands of new transcripts andswitching among isoforms Nat Biotechnol 201128(5)511ndash5

31 Anders S Pyl PT Huber W HTSeq-A Python framework to work withhigh-throughput sequencing data Bioinformatics 201531(2)166ndash9

32 Zhang Y Liu T Meyer Ca Eeckhoute J Johnson DS Bernstein BENusbaum C Myers RM Brown M Li W Liu XS Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089(9)137

33 Fidler F Gordon A Science is in a reproducibility crisis ndash how do weresolve it Conversation 2013 Available at httptheconversationcomscience-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998Accessed 11 Jun 2019

34 Lehrer J The truth wears off New Yorker 2010Dec 1352ndash735 Van Bavel J Why do so many studies fail to replicate N Y Times 201610

Available at httpswwwnytimescom20160529opinionsundaywhy-do-so-many-studies-fail-to-replicatehtml Accessed 11 Jun 2019

36 Fomel S Claerbout JF Guest Editorsrsquo Introduction ReproducibleResearch Comput Sci Eng 200911(1)5ndash7

37 Leisch F Sweave Dynamic generation of statistical reports using literatedata analysis In Haumlrdle W Roumlnz B editors Compstat 2002mdashProceedings in Computational Statistics Heidelberg Physica Verlag 2002p 575ndash80 ISBN 3-7908-1517-9 httpwwwstatuni-muenchendetexttildelowleischSweave

38 Xie Y Implementing Reproducible Computational Research In StoddenV Leisch F Peng RD editors Boca Raton Chapman and HallCRC 2014ISBN 978-1466561595 httpwwwcrcpresscomproductisbn9781466561595

39 Peacuterez F Granger BE IPython a system for interactive scientificcomputing Comput Sci Eng 20079(3)21ndash9

40 Afgan E Baker D Batut B van den Beek M Bouvier D Cech M ChiltonJ Clements D Coraor N Gruumlning BA Guerler A Hillman-Jackson JHiltemann S Jalili V Rasche H Soranzo N Goecks J Taylor JNekrutenko A Blankenberg D The galaxy platform for accessiblereproducible and collaborative biomedical analyses 2018 update NucleicAcids Res 201846537ndash44

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Implementation
        • Analysis as a directed acyclic graph
          • Plug-in architecture
            • Enforcing consistency and integrity
            • Maintaining reproducibility
            • Process flow
              • Results
                • Preparing genomic data for HTS analysis
                  • Transcriptome assembly from RNA-Seq data
                    • Identification of enriched regions from ChIP-Seq data
                      • Conclusions
                      • Availability and requirements
                      • Supplementary informationSupplementary information accompanies this paper at httpsdoiorg101186s12859-019-3219-1
                        • Additional file 1
                          • Abbreviations
                          • Acknowledgments
                          • Authors contributions
                          • Funding
                          • Availability of data and materials
                          • Ethics approval and consent to participate
                          • Consent for publication
                          • Competing interests
                          • Author details
                          • References
                          • Publishers Note
Page 6: SOFTWARE OpenAccess uap:reproducibleandrobustHTSdata …

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 6 of 9

Fig 3 Sketch of the main principles uap is built on a An analysis with uap comprises three parts (i) the uap source code itself implemented inPython - it contains the complete framework of uap and 2 classes for the implementation of source and processing steps These classes are used towrap any tool that is part of the analysis enabling an easy extension of the uaprsquos repertoire of steps (ii) the uap configuration in YAML format Itcontains all necessary information to run and reproduce the analysis given the data and (iii) the uap results - organized in one folder per step inthe output directory The special folder temp contains the expected results until the computation of the step has finished successfully and keepsthe intermediate results and log files upon failure b The progress or state of an analysis can be monitored with a call to uap status Itdetermines the state of each individual step in dependence of the state(s) of its previous step(s) and provides this information to the user

like operating systems and can interact with batchqueuing systems like the SunOracleUniva grid engines(SGEOGEUGE) and SLURM [23] to submit analyses tohigh performance computing systems uap is distributedunder the GNU GPL v3 license and is publicly avail-able at httpsgithubcomyigbtuap Its documentationis hosted at httpuapreadthedocsorg A docker con-tainer with a core set of tools is available at httpshubdockercomryigbtuaptags

uap is distributed with predefined workflows for (i)genome sequence download and index generation forread mapping programs (ii) transcriptome sequencing(RNA-seq) data analysis and (iii) Chromatin Immuno-Precipitation DNA-Sequencing (ChIP-Seq) data analysisFurther we provide small test data sets enabling a quickstart for each of these workflows An additional exampleusing a larger data set is provided via code for download-ing and analyzing a publicly available ChIP-Seq data set

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 7 of 9

(Barski et al [24]) The provided workflows are intendedto serve as an easy entry point into a uap analysis as wellas a template for similar analyses eg for other species orwith another set of tools

Preparing genomic data for HTS analysis An impor-tant prerequisite for HTS projects where a referencegenome is available is aligning (mapping) the sequencingreads to this genome Most mapping software requires aspecific data structure (index) of the genome to efficientlysolve this alignment problem Indices have to be gener-ated once prior to any mapping procedure We provideuap configuration files for the bwa [25] bowtie [26]and segemehl [27] mapping programs and samtools(fasta indexing) of the a) Mycoplasma genitaliumand b) Homo sapiens genome Genomic sequencescan be downloaded automatically prior to indexgeneration

Transcriptome assembly from RNA-Seq data Tran-scriptome sequencing identifies and estimates the quan-tity of RNA in biological samples Beyond quantification ofknown transcripts based on overlapping sequence readsRNA-seq allows the assembly of novel transcripts Weprovide a uap configuration file for combining split-readmapping with de novo transcript assembly uap readsthe sequencing data either from an Illumina sequencingrun folder or a set of fastq files applies quality con-trol removes adapter sequences and maps the reads toa genome using tophat2 [28] and segemehl [27 29]The mapped reads from tophat2 are directly processedby cufflinks [30] for de novo transcript assemblySplit-reads mapped with segemehl are prepared forcufflinks using an adapter script and then also pro-cessed with cufflinks The configuration also containsa step to determine the number of mapped reads pertranscript applying htseq-count [31]

Identification of enriched regions from ChIP-Seqdata ChIP sequencing is a method that integrates chro-matin immunoprecipitation (ChIP) and high-throughputDNA sequencing to identify sites of protein-DNA inter-actions The provided uap-configuration file for this taskinitially resembles the RNA-Seq workflow Here readsare expected to correspond to genomic DNA and themapping is done without considering split-reads usingbowtie2 [26] Mapped reads are subsequently sortedduplicates removed and enriched regions are detectedusing MACS2 [32]

ConclusionsThe critique on a lack of reproducibility in science anda growing awareness that many reported facts do not

seem to hold up to repeated investigation has mean-while reached a broader audience beyond the scien-tific community (eg [33ndash35]) In our opinion HTSdata analysis is particularly prone to consistency andreproducibility issues ndash especially due to the complex-ity of the analysis the involved data volume and thebroad range of available tools and their multitude ofparametersWorkflow management systems are indispensable for

reproducibly controlling more complex analyses like HTSdata analysis Published workflow management systemsfor HTS data analysis are either highly flexible and madefor experienced programmers or lack a lot of flexibil-ity but can be used intuitively and some are specific toa certain type of analysis In the introduction we listedfour minimal requirements that we consider essential forensuring reproducibility and consistency of HTS dataanalyses Additionally but beyond the programmatic con-trol of a of a WMS using versioned external data likegenome sequences or annotation is key for reproducibleanalysis None of the published systems we are awareof however completely satisfies these criteria uap hasbeen designed to fulfill these One critical requirement islinking analysis code and generated data While Repro-ducible Research in statistics [36] uses tools like Sweave[37] knitR [38] or Jupyter [39] to combine analysis codeand resulting data in one output file such a strategy isnot feasible for most steps of an HTS analysis due tothe size of generated data uap therefore relies on hash-ing over the complete sequence of commands includingparameters of a run and appending the key to the outputpath In addition uap performs logging and process mon-itoring supports different cluster management systemscreates recovery points plots execution graphs managesjob dependencies and is extensible to any kind of multi-step analysis It provides pre-built steps for the prepara-tion of genomic data and the analysis of RNA-Seq andChIP-Seq dataAmong the many different flavors of WMS uap is

clearly easier to operate for users with limited program-ming experience than systems based on domain specificprogramming languages while offering a lot more flexi-bility than single purpose tools Based on the comparisonof tools in Additional file 1 Table S1 the tools most ded-icated to reproducible research and providing the mostsimilar set of features compared to uap are Galaxy andNextflowIn our opinion Galaxy [40] and uap address dif-

ferent user groups and tasks and we use both WMSin our research environment Galaxy is great for pro-viding predefined workflows to users without experi-ence in programming and working on the commandline It allows these users to adapt parameters and exe-cute such workflows on their data For users working

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 8 of 9

frequently with large HTS data sets adapting work-flows to a larger extent or duplicating branches of theDAG for performing variants of the analysis in paral-lel is much more efficiently performed in uap Galaxydoes not link data and code in the sense of uap Butas any change to parameters in a Galaxy workflowtriggers re-execution of the sub-workflow below this isnot necessary If many changes throughout a workflowhave to be made this behavior of Galaxy may be hin-dering Obviously running HTS analyses on Galaxyrequires a Galaxy server integrated with an HPC envi-ronment which is not trivial to set up and demandscontinuous maintenance Starting from scratch settingup HTS analysis is significantly less effort using uapthan GalaxyNextflow is a powerful WMS dedicated to scalability

and reproducibility [18] Its approach to reproducibil-ity relies on a tight integration with github and thesupport of scalable containerization of pipelines usingeg Docker Nextflow and uap share several con-cepts eg using temporary files for intermediate resultsor analyzing the workflow DAG to enable failing fastIntermediate results of an HTS analysis can consumelarge amounts of storage space uap therefore providesa means for volatilization of intermediate results with-out breaking dependencies in the DAG - a feature whichdoes not seem to be available in Nextflow Loggingis somewhat limited in Nextflow compared to uapbut Nextflow provides a broader support for HPCenvironments including also support for cloud comput-ing Nextflowrsquos approach to reproducibility is pow-erful when software from Github is used as it enablesthe user to request a specific commit or when thetools used are publicly available as a container How-ever when a tool is run in rsquonative task supportrsquo likethe Kallisto example provided in [18] uap is morestringent in logging version and the full set (includingdefault) of parameters Where uap and Nextflow dif-fer most clearly regarding reproducibility is linking dataand code as to our understanding based on publica-tions and the online documentation this is not availablein NextflowIn summary we are convinced that reproducible

research principles need to be advanced for HTS dataanalysis and that uap is a highly useful system for facingthis challenge

Availability and requirementsProject name uapProject home page httpsgithubcomyigbtuapOperating system(s) LinuxProgramming language PythonOther requirements virtualenv git and graphviz

License GNU GPL v3Any restrictions to use by non-academics None

Supplementary informationSupplementary information accompanies this paper athttpsdoiorg101186s12859-019-3219-1

Additional file 1 This file includes supplemental tables and figures

AbbreviationsChIP-seq Chromatin immuno-precipitation dna-sequencing DAG Directedacyclic graph HTS High throughput sequencing HPC High performancecomputing RNA-seq Transcriptome sequencing WMS Workflowmanagement system

AcknowledgmentsNot applicable

Authorsrsquo contributionsKR MS and JH initially designed and implemented the software SHP CK ASKR JS and JH took over the further development of the software GDimplemented the script that formats segemehl-output to be compatiblewith the cufflinks suite of tools CK and AS provided the dockercontainers of uap CK KR JS and JH wrote the manuscript All authors readcorrected and approved the final manuscript

FundingThis work was supported in part by the Initiative and Networking Fund of theHelmholtz Association through a grant to JH (VH-NG738) by the DeutscheForschungsgemeinschaft (DFG) through a grant to JH (SPP1738) by theFraunhofer Future Fund and through the LIFE Leipzig Research Center forCivilization Diseases The funding bodies did not have any influence on thedesign of the study nor on collection analysis or interpretation of data nor onwriting the manuscript

Availability of data andmaterialsSource code is available at httpsgithubcomyigbtuap links to Dockercontainers and documentation are provided at httpswwwufzdeindexphpen=44919 All datasets used within the example workflows are publiclyavailable and are fully referenced within the workflow configuration files

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests KR is member ofthe editorial board (Associate Editor) of BMC Bioinformatics

Author details1Young Investigators Group Bioinformatics and Transcriptomics DepartmentMolecular Systems Biology Helmholtz Center for Environmental Research ndashUFZ Permoserstraszlige 15 04318 Leipzig Germany 2Bioinformatics DepartmentUniversitaumlt Leipzig Haumlrtelstraszlige 16-18 04107 Leipzig Germany3Bioinformatics Unit Department of Diagnostics Fraunhofer Institute for CellTherapy and Immunology Perlickstraszlige 1 04103 Leipzig Germany 4Presentaddress Department of Diagnostics Fraunhofer Institute for Cell Therapy andImmunology Perlickstraszlige 1 04103 Leipzig Germany 5Present address ecSeqBioinformatics GmbH Sternwartenstraszlige 29 04103 Leipzig Germany

Received 11 June 2019 Accepted 12 November 2019

References1 Nekrutenko A Taylor J Next-generation sequencing data interpretation

enhancing reproducibility and accessibility Nat Rev Genet 201213667ndash72

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 9 of 9

2 Peng RD Reproducible research in computational science Sci N Y20113341226ndash7

3 Bustin SA The reproducibility of biomedical research Sleepers awakeBiomol Detect Quantif 2014235ndash42

4 Baker M 1500scientists lift the lid on reproducibility Nature 2016533452ndash45 Sandve GK Nekrutenko A Taylor J Hovig E Ten simple rules for

reproducible computational research PLoS Comput Biol 2013910032856 Gruumlning B Chilton J Koumlster J Dale R Soranzo N van den Beek M

Goecks J Backofen R Nekrutenko A Taylor J Practical computationalreproducibility in the life sciences Cell Syst 20186631ndash5

7 Goodstadt L Ruffus A lightweight python library for computationalpipelines Bioinformatics 2010262778ndash9

8 Koumlster J Rahmann S Snakemake-a scalable bioinformatics workflowengine Bioinformatics 201228(19)2520ndash2

9 Fonseca Na Petryszak R Marioni J iRAP - an integrated RNA-seq AnalysisPipeline iRAP - an integrated RNA-seq Analysis Pipeline 20140ndash10httpsdoiorg101101005991

10 Wang Y Mehta G Mayani R Lu J Souaiaia T Chen Y Clark A Yoon HJWan L EvgrafovOV Knowles JA Deelman E Chen T RseqFlow Workflowsfor RNA-Seq data analysis Bioinformatics 201127(18)2598ndash600

11 Kalari KR Nair Aa Bhavsar JD O Brien DR Davila JI Bockol Ma Nie JTang X Baheti S Doughty JB Middha S Sicotte H Thompson AEAsmann YW Kocher J-Pa MAP-RSeq Mayo Analysis Pipeline for RNAsequencing BMC Bioinformatics 201415(1)224

12 Goecks J Nekrutenko A Taylor J Galaxy a comprehensive approach forsupporting accessible reproducible and transparent computationalresearch in the life sciences Genome Biol 20101186

13 Golosova O Henderson R Vaskin Y Gabrielian A Grekhov G NagarajanV Oler AJ Quintildeones M Hurt D Fursov M Huyen Y Unipro UGENE NGSpipelines and components for variant calling RNA-seq and ChIP-seq dataanalyses PeerJ 20142644

14 Berthold MR Cebron N Dill F Gabriel TR Koumltter T Meinl T Ohl P SiebC Thiel K Wiswedel B KNIME The Konstanz Information Miner InStudies in Classification Data Analysis and Knowledge Organization (GfKL2007) New York Springer 2007

15 Wolstencroft K Haines R Fellows D Williams A Withers D Owen SSoiland-Reyes S Dunlop I Nenadic A Fisher P Bhagat J Belhajjame KBacall F Hardisty A Nieva de la Hidalga A Balcazar Vargas MP Sufi SGoble C The Taverna workflow suite designing and executing workflowsof Web Services on the desktop web or in the cloud Nucleic acidsresearch 201341(Web Server issue)557ndash61

16 Guimera RV bcbio-nextgen Automated distributed next-gensequencing pipeline EMBnet J 20121730

17 Sadedin SP Pope B Oshlack A Bpipe a tool for running and managingbioinformatics pipelines Bioinformatics (Oxford Engl) 201228(11)1525ndash6

18 Di Tommaso P Chatzou M Floden EW Barja PP Palumbo ENotredame C Nextflow enables reproducible computational workflowsNat Biotechnol 201735316ndash9

19 Cingolani P Sladek R Blanchette M BigDataScript a scripting languagefor data pipelines Bioinformatics 201431(1)10ndash16

20 Ewels P Krueger F Kaumlller M Andrews S Cluster flow A user-friendlybioinformatics workflow tool F1000Research 201652824

21 Gafni E Luquette LJ Lancaster AK Hawkins JB Jung J-Y Souilmi Y WallDP Tonellato PJ Cosmos Python library for massively parallel workflowsBioinformatics 2014302956ndash8

22 Kaushik G Ivkovic S Simonovic J Tijanic N Davis-Dusenbery B Kural DRabix An open-source workflow executor supporting recomputabilityand interoperability of workflow descriptions Pac Symp Biocomput201722154ndash65

23 Yoo AB Jette MA Grondona M SLURM Simple Linux Utility for ResourceManagement In Feitelson D Rudolph L Schwiegelshohn U editorsJob Scheduling Strategies for Parallel Processing 9th InternationalWorkshop JSSPP 2003 Seattle WA USA June 24 2003 Revised PaperBerlin Springer 2003 p 44ndash60

24 Barski A Cuddapah S Cui K Roh T-Y Schones DE Wang Z Wei GChepelev I Zhao K High-resolution profiling of histone methylations inthe human genome Cell 2007129(4)823ndash37

25 Li H Durbin R Fastandaccurate long-read alignment with Burrows-Wheelertransform Bioinformatics (Oxford Engl) 201026(5)589ndash95

26 Langmead B Salzberg SL Fast gapped-read alignment with Bowtie 2Nat Methods 20129(4)357ndash9

27 Hoffmann S Otto C Kurtz S Sharma CM Khaitovich P Vogel J StadlerPF Hackermuumlller J Fast mapping of short sequences with mismatchesinsertions and deletions using index structures PLoS Comput Biol20095(9)1000502

28 Kim D Pertea G Trapnell C Pimentel H Kelley R Salzberg SL TopHat2accurate alignment of transcriptomes in the presence of insertionsdeletions and gene fusions Genome Biol 201314(4)36

29 Hoffmann S Otto C Doose G Tanzer A Langenberger D Christ S KunzM Holdt LM Teupser D Hackermuumlller J Stadler PF A multi-splitmapping algorithm for circular RNA splicing trans-splicing and fusiondetection Genome Biol 201415(2)34

30 Trapnell C Williams Ba Pertea G Mortazavi A Kwan G van Baren MJSalzberg SL Wold BJ Pachter L Transcript assembly and abundanceestimation from RNA-Seq reveals thousands of new transcripts andswitching among isoforms Nat Biotechnol 201128(5)511ndash5

31 Anders S Pyl PT Huber W HTSeq-A Python framework to work withhigh-throughput sequencing data Bioinformatics 201531(2)166ndash9

32 Zhang Y Liu T Meyer Ca Eeckhoute J Johnson DS Bernstein BENusbaum C Myers RM Brown M Li W Liu XS Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089(9)137

33 Fidler F Gordon A Science is in a reproducibility crisis ndash how do weresolve it Conversation 2013 Available at httptheconversationcomscience-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998Accessed 11 Jun 2019

34 Lehrer J The truth wears off New Yorker 2010Dec 1352ndash735 Van Bavel J Why do so many studies fail to replicate N Y Times 201610

Available at httpswwwnytimescom20160529opinionsundaywhy-do-so-many-studies-fail-to-replicatehtml Accessed 11 Jun 2019

36 Fomel S Claerbout JF Guest Editorsrsquo Introduction ReproducibleResearch Comput Sci Eng 200911(1)5ndash7

37 Leisch F Sweave Dynamic generation of statistical reports using literatedata analysis In Haumlrdle W Roumlnz B editors Compstat 2002mdashProceedings in Computational Statistics Heidelberg Physica Verlag 2002p 575ndash80 ISBN 3-7908-1517-9 httpwwwstatuni-muenchendetexttildelowleischSweave

38 Xie Y Implementing Reproducible Computational Research In StoddenV Leisch F Peng RD editors Boca Raton Chapman and HallCRC 2014ISBN 978-1466561595 httpwwwcrcpresscomproductisbn9781466561595

39 Peacuterez F Granger BE IPython a system for interactive scientificcomputing Comput Sci Eng 20079(3)21ndash9

40 Afgan E Baker D Batut B van den Beek M Bouvier D Cech M ChiltonJ Clements D Coraor N Gruumlning BA Guerler A Hillman-Jackson JHiltemann S Jalili V Rasche H Soranzo N Goecks J Taylor JNekrutenko A Blankenberg D The galaxy platform for accessiblereproducible and collaborative biomedical analyses 2018 update NucleicAcids Res 201846537ndash44

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Implementation
        • Analysis as a directed acyclic graph
          • Plug-in architecture
            • Enforcing consistency and integrity
            • Maintaining reproducibility
            • Process flow
              • Results
                • Preparing genomic data for HTS analysis
                  • Transcriptome assembly from RNA-Seq data
                    • Identification of enriched regions from ChIP-Seq data
                      • Conclusions
                      • Availability and requirements
                      • Supplementary informationSupplementary information accompanies this paper at httpsdoiorg101186s12859-019-3219-1
                        • Additional file 1
                          • Abbreviations
                          • Acknowledgments
                          • Authors contributions
                          • Funding
                          • Availability of data and materials
                          • Ethics approval and consent to participate
                          • Consent for publication
                          • Competing interests
                          • Author details
                          • References
                          • Publishers Note
Page 7: SOFTWARE OpenAccess uap:reproducibleandrobustHTSdata …

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 7 of 9

(Barski et al [24]) The provided workflows are intendedto serve as an easy entry point into a uap analysis as wellas a template for similar analyses eg for other species orwith another set of tools

Preparing genomic data for HTS analysis An impor-tant prerequisite for HTS projects where a referencegenome is available is aligning (mapping) the sequencingreads to this genome Most mapping software requires aspecific data structure (index) of the genome to efficientlysolve this alignment problem Indices have to be gener-ated once prior to any mapping procedure We provideuap configuration files for the bwa [25] bowtie [26]and segemehl [27] mapping programs and samtools(fasta indexing) of the a) Mycoplasma genitaliumand b) Homo sapiens genome Genomic sequencescan be downloaded automatically prior to indexgeneration

Transcriptome assembly from RNA-Seq data Tran-scriptome sequencing identifies and estimates the quan-tity of RNA in biological samples Beyond quantification ofknown transcripts based on overlapping sequence readsRNA-seq allows the assembly of novel transcripts Weprovide a uap configuration file for combining split-readmapping with de novo transcript assembly uap readsthe sequencing data either from an Illumina sequencingrun folder or a set of fastq files applies quality con-trol removes adapter sequences and maps the reads toa genome using tophat2 [28] and segemehl [27 29]The mapped reads from tophat2 are directly processedby cufflinks [30] for de novo transcript assemblySplit-reads mapped with segemehl are prepared forcufflinks using an adapter script and then also pro-cessed with cufflinks The configuration also containsa step to determine the number of mapped reads pertranscript applying htseq-count [31]

Identification of enriched regions from ChIP-Seqdata ChIP sequencing is a method that integrates chro-matin immunoprecipitation (ChIP) and high-throughputDNA sequencing to identify sites of protein-DNA inter-actions The provided uap-configuration file for this taskinitially resembles the RNA-Seq workflow Here readsare expected to correspond to genomic DNA and themapping is done without considering split-reads usingbowtie2 [26] Mapped reads are subsequently sortedduplicates removed and enriched regions are detectedusing MACS2 [32]

ConclusionsThe critique on a lack of reproducibility in science anda growing awareness that many reported facts do not

seem to hold up to repeated investigation has mean-while reached a broader audience beyond the scien-tific community (eg [33ndash35]) In our opinion HTSdata analysis is particularly prone to consistency andreproducibility issues ndash especially due to the complex-ity of the analysis the involved data volume and thebroad range of available tools and their multitude ofparametersWorkflow management systems are indispensable for

reproducibly controlling more complex analyses like HTSdata analysis Published workflow management systemsfor HTS data analysis are either highly flexible and madefor experienced programmers or lack a lot of flexibil-ity but can be used intuitively and some are specific toa certain type of analysis In the introduction we listedfour minimal requirements that we consider essential forensuring reproducibility and consistency of HTS dataanalyses Additionally but beyond the programmatic con-trol of a of a WMS using versioned external data likegenome sequences or annotation is key for reproducibleanalysis None of the published systems we are awareof however completely satisfies these criteria uap hasbeen designed to fulfill these One critical requirement islinking analysis code and generated data While Repro-ducible Research in statistics [36] uses tools like Sweave[37] knitR [38] or Jupyter [39] to combine analysis codeand resulting data in one output file such a strategy isnot feasible for most steps of an HTS analysis due tothe size of generated data uap therefore relies on hash-ing over the complete sequence of commands includingparameters of a run and appending the key to the outputpath In addition uap performs logging and process mon-itoring supports different cluster management systemscreates recovery points plots execution graphs managesjob dependencies and is extensible to any kind of multi-step analysis It provides pre-built steps for the prepara-tion of genomic data and the analysis of RNA-Seq andChIP-Seq dataAmong the many different flavors of WMS uap is

clearly easier to operate for users with limited program-ming experience than systems based on domain specificprogramming languages while offering a lot more flexi-bility than single purpose tools Based on the comparisonof tools in Additional file 1 Table S1 the tools most ded-icated to reproducible research and providing the mostsimilar set of features compared to uap are Galaxy andNextflowIn our opinion Galaxy [40] and uap address dif-

ferent user groups and tasks and we use both WMSin our research environment Galaxy is great for pro-viding predefined workflows to users without experi-ence in programming and working on the commandline It allows these users to adapt parameters and exe-cute such workflows on their data For users working

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 8 of 9

frequently with large HTS data sets adapting work-flows to a larger extent or duplicating branches of theDAG for performing variants of the analysis in paral-lel is much more efficiently performed in uap Galaxydoes not link data and code in the sense of uap Butas any change to parameters in a Galaxy workflowtriggers re-execution of the sub-workflow below this isnot necessary If many changes throughout a workflowhave to be made this behavior of Galaxy may be hin-dering Obviously running HTS analyses on Galaxyrequires a Galaxy server integrated with an HPC envi-ronment which is not trivial to set up and demandscontinuous maintenance Starting from scratch settingup HTS analysis is significantly less effort using uapthan GalaxyNextflow is a powerful WMS dedicated to scalability

and reproducibility [18] Its approach to reproducibil-ity relies on a tight integration with github and thesupport of scalable containerization of pipelines usingeg Docker Nextflow and uap share several con-cepts eg using temporary files for intermediate resultsor analyzing the workflow DAG to enable failing fastIntermediate results of an HTS analysis can consumelarge amounts of storage space uap therefore providesa means for volatilization of intermediate results with-out breaking dependencies in the DAG - a feature whichdoes not seem to be available in Nextflow Loggingis somewhat limited in Nextflow compared to uapbut Nextflow provides a broader support for HPCenvironments including also support for cloud comput-ing Nextflowrsquos approach to reproducibility is pow-erful when software from Github is used as it enablesthe user to request a specific commit or when thetools used are publicly available as a container How-ever when a tool is run in rsquonative task supportrsquo likethe Kallisto example provided in [18] uap is morestringent in logging version and the full set (includingdefault) of parameters Where uap and Nextflow dif-fer most clearly regarding reproducibility is linking dataand code as to our understanding based on publica-tions and the online documentation this is not availablein NextflowIn summary we are convinced that reproducible

research principles need to be advanced for HTS dataanalysis and that uap is a highly useful system for facingthis challenge

Availability and requirementsProject name uapProject home page httpsgithubcomyigbtuapOperating system(s) LinuxProgramming language PythonOther requirements virtualenv git and graphviz

License GNU GPL v3Any restrictions to use by non-academics None

Supplementary informationSupplementary information accompanies this paper athttpsdoiorg101186s12859-019-3219-1

Additional file 1 This file includes supplemental tables and figures

AbbreviationsChIP-seq Chromatin immuno-precipitation dna-sequencing DAG Directedacyclic graph HTS High throughput sequencing HPC High performancecomputing RNA-seq Transcriptome sequencing WMS Workflowmanagement system

AcknowledgmentsNot applicable

Authorsrsquo contributionsKR MS and JH initially designed and implemented the software SHP CK ASKR JS and JH took over the further development of the software GDimplemented the script that formats segemehl-output to be compatiblewith the cufflinks suite of tools CK and AS provided the dockercontainers of uap CK KR JS and JH wrote the manuscript All authors readcorrected and approved the final manuscript

FundingThis work was supported in part by the Initiative and Networking Fund of theHelmholtz Association through a grant to JH (VH-NG738) by the DeutscheForschungsgemeinschaft (DFG) through a grant to JH (SPP1738) by theFraunhofer Future Fund and through the LIFE Leipzig Research Center forCivilization Diseases The funding bodies did not have any influence on thedesign of the study nor on collection analysis or interpretation of data nor onwriting the manuscript

Availability of data andmaterialsSource code is available at httpsgithubcomyigbtuap links to Dockercontainers and documentation are provided at httpswwwufzdeindexphpen=44919 All datasets used within the example workflows are publiclyavailable and are fully referenced within the workflow configuration files

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests KR is member ofthe editorial board (Associate Editor) of BMC Bioinformatics

Author details1Young Investigators Group Bioinformatics and Transcriptomics DepartmentMolecular Systems Biology Helmholtz Center for Environmental Research ndashUFZ Permoserstraszlige 15 04318 Leipzig Germany 2Bioinformatics DepartmentUniversitaumlt Leipzig Haumlrtelstraszlige 16-18 04107 Leipzig Germany3Bioinformatics Unit Department of Diagnostics Fraunhofer Institute for CellTherapy and Immunology Perlickstraszlige 1 04103 Leipzig Germany 4Presentaddress Department of Diagnostics Fraunhofer Institute for Cell Therapy andImmunology Perlickstraszlige 1 04103 Leipzig Germany 5Present address ecSeqBioinformatics GmbH Sternwartenstraszlige 29 04103 Leipzig Germany

Received 11 June 2019 Accepted 12 November 2019

References1 Nekrutenko A Taylor J Next-generation sequencing data interpretation

enhancing reproducibility and accessibility Nat Rev Genet 201213667ndash72

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 9 of 9

2 Peng RD Reproducible research in computational science Sci N Y20113341226ndash7

3 Bustin SA The reproducibility of biomedical research Sleepers awakeBiomol Detect Quantif 2014235ndash42

4 Baker M 1500scientists lift the lid on reproducibility Nature 2016533452ndash45 Sandve GK Nekrutenko A Taylor J Hovig E Ten simple rules for

reproducible computational research PLoS Comput Biol 2013910032856 Gruumlning B Chilton J Koumlster J Dale R Soranzo N van den Beek M

Goecks J Backofen R Nekrutenko A Taylor J Practical computationalreproducibility in the life sciences Cell Syst 20186631ndash5

7 Goodstadt L Ruffus A lightweight python library for computationalpipelines Bioinformatics 2010262778ndash9

8 Koumlster J Rahmann S Snakemake-a scalable bioinformatics workflowengine Bioinformatics 201228(19)2520ndash2

9 Fonseca Na Petryszak R Marioni J iRAP - an integrated RNA-seq AnalysisPipeline iRAP - an integrated RNA-seq Analysis Pipeline 20140ndash10httpsdoiorg101101005991

10 Wang Y Mehta G Mayani R Lu J Souaiaia T Chen Y Clark A Yoon HJWan L EvgrafovOV Knowles JA Deelman E Chen T RseqFlow Workflowsfor RNA-Seq data analysis Bioinformatics 201127(18)2598ndash600

11 Kalari KR Nair Aa Bhavsar JD O Brien DR Davila JI Bockol Ma Nie JTang X Baheti S Doughty JB Middha S Sicotte H Thompson AEAsmann YW Kocher J-Pa MAP-RSeq Mayo Analysis Pipeline for RNAsequencing BMC Bioinformatics 201415(1)224

12 Goecks J Nekrutenko A Taylor J Galaxy a comprehensive approach forsupporting accessible reproducible and transparent computationalresearch in the life sciences Genome Biol 20101186

13 Golosova O Henderson R Vaskin Y Gabrielian A Grekhov G NagarajanV Oler AJ Quintildeones M Hurt D Fursov M Huyen Y Unipro UGENE NGSpipelines and components for variant calling RNA-seq and ChIP-seq dataanalyses PeerJ 20142644

14 Berthold MR Cebron N Dill F Gabriel TR Koumltter T Meinl T Ohl P SiebC Thiel K Wiswedel B KNIME The Konstanz Information Miner InStudies in Classification Data Analysis and Knowledge Organization (GfKL2007) New York Springer 2007

15 Wolstencroft K Haines R Fellows D Williams A Withers D Owen SSoiland-Reyes S Dunlop I Nenadic A Fisher P Bhagat J Belhajjame KBacall F Hardisty A Nieva de la Hidalga A Balcazar Vargas MP Sufi SGoble C The Taverna workflow suite designing and executing workflowsof Web Services on the desktop web or in the cloud Nucleic acidsresearch 201341(Web Server issue)557ndash61

16 Guimera RV bcbio-nextgen Automated distributed next-gensequencing pipeline EMBnet J 20121730

17 Sadedin SP Pope B Oshlack A Bpipe a tool for running and managingbioinformatics pipelines Bioinformatics (Oxford Engl) 201228(11)1525ndash6

18 Di Tommaso P Chatzou M Floden EW Barja PP Palumbo ENotredame C Nextflow enables reproducible computational workflowsNat Biotechnol 201735316ndash9

19 Cingolani P Sladek R Blanchette M BigDataScript a scripting languagefor data pipelines Bioinformatics 201431(1)10ndash16

20 Ewels P Krueger F Kaumlller M Andrews S Cluster flow A user-friendlybioinformatics workflow tool F1000Research 201652824

21 Gafni E Luquette LJ Lancaster AK Hawkins JB Jung J-Y Souilmi Y WallDP Tonellato PJ Cosmos Python library for massively parallel workflowsBioinformatics 2014302956ndash8

22 Kaushik G Ivkovic S Simonovic J Tijanic N Davis-Dusenbery B Kural DRabix An open-source workflow executor supporting recomputabilityand interoperability of workflow descriptions Pac Symp Biocomput201722154ndash65

23 Yoo AB Jette MA Grondona M SLURM Simple Linux Utility for ResourceManagement In Feitelson D Rudolph L Schwiegelshohn U editorsJob Scheduling Strategies for Parallel Processing 9th InternationalWorkshop JSSPP 2003 Seattle WA USA June 24 2003 Revised PaperBerlin Springer 2003 p 44ndash60

24 Barski A Cuddapah S Cui K Roh T-Y Schones DE Wang Z Wei GChepelev I Zhao K High-resolution profiling of histone methylations inthe human genome Cell 2007129(4)823ndash37

25 Li H Durbin R Fastandaccurate long-read alignment with Burrows-Wheelertransform Bioinformatics (Oxford Engl) 201026(5)589ndash95

26 Langmead B Salzberg SL Fast gapped-read alignment with Bowtie 2Nat Methods 20129(4)357ndash9

27 Hoffmann S Otto C Kurtz S Sharma CM Khaitovich P Vogel J StadlerPF Hackermuumlller J Fast mapping of short sequences with mismatchesinsertions and deletions using index structures PLoS Comput Biol20095(9)1000502

28 Kim D Pertea G Trapnell C Pimentel H Kelley R Salzberg SL TopHat2accurate alignment of transcriptomes in the presence of insertionsdeletions and gene fusions Genome Biol 201314(4)36

29 Hoffmann S Otto C Doose G Tanzer A Langenberger D Christ S KunzM Holdt LM Teupser D Hackermuumlller J Stadler PF A multi-splitmapping algorithm for circular RNA splicing trans-splicing and fusiondetection Genome Biol 201415(2)34

30 Trapnell C Williams Ba Pertea G Mortazavi A Kwan G van Baren MJSalzberg SL Wold BJ Pachter L Transcript assembly and abundanceestimation from RNA-Seq reveals thousands of new transcripts andswitching among isoforms Nat Biotechnol 201128(5)511ndash5

31 Anders S Pyl PT Huber W HTSeq-A Python framework to work withhigh-throughput sequencing data Bioinformatics 201531(2)166ndash9

32 Zhang Y Liu T Meyer Ca Eeckhoute J Johnson DS Bernstein BENusbaum C Myers RM Brown M Li W Liu XS Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089(9)137

33 Fidler F Gordon A Science is in a reproducibility crisis ndash how do weresolve it Conversation 2013 Available at httptheconversationcomscience-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998Accessed 11 Jun 2019

34 Lehrer J The truth wears off New Yorker 2010Dec 1352ndash735 Van Bavel J Why do so many studies fail to replicate N Y Times 201610

Available at httpswwwnytimescom20160529opinionsundaywhy-do-so-many-studies-fail-to-replicatehtml Accessed 11 Jun 2019

36 Fomel S Claerbout JF Guest Editorsrsquo Introduction ReproducibleResearch Comput Sci Eng 200911(1)5ndash7

37 Leisch F Sweave Dynamic generation of statistical reports using literatedata analysis In Haumlrdle W Roumlnz B editors Compstat 2002mdashProceedings in Computational Statistics Heidelberg Physica Verlag 2002p 575ndash80 ISBN 3-7908-1517-9 httpwwwstatuni-muenchendetexttildelowleischSweave

38 Xie Y Implementing Reproducible Computational Research In StoddenV Leisch F Peng RD editors Boca Raton Chapman and HallCRC 2014ISBN 978-1466561595 httpwwwcrcpresscomproductisbn9781466561595

39 Peacuterez F Granger BE IPython a system for interactive scientificcomputing Comput Sci Eng 20079(3)21ndash9

40 Afgan E Baker D Batut B van den Beek M Bouvier D Cech M ChiltonJ Clements D Coraor N Gruumlning BA Guerler A Hillman-Jackson JHiltemann S Jalili V Rasche H Soranzo N Goecks J Taylor JNekrutenko A Blankenberg D The galaxy platform for accessiblereproducible and collaborative biomedical analyses 2018 update NucleicAcids Res 201846537ndash44

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Implementation
        • Analysis as a directed acyclic graph
          • Plug-in architecture
            • Enforcing consistency and integrity
            • Maintaining reproducibility
            • Process flow
              • Results
                • Preparing genomic data for HTS analysis
                  • Transcriptome assembly from RNA-Seq data
                    • Identification of enriched regions from ChIP-Seq data
                      • Conclusions
                      • Availability and requirements
                      • Supplementary informationSupplementary information accompanies this paper at httpsdoiorg101186s12859-019-3219-1
                        • Additional file 1
                          • Abbreviations
                          • Acknowledgments
                          • Authors contributions
                          • Funding
                          • Availability of data and materials
                          • Ethics approval and consent to participate
                          • Consent for publication
                          • Competing interests
                          • Author details
                          • References
                          • Publishers Note
Page 8: SOFTWARE OpenAccess uap:reproducibleandrobustHTSdata …

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 8 of 9

frequently with large HTS data sets adapting work-flows to a larger extent or duplicating branches of theDAG for performing variants of the analysis in paral-lel is much more efficiently performed in uap Galaxydoes not link data and code in the sense of uap Butas any change to parameters in a Galaxy workflowtriggers re-execution of the sub-workflow below this isnot necessary If many changes throughout a workflowhave to be made this behavior of Galaxy may be hin-dering Obviously running HTS analyses on Galaxyrequires a Galaxy server integrated with an HPC envi-ronment which is not trivial to set up and demandscontinuous maintenance Starting from scratch settingup HTS analysis is significantly less effort using uapthan GalaxyNextflow is a powerful WMS dedicated to scalability

and reproducibility [18] Its approach to reproducibil-ity relies on a tight integration with github and thesupport of scalable containerization of pipelines usingeg Docker Nextflow and uap share several con-cepts eg using temporary files for intermediate resultsor analyzing the workflow DAG to enable failing fastIntermediate results of an HTS analysis can consumelarge amounts of storage space uap therefore providesa means for volatilization of intermediate results with-out breaking dependencies in the DAG - a feature whichdoes not seem to be available in Nextflow Loggingis somewhat limited in Nextflow compared to uapbut Nextflow provides a broader support for HPCenvironments including also support for cloud comput-ing Nextflowrsquos approach to reproducibility is pow-erful when software from Github is used as it enablesthe user to request a specific commit or when thetools used are publicly available as a container How-ever when a tool is run in rsquonative task supportrsquo likethe Kallisto example provided in [18] uap is morestringent in logging version and the full set (includingdefault) of parameters Where uap and Nextflow dif-fer most clearly regarding reproducibility is linking dataand code as to our understanding based on publica-tions and the online documentation this is not availablein NextflowIn summary we are convinced that reproducible

research principles need to be advanced for HTS dataanalysis and that uap is a highly useful system for facingthis challenge

Availability and requirementsProject name uapProject home page httpsgithubcomyigbtuapOperating system(s) LinuxProgramming language PythonOther requirements virtualenv git and graphviz

License GNU GPL v3Any restrictions to use by non-academics None

Supplementary informationSupplementary information accompanies this paper athttpsdoiorg101186s12859-019-3219-1

Additional file 1 This file includes supplemental tables and figures

AbbreviationsChIP-seq Chromatin immuno-precipitation dna-sequencing DAG Directedacyclic graph HTS High throughput sequencing HPC High performancecomputing RNA-seq Transcriptome sequencing WMS Workflowmanagement system

AcknowledgmentsNot applicable

Authorsrsquo contributionsKR MS and JH initially designed and implemented the software SHP CK ASKR JS and JH took over the further development of the software GDimplemented the script that formats segemehl-output to be compatiblewith the cufflinks suite of tools CK and AS provided the dockercontainers of uap CK KR JS and JH wrote the manuscript All authors readcorrected and approved the final manuscript

FundingThis work was supported in part by the Initiative and Networking Fund of theHelmholtz Association through a grant to JH (VH-NG738) by the DeutscheForschungsgemeinschaft (DFG) through a grant to JH (SPP1738) by theFraunhofer Future Fund and through the LIFE Leipzig Research Center forCivilization Diseases The funding bodies did not have any influence on thedesign of the study nor on collection analysis or interpretation of data nor onwriting the manuscript

Availability of data andmaterialsSource code is available at httpsgithubcomyigbtuap links to Dockercontainers and documentation are provided at httpswwwufzdeindexphpen=44919 All datasets used within the example workflows are publiclyavailable and are fully referenced within the workflow configuration files

Ethics approval and consent to participateNot applicable

Consent for publicationNot applicable

Competing interestsThe authors declare that they have no competing interests KR is member ofthe editorial board (Associate Editor) of BMC Bioinformatics

Author details1Young Investigators Group Bioinformatics and Transcriptomics DepartmentMolecular Systems Biology Helmholtz Center for Environmental Research ndashUFZ Permoserstraszlige 15 04318 Leipzig Germany 2Bioinformatics DepartmentUniversitaumlt Leipzig Haumlrtelstraszlige 16-18 04107 Leipzig Germany3Bioinformatics Unit Department of Diagnostics Fraunhofer Institute for CellTherapy and Immunology Perlickstraszlige 1 04103 Leipzig Germany 4Presentaddress Department of Diagnostics Fraunhofer Institute for Cell Therapy andImmunology Perlickstraszlige 1 04103 Leipzig Germany 5Present address ecSeqBioinformatics GmbH Sternwartenstraszlige 29 04103 Leipzig Germany

Received 11 June 2019 Accepted 12 November 2019

References1 Nekrutenko A Taylor J Next-generation sequencing data interpretation

enhancing reproducibility and accessibility Nat Rev Genet 201213667ndash72

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 9 of 9

2 Peng RD Reproducible research in computational science Sci N Y20113341226ndash7

3 Bustin SA The reproducibility of biomedical research Sleepers awakeBiomol Detect Quantif 2014235ndash42

4 Baker M 1500scientists lift the lid on reproducibility Nature 2016533452ndash45 Sandve GK Nekrutenko A Taylor J Hovig E Ten simple rules for

reproducible computational research PLoS Comput Biol 2013910032856 Gruumlning B Chilton J Koumlster J Dale R Soranzo N van den Beek M

Goecks J Backofen R Nekrutenko A Taylor J Practical computationalreproducibility in the life sciences Cell Syst 20186631ndash5

7 Goodstadt L Ruffus A lightweight python library for computationalpipelines Bioinformatics 2010262778ndash9

8 Koumlster J Rahmann S Snakemake-a scalable bioinformatics workflowengine Bioinformatics 201228(19)2520ndash2

9 Fonseca Na Petryszak R Marioni J iRAP - an integrated RNA-seq AnalysisPipeline iRAP - an integrated RNA-seq Analysis Pipeline 20140ndash10httpsdoiorg101101005991

10 Wang Y Mehta G Mayani R Lu J Souaiaia T Chen Y Clark A Yoon HJWan L EvgrafovOV Knowles JA Deelman E Chen T RseqFlow Workflowsfor RNA-Seq data analysis Bioinformatics 201127(18)2598ndash600

11 Kalari KR Nair Aa Bhavsar JD O Brien DR Davila JI Bockol Ma Nie JTang X Baheti S Doughty JB Middha S Sicotte H Thompson AEAsmann YW Kocher J-Pa MAP-RSeq Mayo Analysis Pipeline for RNAsequencing BMC Bioinformatics 201415(1)224

12 Goecks J Nekrutenko A Taylor J Galaxy a comprehensive approach forsupporting accessible reproducible and transparent computationalresearch in the life sciences Genome Biol 20101186

13 Golosova O Henderson R Vaskin Y Gabrielian A Grekhov G NagarajanV Oler AJ Quintildeones M Hurt D Fursov M Huyen Y Unipro UGENE NGSpipelines and components for variant calling RNA-seq and ChIP-seq dataanalyses PeerJ 20142644

14 Berthold MR Cebron N Dill F Gabriel TR Koumltter T Meinl T Ohl P SiebC Thiel K Wiswedel B KNIME The Konstanz Information Miner InStudies in Classification Data Analysis and Knowledge Organization (GfKL2007) New York Springer 2007

15 Wolstencroft K Haines R Fellows D Williams A Withers D Owen SSoiland-Reyes S Dunlop I Nenadic A Fisher P Bhagat J Belhajjame KBacall F Hardisty A Nieva de la Hidalga A Balcazar Vargas MP Sufi SGoble C The Taverna workflow suite designing and executing workflowsof Web Services on the desktop web or in the cloud Nucleic acidsresearch 201341(Web Server issue)557ndash61

16 Guimera RV bcbio-nextgen Automated distributed next-gensequencing pipeline EMBnet J 20121730

17 Sadedin SP Pope B Oshlack A Bpipe a tool for running and managingbioinformatics pipelines Bioinformatics (Oxford Engl) 201228(11)1525ndash6

18 Di Tommaso P Chatzou M Floden EW Barja PP Palumbo ENotredame C Nextflow enables reproducible computational workflowsNat Biotechnol 201735316ndash9

19 Cingolani P Sladek R Blanchette M BigDataScript a scripting languagefor data pipelines Bioinformatics 201431(1)10ndash16

20 Ewels P Krueger F Kaumlller M Andrews S Cluster flow A user-friendlybioinformatics workflow tool F1000Research 201652824

21 Gafni E Luquette LJ Lancaster AK Hawkins JB Jung J-Y Souilmi Y WallDP Tonellato PJ Cosmos Python library for massively parallel workflowsBioinformatics 2014302956ndash8

22 Kaushik G Ivkovic S Simonovic J Tijanic N Davis-Dusenbery B Kural DRabix An open-source workflow executor supporting recomputabilityand interoperability of workflow descriptions Pac Symp Biocomput201722154ndash65

23 Yoo AB Jette MA Grondona M SLURM Simple Linux Utility for ResourceManagement In Feitelson D Rudolph L Schwiegelshohn U editorsJob Scheduling Strategies for Parallel Processing 9th InternationalWorkshop JSSPP 2003 Seattle WA USA June 24 2003 Revised PaperBerlin Springer 2003 p 44ndash60

24 Barski A Cuddapah S Cui K Roh T-Y Schones DE Wang Z Wei GChepelev I Zhao K High-resolution profiling of histone methylations inthe human genome Cell 2007129(4)823ndash37

25 Li H Durbin R Fastandaccurate long-read alignment with Burrows-Wheelertransform Bioinformatics (Oxford Engl) 201026(5)589ndash95

26 Langmead B Salzberg SL Fast gapped-read alignment with Bowtie 2Nat Methods 20129(4)357ndash9

27 Hoffmann S Otto C Kurtz S Sharma CM Khaitovich P Vogel J StadlerPF Hackermuumlller J Fast mapping of short sequences with mismatchesinsertions and deletions using index structures PLoS Comput Biol20095(9)1000502

28 Kim D Pertea G Trapnell C Pimentel H Kelley R Salzberg SL TopHat2accurate alignment of transcriptomes in the presence of insertionsdeletions and gene fusions Genome Biol 201314(4)36

29 Hoffmann S Otto C Doose G Tanzer A Langenberger D Christ S KunzM Holdt LM Teupser D Hackermuumlller J Stadler PF A multi-splitmapping algorithm for circular RNA splicing trans-splicing and fusiondetection Genome Biol 201415(2)34

30 Trapnell C Williams Ba Pertea G Mortazavi A Kwan G van Baren MJSalzberg SL Wold BJ Pachter L Transcript assembly and abundanceestimation from RNA-Seq reveals thousands of new transcripts andswitching among isoforms Nat Biotechnol 201128(5)511ndash5

31 Anders S Pyl PT Huber W HTSeq-A Python framework to work withhigh-throughput sequencing data Bioinformatics 201531(2)166ndash9

32 Zhang Y Liu T Meyer Ca Eeckhoute J Johnson DS Bernstein BENusbaum C Myers RM Brown M Li W Liu XS Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089(9)137

33 Fidler F Gordon A Science is in a reproducibility crisis ndash how do weresolve it Conversation 2013 Available at httptheconversationcomscience-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998Accessed 11 Jun 2019

34 Lehrer J The truth wears off New Yorker 2010Dec 1352ndash735 Van Bavel J Why do so many studies fail to replicate N Y Times 201610

Available at httpswwwnytimescom20160529opinionsundaywhy-do-so-many-studies-fail-to-replicatehtml Accessed 11 Jun 2019

36 Fomel S Claerbout JF Guest Editorsrsquo Introduction ReproducibleResearch Comput Sci Eng 200911(1)5ndash7

37 Leisch F Sweave Dynamic generation of statistical reports using literatedata analysis In Haumlrdle W Roumlnz B editors Compstat 2002mdashProceedings in Computational Statistics Heidelberg Physica Verlag 2002p 575ndash80 ISBN 3-7908-1517-9 httpwwwstatuni-muenchendetexttildelowleischSweave

38 Xie Y Implementing Reproducible Computational Research In StoddenV Leisch F Peng RD editors Boca Raton Chapman and HallCRC 2014ISBN 978-1466561595 httpwwwcrcpresscomproductisbn9781466561595

39 Peacuterez F Granger BE IPython a system for interactive scientificcomputing Comput Sci Eng 20079(3)21ndash9

40 Afgan E Baker D Batut B van den Beek M Bouvier D Cech M ChiltonJ Clements D Coraor N Gruumlning BA Guerler A Hillman-Jackson JHiltemann S Jalili V Rasche H Soranzo N Goecks J Taylor JNekrutenko A Blankenberg D The galaxy platform for accessiblereproducible and collaborative biomedical analyses 2018 update NucleicAcids Res 201846537ndash44

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Implementation
        • Analysis as a directed acyclic graph
          • Plug-in architecture
            • Enforcing consistency and integrity
            • Maintaining reproducibility
            • Process flow
              • Results
                • Preparing genomic data for HTS analysis
                  • Transcriptome assembly from RNA-Seq data
                    • Identification of enriched regions from ChIP-Seq data
                      • Conclusions
                      • Availability and requirements
                      • Supplementary informationSupplementary information accompanies this paper at httpsdoiorg101186s12859-019-3219-1
                        • Additional file 1
                          • Abbreviations
                          • Acknowledgments
                          • Authors contributions
                          • Funding
                          • Availability of data and materials
                          • Ethics approval and consent to participate
                          • Consent for publication
                          • Competing interests
                          • Author details
                          • References
                          • Publishers Note
Page 9: SOFTWARE OpenAccess uap:reproducibleandrobustHTSdata …

Kaumlmpf et al BMC Bioinformatics (2019) 20664 Page 9 of 9

2 Peng RD Reproducible research in computational science Sci N Y20113341226ndash7

3 Bustin SA The reproducibility of biomedical research Sleepers awakeBiomol Detect Quantif 2014235ndash42

4 Baker M 1500scientists lift the lid on reproducibility Nature 2016533452ndash45 Sandve GK Nekrutenko A Taylor J Hovig E Ten simple rules for

reproducible computational research PLoS Comput Biol 2013910032856 Gruumlning B Chilton J Koumlster J Dale R Soranzo N van den Beek M

Goecks J Backofen R Nekrutenko A Taylor J Practical computationalreproducibility in the life sciences Cell Syst 20186631ndash5

7 Goodstadt L Ruffus A lightweight python library for computationalpipelines Bioinformatics 2010262778ndash9

8 Koumlster J Rahmann S Snakemake-a scalable bioinformatics workflowengine Bioinformatics 201228(19)2520ndash2

9 Fonseca Na Petryszak R Marioni J iRAP - an integrated RNA-seq AnalysisPipeline iRAP - an integrated RNA-seq Analysis Pipeline 20140ndash10httpsdoiorg101101005991

10 Wang Y Mehta G Mayani R Lu J Souaiaia T Chen Y Clark A Yoon HJWan L EvgrafovOV Knowles JA Deelman E Chen T RseqFlow Workflowsfor RNA-Seq data analysis Bioinformatics 201127(18)2598ndash600

11 Kalari KR Nair Aa Bhavsar JD O Brien DR Davila JI Bockol Ma Nie JTang X Baheti S Doughty JB Middha S Sicotte H Thompson AEAsmann YW Kocher J-Pa MAP-RSeq Mayo Analysis Pipeline for RNAsequencing BMC Bioinformatics 201415(1)224

12 Goecks J Nekrutenko A Taylor J Galaxy a comprehensive approach forsupporting accessible reproducible and transparent computationalresearch in the life sciences Genome Biol 20101186

13 Golosova O Henderson R Vaskin Y Gabrielian A Grekhov G NagarajanV Oler AJ Quintildeones M Hurt D Fursov M Huyen Y Unipro UGENE NGSpipelines and components for variant calling RNA-seq and ChIP-seq dataanalyses PeerJ 20142644

14 Berthold MR Cebron N Dill F Gabriel TR Koumltter T Meinl T Ohl P SiebC Thiel K Wiswedel B KNIME The Konstanz Information Miner InStudies in Classification Data Analysis and Knowledge Organization (GfKL2007) New York Springer 2007

15 Wolstencroft K Haines R Fellows D Williams A Withers D Owen SSoiland-Reyes S Dunlop I Nenadic A Fisher P Bhagat J Belhajjame KBacall F Hardisty A Nieva de la Hidalga A Balcazar Vargas MP Sufi SGoble C The Taverna workflow suite designing and executing workflowsof Web Services on the desktop web or in the cloud Nucleic acidsresearch 201341(Web Server issue)557ndash61

16 Guimera RV bcbio-nextgen Automated distributed next-gensequencing pipeline EMBnet J 20121730

17 Sadedin SP Pope B Oshlack A Bpipe a tool for running and managingbioinformatics pipelines Bioinformatics (Oxford Engl) 201228(11)1525ndash6

18 Di Tommaso P Chatzou M Floden EW Barja PP Palumbo ENotredame C Nextflow enables reproducible computational workflowsNat Biotechnol 201735316ndash9

19 Cingolani P Sladek R Blanchette M BigDataScript a scripting languagefor data pipelines Bioinformatics 201431(1)10ndash16

20 Ewels P Krueger F Kaumlller M Andrews S Cluster flow A user-friendlybioinformatics workflow tool F1000Research 201652824

21 Gafni E Luquette LJ Lancaster AK Hawkins JB Jung J-Y Souilmi Y WallDP Tonellato PJ Cosmos Python library for massively parallel workflowsBioinformatics 2014302956ndash8

22 Kaushik G Ivkovic S Simonovic J Tijanic N Davis-Dusenbery B Kural DRabix An open-source workflow executor supporting recomputabilityand interoperability of workflow descriptions Pac Symp Biocomput201722154ndash65

23 Yoo AB Jette MA Grondona M SLURM Simple Linux Utility for ResourceManagement In Feitelson D Rudolph L Schwiegelshohn U editorsJob Scheduling Strategies for Parallel Processing 9th InternationalWorkshop JSSPP 2003 Seattle WA USA June 24 2003 Revised PaperBerlin Springer 2003 p 44ndash60

24 Barski A Cuddapah S Cui K Roh T-Y Schones DE Wang Z Wei GChepelev I Zhao K High-resolution profiling of histone methylations inthe human genome Cell 2007129(4)823ndash37

25 Li H Durbin R Fastandaccurate long-read alignment with Burrows-Wheelertransform Bioinformatics (Oxford Engl) 201026(5)589ndash95

26 Langmead B Salzberg SL Fast gapped-read alignment with Bowtie 2Nat Methods 20129(4)357ndash9

27 Hoffmann S Otto C Kurtz S Sharma CM Khaitovich P Vogel J StadlerPF Hackermuumlller J Fast mapping of short sequences with mismatchesinsertions and deletions using index structures PLoS Comput Biol20095(9)1000502

28 Kim D Pertea G Trapnell C Pimentel H Kelley R Salzberg SL TopHat2accurate alignment of transcriptomes in the presence of insertionsdeletions and gene fusions Genome Biol 201314(4)36

29 Hoffmann S Otto C Doose G Tanzer A Langenberger D Christ S KunzM Holdt LM Teupser D Hackermuumlller J Stadler PF A multi-splitmapping algorithm for circular RNA splicing trans-splicing and fusiondetection Genome Biol 201415(2)34

30 Trapnell C Williams Ba Pertea G Mortazavi A Kwan G van Baren MJSalzberg SL Wold BJ Pachter L Transcript assembly and abundanceestimation from RNA-Seq reveals thousands of new transcripts andswitching among isoforms Nat Biotechnol 201128(5)511ndash5

31 Anders S Pyl PT Huber W HTSeq-A Python framework to work withhigh-throughput sequencing data Bioinformatics 201531(2)166ndash9

32 Zhang Y Liu T Meyer Ca Eeckhoute J Johnson DS Bernstein BENusbaum C Myers RM Brown M Li W Liu XS Model-based analysis ofChIP-Seq (MACS) Genome Biol 20089(9)137

33 Fidler F Gordon A Science is in a reproducibility crisis ndash how do weresolve it Conversation 2013 Available at httptheconversationcomscience-is-in-a-reproducibility-crisis-how-do-we-resolve-it-16998Accessed 11 Jun 2019

34 Lehrer J The truth wears off New Yorker 2010Dec 1352ndash735 Van Bavel J Why do so many studies fail to replicate N Y Times 201610

Available at httpswwwnytimescom20160529opinionsundaywhy-do-so-many-studies-fail-to-replicatehtml Accessed 11 Jun 2019

36 Fomel S Claerbout JF Guest Editorsrsquo Introduction ReproducibleResearch Comput Sci Eng 200911(1)5ndash7

37 Leisch F Sweave Dynamic generation of statistical reports using literatedata analysis In Haumlrdle W Roumlnz B editors Compstat 2002mdashProceedings in Computational Statistics Heidelberg Physica Verlag 2002p 575ndash80 ISBN 3-7908-1517-9 httpwwwstatuni-muenchendetexttildelowleischSweave

38 Xie Y Implementing Reproducible Computational Research In StoddenV Leisch F Peng RD editors Boca Raton Chapman and HallCRC 2014ISBN 978-1466561595 httpwwwcrcpresscomproductisbn9781466561595

39 Peacuterez F Granger BE IPython a system for interactive scientificcomputing Comput Sci Eng 20079(3)21ndash9

40 Afgan E Baker D Batut B van den Beek M Bouvier D Cech M ChiltonJ Clements D Coraor N Gruumlning BA Guerler A Hillman-Jackson JHiltemann S Jalili V Rasche H Soranzo N Goecks J Taylor JNekrutenko A Blankenberg D The galaxy platform for accessiblereproducible and collaborative biomedical analyses 2018 update NucleicAcids Res 201846537ndash44

Publisherrsquos NoteSpringer Nature remains neutral with regard to jurisdictional claims inpublished maps and institutional affiliations

  • Abstract
    • Background
    • Results
    • Conclusions
    • Keywords
      • Background
      • Implementation
        • Analysis as a directed acyclic graph
          • Plug-in architecture
            • Enforcing consistency and integrity
            • Maintaining reproducibility
            • Process flow
              • Results
                • Preparing genomic data for HTS analysis
                  • Transcriptome assembly from RNA-Seq data
                    • Identification of enriched regions from ChIP-Seq data
                      • Conclusions
                      • Availability and requirements
                      • Supplementary informationSupplementary information accompanies this paper at httpsdoiorg101186s12859-019-3219-1
                        • Additional file 1
                          • Abbreviations
                          • Acknowledgments
                          • Authors contributions
                          • Funding
                          • Availability of data and materials
                          • Ethics approval and consent to participate
                          • Consent for publication
                          • Competing interests
                          • Author details
                          • References
                          • Publishers Note