centralizing sequence analysis

1
FOR FURTHER INFORMATION Figure 1: [Pipeline framework] Project information is kept separate from scripts/programs, a ‘ config ’ file defines how the pipeline is accessed to produce data and analysis steps that are relevant for this particular project. Each module is dual- functioning: data generation ‘armed’ and quality control ‘verify’. REFERENCES .[1] Bauer, Denis. Variant calling comparison CASAVA1.8 and GATK. Available from Nature Precedings (2011) Centralizing Sequence Analysis Crowd-sourcing not Wheel-reinvention Academic tools will remain the methods of choice for cutting-edge data analysis [1] , however, most do not comply with even very basic software- development practice (i.e. poor documentation, lack of legacy support), which makes setting-up and maintenance time consuming. A similar issue applies to reference data sets, which need to be downloaded and often filtered and converted into a usable format. Summarizing quality control and data yield in a meaningful way remains an labour intensive expert task. Rather than individually battling these issues, a more efficient way would be to have a centralized system set up that is collectively maintained by the researchers who are using the system. Benefits would be: Sharing modular methods/scripts for data analysis and summary Ensuring consistency and reproducibility by keeping scripts separate from data Benchmarking quality amongst other datasets within CSIRO Enabling collaborative knowledge gain Making developers’ expert knowledge available to users by enforcing scripts to have a self-contained quality control stage The first steps of analysing sequencing data (2GS,NGS) has entered a transitional period where on one hand most analysis steps can be automated and standardized (pipeline), while on the other constantly evolving protocols and software updates makes maintaining these analysis pipelines labour intensive. I propose a centralized system within CSIRO that is flexible to cater for different analyses while also being generic to efficiently disseminate labour intensive maintenance and extension amongst the user community. Taking the grind out of the analysis pipelines Denis C. Bauer Big picture: flexible yet low-maintenance framework CSIRO MATHEMATICS, INFORMATICS AND STATISTICS Denis Bauer e [email protected] w www.csiro.au/CMIS Figure 2: [Examples of applications that can be shared] Quality control overview, highly informative performance plots, system to browse data in real time with references to other data sets are all examples of labour extensive tasks (set- up/maintenance) that would be of benefit to other users. QC overview Performance plots Browsing Documentation Application Backup and Version Control Data Warehousing RSudio Project Cards (web) Software Processed Data External Genomic Resources Custom Scripts Custom Scripts Visualization IGV Genome Browser Statistic Analysis Quality Control Hypothesis Generation Data Processing and Analysis From external service providers Rsync Version Control Genomes, Annotation, etc. Project-summary-cards Wiki pages, Task-Logs, Webbrowser >35 external programs >41 custom scripts (4197 lines of code) 200 GB processed data/Project 57 GB external reference data //cherax + //cluster-vm //fsnsw3_syd/Bioinfo Project server Processed Data Raw Data Cluster Galaxy Project Server Content BWA, GATK, samtools, etc. //??? Utilizing international efforts There are several international attempts to automate and standardize NGS data analysis. Investigating which efforts are beneficial to CSIRO is likely to be more successful as a group-effort than by each individual alone: Nectar - http://nectar.org.au/ Australia-wide effort for Cloud-computing and large data storage with emphasis on NGS (Mike Pheasant) Bpipe - http://code.google.com/p/bpipe/ Effort for streamlining pipeline calls/re-calls ISAtools - http://isatab.sourceforge.net/tools.html Metadata annotation and documentation BioStore - http://www.seqan-biostore.de/wp/ C++ framework for developing and sharing sequencing analysis programs based on solid algorithmic foundations and template- based interfaces. Figure 3: [Framework overview] Elements from the three main pillars (Apps, Data, Backup) are over ached by a documentation server, which displays current states and annotates changes. Summary.html Project Shared scripts Config.txt trigger.sh pbs1.sh pbs2.sh data1 data2 eval1 armed eval2 verify User call logfile & code logfile & code

Upload: denis-bauer

Post on 13-Jul-2015

515 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Centralizing sequence analysis

FOR FURTHER INFORMATION

Figure 1: [Pipeline framework] Project information is kept separate from scripts/programs, a ‘config’ file defines how the pipeline is accessed to produce data and analysis steps that are relevant for this particular project. Each module is dual-functioning: data generation ‘armed’ and quality control ‘verify’.

REFERENCES

.[1] Bauer, Denis. Variant calling comparison CASAVA1.8 and

GATK. Available from Nature Precedings (2011)

Centralizing Sequence Analysis

Crowd-sourcing not Wheel-reinventionAcademic tools will remain the methods of choice for cutting-edge dataanalysis[1], however, most do not comply with even very basic software-development practice (i.e. poor documentation, lack of legacy –support),which makes setting-up and maintenance time consuming.

A similar issue applies to reference data sets, which need to be downloadedand often filtered and converted into a usable format.

Summarizing quality control and data yield in a meaningful way remains anlabour intensive expert task.

Rather than individually battling these issues, a more efficient way would beto have a centralized system set up that is collectively maintained by theresearchers who are using the system.

Benefits would be:

• Sharing modular methods/scripts for data analysis and summary

• Ensuring consistency and reproducibility by keeping scripts separate from data

• Benchmarking quality amongst other datasets within CSIRO

• Enabling collaborative knowledge gain

• Making developers’ expert knowledge available to users by enforcing scripts to have a self-contained quality control stage

The first steps of analysing sequencing data (2GS,NGS) has entered a transitional period where on one hand mostanalysis steps can be automated and standardized (pipeline), while on the other constantly evolving protocols andsoftware updates makes maintaining these analysis pipelines labour intensive.

I propose a centralized system within CSIRO that is flexible to cater for different analyses while also being genericto efficiently disseminate labour intensive maintenance and extension amongst the user community.

Taking the grind out of the analysis pipelinesDenis C. Bauer

Big picture: flexible yet low-maintenance framework

CSIRO MATHEMATICS, INFORMATICS AND STATISTICS

Denis Bauer

e [email protected]

w www.csiro.au/CMIS

Figure 2: [Examples of applications that can be shared] Quality control overview, highly informative performance plots, system to browse data in real time with references to other data sets are all examples of labour extensive tasks (set-up/maintenance) that would be of benefit to other users.

QC overview Performance plots

Browsing

Documentation

Application Backup andVersion Control

Data Warehousing

RSudio

Project Cards (web)

Software

Processed Data

External Genomic Resources

Custom Scripts Custom Scripts

Visualization

IGVGenome Browser

Statistic Analysis

Quality Control

HypothesisGeneration

Data

Processing and Analysis

From external

service providers

Rsync

Version

Control

Genomes,

Annotation, etc.

Project-summary-cardsWiki pages, Task-Logs, Webbrowser

>35 external programs>41 custom scripts (4197 lines of code)

200 GB processed data/Project57 GB external reference data

//cherax + //cluster-vm //fsnsw3_syd/Bioinfo

Project server

Processed Data

Raw Data

Cluster

GalaxyProject Server

Content

BWA, GATK,

samtools, etc.

//???

Utilizing international efforts

There are several international attempts to automate and standardize NGS data analysis. Investigating which efforts are beneficial to CSIRO is likely to be more successful as a group-effort than by each individual alone:

Nectar - http://nectar.org.au/

Australia-wide effort for Cloud-computing and large data storage with emphasis on NGS (Mike Pheasant)

Bpipe - http://code.google.com/p/bpipe/Effort for streamlining pipeline calls/re-calls

ISAtools - http://isatab.sourceforge.net/tools.html

Metadata annotation and documentation

BioStore - http://www.seqan-biostore.de/wp/

C++ framework for developing and sharing sequencing analysis programs based on solid algorithmic foundations and template-based interfaces.

Figure 3: [Framework overview] Elements from the three main pillars (Apps, Data, Backup) are over ached by a documentation server, which displays current states and annotates changes.

Summary.html

Project Shared scripts

Config.txt trigger.sh

pbs1.sh

pbs2.sh

data1

data2

eval1

armed

eval2

verify

User call

logfile & code

logfile & code