centralizing sequence analysis
TRANSCRIPT
FOR FURTHER INFORMATION
Figure 1: [Pipeline framework] Project information is kept separate from scripts/programs, a ‘config’ file defines how the pipeline is accessed to produce data and analysis steps that are relevant for this particular project. Each module is dual-functioning: data generation ‘armed’ and quality control ‘verify’.
REFERENCES
.[1] Bauer, Denis. Variant calling comparison CASAVA1.8 and
GATK. Available from Nature Precedings (2011)
Centralizing Sequence Analysis
Crowd-sourcing not Wheel-reinventionAcademic tools will remain the methods of choice for cutting-edge dataanalysis[1], however, most do not comply with even very basic software-development practice (i.e. poor documentation, lack of legacy –support),which makes setting-up and maintenance time consuming.
A similar issue applies to reference data sets, which need to be downloadedand often filtered and converted into a usable format.
Summarizing quality control and data yield in a meaningful way remains anlabour intensive expert task.
Rather than individually battling these issues, a more efficient way would beto have a centralized system set up that is collectively maintained by theresearchers who are using the system.
Benefits would be:
• Sharing modular methods/scripts for data analysis and summary
• Ensuring consistency and reproducibility by keeping scripts separate from data
• Benchmarking quality amongst other datasets within CSIRO
• Enabling collaborative knowledge gain
• Making developers’ expert knowledge available to users by enforcing scripts to have a self-contained quality control stage
The first steps of analysing sequencing data (2GS,NGS) has entered a transitional period where on one hand mostanalysis steps can be automated and standardized (pipeline), while on the other constantly evolving protocols andsoftware updates makes maintaining these analysis pipelines labour intensive.
I propose a centralized system within CSIRO that is flexible to cater for different analyses while also being genericto efficiently disseminate labour intensive maintenance and extension amongst the user community.
Taking the grind out of the analysis pipelinesDenis C. Bauer
Big picture: flexible yet low-maintenance framework
CSIRO MATHEMATICS, INFORMATICS AND STATISTICS
Denis Bauer
w www.csiro.au/CMIS
Figure 2: [Examples of applications that can be shared] Quality control overview, highly informative performance plots, system to browse data in real time with references to other data sets are all examples of labour extensive tasks (set-up/maintenance) that would be of benefit to other users.
QC overview Performance plots
Browsing
Documentation
Application Backup andVersion Control
Data Warehousing
RSudio
Project Cards (web)
Software
Processed Data
External Genomic Resources
Custom Scripts Custom Scripts
Visualization
IGVGenome Browser
Statistic Analysis
Quality Control
HypothesisGeneration
Data
Processing and Analysis
From external
service providers
Rsync
Version
Control
Genomes,
Annotation, etc.
Project-summary-cardsWiki pages, Task-Logs, Webbrowser
>35 external programs>41 custom scripts (4197 lines of code)
200 GB processed data/Project57 GB external reference data
//cherax + //cluster-vm //fsnsw3_syd/Bioinfo
Project server
Processed Data
Raw Data
Cluster
GalaxyProject Server
Content
BWA, GATK,
samtools, etc.
//???
Utilizing international efforts
There are several international attempts to automate and standardize NGS data analysis. Investigating which efforts are beneficial to CSIRO is likely to be more successful as a group-effort than by each individual alone:
Nectar - http://nectar.org.au/
Australia-wide effort for Cloud-computing and large data storage with emphasis on NGS (Mike Pheasant)
Bpipe - http://code.google.com/p/bpipe/Effort for streamlining pipeline calls/re-calls
ISAtools - http://isatab.sourceforge.net/tools.html
Metadata annotation and documentation
BioStore - http://www.seqan-biostore.de/wp/
C++ framework for developing and sharing sequencing analysis programs based on solid algorithmic foundations and template-based interfaces.
Figure 3: [Framework overview] Elements from the three main pillars (Apps, Data, Backup) are over ached by a documentation server, which displays current states and annotates changes.
Summary.html
Project Shared scripts
Config.txt trigger.sh
pbs1.sh
pbs2.sh
data1
data2
eval1
armed
eval2
verify
User call
logfile & code
logfile & code