initial steps towards a production platform for dna sequence analysis on the grid
DESCRIPTION
Presented at the ISMB/ECCB 2011 conference. https://www.iscb.org/cms_addon/conferences/ismbeccb2011/highlights.php#HL13TRANSCRIPT
![Page 1: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/1.jpg)
Initial steps towards a production platformfor DNA sequence analysis on the grid
ISMB/ECCB conference – 18 July 2011
Barbera van Schaik, Angela Luyf, Michel de Vries,
Frank Baas, Antoine van Kampen and Silvia Olabarriaga
![Page 2: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/2.jpg)
Overview
Grid computing and workflow technology
Example: Virus discovery
Analysis of larger data sets
Example: Genome of the Netherlands
Challenges and summary
![Page 3: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/3.jpg)
Sequencing, Moore’s law and personnel
http://www.politigenomics.com/2009/02/the-scale-up.html
Accele
ration Note:
Only slope is
meaningful in
this graph
![Page 4: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/4.jpg)
What are the options?
Local cluster
Desktop grid
Super computer
Hadoop cluster
GPU cluster
Cloud computing
(Inter) national Grid
DNA computing
National computing facilities
Each system has its own interfaceNeed to learn how they all work
![Page 5: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/5.jpg)
Grids
Distributed resources
ComputingData storage
Open protocols
It's all about sharing
ResourcesMethodsCollaborations
![Page 6: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/6.jpg)
Dutch grid (resources)
grid
http://www.biggrid.nl/
![Page 7: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/7.jpg)
People, resources and data flow
My role
grid
Sequencefacility
Researchlaboratories
BioinformaticsNGS team
e-BioScienceteam
![Page 8: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/8.jpg)
Example: Virus discovery
Virus discovery unit
VIDISCAmethod
GenBank - NR
exp1exp1
exp1exp1
exp1exp1
exp6exp1
exp1exp3
exp2exp1
Goal: Identify known and discover new viruses in samples
Michel de Vries et al (2011) PloS one
![Page 9: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/9.jpg)
BLAST analysis workflow
Input: sequence reads
Conversion step (sff to fasta)
BLAST
Output: BLAST results
![Page 10: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/10.jpg)
Workflow description (XML)
Component 1 (XML) Component 2 (XML)
Implementation of workflow components
Executable/script:
BLAST
Executable/script:
sff2fasta.pl
In: sequences(fasta)
In: database(fasta)
Out: blast result
(txt)
In: sequences(sff)
Out: sequences(fasta)
X
Tristan Glatard (2008) Future generation computer systems
http://gwendia.i3s.unice.fr/doku.php?id=gwendia
![Page 11: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/11.jpg)
Run workflow on the grid
Silvia Olabarriaga et al (2010) IEEE Transactions on Information Technology In BiomedicineTristan Glatard (2008) International Journal of High Performance Computing Applications
![Page 12: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/12.jpg)
Graphical user interface: VBrowser
htt
p:/
/ww
w.v
l-e.
nl/
vbro
wse
r
![Page 13: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/13.jpg)
Workflow monitoring
![Page 14: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/14.jpg)
Speed upexp1
exp1exp1
exp1exp1
exp1exp6
exp1exp1
exp3exp2
exp1
Blast
15 experiments722 samples
2 databases:Human ribosomal
Viruses
Total CPU time: 413 hrs (~17 days)Elapsed time workflow: 13.7 hrs= 30x speed up
Angela Luyf, Barbera van Schaik et al (2010) BMC Bioinformatics
![Page 15: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/15.jpg)
Benefits workflow technology
Agile development
Re-use of components
Iteration strategy
Knowledge about analysis
steps captured in workflow
![Page 16: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/16.jpg)
Analysis of larger data setsGenome of the Netherlands (GoNL)
Whole genome
sequencing of
250 trios
Enrich biobanks
Reference set for
disease studies http://www.bbmri.nl/http://www.nlgenome.nl/
770 samples45 TB raw data
Many partners(data sharing)
Analysis ondistributed sites
![Page 17: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/17.jpg)
GoNL alignment pipeline
BWA aln, sampe, sam-to-bam, sort bam, index
Picard mark duplicates
GATK realignment
GATK recalibration
Picard fix mates
Pair1.fastq
Pair2.fastq
Pipeline similar to what is used at the Broad Institute. Implemented for GoNL by Freerk van Dijk (Groningen)
Referencegenome
Result.bam
160 samples (478 lanes) are
currently analyzed on the Dutch grid
Development and small tests:
Nov 22, 2010 - now
Analysis:
Mar 25, 2011 - Jul 15, 2011
Jobs: 13,981
Total CPU time: 5.5 years
Disk space used: 315 TB
![Page 18: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/18.jpg)
Challenges
• Error handling
• Data management
• Data protection
• Provenance tracking
• Transparent addition of other resources
![Page 19: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/19.jpg)
Summary
More research and development needed in e-bioscience
Latest IT infrastructures needed for scaling up NGS data analysis (grids, clouds, big clusters)
Workflow technology assists agile implementation of bioinformatics software
Separate workflow development from IT infrastructure for easier migration and expansion (middleware)
![Page 20: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/20.jpg)
AcknowledgementsGenome of the
Netherlands, NL
Cisca Wijmenga
Morris Swertz
All project partners
Virus discovery unit, AMC
Lia van der Hoek
Michel de Vries
Department of
genome analysis, AMC
Frank Baas
Ted Bradley
Marja Jakobs
Bioinformatics Laboratory, AMC
Antoine van Kampen
NGS bioinformatics team
Aldo Jongejan
Marcel Willemsen
e-Bioscience team
Silvia Olabarriaga
Angela Luyf
Mark Santcroos
Shayan Shahand
University of Amsterdam
Piter de Boer
BiG Grid
Jan Just Keijser
Tom Visser
Grid support
Modalis, France
Johan Montagnat
Creatis, France
Tristan Glatard
http://www.bioinformaticslaboratory.nl/
![Page 21: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/21.jpg)
![Page 22: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/22.jpg)
22
BWA on grid – component description
![Page 23: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/23.jpg)
23
BWA on grid – component description
![Page 24: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/24.jpg)
24
BWA on grid – workflow description
![Page 25: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/25.jpg)
e-BioInfra gateway
No grid certificate neededData upload via sFTP (intranet)Synced with grid storageWorkflows are started from web page
htt
p:/
/ora
nge
.eb
iosc
ien
ce.a
mc.
nl/
ebio
infr
agat
eway
/
![Page 26: Initial steps towards a production platform for DNA sequence analysis on the grid](https://reader033.vdocuments.us/reader033/viewer/2022060108/554e9426b4c90573338b4ffb/html5/thumbnails/26.jpg)
Implemented workflow componentsfor next generation sequencing
Existing software
• BLAST
• BLAT
• BWA
• Annovar
• Varscan
• Newbler
• FastQC
In-house software
• Data format converters
• Quality trimming
• Alternative splice product detection
• CDR3 detection (T- and B-cell variation)
• Genome comparison (small genomes)
• Roche software
• GATK
• Picard
• Samtools