#code2cure: a field guide for software engineers on their journey to the world of genomics
Post on 21-Apr-2017
7.668 Views
Preview:
TRANSCRIPT
#Code2Cure: Engineering Genomics
: @mirkiani
A field guide for software engineers on their journey to the world of genomics.
Amirhossein KianiSr. Lead Software Engineer
: amir@bina.com
Image courtesy of http://circos.ca
DISCLAIMER: The views expressed in this talk are mine alone and not those of my employer.
Bina products are for use Research Use Only. Not for use in diagnostic procedures.
Also, I’m a Computer Scientist by training and trying to help those with similar background to learn about the field of genomics. Therefore there has been a high degree of simplification done in explaining the scientific concepts in this talk.
https://www.youtube.com/watch?v=G1ZLyGW8rKY2
www.bina.com
Why Genomics?
$3,000,000,00013 years
http://en.wikipedia.org/wiki/Human_Genome_Project
Past Present
$100024 hours
Future
3
www.bina.com
Why Genomics?
Some things we could do with genomics:
• Carrier Screening• Prenatal Screening• Newborn Screening• Inherited Disease• Infectious Disease• Cancer Diagnostics• Microbiome• Personalized Medicine
4
But I have no genomics background!It’s ok.
5
www.bina.com
My personal story…
6
Now
Then
www.bina.com
What is cell, what is DNA?
http://en.wikipedia.org/wiki/Cell_%28biology%29 http://en.wikipedia.org/wiki/DNA
7
Image courtesy of Pinterest
Image courtesy of Tumblr
www.bina.com
Crash Course on Genomics
The field of studying the structure of genomes.
http://en.wikipedia.org/wiki/Genomics http://en.wikipedia.org/wiki/RNA http://en.wikipedia.org/wiki/Protein
DNA RNA Protein You!
8
www.bina.com
How do we figure out what’s in DNA?
Like everything else, we turn the analog signal to digital, and then analyze it.
http://en.wikipedia.org/wiki/DNA_sequencing http://en.wikipedia.org/wiki/FASTQ_format
Illumina, Ion Torrent, Genia, …
Primary Analysis
FASTQ Format
9
Image courtesy of PersonalGenomes.org
www.bina.com
RAW Data to Variants (Secondary Analysis)
Step 1. Alignment
http://en.wikipedia.org/wiki/DNA_sequencing http://en.wikipedia.org/wiki/FASTQ_format
10
Image courtesy of Wall Woodworks
Image courtesy of Wallpaper Up
www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
Step 1. Short-Read Sequence Alignment
http://en.wikipedia.org/wiki/Reference_genome http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism http://en.wikipedia.org/wiki/Indel http://en.wikipedia.org/wiki/Structural_variation
AACACACCCAAGGGGGAAACTTTGGTCCACCCAAGGGGGAAACCCAAGGGGGAAACTTTGReference Genome (~3B bases)
ACTTTGGTCCACCCAAGGAAGGGGGACACCCAAGGACACCC__GGGGGAAACT
GGACACCCAAGGGGGAAACCCAAGGGGGACACCC
ACCC__GGGGGAAACTTTGAACACACCC__GGGGGAA
Cov
erag
e
Deletion Single Nucleotide Polymorphism
11
www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
• Burrows-Wheeler Aligner (BWA)• Uses Burrows-Wheeler transform (also used in bzip)• Uses Smith-Waterman algorithm• Written in C++• Uses ~4GB memory for human genome
http://bio-bwa.sourceforge.net http://bioinformatics.oxfordjournals.org/content/25/14/1754.full.pdf+html
$ bwa mem ref.fa read1.fq read2.fq > aln-pe.samExample
12
www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
Alignment
FASTQ SAM
Convert to BinaryBZIP (samtools)
BAM File
BAM File Index
http://samtools.github.io/hts-specs/SAMv1.pdf http://samtools.github.io
13
www.bina.com
From “Raw” DNA to “Variants” (Secondary Analysis)
BAM File
BAM File Index
http://www.broadinstitute.org/igv https://github.com/ekg/freebayes http://arxiv.org/abs/1207.3907 https://www.broadinstitute.org/gatk
Visualize
Variant Calling
$ freebayes -f ref.fa aln.bam >var.vcf
ExampleInteractive Genome Browser (IGV)
14
www.bina.com15
From “Raw” DNA to “Variants” (Secondary Analysis)… and here are your variants (VCF file)!
http://samtools.github.io/hts-specs/VCFv4.2.pdf
www.bina.com
What do we do with variant calls then?
Zooming in on the Central Dogma of Molecular Biology:
• There is redundancy in protein codes.• But a mutation could change the protein coding.
16
Image courtesy of Wikipedia
www.bina.com
What do we do with variant calls then?
Annotation & Interpretation
• Functional Annotation Figure out if the mutation is dangerous (Use SNPEff)• Synonymous• Non-Synonymous• Frame-shift• …
• Put in the context of existing findings• dbSNP• ClinVar• COSMIC• ESP• 1000 Genomes• …
http://snpeff.sourceforge.net http://www.ncbi.nlm.nih.gov/SNP
17
www.bina.com18
CASE STUDY:
www.bina.com
Statistics Data AnalyticsBioinformatics
Genomics
Big Data TechnologiesCompute and Data Science
19
Bringing three disciplines together
www.bina.com
Case Study: Bina GMS
20
Sequencing 2º Analysis 3º Analysis Interpretation
Meaningful Results & Clinical
Relevance
20+ DBs including over140+ annotations:
HGMD // PGMD // ClinvarCOSMIC // dbNSFP // TRANSFAC
1000 Genome and more.
Tools & Workflows for:
WGS // WES // RNAseq Somatic Mutations
Multi sampleGene Panels
Bina Products are for Research Use Only
www.bina.com21
Bina RAVE Architecture (1)
Secure REST InterfacePortal Server(s)
Portal Backend (Distributed)
• Workflow Definition• Templates• QC/Monitoring• System Management/Updates
Task DependencyGraphs
Distributed Workflow Orchestration
Secure PushInterface
Workflow
Generation
Interactive UI // Command Line SDK
Executor
Dynamic Scheduling
Local Storage
Exe
cutio
n E
ngin
e
Executor Nodes / VMs
Network Storage – Input/Output Data
StaticScheduling
Workflows
Tools
Commands
www.bina.com
Bina RAVE Architecture (2)
Workflows (DNA, RNA ..)
Tools (BWA, GATK, SVs)
Services(Logging, Storage, Caching,
Streaming)
Commands(Samtools, GATK, URL,..)
Genome-aware – Workflow Generation
Distributed Coordination
Task Graph
JSON Request(UI/CMD/SDK)
Nodes / VMs
Executor
Dynamic scheduling
Graph
Triggers
Updates
Genome aware – Distributed Execution Framework
Syncing all Nodes
Dependency Graph
Task Status
Network storage – Input/output data
Local storage
•Dependency Aware Execution•Locality Aware Execution (Caching)•Streaming Through “Engines”•In-Memory Computation
Output(VCF,SV)
Input(BAM, FASTQ)
Static Scheduling
www.bina.com
Bina AAiM Architecture
Annotation and Indexing Engine
InputVCF
UI/CMD
Clinical Annotations
Genomic Context
Prediction Func. Impact
PopulationFrequency
Distributed Execution
Framework
Annotation
(Join static DBs)
Indexing &
Functional Filters
MapReduce Jobs
Analytics Engine
NoSQL
Data Store
Indices
Metadata
Store
Tumor/Normal Pedigree
Queries, Filters, Variant Sets, Reports
Bina Secondary
Cohort StudyProband
www.bina.com
What next?
http://www.genomicsengland.co.uk http://www.personalgenomes.org
• Apply this process to different domains and applications• Come up with ways of ranking variants • Keep learning from data• Sequence everyone!
• Genomics England 100,000 Genome Project• Personal Genomes Project
• Decrease cost• Increase accuracy• Make the technology faster and more usable!
Map of sequencers around the globe: http://omicsmaps.com
24
www.bina.com
Challenges in Genomics• Accuracy
• Gold standard? What tool is best, there are so many!• NIST, Dream Challenge
• Need to speak the same language… interoperability• Global Alliance• API, format, meta data, …
• Regulations• HIPPA, CLIA: security, accuracy, anonymity and encryption
• Scalability• Storage
• Need terabytes• Each genome could be up to 1T
• Computation• We still pretty much have no idea what most of DNA is doing…• Can’t run on single machine. Need to scale to many nodes• Need to leverage cloud technologies
• Provenance and auditability• Importance of usability• Different personas• Errors are very expensive (life and death)• Better visualization → faster discovery → faster cure
25
www.bina.com
Why should software engineers move to genomics?
Because genomics needs you, and you need genomics.
Work on something that matters! (#Code2Cure)
Things that SWEs do very well:• Automation• Elegant solutions for complex problems• Enabling non-savvy users by
making the technology robust and accessible• Scale• Optimization• Building production-grade platforms
• Tested• Robust• Secure
THESE ARE ALL NEEDED IN GENOMICS YESTERDAY!
26
Image courtesy of http://silvsoul.blogspot.com
www.bina.com
Open projects/resources to checkout/contribute to
Projects/Conferences• Galaxy -- http://galaxyproject.org• Arvados -- https://arvados.org• Open Bio Conference -- http://www.open-bio.org• BioViz -- http://www.biovis.net• BioPython -- http://biopython.org• Global Alliance for Genomics Health -- http://ga4gh.org• Rosalind Project -- http://rosalind.info
Blogs/Websites
• http://bcb.io• http://nextgenseek.com/• http://ngs-expert.com/• http://seqanswers.com/• http://core-genomics.blogspot.com• http://www.genomesunzipped.org• http://genomeweb.com
27
Thank you. And I hope you consider moving to genomics! http://info.bina.com/code2cure-community
: @mirkiani
Amirhossein KianiSr. Lead Software Engineer
: amir@bina.com
top related