#code2cure: a field guide for software engineers on their journey to the world of genomics

28
#Code2Cure: Engineering Genomics : @mirkiani A field guide for software engineers on their journey to the world of genomics. Amirhossein Kiani Sr. Lead Software Engineer : [email protected] Image courtesy of http://circos.ca DISCLAIMER: The views expressed in this talk are mine alone and not those of my employer. Bina products are for use Research Use Only. Not for use in diagnostic procedures. Also, I’m a Computer Scientist by training and trying to help those with similar background to learn about the field of genomics. Therefore there has been a high degree of simplification done in explaining the scientific concepts in this talk.

Upload: amirhossein-kiani

Post on 21-Apr-2017

7.668 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

#Code2Cure: Engineering Genomics

: @mirkiani

A field guide for software engineers on their journey to the world of genomics.

Amirhossein KianiSr. Lead Software Engineer

: [email protected]

Image courtesy of http://circos.ca

DISCLAIMER: The views expressed in this talk are mine alone and not those of my employer.

Bina products are for use Research Use Only. Not for use in diagnostic procedures.

Also, I’m a Computer Scientist by training and trying to help those with similar background to learn about the field of genomics. Therefore there has been a high degree of simplification done in explaining the scientific concepts in this talk.

Page 3: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

Why Genomics?

$3,000,000,00013 years

http://en.wikipedia.org/wiki/Human_Genome_Project

Past Present

$100024 hours

Future

3

Page 4: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

Why Genomics?

Some things we could do with genomics:

• Carrier Screening• Prenatal Screening• Newborn Screening• Inherited Disease• Infectious Disease• Cancer Diagnostics• Microbiome• Personalized Medicine

4

Page 5: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

But I have no genomics background!It’s ok.

5

Page 6: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

My personal story…

6

Now

Then

Page 8: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

Crash Course on Genomics

The field of studying the structure of genomes.

http://en.wikipedia.org/wiki/Genomics http://en.wikipedia.org/wiki/RNA http://en.wikipedia.org/wiki/Protein

DNA RNA Protein You!

8

Page 9: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

How do we figure out what’s in DNA?

Like everything else, we turn the analog signal to digital, and then analyze it.

http://en.wikipedia.org/wiki/DNA_sequencing http://en.wikipedia.org/wiki/FASTQ_format

Illumina, Ion Torrent, Genia, …

Primary Analysis

FASTQ Format

9

Image courtesy of PersonalGenomes.org

Page 10: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

RAW Data to Variants (Secondary Analysis)

Step 1. Alignment

http://en.wikipedia.org/wiki/DNA_sequencing http://en.wikipedia.org/wiki/FASTQ_format

10

Image courtesy of Wall Woodworks

Image courtesy of Wallpaper Up

Page 11: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

From “Raw” DNA to “Variants” (Secondary Analysis)

Step 1. Short-Read Sequence Alignment

http://en.wikipedia.org/wiki/Reference_genome http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism http://en.wikipedia.org/wiki/Indel http://en.wikipedia.org/wiki/Structural_variation

AACACACCCAAGGGGGAAACTTTGGTCCACCCAAGGGGGAAACCCAAGGGGGAAACTTTGReference Genome (~3B bases)

ACTTTGGTCCACCCAAGGAAGGGGGACACCCAAGGACACCC__GGGGGAAACT

GGACACCCAAGGGGGAAACCCAAGGGGGACACCC

ACCC__GGGGGAAACTTTGAACACACCC__GGGGGAA

Cov

erag

e

Deletion Single Nucleotide Polymorphism

11

Page 12: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

From “Raw” DNA to “Variants” (Secondary Analysis)

• Burrows-Wheeler Aligner (BWA)• Uses Burrows-Wheeler transform (also used in bzip)• Uses Smith-Waterman algorithm• Written in C++• Uses ~4GB memory for human genome

http://bio-bwa.sourceforge.net http://bioinformatics.oxfordjournals.org/content/25/14/1754.full.pdf+html

$ bwa mem ref.fa read1.fq read2.fq > aln-pe.samExample

12

Page 13: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

From “Raw” DNA to “Variants” (Secondary Analysis)

Alignment

FASTQ SAM

Convert to BinaryBZIP (samtools)

BAM File

BAM File Index

http://samtools.github.io/hts-specs/SAMv1.pdf http://samtools.github.io

13

Page 14: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

From “Raw” DNA to “Variants” (Secondary Analysis)

BAM File

BAM File Index

http://www.broadinstitute.org/igv https://github.com/ekg/freebayes http://arxiv.org/abs/1207.3907 https://www.broadinstitute.org/gatk

Visualize

Variant Calling

$ freebayes -f ref.fa aln.bam >var.vcf

ExampleInteractive Genome Browser (IGV)

14

Page 15: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com15

From “Raw” DNA to “Variants” (Secondary Analysis)… and here are your variants (VCF file)!

http://samtools.github.io/hts-specs/VCFv4.2.pdf

Page 16: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

What do we do with variant calls then?

Zooming in on the Central Dogma of Molecular Biology:

• There is redundancy in protein codes.• But a mutation could change the protein coding.

16

Image courtesy of Wikipedia

Page 17: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

What do we do with variant calls then?

Annotation & Interpretation

• Functional Annotation Figure out if the mutation is dangerous (Use SNPEff)• Synonymous• Non-Synonymous• Frame-shift• …

• Put in the context of existing findings• dbSNP• ClinVar• COSMIC• ESP• 1000 Genomes• …

http://snpeff.sourceforge.net http://www.ncbi.nlm.nih.gov/SNP

17

Page 18: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com18

CASE STUDY:

Page 19: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

Statistics Data AnalyticsBioinformatics

Genomics

Big Data TechnologiesCompute and Data Science

19

Bringing three disciplines together

Page 20: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

Case Study: Bina GMS

20

Sequencing 2º Analysis 3º Analysis Interpretation

Meaningful Results & Clinical

Relevance

20+ DBs including over140+ annotations:

HGMD // PGMD // ClinvarCOSMIC // dbNSFP // TRANSFAC

1000 Genome and more.

Tools & Workflows for:

WGS // WES // RNAseq Somatic Mutations

Multi sampleGene Panels

Bina Products are for Research Use Only

Ketchum, Elias {DQEE~Pleasanton}
This is an area to be aware of. I'm not proposing we change the wording here but to tailor your verbalization about the utility of the interpretation of the results - they are not currently intended for clinical diagnostic purposes but for research only.
Page 21: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com21

Bina RAVE Architecture (1)

Secure REST InterfacePortal Server(s)

Portal Backend (Distributed)

• Workflow Definition• Templates• QC/Monitoring• System Management/Updates

Task DependencyGraphs

Distributed Workflow Orchestration

Secure PushInterface

Workflow

Generation

Interactive UI // Command Line SDK

Executor

Dynamic Scheduling

Local Storage

Exe

cutio

n E

ngin

e

Executor Nodes / VMs

Network Storage – Input/Output Data

StaticScheduling

Workflows

Tools

Commands

Page 22: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

Bina RAVE Architecture (2)

Workflows (DNA, RNA ..)

Tools (BWA, GATK, SVs)

Services(Logging, Storage, Caching,

Streaming)

Commands(Samtools, GATK, URL,..)

Genome-aware – Workflow Generation

Distributed Coordination

Task Graph

JSON Request(UI/CMD/SDK)

Nodes / VMs

Executor

Dynamic scheduling

Graph

Triggers

Updates

Genome aware – Distributed Execution Framework

Syncing all Nodes

Dependency Graph

Task Status

Network storage – Input/output data

Local storage

•Dependency Aware Execution•Locality Aware Execution (Caching)•Streaming Through “Engines”•In-Memory Computation

Output(VCF,SV)

Input(BAM, FASTQ)

Static Scheduling

Page 23: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

Bina AAiM Architecture

Annotation and Indexing Engine

InputVCF

UI/CMD

Clinical Annotations

Genomic Context

Prediction Func. Impact

PopulationFrequency

Distributed Execution

Framework

Annotation

(Join static DBs)

Indexing &

Functional Filters

MapReduce Jobs

Analytics Engine

NoSQL

Data Store

Indices

Metadata

Store

Tumor/Normal Pedigree

Queries, Filters, Variant Sets, Reports

Bina Secondary

Cohort StudyProband

Page 24: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

What next?

http://www.genomicsengland.co.uk http://www.personalgenomes.org

• Apply this process to different domains and applications• Come up with ways of ranking variants • Keep learning from data• Sequence everyone!

• Genomics England 100,000 Genome Project• Personal Genomes Project

• Decrease cost• Increase accuracy• Make the technology faster and more usable!

Map of sequencers around the globe: http://omicsmaps.com

24

Page 25: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

Challenges in Genomics• Accuracy

• Gold standard? What tool is best, there are so many!• NIST, Dream Challenge

• Need to speak the same language… interoperability• Global Alliance• API, format, meta data, …

• Regulations• HIPPA, CLIA: security, accuracy, anonymity and encryption

• Scalability• Storage

• Need terabytes• Each genome could be up to 1T

• Computation• We still pretty much have no idea what most of DNA is doing…• Can’t run on single machine. Need to scale to many nodes• Need to leverage cloud technologies

• Provenance and auditability• Importance of usability• Different personas• Errors are very expensive (life and death)• Better visualization → faster discovery → faster cure

25

Ketchum, Elias {DQEE~Pleasanton}
I would recomend taking out "life or death" from this slide. Maybe swap for "high-risk medical field" and then speak to what that means to you (that errors have implications in the health and well-being of patients).
Page 26: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

Why should software engineers move to genomics?

Because genomics needs you, and you need genomics.

Work on something that matters! (#Code2Cure)

Things that SWEs do very well:• Automation• Elegant solutions for complex problems• Enabling non-savvy users by

making the technology robust and accessible• Scale• Optimization• Building production-grade platforms

• Tested• Robust• Secure

THESE ARE ALL NEEDED IN GENOMICS YESTERDAY!

26

Image courtesy of http://silvsoul.blogspot.com

Page 27: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

www.bina.com

Open projects/resources to checkout/contribute to

Projects/Conferences• Galaxy -- http://galaxyproject.org• Arvados -- https://arvados.org• Open Bio Conference -- http://www.open-bio.org• BioViz -- http://www.biovis.net• BioPython -- http://biopython.org• Global Alliance for Genomics Health -- http://ga4gh.org• Rosalind Project -- http://rosalind.info

Blogs/Websites

• http://bcb.io• http://nextgenseek.com/• http://ngs-expert.com/• http://seqanswers.com/• http://core-genomics.blogspot.com• http://www.genomesunzipped.org• http://genomeweb.com

27

Page 28: #Code2Cure: A field guide for software engineers on their journey to the world of genomics

Thank you. And I hope you consider moving to genomics! http://info.bina.com/code2cure-community

: @mirkiani

Amirhossein KianiSr. Lead Software Engineer

: [email protected]