analytics - homolog.us · ngs workflow: bina analytics solutions . 4. sequencing . 2º analysis ....

34
Accurate, Scalable and Easy to Use Genomic Data Analysis analytics Gianfranco de Feo, Ph. D.

Upload: others

Post on 04-Oct-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

Accurate, Scalable and Easy to Use Genomic Data Analysis

analytics Gianfranco de Feo, Ph. D.

Page 2: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

The Opportunity and Challenge Clinical Research Transformation:

Ability to sequence entire genomes at accessible costs and timeframes

Robustness of Sequencing Technologies: reagents and instruments

Deluge of data linking clinical phenotypes to genomic aberrations

The Challenge: Huge amounts of data (both in terms of sequence and Giga/Peta bites)

• Large investments in Bioinformatics/IT/Engineering to handle data

Analytical workflows immature and prone to errors • Large efforts leading to public domain tools that are ‘best-in-class’ but difficult to use • Software tools and algorithms are being constantly updated

Much more work required to link genomic aberrations to clinical actionability

Page 3: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

Statistics Data Analytics Bioinformatics

Genomics

Big Data Technologies Compute and Data Science

Page 4: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

NGS Workflow: Bina Analytics solutions

4

Sequencing 2º Analysis 3º Analysis Interpretation

Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

Annotation (public and private DBs)

Scientific and medical interpretation

Integrated Workflows for: Whole Genome, Whole Exome, RNAseq, and targeted panels

Page 5: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

ACCURACY SCALABILITY SIMPLICITY.

Page 6: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

ACCURACY TopHat Cufflinks

BWA Pindel Picard BreakDancer

GATK BreakSeq CNVNator Samtools Bowtie

Page 7: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

AlignmentEngine

‘Under the hood’: DNA pipeline

in: FASTQ

out: Sorted BAM

out: VCF

out: Custom, VCF

In-Memory Sorter

Single/ Multi-Node Bina Platform

Parallelized GATK Pipeline (Best Practices)

Structural and Copy Number Variants

Parallelized BWA 0.7 Bina Aligner

GATK 1.x, 2.x 1. RealignerTargetCreator 2. IndelRealigner 3. BaseRecalibrator 4. PrintReads 5. Unified Genotyper 6. VQSR

SV tools BreakDancer, CNVNator, BreakSeq, Pindel,SVMerge

in: Sorted BAM

in: Re-calibrated BAM

out: Realigned BAM

out: Recalibrated BAM

Tools Formats

Genome-Aware Load-Balancing

In-Memory Bina Proprietary Sorting, Duplicate Marking, and QC Calculation

Whole Genome, Whole Exome and Targeted Panel Datasets

7

Page 8: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

Alignment

‘Under the hood’: RNA pipeline

Sorted BAM, BED

.FPKM_tracking, GTF

In-Memory Sorting

Single/ Multi-Node Bina Platform

Assembly

Expression

Tophat2 SpliceMap

Cuffmerge/Cuffdiff IDP (Isoform Detection and Prediction)

Tool set

In-Memory Sorting, QC Calculation

Cufflinks LSC (Long read error correction)

Per-Sample Load-Balancing

GTF, DIFF, .*_tracking

8

FASTQ

Formats

Page 9: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com Bina Technologies Confidential and Proprietary

Trios

Accuracy Validation

9

SNP Indel SV

Synthetic Diploid Genome

Simulation

Bina Genome Analysis Platform

SNP/Indel SV/CNV

Validate

M

C

F

Replicates

R2 R1

Gold Call Sets QC Statistics

NIST

EXP/ASM

1KG CG # SN

P

Ti/T

v H

et/H

om

Bina Genome Analysis Platform Bina Genome Analysis Platform Bina Genome Analysis Platform

Validate Validate

Validate Validate

Validate

Com

puta

tiona

l Experimental

Page 10: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com Bina Technologies Confidential and Proprietary

Validation by Gold Set

• Highly confident variants for NA12878

• Integrated 13 datasets currently from 5 platforms with sophisticated filtering

• Includes SNPs and Indels with genotypes

• 99.69% of SNPs are known polymorphic variants, whereas Indels are 89.57%

SNPs : 2.8M Indels: 360K SVs : Coming soon

http://genomeinabottle.org/

Page 11: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com Bina Technologies Confidential and Proprietary

Benchmarking: Alignment effect on indel calling accuracy

11

Bina Aligner / GATK

Unique Calls

Insertions 65,555

Deletions 38,165

Known 70.7%

Het / Hom 1.4

79.9% Overlap shared

indel calls vs. total indel calls

Shared Calls

Insertions 267,144

Deletions 274,874

Known 85.2%

Het / Hom 1.26

Accelerated BWA /

GATK Unique Calls

Insertions 18,996

Deletions 21,094

Known 59.7%

Het / Hom 1.6

11% more known indels

Page 12: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com Bina Technologies Confidential and Proprietary

Benchmarking: Alignment and SNV Accuracy

12

• Single nucleotide variants

Bina Aligner / GATK

Unique Calls

All SNVs 98,870

Known 95.5%

Ti / Tv 1.8

Het /Hom 2.3

94.4% Overlap shared

SNV calls vs. total SNV calls

Shared SNV Calls

All SNVs 3,097,329

Known 99.2%

Ti / Tv 2.13

Het / Hom 1.45

Accelerated BWA /

GATK Unique Calls

All SNVs 84,481

Known 94.1%

Ti / Tv 2.09

Het / Hom 3.4

More accurate

SNV calls

Page 13: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com Bina Technologies Confidential and Proprietary

Alignment Benchmarking

Aligner Reads/s Mapping rate (%)

Uniq. Mapping rate (%)

Uniq. Mismatch rate (%)

Uniq. gap rate (%)

Bina 69K 94.6 88.1 0.56 0.03

BWA accel.

35K 92.4 86.8 0.28 0.013

BWA mem 50K 95.7 88.5 0.88 0.022

Novoalign 9K 86.6 86.6 0.34 0.018

Isaac 72K 88.9 82.9 0.17 0.0108

Page 14: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

Our overall performance

BWA+GATK2

Platform Illumina

Mode WGS

Type Paired End

Sample NA12878

Read Length 2x100bp

Reads 1.2G

Coverage 37.8X

SNP

% Known (dbSNP) 98.62%

Ti/Tv 2.1

Het/Hom 1.53

Sensitivity (Gold Set) 98.55%

GT Concordance (Gold set) 99.98%

Inde

l

% Known (dbSNP) 89.29%

Sensitivity (Gold Set) 84.82%

GT Concordance (Gold Set) 97.88%

Page 15: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

3.65 3.66 3.67 3.70 3.75

0.30 0.43 0.24 0.21 0.21

5.7

7.1

6.2 6.2 6.1

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

0.00

1.00

2.00

3.00

4.00

5.00

6.00

BWA+GATK 1.6 BWA+GATK 2.3-9 Lite BWA+GATK 2.5-2 BWA+GATK 2.6-5 BWA+GATK 2.7-2

Run

time

(hr)

# SN

Ps

Mill

ions

SNPs (HC) SNPs (LC) Time (h)

SNP Accuracy

98.47% 98.55% 98.56% 98.57% 98.60% 99.98% 99.98% 99.98% 99.98% 99.98%

0% 20% 40% 60% 80%

100%

BWA+GATK 1.6 BWA+GATK 2.3-9 Lite BWA+GATK 2.5-2 BWA+GATK 2.6-5 BWA+GATK 2.7-2

Sensitivity GT concordance

Page 16: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

84.88% 84.82% 84.92% 84.87% 84.91% 98.01% 97.88% 97.99% 97.99% 97.97%

0% 20% 40% 60% 80%

100%

BWA+GATK 1.6 BWA+GATK 2.3-9 Lite BWA+GATK 2.5-2 BWA+GATK 2.6-5 BWA+GATK 2.7-2

Sensitivity GT Concordance

590.0 589.4 590.7 590.6 590.2

4.6 5.5 4.0 4.1 4.4

0

200

400

600

BWA+GATK 1.6 BWA+GATK 2.3-9 Lite BWA+GATK 2.5-2 BWA+GATK 2.6-5 BWA+GATK 2.7-2

# In

dels

Th

ousa

nds

Indels (HC) Indels (LC)

Indel Accuracy

Page 17: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

Haplotyper Caller Excels in Indels

99.62

96.35

98.6

84.91

75

80

85

90

95

100

105

SNPs Indels

Sens

itivi

ty

HaplotypeCaller

UnifiedGenotyper

Page 18: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

SV Callers Analysis

Unpublished results

Page 19: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

TOPHAT2 is More Reliable

44

33

56

44

29

62

0

10

20

30

40

50

60

70

Total number of reads Aligned Reads Alignments

Mill

ions

TOPHAT1 TOPHAT2

165

43

184

42

0

20

40

60

80

100

120

140

160

180

200

Junction Calls Novel

Thou

sand

s

TOPHAT1 TOPHAT2

19.1%

25.9%

0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

Validation Rate

TOPHAT1 TOPHAT2

84

52 44

0

10

20

30

40

50

60

70

80

90

Total Transcripts

Thou

sand

s

TOPHAT1 TOPHAT2 RefSeq

TOPHAT1 TOPHAT2 REFSEQ

Page 20: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

Integrated Easy to use RNASeq Workflows

Long Reads (Raw) Short Reads

Long Read Correction (LSC)

Long Reads (Corrected)

Alignment

Au et al. Improving PacBio Long Read Accuracy by Short Read Alignment. Plos ONE Au et al. Characterization of the human ESC transcriptome by hybrid sequencing. PNAS

Isoform Detection & Prediction (IDP)

Page 21: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com Bina Technologies Confidential and Proprietary

Science innovation • Validation

• SV gold set

• Accuracy • Filtering, Feature enhancements, replacing GATK tools in some areas (SNPs Variant calling) • Improving SV tools • Incorporate more tools (BreakSeq, Pindel)

• Cancer pipeline: • Tool selection • Workflow definition (many use cases) • Annotation

• Contribute back to open source genomic tools • Bug fixes • Validation tools, data quality framework • Functionality • New tools

Page 22: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

SCALABILITY Flexible Deployment

High Performance Computing Best Practices

Bina Box Bina Lite Bina Cloud

Page 23: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

Bina Box Genome Analysis Pipeline Performance

23

Whole Genome Sequencing

Whole Exome Sequencing RNA Sequencing

Turnaround Time ~4h ~45m 2.5 h

1 Bina Box Throughput 6/day ~50/day ~38/day (152/day)

Data: • WGS: Three lanes of paired-end HiSeq data from the NA12878 cell line (37X) • WES: NA12878 Whole Exome Dataset, 100X • RNA: Human Body Map 2.0, Skeletal Muscle, 82M reads, 75bp

Pipeline:

• WGS: Parallelized BWA, GATK + VQSR • WES: Parallelized BWA, GATK • RNA: Tophat 2.0, bowtie2 2.1.0, cufflinks 2.1.1

Hardware Configuration WGS & WES: 4-node, 64-core Bina Box appliance Hardware Configuration RNA: 1-node, 64-core Bina Box appliance

Page 24: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

Integrated Management for Scalable Backend

Page 25: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

SIMPLICITY

Best Practices Workflows

Hardware/Software

On-premises Integration

Integration

Page 26: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

SIMPLICITY Intuitive User Interface

Monitoring & Management

Quality Control

Visualization

Page 27: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

Intuitive QC Metrics Summaries

Page 28: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

Filtering the annotated VCF

Miley Trio

NA 12878

Page 29: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

Annotation Platform – high level architecture

29

External Genomic Data Sources

Small Variations

dbSNP

Structural Variations

dbVar, DGVa

Genome

RefSeq, Ensembl

Impact prediction

SnpEff, SIFT, PolyPhen-2

Genotype – Phenotype

ClinVar

VCF

Bina Box Data

Versioning

Standardization

Compression

User-Defined Data Sources

Query Engine

Serving Layer Batch Layer • Compact Representation

of Knowledgebase

• Smart Integration of User Data with Knowledgebase

• Unbounded Data Storage

• Real-time Filtering

• Cached Queries • Annotation

• Custom Queries

• Saved filters

Visualization

Page 30: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

Academic Core Labs: Bina-on-Demand

30

• High capacity compute available, when needed, with no overhead costs

• Similar to a shared network printer; Only pay for actual use

In Dr. Snyder’s group, secondary analysis time was reduced from 10 days per whole genome (using a 1200-core shared cluster) to 6h on one Bina Box. A second Bina-on-Demand platform accelerates NGS research across the Stanford Campus.

Stanford Center for Genomics and Personalized Medicine

Page 31: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

Clinical /Translational Centers: Bina Subscription

31

• On-premises solution for maximum privacy and security

• Fastest time to results – and therefore fastest time to decision and reporting

• Roadmap to clinical-grade software (HIPAA, robustness, training, and support)

Elizabeth Worthey’s team focuses on high volume clinical sequencing of distressed newborns in a neonatal intensive care unit (NICU). Bina reduces bioinformatics analysis from 17h to 3.5h while slashing costs by half. Plans to sequence all newborns at MCW by 2015.

Page 32: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

Bina Custom: Large scale Translational Research

32

• High-throughput platform for large cohort NGS studies (WGS, WES, RNA-Seq)

• Capable of processing at least 100 WGS/month or more than 600 WES/month

• Additional Bina Boxes increases throughput; Straightforward subscription model

• Secondary analysis today, Data management, aggregation and storage in Q4

• 400 WGS samples related to cardiovascular disease

• Alignment, variant calling and SV / CNV results

• Joint sample variant calling; Aggregation of results

• Dramatically reduced analysis time from years to months

• Results being prepared for publication in a leading journal

U.S. Department of Veterans Affairs

Page 33: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

Next Steps

Want to know more? Try it with your own datasets on the cloud (no cost) www.binatechnologies.com/ Contact your bina representative: Take Ogawa, Director of Sales [email protected] Contact me! Gianfranco de Feo, VP Marketing [email protected]

Page 34: analytics - Homolog.us · NGS Workflow: Bina Analytics solutions . 4. Sequencing . 2º Analysis . 3º Analysis Interpretation . Raw Reads Variant Calling (SV and SNVs) RNAseq analysis

www.binatechnologies.com

ACCURACY SCALABILITY SIMPLICITY

Low to very high throughput solutions

Full incorporation of best-in-class tools (benchmarking)

RNAseq Whole Genome Whole Exome

Easy-to-use interfaces

analytics