bringing the next generation sequencing to the clinicregist2.virology-education.com ›...
TRANSCRIPT
Novel simplified bioinformatics for NGS data analysis
Marc Noguera-Julian, PhD
IrsiCaixa AIDS Research Institute
Badalona, Catalonia
Bringing the next generation
sequencing to the clinic
Outline
1. Intro:
• The role of bioinformatics
• How is bioinformatics used?
• Challenges & potential for bioinformatics in clinical use
• How should a bioinformatic tool be like
• Simplified Bioinformatic tools
1. Geno2Pheno-[NGS]
2. Exatype
3. MiCall
4. Hydra
5. PASeq.org
• Standardization efforts
Skurtur (http://skurtur.com)
NGS data deluge
Poster #86
Skurtur (http://skurtur.com)
• All NGS platforms produce [tens/hundreds] Thousands of sequences
• Bioinformatics skills need to be acquired (hired/learned) when adopting
NGS technologies.
• Bioinformatics needed to:
• Ensure data quality
• Design & Ensure adequate data analysis
• Ensure result data structure for further exploitation (when possible)
NGS data deluge
Skurtur (http://skurtur.com)
How is NGS being used for HIV-DR testing?
• Mostly Illumina [H,M]iSeq platform used in research, but
• Many [in-house] Experimental protocols in place for sample preparation
and library preparation
• Poll Data on how NGS data is generated and processed for HIV-DR
testing
Skurtur (http://skurtur.com)
How is NGS currently data being analyzed?
• Most research sites use in-house pipelines and have bioinformatics
support
• Poll Data from INTEGRATE project on how NGS data is generated and
processed for HIV-DR testing
NGS Data deluge
HIV-DR testing challenges
Skurtur (http://skurtur.com)
• Computational (storage & processing) resources scale-up
• Many experimental designs exist that may affect data analysis
efficacy/validity.
• Need for automated, robust and reproducible analysis
• Multiple sequencing platforms with different sequencing chemistries and
intrinsic/specific error models or data processing need.
• Heterogeneous analysis strategies and output formats render results
difficult to compare.
• Data format and interoperability is an important ISSUE. Need for
standardization.
• Data regulatory aspects need to be embedded in software plaftorms:
• Storage time
• Access policies
However,
• Individual Report and
DR interpretation• Individual Report and
DR interpretation
Health
CenterHealth
CenterHealth
Center
Central
Diagnostics
Lab
Central
Diagnostics
Lab
Central Diagnostics
Lab/NGS Sequencing
• High Throughput
• Low Cost
• Validated Assay
Automated
Bioinformatics
• One-click use by Lab
Tech
• Secure
• Highly Scalable
• Automated QA/QC
• Automated integration
with HIVDR interpretation
systems (HIVDB-
Stanford)
Quality-Curated
HIVDR Data
Structured Database
Individual Report and
HIVDR interpretation
Program
Officer Real Time Surveillance:
•HIVDR Epidemiology
Actionable ReportCloud Computing:
Centralized Data Analysis
HIVDR test
required
Training & Quality Improvement
• Web Interface
• Pre-built queries
• Embedded Statistics
Sample Shipment
•QA/QC Monitoring
Great Potential from NGS in HIV DR clinics
Noguera-Julian. et al. J. Infect. Dis. 2017
NGS Data Deluge
HIV-NGS-DR Testing Needs
Several key features for bioinformatics to move into clinical practice:
• Usually no bioinformatics support on site in the clinical setting
• Usually no or minimal computational infraestructure available
• Be remotely usable by users with no bioinformatics skills through a
user-friendly web interface accessible from simple computers or
smartphones
• Provide robust, reproducible, and easy-to-interpret results using
standard and well-established HIV resistance interpretation rules (eg,
Stanford HIVdb or equivalent);
• Incorporate built-in quality standards
• Avoid unnecessary transfer of large data volumes
• Provide clinically actionable results that can be downloadable with
limited network access
• Demand minimal or no local computational infrastructure
• Seamlessly respond to varying number of samples in highly scalable
manner without an impact in time to results
• Minimal(No) cost to enable their sustainable adoption by LMICs
Skurtur (http://skurtur.com)
Available Software
Multiple Software packages/platforms exist:
Hezhao. et al. JIAS 2018, submitted
Which usually means bioinformatics support
Available software for HIV-NGS-DR Testing
Noguera-Julian. et al. J. Infect. Dis. 2017
0
25
50
75
100
0 25 50 75 100PASeq
MiC
all
/ H
yd
ra
0
0
00
1
176
0
10
05
081
0
5
1
62
197
PASeq PASeq PASeq
MiCall Hydra MiCall Hydra MiCall Hydra
15% 5% 1%
B
MiCall
Hydra
Good agreement between analysis pipelines
Simplified
Data
Analysis
Common Concepts Data Analysis for NGS HIV-DR Testing
FastQ file: Similar to FastA file but containing sequences with per-nucleobase quality
values. This is the raw material produced by most(if not all) of the sequencers.
@HWI-D00283:145:C6BEJANXX:5:1101:18971:2203
AGGCCTTGAATGAGATTCCAAAAATCTATCGACTACAATCCCCCAAAAATCTATCGACTAC+EBB=B>FEGGGEGGFEFC@G:CDD>FGGGCBCAGGGGGGEEBB=B>FEGGGEGGFEFC
Sequence/base Quality:
Numeric value (1-40
range) indicating the
probability that the
sequence/base readout is
wrong. Higher value
means higher quality.
Paired-end/Single-end:
Refers to experimental
design in how every DNA
fragment has been read:
from both ends or from a
single end.
Common Concepts Data Analysis for NGS HIV-DR Testing
Sequence Alignment: usually called SAM or BAM file. Contains all information
regarding how every one of the fastQ sequences has aligned against a reference
Depth of coverage or, simply, “coverage”, refers to the number of times each
genomic position has been read by an independent sequence read.
Common steps in Data Analysis for NGS HIV-DR Testing
Quality Filter/control
Sequence alignment: reference wise
Variant Calling
Resistance Interpretation
Structured Data Storage
Contamination
Control
Alignment
Quality Control
Alignment filesCoverage Plots
Codon/AA TablesCodon/AA Freqs
Consensus / Variant Calling
Nucleotide tableConsensus Seq
Resistance Reports
Queryable Database
Real-time Algorithms[Real time] surveillance
Sequence Quality & Contamination Control
Quality Control: Essentially removing all sequences that either have low
quality values (higher probability of missed nucleobase-calls or that are
too short to obtain a reliable alignment
Contamination Control: Remove all
sequences that are not legit, not from the
sample being studied
- External contamination: Usually human
DNA, does not interfere with HIV
alignments but can be a reason to
have a very low throughput
- Cross-Contamination: DNA
contamination from one sample to
another during either library
preparation or sequencing. I can
interfere with DR testing, specially at
low thresholds
Available Software
Multiple Software packages/platforms exist:
Hezhao. et al. JIAS 2018, submitted
Geno2Pheno-[NGS]
https://ngs.geno2pheno.org/
Döring M, et al. Nucleic Acids Res. gky349.
Poster #25
Geno2Pheno-[NGS]
geno2pheno[ngs-freq] Report
1/3© ngs.geno2pheno.org (Version 1.0.453)
Sample: 1_HIV Case Study
Patient:
Date of birth:
Sample received:
Sample type:
Physician:
Study:
Viral load:
Sample collected:
Date of report: May 28, 2018
Treatment:
Resistance Interpretation: HIV-1, Subtype B (100%)
Consensus sequence at prevalence >= 10% Consensus sequence at prevalence >= 2%
Drug SIR Z Mutations >= 10% SIR Z Mutations < 10% and >= 2%
ABC 0.2 No resistance mutations found. 3.8 M184V (2%)
ddI 0.3 R211E (73%) 2.7 M184V (2%)
3TC 0.1 No resistance mutations found. 7.8 M184V (2%)
d4T 0.6 No resistance mutations found. 0.6 No further resistance mutations found.
TDF -0.6 No resistance mutations found. 0 E138K (7%)
ZDV 0.5 No resistance mutations found. 0.6 No further resistance mutations found.
EFV -0.4 No resistance mutations found. -0.1 No further resistance mutations found.
ETR 0.7 No resistance mutations found. 3.3 E138K (7%)
NVP -0.1 No resistance mutations found. 0.2 No further resistance mutations found.
RPV -0.2 No resistance mutations found. 1.8 E138K (7%)
APV 1 L63P (84%) 2.1 No further resistance mutations found.
ATV 3.3 I93L (99%) 4.4 No further resistance mutations found.
DRV 0 No resistance mutations found. 0.7 No further resistance mutations found.
IDV 3.1 No resistance mutations found. 3.1 No further resistance mutations found.
LPV 1.9 L63P (84%) 2.1 No further resistance mutations found.
NFV 1.9 No resistance mutations found. 1.9 No further resistance mutations found.
SQV 1.4 No resistance mutations found. 1.7 No further resistance mutations found.
TPV 0.7 No resistance mutations found. 1.4 No further resistance mutations found.
NR
TI
NN
RT
IP
I
The clinical relevance of resistant variants below 10% is still unclear. Since their presence may reduce treatment success, these mutations should be considered in unison with other f actors such as the viral load.The Z-column indicates the change of the outcome estimate (e .g. reistance factor) in terms of standard deviations relative to the mean outcome for treatment-naive persons.
Legend
susceptible intermediate resistant
FEATURES:
• User downloadable reports
• Automatic Coverage alerts
• Detects APOBEC mutations
• Codon table as input requires pre-
processing but allows flexibility in
sequence analysis
• User-definable detection thresholds
• Uses g2p-resistance for mutation
interpretation
• https://ngs.geno2pheno.org/
• No registration required
• Accepts HCV Data
Hyrax Exatype
https://exatype.com
Hyrax Exatype: Quality Control
Hyrax Exatype: Resistance Interpretation
Hyrax Exatype
FEATURES
• Commercial platform for cloud-based DR testing in HIV
• Embedded all-in-one quality control, sequence alignment,
variant calling and Stanford HIVDB based Resistance
interpretation
• Cloud Based, highly multiplexed and scalable
• Processes data from different NGS platforms.
• Codon-aware aligner strategy (RAMICS/Examap)*
• Free analysis for 50 samples.
• https://www.exatype.com
*Wright et al, NAR, 2014
MiCall for NGS-based data analysis
MiCall for NGS-based data analysis
FEATURES:
• User downloadable resistance reports
• Raw Data used as input
• Embedded error model building
• RawData directly used from Illumina
basespace
• Can download all intermediate files
• Registration/Authorization required
• Free to use(?)
• Cloud Based, highly multiplexed and
scalable
Hydra Web for HIV NGS data analysis
Hydra Web for HIV NGS data analysis
Sequence length threshold
Score Cutoff
Error rate (platform-specific)
Quality for variants
Depth of coverage
Minimum mutation count
Frequency threshold
Hydra Web for HIV NGS data analysis
Hydra Web : Quality Control
Hydra Web : Results
FEATURES:
- Highly customizable analysis parameters
- Multi-sample upload & results download for high scale analysis
- User registration and access-controlled data storage
- No drug-level resistance interpretation.
- Can download intermediate files
- No pdf report generated.
- https://hydra.canada.ca/
PASeq : Polymorphism Analysis by Sequencing
www.paseq.org
• Minimal User intervention – Drag & Drop raw files
• User can input sample metadata (optional)
HIV-NGS-DR Testing in PASeq
HIV-NGS-DR Testing
2017-11-06 07:27:25. Sample created
2017-11-06 07:28:10. Allocating Resources
2017-11-06 07:28:47. Server assigned. Launching instance
2017-11-06 07:29:40. Initiating process
2017-11-06 07:29:40. Downloading Fastq R1 file
2017-11-06 07:29:40. Downloading Fastq R2 file
2017-11-06 07:29:40. This is a Paired-end Analysis
2017-11-06 07:29:40. Creating WorkSpace
2017-11-06 07:29:41. Going Through Quality Control Using Trimmomatic
2017-11-06 07:31:01. Quality Filtering and Adapter Trimming: Trimmomatic
2017-11-06 07:31:01. MinLen=75 minQual=30 SlidingWindow=20
2017-11-06 07:32:53. 17937 of 21643 survived
2017-11-06 07:32:54. Checking for External Contamination
2017-11-06 07:34:32. Found 17016 HIV sequences and 921 non-HIV sequences
2017-11-06 07:34:40. Indentifying potential contamination source
2017-11-06 07:35:44. Compressing Files and moving on
2017-11-06 07:36:00. Creating Coverage plots
2017-11-06 07:36:53. Calling Deep Variants
2017-11-06 07:39:59. Querying HIVDB using deep Variant Data at different thresholds
2017-11-06 07:40:49. Querying HIVDB-Stanford with consensus sequence for resistance interpretation
2017-11-06 07:40:50. Storing Consensus sequences for surveillance
2017-11-06 07:41:05. Uploading Results
2017-11-06 07:41:42. Job Finished
~30 min Analysis Time
Data Download & setup
Quality Control
Contamination Control
Sequence Alignment &
Variant Calling
Resistance Interpretation
HIV-NGS-DR Testing – PASeq Output – Embedded Quality
Control
HIV-NGS-DR Testing – PASeq Output – User-defined
threshold
···
···
5%
···
···
1%
Mutation Protein Frequency (%)
Q148R INT 3.177
N155H INT 99.877
S147G INT 99.838
M46I PR 1.825
E138A RT 99.624
IrsiCaixaReport date: 2017-11-06 18:17:11 CETStanford HIVDB Version:8.2 PASEQ Version:
Resistance interpretation information obtained from Stanford HIVDB (https://hivdb.stanford.edu/)This report has been generated by PASeq Web Service. All data interpretations are for research use only, not for diagnostic or clinical purposes, and are provided as is with NO guarantee of any class.
IrsiCaixaReport date: 2017-11-06 18:17:11 CETStanford HIVDB Version:8.2 PASEQ Version:
Resistance interpretation information obtained from Stanford HIVDB (https://hivdb.stanford.edu/)This report has been generated by PASeq Web Service. All data interpretations are for research use only, not for diagnostic or clinical purposes, and are provided as is with NO guarantee of any class.
Quality Control
HIVdb mutation comments
E138A is a common polymorphic accessory mutation weakly selected in patients receiving ETR and RPV. It reduces ETR and RPVsusceptibility ~2-fold. It has a weight of 1.5 in the Tibotec ETR genotypic susceptibility score.
H51Y is a rare non-polymorphic accessory mutation selected in patients receiving RAL and EVG and in vitro by DTG. H51Y reduces EVGsusceptibility 2 to 3-fold. It does not reduce RAL or DTG susceptibility.
M46I/L are relatively non-polymorphic PI-selected mutations. In combination with other PI-resistance mutations, they are associated withreduced susceptibility to each of the PIs except DRV.
N155H is a non-polymorphic mutation selected in patients receiving RAL and EVG. Alone, it reduces RAL and EVG susceptibility ~15-fold and30-fold, respectively. N155H has been selected by DTG in RAL-experienced patients but alone does not reduce DTG susceptibility.
Q148H/K/R are non-polymorphic mutations selected by RAL and EVG. Alone, Q148H reduces RAL and EVG susceptibility ~5 to 10-fold. Alone, Q148R/K reduce RAL and EVG susceptibility ~30 to 100-fold. In combination with G140S/A or E138K/A, they reduce RAL and EVGsusceptibility >100-fold. Alone, Q148H/K/R have minimal effects on DTG susceptibility. In combination with G140S/A/C and/or E138K/A, theyreduce DTG susceptibility up to 10-fold.
S147G is a non-polymorphic mutation selected in patients receiving EVG. It reduces EVG susceptibility 5 to 10-fold. It does not reduce RAL orDTG susceptibility.
IrsiCaixaReport date: 2017-11-06 18:17:11 CETStanford HIVDB Version:8.2 PASEQ Version:
Resistance interpretation information obtained from Stanford HIVDB (https://hivdb.stanford.edu/)This report has been generated by PASeq Web Service. All data interpretations are for research use only, not for diagnostic or clinical purposes, and are provided as is with NO guarantee of any class.
Mutation Protein Frequency (%)
H51Y INT 3.256
IrsiCaixaReport date: 2017-11-06 18:17:11 CETStanford HIVDB Version:8.2 PASEQ Version:
Resistance interpretation information obtained from Stanford HIVDB (https://hivdb.stanford.edu/)This report has been generated by PASeq Web Service. All data interpretations are for research use only, not for diagnostic or clinical purposes, and are provided as is with NO guarantee of any class.
PASeq.org
FEATURES:
• User downloadable resistance reports
• Configurable report
• Automatic Low coverage alerts
• Detects contamination
• Detects APOBEC mutations
• Raw Data used as input
• Can download all intermediate files
• User-definable detection thresholds
• Use real-time updated HIVdb-Stanford
• Registration required
• Free to use
• Cloud Based, highly multiplexed and
scalable
• Shareable Outputswww.paseq.org
Ongoing Bioinformatics Standardization Efforts
aHezhao et al, JIAS, 2018, submitted
Winnipeg Consensusa:
- From International symposium from NGS HIV-DR testing and pipeline
developers
- Contains recommendations/consideration at each of the analysis steps
when developing new pipelines.
VQA Dry Panel:
- Set of real samples distributed & sequences among 13 different centers
- NGS data from 6 centers & synthetic data used to evaluate different
pipelines
Variant Data Format(AVF Format)b:
- Intent to find a common data exchange format for amino acid variants.
Based on Hydra HMCF format by Eric Enns (NML/PHAC).
- Applicable to HIV but also to other genomic populations, particularly
viral
- Overcome limitations/artifacts of consensus-like sequence for this kind
of data.
bAVF format, https://github.com/winhiv/aavf-spec
THANKS!
Toti Herms @Microbial Genomics @