bioinformatic methods for cluster analysisregist2.virology-education.com/presentations/2019/... ·...
TRANSCRIPT
National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention
Network Identification for Prevention – Session 1
Ellsworth Campbell
Computational Biologist
Molecular Epidemiology and Bioinformatics Team
Centers for Disease Control and Prevention
Division of HIV/AIDS PreventionLaboratory Branch
Bioinformatic Methods for Cluster Analysis
Bioinformatic Methods for Cluster Analysis
• Immense field of literature, recent and detailed reviews abound
Bioinformatic Methods for Cluster Analysis
• Immense field of literature, recent and detailed reviews abound
• Cluster analysis can be broken down into four categories:
1. Gene region: polymerase, envelope, gag, whole genome…
2. Technology: Sanger vs NGS*, with various subdivisions therein
3. Analysis: phylogenetic trees vs genetic networks
4. Filtration: Thresholds vs algorithms
(pol) (env)
*Next-generation sequencing
Bioinformatic Methods for Cluster Analysis
• Immense field of literature, recent and detailed reviews abound
• Cluster analysis can be broken down into four categories:
1. Gene region: polymerase, envelope, gag, whole genome…
2. Technology: Sanger vs NGS*, with various subdivisions therein
3. Analysis: phylogenetic trees vs genetic networks
4. Filtration: Thresholds vs algorithms
(pol) (env)
*Next-generation sequencing
Why? Comparability
Comparability of Laboratory and Epidemiologic
Data Sources
Trees are Networks!
What if I prefer my phylogenetic trees?• Complex modeling (e.g., evolution, phylogeography, etc)• Time-resolved trees (e.g., BEAST)• Complex genomics (e.g., recombination)
Pairwise patristic distances, measured across a tree, enables us to cast a tree as a pairwise genetic network
A B0.1%
A BSex
ComparabilityResults are directly comparable to traditional contract tracing data: Simple Integration• Provides an epidemiologic method to filter close genetic links between sequences
A D1.4%
ComparabilityResults are directly comparable to traditional contract tracing data: Simple Integration• Provides an epidemiologic method to filter close genetic links between sequences
Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al. (2017) Social and Genetic Networks of HIV-1 Transmission in
New York City. PLoS Pathog 13(1): e1006000. doi:10.1371/journal. ppat.1006000
Figure 1
Distances between
epi-linked sequences
Distances between
unlinked sequences
Threshold that
separates most
blue from red
Genetic distance (pol) network
Algorithm-filtered, mean d = 0.1%Left-network, filtered by the presence
of a reported high-risk contact
ComparabilityResults are directly comparable to traditional contract tracing data: Simple Integration• Provides an epidemiologic method to filter close genetic links between sequences
Filter with Epi Data
Campbell et al., Detailed Transmission Network Analysis of a Large Opiate-Driven Outbreak of HIV Infection in the United States, The Journal of Infectious Diseases, Volume 216, Issue 9, 1 November 2017, Pages 1053–1062, https://doi.org/10.1093/infdis/jix307
Bioinformatic Methods for Cluster Analysis
• Immense field of literature, recent and detailed reviews abound
• Cluster analysis can be broken down into four categories:
1. Gene region: polymerase, envelope, gag, whole genome…
2. Technology: Sanger vs NGS*, with various subdivisions therein
3. Analysis: phylogenetic trees vs genetic networks
4. Filtration: Thresholds vs algorithms
Why? Precision
(pol) (env)
*Next-generation sequencing
Algorithmic Filtering Improves Precision
PrecisionGiven the data volume, filtration is a requirement…but methodology matters.
Thresholds Nearest Neighbor(s) algorithm
(aka union of all minimum spanning trees)
“Throw away data above this value.” “Keep only the closest link, for all individuals.”
PrecisionAlgorithmic filtering enables a more precise reconstruction of the genetic network
1.3%
1.3%
2.3%
1.7%
0.2%
0.3%
0.0%
0.0%
Explanatory Example
PrecisionAlgorithmic filtering enables a more precise reconstruction of the genetic network
1.3%
1.3%
2.3%
1.7%
0.2%
0.3%
0.0%
0.0%
Explanatory Example
Realistic Example
PrecisionAlgorithmic filtering enables a more precise reconstruction of the genetic network
1.3%
1.3%
2.3%
1.7%
0.2%
0.3%
0.0%
0.0%
Explanatory Example
Downsides:
1. Powerfully depends on threshold
2. Threat of over inclusion
3. Threat of over exclusionRealistic Example
PrecisionAlgorithmic filtering enables a more precise reconstruction of the genetic network
1.3%
1.3%
2.3%
1.7%
0.2%
0.3%
0.0%
0.0%
PrecisionAlgorithmic filtering enables a more precise reconstruction of the genetic network
Distances between
epi-linked sequences
Distances between
unlinked sequences
Threshold that
separates most
blue from red
Genetic distance (pol) network
Threshold-filtered, d ≤0.5%
Genetic distance (pol) network
Algorithm-filtered, mean d = 0.1%
PrecisionAlgorithmic filtering enables a more precise reconstruction of the genetic network
Campbell et al., Detailed Transmission Network Analysis of a Large Opiate-Driven Outbreak of HIV Infection in the United States, The Journal of Infectious Diseases, Volume 216, Issue 9, 1 November 2017, Pages 1053–1062, https://doi.org/10.1093/infdis/jix307
Adapting to the Setting
Setting: Outbreak Response
Gene Region • Comparability to available sequences: pol• Recent evolutionary past: env
Technology • Cost*, speed* and availability : Sanger
Analysis • Comparability to contact tracing: genetic networks• Lowest training requirements: genetic networks• Time of divergence of interest: timed phylogenetic trees
* This fact is changing rapidly, and short-read NGS methods will soon dominate
Setting: Public Health Surveillance
Analysis • Lowest training requirements: genetic networks• Speed of analysis: genetic networks
Gene Region • Comparability to available sequences: pol• Evolutionary past (regardless of scale): pol
Technology • Cost*, speed* and availability : Sanger
* This fact is changing rapidly, and short-read NGS methods will soon dominate
Setting: Research Study
Gene Region • Comparability to available sequences: pol• Distant/recent evolutionary past: pol/env• Maximum reliability: pol + gag or whole genome
Technology • High-fidelity transmission analysis: long read, next-generation sequencing• Intra- and interhost diversity: short read, next-generation sequencing
Analysis • Transmission-oriented study with contact tracing data: genetic networks• Study of recombinant variants: genetic networks• Time of divergence of interest: timed phylogenetic trees
MicrobeTrace* Accommodates all considerations
* There are other tools in the field, but all require multiple tools to be used in tandem
Sequences + Epidemiologic Data
github.com/cdcgov/microbetrace
id venue riskFactor Gender Age
1807 1 HET male 31
1311 1 MSM female 33
1518 2 HET male 26
1481 2 IDU male 52
1480 1 IDU male 22
1422 3 MSM-IDU male 32
Case Contact Type
1807 1311 Sex+IDU
1518 1311 Sex
1481 1311 IDU
1480 1311 Sex+IDU
1422 1311 Sex+IDU
What is MicrobeTrace 1. Network analysis (both sequence and contact tracing)2. Robust data integration and visualization platform3. Pathogen-agnostic (e.g., HIV, HCV, TB & Zika so far)
Interactive Exploration of Transmission Networks
MicrobeTrace
• Delivered securely via browser• Redirects to latest version
• All analyses and visuals run locally• Can disconnect from internet
• Customizable inputs and many formats• Excel formats (CSV and XLSX)• Raw sequences (.FASTA)• Distance matrix format
https://microbetrace.cdc.gov
MicrobeTrace Design Decisions and Process
Interaction with and feedback from end users at local health departments (HDs) has guided our primary design decisions and feature requirements
1. Delivering MicrobeTrace via a web browser is preferable to downloading and installing software due to administrative red tape.
2. Local HDs do not currently have, or plan to obtain, bioinformatic expertise necessary to perform phylodynamic analyses
3. The overwhelming stigma and general sensitivity of HIV, HCV, and TB data requires that most data, especially high-risk contact networks, be held locally on ‘air-gapped’ machines.
4. Contact networks, demographic data, and sequence data are fundamentally different and difficult to integrate. We provide automatic integration of arbitrary numbers or types of files.
MicrobeTrace Usage Statistics Launch to February (06.08.18 – Today, 551 days)
High user count in GA primarily from partners training events held at CDC.
High user count in GA primarily from partners training events held at CDC.
Users by Area in the US• >1900 unique users• Representing 44 states, including DC
• median of 5 users per state• Representing 66 major metropolitan areas
• median of 3 users per metro area
Activity• 2.2 visits per user
• ~10min of use per visit
International Users• 193 unique users representing 38 other countries
The Road Ahead – Feedback and Support
• User feedback reigns supreme• Feedback via Github.com issues• Support via [email protected]
• Open source development practices enable us to leverage the community
• Help wanted ‘bounties’ Code.gov Entry User-requested features
Open Issue Tracker
Conclusions
• Bioinformatic methods for cluster analysis are varied
• Method selection depends most powerfully on the setting and questions of interest
• Genetic networks offer improved comparability and compatibility with contact tracing data
• Traditional phylogenetic trees are readily mapped to network formats w/o loss of information
• Nearest neighbor (algorithmic) filtration offers more precision than threshold criteria
MicrobeTrace
• MicrobeTrace supports integration of genetic, patristic and/or contact tracing networks
• MicrobeTrace enables filtration based on threshold, nearest neighbor algorithm, or bothsimultaneously
Acknowledgements
Bill SwitzerMolecular Epidemiology and Bioinformatics Team Lead
Tony BoylesJay Kim
MicrobeTrace Developers
Anupama ShankarTraining and Content Coordination Quality ControlUser Manual
Sergei KniazevCompanion Tool DeveloperBioinformatic Algorithm Developer
Field Testing, Training and FeedbackDHAP – HICSBAngela HernandezAlexa OsterAnne Marie FranceCheryl OcfemiaPaul WiedleNivedha PanneerScott Cope
DTBEBen SilkKathryn WingleeSarah TalaricoDVHSeth SimsDavid Campos
Administration & LeadershipBill SwitzerWalid Heneine
Michele OwenCaryn Kim
CDC IT OfficeThom SukalacRob MacGregor
Funding and Support by OAMD and DHAP OD
CDCTraining Production Content Production