bioinformatic methods for cluster analysisregist2.virology-education.com/presentations/2019/... ·...

33
National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention Network Identification for Prevention Session 1 Ellsworth Campbell Computational Biologist Molecular Epidemiology and Bioinformatics Team Centers for Disease Control and Prevention Division of HIV/AIDS Prevention Laboratory Branch Bioinformatic Methods for Cluster Analysis

Upload: others

Post on 23-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

National Center for HIV/AIDS, Viral Hepatitis, STD, and TB Prevention

Network Identification for Prevention – Session 1

Ellsworth Campbell

Computational Biologist

Molecular Epidemiology and Bioinformatics Team

Centers for Disease Control and Prevention

Division of HIV/AIDS PreventionLaboratory Branch

Bioinformatic Methods for Cluster Analysis

Page 2: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Bioinformatic Methods for Cluster Analysis

• Immense field of literature, recent and detailed reviews abound

Page 3: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Bioinformatic Methods for Cluster Analysis

• Immense field of literature, recent and detailed reviews abound

• Cluster analysis can be broken down into four categories:

1. Gene region: polymerase, envelope, gag, whole genome…

2. Technology: Sanger vs NGS*, with various subdivisions therein

3. Analysis: phylogenetic trees vs genetic networks

4. Filtration: Thresholds vs algorithms

(pol) (env)

*Next-generation sequencing

Page 4: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Bioinformatic Methods for Cluster Analysis

• Immense field of literature, recent and detailed reviews abound

• Cluster analysis can be broken down into four categories:

1. Gene region: polymerase, envelope, gag, whole genome…

2. Technology: Sanger vs NGS*, with various subdivisions therein

3. Analysis: phylogenetic trees vs genetic networks

4. Filtration: Thresholds vs algorithms

(pol) (env)

*Next-generation sequencing

Why? Comparability

Page 5: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Comparability of Laboratory and Epidemiologic

Data Sources

Page 6: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Trees are Networks!

What if I prefer my phylogenetic trees?• Complex modeling (e.g., evolution, phylogeography, etc)• Time-resolved trees (e.g., BEAST)• Complex genomics (e.g., recombination)

Pairwise patristic distances, measured across a tree, enables us to cast a tree as a pairwise genetic network

Page 7: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

A B0.1%

A BSex

ComparabilityResults are directly comparable to traditional contract tracing data: Simple Integration• Provides an epidemiologic method to filter close genetic links between sequences

A D1.4%

Page 8: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

ComparabilityResults are directly comparable to traditional contract tracing data: Simple Integration• Provides an epidemiologic method to filter close genetic links between sequences

Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al. (2017) Social and Genetic Networks of HIV-1 Transmission in

New York City. PLoS Pathog 13(1): e1006000. doi:10.1371/journal. ppat.1006000

Figure 1

Distances between

epi-linked sequences

Distances between

unlinked sequences

Threshold that

separates most

blue from red

Page 9: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Genetic distance (pol) network

Algorithm-filtered, mean d = 0.1%Left-network, filtered by the presence

of a reported high-risk contact

ComparabilityResults are directly comparable to traditional contract tracing data: Simple Integration• Provides an epidemiologic method to filter close genetic links between sequences

Filter with Epi Data

Campbell et al., Detailed Transmission Network Analysis of a Large Opiate-Driven Outbreak of HIV Infection in the United States, The Journal of Infectious Diseases, Volume 216, Issue 9, 1 November 2017, Pages 1053–1062, https://doi.org/10.1093/infdis/jix307

Page 10: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Bioinformatic Methods for Cluster Analysis

• Immense field of literature, recent and detailed reviews abound

• Cluster analysis can be broken down into four categories:

1. Gene region: polymerase, envelope, gag, whole genome…

2. Technology: Sanger vs NGS*, with various subdivisions therein

3. Analysis: phylogenetic trees vs genetic networks

4. Filtration: Thresholds vs algorithms

Why? Precision

(pol) (env)

*Next-generation sequencing

Page 11: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Algorithmic Filtering Improves Precision

Page 12: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

PrecisionGiven the data volume, filtration is a requirement…but methodology matters.

Thresholds Nearest Neighbor(s) algorithm

(aka union of all minimum spanning trees)

“Throw away data above this value.” “Keep only the closest link, for all individuals.”

Page 13: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

PrecisionAlgorithmic filtering enables a more precise reconstruction of the genetic network

1.3%

1.3%

2.3%

1.7%

0.2%

0.3%

0.0%

0.0%

Explanatory Example

Page 14: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

PrecisionAlgorithmic filtering enables a more precise reconstruction of the genetic network

1.3%

1.3%

2.3%

1.7%

0.2%

0.3%

0.0%

0.0%

Explanatory Example

Realistic Example

Page 15: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

PrecisionAlgorithmic filtering enables a more precise reconstruction of the genetic network

1.3%

1.3%

2.3%

1.7%

0.2%

0.3%

0.0%

0.0%

Explanatory Example

Downsides:

1. Powerfully depends on threshold

2. Threat of over inclusion

3. Threat of over exclusionRealistic Example

Page 16: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

PrecisionAlgorithmic filtering enables a more precise reconstruction of the genetic network

1.3%

1.3%

2.3%

1.7%

0.2%

0.3%

0.0%

0.0%

Page 17: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

PrecisionAlgorithmic filtering enables a more precise reconstruction of the genetic network

Distances between

epi-linked sequences

Distances between

unlinked sequences

Threshold that

separates most

blue from red

Page 18: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Genetic distance (pol) network

Threshold-filtered, d ≤0.5%

Genetic distance (pol) network

Algorithm-filtered, mean d = 0.1%

PrecisionAlgorithmic filtering enables a more precise reconstruction of the genetic network

Campbell et al., Detailed Transmission Network Analysis of a Large Opiate-Driven Outbreak of HIV Infection in the United States, The Journal of Infectious Diseases, Volume 216, Issue 9, 1 November 2017, Pages 1053–1062, https://doi.org/10.1093/infdis/jix307

Page 19: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Adapting to the Setting

Page 20: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Setting: Outbreak Response

Gene Region • Comparability to available sequences: pol• Recent evolutionary past: env

Technology • Cost*, speed* and availability : Sanger

Analysis • Comparability to contact tracing: genetic networks• Lowest training requirements: genetic networks• Time of divergence of interest: timed phylogenetic trees

* This fact is changing rapidly, and short-read NGS methods will soon dominate

Page 21: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Setting: Public Health Surveillance

Analysis • Lowest training requirements: genetic networks• Speed of analysis: genetic networks

Gene Region • Comparability to available sequences: pol• Evolutionary past (regardless of scale): pol

Technology • Cost*, speed* and availability : Sanger

* This fact is changing rapidly, and short-read NGS methods will soon dominate

Page 22: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Setting: Research Study

Gene Region • Comparability to available sequences: pol• Distant/recent evolutionary past: pol/env• Maximum reliability: pol + gag or whole genome

Technology • High-fidelity transmission analysis: long read, next-generation sequencing• Intra- and interhost diversity: short read, next-generation sequencing

Analysis • Transmission-oriented study with contact tracing data: genetic networks• Study of recombinant variants: genetic networks• Time of divergence of interest: timed phylogenetic trees

Page 23: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

MicrobeTrace* Accommodates all considerations

* There are other tools in the field, but all require multiple tools to be used in tandem

Page 24: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Sequences + Epidemiologic Data

github.com/cdcgov/microbetrace

id venue riskFactor Gender Age

1807 1 HET male 31

1311 1 MSM female 33

1518 2 HET male 26

1481 2 IDU male 52

1480 1 IDU male 22

1422 3 MSM-IDU male 32

Case Contact Type

1807 1311 Sex+IDU

1518 1311 Sex

1481 1311 IDU

1480 1311 Sex+IDU

1422 1311 Sex+IDU

What is MicrobeTrace 1. Network analysis (both sequence and contact tracing)2. Robust data integration and visualization platform3. Pathogen-agnostic (e.g., HIV, HCV, TB & Zika so far)

Interactive Exploration of Transmission Networks

Page 25: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

MicrobeTrace

• Delivered securely via browser• Redirects to latest version

• All analyses and visuals run locally• Can disconnect from internet

• Customizable inputs and many formats• Excel formats (CSV and XLSX)• Raw sequences (.FASTA)• Distance matrix format

https://microbetrace.cdc.gov

Page 26: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

MicrobeTrace Design Decisions and Process

Interaction with and feedback from end users at local health departments (HDs) has guided our primary design decisions and feature requirements

1. Delivering MicrobeTrace via a web browser is preferable to downloading and installing software due to administrative red tape.

2. Local HDs do not currently have, or plan to obtain, bioinformatic expertise necessary to perform phylodynamic analyses

3. The overwhelming stigma and general sensitivity of HIV, HCV, and TB data requires that most data, especially high-risk contact networks, be held locally on ‘air-gapped’ machines.

4. Contact networks, demographic data, and sequence data are fundamentally different and difficult to integrate. We provide automatic integration of arbitrary numbers or types of files.

Page 27: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al
Page 28: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al
Page 29: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

MicrobeTrace Usage Statistics Launch to February (06.08.18 – Today, 551 days)

High user count in GA primarily from partners training events held at CDC.

High user count in GA primarily from partners training events held at CDC.

Users by Area in the US• >1900 unique users• Representing 44 states, including DC

• median of 5 users per state• Representing 66 major metropolitan areas

• median of 3 users per metro area

Activity• 2.2 visits per user

• ~10min of use per visit

International Users• 193 unique users representing 38 other countries

Page 30: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

The Road Ahead – Feedback and Support

• User feedback reigns supreme• Feedback via Github.com issues• Support via [email protected]

• Open source development practices enable us to leverage the community

• Help wanted ‘bounties’ Code.gov Entry User-requested features

Open Issue Tracker

Page 31: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Conclusions

• Bioinformatic methods for cluster analysis are varied

• Method selection depends most powerfully on the setting and questions of interest

• Genetic networks offer improved comparability and compatibility with contact tracing data

• Traditional phylogenetic trees are readily mapped to network formats w/o loss of information

• Nearest neighbor (algorithmic) filtration offers more precision than threshold criteria

Page 32: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

MicrobeTrace

• MicrobeTrace supports integration of genetic, patristic and/or contact tracing networks

• MicrobeTrace enables filtration based on threshold, nearest neighbor algorithm, or bothsimultaneously

Page 33: Bioinformatic Methods for Cluster Analysisregist2.virology-education.com/presentations/2019/... · Wertheim JO, Kosakovsky Pond SL, Forgione LA, Mehta SR, Murrell B, Shah S, et al

Acknowledgements

Bill SwitzerMolecular Epidemiology and Bioinformatics Team Lead

Tony BoylesJay Kim

MicrobeTrace Developers

Anupama ShankarTraining and Content Coordination Quality ControlUser Manual

Sergei KniazevCompanion Tool DeveloperBioinformatic Algorithm Developer

Field Testing, Training and FeedbackDHAP – HICSBAngela HernandezAlexa OsterAnne Marie FranceCheryl OcfemiaPaul WiedleNivedha PanneerScott Cope

DTBEBen SilkKathryn WingleeSarah TalaricoDVHSeth SimsDavid Campos

Administration & LeadershipBill SwitzerWalid Heneine

Michele OwenCaryn Kim

CDC IT OfficeThom SukalacRob MacGregor

Funding and Support by OAMD and DHAP OD

CDCTraining Production Content Production