algorithms and methods for cross-species forensics and...

Algorithms and Methods for Cross-Species Forensics and Classification

Darryl Reeves, Ph.D.Clinical and Research Genomics04/11/2018

OutlineI. Introduction

II. An alternative view of biological sequences

III. A model for sequence identification

IV. Conclusion/Future plans

O




IV. Conclusion/Future Research Agenda

O

Why study communities of organisms (metagenomics)?

I

§ Environment contains some interesting biological activity

§ Interaction between community and external environment/host

§ Understand how environmental changes impact organisms

Motivating examplesI

Metagenomics researchI

¡ Two very different research domains (environmental and medical)

¡ Same basic questions

¡ Given a heterogeneous community of organisms¡ Who is there?

¡ What are they doing (function)?¡ What is the abundance of members?

The GoalS

DNA Sequence

Metagenomics – dataI

Metagenomicidentification/classification

I


I

Species X

Sequence alignment

Sub-sequence markers

Sub-sequence composition

Hybrid approaches

Common Limitation


I

Species X

Species Y 95%

Species X 90%


I

Species X

Species Y

Species Z

Species 1

Species 2

Species 3

Species 3

HypothesisH

Current methods fail to utilize all available information

Results in a local view of community composition

Statistical learning can be used to develop a global view of communities





O

An alternative view

¡ Sequence identification fundamentally a comparison problem

K

¡ Comparing text vs. numbers

K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG

K


K

3-mer CountATG 1TGC 2CTA 2TAG 2AGG 2GGA 1GAA 1

K-mer Frequency Distribution


K

3-mers, 64 combinations

5-mers, 1024 combinations

23-mers, 70,368,744,177,664 combinations


K

4K

K-mer diversity analysis –excluding k-mers

K

AGTAGTGGATA

GTGGATA

CAGTAG

AGTAGATA

Keys (m) Hash Function Hashes (n)

0

1

2

m >> n

KHashing

Data from KMerge

Sequence SiHash Val Count0 41 02 9

|H|-3 13|H|-2 1|H|-1 2

. . .

. . .

S

KMerge - reference

>genome1AGCTCTATCATGCCCTTAGTTTTAAACTAGGTCTAAGCTAGAAGG... +

Kingdom xDomain xyPhylum xzClass xaOrder xbFamily xtGenus xxSpecies y

GGGCATCTAT 12TAATC 65AGT 2ACCTA 1ACGATCATGC 78ACGTT 921

CAT 2702TGCCTGA 53 CTGAT 12GCA 97GCATA 231TTATATC 780

123214 116350840 310534543535 3911000001231 13201232144523 324341432234243 85322301430583 432402434238402 5453424320890 954353432800243 142344124023804 89243

DBCount K-mers

Hash K-mersFASTA Taxonomy

>genome2AGCTCTATCATGCCCTTAGTTTTAAACTAGGTCTAAGCTAGAAGG...

+Kingdom zDomain zyPhylum zzClass zaOrder zbFamily ztGenus zxSpecies b

GCTCAAATAT 12TAATCGG 22GGTAAG 5ACCTA 45ACGATCATGC 70ACG 1901

CATGG 702TGTGA 59 CTGATAC 14GCAGTCA 209ATA 210TTAGGTA 283

12321114 235033840 65255351545 31100231 232123223807 32411432243 85822303058331 430122438402312 5676420890 95513400243537 1342402380443 8431

FASTATaxonomy

>genome3AGCTCTATCATGCCCTTAGTTTTAAACTAGGTCTAAGCTAGAAGG...

+Kingdom wDomain wyPhylum wzClass waOrder wbFamily wtGenus wxSpecies d

GGGTAT 17TAATGCATC 623AGTAA 29ACCTAGTCA 905ACGATCC 63ACGTT 101

CATAC 372TGTGA 93 CTGATAG 52GCT 37GCATAAC 31TTATC 706

12331214 13135350840 326534563435 978230001231 19731232144523 3142412234243 8553014583 24024348402 54545342432 943964320243 142431124004 8937

FASTATaxonomy

. . .

. . .

. . .

K

KMerge - sample

@read1AGTCTAGATCTAG+?????BBBDBDDD#@read2CACCCTGACGGT+FBFFHHHH@@C>@read3...

GCATCTAT 2TAATCTC 5AGAAT 2ACCTA 1ACGATCACT 6ACGACAT 1CATCC 2TGCCTGA 5 GATAA 2GCACGAT 7GCATAAA 3ATATCGG 7

121 163540 9534535 51000121 21234523 414234243 83014383 424338402 53424320 934300243 1413804 2

DBCount K-mers

Hash K-mers

FASTQ


121 163540 9534535 51000121 21234523 414234243 83014383 424338402 53424320 934300243 1413804 2


121 163540 9534535 51000121 21234523 414234243 83014383 424338402 53424320 934300243 1413804 2

K

K-mer information contentK

H (X) = P(x)log21

P(x )x∑

RE = H (Xhash ) /H (Xraw )

8-bit16-bit32-bit





O

Flipping Coins

H = Flip Coin A

T = Flip Coin B

S

Flipping Coins

1

2

3

4

5

S

Flipping CoinsS

H,T

H,T x 2

X1

X2

X3 YN

M

f1,2

f2,3

Flipping Multiple CoinsS

X1

X2

X3 YN

M

f1,2

f2,3

Flipping Multiple Coins

H,T

H,T

H,T

H,T

x 2

x 4

x 8

S

Rolling Dice

X1

X2

X3 YN

M

f1,2

f2,3

1,2,3,4,5,6

1,2,3,4,5,6

1,2,3,4,5,6

1,2,3,4,5,6

S

x 6

x 36

x 216

Generative model

¡ Each factor can be envisioned as one or more sub-rank wheels

fskXsk

Xk

Fsk,k

EuksViruses

ProksArchaea

S

E AP

V

Generative process

¡ |H|- number of possible hash values

01

2

|H|-3

|H|-2

|H|-1

3Y

Xst

fst,y

S

. . .

¡ Hash wheel for St* is the distribution of hashes in genome¡ Hash wheel spun Ns = n times¡ Y0,…Yn representing n observed k-mer hashes

Graphical modelS

Xsk

Xk

Xp

Xc

Xst YNs

M

Xi – Categorical-distributed nodesfp,c – Factors

Y – Multinonmial-distributed k-mer hash values

Ns – number of observations in sequence Sifsk,k

fsk

fk,p

fp,c

fc,o

fsp,st

M – number of sequences in sample

Generative process

Sequence SHash Val Count0 41 02 9

|H|-3 13|H|-2 1|H|-1 2

. . .

. . .

S

Model inference

P(Z |Y ) = ?

P(Z |Y ) = P(Y | Z )P(Z )P(Y )

Computationally Intractable Integral

S

Parameter Learning:

variationalapproximation

Z = {X, f } Z: hidden variablesX: rank variablesf: conditional probabilities





O

The GoalS

KMerge+

Model

DNA Sequence

Next Steps

¡ Scaling, tuning, and benchmarking method

¡ Web application/visualization platform

C

¡ Publish results

¡ Implement complete model

AcknowledgementsAMason Lab

Chris Mason, Ph.D.*Dhruva Chandramohan, Ph.D.*Cem Meydan, Ph.D.*Elizabeth Hénaff,Ph.D.*Francine Garrett-Bakelman, MD, Ph.D.Virginia Wagatsuma, Ph.D.Niamh O’Hara, Ph.D.Bharath Prithiviraj, Ph.D.Marjan Bozinoski, Ph.D.Lenore Pipes, Ph.D.Priya Vijay, Ph.D.Pradeep AmbroseJake ReedAlexa McIntyreNoah AlexanderSofia AhsanuddinEbrahim AfshinnekooChou ChouMaureen Milici

Thesis CommitteeOlivier Elemento, Ph.D.*Gunnar Rätsch, Ph.D.Jason Mezey, Ph.D.

PhD Survival TeamNoah DuklerTom Vincent, Ph.D.Andre Kahles, Ph.D.Jaakko Luttinen, Ph.D.PBTech

Tri-I CBMDavid Christini, Ph.D.Christina Leslie, Ph.D.Margie Hinonangan-MendozaKathleen PickeringFrancine Collazo-Espinell

FundingNIH Ruth L. KirschteinNational Research ServiceAward (F31GM111053)

algorithms and methods for cross-species forensics and...

Documents