algorithms and methods for cross-species forensics and...

50
Algorithms and Methods for Cross-Species Forensics and Classification Darryl Reeves, Ph.D. Clinical and Research Genomics 04/11/2018

Upload: others

Post on 29-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Algorithms and Methods for Cross-Species Forensics and Classification

    Darryl Reeves, Ph.D.Clinical and Research Genomics04/11/2018

  • OutlineI. Introduction

    II. An alternative view of biological sequences

    III. A model for sequence identification

    IV. Conclusion/Future plans

    O

  • OutlineI. Introduction

    II. An alternative view of biological sequences

    III. A model for sequence identification

    IV. Conclusion/Future Research Agenda

    O

  • Why study communities of organisms (metagenomics)?

    I

    § Environment contains some interesting biological activity

    § Interaction between community and external environment/host

    § Understand how environmental changes impact organisms

  • Motivating examplesI

  • Metagenomics researchI

    ¡ Two very different research domains (environmental and medical)

    ¡ Same basic questions

    ¡ Given a heterogeneous community of organisms¡ Who is there?

    ¡ What are they doing (function)?¡ What is the abundance of members?

  • The GoalS

    DNA Sequence

  • Metagenomics – dataI

  • Metagenomics – dataI

  • Metagenomics – dataI

  • Metagenomicidentification/classification

    I

  • Metagenomicidentification/classification

    I

    Species X

    Sequence alignment

    Sub-sequence markers

    Sub-sequence composition

    Hybrid approaches

    Common Limitation

  • Metagenomicidentification/classification

    I

    Species X

    Species Y 95%

    Species X 90%

  • Metagenomicidentification/classification

    I

    Species X

    Species Y

    Species Z

    Species 1

    Species 2

    Species 3

    Species 3

  • HypothesisH

    Current methods fail to utilize all available information

    Results in a local view of community composition

    Statistical learning can be used to develop a global view of communities

  • OutlineI. Introduction

    II. An alternative view of biological sequences

    III. A model for sequence identification

    IV. Conclusion/Future plans

    O

  • An alternative view

    ¡ Sequence identification fundamentally a comparison problem

    K

    ¡ Comparing text vs. numbers

  • K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG

    K

  • K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG

    K

  • K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG

    K

  • K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG

    K

  • K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG

    K

  • K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG

    K

  • K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG

    K

  • K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG

    K

    3-mer CountATG 1TGC 2CTA 2TAG 2AGG 2GGA 1GAA 1

    K-mer Frequency Distribution

  • K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG

    K

    3-mers, 64 combinations

    5-mers, 1024 combinations

    23-mers, 70,368,744,177,664 combinations

  • K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG

    K

    4K

  • K-mer diversity analysis –excluding k-mers

    K

  • K-mer diversity analysis –excluding k-mers

    K

  • AGTAGTGGATA

    GTGGATA

    CAGTAG

    AGTAGATA

    Keys (m) Hash Function Hashes (n)

    0

    1

    2

    m >> n

    KHashing

  • Data from KMerge

    Sequence SiHash Val Count0 41 02 9

    |H|-3 13|H|-2 1|H|-1 2

    . . .

    . . .

    S

  • KMerge - reference

    >genome1AGCTCTATCATGCCCTTAGTTTTAAACTAGGTCTAAGCTAGAAGG... +

    Kingdom xDomain xyPhylum xzClass xaOrder xbFamily xtGenus xxSpecies y

    GGGCATCTAT 12TAATC 65AGT 2ACCTA 1ACGATCATGC 78ACGTT 921

    CAT 2702TGCCTGA 53 CTGAT 12GCA 97GCATA 231TTATATC 780

    123214 116350840 310534543535 3911000001231 13201232144523 324341432234243 85322301430583 432402434238402 5453424320890 954353432800243 142344124023804 89243

    DBCount K-mers

    Hash K-mersFASTA Taxonomy

    >genome2AGCTCTATCATGCCCTTAGTTTTAAACTAGGTCTAAGCTAGAAGG...

    +Kingdom zDomain zyPhylum zzClass zaOrder zbFamily ztGenus zxSpecies b

    GCTCAAATAT 12TAATCGG 22GGTAAG 5ACCTA 45ACGATCATGC 70ACG 1901

    CATGG 702TGTGA 59 CTGATAC 14GCAGTCA 209ATA 210TTAGGTA 283

    12321114 235033840 65255351545 31100231 232123223807 32411432243 85822303058331 430122438402312 5676420890 95513400243537 1342402380443 8431

    FASTATaxonomy

    >genome3AGCTCTATCATGCCCTTAGTTTTAAACTAGGTCTAAGCTAGAAGG...

    +Kingdom wDomain wyPhylum wzClass waOrder wbFamily wtGenus wxSpecies d

    GGGTAT 17TAATGCATC 623AGTAA 29ACCTAGTCA 905ACGATCC 63ACGTT 101

    CATAC 372TGTGA 93 CTGATAG 52GCT 37GCATAAC 31TTATC 706

    12331214 13135350840 326534563435 978230001231 19731232144523 3142412234243 8553014583 24024348402 54545342432 943964320243 142431124004 8937

    FASTATaxonomy

    . . .

    . . .

    . . .

    K

  • KMerge - sample

    @read1AGTCTAGATCTAG+?????BBBDBDDD#@read2CACCCTGACGGT+FBFFHHHH@@C>@read3...

    GCATCTAT 2TAATCTC 5AGAAT 2ACCTA 1ACGATCACT 6ACGACAT 1CATCC 2TGCCTGA 5 GATAA 2GCACGAT 7GCATAAA 3ATATCGG 7

    121 163540 9534535 51000121 21234523 414234243 83014383 424338402 53424320 934300243 1413804 2

    DBCount K-mers

    Hash K-mers

    FASTQ

    GCATCTAT 2TAATCTC 5AGAAT 2ACCTA 1ACGATCACT 6ACGACAT 1CATCC 2TGCCTGA 5 GATAA 2GCACGAT 7GCATAAA 3ATATCGG 7

    121 163540 9534535 51000121 21234523 414234243 83014383 424338402 53424320 934300243 1413804 2

    GCATCTAT 2TAATCTC 5AGAAT 2ACCTA 1ACGATCACT 6ACGACAT 1CATCC 2TGCCTGA 5 GATAA 2GCACGAT 7GCATAAA 3ATATCGG 7

    121 163540 9534535 51000121 21234523 414234243 83014383 424338402 53424320 934300243 1413804 2

    K

  • K-mer information contentK

    H (X) = P(x)log21

    P(x )x∑

    RE = H (Xhash ) /H (Xraw )

    8-bit16-bit32-bit

  • OutlineI. Introduction

    II. An alternative view of biological sequences

    III. A model for sequence identification

    IV. Conclusion/Future plans

    O

  • Flipping Coins

    H = Flip Coin A

    T = Flip Coin B

    S

  • Flipping Coins

    1

    2

    3

    4

    5

    S

  • Flipping CoinsS

    H,T

    H,T x 2

  • X1

    X2

    X3 YN

    M

    f1,2

    f2,3

    Flipping Multiple CoinsS

  • X1

    X2

    X3 YN

    M

    f1,2

    f2,3

    Flipping Multiple Coins

    H,T

    H,T

    H,T

    H,T

    x 2

    x 4

    x 8

    S

  • Rolling Dice

    X1

    X2

    X3 YN

    M

    f1,2

    f2,3

    1,2,3,4,5,6

    1,2,3,4,5,6

    1,2,3,4,5,6

    1,2,3,4,5,6

    S

    x 6

    x 36

    x 216

  • Generative model

    ¡ Each factor can be envisioned as one or more sub-rank wheels

    fskXsk

    Xk

    Fsk,k

    EuksViruses

    ProksArchaea

    S

    E AP

    V

  • Generative process

    ¡ |H|- number of possible hash values

    01

    2

    |H|-3

    |H|-2

    |H|-1

    3Y

    Xst

    fst,y

    S

    . . .

    ¡ Hash wheel for St* is the distribution of hashes in genome¡ Hash wheel spun Ns = n times¡ Y0,…Yn representing n observed k-mer hashes

  • Graphical modelS

    Xsk

    Xk

    Xp

    Xc

    Xst YNs

    M

    Xi – Categorical-distributed nodesfp,c – Factors

    Y – Multinonmial-distributed k-mer hash values

    Ns – number of observations in sequence Sifsk,k

    fsk

    fk,p

    fp,c

    fc,o

    fsp,st

    M – number of sequences in sample

  • Generative process

    Sequence SHash Val Count0 41 02 9

    |H|-3 13|H|-2 1|H|-1 2

    . . .

    . . .

    S

  • Model inference

    P(Z |Y ) = ?

    P(Z |Y ) = P(Y | Z )P(Z )P(Y )

    Computationally Intractable Integral

    S

    Parameter Learning:

    variationalapproximation

    Z = {X, f } Z: hidden variablesX: rank variablesf: conditional probabilities

  • OutlineI. Introduction

    II. An alternative view of biological sequences

    III. A model for sequence identification

    IV. Conclusion/Future plans

    O

  • The GoalS

    KMerge+

    Model

    DNA Sequence

  • Next Steps

    ¡ Scaling, tuning, and benchmarking method

    ¡ Web application/visualization platform

    C

    ¡ Publish results

    ¡ Implement complete model

  • AcknowledgementsAMason Lab

    Chris Mason, Ph.D.*Dhruva Chandramohan, Ph.D.*Cem Meydan, Ph.D.*Elizabeth Hénaff,Ph.D.*Francine Garrett-Bakelman, MD, Ph.D.Virginia Wagatsuma, Ph.D.Niamh O’Hara, Ph.D.Bharath Prithiviraj, Ph.D.Marjan Bozinoski, Ph.D.Lenore Pipes, Ph.D.Priya Vijay, Ph.D.Pradeep AmbroseJake ReedAlexa McIntyreNoah AlexanderSofia AhsanuddinEbrahim AfshinnekooChou ChouMaureen Milici

    Thesis CommitteeOlivier Elemento, Ph.D.*Gunnar Rätsch, Ph.D.Jason Mezey, Ph.D.

    PhD Survival TeamNoah DuklerTom Vincent, Ph.D.Andre Kahles, Ph.D.Jaakko Luttinen, Ph.D.PBTech

    Tri-I CBMDavid Christini, Ph.D.Christina Leslie, Ph.D.Margie Hinonangan-MendozaKathleen PickeringFrancine Collazo-Espinell

    FundingNIH Ruth L. KirschteinNational Research ServiceAward (F31GM111053)