algorithms and methods for cross-species forensics and...
TRANSCRIPT
-
Algorithms and Methods for Cross-Species Forensics and Classification
Darryl Reeves, Ph.D.Clinical and Research Genomics04/11/2018
-
OutlineI. Introduction
II. An alternative view of biological sequences
III. A model for sequence identification
IV. Conclusion/Future plans
O
-
OutlineI. Introduction
II. An alternative view of biological sequences
III. A model for sequence identification
IV. Conclusion/Future Research Agenda
O
-
Why study communities of organisms (metagenomics)?
I
§ Environment contains some interesting biological activity
§ Interaction between community and external environment/host
§ Understand how environmental changes impact organisms
-
Motivating examplesI
-
Metagenomics researchI
¡ Two very different research domains (environmental and medical)
¡ Same basic questions
¡ Given a heterogeneous community of organisms¡ Who is there?
¡ What are they doing (function)?¡ What is the abundance of members?
-
The GoalS
DNA Sequence
-
Metagenomics – dataI
-
Metagenomics – dataI
-
Metagenomics – dataI
-
Metagenomicidentification/classification
I
-
Metagenomicidentification/classification
I
Species X
Sequence alignment
Sub-sequence markers
Sub-sequence composition
Hybrid approaches
Common Limitation
-
Metagenomicidentification/classification
I
Species X
Species Y 95%
Species X 90%
-
Metagenomicidentification/classification
I
Species X
Species Y
Species Z
Species 1
Species 2
Species 3
Species 3
-
HypothesisH
Current methods fail to utilize all available information
Results in a local view of community composition
Statistical learning can be used to develop a global view of communities
-
OutlineI. Introduction
II. An alternative view of biological sequences
III. A model for sequence identification
IV. Conclusion/Future plans
O
-
An alternative view
¡ Sequence identification fundamentally a comparison problem
K
¡ Comparing text vs. numbers
-
K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG
K
-
K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG
K
-
K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG
K
-
K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG
K
-
K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG
K
-
K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG
K
-
K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG
K
-
K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG
K
3-mer CountATG 1TGC 2CTA 2TAG 2AGG 2GGA 1GAA 1
K-mer Frequency Distribution
-
K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG
K
3-mers, 64 combinations
5-mers, 1024 combinations
23-mers, 70,368,744,177,664 combinations
-
K-mersATGCTAGGAACCCTAGCTTACAGAGCAGTTGCAGG
K
4K
-
K-mer diversity analysis –excluding k-mers
K
-
K-mer diversity analysis –excluding k-mers
K
-
AGTAGTGGATA
GTGGATA
CAGTAG
AGTAGATA
Keys (m) Hash Function Hashes (n)
0
1
2
m >> n
KHashing
-
Data from KMerge
Sequence SiHash Val Count0 41 02 9
|H|-3 13|H|-2 1|H|-1 2
. . .
. . .
S
-
KMerge - reference
>genome1AGCTCTATCATGCCCTTAGTTTTAAACTAGGTCTAAGCTAGAAGG... +
Kingdom xDomain xyPhylum xzClass xaOrder xbFamily xtGenus xxSpecies y
GGGCATCTAT 12TAATC 65AGT 2ACCTA 1ACGATCATGC 78ACGTT 921
CAT 2702TGCCTGA 53 CTGAT 12GCA 97GCATA 231TTATATC 780
123214 116350840 310534543535 3911000001231 13201232144523 324341432234243 85322301430583 432402434238402 5453424320890 954353432800243 142344124023804 89243
DBCount K-mers
Hash K-mersFASTA Taxonomy
>genome2AGCTCTATCATGCCCTTAGTTTTAAACTAGGTCTAAGCTAGAAGG...
+Kingdom zDomain zyPhylum zzClass zaOrder zbFamily ztGenus zxSpecies b
GCTCAAATAT 12TAATCGG 22GGTAAG 5ACCTA 45ACGATCATGC 70ACG 1901
CATGG 702TGTGA 59 CTGATAC 14GCAGTCA 209ATA 210TTAGGTA 283
12321114 235033840 65255351545 31100231 232123223807 32411432243 85822303058331 430122438402312 5676420890 95513400243537 1342402380443 8431
FASTATaxonomy
>genome3AGCTCTATCATGCCCTTAGTTTTAAACTAGGTCTAAGCTAGAAGG...
+Kingdom wDomain wyPhylum wzClass waOrder wbFamily wtGenus wxSpecies d
GGGTAT 17TAATGCATC 623AGTAA 29ACCTAGTCA 905ACGATCC 63ACGTT 101
CATAC 372TGTGA 93 CTGATAG 52GCT 37GCATAAC 31TTATC 706
12331214 13135350840 326534563435 978230001231 19731232144523 3142412234243 8553014583 24024348402 54545342432 943964320243 142431124004 8937
FASTATaxonomy
. . .
. . .
. . .
K
-
KMerge - sample
@read1AGTCTAGATCTAG+?????BBBDBDDD#@read2CACCCTGACGGT+FBFFHHHH@@C>@read3...
GCATCTAT 2TAATCTC 5AGAAT 2ACCTA 1ACGATCACT 6ACGACAT 1CATCC 2TGCCTGA 5 GATAA 2GCACGAT 7GCATAAA 3ATATCGG 7
121 163540 9534535 51000121 21234523 414234243 83014383 424338402 53424320 934300243 1413804 2
DBCount K-mers
Hash K-mers
FASTQ
GCATCTAT 2TAATCTC 5AGAAT 2ACCTA 1ACGATCACT 6ACGACAT 1CATCC 2TGCCTGA 5 GATAA 2GCACGAT 7GCATAAA 3ATATCGG 7
121 163540 9534535 51000121 21234523 414234243 83014383 424338402 53424320 934300243 1413804 2
GCATCTAT 2TAATCTC 5AGAAT 2ACCTA 1ACGATCACT 6ACGACAT 1CATCC 2TGCCTGA 5 GATAA 2GCACGAT 7GCATAAA 3ATATCGG 7
121 163540 9534535 51000121 21234523 414234243 83014383 424338402 53424320 934300243 1413804 2
K
-
K-mer information contentK
H (X) = P(x)log21
P(x )x∑
RE = H (Xhash ) /H (Xraw )
8-bit16-bit32-bit
-
OutlineI. Introduction
II. An alternative view of biological sequences
III. A model for sequence identification
IV. Conclusion/Future plans
O
-
Flipping Coins
H = Flip Coin A
T = Flip Coin B
S
-
Flipping Coins
1
2
3
4
5
S
-
Flipping CoinsS
H,T
H,T x 2
-
X1
X2
X3 YN
M
f1,2
f2,3
Flipping Multiple CoinsS
-
X1
X2
X3 YN
M
f1,2
f2,3
Flipping Multiple Coins
H,T
H,T
H,T
H,T
x 2
x 4
x 8
S
-
Rolling Dice
X1
X2
X3 YN
M
f1,2
f2,3
1,2,3,4,5,6
1,2,3,4,5,6
1,2,3,4,5,6
1,2,3,4,5,6
S
x 6
x 36
x 216
-
Generative model
¡ Each factor can be envisioned as one or more sub-rank wheels
fskXsk
Xk
Fsk,k
EuksViruses
ProksArchaea
S
E AP
V
-
Generative process
¡ |H|- number of possible hash values
01
2
|H|-3
|H|-2
|H|-1
3Y
Xst
fst,y
S
. . .
¡ Hash wheel for St* is the distribution of hashes in genome¡ Hash wheel spun Ns = n times¡ Y0,…Yn representing n observed k-mer hashes
-
Graphical modelS
Xsk
Xk
Xp
Xc
Xst YNs
M
Xi – Categorical-distributed nodesfp,c – Factors
Y – Multinonmial-distributed k-mer hash values
Ns – number of observations in sequence Sifsk,k
fsk
fk,p
fp,c
fc,o
fsp,st
M – number of sequences in sample
-
Generative process
Sequence SHash Val Count0 41 02 9
|H|-3 13|H|-2 1|H|-1 2
. . .
. . .
S
-
Model inference
P(Z |Y ) = ?
P(Z |Y ) = P(Y | Z )P(Z )P(Y )
Computationally Intractable Integral
S
Parameter Learning:
variationalapproximation
Z = {X, f } Z: hidden variablesX: rank variablesf: conditional probabilities
-
OutlineI. Introduction
II. An alternative view of biological sequences
III. A model for sequence identification
IV. Conclusion/Future plans
O
-
The GoalS
KMerge+
Model
DNA Sequence
-
Next Steps
¡ Scaling, tuning, and benchmarking method
¡ Web application/visualization platform
C
¡ Publish results
¡ Implement complete model
-
AcknowledgementsAMason Lab
Chris Mason, Ph.D.*Dhruva Chandramohan, Ph.D.*Cem Meydan, Ph.D.*Elizabeth Hénaff,Ph.D.*Francine Garrett-Bakelman, MD, Ph.D.Virginia Wagatsuma, Ph.D.Niamh O’Hara, Ph.D.Bharath Prithiviraj, Ph.D.Marjan Bozinoski, Ph.D.Lenore Pipes, Ph.D.Priya Vijay, Ph.D.Pradeep AmbroseJake ReedAlexa McIntyreNoah AlexanderSofia AhsanuddinEbrahim AfshinnekooChou ChouMaureen Milici
Thesis CommitteeOlivier Elemento, Ph.D.*Gunnar Rätsch, Ph.D.Jason Mezey, Ph.D.
PhD Survival TeamNoah DuklerTom Vincent, Ph.D.Andre Kahles, Ph.D.Jaakko Luttinen, Ph.D.PBTech
Tri-I CBMDavid Christini, Ph.D.Christina Leslie, Ph.D.Margie Hinonangan-MendozaKathleen PickeringFrancine Collazo-Espinell
FundingNIH Ruth L. KirschteinNational Research ServiceAward (F31GM111053)