![Page 1: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/1.jpg)
CompostBin : A DNA composition based metagenomic binning algorithm
CompostBin : A DNA composition based metagenomic binning algorithm
Sourav Chatterji*, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen
UC Davis [email protected]
Sourav Chatterji*, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen
UC Davis [email protected]
![Page 2: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/2.jpg)
Overview of TalkOverview of Talk
Metagenomics and the binning problem. CompostBin
Metagenomics and the binning problem. CompostBin
![Page 3: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/3.jpg)
The Microbial WorldThe Microbial World
![Page 4: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/4.jpg)
Exploring the Microbial WorldExploring the Microbial World
Culturing Majority of microbes currently unculturable. No ecological context.
Molecular Surveys (e.g. 16S rRNA) “who is out there?” “what are they doing?”
Culturing Majority of microbes currently unculturable. No ecological context.
Molecular Surveys (e.g. 16S rRNA) “who is out there?” “what are they doing?”
![Page 5: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/5.jpg)
Metagenomics
![Page 6: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/6.jpg)
Interpreting Metagenomic DataInterpreting Metagenomic Data
Nature of Metagenomic Data Mosaic Intraspecies polymorphism Fragmentary
New Sequencing Technologies Enormous amount of data Short Reads
Nature of Metagenomic Data Mosaic Intraspecies polymorphism Fragmentary
New Sequencing Technologies Enormous amount of data Short Reads
![Page 7: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/7.jpg)
Metagenomic BinningMetagenomic Binning
Classification of sequences by taxa
![Page 8: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/8.jpg)
Binning in ActionBinning in Action
Glassy Winged Sharpshooter (Homalodisca coagulata).
Feeds on plant xylem (poor in organic nutrients).
Microbial Endosymbionts
![Page 9: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/9.jpg)
![Page 10: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/10.jpg)
Current Binning Methods Current Binning Methods
Assembly Align with Reference Genome Database Search [MEGAN, BLAST] Phylogenetic Analysis DNA Composition [TETRA,Phylopythia]
Assembly Align with Reference Genome Database Search [MEGAN, BLAST] Phylogenetic Analysis DNA Composition [TETRA,Phylopythia]
![Page 11: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/11.jpg)
Current Binning Methods Current Binning Methods
Need closely related reference genomes. Poor performance on short fragments.
Sanger sequence reads 500-1000 bp long. Current assembly methods unreliable
Complex Communities Hard to Bin.
Need closely related reference genomes. Poor performance on short fragments.
Sanger sequence reads 500-1000 bp long. Current assembly methods unreliable
Complex Communities Hard to Bin.
![Page 12: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/12.jpg)
Overview of TalkOverview of Talk
Metagenomics and the binning problem. CompostBin
Metagenomics and the binning problem. CompostBin
![Page 13: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/13.jpg)
Genome SignaturesGenome Signatures
Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? Yes [Karlin et al. 1990s]
What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?
Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms? Yes [Karlin et al. 1990s]
What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?
![Page 14: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/14.jpg)
Imperfect WorldImperfect World
Horizontal Gene Transfer Recent Estimates [Ge et al. 2005]
Varies between 0-6% of genes.Typically ~2%.
But… Amelioration
Horizontal Gene Transfer Recent Estimates [Ge et al. 2005]
Varies between 0-6% of genes.Typically ~2%.
But… Amelioration
![Page 15: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/15.jpg)
DNA-composition metricsDNA-composition metrics
The K-mer Frequency MetricCompostBin uses hexamers
![Page 16: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/16.jpg)
Working with K-mers for Binning. Curse of Dimensionality : O(4K) independent
dimensions. Statistical noise increases with decreasing
fragment lengths. Project data into a lower dimensional space to
decrease noise. Principal Component Analysis.
Working with K-mers for Binning. Curse of Dimensionality : O(4K) independent
dimensions. Statistical noise increases with decreasing
fragment lengths. Project data into a lower dimensional space to
decrease noise. Principal Component Analysis.
DNA-composition metricsDNA-composition metrics
![Page 17: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/17.jpg)
PCA separates speciesPCA separates species
Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]
![Page 18: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/18.jpg)
Effect of Skewed Relative AbundanceEffect of Skewed Relative Abundance
B. anthracis and L. monogocytes
Abundance 1:1 Abundance 20:1
![Page 19: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/19.jpg)
A Weighting SchemeA Weighting Scheme
For each read, find overlap with other sequences
![Page 20: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/20.jpg)
A Weighting SchemeA Weighting Scheme
Calculate the redundancy of each position.
4 5 5 3
Weight is inverse of average redundancy.
![Page 21: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/21.jpg)
Weighted PCAWeighted PCA
Calculate weighted mean µw :
Calculates weighted co-variance matrix Mw
PCs are eigenvectors of Mw. Use first three PCs for further analysis.
Calculate weighted mean µw :
Calculates weighted co-variance matrix Mw
PCs are eigenvectors of Mw. Use first three PCs for further analysis.
TTwwii
NN
11iiwwiiiiww ))μμ(X(X))μμ(X(XwwMM
N
Xwμ
N
1iii
w
![Page 22: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/22.jpg)
Weighted PCA separates species
Weighted PCA separates species
B. anthracis and L. monogocytes : 20:1
PCA Weighted PCA
![Page 23: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/23.jpg)
Un-supervised Classification ?Un-supervised Classification ?
![Page 24: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/24.jpg)
Semi-Supervised ClassificationSemi-Supervised Classification
31 Marker Genes [courtesy Martin Wu] Omni-present Relatively Immune to Lateral Gene Transfer
Reads containing these marker genes can be classified with high reliability.
31 Marker Genes [courtesy Martin Wu] Omni-present Relatively Immune to Lateral Gene Transfer
Reads containing these marker genes can be classified with high reliability.
![Page 25: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/25.jpg)
Semi-supervised ClassificationSemi-supervised Classification
Use a semi-supervised version of the normalized cut algorithm
![Page 26: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/26.jpg)
The Semi-supervised Normalized Cut Algorithm
The Semi-supervised Normalized Cut Algorithm
1. Calculate the K-nearest neighbor graph from the point set.
2. Update graph with marker information.o If two nodes are from the same species, add an
edge between them.o If two nodes are from different species, remove
any edge between them.
3. Bisect the graph using the normalized-cut algorithm.
1. Calculate the K-nearest neighbor graph from the point set.
2. Update graph with marker information.o If two nodes are from the same species, add an
edge between them.o If two nodes are from different species, remove
any edge between them.
3. Bisect the graph using the normalized-cut algorithm.
![Page 27: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/27.jpg)
Generalization to multiple binsGeneralization to multiple bins
Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis
[0.62]
Apply algorithm
recursively
![Page 28: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/28.jpg)
Generalization to multiple binsGeneralization to multiple bins
Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis
[0.62]
![Page 29: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/29.jpg)
TestingTesting
Simulate Metagenomic Sequencing Sanger Reads Variables
Number of speciesRelative abundanceGC contentPhylogenetic Diversity
Test on a “real” dataset where answer is well-established.
Simulate Metagenomic Sequencing Sanger Reads Variables
Number of speciesRelative abundanceGC contentPhylogenetic Diversity
Test on a “real” dataset where answer is well-established.
![Page 30: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/30.jpg)
ResultsResults
![Page 31: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/31.jpg)
Conclusions/Future DirectionsConclusions/Future Directions
Satisfactory performance No Training on Existing Genomes Sanger Reads Low number of Species
Future Work Holy Grail : Complex Communities
Semi-supervised projection? Hybrid Assembly/Binning
Satisfactory performance No Training on Existing Genomes Sanger Reads Low number of Species
Future Work Holy Grail : Complex Communities
Semi-supervised projection? Hybrid Assembly/Binning
![Page 32: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/32.jpg)
AcknowledgementsAcknowledgements
UC DavisUC Davis Jonathan Eisen Martin Wu Dongying Wu Ichitaro Yamazaki Amber Hartman Marcel Huntemann
Jonathan Eisen Martin Wu Dongying Wu Ichitaro Yamazaki Amber Hartman Marcel Huntemann
UC BerkeleyUC Berkeley Lior Pachter Richard Karp Ambuj Tewari Narayanan Manikandan
Lior Pachter Richard Karp Ambuj Tewari Narayanan Manikandan
Princeton University Simon Levin Josh Weitz Jonathan Dushoff
![Page 33: CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji *, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis schatterji@ucdavis.edu](https://reader036.vdocuments.us/reader036/viewer/2022062722/56649f335503460f94c50229/html5/thumbnails/33.jpg)