sourav chatterji uc davis genome center schatterji@ucdavis
DESCRIPTION
Computational Metagenomics : Algorithms for Understanding the " Unculturable " Microbial Majority . Sourav Chatterji UC Davis Genome Center [email protected]. Background. The Microbial World. Exploring the Microbial World. Culturing Majority of microbes currently unculturable . - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/1.jpg)
Computational Metagenomics: Algorithms for Understanding the "Unculturable" Microbial Majority
Sourav ChatterjiUC Davis Genome [email protected]
![Page 2: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/2.jpg)
Background
![Page 3: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/3.jpg)
The Microbial World
![Page 4: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/4.jpg)
Exploring the Microbial World
• Culturing– Majority of microbes currently unculturable.– No ecological context.
• Molecular Surveys (e.g. 16S rRNA)– “who is out there?”– “what are they doing?”
![Page 5: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/5.jpg)
Environmental Shotgun Sequencing
![Page 6: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/6.jpg)
Interpreting Metagenomic Data
• Nature of Metagenomic Data– Mosaic– Intraspecies polymorphism– Fragmentary
• New Sequencing Technologies– Enormous amount of data– Short Reads
![Page 7: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/7.jpg)
Overview of Talk
• Metagenomic Binning• PhyloMetagenomics• The Big Picture/ Future Work
![Page 8: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/8.jpg)
Overview of Talk
• Metagenomic Binning– Background– CompostBin
• PhyloMetagenomics• The Big Picture/ Future Work
![Page 9: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/9.jpg)
Metagenomic Binning
Classification of sequences by taxa
![Page 10: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/10.jpg)
Current Binning Methods
• Assembly • Align with Reference Genome• Database Search [MEGAN, BLAST]• Phylogenetic Analysis• DNA Composition [TETRA,Phylopythia]
![Page 11: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/11.jpg)
Current Binning Methods
• Need closely related reference genomes.• Poor performance on short fragments.
– Sanger sequence reads 500-1000 bp long.– Current assembly methods unreliable
• Complex Communities Hard to Bin.
![Page 12: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/12.jpg)
Genome Signatures
• Does genomic sequence from an organism have a unique “signature” that distinguishes it from genomic sequence of other organisms?– Yes [Karlin et al. 1990s]
• What is the minimum length sequence that is required to distinguish genomic sequence of one organism from the genomic sequence of another organism?
![Page 13: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/13.jpg)
DNA-composition metrics
The K-mer Frequency MetricCompostBin uses hexamers
![Page 14: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/14.jpg)
• Working with K-mers for Binning.– Curse of Dimensionality : O(4K) independent
dimensions.– Statistical noise increases with decreasing
fragment lengths.• Project data into a lower dimensional space to
decrease noise.– Principal Component Analysis.
DNA-composition metrics
![Page 15: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/15.jpg)
PCA separates species
Gluconobacter oxydans[65% GC] and Rhodospirillum rubrum[61% GC]
![Page 16: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/16.jpg)
Effect of Skewed Relative Abundance
B. anthracis and L. monogocytes
Abundance 1:1 Abundance 20:1
![Page 17: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/17.jpg)
A Weighting Scheme
For each read, find overlap with other sequences
![Page 18: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/18.jpg)
A Weighting Scheme
Calculate the redundancy of each position.
4 5 5 3
Weight is inverse of average redundancy.
![Page 19: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/19.jpg)
Weighted PCA
• Calculate weighted mean µw :
• Calculates weighted co-variance matrix Mw
• Principal Components are eigenvectors of Mw.– Use first three PCs for further analysis.
Twi
N
1iwiiw )μ(X)μ(XwM --=å
=
N
Xwμ
N
1iii
w
å==
![Page 20: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/20.jpg)
Weighted PCA separates species
B. anthracis and L. monogocytes : 20:1
PCA Weighted PCA
![Page 21: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/21.jpg)
Un-supervised Classification?
![Page 22: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/22.jpg)
Semi-Supervised Classification
• 31 Marker Genes [courtesy Martin Wu]– Omni-present– Relatively Immune to Lateral Gene Transfer
• Reads containing these marker genes can be classified with high reliability.
![Page 23: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/23.jpg)
Semi-supervised Classification
Use a semi-supervised version of the normalized cut algorithm
![Page 24: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/24.jpg)
The Semi-supervised Normalized Cut Algorithm
1. Calculate the K-nearest neighbor graph from the point set.
2. Update graph with marker information.o If two nodes are from the same species, add an
edge between them.o If two nodes are from different species, remove
any edge between them.3. Bisect the graph using the normalized-cut
algorithm.
![Page 25: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/25.jpg)
Generalization to multiple bins
Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis
[0.62]
Apply algorithm
recursively
![Page 26: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/26.jpg)
Generalization to multiple bins
Gluconobacter oxydans [0.61], Granulobacter bethesdensis[0.59] and Nitrobacter hamburgensis
[0.62]
![Page 27: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/27.jpg)
Testing
• Simulate Metagenomic Sequencing– Variables
• Number of species• Relative abundance• GC content• Phylogenetic Diversity
• Test on a “real” dataset where answer is well-established.
![Page 28: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/28.jpg)
Results
![Page 29: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/29.jpg)
Conclusions
Satisfactory performance No Training on Existing Genomes Sanger Reads Low number of Species
![Page 30: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/30.jpg)
Overview of Talk
• Metagenomic Binning• Phylo-Metagenomics
– Background– Incorporating Alignment Accuracy
• The Big Picture/ Future Work
![Page 31: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/31.jpg)
Phylogenetic Trees
Charles Darwin, First Notebook on Transmutation of Species (1837)
![Page 32: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/32.jpg)
Garcia Martin et al., Nat. Biotechnology (2006)
Population Structure of Communities
![Page 33: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/33.jpg)
Yooseph et al., PLoS Biology (2007)
Gene Family Characterization
![Page 34: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/34.jpg)
![Page 35: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/35.jpg)
Wong et al., Science, 2008
![Page 36: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/36.jpg)
Manual Masking
• Require skilled and tedious manual intervention
• Subjective and non-reproducible• Impractical for high throughput data
– Frequently ignored. “Garbage-in-and-garbage-out”
![Page 37: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/37.jpg)
Gblocks
![Page 38: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/38.jpg)
Probabilistic Masking using pair-HMMs
• Probabilistic formulation of alignment problem.
• Can answer additional questions– Alignment Reliability– Sub-optimal Alignments
Durbin et al., Cambridge University Press (1998)
![Page 39: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/39.jpg)
Probabilistic Masking
• What is the probability residues xi and yj are homologous?
• Posterior Probability the residues xi and yj are homologous
• Can be calculated efficiently for all pairs (and gaps) in quadratic time.
y]Pr[x,y]x,,yPr[x
]yPr[x jiji
à=à
![Page 40: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/40.jpg)
Scoring Multiple Alignments
• Calculate the “posterior probability matrix” and distances dij between every pair of sequences.
• Weighted “sum of pairs” score for column r :
åå à
ji,ij
jiji,
ij
d
]rPr[rd
![Page 41: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/41.jpg)
Testing
The Balibase 3.0 Benchmark Database
![Page 42: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/42.jpg)
Testing
• Realign sequences using MSA programs like Clustalw.
• Sensitivity: for all correctly aligned columns, the fraction that has been masked as good
• Specificity: for all incorrectly aligned columns, the fraction that has been masked as bad
![Page 43: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/43.jpg)
Performance
Gblocks
Prob Mask
Sensitivity Specificity
97% 93%
53% 94%
![Page 44: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/44.jpg)
Effect on Phylogenetic Inference
Protocol Symmetric Tree Inference Accuracy
Asymmetric Tree Inference Accuracy
No Masking 84.08 % 80.51 %
Gblocks 76.92 % 79.99 %
Prob. Masking 85.11 % 84.60 %
Gblocks simulated data-set, PhyML likelihood tree
![Page 45: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/45.jpg)
Consistency between Alignment Programs
• Yeast Genome Data Set– 7 yeast species, 1502 “orthologs” in each.
• Wong et al. , Science (2008).– Aligned using 7 programs– Different programs often give inconsistent answers.
• Garbage in, Garbage Out?– Partial Data, confusing global alignment programs.– No Masking
![Page 46: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/46.jpg)
Consistency between Alignment Programs
Protocol Inconsistent Consistent
No Masking 4.05 % 95.95%
Prob. Masking 2.74 % 97.26%
Masking remove ~33% of inconsistencies
![Page 47: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/47.jpg)
Consistency between Alignment Programs
ProtocolInconsistent Consistent
No BootstrapSupport
Bootstrap Support
No BootstrapSupport
Bootstrap Support
No Masking 3.73 % 0.32 % 23.41 % 72.54%
Prob. Masking 2.67% 0.07 % 23.77 % 73.48 %
Masking remove ~75% of inconsistencies with high support
![Page 48: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/48.jpg)
The Final Result
A Phylogenetic Database/Pipeline (with Martin Wu)
![Page 49: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/49.jpg)
Overview of Talk
• Metagenomic Binning • Phylo-Metagenomics• The Big Picture/ Future Work
![Page 50: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/50.jpg)
Population Structure
Venter et al. , Science (2004)
![Page 51: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/51.jpg)
Future Directions/Challenges
• What defines a species (OTU)?– Clustering Problem
• Handling Partial Data• Improved Phylogenetic Inference• How to integrate information from multiple
markers?
![Page 52: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/52.jpg)
Species Interactions
![Page 53: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/53.jpg)
Interactions in Microbial Communities
![Page 54: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/54.jpg)
Time Series Data
Ruan et al., Bioinformatics (2006)
![Page 55: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/55.jpg)
Interaction Networks in Microbial Communities
Ruan et al., Bioinformatics (2006)
![Page 56: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/56.jpg)
Functional Profiling
Prediction of Gene Function Prediction of Metabolic Pathway
![Page 57: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/57.jpg)
Functional Profiling (with Binning)
McCutcheon and Moran PNAS.(2007)
![Page 58: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/58.jpg)
Future Directions/Challenges
• Inferring Species Interactions– Time Series Analysis– Network Dynamics
• Generalizing Binning to Multiple Classes– Semi-supervised Approach
• Semi Supervised Projection?– More Phylogenetic Markers
• Iterative Binning/Assembly– Problem : Modeling variations within a species
![Page 59: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/59.jpg)
Single Cell Genomics
Reads From Single Cell “Simulated” Contamination
With Ramunas Stepanauskas at Bigelow Institute
![Page 60: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/60.jpg)
Detecting Genetic Engineering
Caveat : Also detects host anomalous DNA (e.g. LGT), Comparative Genomics helps
![Page 61: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/61.jpg)
The Big PictureMicrobial Community
Metagenomic Sampling Single Cell Genomics
Population Structure Functional Profiling
Species Interaction Network
Time Series Data
![Page 62: Sourav Chatterji UC Davis Genome Center schatterji@ucdavis](https://reader035.vdocuments.us/reader035/viewer/2022062410/568163ed550346895dd55fd9/html5/thumbnails/62.jpg)
Acknowledgements
UC Davis• Jonathan Eisen • Martin Wu• Dongying Wu• Ichitaro Yamazaki• Amber Hartman• Marcel Huntemann
UC Berkeley• Lior Pachter• Richard Karp• Ambuj Tewari• Narayanan Manikandan
Princeton University• Simon Levin• Josh Weitz• Jonathan Dushoff