rlharris cos551 finalpaper - princeton · rachell.harris3! results anchoring of metagenomic reads...
TRANSCRIPT
Rachel L. Harris 1
Insights into the phylogeny and coding potential of microbial dark matter: a replication of phylogenetic anchoring methods described by Rinke et al., 2013
Rachel Harris, David Zhao, Melany Ruiz Urigen, & Chuhan Zong QCB 455 – MOL 455 – COS 551, Fall 2014 | Instructor: Dr. Anastasia Baryshnikova
ABSTRACT In their 2013 study, Rinke et al. challenge the efficacy of a gold standard reference genome database in accurately anchoring metagenomic reads. By appending 201 uncultivated archaeal and bacterial genomes representing largely uncharted taxa (so-called “microbial dark matter”) to this gold standard, Rinke et al. significantly improve phylogenetic anchoring of 475 metagenomes. In this study, we apply Rinke’s methods to ten of their reported top-recruiting metagenomes. We not only replicate, but also improve upon their results, concluding that microbial dark matter genomes are key players in improving phylogenetic anchoring performance of reference genomes databases. INTRODUCTION A clear cultivation bias exists in microbial phylogenetics. As of 2010, half of all
sequenced prokaryotic phyla lacked a cultivable representative, and 88% of all these known
phyla were phylogenetically anchored as either belonging to Proteobacteria, Firmicutes,
Actinobacteria, or Bacteriodetes1. Rinke et al. attempt to abolish these biases by testing whether
the genomes of microbial dark matter (uncultivated prokaryotes representing poorly sequenced
branches on the tree of life), when appended to an NCBI BLASTx reference database, improved
phylogenetic anchoring for queried metagenomic reads2. In this study we aimed to confirm
Rinke et al.’s methodology by means of alternative tools, including BLASTn, R environment
software, and the Galaxy computational program. Results between the two studies were in
agreement and often improved upon in our own analyses.
MATERIALS AND METHODS Ten metagenomes were selected for analysis from publicly available databases according
to their relative anchoring performance (Rinke et al., Figure 4) and their representation of nine
diverse habitats: Sakinaw Lake (SAK), TA Mother Reactor (TAM), GBS 85C sediment (GBS),
Saanich Inlet pooled fosmids (SAA), GOS Mangrove on Isabella Island (MAN), Yellowstone
Bison Hot Spring (BIS), Line P J08P26-500 (LNP), TA reactor biofilm (BIO), Peru Margin
Rachel L. Harris 2
(PER), and Marine Sediments sample SCG71 (MAR). All 201 SAG assemblies were accessed
from the Microbial Dark Matter project website (http://genome.jgi.doe.gov/MDM).
A random subset of 10,000 reads was extracted from each metagenome and subjected to
two runs of NCBI’s Nucleotide BLAST (BLASTn). The first run BLASTed each of the
metagenomes against NCBI’s non-redundant nucleotide (nt) database, whereas the second run
BLASTed each of the metagenomes against a modified database comprised of the nt and Rinke
et al.’s 201 SAG assemblies. Resulting BLASTn hits against the nt and nt+SAGs databases will
hereafter be referenced as BLAST Hits 1 (BH1) and BLAST Hits 2 (BH2), respectively. Target
labels of these hits were identified by either NCBI’s GI sequence identification markers (GI IDs)
or one of Rinke et al.’s SAG IDs. A third BLASTn run was subsequently performed on all SAG
assemblies in order to exchange any SAG target labels identified in BH2 with their respective GI
IDs. Queries originally assigned to SAG targets that were found to have no corresponding GI ID
were considered false positives and these entries were removed from BH2 analysis. Duplicate
queries with the same target label were also removed from both BH1 and BH2 to ensure the
validity of community composition and subsequent statistical analysis.
Whereas Rinke et al. determined phylogenies of BLAST hits with the aid of MEGAN4
software3, we obtained taxonomic summaries by submitting GI ID targets from both BLAST hit
databases of each metagenome to Princeton University’s Galaxy Project4 server
(https://galaxy.princeton.edu). All statistical analyses were performed on BLAST hits at the
phylum level in the R software environment (http://r-project.org) to determine whether
BLASTing against nt+SAGs represented significant improvements in read anchoring,
phylogenetic binning, and percent identity distribution relative to BLASTing against the nt alone.
Rachel L. Harris 3
RESULTS Anchoring of Metagenomic Reads Rinke et al. report >2% BH2 read anchoring at the phylum level for all ten metagenomes
analyzed in this study. We not only confirm these findings in our own analysis, but also improve
upon them, recovering greater read anchoring for six out of ten metagenomes (Fig. 1). Only BIO
and PER metagenomes yield <2% read anchoring, achieving 1.49% and 1.57%, respectively.
Whereas the Rinke study reports SAK as demonstrating the greatest recovery of read hits
(19.56%), our results depict a three-fold improvement in anchoring for our highest recruiting
metagenome, BIS (60.22%). A paired Student’s t-test of all BLAST hits from our 10 analyzed
metagenomes reveals that significantly (P=0.00023) more reads were assigned at the phylum
level for BH2 relative to BH1. This result is in concordance with that of Rinke et al.
(P=0.00024), who performed the same analysis for their top 19 recruiting metagenomes (Fig.
1a).
Phylogenetic Binning Figure 4 in Rinke et al. depicts the 23 most anchored phyla following classification of
BH2 hits by MEGAN4. By contrast, phylogenetic binning conducted via Galaxy in this analysis
only reveals an overlap of eight phyla as top recruiters in surveyed metagenomes. However, all
eight overlapping phyla between the two studies – Acetothermia, Caldiserica, Cloacimonetes,
Marinimicrobia, Sunergistetes, Euryarchaeota, Nanoarchaeota, and Thaumarchaetoa – show
improved phylogenetic anchoring across several metagenomes from BH1 to BH2 where no such
improvement was noted at all by Rinke et al. (Fig. 1a). Furthermore, our results also indicate at
least four additional phyla not mentioned in the parent study that demonstrate an average of ≥1%
improvement in binning across all metagenomes – Aquificae, Firmicutes, Ignavibacteriae, and
Proteobacteria (Fig. 1b).
Rachel L. Harris 4
Fig. 1 | Phylogenetic anchoring. 1a. Modified Figure 4 in original Rinke et al. publication, depicting 19 top-recruiting metagenomes characterized by >2% phylum-level read anchoring. Highlighted metagenomes represent metagenomes analyzed in this study. Top-recruiting phyla in Rinke et al.’s study are listed at the top, with phyla denoted by representing overlapping top recruiters in our own analysis. Black rectangles ( ) represent additional phylum-level recruits elucidated in our analysis that were not discovered by Rinke et al. 1b. Duplication of the Rinke heat map described in 1a. portraying all phyla demonstrating improved read anchoring from BH1 to BH2. Grey cells label phyla showing 0% anchoring improvement.
Beyond the analysis conducted by Rinke et al., we tested each metagenome individually
for significant phylum-level differences in community composition. A paired Student’s t-test
revealed a significant difference in classification at the phylum level for LNP (P=0.02627), SAK
(P=0.03152), and TAM (P=0.01966) metagenomes (Fig. 2).
Percent Identity Distribution In addition, we also determined whether BLASTing metagenomes against nt+SAGs
improved phylogenetic anchoring at a confidence interval of 97% query-target shared nucleotide
identity relative to BLASTing against the nt alone. This was performed by removing all BLAST
hits that shared the same query and target label across BH1 and BH2 databases for each
metagenome, leaving behind only novel classifications for consideration.
a b
Rachel L. Harris 5
We were able to successfully elucidate, via
unpaired, one-sided t-tests, significant
improvements in phylum-level clustering for
GBS (P=0.03569), LNP (P=5.1E-06), SAK
(P=2.2E-16), and TAM (P=2.2E
-16) metagenomes
(Fig. 3a). Despite only four out of the ten
analyzed metagenomes showing significant
improvement in anchoring with ≥97% identity,
all novel BH2 classifications in this study
demonstrated significantly improved percent
identities (Welch two-sample t-test, P=2.2E-16)
relative to queries that maintained the same target label in both BH1 and BH2 databases (Fig.
3b).
a b DISCUSSION With the exception of a few inconsistencies, the results of this analysis illustrate robust
agreement with those of Rinke et al. In several instances we were not only able to duplicate their
Fig. 2 | Community Composition at the Phylum Level. Distribution of phylum-level classifications for BH1 hits against the nt database (left panel) and BH2 hits against the ntSAGs database (right panel). BH2 hits characterized by statistically significant changes in community composition are denoted by . P = 0.02627, 0.03152, and 0.01966 for LNP, SAK, and TAM, respectively.
Fig. 3 | True hit (CI≥97%) trends relative to BLAST type. 3a. Per metagenome BH1:BH2 hit proportions with percent identities ≥97%. Significant improvements in classification above this threshold for BH2 data are denoted by (GBS, P=0.03569; LNP, P=5.1E
-06; SAK, P=2.2E
-16; and TAM, P=2.2E-16). 3b. Distribution of percent identities for all BLAST
hits across all analyzed metagenomes. A significant increase in number of hits with ≥97% query-target identity is generally associated with BH2 hits (P=2.2E
-16). This increase is clearly enhanced when only novel hits are considered.
Rachel L. Harris 6
results, but also improve upon them. We attest these improvements to the exponential growth of
NCBI’s non-redundant reference databases. As of January 2015, more than 11,100 reference
genomes are publicly available in NCBI’s databases (http://ncbi.nlm.nih.gov); this is more than
twice the number of reference genomes that were available at the time of the parent study’s
publication in early 2013 and more than ten times the number available at the start of the study in
mid 20105. Advancements in high-throughput sequencing technologies have enabled swift and
reliable taxonomic identifications from uncultured microbial samples. As per sample costs have
dropped, the number of published metagenomes has risen, drastically improved our knowledge
of microbial diversity6. This improvement is particularly relevant in our own data pertaining to
read anchoring (Fig. 1).
It is possible that some discrepancies between our own data and those published by Rinke
et al. may be attributed to differences in choice of processing tools. For example, where Rinke et
al. used NCBI’s BLASTx algorithm to BLAST metagenomic reads against the non-redundant
protein database (nr), we utilized BLASTn against the nt database. Both methods are valid in
elucidating taxonomic information from raw reads; we elected to employ BLASTn over
BLASTx due to its faster processing time (BLASTx translates nucleotide queries as they are
submitted for analysis, whereas BLASTn directly runs a search of nucleotide strings against the
nt) and more concise output (BLASTx outputs protein-specific GI IDs, which was useful for
Rinke et al. in another part of their study that was irrelevant to this particular investigation).
Nevertheless, we affirm that major statistical differences between our results and Rinke et al.’s
are most likely the result of the tremendous growth of NCBI reference databases. For instance,
discrepancies between the two studies’ top recruiting phyla can be attributed to the expansion of
the number of unique representative genomes per phyla since Rinke et al.’s original analysis. Our
Rachel L. Harris 7
results reflect this improvement, and are supported by significantly increased read anchoring
(Fig. 1) and improved binning of true hits (Fig. 3).
Notwithstanding major improvements in reference genome databases, this study’s
replication of Rinke et al.’s methods continues to support the notion that appending MDM single
cell genomes to these databases still results in significantly improved phylogenetic anchoring for
submitted queries. As such, we acknowledge single cell genomics as a viable next step in
elucidating rare taxa in microbial communities, as they are statistically proven to be key players
in correctly inferring community composition.
WORKS CITED 1. Hugenholtz, P. & Kyrpides, N.C. A changing of the guard. Environ. Microbiol. 11, 551-553 (2009). 2. Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437
(2013). 3. Huson, D. H., Mitra, S., Ruscheweyh, H.-J., Weber, N. & Schuster, S. C. Integrative analysis of environmental
sequences using MEGAN4. Genome Res. 21, 1552–1560 (2011). 4. Blankenberg, D. et al. Galaxy: A web-based genome analysis tool for experimentalists. Current Protocols in
Molecular Biology (2010). doi:10.1002/0471142727.mb1910s89 5. Lagesen, K., Ussery, D. W. & Wassenaar, T. M. Genome update: the 1000th genome--a cautionary tale.
Microbiology 156, 603–608 (2010). 6. Ni, J., Yan, Q. & Yu, Y. How much metagenomic sequencing is enough to achieve a given goal? Sci. Rep. 3, 1968
(2013).
SUPPLEMENTARY MATERIAL
Metagenome BH2-unique Phyla
BIO N/A
BIS Caldiserica*,Dictyoglomi,Elusimicrobia,Tenericutes
GBS Gemmatimonadetes,Synergistetes*
LNP Cloacimonetes*,Phaeophyceae,Xanthophyceae
MAN Cloacimonetes*
MAR N/A
PER N/A
SAA Gemmatimonadetes
SAK Acetothermia*,Elusimicrobia,Fusobacteria,Synergistetes*
TAM Aquificae,Chlamydiae,Deferribacteres,Dictyoglomi,Gemmatimonadetes,Nitrospirae, Synergistetes*
Table S1 | Unique phylum-level assignments in BH2. Seven of ten analyzed metagenomes exhibit novel phyla hits when raw reads are BLASTed against nt+SAGs database. Phyla distinguished by * represent overlapping top-recruiters in the analysis by Rinke et al.