rlharris cos551 finalpaper - princeton · rachell.harris3! results anchoring of metagenomic reads...

Rachel L. Harris 1

Insights into the phylogeny and coding potential of microbial dark matter: a replication of phylogenetic anchoring methods described by Rinke et al., 2013

Rachel Harris, David Zhao, Melany Ruiz Urigen, & Chuhan Zong QCB 455 – MOL 455 – COS 551, Fall 2014 | Instructor: Dr. Anastasia Baryshnikova

ABSTRACT In their 2013 study, Rinke et al. challenge the efficacy of a gold standard reference genome database in accurately anchoring metagenomic reads. By appending 201 uncultivated archaeal and bacterial genomes representing largely uncharted taxa (so-called “microbial dark matter”) to this gold standard, Rinke et al. significantly improve phylogenetic anchoring of 475 metagenomes. In this study, we apply Rinke’s methods to ten of their reported top-recruiting metagenomes. We not only replicate, but also improve upon their results, concluding that microbial dark matter genomes are key players in improving phylogenetic anchoring performance of reference genomes databases. INTRODUCTION A clear cultivation bias exists in microbial phylogenetics. As of 2010, half of all

sequenced prokaryotic phyla lacked a cultivable representative, and 88% of all these known

phyla were phylogenetically anchored as either belonging to Proteobacteria, Firmicutes,

Actinobacteria, or Bacteriodetes1. Rinke et al. attempt to abolish these biases by testing whether

the genomes of microbial dark matter (uncultivated prokaryotes representing poorly sequenced

branches on the tree of life), when appended to an NCBI BLASTx reference database, improved

phylogenetic anchoring for queried metagenomic reads2. In this study we aimed to confirm

Rinke et al.’s methodology by means of alternative tools, including BLASTn, R environment

software, and the Galaxy computational program. Results between the two studies were in

agreement and often improved upon in our own analyses.

MATERIALS AND METHODS Ten metagenomes were selected for analysis from publicly available databases according

to their relative anchoring performance (Rinke et al., Figure 4) and their representation of nine

diverse habitats: Sakinaw Lake (SAK), TA Mother Reactor (TAM), GBS 85C sediment (GBS),

Saanich Inlet pooled fosmids (SAA), GOS Mangrove on Isabella Island (MAN), Yellowstone

Bison Hot Spring (BIS), Line P J08P26-500 (LNP), TA reactor biofilm (BIO), Peru Margin

Rachel L. Harris 2

(PER), and Marine Sediments sample SCG71 (MAR). All 201 SAG assemblies were accessed

from the Microbial Dark Matter project website (http://genome.jgi.doe.gov/MDM).

A random subset of 10,000 reads was extracted from each metagenome and subjected to

two runs of NCBI’s Nucleotide BLAST (BLASTn). The first run BLASTed each of the

metagenomes against NCBI’s non-redundant nucleotide (nt) database, whereas the second run

BLASTed each of the metagenomes against a modified database comprised of the nt and Rinke

et al.’s 201 SAG assemblies. Resulting BLASTn hits against the nt and nt+SAGs databases will

hereafter be referenced as BLAST Hits 1 (BH1) and BLAST Hits 2 (BH2), respectively. Target

labels of these hits were identified by either NCBI’s GI sequence identification markers (GI IDs)

or one of Rinke et al.’s SAG IDs. A third BLASTn run was subsequently performed on all SAG

assemblies in order to exchange any SAG target labels identified in BH2 with their respective GI

IDs. Queries originally assigned to SAG targets that were found to have no corresponding GI ID

were considered false positives and these entries were removed from BH2 analysis. Duplicate

queries with the same target label were also removed from both BH1 and BH2 to ensure the

validity of community composition and subsequent statistical analysis.

Whereas Rinke et al. determined phylogenies of BLAST hits with the aid of MEGAN4

software3, we obtained taxonomic summaries by submitting GI ID targets from both BLAST hit

databases of each metagenome to Princeton University’s Galaxy Project4 server

(https://galaxy.princeton.edu). All statistical analyses were performed on BLAST hits at the

phylum level in the R software environment (http://r-project.org) to determine whether

BLASTing against nt+SAGs represented significant improvements in read anchoring,

phylogenetic binning, and percent identity distribution relative to BLASTing against the nt alone.

Rachel L. Harris 3

RESULTS Anchoring of Metagenomic Reads Rinke et al. report >2% BH2 read anchoring at the phylum level for all ten metagenomes

analyzed in this study. We not only confirm these findings in our own analysis, but also improve

upon them, recovering greater read anchoring for six out of ten metagenomes (Fig. 1). Only BIO

and PER metagenomes yield <2% read anchoring, achieving 1.49% and 1.57%, respectively.

Whereas the Rinke study reports SAK as demonstrating the greatest recovery of read hits

(19.56%), our results depict a three-fold improvement in anchoring for our highest recruiting

metagenome, BIS (60.22%). A paired Student’s t-test of all BLAST hits from our 10 analyzed

metagenomes reveals that significantly (P=0.00023) more reads were assigned at the phylum

level for BH2 relative to BH1. This result is in concordance with that of Rinke et al.

(P=0.00024), who performed the same analysis for their top 19 recruiting metagenomes (Fig.

1a).

Phylogenetic Binning Figure 4 in Rinke et al. depicts the 23 most anchored phyla following classification of

BH2 hits by MEGAN4. By contrast, phylogenetic binning conducted via Galaxy in this analysis

only reveals an overlap of eight phyla as top recruiters in surveyed metagenomes. However, all

eight overlapping phyla between the two studies – Acetothermia, Caldiserica, Cloacimonetes,

Marinimicrobia, Sunergistetes, Euryarchaeota, Nanoarchaeota, and Thaumarchaetoa – show

improved phylogenetic anchoring across several metagenomes from BH1 to BH2 where no such

improvement was noted at all by Rinke et al. (Fig. 1a). Furthermore, our results also indicate at

least four additional phyla not mentioned in the parent study that demonstrate an average of ≥1%

improvement in binning across all metagenomes – Aquificae, Firmicutes, Ignavibacteriae, and

Proteobacteria (Fig. 1b).

Rachel L. Harris 4

Fig. 1 | Phylogenetic anchoring. 1a. Modified Figure 4 in original Rinke et al. publication, depicting 19 top-recruiting metagenomes characterized by >2% phylum-level read anchoring. Highlighted metagenomes represent metagenomes analyzed in this study. Top-recruiting phyla in Rinke et al.’s study are listed at the top, with phyla denoted by representing overlapping top recruiters in our own analysis. Black rectangles ( ) represent additional phylum-level recruits elucidated in our analysis that were not discovered by Rinke et al. 1b. Duplication of the Rinke heat map described in 1a. portraying all phyla demonstrating improved read anchoring from BH1 to BH2. Grey cells label phyla showing 0% anchoring improvement.

Beyond the analysis conducted by Rinke et al., we tested each metagenome individually

for significant phylum-level differences in community composition. A paired Student’s t-test

revealed a significant difference in classification at the phylum level for LNP (P=0.02627), SAK

(P=0.03152), and TAM (P=0.01966) metagenomes (Fig. 2).

Percent Identity Distribution In addition, we also determined whether BLASTing metagenomes against nt+SAGs

improved phylogenetic anchoring at a confidence interval of 97% query-target shared nucleotide

identity relative to BLASTing against the nt alone. This was performed by removing all BLAST

hits that shared the same query and target label across BH1 and BH2 databases for each

metagenome, leaving behind only novel classifications for consideration.

a b

Rachel L. Harris 5

We were able to successfully elucidate, via

unpaired, one-sided t-tests, significant

improvements in phylum-level clustering for

GBS (P=0.03569), LNP (P=5.1E-06), SAK

(P=2.2E-16), and TAM (P=2.2E

-16) metagenomes

(Fig. 3a). Despite only four out of the ten

analyzed metagenomes showing significant

improvement in anchoring with ≥97% identity,

all novel BH2 classifications in this study

demonstrated significantly improved percent

identities (Welch two-sample t-test, P=2.2E-16)

relative to queries that maintained the same target label in both BH1 and BH2 databases (Fig.

3b).

a b DISCUSSION With the exception of a few inconsistencies, the results of this analysis illustrate robust

agreement with those of Rinke et al. In several instances we were not only able to duplicate their

Fig. 2 | Community Composition at the Phylum Level. Distribution of phylum-level classifications for BH1 hits against the nt database (left panel) and BH2 hits against the ntSAGs database (right panel). BH2 hits characterized by statistically significant changes in community composition are denoted by . P = 0.02627, 0.03152, and 0.01966 for LNP, SAK, and TAM, respectively.

Fig. 3 | True hit (CI≥97%) trends relative to BLAST type. 3a. Per metagenome BH1:BH2 hit proportions with percent identities ≥97%. Significant improvements in classification above this threshold for BH2 data are denoted by (GBS, P=0.03569; LNP, P=5.1E

-06; SAK, P=2.2E

-16; and TAM, P=2.2E-16). 3b. Distribution of percent identities for all BLAST

hits across all analyzed metagenomes. A significant increase in number of hits with ≥97% query-target identity is generally associated with BH2 hits (P=2.2E

-16). This increase is clearly enhanced when only novel hits are considered.

Rachel L. Harris 6

results, but also improve upon them. We attest these improvements to the exponential growth of

NCBI’s non-redundant reference databases. As of January 2015, more than 11,100 reference

genomes are publicly available in NCBI’s databases (http://ncbi.nlm.nih.gov); this is more than

twice the number of reference genomes that were available at the time of the parent study’s

publication in early 2013 and more than ten times the number available at the start of the study in

mid 20105. Advancements in high-throughput sequencing technologies have enabled swift and

reliable taxonomic identifications from uncultured microbial samples. As per sample costs have

dropped, the number of published metagenomes has risen, drastically improved our knowledge

of microbial diversity6. This improvement is particularly relevant in our own data pertaining to

read anchoring (Fig. 1).

It is possible that some discrepancies between our own data and those published by Rinke

et al. may be attributed to differences in choice of processing tools. For example, where Rinke et

al. used NCBI’s BLASTx algorithm to BLAST metagenomic reads against the non-redundant

protein database (nr), we utilized BLASTn against the nt database. Both methods are valid in

elucidating taxonomic information from raw reads; we elected to employ BLASTn over

BLASTx due to its faster processing time (BLASTx translates nucleotide queries as they are

submitted for analysis, whereas BLASTn directly runs a search of nucleotide strings against the

nt) and more concise output (BLASTx outputs protein-specific GI IDs, which was useful for

Rinke et al. in another part of their study that was irrelevant to this particular investigation).

Nevertheless, we affirm that major statistical differences between our results and Rinke et al.’s

are most likely the result of the tremendous growth of NCBI reference databases. For instance,

discrepancies between the two studies’ top recruiting phyla can be attributed to the expansion of

the number of unique representative genomes per phyla since Rinke et al.’s original analysis. Our

Rachel L. Harris 7

results reflect this improvement, and are supported by significantly increased read anchoring

(Fig. 1) and improved binning of true hits (Fig. 3).

Notwithstanding major improvements in reference genome databases, this study’s

replication of Rinke et al.’s methods continues to support the notion that appending MDM single

cell genomes to these databases still results in significantly improved phylogenetic anchoring for

submitted queries. As such, we acknowledge single cell genomics as a viable next step in

elucidating rare taxa in microbial communities, as they are statistically proven to be key players

in correctly inferring community composition.

WORKS CITED 1. Hugenholtz, P. & Kyrpides, N.C. A changing of the guard. Environ. Microbiol. 11, 551-553 (2009). 2. Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437

(2013). 3. Huson, D. H., Mitra, S., Ruscheweyh, H.-J., Weber, N. & Schuster, S. C. Integrative analysis of environmental

sequences using MEGAN4. Genome Res. 21, 1552–1560 (2011). 4. Blankenberg, D. et al. Galaxy: A web-based genome analysis tool for experimentalists. Current Protocols in

Molecular Biology (2010). doi:10.1002/0471142727.mb1910s89 5. Lagesen, K., Ussery, D. W. & Wassenaar, T. M. Genome update: the 1000th genome--a cautionary tale.

Microbiology 156, 603–608 (2010). 6. Ni, J., Yan, Q. & Yu, Y. How much metagenomic sequencing is enough to achieve a given goal? Sci. Rep. 3, 1968

(2013).

SUPPLEMENTARY MATERIAL

Metagenome BH2-unique Phyla

BIO N/A

BIS Caldiserica*,Dictyoglomi,Elusimicrobia,Tenericutes

GBS Gemmatimonadetes,Synergistetes*

LNP Cloacimonetes*,Phaeophyceae,Xanthophyceae

MAN Cloacimonetes*

MAR N/A

PER N/A

SAA Gemmatimonadetes

SAK Acetothermia*,Elusimicrobia,Fusobacteria,Synergistetes*

TAM Aquificae,Chlamydiae,Deferribacteres,Dictyoglomi,Gemmatimonadetes,Nitrospirae, Synergistetes*

Table S1 | Unique phylum-level assignments in BH2. Seven of ten analyzed metagenomes exhibit novel phyla hits when raw reads are BLASTed against nt+SAGs database. Phyla distinguished by * represent overlapping top-recruiters in the analysis by Rinke et al.

rlharris cos551 finalpaper - princeton · rachell.harris3! results anchoring of metagenomic reads...

Documents