identification and characterization of accessory genomes in...

8
Identification and Characterization of Accessory Genomes in Bacterial Species Based on Genome Comparison and Metagenomic Recruitment Mingjie Wang, Haixu Tang and Yuzhen Ye * School of Informatics and Computing Indiana University Bloomington, IN 47405, USA * Contact: [email protected] Abstract—Accessory genomes in bacterial species carry im- portant genetic elements that are frequently related to antibi- otic resistance, virulence factors, and the biotransformation of xenobiotics. Facilitated by the recent advances in sequencing technology, bacterial genomes and metagenomes are accumu- lating at an unprecedented pace, providing opportunities for studies of accessory genomes. Comparison of closely related genomes reveals potential (and static) accessory genomes, and metagenomic recruitment (i.e., mapping metagenomic sequences onto reference genomes) provides insights into the nature and the dynamics of the accessory portion of the genomes. Recent metagenomic recruitment approaches focus on the identification of ‘metagenomic islands’ (MIs), segments in reference genomes that are under-recruiting in metagenomic samples and therefore likely to be mobile genetic elements (MGEs) in accessory genomes. However, the discovery of MIs often relies on manual inspection of the read recruitment plots. Here we introduce a method that integrates comparison of closely related genomes using A-Bruijn graph, metagenomic recruitment, and recurrent analysis for the identification and characterization of accessory genomes. In addition to metage- nomic islands (valleys), our method reveals ‘metagenomic peaks’ (MPs), segments in a reference genome that disproportionally recruit more metagenomic sequencing reads as compared to the remaining of the reference genome, indicating an enrich- ment of those segments in specific environments. Our method facilitates automated detection and characterization of accessory genomes at a large scale, and leads to the observation that MGEs are largely specific to environments, as demonstrated in the discovery of MGEs related to Streptococcus mitis in human microbiomes. Our tools are available for download at http://omics.informatics.indiana.edu/mg/iMGE/. I. I NTRODUCTION Bacterial genomes harbor great variation at the species level, which partly results from frequent genetic alternations within and between bacterial genomes. For a given species, the entire pool of genetic materials is called the pan-genome, which can be divided into the ‘core genome’ containing genes shared by all strains, and the ‘accessory genome’ containing genes present in multiple strains and genes unique to single strains [1]. The proportion of genes in those two categories varies among species, and the size of the accessory genome determines the scale of the pan-genome; some bacterial species are ‘closed’ in the sense that the size of the pan-genome does not increase much as more strains are incorporated while other species are ‘open’ as the size of the pan-genome is infinite in theory [2]. For bacterial species with an open pan-genome, there is a large pool of genes that are unique to single strains, which are termed “unique accessory genome”. For instances, it has been shown that more than 20% of genes are strain- specific in Escherichia coli [3], and Pohl et al. found new strain-specific compositions of accessory genomic elements and a high portion (10-20%) of genes without Pseudomonas aeruginosa homologues in P. aeruginosa genomes, despite the many strains of this species that have been sequenced [4]. The unique accessory genome can be identified indirectly by identifying the core genome beforehand by the construction of synteny blocks from multiple genomes, which is computa- tionally equivalent to de novo repeat classification in a ‘virtual genome’ of concatenated individual genomic sequences [5]. Therefore, synteny blocks have been detected in mammalian and plant genomes by using the A-Bruijn graph framework which can detect inexact repeats [5], [6]. Minkin et al. argued that it is computationally challenging to adapt the algorithm for large numbers of bacterial genomes and developed Sibelia to generate synteny blocks by constructing de Bruijn graphs iteratively to capture multi-scale and multi-granular repeats [7]. The construction of de Bruijn graph is faster than that of A- Bruijn graph because the former only captures perfect repeats, but A-Bruijn graph can utilize inexact repeats that cannot be detected using de Bruijn graph. Both the A-Bruijn graph approach and the iterative de Bruijn graph approach merge repeats and reveal synteny blocks as non-branching paths. The synteny blocks include both core genome and accessory regions shared by two or more strains. Unique accessory genome for each strain can thus be identified by subtracting the synteny blocks from the original genomic sequence. We show here that the A-Bruijn graph algorithm can be used for comparison of bacterial genomes, and it is rewarding to do so when the species to be analyzed is under-represented by sequenced strains/genomes. A large fraction of the accessory segments are mobile genetic elements (MGEs), which are universally distributed in all bacterial genomes [8]. MGEs are capable of moving within or between bacterial genomes by means of excision and integration reactions that are independent of homologous re- combination. In this sense, the bacterial genomes are dynamic and the bacterial space should be regarded as a connected but compartmentalized network [9]. The mobile elements tend to be strain-specific and may not be required for the survival of the bacterial cell, but they clearly confer adaptive phenotypes on the host cell, placing the host cell in an advantageous

Upload: others

Post on 02-Nov-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Identification and Characterization of Accessory Genomes in ...homes.sice.indiana.edu/yye/lab/mypaper/MGE_BIBM_2014.pdfMingjie Wang, Haixu Tang and Yuzhen Ye School of Informatics

Identification and Characterization of AccessoryGenomes in Bacterial Species Based on Genome

Comparison and Metagenomic Recruitment

Mingjie Wang, Haixu Tang and Yuzhen Ye∗School of Informatics and Computing

Indiana University Bloomington, IN 47405, USA∗Contact: [email protected]

Abstract—Accessory genomes in bacterial species carry im-portant genetic elements that are frequently related to antibi-otic resistance, virulence factors, and the biotransformation ofxenobiotics. Facilitated by the recent advances in sequencingtechnology, bacterial genomes and metagenomes are accumu-lating at an unprecedented pace, providing opportunities forstudies of accessory genomes. Comparison of closely relatedgenomes reveals potential (and static) accessory genomes, andmetagenomic recruitment (i.e., mapping metagenomic sequencesonto reference genomes) provides insights into the nature andthe dynamics of the accessory portion of the genomes. Recentmetagenomic recruitment approaches focus on the identificationof ‘metagenomic islands’ (MIs), segments in reference genomesthat are under-recruiting in metagenomic samples and thereforelikely to be mobile genetic elements (MGEs) in accessory genomes.However, the discovery of MIs often relies on manual inspectionof the read recruitment plots.

Here we introduce a method that integrates comparisonof closely related genomes using A-Bruijn graph, metagenomicrecruitment, and recurrent analysis for the identification andcharacterization of accessory genomes. In addition to metage-nomic islands (valleys), our method reveals ‘metagenomic peaks’(MPs), segments in a reference genome that disproportionallyrecruit more metagenomic sequencing reads as compared tothe remaining of the reference genome, indicating an enrich-ment of those segments in specific environments. Our methodfacilitates automated detection and characterization of accessorygenomes at a large scale, and leads to the observation thatMGEs are largely specific to environments, as demonstratedin the discovery of MGEs related to Streptococcus mitis inhuman microbiomes. Our tools are available for download athttp://omics.informatics.indiana.edu/mg/iMGE/.

I. INTRODUCTION

Bacterial genomes harbor great variation at the specieslevel, which partly results from frequent genetic alternationswithin and between bacterial genomes. For a given species,the entire pool of genetic materials is called the pan-genome,which can be divided into the ‘core genome’ containing genesshared by all strains, and the ‘accessory genome’ containinggenes present in multiple strains and genes unique to singlestrains [1]. The proportion of genes in those two categoriesvaries among species, and the size of the accessory genomedetermines the scale of the pan-genome; some bacterial speciesare ‘closed’ in the sense that the size of the pan-genome doesnot increase much as more strains are incorporated while otherspecies are ‘open’ as the size of the pan-genome is infinite intheory [2]. For bacterial species with an open pan-genome,

there is a large pool of genes that are unique to single strains,which are termed “unique accessory genome”. For instances,it has been shown that more than 20% of genes are strain-specific in Escherichia coli [3], and Pohl et al. found newstrain-specific compositions of accessory genomic elementsand a high portion (10-20%) of genes without Pseudomonasaeruginosa homologues in P. aeruginosa genomes, despite themany strains of this species that have been sequenced [4].

The unique accessory genome can be identified indirectlyby identifying the core genome beforehand by the constructionof synteny blocks from multiple genomes, which is computa-tionally equivalent to de novo repeat classification in a ‘virtualgenome’ of concatenated individual genomic sequences [5].Therefore, synteny blocks have been detected in mammalianand plant genomes by using the A-Bruijn graph frameworkwhich can detect inexact repeats [5], [6]. Minkin et al. arguedthat it is computationally challenging to adapt the algorithmfor large numbers of bacterial genomes and developed Sibeliato generate synteny blocks by constructing de Bruijn graphsiteratively to capture multi-scale and multi-granular repeats [7].The construction of de Bruijn graph is faster than that of A-Bruijn graph because the former only captures perfect repeats,but A-Bruijn graph can utilize inexact repeats that cannotbe detected using de Bruijn graph. Both the A-Bruijn graphapproach and the iterative de Bruijn graph approach mergerepeats and reveal synteny blocks as non-branching paths.The synteny blocks include both core genome and accessoryregions shared by two or more strains. Unique accessorygenome for each strain can thus be identified by subtractingthe synteny blocks from the original genomic sequence. Weshow here that the A-Bruijn graph algorithm can be used forcomparison of bacterial genomes, and it is rewarding to doso when the species to be analyzed is under-represented bysequenced strains/genomes.

A large fraction of the accessory segments are mobilegenetic elements (MGEs), which are universally distributedin all bacterial genomes [8]. MGEs are capable of movingwithin or between bacterial genomes by means of excision andintegration reactions that are independent of homologous re-combination. In this sense, the bacterial genomes are dynamicand the bacterial space should be regarded as a connected butcompartmentalized network [9]. The mobile elements tend tobe strain-specific and may not be required for the survival ofthe bacterial cell, but they clearly confer adaptive phenotypeson the host cell, placing the host cell in an advantageous

Page 2: Identification and Characterization of Accessory Genomes in ...homes.sice.indiana.edu/yye/lab/mypaper/MGE_BIBM_2014.pdfMingjie Wang, Haixu Tang and Yuzhen Ye School of Informatics

Genome 1

Genome 2

Genome 1

Genome 2

1

2

3

4

5

6 7 8

Genome 11 3 5

Genome 26 7 8

(A) (B) (C)

S1 S2 S3

1

3

5

7

8

6

Genome 1

Genome 2

... ...

... ...

... ...

... ...

... ...

... ...

(E)

1

3

5

4

4

0

7

8

6 1

4

1

5

4

0Genome 1

Genome 2

S1

(D)S2

... ...

... ...

... ...

... ...

... ...

... ...

5

6

0

1

1

2

8

1

10

S3

Fig. 1: The pipeline for identification and dissection of accessory genomes by graph construction and metagenomic recruitment analysis. (A) For demonstrationpurpose, only two closely related genomes are shown in this example. (B) Construction of de Bruijn graph or A-Bruijn graph of the genomes, in which similarregions between genomes or repeats within genomes are collapsed while unique regions form branches in the graph. Strain-specific regions (i.e., the uniqueaccessory genome) can be inferred from the graph. (C) Metagenomic recruitment by mapping community sequencing reads onto the accessory genome of eachstrain, revealing blocks in the accessory genome with different reads coverage. Coverage peaks and valleys are detected for each genome independently. For thismetagenomic sample, a coverage peak (edge 7) is observed in genome 2; and a valley (edge 3) is observed in genome 1. (D) Dissection of accessory segmentsinto different categories by analyzing the coverages of each edge based on a Poisson model. Note the dissection is conducted for each genome in each sample,resulting in candidate MGEs for each genome. (E) Recurrent analysis of MGEs. Candidate MGE is validated if it is observed in multiple samples from the sameenvironment. The accessory segments are dissected into three groups: active MGEs (metagenomic peaks; red), lost MGEs (metagenomic islands; blue) and trulyaccessory segments (yellow).

position to live in a specific environment [2], [10]. Activitiesof mobile elements (among others) cause the plasticity ofbacterial genomes on an evolutionary time scale [11], and arecent study shows that MGEs drive recombination hotspotsin the core genome of Staphylococcus aureus [12].

There are several types of computational methods availablefor the detection of MGEs. One straightforward method is tosearch against a database of known elements, but similarity-based methods are limited by the completeness of the database.Alternatively, MGEs can be identified by altered propertiessuch as G+C content and codon usage based on the assumptionthat MGEs have different nucleotide compositions from thehost genome [2]. But this approach is restricted for the de-tection of recent horizontal transfers, and the assumption doesnot always hold. For example, not all prophages are atypical innucleotide content [13]. Furthermore, most of these methodsare only able to detect a certain type of mobile elements dueto the enormous space of MGEs.

As an application of metagenomic sequencing, metage-nomic recruitment analysis has been shown to be useful indetecting MGEs by mapping community sequencing readsonto reference genomes. Coleman et al. aligned the SargassoSea metagenomic reads to 11 completely sequenced Prochloro-coccus strains and observed the ‘metagenomic islands’ (MIs)that hardly recruited any reads while the majority of thegenomes were covered constantly at a high rate [14]. Belda-Ferre et al. mined virulence genes from MIs by mappingmetagenomic reads from microbiome of healthy individualsagainst pathogenic bacteria [15].

We propose to expand metagenomic recruitment analy-sis by considering not only metagenomic islands but alsometagenomic peaks (MPs), and therefore to dissect the ac-cessory regions detected by the graph-based approaches (A-

Bruijn graph or iterative de Bruijn graph) into three groups:lost MGEs, active MGEs, and truly strain-specific accessoryregions. Lost MGEs refer to the regions where metagenomicislands are found: these regions are present in the referencegenomes, but are absent in the species living in the communitycharacterized by metagenomic sequencing. Active MGEs areDNA segments shared in multiple genomes (the referencestrain, unsequenced strain(s) of the same species or strain(s)of other species) in the environment and consequently havesignificantly higher sequencing coverage than the rest of theaccessory regions. Thus, a portion of MGEs can be mined fromaccessory regions by searching for segments with extremelylow (lost MGEs) or high depth of coverage (active MGEs)in recruitment analysis. The rest of the accessory regions canbe considered truly unique to an individual genome in theenvironment: the coverages of those regions are not affectedby other genomes and can be utilized for the quantification ofthe genome in the very environment.

We applied our method to identify MGEs associated withStreptococcus mitis, a species that is most commonly foundin all oral sites and human subjects [16], [17]. There areseveral hundred samples from the human oral microbiome inthe HMP datasets [18], providing a great resource for studyingthe MGEs. Our method facilitates the automatic detection andcharacterization of accessory genomes on a scale that involveshundreds of databases, and leads to the observation that MGEsare largely specific to environments, as demonstrated in thediscovery of MGEs related with Streptococcus mitis in humanmicrobiomes.

II. METHODS

Given a collection of genomes from a bacterial species,our method first identifies strain-specific segments for each

Page 3: Identification and Characterization of Accessory Genomes in ...homes.sice.indiana.edu/yye/lab/mypaper/MGE_BIBM_2014.pdfMingjie Wang, Haixu Tang and Yuzhen Ye School of Informatics

reference genome through the construction of iterative deBruijn graphs or A-Bruijn graphs and then classifies thesegments based on their coverage profiles from metagenomicrecruitment. Using a Poisson model, we are able to screenfor over- (MPs) or under-recruiting segments (MIs) in eachgenome in each sample. When multiple metagenomic datasetsexist for the same environment, metagenomic recruitmentresults for each genome can be combined for recurrent analysisof the predicted accessory genome (Figure 1).

A. Identification of unique accessory genome in closely relatedbacterial genomes

Unique accessory genome can be identified in multiplestrains of the same bacterial species through the constructionof iterative de Bruijn graphs or A-Bruijn graphs using thegenomic sequences. In these graphs, the genomes can beviewed as permutations of edges: each representing either asynteny block from repeats within single genomes or sharedby two or more genomes, or a strain-specific genome segment.By definition, ‘unique accessory genome’ should include bothrepeats within genomes and segments with multiplicity 1. Tosimplify following metagenomic recruitment analysis in largescale, here we refer to ‘unique accessory genome’ as strain-specific DNA segments with multiplicity 1 (single copy uniqueregions). Sibelia was developed for the purpose of generatingsynteny blocks for closely related microbial genomes, fromwhich the unique accessory genome needs to be identified byeliminating synteny blocks from the bacterial genomes. Weadopted the ‘fine’ parameters set in Sibelia to generate syntenyblocks that are longer than 100 bp. Single copy strain-specificregions were then obtained by subtraction and further filteredfor segments of at least 300 bp.

The de Bruijn graph can only merge identical DNAsegments of synteny blocks from the respective genomes(Sibelia uses the sequence modification algorithm to removesmall buldges from the graph [7]), while the A-Bruijn graphapproach [6] can collapse nearly identical sequences into asingle edge [19]. BLASTN [20] is used to detect similarregions within each genome or between each pair of selectedmicrobial genomes. A-Bruijn graph can then be constructed bygluing together the aligned positions from alignments that arelonger than a threshold (e.g. 100 bp) and with similarity higherthan a threshold (e.g. 97%) [6], [19]. We implemented the A-Bruijn graph algorithm and directly detected single copy strain-specific regions by identifying unique edges with multiplicity1 in each input genome. Compared with the iterative de Bruijngraph approach [7], the A-Bruijn graph approach increases thesensitivity of detecting synteny blocks for genomes that arenot well conserved across sequenced genomes, by decreasingthe similarity threshold in graph construction. Similarly, wekeep unique edges in the A-Bruijn graph of at least 300 bp(accessory segments) for further analysis.

B. Dissection of accessory genome by metagenomic recruit-ment

Metagenomic reads are mapped onto the unique accessorysegments via BWA [21]. After reads mapping, we calculatethe average coverage of each accessory segment based on thenumber of reads aligned to the segment. It is generally assumedthat the sequencing coverage follows a Poisson distribution in

shotgun sequencing [22]. By adopting the Poisson model, wecan detect outliers of accessory segments with abnormally highor low coverages (see Algorithm 1). Median coverage for thesegments from each reference genome is used to approximatethe average coverage of each bacterial genome as median isa robust statistic. It is well known that bacterial species areof various abundances in the same environment: some arevery abundant (the dominating species) while others may berare. Therefore we treat rare strains and relatively abundantstrains differently in Algorithm 1. We propose to dissect theaccessory genome of rare strains into two groups: active MGEs(MPs shown as the coverage peaks) and truly unique accessoryregions, whereas we can further look for the third group—lost MGEs (MIs shown as coverage valleys)—for relativelyabundant strains in the environment.

Algorithm 1 Detection of MGEs based on Poisson model

Require: coverage ci for each accessory segment fi (1 ≤ i ≤n); probability cutoff t (significance level)

1: m← median of ci2: if 1 ≥ m > 0 then /*rare strains*/3: for i from 1 to n do4: if Pr(Pois(λ ← 1, X ← ci)) ≤ t and ci > 1

then5: fi is a MP6: else if m > 1 then7: m′ ← int(m)8: for i from 1 to n do9: if Pr(Pois(λ ← m′, X ← ci)) ≤ t and ci > 1

then10: fi is a MP11: if Pr(Pois(λ ← m′, X ← 0)) ≤ t then /*strains

sufficiently abundant for MI detection*/12: for i from 1 to n do13: if ci = 0 then14: fi is a MI

The availability of many metagenomic datasets for thesame environment (or similar environments) allows us to studythe dynamics of the MGEs by metagenomic recruitment inmultiple samples. In addition, we can use recurrent analysison the candidate MGEs using multiple samples to confirmMGEs: MGEs identified recurrently are more likely to bereal. The frequency that a MGE is detected in the samplesis termed “recurrent rate”. As MIs (metagenomic islands)can only be detected in abundant species, we only considerthe samples in which the reference genome is abundant forcalculating the recurrent rate of MIs. The recurrent analysisrenders our method robust to noise. Yet we note that themultiple samples from the same body site in HMP datasetsare not exact experimental replicates because they are not fromthe same individual. Also, different MGEs might have variedrate of insertion and loss in a ‘population’ of individuals, thuswe should be careful when we set the threshold for recurrentrate. Recurrent rate actually reflects the dynamics of MGEtransfer among genomes in the microbial community, thus wedo not have to use very stringent cutoff for recurrent rate forthe purpose of ruling out mapping noise.

We used myRAST (downloaded from http://blog.theseed.org/downloads/myRAST-Intel.dmg) to predict genes in iden-

Page 4: Identification and Characterization of Accessory Genomes in ...homes.sice.indiana.edu/yye/lab/mypaper/MGE_BIBM_2014.pdfMingjie Wang, Haixu Tang and Yuzhen Ye School of Informatics

tified MGEs and predicted their functions according to theSEED functional category [23].

C. Datasets

The 13 Streptococcus mitis genomes (the B6 genome iscomplete and the others are draft genomes) were downloadedfrom the NCBI ftp site (ftp.ncbi.nlm.nih.gov). We used 386oral microbiomes from the Human Microbiome Project (HMP)datasets [18], [24] downloaded from the HMP DACC website(http://www.hmpdacc.org/). These samples cover three oralsubsites: buccal mucosa (122 samples), supragingival plaque(128 samples), and tongue dorsum (136 samples).

III. RESULTS

We tested the performance of the iterative de Bruijngraph approach and the A-Bruijn graph approach for theidentification of unique accessory genome in Streptococcusmitis by comparing 13 sequenced strains of this species. TheMUMi distance [25] (0 ∼ 1 scale) was used to quantify thegenomic distances between the strains, based on which theneighbor-joining tree was constructed (Figure 2). From thephylogenetic tree, we can see S. mitis ATCC 6249 and S.mitis SPAR10 are relatively distant to other strains. There areno two strains that are highly similar (within 0.05 of MUMidistance), thus we included all of the 13 genomes in ourstudy. Below we first report the identification of the accessorygenome of S. mitis using the two graph-based approaches. Wethen show the characterization of the accessory genome basedon metagenomic recruitment.

S. mitis NCTC 12261

S. mitis SK1073

S. mitis SK321

S. mitis B6

S. mitis SK1080

S. mitis SK597

S. mitis SK564

S. mitis SK575

S. mitis SK579

S. mitis SK569

S. mitis SK616

S. mitis ATCC 6249

S. mitis SPAR10

0.1

Fig. 2: Neighbor-joining tree for 13 Streptococcus mitis genomes based onthe MUMi distance. The genome of S. mitis B6 is complete and the other 12genomes are draft genomes.

A. Comparison of two graph-based approaches for accessorygenome detection

First we screened for accessory regions in each of theS. mitis genomes using Sibelia (the iterative de Bruijn graphapproach) [7]. As S. mitis ATCC 6249 and S. mitis SPAR10are substantially distant from other genomes (see Figure 2), a

large proportion of their genomic sequences ( 64% and 81%,respectively) was identified as strain-specific region in thesetwo strains. Except those two strains, other strains had 3%–13% unique accessory genome, which revealed a relatively‘open’ pan-genome for S. mitis. Since S. mitis ATCC 6249 andS. mitis SPAR10 were not well represented by the available S.mitis genomes, it is very likely that a significant proportionof the large number of ‘accessory‘ segments identified fromthese two genomes are not real accessory regions, as revealedby our A-Bruijn graph based approach.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

IDB AB(95%)

AB(90%)

AB(85%)

AB(80%)

AB(75%)

ATCC 6249

B6

NCTC 12261

SK1073

Sk1080

SK321

SK564

SK569

SK575

SK579

SK597

SK616

SPAR10

Fig. 3: Comparison of the two graph-based approaches in the detection ofunique accessory genome for 13 S. mitis genomes. IDB stands for the iterativede Bruijn graph approach (Sibelia) and AB stands for the A-Bruijn graphapproach (our approach). A-Bruijn graphs were constructed using differentsimilarity thresholds (95%, 90%, 85%, 80%, and 75% sequence identity,respectively). The y-axis values represent the proportion of accessory genomeidentified for individual strains.

To compare the performance of A-Bruijn graph approachand iterative de Bruijn graph approach for the detection ofaccessory genome, we used comparable parameters in theconstruction of A-Bruijn graphs—we glued alignments of atleast 100 bp in length and filtered for strain-specific edges ofat least 300 bp. While the de Bruijn graph approach can onlymerge identical segments, the A-Bruijn graph approach hasan extra parameter (the similarity threshold) that determinesthe degree of compression of the genomes: the lower thecutoff value is, the more compact the graph will be by gluingtogether relatively divergent repeats. This allows us to reducefalse accessory segments in S. mitis ATCC 6249 and S. mitisSPAR10. As shown in Figure 3, A-Bruijn graphs constructedat different similarity parameters (95%, 90%, 85%, 80%, and75%) resulted in different proportions of accessory genomes inthese two genomes (at 90% sequence identity, the proportionof accessory genome in each genome was nearly identicalto that detected by the iterative de Bruijn graph approach).By decreasing the similarity cutoff, the proportion of uniqueaccessory genome was shrunk to 19% for S. mitis ATCC 6249and 20% for S. mitis SPAR10 (Figure 3), reaching a plateauat ∼ 80% sequence identity (but the plateau for other S. mitisgenomes arrives earlier—at ∼ 90% sequence identity). Basedon this result, we chose 80% as the similarity cutoff in ourA-Bruijn graph approach to identify strain-specific segmentsfrom each genome, and we collected the segments of at least300 bp (accessory segments) for further characterization.

Page 5: Identification and Characterization of Accessory Genomes in ...homes.sice.indiana.edu/yye/lab/mypaper/MGE_BIBM_2014.pdfMingjie Wang, Haixu Tang and Yuzhen Ye School of Informatics

Fig. 4: Circos [26] plot for the unique accessory genome and MGEs identified in S. mitis ATCC 6249. The three highlighted circles from outside in representthe accessory segments detected by iterative de Bruijn graph approach (purple), A-Bruijn graph approach at 90% sequence identity (green), and A-Bruijn graphat 80% sequence identity (green), respectively. The three histogram circles from outside in represent MGEs detected in buccal mucosa, supragingival plaque, andtongue dorsum, respectively: MPs are colored in red while MIs are colored in blue. The height of a rectangle in the histogram is proportional to the recurrentrate of the corresponding MGE.

TABLE I: Summary of MGEs (recurrent rate > 0.1) identified in human oral microbiomes for S. mitis

Genome # Seg1Buccal mucosa (122 samples) Supragingival plaque (128 samples) Tongue dorsum (136 samples)

# AS2 # MPs3 # MIs4 # OLPs5 # MPs # AS # MIs # OLPs # MPs # AS # MIs # OLPsATCC 6249 257 3 87 78 8 164 45 105 21 10 0 0 0B6 96 0 32 0 0 13 0 0 0 7 0 0 0NCTC 12261 46 13 33 21 11 7 1 8 0 5 0 0 0SK1073 77 0 37 0 0 8 0 0 0 16 0 0 0SK1080 95 0 16 0 0 1 0 0 0 3 0 0 0SK321 66 5 24 33 10 11 0 0 0 13 0 0 0SK564 81 0 32 0 0 2 0 0 0 5 0 0 0SK569 21 13 12 13 4 3 1 7 0 5 0 0 0SK575 71 2 34 27 7 3 0 0 0 7 0 0 0SK579 49 5 24 30 7 3 0 0 0 5 0 0 0SK597 74 0 24 0 0 6 0 0 0 6 0 0 0SK616 28 0 12 0 0 1 1 8 0 4 0 0 0SPAR10 316 3 13 84 2 20 1 12 1 238 124 37 21 Seg: total number of accessory segments identified by A-Bruijn graph approach using the 80% similarity cutoff in each genome.2 Samples in which the strain is significantly abundant (see Algorithm 1).3 Set of MPs identified from > 10% of the samples from the oral subsite.4 Set of MIs identified from > 10% of the AS samples.5 MGEs identified as MPs in some samples while as MIs in other samples (Set(MPs) ∩ Set(MIs)).

Page 6: Identification and Characterization of Accessory Genomes in ...homes.sice.indiana.edu/yye/lab/mypaper/MGE_BIBM_2014.pdfMingjie Wang, Haixu Tang and Yuzhen Ye School of Informatics

B. Mining of MGEs from unique accessory genome based onmetagenomic recruitment

In order to predict which of the accessory segments areMGEs in the S. mitis genomes in human oral microbiomes,we mapped short reads of 386 samples from the human mouth(buccal mucosa, supragingival plaque and tongue dorsum) ontothe accessory segments and identified outliers with significantover-recruitment (MPs) or under-recruitment (MIs). Comparedto the rest of accessory segments, MPs and MIs (of abnormalcoverage) are more likely to be real MGEs.

We show that the median coverage of unique segmentsgives a good approximation of the abundance of the genomein the community, as demonstrated using a HMP sampleSRS015985 from buccal mucosa: the median coverages ofthe 13 genomes of Streptococcus mitis in the sample arestrongly correlated with the estimated relative abundances ofthe genomes by our de Bruijn graph approach for quantificationof closely related genomes in a microbiome that utilizes readmapping to both common and unique regions of the genomes[19] (Figure 5). Our de Bruijn based approach [19] avoidsredundant mapping of metagenomic reads to multiple copies ofthe common regions in different genomes (common regions arecollapsed), and provides more accurate abundance estimatesfor the closely related species (by contrast, naive mapping ofmetagenomic reads onto individual genomes will result in ar-tificially high coverage at regions that are common to multiplegenomes). The abundance quantification approach used in thiswork further simplifies the mapping step (as we only need tomap metagenomic sequences onto unique segments), but stillgives comparable performance for abundance quantification.

R² = 0.9027

0

10

20

30

40

50

60

70

0 50 100 150 200

Fig. 5: Scatter plot of the estimated abundances of the different S. mitisstrains in a HMP dataset (SRS015985). The x-axis represents the relativeabundances estimated by our previous de Bruijn graph approach for quantifi-cation of closely related genomes in a microbial community [19]. The y-axisrepresents the median coverage of accessory segments for individual genomes,which is used in this work to approximate the abundance of the genomes.

In each sample, we screened for MGEs using Algorithm 1by setting the probability cutoff as 10−5. As there were over100 samples from each site, we were able to filter the MGEsby applying a recurrent rate threshold of 0.1, which means weonly keep MGEs identified in > 10% of the samples. We notethat MPs were identified in all of the samples while MIs weredetected in genomes with relatively high coverage. Thus therecurrent rates of MPs and MIs are statistically distinct. As aresult, each of the strains was only significantly abundant in

few samples except that the SPAR10 strain was abundant in124 out of 136 tongue dorsum samples and that the ATCC6249 strain was abundant in 45 of 128 supragingival plaquesamples. (Table I). Moreover, the SPAR10 strain was detectedas the only abundant strain in the tongue dorsum sampleswhile the other strains were barely sequenced in the oral site(Table I), resulting in smaller number of MGEs identified intongue dorsum samples than the other two subsites throughmetagenomic recruitment. This result shows a clear strain-levelcompositional difference at different oral subsites.

Fig. 6: Overlaps of the metagenomic peaks identified from different bodysites for S. mitis ATCC 6249. The MGEs were validated using a recurrent rateof 0.1. B (in blue), S (in green) and T (in yellow) represents buccal mucosa,supragingival plaque and tongue dorsum, respectively.

The results indicate a dynamic picture of MGEs in differentbody sites and different samples of the same body site. Firstly,MGEs vary from site to site (i.e., different sites have differentsets of MGEs; see Table I, Figure 4). For example, althoughwe mined most number of MPs from supragingival plaque forstrain ATCC 6249, there were 12 and 4 unique MPs found inbuccal mucosa and tongue dorsum, respectively (see Figure 6).Meanwhile, the existence of a MGE in one sample does notguarantee its activity in other samples from the same bodysite (Figure 7); some MGEs were detected as MPs in somesamples while as MIs in other samples (Table I). For example,an accessory segment of 8,367 bp was detected as a MP in 30out of 128 (23.4%) supragingival plaque samples while as a MIin 14 out of 45 (31.1%) samples (in which the source genomeof the accessory segment was abundant). Annotation of thissegment revealed a phage integrase and DNA segregationATPase FtsK/SpoIIIE, providing evidence that it is from aprophage region. For the datasets that have this MGE identifiedas MPs (metagenomic peaks), it is likely there is a burst of thephage, or the integration of the same phage into other genomes(which also contribute to the disproportionally large numberof reads that can be mapped to this MGE). And in the datasetswith few reads mapped (shown as MIs), it is likely that theprophage is depleted from its host genome. Additionally, wedetected distinct percentage of MGEs in the accessory genomefrom different samples; for example, the total length of MGEs(active or lost) in S. mitis ATCC 6249 ranges from 0% to85% of the unique accessory genome in supragingival plaquesamples.

Page 7: Identification and Characterization of Accessory Genomes in ...homes.sice.indiana.edu/yye/lab/mypaper/MGE_BIBM_2014.pdfMingjie Wang, Haixu Tang and Yuzhen Ye School of Informatics

Fig. 7: Heatmap illustration of the dynamics of MGEs across samples. Rowsrepresent accessory segments identified in S. mitis ATCC 6249, and columnsrepresent different supragingival plaque samples. MPs and MIs identifiedfrom individual samples are colored in red and blue, respectively. The twogreen horizontal lines divide the segments into three groups based on visualinspection of their profiles across samples: segments in the top group areconsistently found as MPs, segments in the bottom group consistently appearas MIs, and segments in the middle group show the highest variation.

C. Annotation of predicted MGEs

Predicted MGEs were annotated according to the SEEDfunctional category [23]. As expected, a large portion ofthe genes were related to virulence subsystems, includingrestriction-modification system, and phage proteins and pack-aging. We found two accessory segments that contain thetype II CRISPR–Cas bacterial immune systems [27], as genesencoding for Cas1, Cas2, and Cas9 proteins were predictedin these segments (cas9 gene is the signature gene for thetype II CRISPR–Cas systems) [28]. One accessory segment(of 9,879 bp) was identified from S. mitis ATCC 6249, andthe other one (of 7,717 bp) was identified from SK321. Thesetwo segments, although both contain a type II CRISPR–Cassystem, are not very similar to each other (the two Cas9proteins predicted from these two segments only share about30% sequence identify, suggesting the involvement of HGTof the bacterial immune systems in S. mitis), and thereforewere identified as two unique edges in the A-Bruijn graph(the other 11 S. mitis genomes do not have this system).More interestingly, both segments were detected as MPs inthe oral microbiomes, which indicates they are shared bythe reference strain, accessory genomes of unsequenced S.mitis strains, or even phylogenetically distant genomes in thebacterial communities.

IV. CONCLUSION

We developed a pipeline that combines comparison ofavailable genomes and mapping of metagenomic sequencesfor studying accessory genomes. This pipeline facilitates au-tomated detection and characterization of accessory genomeson a large scale, and enables the study of the dynamics of theaccessory genomes in different microbial communities.

In principle, to study the pan-genome or core genomeof a species, we want to incorporate as many genomes aspossible from different isolates of the species of interest. Inpractice, however, including many highly similar genomes ofthe same species will make the identification of accessoryregions difficult. A simple strategy to alleviate the problemis to select representative genomes for a specific species ofinterest. For example, we may infer the distance (0 ∼ 1scale) between genomic sequences using the maximal uniquematches index (MUMi) [19], [25] and regard two genomes ashighly similar to each other if their distance is less than a cutoff(default 0.05). Only one representative genome for each groupof extremely similar genomes will then be incorporated forcomparison. We can also try a different strategy that is moretied to phylogenetic distances: different phylogenetic distancecutoffs can be used to reveal ‘accessory genomes’ at differentphylogenetic distances.

Genome comparison using the iterative construction ofde Bruijn graph runs faster than the A-Bruijn-graph-basedapproach for the cases when the pan-genome is well sampled(but both result in similar accessory genome identifications).However, for species that are not well represented by se-quenced genomes, construction of A-Bruijn graph is moreadvantageous than de-Bruijn-graph-based approach, as shownin the Results (Figure 3). We note that we only comparedour comparison tool to Sibelia, considering that 1) Sibelia hasshown superior performance as compared to other existingapproaches including Mugsy [29], Multiz [30], Mauve [31])on comparison of bacterial genomes [7], and 2) the focus ofour work is to integrate genome comparison with metagenomicrecruitment to study the dynamics of MGEs.

As a showcase, we used a stringent probability cutoff(10−5) in Algorithm 1 for the identification of metagenomicislands. As a result, we only detected a limited number ofMIs from few samples in which the reference strain wassignificantly abundant. One can decrease the threshold valuein preliminary screen for MGEs in practical usage of ourapproach. Also, we chose 80% as the similarity cutoff whenwe constructed the A-Bruijn graph for the S. mitis genomes. Adifferent cutoff may work better for another species, dependingon its divergence rate and on how well it is represented bysequenced genomes. A curve of the size of accessory genomesversus the similarity cutoff (e.g., Figure 3) will be usefulfor determining a proper cutoff. Furthermore, a selection ofspecies other than S. mitis is likely to provide us distinctdynamics of MGEs in microbial communities.

ACKNOWLEDGMENT

This work was supported by National Science Foundation(grant number: DBI-0845685 and DBI-1262588). We thankJohn McCurley for reading the manuscript.

REFERENCES

[1] D. Medini, C. Donati, H. Tettelin, V. Masignani, and R. Rappuoli, “Themicrobial pan-genome.” Curr Opin Genet Dev, vol. 15, pp. 589–594,Dec 2005.

[2] A. Mira, A. B. Martin-Cuadrado, G. D. Auria, and F. Rodriguez-Valera, “The bacterial pan-genome: a new paradigm in microbiology.”Int Microbiol, vol. 13, pp. 45–57, Jun 2010.

Page 8: Identification and Characterization of Accessory Genomes in ...homes.sice.indiana.edu/yye/lab/mypaper/MGE_BIBM_2014.pdfMingjie Wang, Haixu Tang and Yuzhen Ye School of Informatics

[3] S. Fukiya, H. Mizoguchi, T. Tobe, and H. Mori, “Extensive genomicdiversity in pathogenic Escherichia coli and Shigella Strains revealed bycomparative genomic hybridization microarray.” J Bacteriol, vol. 186,pp. 3911–3921, Jun 2004.

[4] S. Pohl, J. Klockgether, D. Eckweiler, A. Khaledi, M. Schniederjans,P. Chouvarine, B. Tummler, and S. Haussler, “The extensive set ofaccessory Pseudomonas aeruginosa genomic components,” FEMS Mi-crobiol. Lett., Apr 2014.

[5] Q. Peng, M. A. Alekseyev, G. Tesler, and P. A. Pevzner, “Decodingsynteny blocks and large-scale duplications in mammalian and plantgenomes.” WABI 2009, pp. 220–232, 2009.

[6] P. A. Pevzner, H. Tang, and G. Tesler, “De novo repeat classificationand fragment assembly.” Genome Res, vol. 14, pp. 1786–1796, Sep2004.

[7] I. Minkin, A. Patel, M. Kolmogorov, N. Vyahhi, and S. Pham, “Sibelia:a scalable and comprehensive synteny block generation tool for closelyrelated microbial genomes.” WABI 2013, pp. 215–229, 2013.

[8] L. S. Frost, R. Leplae, A. O. Summers, and A. Toussaint, “Mobilegenetic elements: the agents of open source evolution.” Nat RevMicrobiol, vol. 3, pp. 722–732, Sep 2005.

[9] E. V. Koonin and Y. I. Wolf, “Genomics of bacteria and archaea: theemerging dynamic view of the prokaryotic world.” Nucleic Acids Res,vol. 36, pp. 6688–6719, Dec 2008.

[10] T. D. Read and D. W. Ussery, “Opening the pan-genomics box.” CurrOp Microbiol, vol. 9, pp. 496–498, Sep 2006.

[11] E. Darmon and D. R. Leach, “Bacterial genome instability,” Microbiol.Mol. Biol. Rev., vol. 78, no. 1, pp. 1–39, Mar 2014.

[12] R. G. Everitt, X. Didelot, E. M. Batty, R. R. Miller, K. Knox, B. C.Young, R. Bowden, A. Auton, A. Votintseva, H. Larner-Svensson,J. Charlesworth, T. Golubchik, C. L. Ip, H. Godwin, R. Fung, T. E.Peto, A. S. Walker, D. W. Crook, and D. J. Wilson, “Mobile elementsdrive recombination hotspots in the core genome of Staphylococcusaureus,” Nat Commun, vol. 5, p. 3956, 2014.

[13] K. E. Nelson, C.Weinel, C. M. Fraser, and et al., “Complete genomesequence and comparative analysis of the metabolically versatile Pseu-domonas putida KT2440.” Environ Microbiol, vol. 4, pp. 799–808, Dec2002.

[14] M. L. Coleman, M. B. Sullivan, A. C. Martiny, C. Steglich, K. Barry,E. F. DeLong, and S. W. Chisholm, “Genomic islands and ecology andevolution of Prochlorococcus.” Science, vol. 311, pp. 1768–1770, Mar2006.

[15] P. Belda-Ferre, R. Cabrera-Bubio, and A. Mira, “Mining virulence genesusing metagenomics.” PLoS One, vol. 6, p. e24975, Oct 2011.

[16] J. A. Aas, B. J. Paster, L. N. Stokes, I. Olsen, and F. E. Dewhirst,“Defining the normal bacterial flora of the oral cavity.” J Clin Microbiol,vol. 43, pp. 5721–5732, Nov 2005.

[17] F. E. Dewhirst, T. Chen, W. G. Wade, and et al., “The human oralmicrobiome.” J Bacteriol, vol. 192, pp. 5002–5017, Oct 2010.

[18] C. Huttenhower, D. Gevers, R. Knight, S. Abubucker, J. H. Badger,A. T. Chinwalla, H. H. Creasy, A. M. Earl, M. G. FitzGerald, R. S.Fulton, and et al., “Structure, function and diversity of the healthyhuman microbiome,” Nature, vol. 486, no. 7402, pp. 207–214, Jun 2012.

[19] M. Wang, Y. Ye, and H. Tang, “A de Bruijn graph approach to thequantification of closely-related genomes in a microbial community.” JComput Biol, vol. 19, pp. 814–825, Jun 2012.

[20] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang,W. Miller, and D. J. Lipman, “Gapped BLAST and PSI-BLAST: a newgeneration of protein database search programs.” Nucleic Acids Res,vol. 25, pp. 3389–3402, Sep 1997.

[21] H. Li and R. Durbin, “Fast and accurate short read alignment withBurrows-Wheeler transform.” Bioinformatics, vol. 25, pp. 1754–1760,Jul 2009.

[22] E. S. Lander and M. S. Waterman, “Genomic mapping by fingerprintingrandom clones: a mathematical analysis.” Genomics, vol. 2, pp. 231–239, Apr 1988.

[23] R. Overbeek, R. Olson, G. D. Pusch, G. J. Olsen, J. J. Davis, T. Disz,R. A. Edwards, S. Gerdes, B. Parrello, M. Shukla, V. Vostein, A. R.Wattam, F. Xia, and R. Stevens, “The SEED and the Rapid Annotationof microbial genomes using Subsystems Technology(RAST).” NucleicAcds Res, vol. 42, pp. D206–D214, Jan 2014.

[24] J. Peterson, S. Garges, M. Giovanni, M. Guyer, and et al., “The NIHHuman Microbiome Project.” Genome Res, vol. 19, pp. 2317–1323,Dec 2009.

[25] M. Deloger, M. E. Karoui, and M.-A. Petit, “A genomic distance basedon MUM indicates discontinuity between most bacterial species andgenera.” J Bacteriol, vol. 191, pp. 91–99, Jan 2009.

[26] M. Krzywinski, J. Schein, I. Birol, J. Connors, R. Gascoyne, D. Hors-man, S. J. Jones, and M. A. Marra, “Circos: an information aestheticfor comparative genomics.” Genome Res, vol. 19, pp. 1639–1645, Sep2009.

[27] P. Horvath and R. Barrangou, “CRISPR/Cas, the immune system ofbacteria and archaea,” Science, vol. 327, no. 5962, pp. 167–170, Jan2010.

[28] Q. Zhang, T. G. Doak, and Y. Ye, “Expanding the catalog of cas geneswith metagenomes,” Nucleic Acids Res., vol. 42, no. 4, pp. 2448–2459,Feb 2014.

[29] S. V. Angiuoli and S. L. Salzberg, “Mugsy: fast multiple alignmentof closely related whole genomes,” Bioinformatics, vol. 27, no. 3, pp.334–342, Feb 2011.

[30] M. Blanchette, W. J. Kent, C. Riemer, L. Elnitski, A. F. Smit, K. M.Roskin, R. Baertsch, K. Rosenbloom, H. Clawson, E. D. Green,D. Haussler, and W. Miller, “Aligning multiple genomic sequences withthe threaded blockset aligner,” Genome Res., vol. 14, no. 4, pp. 708–715, Apr 2004.

[31] A. C. Darling, B. Mau, F. R. Blattner, and N. T. Perna, “Mauve: mul-tiple alignment of conserved genomic sequence with rearrangements,”Genome Res., vol. 14, no. 7, pp. 1394–1403, Jul 2004.