bioinformatics for discovery of microbiome variation daniel brejnrod.pdf · introduction to...

114
Main Academic supervisor: Professor Søren Johannes Sørensen University of Copenhagen, Section of Microbiology Bioinformatics for discovery of microbiome variation UNIVERSITY OF COPENHAGEN FACULTY OF SCIENCE Chair of Assessment committee: Associate professor Sten S TRUWE University of Copenhagen, Denmark Assessment committee member: Professor Henrik Bjørn Nielsen Clinical-Microbiomics Assessment committee member: Professor Tim Vogel Université de Lyon PHD THESIS Asker Daniel Brejnrod

Upload: others

Post on 29-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Main Academic supervisor:

Professor Søren Johannes Sørensen

University of Copenhagen, Section of Microbiology

Bioinformatics for discovery of microbiome

variation

U N I V E R S I T Y O F C O P E N H A G E N

FACULTY OF SCIENCE

Chair of Assessment committee:

Associate professor Sten STRUWE

University of Copenhagen, Denmark

Assessment committee member:

Professor Henrik Bjørn Nielsen

Clinical-Microbiomics

Assessment committee member:

Professor Tim Vogel

Université de Lyon

PHD THESIS Asker Daniel Brejnrod

Page 2: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references
Page 3: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation I

List of contents

List of contents I

Acknowledgement II

List of abbreviations IV

Abstract V

Resume VII

List of manuscripts IX

Introduction 1

Sequencing 1

DNA extraction 4

Amplicons: Quality control 5

Amplicons: OTU picking 8

Metagenomes: Mapping 9

Metagenomes: Assembly and binning 11

Metatranscriptomes 12

General Objectives 15

Introduction to manuscripts 17

Manuscript 1 17

Manuscript 2 21

Manuscripts 3 & 4 25

List of references 30

Page 4: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

II ACKNOWLEDGEMENTS

Acknowledgements

This thesis is the result of 3 years of work that has included many collaborations, both on

the papers included in the thesis and the ones that were not. I have really appreciated that

so many people wanted to work with me. Unfortunately space constraints how many I

can acknowledge here, but I hope you know that I have appreciated it.

First, I would like to thank my supervisor Søren Johannes Sørensen for having me in his

lab and extending all of his support to my projects.

The same goes for Karsten Kristiansen, my informal supervisor and, who has made room

for me in his lab and introduced me to many exciting projects.

Karsten, Søren and the institute of biology has all funded ne with support from the

Lundbeck foundation for which I am very grateful.

Anders Priemé has been an informal supervisor on many projects performed at MME and

has always contributed enthusiastically and knowingly about any and all environmental

microbiology topic.

The ABC project has been my main project in Sørens group, and I would like to thank

everybody involved. In particular thanks to the COPSAC research group at Gentofte

Hospital. Hans Bisgaard and the omics group: Jonathan Thorsen, Jakob Stokholm, Johan-

nes Waage and Morten Rasmussen. Writing manuscript 1 with them was a lot of fun, and

I learnt a lot from it.

I would like to thank Dr. Christopher Quince of Warwick University for letting me stay in

his lab. Being introduced to the metagenomics field by an expert who has contributed

many key algorithms was a pleasure. I would also like to thank him for introducing me to

Page 5: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation III

a collaboration with Dr. Konstantinos Gerasimidis of Glasgow University whose data I

worked on during my stay in Britain.

I am grateful to Sam Jacquoid and Ines Nunes for all their help and support in writing the

Hygum manuscript, and for the general collaboration we have had on many things over

the years.

I have been lucky enough to be in two groups, Sørens and Karstens, that are very social

and supportive, and have made the experience a lot of fun.

I would like to thank Annelise Kjøller for reading and editing both this thesis, but pretty

much everything I have written since my master thesis began, and to Sten Struwe for

agreeing to be the chairman of the Phd committee and handle the planning aspect of it.

Thanks to both for our many discussions.

Finally I would like to thank my family and friends for always being there.

Asker Brejnrod, Copenhagen, 2017

Page 6: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

IV LIST OF ABBREVIATIONS

List of abbreviations

16S – 16S small ribosomal subunit

COPSAC – Copenhage prospective Studies on Asthma in Childhood

FRG – Functional Response Group

FPR – False Positive Rate

HMP – Human Microbiome project

IBD – Intestinal Bowel Disease

KEGG – Kyoto Encyclopedia of Genes and Genomes

LPS – Lipopolysaccharides

MetaHIT – Metagenomics of Human Intestinal Tract

NGS – Next Generation Sequncing

OTU - Operational Taxonomic Unit

PCA – Principal Component Analysis

PCoA – Principal Coordinate analysis

PCR – Polymerase Chain Reaction

PICRUST – Phylogenetic Investigation of Communities by Reconstruction of Unobserved

States

RNA-seq – RNA sequencing, transcriptomics

SSU – (ribosomal) Small Subunit

UC – Ulcerative Colitis

WGS – Whole Genome Sequencing

Page 7: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation V

Abstract

Sequencing based tools have revolutionized microbiology in recent years. High-

throughput DNA sequencing have allowed high-resolution studies on microbial life in

many different environments and at unprecedented low cost. These culture-independent

methods have helped discovery of novel bacteria in both the environment and human

hosts that are not accessible through traditional means, and genome bioinformatics have

allowed interrogation of their metabolic capabilities through novel tools of reconstructing

genomes from short DNA sequencing reads. These new approaches to studying bacterial

taxonomy and function enables hypothesis generation of how variation in microbial

communities and functionality impacts interaction with their hosts and environments.

This thesis contributes new tools and improvements to existing ones to discover relevant

variation in microbiomes. Finally, it explores the integration of various molecular meth-

ods to build hypotheses about the impact of a copper contaminated soil.

The introduction is a broad introduction to the field of microbiome research with a focus

on the technologies that enable these discoveries and how some of the broader issues have

related to this thesis particularly for considerations there was no space for in the papers.

The main focus is on the bioinformatics, the processing of the data after it has been gener-

ated by the sequencing machines. However, some topics such as DNA extraction is

touched upon as it have a big impact on downstream results.

Manuscript 1 ,“Large-scale benchmarking reveals false discoveries and count transfor-

mation sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome

studies”, benchmarked the performance of a variety of popular statistical methods for

discovering differentially abundant bacteria . between two conditions. The purpose is to

assess the false discovery rate, recovery of truly differential abundant bacteria and the

impact of beta diversity exploration strategies commonly used in microbiome research.

We assess these differences by simulation and by making biological assumptions about

Page 8: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

VI ABSTRACT

the data, and conclude that researchers should be careful in their choice of methods, as it

might lead to inflated rates of both type I and type II errors.

In Manuscript 2 a new software package is described that allows the exploration of mi-

crobial functionality that is found on mobile genetic elements. This exploration is done

through annotation of metagenomic contigs and subsequent statistical discovery. Statisti-

cal validation of the approach is performed and two datasets are tested, the Hygum soil

metagenomes and a public dataset of UC patients. A number of significant candidates are

discovered and their feasibility is discussed in the context of available knowledge of the

respective microbiomes.

Manuscripts 3 and 4 detail the exploration of a heavily copper-contaminated site in Hy-

gum, Denmark. The site has been used for impregnating wood using CuSO4, and while it

was shut down in 1924 it is still contaminated with a gradient of copper spanning 100-

folds. “Coping with copper: legacy effect of copper on potential activity of soil bacteria

following a century of exposure” uses RNA-based sequencing to study the active fraction

of the microbioal community. It extends on past work on the soil by enumerating only the

bacteria of the active fraction and defining functional response groups, groups that re-

spond similarly to the copper concentrations. Manuscript 4 extends previous literature by

using metagenomics sequencing and RNA sequencing of mRNA to characterize the func-

tionality of the microbial community. Based on an assessment of catabolic diversity and

expression of growth-related proteins we conclude that the copper contamination selects

for K-strategists, and that they are likely infected with lytic phages that prey on the com-

munity.

Page 9: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation VII

Resume

Sekventerings-baserede værktøjer har revolutioneret mikrobiologi I de seneste år. Næste-

generations sekventering har givet mulighed for studier af mikrobielt liv i hidtil uset høj

opløsning og i mange forskellige miljøer. Disse metoder, som er uafhængige af dyrkning

har givet mulighed for at opdage nye bakterier i både naturen og menneskelige værter

som ikke har været tilgængelige med traditionelle værktøjer. Genom og bioinformatik

værktøjer lader forskere afklare metabolsk potentiale ved at rekonstruere fulde genomer

fra kort DNA stykker. Disse nye metoder kan bruges til at generere hypoteser om hvorle-

des variation i mikrobielle samfund påvirker deres miljø eller vært.

Denne afhandling bidrager med nye værktøjer og forbedringer af eksisterende der kan

bruges til at finde variation i mikrobiomer. Endeligt udforskes integration af molekylære

metoder til at lave nye hypoteser om kopper forurenings indflydelse i jord.

Introduktionen er en bred introduktion til mikrobiom forskningfeltet med fokus på den

teknologi der bruges til at lave nye opdagelser og til at uddybe overvejelser som der ikke

var plads til i artiklerne. Hovedfokus er på bioinformatik siden, behandling af data efter

det kommer ud af sekventeringsmaskinen. Dog påkalder andre emner som f.eks DNA

ekstraktion sig også opmærksomhed da det har stor indflydelse på resultaterne.

Manuskript 1, “Large-scale benchmarking reveals false discoveries and count transforma-

tion sensitivity in 16S rRNA gene amplicon data analysis methods used in microbiome

studies”, beskriver et benchmarking studie af forskellige aspekter af ofte brugte statistiske

metoder til at afklare hvilke bakterier der findes i forskellig mængde under forskellige

betingelser. Formålet med dette er at belyse hvor ofte sande nulhypoteser forkastes og at

finde ud af hvor ofte metoder accepterer sande alternative hypoteser. Vi undersøger dette

ved hjælp af simulations-studier og ved at lave biologiske antagelser om hvilke hypoteser

er sande. Vi konkludere at man bør være påpasselig i sit metode-valg fordi anvendelse af

metoderne kan føre til type I fejl-rater der er større end forventet.

Page 10: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

VIII RESUME

Manuskript 2 beskrives udviklingen af en ny softwarepakke der tillader udforskning af

den del af mikrobielle funktionelle gener der findes på mobile elementer. Dette er baseret

på annotation af metagenom contigs og efterfølgende statistisk modellering. Vi foretager

en statistisk validering af metoden og afprøver den på to datasæt, Hygum jord meta-

genomerne og et offentligt tilgængeligt datasæt fra patienter med blødende tyktarmsbe-

tændelse. Vi identificerer nogle statistisk signifikante kandidat-gener og diskuterer deres

mulige funktion i lyset af hvad man ved om deres respektive mikrobiomer.

Manuskript 3 og 4 beskriver udforskningen af en grund i Hygum i Danmark som er

stærkt forunenet med kopper. Grunden har været brugt til at imprægnere træ ved hjælp

af koppersulfat. Den blev lukket i 1924, men den er stadig foruneret med en gradient der

spænder over 100-fold forskelle. “Coping with copper: legacy effect of copper on potential

activity of soil bacteria following a century of exposure” anvender RNA-baserede

metoder til at studere den metabolsk aktive del af bakterierne. Vi bygger videre på eksi-

sterende observationer ved kun at kigge på den aktive del, og at gruppere disse i grupper

efter deres respons på forurening. Manuskript 4 bidrager yderligere ved at bruge meta-

genom og metatranskriptom sekventering til at udforske den metabolske funktionalitet af

bakterierne. Baseret på diversiteten af det katabolske potentiale og udtrykket af DNA po-

lymerase konkluderer vi at kopper selekterer for bakterier der bruger den såkaldte K-

strategi. Vi konkluderer yderligere at de formentligt inficeres af phager i den lytiske fase.

Page 11: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation IX

List of manuscripts

Manuscript 1:

Thorsen, J*., Brejnrod, A*., Mortensen, M., Rasmussen, M.A., Stokholm, J., Al-Soud, W.A.,

Sørensen, S., Bisgaard, H. and Waage, J., 2016. Large-scale benchmarking reveals false

discoveries and count transformation sensitivity in 16S rRNA gene amplicon data analysis

methods used in microbiome studies. Microbiome, 4(1), p.62.

Manuscript 2:

Asker Daniel Brejnrod, Sam Jacquiod , Gisle Vestergaard, Jakob Russel , Michael Schloter

and Søren Johannes Sørensen.

SYNTENY: A new tool to explore the functional content of the mobile pool from meta-

genomes

Manuscript 3:

Nunes, I., Jacquiod, S., Brejnrod, A., Holm, P.E., Johansen, A., Brandt, K.K., Priemé, A.

and Sørensen, S.J., 2016. Coping with copper: legacy effect of copper on potential activity

of soil bacteria following a century of exposure. FEMS Microbiology Ecology, 92(11),

p.fiw175.

Manuscript 4:

Asker Daniel Brejnrod, Samul Jacquoid, Inés Marques Nunes, Anders Priemé, Søren Jo-

hannes Sørensen.

An integrative molecular approach reveals the lifestyle and phage susceptibility the mi-

crobial community of a long-term contaminated soil

Page 12: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

X LIST OF MANUSCRIPTS

Other publications not included in the thesis

Svolos, V., Hansen, R., Ijaz, U.Z., Quince, C., Watson, D., Alghamdi, A., Brejnrod, A.,

Ansalone, C., Milling, S., Gaya, D. and Russell, R., 2017. Dop008 Dietary manipulation of

the healthy human and colitic murine gut microbiome by Cd-treat diet and exclusive en-

teral nutrition; a proof of concept study. Journal of Crohn's & colitis, 11(suppl_1), pp.S29-

S30.

Roghayeh Habibi, Saeed Tarighi, Javad Behravan, Parissa Taheri, Annelise Helene Kjøller,

Asker Brejnrod, Jonas Stenløkke Madsen, and Søren Johannes Sørensen. 2017. Whole-

Genome Sequence of Pseudomonas fluorescens EK007-RG4, a Promising Biocontrol Agent

against a Broad Range of Bacteria, Including the Fire Blight Bacterium Erwinia amylovora.

Genome Announcements. 5:e00026-17; doi:10.1128/genomeA.00026-17

Mortensen, M.S., Brejnrod, A.D., Roggenbuck, M., Al-Soud, W.A., Balle, C., Krogfelt,

K.A., Stokholm, J., Thorsen, J., Waage, J., Rasmussen, M.A. and Bisgaard, H., 2016. The

developing hypopharyngeal microbiota in early life. Microbiome, 4(1), p.70.

Danneskiold‐Samsøe, N.B., Andersen, D., Radulescu, I.D., Normann‐Hansen, A.,

Brejnrod, A., Kragh, M., Madsen, T., Nielsen, C., Josefsen, K., Fretté, X. and Fjære, E.,

2017. A safflower oil based high‐fat/high‐sucrose diet modulates the gut microbiota and

liver phospholipid profiles associated with early glucose intolerance in the absence of tis-

sue inflammation. Molecular Nutrition & Food Research.

Sverrild, A., Kiilerich, P., Brejnrod, A., Pedersen, R., Porsbjerg, C., Bergqvist, A., Erjefält,

J.S., Kristiansen, K. and Backer, V., 2016. Eosinophilic airway inflammation in asthmatic

patients is associated with an altered airway microbiome. Journal of Allergy and Clinical

Immunology.

Page 13: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation XI

Behrendt, L., Brejnrod, A., Schliep, M., Sørensen, S.J., Larkum, A.W. and Kühl, M., 2015.

Chlorophyll f-driven photosynthesis in a cavernous cyanobacterium. The ISME journal,

9(9), pp.2108-2111.

Røder, H.L., Raghupathi, P.K., Herschend, J., Brejnrod, A., Knøchel, S., Sørensen, S.J. and

Burmølle, M., 2015. Interspecies interactions result in enhanced biofilm formation by co-

cultures of bacteria isolated from a food processing environment. Food microbiology, 51,

pp.18-24.

Harder, C.B., Rønn, R., Brejnrod, A., Bass, D., Al-Soud, W.A. and Ekelund, F., 2016. Local

diversity of heathland Cercozoa explored by in-depth sequencing. The ISME journal.

Riber, L., Poulsen, P.H., Al‐Soud, W.A., Skov Hansen, L.B., Bergmark, L., Brejnrod, A.,

Norman, A., Hansen, L.H., Magid, J. and Sørensen, S.J., 2014. Exploring the immediate

and long‐term impact on bacterial communities in soil amended with animal and urban

organic waste fertilizers using pyrosequencing and screening for horizontal transfer of

antibiotic resistance. FEMS microbiology ecology, 90(1), pp.206-224.

Page 14: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references
Page 15: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Unravelling the mechanisms of bacterial interactions in model communities 1

Introduction

Sequencing

High-throughput sequencing has enabled a new era in microbiology research. Frederick

Sanger shared the 1980 Nobel prize in chemistry for a method that allowed the determina-

tion of the sequence of base pairs in DNA. In 1977 they had applied this method to se-

quence the first complete genome of an organism, Phi-174 with 5386 basepairs[1]. The first

complete microbial genome, from Haemophilus influenzae (1,830,137 bp) was completed

in 1995[2]. Finally in the early 2000s the complete human genome was reported from two

competing projects at around 3 billion bp [3-5].

While these milestones were all completed with the same fundamental technology, Sanger

sequencing, they spurred innovation in both optimizing the wet-lab procedures, such as

shotgun sequencing[6], and handling the exponentially growing data output. Sanger se-

quencing as well as the next generation of technologies sequence short fragments of ge-

nomes that then requires assembly to a long contiguous chromosome. Furthermore, ob-

taining the sequence of a gene is not immediately informative of its function and so much

effort has been put into the task of searching newly sequenced genes against databases of

known gene functions. These two basic tasks of sequence bioinformatics have necessitated

a wealth of algorithm development and infrastructure requirements to handle the output

from the rapidly developing sequencing technology.

Page 16: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

2 INTRODUCTION

Figure 1: Cost of sequencing dropping faster than Moores law (from https://www.genome.gov/sequencingcosts/)

While initially Sanger sequencing was a laborious procedure that required manual cycling

of reagents and running gels, it has been automated into cycle sequencing. This was one

of the factors that drove the sequencing of the human genome to be a contentious race

between a public and a private effort, because the throughput grew exponentially.

The 2nd generation of DNA sequencing technology was driven by miniaturization of wet

lab procedures that allowed them to be carried out on small plates, millions in parallel.

While the read length was generally shorter and the error rates were higher, the through-

put was orders of magnitudes larger. Three companies in particular, Illumina, SOLiD and

454 made their entry into the market with fundamentally different chemistries. The high

throughput and low price opened up a new era in microbial and human genome sequenc-

ing, and the amount of sequenced genomes rose exponentially.

Page 17: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 3

In addition to the amount of genomes, cheap sequencing allowed the development of oth-

er applications. Of particular interest in this thesis will be amplicon sequencing, meta-

genomic and, metatranscriptomic sequencing.

Amplicon sequencing is the targeted sequencing of a particular gene by amplifying a re-

gion of the gene using PCR primers. For microbiome purposes the 16S rRNA gene is often

the target.This was enabled in particular by the longer and longer reads of the 454 ma-

chines. Before the machine was obsoleted in 2016 the Titanium FLX kit was sold as pro-

ducing 800bp long reads. As reads are made from targeted and amplified regions of bac-

terial conserved 16s genes (figure X), longer reads allow better discrimation of bacteria. As

the field has shifted away from 454 technology, it is now typically Illuminas 2x300bp or

the Ion Torrents 400bp kit. This change in chemistry has also had a large impact. Pyrose-

quencing as used by the 454 had a high rate of “indel” type errors. The chemistry relies on

adding a fluorescent base to an extending sequence and reading the light signal that re-

sults if the base was incorporated. If the sequence consists of several bases of the same

type, several bases are added, but the light signal is not completely proportional. The ma-

chines own estimate of the error rate, PHREDS score[7], is incorrect which leads to the

“homopolymer” problem. Amplicon sequencing uses the similarity of reads to group

reads into putative species, so this required the development of algorithms to control the

error rate and avoid inflated diversity estimates. The Illumina Sequencing-by-synthesis

chemistry does not have theses same issues, and neither does the newer Ion Torrent kits.

While amplicon sequencing relies on amplification of particular regions of DNA, meta-

genomics aim to sequence all available DNA in a given sample. This is done by extracting

DNA and building libraries as if it was a regular genome, ie. fragmenting the DNA and

attaching adapters that allow the sequencing machine to capture the short fragment.

The final application relevant to this thesis is the sequencing of metatranscriptomes, RNA

instead of DNA. As the sequencers cannot directly operate on RNA molecules, the proto-

col includes a step for converting RNA to DNA by reverse transcriptase, an enzyme iso-

Page 18: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

4 INTRODUCTION

lated from RNA viruses. This reaction is primed by random hexamers, ensuring that the

amplification is unbiased. After this amplification regular libraries can be build and se-

quenced.

RNA sequencing does come with several caveats. First, RNA is a very fragile molecule

and to get useful data samples must be immediately frozen in at -80 degrees, complicating

sample collection and which types of samples it is even meaningful to collect. Second, a

large fraction of the RNA pool in a given sample will be ribosomal, so consideration must

be made whether to remove this, at the cost of extra complexity and possible bias.

DNA extraction

DNA extraction is an important issue in microbiome work. Sequencing of DNA requires

the DNA to be accessible to PCR enzymes and library building, and the final estimation of

bacterial abundances rely on all cells to have been lysed equally well. Different organisms

have evolved different cell walls to protect them from the environment. While gram-

positive bacteria have one membrane with a thick layer of peptidoglycan, gram-negative

bacteria have two membranes and a thinner layer of peptidoglycan in the the periplasmic

space between the membranes the so-called cell envelope, with an additional layer of lip-

opolysaccharides. Other domains of life have evolved different cell walls, of most rele-

vance to this thesis is the distinct chitin and ergosterol in fungal cell walls that we exploit

to quantify the biomass of fungi. Indeed, differences in cell-wall is thought to have deep

evolutionary roots [8]. The consequence of these differences is that lysis of cells in DNA

extraction must be optimized to lyse equally well.

Extraction of a wide variety of cells is only one practical concern, investigators must con-

sider other parameters such as quality and amount of eluate. For amplion sequencing that

involves amplifying specific pieces of DNA very low amounts are needed, but for build-

ing metagenomic libraries much more DNA is needed. For amplicon sequencing PCR

Page 19: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 5

inhibitors such as humic acid [9, 10] in general, but heavy metals for contaminated soils in

particular [11], found in soil can prevent amplification and must be removed, typically by

dilution. Another quality aspect is the shearing of DNA by harsh treatment in the extrac-

tion process, which can cause chimeric PCR products [12, 13]. Sanquing et al [14] observed

ranges of 9-23 kb, but observed fragments down to 120bp.

DNA extraction of both soil and faeces can generally proceed by two routes, direct and

indirect. These lyse directly from the soil or faecal matter or pull out cells and then do the

lysis subsequently. Indirect extraction can free cells from the soil particulate by eg. using

nycodenz gradiets, a solute with higher density than bacterial cells. After ultra-

centrifugation cells will be found on top of the gradient. While it has been shown to bias

the extracted microbial community [15] it gives substantially more DNA of high molecu-

lar weight [16].

Direct extraction is most common for microbiome purposes, and is generally considered

less laborious. Bead beating is an essential part of the protocol for that ensures efficient

lysis [17] but it has also been recognized that longer bead beating will result in more

shearing [18]. Without this step it has been demonstrated that DNA extraction systemati-

cally misrepresents some bacteria[19]. Even with bead beating different methods can sys-

tematically bias extraction of members that are crucial to study in both soil and faeces,

with possible consequences for controversial subjects in human microbiome research such

as the Fimicutes/bacteroidetes ratio and it’s relation to obesity [20].

Amplicons: Quality control

Quality control of amplicons is of greater importance in amplicon sequencing than most

other NGS applications. Because of the de novo clustering to form OTUs, errors in reads

falsely increases diversity because a read with sufficiently many errors will look like a

new species. Additionally, amplicon sequencing relies on PCR. Not only are some poly-

merases known to be error-prone but all PCR reactions can generate chimeras. Chimeras

are amplicons that are the result of one prematurely terminated amplicon being used as a

Page 20: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

6 INTRODUCTION

“mega-primer” in the subsequent round of amplification that makes it seems as if one

part of the amplicon was generated from one species, and the rest from another.

Figure 2: Variability of the 16S gene. From Neefs et al. 1990 [21]

The problem of chimera removal was studied before NGS, when most 16S sequences were

described by Sanger sequencing [22]. Indeed, Hugenholtz et al [23] demonstrated that

several studies of PCR libraries had chimeric artefacts

In addition to the issue of the quality of individual reads, PCR-based methods also suffer

from some general issues like amplification bias and contamination, particularly for low-

DNA yield samples. The issues with contamination has now been well-documented in the

literature [24-26]. The use of negative controls for both the DNA extraction step and the

PCR amplification has been recommended, but it is still unclear if this information can be

used to derive contamination free microbiome profiles, as many contaminantants are also

the organisms that are of biological interest in the samples. Another useful addition to

experiments is a positive control in the form of a mock community, a constructed com-

munity with known composition. These are part of the human microbiomes standard pro-

tocols[27], and can be ordered commercially. Another approach is to use a so-called “gen-

erous donor” approach where several samples are mixed and this mix sequenced across

batches[28]. The latter is similar to the controls used in LC-MS where a lot of effort have

Page 21: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 7

been put into correcting spectra based on these pooled samples run at different times [29].

To my knowledge no such corrections exist yet for sequencing data.

Figure 3: Evaluation of the mock communities generated for the data in manuscript 1. Figures text taken from C. Balles

master thesis [30]

For the 5000-sample dataset generated in manuscript 2. Both negative and positive con-

trols were sequenced. The genus-level composition of the positive controls are shown in

figure X. the leftmost column shows the ideal composition. Clearly none of the sequenced

mock communities are exactly identical, but none of the genera seem to deviate more than

~10%. The internal correlations between the mocks are mostly > 0.9 (figure X) giving some

confidence in the between-sample conclusions we make.

Page 22: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

8 INTRODUCTION

Figure 4: Correlations between mock community compositions [30]

Amplicons: OTU picking

From amplicon sequencing data it is possible to follow several strategies for data analysis.

One strategy is to simply take the reads and search them against a database of previously

observed 16S sequences, so-called closed reference OTU picking[31]. This approach is

popular for large datasets because it is trivially parallelizable and it immediately gives the

phylogenetic identity of the query read. Because of the conservation pattern of the PCR

amplified fragment is possible to analyse the sequencing data on the basis of sequence-

defined groups of reads, so called Operational Taxonomic Units (OTUs). The exact formu-

lation of the OTU concept rely on many technical considerations and have been the sub-

ject of much discussion in the literature. One initial consideration is how identical se-

quences should be to be considered to have originated from the same species. The current

standard is to define everything within 97% identity the same species. This threshold was

originally defined because full-length 16S sequences are correlated with the older DNA-

DNA hybridization concept of 70% full-genome identity for grouping species [32]. Addi-

tionally, because amplicons was originally sequenced on 454 chemistry the chosen thresh-

old had to be higher than the error-rate of 1%. However, as the sequenced fragment is

expected to be more variable than the full-length 16S on average, it is not obvious that this

is the best threshold for short, and much effort has gone into testing this empirically.

Complicating matters, is the fact that the algorithm used for grouping and the exact defi-

Page 23: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 9

nition of the measurement of separation of clusters impact the observed species. Schloss et

al. [33] has done comprehensive empirical testing of choice of algorithm, but even so in

practice many datasets must rely on approximations to these algorithms without mathe-

matical guarantees of finding the optimal clustering for the chosen definition. One exam-

ple of such an approximate algorithm is usearch [34] in which computationally expensive

alignments are replaced with cheap k-mer based overlap similarity to narrow the search

space for candidates to align. This approximation is very fast, but is not guaranteed to

find the optimal clustering. Validation on mock community datasets, however, shows that

in practice this strategy performs very well.

Newer algorithms eschew the clustering step and simply attempt to denoise the observed

sequences so that taxonomy classifications can be made a single-nucleotide resolution [35-

37]. This is in part made possible by the lower error rate, and more manageable error pro-

file of the Illumina Truseq chemistry as compared to 454 [38].

Metagenomes: Mapping

The most widely used bioinformatics strategy for analyzing metagenome data is simply

mapping the reads against databases of known DNA material. Databases such as NCBIs

Non-redundant Proteins Database allow annotation against a database that has been pre-

clustered to remove very similar proteins for reduced search times. Databases such as

Metacyc[39] and Kyoto encyclopedia of genes and genomes (KEGG)[40] offers additional

structure for the understanding of the output. The latter databases are organized into bio-

logically meaningful structures such as modules and pathways that allow the estimation

of whether the joint microbial community can complete certain chemical transformations,

and the abundance of the enzymes mapped to these pathways provide a measure through

which it is possible to look for overrepresentation of pathways under different experi-

mental or environmental conditions.

Page 24: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

10 INTRODUCTION

The mapping of reads against databases is trivially parallelizable, and it is easy to dedi-

cate as much computer power to the problem as is available, and sequencing search of

DNA against proteins has seen big advances recently [34, 41]

Abundances of mapped reads have to be normalized to interpret them correctly. The sta-

tistics section discusses the effects of library size that needs to be accounted for, but an-

other consideration is that genes have different lengths. In a model where reads are com-

pletely randomly generated from genomes, longer genes are more likely to have generat-

ed reads and this must be accounted for. The Humann [42] and Humann2 packages ac-

count for this by taking the family of the specific gene that a read mapped to and normal-

izing with the average length in that family.

This mapping approach also extends to taxonomic identification. In manuscript 4 we de-

ploy two different strategies to inform us about the taxonomic composition. The first is

extracting reads that are possibly ribosomal Small Subunits (SSU) and identifying them

using the software Metaxa2 [43]. The amount of extracted sequences from each clade of

interest forms the basis for a rarefaction analysis and an analysis of the correlations of

taxonomic groups with environmental parameters in manuscript 4. We use the rarefaction

analysis to conclude that we have not quite sequenced deep enough to discover all diver-

sity at the genus level but that there is significant differences between the groups when

sampling a fixed number of reads.

The second is the mapping strategy included in the humann2 package: metaphlan2 [44].

This software maps all metagenome reads against a database of pre-selected markers that

unequivocally identifies microbial clades as well as viruses. This database was construct-

ed by finding discriminatory genes for each clade in the IMG database[45]. To compute

relative abundance for the clades metaphlan2 normalizes the number of hits to the nucleo-

tide length and the amount of markers for the clade. We use the correlation between

abundance and copper contamination to elaborate on unidentified groups from the previ-

ous SSU analysis, demonstrating the complementarity of these approaches on meta-

genomic data. In addition we select a Gammaproteobacteria marker from the literature to

investigate unclassified bacteria [46]

Page 25: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 11

Metagenomes: Assembly and binning

As sequencing throughput and the complexity of experimental designs has increased, so

has the state of assembly technology matured. Assembly in the early days of Sanger se-

quencing was simply constructing overlapping sequences of short reads aided by com-

puter visualization. The human genome project saw the development of an automated

computer program for doing the full assembly. As next generation sequencing became

common, bioinformaticians has used a lot of energy on more effective assemblers. Over-

lap assemblers, formally known as overlap-layout-consensus assemblers, such as the the

Celera Assembler[47] used for the first human genomes and Newbler[48] used for 454

sequencing essentially formalizes and optimizes the process a human operator would go

through to stitch short reads together.

The newer paradigm in assembly however uses a different representation of the data that

would not be immediately intuitive to a human operator. Overlapping DNA words of

length k, called k-mers, are arranged in a directed graph so that each node is a word and

edges connecting them represent overlap of all but one character in the linked k-mers.

In this representation of reads the ideal full genome would be represented by the so-called

Eulerian path through the graph, a path that visits all edges exactly once. This is a well-

studied problem in graph theory and has several algorithms of acceptable complexity.

The first of these from Pevzner et al [49] represented a substantial improvement on the

celera assembler used for the first human genome assembly, and was soon adapted to

short reads [50]. This representation also has the advantage that the assembly problem can

be solved with a bounded amount of memory that depends on the genome size [51].

However, in practice a complication is that sequencing has an error rate means that there

will exist many k-mers in the data that is not of biological origin. Additionally, it is not

obvious which choice of word-length k is the best. DNA often has repeats and ideally the

choice of k-mer would be longer than these repeats so their overlap is uniquely identifia-

ble. However, longer k-mers allow for more errors in the individual k-mers, increasing

the number of false paths in the graph.

Page 26: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

12 INTRODUCTION

There is no analytically tractable way of choosing k, and while heuristics exists they must

also account for the fact that in practice memory usage increases with longer k. The only

hard and fast rule is that k should be an uneven number so that k-mers cannot be palin-

dromic, ie. the same sequence no matter if read from 3’ or 5’ end. One widely used solu-

tion is to use digital normalization[52] in which all k-mers below some specified abun-

dance is simply removed, on the intuition that if it is only found few times it is more likely

to be noise. However, for the problem of metagenomes specifically, it is likely that there is

some amount of organisms with low coverage, and this approach would discard those.

The assembler Megahit expands on this strategy by removing all edges with abundance

less than d, but recovering them if they come from a context of surrounding high-

abundance k-mers, and it is fast enough to simply try a selection of k’s for evaluation.

Assemblies for metagenomics binning is currently made from one of two overall strate-

gies. Some projects, such as most MetaHit projects, prefer to assemble each sample indi-

vidually while others prefer coassembly, pooling all samples together and doing one big

assembly. To my knowledge no systematic studies have been performed on what works

best and the choice is likely best made based on considerations of the particular dataset.

While coassembly would seem to be the obvious choice to obtain better coverage of all

organisms, for some projects the number and depth of samples prohibit coassembly in

practice because of RAM limitations. It also depends on the choice of downstream binning

method. Some binning methods were made for clustering covariances of protein abun-

dances [22] while others, such as CONCOCT[53] are made for contig covariance and

composition.

Metatranscriptomes

Metatranscriptomes in this thesis is used in two different ways. In Nunes et al [54] RNA

is isolated and reverse transcribed to cDNA using random hexamer primers. The pool of

cDNA is then amplified by 16s primers and sequenced as an amplicon dataset. This tech-

nology is not as widespread as amplifying directly from DNA, and comes with a separate

set of benefits and challenges. One is the additional wet-lab work required to secure the

Page 27: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 13

fragile RNA and transcribe it into cDNA. Another is primarily related to the source being

the transcribed RNA in ribosomes. With DNA the copy number is fixed by the number of

rrnX loci multiplied by the number of chromosomes in each cell. The copy number of loci

is a well-studied quantity from sequenced full genomes [55], but more importantly it is

phylogenetically conserved. So for statistical purposes it is reasonable to assume that the

number in a particular bacterium will be the same for the same bacteria in a different

sample. The chromosomal copy number for prokaryotes can vary quite widely [55], but

can still be expected to be conserved phylogenetically. Ribosome amount on the contrary

will vary [56]. Indeed Schlaecther et al demonstrated in 1958 that the amount of ribosomes

in Salmonella is dependent on the growth rate [57].

RNA however degrades very quickly in the environment. This lets us get a clearer picture

of the active microbes in an environment [58]. It has been show in several cases that se-

quencing rRNA is the only way to observe important organism that were not abundant at

the DNA level such as the case of Legionella in water pipes [59] and eukaryotes in soil

[60].

For the other type of transcriptome in this thesis, RNA is simply transcribed to cDNA and

all of it is sequenced as in the case of metagenomics.

Like metagenomes, meta-transcriptomes can be analysed by either assembly or reference

based methods. An additional consideration is that the reference can be made from meta-

genomes of the same samples, the approach used in this thesis. This allows quantification

of expression of novel proteins, not found in existing databases. For high-diversity envi-

ronments where many new organisms exists from which proteins have not been charac-

terized this can provide leads on novel proteins in which the expression is important for

the system being studied. Alternatively, it is possible to assemble the transcripts them-

selves into longer fragments[61, 62], and then map the short fragments back for quantifi-

cation of transcript abundance.

Isolated RNA consists of mostly ribosomal RNA (Matthew et al estimated it at 85% [63]),

so a decision must be made whether to deplete this so that the mRNA fraction can be

Page 28: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

14 INTRODUCTION

studied, or to investigate the phylogenetic structure of the sample via the expressed ribo-

somal RNA. The latter is a popular option in studying soil where the organisms of interest

are not only prokaryotic, but also includes fungi, protozoa and other biota. Enumerations

of soil biota with ribosomal RNA can lead to better discovery of non-prokaryotes than

their metagenomics counterparts, and several studies have been published [64]

For the functional fraction, it is reasonable to assume that bacteria do no express their

entire genome at the same time. Furthermore, inactive DNA can be found in high

amounts in the environment,[58] so metatranscriptomes can potentially offer a more fo-

cused perspective on the microbiome. Expression studies have previously been per-

formed on microarrays, but microarrays have the inherent limitation that nucleotides tend

to cross-hybridise[65] and concerns about reproducibility has been raised [66] The ex-

pression of mRNA is also a more direct measure of the enzymatic activities performed by

a community. Some chemical transformation such as nitrogen or carbon cycling impacts

the environment or the host organism, and more accurate estimates of the rate of trans-

formations can contribute to a greater understanding of the effects. The early attempts at

estimating turnover of bulk elements came from marine environments such as using en-

zyme abundance measured by metagenomics to predict primary production [67], and the

marine setting was also the background for some of the early metatranscriptomic studies

[68, 69]. These were also the studies that demonstrated that depletion of ribosomal RNA

was practical [70, 71].

Page 29: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 15

General Objectives

The overall objective of this thesis is to use and build bioinformatics tools to explore varia-

tion in microbiome research. Interest is on several types of microbiome data. Manuscripts

1 and 3 are centered on amplicons, and the statistical tools to discover variation in this

type of data. Manuscripts 2 and 4 explores hypothesis generation for metagenomic data.

The chosen methods are statistical. The first is direct modelling of the hypothesis that

there might be an interaction between elements in a mobile genetic context and an out-

come. The second is an integrative approach to link several types of data. The objectives

across the studies are the following:

Ensure robustness of the inferences. This thesis attempts this

through statistical validation of methods and through complimen-

tary approaches

Generate new hypotheses about the chosen environments.

Through the collection and exploration of molecular data new as-

pects of well-known environments are explored

Page 30: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

16 GENERAL OBJECTIVES

Page 31: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 17

Introduction to manuscripts

Manuscript 1

This thesis explores the statistical tools developed for the new sequencing methods in [72].

While the paper is focused on 16s amplicon studies many of the same considerations can

be given to other types of sequencing methods. Indeed, several of the tools benchmarked

were developed for RNA-seq.

The major challenges that necessitates development of these new tools are related to the

way sequencing generates the data. To estimate abundance the bioinformatic workflow

typically ends up mapping reads to a reference, OTUs, contigs or reference genomes, the

data is counts. This is usually better modeled by other distributions than the Gaussian,

such as the poisson- or negative binomial distributions. Furthermore, sequencing usually

has a fixed amount of possible reads, and controlling how many we get per sample is not

always reliable. Consequently, random variation in library sizes may add noise, or in the

worst case confound inference[73]. Because the sum of reads is fixed the data is also in-

herently compositional, meaning that changing the abundance of one thing necessarily

changes something else because all abundances must sum to 100%. Compositional data

analysis is a field with a long history[74] and its own tools that are just recently influenc-

ing microbiome analysis[75, 76].

Another common characteristic of sequencing data is that it tends to be quite high-

dimensional, meaning that we observe orders of magnitude more units, OTUs, genes or

transcripts, than we have samples. This has implications for data analysis, no matter if

viewed from a predictive or an inference perspective.

Page 32: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

18 INTRODUCTION TO MANUSCRIPTS

From a predictive point of view many dimensions typically means that the models we use

has many parameters. This is known to lead to overfitting [77, 78], in which the model fits

to peculiarities in the particular sample of data available but does not generalize well to

previously unseen data from the underlying population.

From an inference perspective we often try to estimate differences in the underlying pop-

ulations of sampling units. However, as the number of required hypotheses test increases

so does the chance of committing a type I error, ie. getting a false positive[79]

In practice, multivariate modeling has proved very effective in this type of data analysis.

In particular, calculating similarities between samples and using that as the basis of plot-

ting and inference is a widely used strategy, and many tools have been developed to

quantify similarities and to analyse them.

Principal Components Analysis (PCA) is one such choice widely used for dimensionality

reduction [80]. The key point of PCA is that It can be used to take a set of vectors of di-

mensionality D in a coordinate system specified by the observations, and rotate them into

vectors of dimensionality D-1, in such a way that the variance of the basis components is

successively maximized, ie. the first component explains the most variance, the second

component the second highest amount of variance etc. These new vectors (called scores)

can then be visualized with their first few components to produce a 2 or 3 dimensional

visualization that hopefully highlights the important features of the data. The assumption

that the maximal variance is also the most interesting is not always warranted, and other

methods that highlights other aspects of the data has been developed. Multidimensional

Scaling (MDS) is the most widely used of these alternative methods primarily in the form

introduced by Kruskal [81]. The objective in MDS is different from PCA as it attempts to

find a new set of points in R^n (where n is a number chosen by the user, 2 is common for

visualisation) so that the difference in distances between these new points, and the differ-

ences in the original points are minimized. This objective is in some sense a more direct

representation of the layout of the datapoints,that could more accurately reflect the nature

of the data. Instead of evaluating the quality of the new representation in terms of vari-

ance explained, MDS uses the residual variance, termed stress, for evaluation, and Krus-

Page 33: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 19

kal provides in the original paper a table of what benchmarks indicated was acceptable

levels of stress.

The underlying tool of PCA is the Euclidian distance that is used to calculate the covari-

ance matrix. This can be replaced with a variety of distances. The rotation must then be

estimated by Principal Coordinates Analysis (PCoA) instead of PCA. Beta diversity met-

rics have an extensive history in ecology [82], and many of the tools developed there has

been adopted in microbial ecology, such as Jaccard [83] and Bray-Curtis[84] distances.

There has also been development of microbiome specific measures that incorporate phy-

logenetic trees [85, 86]. Finally, before the distance is calculated some investigators prefer

to do transformations, such as the classic log or square root transformations, or use nor-

malization such as dividing by the library size or use methods designed for RNA-seq such

as TMM [87]

In [72] we attempt to quantify the impact of analysis choices in the multivariate case by

simply choosing data sets that assume to be separable and then seeing how well different

combinations of transformations, normalization and similarity measures can separate

them, in a factorial manner.

From the false-discovery rate simulations we conclude that some of the methods designed

for sequencing-based analyses have poor control of FPR. MetagenomeSeq [88], a tool that

is widely used in microbiome research, does poorly on the default settings that the soft-

ware was released with, but a newer update to their model keeps the FPR at 0.019, but

still suffers from the influence of sparsity in the dataset. BaySeq [89], designed for RNA-

seq experiments, seemingly has an unpredictable FPR profile, while most methods con-

verge towards very high p-values when faced with sparsity. The classical methods such as

t-test, wilcoxons test and a permutation test all have FPR < 0.05. However, there is often a

trade-off between false positive and false negative rate, and in the spike-in recovery test,

neither the t-test nor the Wilcoxon test has very good performance. The permutation test

does well in the large sample setting. The sequencing-specific tools seem to do well in this

test. Another interesting observation is that contrary to expectation, the proportion of

Page 34: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

20 INTRODUCTION TO MANUSCRIPTS

case/controls have little influence on performance. Running these benchmarks on smaller

datasets changes the performance characteristics of several methods, and the performance

of edgeR [90] is suddenly the overall best as measured by AUC and FPR together. The

final recommendations from this part of the paper is that sample size is the key parameter

for the choice between edgeR (small sample sizes) and permutation test or meta-

genomeSeqs featureModel (large sample sizes).

Figure 5: The extremes of the best and worst performances of various beta diversity

metrics

The beta-diversity benchmark highlights the importance of

choosing the best metric by illustrating the best and the worst

choices for the samples (figur 5). The results are fairly striking

as it is clearly impossible to separate any groups under a poor

choice of metric. More systematically, we use various types of

normalization and transformations and largely conclude that

the impact of these is small. Choice of beta diversity metric,

however, is key, and we conclude that Unifrac [85]in its

weighted formulation seems to perform the best. We also see

the Shannon-Jensen divergence perform above average in some

situations. This metric has important roles in the literature as it

has been used to define clusters for various microbiome such as

the vaginal [91] and gut [92], known as enterotypes. Literature

has suggested that these enterotypes are not reproducible un-

der different choices of distance metric [93] and subsequent suggestions that they are not

robust [94]. Our results seem to elaborate on the situational specificity of the Shannon-

Jensen divergence.

With these results, the overall conclusion of the paper is that there is a wide array of

choices in conducting microbiome analyses, and the data characteristics matter. With this

in mind we make concret recommendations for conducting these studies.

Page 35: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 21

Manuscript 2

Horizontal gene transfer (HGT) is the lateral transfer of genetic elements between bacte-

ria, as opposed to inheritance of genetic features between ophav and progeny. Horizontal

gene transfer play a central role in bacterial adaptation to their environment, both on the

short time-scale of becoming antibiotic resistant in the event of an antibiotic exposure and

on a longer timescale of bacterial evolution [95].

HGT has traditionally been thought to proceed through several mechanisms known as

transformation, transduction and conjugation.

Transformation is the uptake of naked DNA from the environment, and is an important

mechanism for the uptake of plasmids. It is also widely exploited in biotechnological ap-

plications for cloning purposes since it was discovered that the competence, the rate of

DNA uptake, of cells can be greatly improved over their natural state by calcium chloride

[96]. Transduction is the transfer of DNA from viruses into bacterial cell. This route has

been demonstrated to be of great functional importance in particular in marine systems,

where phages have been demonstrated to function as resevoirs for functional genes that

greatly change the bacterial interaction with such key systems as the nitrogen cycle [97]

and the carbon cycle [98]. Finally, conjugation is transfer mediated by a complex gene

system for expressing pili, transport proteins, through cell-to-cell contact of bacteria. This

system, known as the Type IV secretion system, can transfer large islands of genomic

DNA. This mode of transfer has been well-characterized in many clinically relevant bacte-

ria such as Staphylococous areus that have been known to to have large pathogenicity is-

lands [99].

In this thesis HGT is primarily studied with a focus on the functionality that can be trans-

ferred without much concern for which particular mechanism was used. Manuscript X

describes the development of software for identifying horizontally transferred functionali-

Page 36: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

22 INTRODUCTION TO MANUSCRIPTS

ty that can be associated with an outcome. The justification for doing this exploratory type

of work in a field that is otherwise focused on reductionist identification of mechanisms of

transfer and functionality in well-defined systems is the observation that some highly

relevant areas of biology would not be relevant without horizontal gene transfer.

Antibiotic resistance is the most prominent case of this. Most instances of problematic

antibiotic resistance arise not from the evolution of new resistance genes, but from the

wide spread of existing ones. This wide spread has the quite obvious cause that without it

cells would succumb to antibiotics, and so there is strong selective pressure to aquire and

maintain antibiotic resistance. Other well-described cases of this are pathogenicity is-

lands, containing antibiotics to kill other bacteria competing for the same nutrients or iron

scavenging genes for improved fitness in a host where iron is often a growth limiting fac-

tor, such as the case of Prevotella copri in rheumatoid arthriritis [100]. These cases are all

characterized by having a very explicit effect on the host. The exploratory strategy out-

lined in manuscript 2 aims to identify other more subtle cases, essentially by modeling the

interaction between the effect of functionality and the location of this, possibly identifying

more subtle cases of microbial traits doing disproportionate harm to the host because of

mediation by horizontal gene transfer. The aim in a human microbiome context is to iden-

tify elements that have a positive impact on the bacterial fitness, while having a negative

effect on the host, analogously to the case of antibiotics resistance.

The implication of there being a significant horizontal transfer of dysbiosis-inducing traits

would be important to the way microbiomes are currently studied. The majority of stud-

ies currently rely on 16S rRNA gene sequencing. These cannot tell much about the func-

tion of the bacteria. Even with functional inference tools like PiCrust, only phylogenetical-

ly conserved functionality can be examined, missing horizontally transferred elements.

For that expensive metagenome sequencing would be required. Additionally, because

plasmids and phage often have a specific host range [101], horizontal gene transfer as the

main driver could confound a purely phylogenetic study and make it appear as If phy-

logeny were related to outcome when in fact the latent variable of mobile host range

would be driving the association, producing candidate hits that would not be reproduci-

ble in other populations.

Page 37: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 23

One such human condition could be the disease state Ulcerative Colitis (UC), in which the

gut of the patients are inflamed. This inflammation is a stress factor for the microbiome

that could use mobile genetic elements to aquire protective functionality, such as anti-

oxidants to counter reactive oxygen species used by both the innate and adaptive immune

system [102] . Indeed, UC is known to be characterized by an adaptive response using

Type-2 helper cells that has a variety of bactericidal effects . We used the MetaHIT human

microbiome catalogue [103] to investigate this hypothesis. According to the supplemen-

tary information all patients sampled were in clinical remission, but since UC is a relaps-

ing disease that will frequently flare up, we assume that there is some selective pressure

that can be regarded as constant because of the constant need to be able to react to in-

flammation. In the metadata generously uploaded by the MetaHIT consortium it is speci-

fied that there is 219 samples in this set, 92 controls and 127 UC patients. The contigs from

these samples form the basis of the predictions made by the SYNTENY tool.

The other dataset used is from the long-term copper contaminated soil also used in manu-

script 4,

We validate the tool statistically by several methods. One method is assessment of the

assumptions of the linear models, from which we conclude that although there are some

deviations they seem to be largely fulfilled. To quantify this we assess the False Positive

Rate under a variety of randomizations and conclude that although there is some chances

of false positives induced by the modelling, the alpha rate is controlled. As a further vali-

dation on our results we follow up candidates by checking if they have ever appeared on

a mobile genetic element in the databases. If not it would be a big surprise to identify it in

this manner, and we conclude that it is probably a false positive. Finally, we explore the

power properties of the modelling by simulating new data assuming a significant model,

while varying the number of observations. We somewhat arbitrarily set the power of in-

terest at 80%, as is common in randomized controlled trials [104], but the most interesting

observations is that we can still identify candidates in fairly low numbers of observations,

limiting the amount of sequencing needed to use the tool, and expanding the scope of

investigation to shallower sequenced experiments.

Page 38: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

24 INTRODUCTION TO MANUSCRIPTS

Finally, we identify a number of candidates and follow up on their plausibility. One such

candidate is thiamine synthesis. We discuss the feasibility of the candidate and demon-

strate that it is encountered in plasmid databases. In the discussion we argue for its possi-

ble role as an antioxidant, conferring a fitness advantage to the cell, and demonstrate that

previous literature has shown it has higher abundance in children with Crohns, and that

its relative abundance decreases when they are cured (Figure X). For the other candidates

as well we discuss their possible biological relevance in the two dimensions described:

How do they help bacterial fitness, how do relate to the environment, and for the case of

the human microbiome, could they be harming the host?

Figure 6: Exclusive Enteral Nutrition decreases the amount of thiamine synthesis (M00127) potential in childen [105]

Page 39: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 25

Manuscripts 3 & 4

Soil covers much terrestrial land and studying it is important for biogeochemical cycling,

agriculture and climate change. The soil microbiome harbors one of the most diverse bac-

terial communities of big importance. The soil microbiome is engaged in a complex inter-

play with other soil organisms such as fungi, earthworms, and protozoa, and with plant

roots in the so-called rhizosphere. In the rhizosphere soil microbes play both parasitic and

symbiotic roles where they complement plant roots in what was termed the “rhizosphere

effect” in the beginning of the 20 th century [106]. For microbiologists the soil microbiome

represents a wealth of micro-niches and food webs where bacteria interact. This complexi-

ty also drives soil diversity and evolution. The exact extent of the taxonomic diversity is

still a matter of debate. Gans et al, 2005 estimated 10^6 different species based on extrapo-

lation of DNA re-association experiments [107]. This estimate has been highly contested

both on its own merits [108] and by subsequent estimation by various sequencing experi-

ments [109-111] . It is one of the most diverse ecosystems [112] and this presents challeng-

es for sequencing based studies of these environments[113]. Additionally, soil is highly

heterogeneous in all dimensions so sampling strategies need to account for this by pool-

ing samples from plots so that the final result will represent an average of a plot, and

make stringent decisions about which exact layer of soil to sample. Even with these sam-

pling practices, sequencing coverage still fall short and rarefaction curves of soil organ-

isms rarely saturate. This gives problems for the reproducibility even within experiments.

Page 40: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

26 INTRODUCTION TO MANUSCRIPTS

Figur 7: (A) Sampling site location for manuscript 3 and 4. (B) The degree of copper pollution at the hotspot is so high

that is forms visible clots

Soil health is an important topic for agriculture and large disturbances such as climate

change and contamination, gives us the need to monitor and quantify soil health. Tradi-

tional bioindicators include denitrifing activity [114], that are very sensitive to soil dis-

turbance [115]. Methanotroph activity has also been recommend as a soil indicator. Se-

quencing-based bioindicators is a new option provided by the increase in sequencing

throughput, and mobile DNA sequencers such as the Oxford Nanopore hold the potential

to make this even more relevant. Efforts have been made to systematise the use of soil

bioindicators such as the French ADEME project[116] that offers a range of both sequenc-

ing, molecular and ecological bioindicators along with short instruction sheets on their

indication.

In this thesis the focus is on a particular soil that has been contaminated since 1924 by

anthropogenic activity, wood impregnation with CuSO4 (figure X). But even today it is

estimated that many thousands tons of copper in its sulfate form is released at various

sites in the world [117]. Copper has the potential to accumulate in soils reaching toxic lev-

els [118]. Cells need trace amount of copper to function but high levels have many nega-

tive effects on cells [119]. The mechanism highlighted in manuscript 2 is induced oxida-

tive stress, and we describe how microbes might share a mobile element relieving this in

the form of vitamin b6 synthesis. There are many other forms of microbial resistance and

tolerance measures, particularly such as efflux transporters, pumping out ions from the

cell. Indeed in manuscript 4 we observe an increase in relative abundance and diversity of

Page 41: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 27

copA transporters in both contaminated soils. The phylogenetic diversity in the hotspot

was also lower and thus this not an explanation for the increase in copA diversity.

In addition to all these direct effects on the cells of copper, there are also indirect effects.

The physic-chemical properties of the soil such as pH change which is known to have a

large impact on soil organisms [120, 121]. Additionally, the plant cover of the site also

disappears and with that also the roots. This information from manuscript 3, the signifi-

cant increase in carbon content of the contaminated soil observed, and the wider catabolic

diversity potential we observe in the metagenome leads us to conclude that a new ecolog-

ical niche opens up in the contaminated soil. This niche is well-suited for slow-growing,

wide substrate utilizing bacteria, which in the literature has been termed K-selected mi-

crobes [122].

Additionally, we observe an increased expression of phages in the lytic cycle. In a cross-

sectional study design such as this it is very hard to identify causality. Ecological theory

offers some possible hypotheses that we discuss the observations in context of. However,

these observations do contribute to the general field of soil health and can form the basis

for causal studies.

Page 42: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

28 CONCLUSION

Conclusion

The stated general objective of this thesis was to build and use tools to discover microbi-

ome variation, in a robust manner that generates new hypotheses. Two of the manu-

scripts draw upon the tool of simulation to assess the robustness of statistical techniques.

The conclusions from these assessments are that the tools generally used for hypothesis

testing based discovery in microbiome research must be approached with caution, and

that the sparsity and size of datasets influence the optimal tool choice. For the SYNTENY

tool developed in in this thesis we conclude that it has a statistical performance that is

almost as good as expected. The SYNTENY tool was used to generate several new hy-

potheses about the mobile genetic elements bacteria can use to adapt to their local envi-

ronment, and in particular oxidative stress seemed a common stressor whether bacteria

are being selected by copper contamination or the immune system. The Hygum soil pa-

pers proposed several hypotheses about the altered lifestyle of bacteria in copper con-

taminated soil based on observation from DNA and expression sequencing

Page 43: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 29

Page 44: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

30 LIST OF REFERENCES

List of references

1. Sanger, F., Nucleotide sequence of bacteriophage (D X174 DNA. 1977.

2. Fleischmann, R.D., et al., Whole-genome random sequencing and assembly of

Haemophilus influenzae Rd. Science, 1995. 269(5223): p. 496.

3. Venter, J.C., et al., The sequence of the human genome. science, 2001. 291(5507): p.

1304-1351.

4. Lander, E.S., et al., Initial sequencing and analysis of the human genome. Nature, 2001.

409(6822): p. 860-921.

5. Consortium, I.H.G.S., Finishing the euchromatic sequence of the human genome.

Nature, 2004. 431(7011): p. 931-945.

6. Gardner, R.C., et al., The complete nucleotide sequence of an infectious clone of

cauliflower mosaic virus by M13mp7 shotgun sequencing. Nucleic acids research, 1981.

9(12): p. 2871-2888.

7. Cock, P.J., et al., The Sanger FASTQ file format for sequences with quality scores, and the

Solexa/Illumina FASTQ variants. Nucleic acids research, 2010. 38(6): p. 1767-1771.

8. Errington, J., L-form bacteria, cell walls and the origins of life. Open biology, 2013. 3(1):

p. 120143.

9. Roose-Amsaleg, C., E. Garnier-Sillam, and M. Harry, Extraction and purification of

microbial DNA from soil and sediment samples. Applied Soil Ecology, 2001. 18(1): p.

47-60.

Page 45: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 31

10. Frostegård, Å., et al., Quantification of bias related to the extraction of DNA directly

from soils. Applied and environmental microbiology, 1999. 65(12): p. 5409-5420.

11. Gomes, P.C., et al., Selectivity sequence and competitive adsorption of heavy metals by

Brazilian soils. Soil Science Society of America Journal, 2001. 65(4): p. 1115-1121.

12. Liesack, W., H. Weyland, and E. Stackebrandt, Potential risks of gene amplification by

PCR as determined by 16S rDNA analysis of a mixed-culture of strict barophilic bacteria.

Microbial Ecology, 1991. 21(1): p. 191-198.

13. Wintzingerode, F.v., U.B. Göbel, and E. Stackebrandt, Determination of microbial

diversity in environmental samples: pitfalls of PCR-based rRNA analysis. FEMS

microbiology reviews, 1997. 21(3): p. 213-229.

14. Yuan, S., et al., Evaluation of methods for the extraction and purification of DNA from

the human microbiome. PloS one, 2012. 7(3): p. e33865.

15. Holmsgaard, P.N., et al., Bias in bacterial diversity as a result of Nycodenz extraction

from bulk soil. Soil Biology and Biochemistry, 2011. 43(10): p. 2152-2159.

16. Bertrand, H., et al., High molecular weight DNA recovery from soils prerequisite for

biotechnological metagenomic library construction. Journal of Microbiological

Methods, 2005. 62(1): p. 1-11.

17. Griffiths, R.I., et al., Rapid method for coextraction of DNA and RNA from natural

environments for analysis of ribosomal DNA-and rRNA-based microbial community

composition. Applied and environmental microbiology, 2000. 66(12): p. 5488-5491.

18. Bürgmann, H., et al., A strategy for optimizing quality and quantity of DNA extracted

from soil. Journal of microbiological methods, 2001. 45(1): p. 7-20.

19. Guo, F. and T. Zhang, Biases during DNA extraction of activated sludge samples

revealed by high throughput sequencing. Applied microbiology and biotechnology,

2013. 97(10): p. 4607-4616.

20. Wesolowska-Andersen, A., et al., Choice of bacterial DNA extraction method from fecal

material influences community structure as evaluated by metagenomic analysis.

Microbiome, 2014. 2(1): p. 19.

21. Neefs, J.-M., et al., Compilation of small ribosomal subunit RNA sequences. Nucleic

acids research, 1990. 18(Suppl): p. 2237.

Page 46: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

32 LIST OF REFERENCES

22. Nielsen, H.B., et al., Identification and assembly of genomes and genetic el ements in

complex metagenomic samples without using reference genomes. Nature biotechnology,

2014. 32(8): p. 822-828.

23. Hugenholtz, P. and T. Huber, Chimeric 16S rDNA sequences of diverse origin are

accumulating in the public databases. International journal of systematic and

evolutionary microbiology, 2003. 53(1): p. 289-293.

24. Salter, S.J., et al., Reagent and laboratory contamination can critically impact sequence-

based microbiome analyses. BMC biology, 2014. 12(1): p. 87.

25. Jervis-Bardy, J., et al., Deriving accurate microbiota profiles from human samples with

low bacterial content through post-sequencing processing of Illumina MiSeq data.

Microbiome, 2015. 3(1): p. 19.

26. Weiss, S., et al., Tracking down the sources of experimental contamination in microbiome

studies. Genome biology, 2014. 15(12): p. 564.

27. Consortium, H.M.P., A framework for human microbiome research. Nature, 2012.

486(7402): p. 215-221.

28. Kozich, J.J., et al., Development of a dual-index sequencing strategy and curation pipeline

for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform.

Applied and environmental microbiology, 2013. 79(17): p. 5112-5120.

29. Brunius, C., L. Shi, and R. Landberg, Large-scale untargeted LC-MS metabolomics data

correction using between-batch feature alignment and cluster-based within-batch signal

intensity drift correction. Metabolomics, 2016. 12(11): p. 173.

30. Balle, C., Analysis of human microbiomes and a prospective model system - paving the

way for future health, in Section of Microbiology. 2014, University of copenhagen:

Denmark.

31. Caporaso, J.G., et al., QIIME allows analysis of high-throughput community sequencing

data. Nature methods, 2010. 7(5): p. 335-336.

32. Kang, C., et al., Relationship between genome similarity and DNA-DNA hybridization

among closely related bacteria. Journal of microbiology and biotechnology, 2007.

17(6): p. 945.

Page 47: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 33

33. Schloss, P.D. and S.L. Westcott, Assessing and improving methods used in operational

taxonomic unit-based approaches for 16S rRNA gene sequence analysis. Applied and

environmental microbiology, 2011. 77(10): p. 3219-3226.

34. Edgar, R.C., Search and clustering orders of magnitude faster than BLAST.

Bioinformatics, 2010. 26(19): p. 2460-2461.

35. Callahan, B.J., et al., DADA2: high-resolution sample inference from Illumina amplicon

data. Nature methods, 2016.

36. Edgar, R.C., UNOISE2: improved error-correction for Illumina 16S and ITS amplicon

sequencing. bioRxiv, 2016: p. 081257.

37. Tikhonov, M., R.W. Leach, and N.S. Wingreen, Interpreting 16S metagenomic data

without clustering to achieve sub-OTU resolution. The ISME journal, 2015. 9(1): p. 68-

80.

38. Schirmer, M., et al., Insight into biases and sequencing errors for amplicon sequencing

with the Illumina MiSeq platform. Nucleic acids research, 2015: p. gku1341.

39. Caspi, R., et al., The MetaCyc Database of metabolic pathways and enzymes and the

BioCyc collection of Pathway/Genome Databases. Nucleic acids research, 2008.

36(suppl 1): p. D623-D631.

40. Kanehisa, M. and S. Goto, KEGG: kyoto encyclopedia of genes and genomes. Nucleic

acids research, 2000. 28(1): p. 27-30.

41. Buchfink, B., C. Xie, and D.H. Huson, Fast and sensitive protein alignment using

DIAMOND. Nature methods, 2015. 12(1): p. 59-60.

42. Abubucker, S., et al., Metabolic reconstruction for metagenomic data and its application

to the human microbiome. PLoS Comput Biol, 2012. 8(6): p. e1002358.

43. Bengtsson‐Palme, J., et al., METAXA2: improved identification and taxonomic

classification of small and large subunit rRNA in metagenomic data. Molecular Ecology

Resources, 2015. 15(6): p. 1403-1414.

44. Segata, N., et al., Metagenomic microbial community profiling using unique clade-

specific marker genes. Nature methods, 2012. 9(8): p. 811-814.

45. Markowitz, V.M., et al., The integrated microbial genomes system: an expanding

comparative analysis resource. Nucleic acids research, 2009: p. gkp887.

Page 48: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

34 LIST OF REFERENCES

46. Brix, S., et al., Metagenomic heterogeneity explains dual immune effects of endotoxins.

Journal of Allergy and Clinical Immunology, 2015. 135(1): p. 277.

47. Myers, E.W., et al., A whole-genome assembly of Drosophila. Science, 2000. 287(5461):

p. 2196-2204.

48. Egholm, M., et al., Genome sequencing in open microfabricated high density picoliter

reactors. Nature, 2005. 437: p. 376-380.

49. Pevzner, P.A., H. Tang, and M.S. Waterman, An Eulerian path approach to DNA

fragment assembly. Proceedings of the National Academy of Sciences, 2001. 98(17):

p. 9748-9753.

50. Chaisson, M., P. Pevzner, and H. Tang, Fragment assembly with short reads.

Bioinformatics, 2004. 20(13): p. 2067-2074.

51. Zerbino, D.R. and E. Birney, Velvet: algorithms for de novo short read assembly using de

Bruijn graphs. Genome research, 2008. 18(5): p. 821-829.

52. Brown, C.T., et al., A reference-free algorithm for computational normalization of

shotgun sequencing data. arXiv preprint arXiv:1203.4802, 2012.

53. Alneberg, J., et al., CONCOCT: clustering contigs on coverage and composition. arXiv

preprint arXiv:1312.4038, 2013.

54. Nunes, I., et al., Coping with copper: legacy effect of copper on potential activity of soil

bacteria following a century of exposure. FEMS Microbiology Ecology, 2016. 92(11): p.

fiw175.

55. Klappenbach, J.A., et al., rrndb: the ribosomal RNA operon copy number database.

Nucleic acids research, 2001. 29(1): p. 181-184.

56. Blazewicz, S.J., et al., Evaluating rRNA as an indicator of microbial activity in

environmental communities: limitations and uses. The ISME journal, 2013. 7(11): p.

2061-2068.

57. Schaechter, M., O. Maaløe, and N.O. Kjeldgaard, Dependency on medium and

temperature of cell size and chemical composition during balanced growth of Salmonella

typhimurium. Microbiology, 1958. 19(3): p. 592-606.

58. Carini, P., et al., Relic DNA is abundant in soil and obscures estimates of soil microbial

diversity. bioRxiv, 2016: p. 043372.

Page 49: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 35

59. Inkinen, J., et al., Diversity of ribosomal 16S DNA‐and RNA‐based bacterial community

in an office building drinking water system. Journal of applied microbiology, 2016.

120(6): p. 1723-1738.

60. Urich, T., et al., Simultaneous assessment of soil microbial community structure and

function through analysis of the meta -transcriptome. PloS one, 2008. 3(6): p. e2527.

61. Grabherr, M.G., et al., Full-length transcriptome assembly from RNA-Seq data without a

reference genome. Nature biotechnology, 2011. 29(7): p. 644-652.

62. Birol, I., et al., De novo transcriptome assembly with ABySS. Bioinformatics, 2009.

25(21): p. 2872-2877.

63. Scott, M., et al., Interdependence of cell growth and gene expression: origins and

consequences. Science, 2010. 330(6007): p. 1099-1102.

64. Geisen, S., et al., Metatranscriptomic census of active protists in soils. The ISME

journal, 2015. 9(10): p. 2178-2190.

65. Kane, M.D., et al., Assessment of the sensitivity and specificity of oligonucleotide (50mer)

microarrays. Nucleic acids research, 2000. 28(22): p. 4552-4557.

66. Draghici, S., et al., Reliability and reproducibility issues in DNA microarray

measurements. TRENDS in Genetics, 2006. 22(2): p. 101-109.

67. Larsen, P.E., et al., Predicted Relative Metabolomic Turnover (PRMT): determining

metabolic turnover from a coastal marine metagenomic dataset. Microbial informatics

and experimentation, 2011. 1(1): p. 4.

68. Frias-Lopez, J., et al., Microbial community gene expression in ocean surface waters.

Proceedings of the National Academy of Sciences, 2008. 105(10): p. 3805-3810.

69. Gilbert, J.A. and M. Hughes, Gene expression profiling: metatranscriptomics. High-

Throughput Next Generation Sequencing: Methods and Applications, 2011: p. 195-

205.

70. Gifford, S.M., et al., Quantitative analysis of a deeply sequenced marine microbial

metatranscriptome. The ISME journal, 2011. 5(3): p. 461-472.

71. Poretsky, R.S., et al., Comparative day/night metatranscriptomic analysis of microbial

communities in the North Pacific subtropical gyre. Environmental microbiology, 2009.

11(6): p. 1358-1375.

Page 50: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

36 LIST OF REFERENCES

72. Thorsen, J., et al., Large-scale benchmarking reveals false discoveries and count

transformation sensitivity in 16S rRNA gene amplicon data analysis methods used in

microbiome studies. Microbiome, 2016. 4(1): p. 62.

73. Weiss, S., et al., Normalization and microbial differential abundance strategies depend

upon data characteristics. Microbiome, 2017. 5(1): p. 27.

74. Aitchison, J., The statistical analysis of compositional data. 1986.

75. Friedman, J. and E.J. Alm, Inferring correlation networks from genomic survey data.

PLoS Comput Biol, 2012. 8(9): p. e1002687.

76. Mandal, S., et al., Analysis of composition of microbiomes: a novel method for studying

microbial composition. Microbial ecology in health and disease, 2015. 26.

77. Friedman, J., T. Hastie, and R. Tibshirani, The elements of sta tistical learning. Vol. 1.

2001: Springer series in statistics Springer, Berlin.

78. Bishop, C., Pattern Recognition and Machine Learning (Information Science and

Statistics), 1st edn. 2006. corr. 2nd printing edn. 2007, Springer, New York.

79. Bretz, F., T. Hothorn, and P. Westfall, Multiple comparisons using R. 2016: CRC

Press.

80. Pearson, K., LIII. On lines and planes of closest fit to systems of points in space. The

London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science,

1901. 2(11): p. 559-572.

81. Kruskal, J.B., Multidimensional scaling by optimizing goodness of fit to a nonmetric

hypothesis. Psychometrika, 1964. 29(1): p. 1-27.

82. Legendre, P. and L.F. Legendre, Numerical ecology. Vol. 24. 2012: Elsevier.

83. Jaccard, P., The distribution of the flora in the alpine zone. New phytologist, 1912.

11(2): p. 37-50.

84. Beals, E.W., Bray-Curtis ordination: an effective strategy for analysis of multivariate

ecological data. Advances in Ecological Research, 1984. 14: p. 1-55.

85. Lozupone, C. and R. Knight, UniFrac: a new phylogenetic method for comparing

microbial communities. Applied and environmental microbiology, 2005. 71(12): p.

8228-8235.

Page 51: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 37

86. Chen, J., et al., Associating microbiome composition with environmental covariates using

generalized UniFrac distances. Bioinformatics, 2012. 28(16): p. 2106-2113.

87. Robinson, M.D. and A. Oshlack, A scaling normalization method for differential

expression analysis of RNA-seq data. Genome biology, 2010. 11(3): p. R25.

88. Paulson, J.N., et al., Differential abundance analysis for microbial marker-gene surveys.

Nature methods, 2013. 10(12): p. 1200-1202.

89. Hardcastle, T.J. and K.A. Kelly, baySeq: empirical Bayesian methods for identifying

differential expression in sequence count data. BMC bioinformatics, 2010. 11(1): p. 422.

90. Robinson, M.D., D.J. McCarthy, and G.K. Smyth, edgeR: a Bioconductor package for

differential expression analysis of digital gene expression data. Bioinformatics, 2010.

26(1): p. 139-140.

91. Ravel, J., et al., Vaginal microbiome of reproductive-age women. Proceedings of the

National Academy of Sciences, 2011. 108(Supplement 1): p. 4680-4687.

92. Arumugam, M., et al., Enterotypes of the human gut microbiome. nature, 2011.

473(7346): p. 174-180.

93. Koren, O., et al., A guide to enterotypes across the human body: meta -analysis of

microbial community structures in human microbiome datasets. PLoS Comput Biol,

2013. 9(1): p. e1002863.

94. Knights, D., et al., Rethinking “enterotypes”. Cell host & microbe, 2014. 16(4): p. 433-

437.

95. Boto, L., Horizontal gene transfer in evolution: facts and challenges. Proceedings of the

Royal Society of London B: Biological Sciences, 2010. 277(1683): p. 819-827.

96. Hanahan, D., Studies on transformation of Escherichia coli with plasmids. Journal of

molecular biology, 1983. 166(4): p. 557-580.

97. Cairns, J., et al., Evolving interactions between diazotrophic cyanobacterium and phage

mediate nitrogen release and host competitive ability. Royal Society Open Science, 2016.

3(12): p. 160839.

98. Puxty, R.J., et al., Viruses inhibit CO 2 fixation in the most abundant phototrophs on

Earth. Current Biology, 2016. 26(12): p. 1585-1589.

Page 52: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

38 LIST OF REFERENCES

99. Dearborn, A.D., et al., The Staphylococcus aureus pathogenicity island 1 protein gp6

functions as an internal scaffold during capsid size determination. Journal of molecular

biology, 2011. 412(4): p. 710-722.

100. Scher, J.U., et al., Expansion of intestinal Prevotella copri correlates with enhanced

susceptibility to arthritis. Elife, 2013. 2: p. e01202.

101. Pukall, R., H. Tschäpe, and K. Smalla, Monitoring the spread of broad host and narrow

host range plasmids in soil microcosms. FEMS microbiology ecology, 1996. 20(1): p.

53-66.

102. Yang, Y., et al., Reactive oxygen species in the immune system. International reviews of

immunology, 2013. 32(3): p. 249-270.

103. Qin, J., et al., A human gut microbial gene catalogue established by metagenomic

sequencing. nature, 2010. 464(7285): p. 59-65.

104. Röhrig, B., et al., Sample size calculation in clinical trials. Dtsch Arztebl Int, 2010.

107(31-32): p. 552-6.

105. Quince, C., et al., Extensive modulation of the fecal metagenome in children with Crohn’s

disease during exclusive enteral nutrition. The American journal of gastroenterology,

2015. 110(12): p. 1718-1729.

106. Hiltner, L.t., Über neuere Erfahrungen und Probleme auf dem Gebiete der

Bodenbakteriologie unter besonderer Berücksichtigung der Gründüngung und Brache.

Arbeiten der Deutschen Landwirtschaftlichen Gesellschaft, 1904. 98: p. 59-78.

107. Gans, J., M. Wolinsky, and J. Dunbar, Computational improvements reveal great

bacterial diversity and high metal toxicity in soil. Science, 2005. 309(5739): p. 1387-1390.

108. Bunge, J., S.S. Epstein, and D.G. Peterson, Comment on" Computational improvements

reveal great bacterial diversity and high metal toxicity in soil". Science, 2006. 313(5789):

p. 918-918.

109. Singh, B.K., et al., Loss of microbial diversity in soils is coincident with reductions in

some specialized functions. Environmental microbiology, 2014. 16(8): p. 2408-2420.

110. Roesch, L.F., et al., Pyrosequencing enumerates and contrasts soil microbial diversity.

The ISME journal, 2007. 1(4): p. 283-290.

Page 53: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics for discovery of microbiome variation 39

111. Knietsch, A., et al., Metagenomes of complex microbial consortia derived from different

soils as sources for novel genes conferring formation of carbonyls from short-chain polyols

on Escherichia coli. Journal of molecular microbiology and biotechnology, 2003. 5(1):

p. 46-56.

112. van Elsas, J.D., et al., Modern soil microbiology. 2006: CRc Press.

113. Howe, A.C., et al., Tackling soil diversity with the assembly of large, complex

metagenomes. Proceedings of the National Academy of Sciences, 2014. 111(13): p.

4904-4909.

114. Casella, S. and W.J. Payne, Potential of denitrifiers for soil environment protection.

FEMS microbiology letters, 1996. 140(1): p. 1-8.

115. Nielsen, M.N., et al., Microorganisms as indicators of soil health . 2002, National

Environmental Research Institute.

116. Pauget, B., et al. Soil bioindicators: how soil properties influence their responses and how

to select them in function of the site issues? in SETAC Europe 25th Annual Meeting.

2015.

117. Rusjan, D., Copper in horticulture. 2012: INTECH Open Access Publisher.

118. Gaetke, L.M. and C.K. Chow, Copper toxicity, oxidative stress, and antioxidant

nutrients. Toxicology, 2003. 189(1): p. 147-163.

119. Bruins, M.R., S. Kapil, and F.W. Oehme, Microbial resistance to metals in the

environment. Ecotoxicology and environmental safety, 2000. 45(3): p. 198-207.

120. Rousk, J., et al., Soil bacterial and fungal communities across a pH gradient in an arable

soil. The ISME journal, 2010. 4(10): p. 1340-1351.

121. Lauber, C.L., et al., Pyrosequencing-based assessment of soil pH as a predictor of soil

bacterial community structure at the continental scale. Applied and environmental

microbiology, 2009. 75(15): p. 5111-5120.

122. Fierer, N., M.A. Bradford, and R.B. Jackson, Toward an ecological classification of soil

bacteria. Ecology, 2007. 88(6): p. 1354-1364.

Page 54: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

METHODOLOGY Open Access

Large-scale benchmarking reveals falsediscoveries and count transformationsensitivity in 16S rRNA gene amplicon dataanalysis methods used in microbiomestudiesJonathan Thorsen1†, Asker Brejnrod2,3†, Martin Mortensen2, Morten A. Rasmussen1, Jakob Stokholm1,Waleed Abu Al-Soud2, Søren Sørensen2, Hans Bisgaard1* and Johannes Waage1*

Abstract

Background: There is an immense scientific interest in the human microbiome and its effects on human physiology,health, and disease. A common approach for examining bacterial communities is high-throughput sequencing of 16SrRNA gene hypervariable regions, aggregating sequence-similar amplicons into operational taxonomic units (OTUs).Strategies for detecting differential relative abundance of OTUs between sample conditions include classical statisticalapproaches as well as a plethora of newer methods, many borrowing from the related field of RNA-seq analysis. Thiseffort is complicated by unique data characteristics, including sparsity, sequencing depth variation, and nonconformityof read counts to theoretical distributions, which is often exacerbated by exploratory and/or unbalanced study designs.Here, we assess the robustness of available methods for (1) inference in differential relative abundance analysis and (2)beta-diversity-based sample separation, using a rigorous benchmarking framework based on large clinical 16Smicrobiome datasets from different sources.

Results: Running more than 380,000 full differential relative abundance tests on real datasets with permutedcase/control assignments and in silico-spiked OTUs, we identify large differences in method performance on arange of parameters, including false positive rates, sensitivity to sparsity and case/control balances, and spike-inretrieval rate. In large datasets, methods with the highest false positive rates also tend to have the best detection power.For beta-diversity-based sample separation, we show that library size normalization has very little effect and that thedistance metric is the most important factor in terms of separation power.

Conclusions: Our results, generalizable to datasets from different sequencing platforms, demonstrate how the choice ofmethod considerably affects analysis outcome. Here, we give recommendations for tools that exhibit low false positiverates, have good retrieval power across effect sizes and case/control proportions, and have low sparsity bias. Result outputfrom some commonly used methods should be interpreted with caution. We provide an easily extensible framework forbenchmarking of new methods and future microbiome datasets.

Keywords: 16S sequencing, Microbiome, Benchmark, Differential relative abundance, Beta-diversity

* Correspondence: [email protected]; [email protected]†Equal contributors1COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlevand Gentofte Hospital, University of Copenhagen, Copenhagen, DenmarkFull list of author information is available at the end of the article

© The Author(s). 2016 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, andreproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link tothe Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Thorsen et al. Microbiome (2016) 4:62 DOI 10.1186/s40168-016-0208-8

Page 55: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

BackgroundTechnical advances in DNA sequencing have allowed forthe collection of high-dimensional biological data on anunprecedented scale. This development has ignited asurge of scientific opportunities and interest in thehuman microbiome and its effects on human physiology,health, and disease [1, 2]. A common approach tomicrobiome studies is the amplification of hypervariableregions of bacterial 16S rRNA genes from biologicalsamples, sequencing of amplicons in a high-throughputfashion, and grouping of sequences into operational taxo-nomic units (OTUs) [3–5] for downstream applications.A common statistical analysis of OTU data is differen-

tial relative abundance (DA) testing, a serial univariatetest of each OTU between two sample groups, e.g., phe-notypes, compartments, or time-points. This relativelysimple endeavor is complicated by certain characteristicsof the data, in particular three major points. First, theOTU count matrix is sparse, with often between 80 and95% of the counts being zero [6, 7]. Second, the librarysizes (sum of counts in each sample; also referred to assequencing depth) vary significantly, sometimes byseveral orders of magnitude, making it nonsensical tocompare counts directly between samples, since theyeach represent a different fraction of the composition ofa given sample. Third, as is well known from ecologicalliterature, the variances of these count distributions aregreater than their means, a phenomenon known as over-dispersion [8, 9]. In the RNA-seq field which is based onsimilar sequencing technology, explicit modeling of thismean-variance relationship has been attempted [10, 11].The aim of this work is to benchmark the many options

investigators face when analyzing 16S amplicon-basedsequencing data. Previous work with similar objectives hasfocused on the practice of rarefaction [12, 13], i.e., resam-pling reads within each sample to equal amounts to over-come the differences in sequencing depths. This workattempts three separate benchmarks of inference robust-ness, all based on real datasets generated from clinicalsamples, obtained from different compartments in thehuman microbiome, including the gut, hypopharynx, andvagina, covering a wide range of human ecological niches.First, we have quantified the false discovery rate of the

most popular differential relative abundance (DA) methodsby randomly assigning case/control status to samples, thuscreating an empirical null distribution, and testing eachOTU for differential relative abundance. Second, we havesimulated in silico spiking of known magnitudes andexamined how well these can be recovered. We have useda range of multiplicative and additive spike-in magnitudesapplied to OTUs from different relative abundance tertilesto explicitly control the range of OTUs to be recovered.Furthermore, as the microbiome field is currently in a statewhere many projects are exploratory and not explicitly

designed, we have examined the effect of the case/controlproportion. The included methods are well-establishedchoices for data analysis in many fields. The Welch two-sample t test is the default choice for comparing twosample means, while the Wilcoxon rank sum test is anonparametric alternative. Negative binomial generalizedlinear models (GLM) have long been a popular option inecology for modeling count data such as species observa-tion counts [14], by adding an additional parameter toaccount for the aforementioned overdispersion. From thefield of RNA-seq, which have faced many of the same dataanalysis challenges, we have included two widely usedpackages estimating counts parametrically, also utilizingthe negative binomial distribution, DEseq2 [15] and edgeR[16], as well as baySeq [17], using an empirical Bayesmethod for parameter estimation. From the field of micro-bial ecology, metagenomeSeq [7] has been designed withmicrobial marker surveys in mind, using a normalizationprocedure and a zero-inflated gaussian (ZIG) mixturemodel, designed to handle sequence depth issues andsparsity, as well as an alternative zero-inflated log-normalmodel with included parameter shrinkage (feature model)[18]. The ALDEx2 method has been developed withemphasis on the compositional nature of sequencing data,implementing Monte Carlo sampling of Dirichlet distribu-tions and averaging p values across resamples [19]. Inaddition, we have implemented a simple custom permuta-tion test, based on the null distribution of a test statistic

defined as log mean counts in casesmean counts in controls

! "2 obtained through ran-dom permutations of samples as cases/controls. Finally, wehave quantified the effect of normalization, transformation,and choice of distance measure on the beta-diversityseparation of samples with a known biological grouping.Multivariate analysis and choice of distance measure inparticular are currently being debated in microbial ecologyas claims of inherent clustering of vaginal [20] and gutmicrobiomes [2] have been made. The robustness of theseclaims has been shown to be sensitive to data analysischoices [21], visualization choice [22], and copy numberestimation procedures [23]. Ecology has a long traditionfor multivariate analysis of species tables, and many of thecurrently available tools have therefore been adapted fromthis field, such as the Bray-Curtis dissimilarity measure[24]. Microbial ecology has seen the development ofmeasures exploiting phylogenetic information in thesequencing reads. Here, we include the weighted andunweighted UniFrac [25, 26]. Additionally, we included theJensen-Shannon divergence, which plays a key role inenterotyping and clustering of vaginal microbiomes. TheEuclidean distance is known to be unsuited for ecologicaldistance measurements due to what has been termed the“double zero” problem, the fact that it is not possible to

Thorsen et al. Microbiome (2016) 4:62 Page 2 of 14

Page 56: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

distinguish if a species is absent from two samples due toundersampling [7, 27]. It has been included as a baseline,since many early papers explored microbiomes usingEuclidean-driven principal component analysis.

ResultsStudy design and data characteristicsThe study was divided into three parts (Fig. 1), namely(1) false positive rate (FPR) testing, (2) spike-in retrievaltesting, and (3) beta-diversity optimization. We usedseven large datasets: three for the FPR tests and spike-inretrieval tests (labeled A1-A3), one simulated set (labeledA4) for assessing the effect of spike-in independent sam-ple strata, and three for the beta-diversity optimizationtests (labeled B1–B3). The datasets and their characteristicsare presented in Additional file 1: Table S1. The datasetsare characterized by having high degrees of sparsity, largevariation in library sizes, and an overdispersed mean-variance relationship (Additional file 2: Figure S1).

False positive ratesWe found striking differences in the FPR of the testedmethods using identical permutations of the three largedatasets A1–A3 (Additional file 3: Figure S2A). A totalof 17,550 full DA tests were analyzed. Generally, manymethods were robust, with FPR close to or below 0.05,as expected under the null hypothesis. However, edgeR,metagenomeSeq ZIG (unfiltered, see below), and espe-cially baySeq displayed very high FPRs, indicating thatmodels did not fit well to the data. Intriguingly, baySeq,edgeR, and negative binomial GLMs performed worseunder balanced conditions, i.e., 50% cases and 50%controls, than under unbalanced conditions with only10% cases. Most methods had low variance of FPRacross iterations, but metagenomeSeq ZIG and especiallybaySeq showed considerable variation within parameter

sets. To ensure that observed differences in FPRsbetween balanced conditions were not due to inherentbiological signals or sample structures in the datasetsused, we repeated the analysis in an additional simulateddataset (A4, n = 5850), based on within-OTU countpermutation, retaining the biological distribution ofOTU count data but breaking within-sample characteris-tics. With the exception of baySeq, no major deviationswere observed from the results obtained with dataset A3(Additional file 3: Figure S2B).Next, we investigated the effect of OTU sparsity on

test inference (Fig. 2) and observed that the sparsity ofany given OTU had different effects when applying thedifferent methods, in the feces dataset A1. OTU-wise pvalues from non-spiked single DA runs with 50% cases,selected by the median FPR, depended to some extenton the percentage of zeroes in the OTU in question.The methods edgeR, negative binomial GLM, metagen-omeSeq ZIG (unfiltered), and especially baySeq displayedbiased results at high sparsity, meaning that many zeroeslead to lower p values, irrespective of any signal in thedata. This effect was to some extent ameliorated for meta-genomeSeq by its filtering step, which essentially removesmost of the sparse OTUs after the model has been esti-mated, as demonstrated in Fig. 2. The feature model didnot exhibit this inflation. DESeq2 in particular exhibited aconservative estimation of p values at high sparsity and ttests, and Wilcoxon and the permutation tests were allvery robust across the range. The ALDEx2 method wasvery conservative and showed a narrow band of p valuesresembling a Gaussian distribution around 0.5, regardlessof sparsity.

Spike-in retrieval testsA total of 175,500 spiked DA analyses from datasetsA1–A3 were considered. The spike-ins were performed

Fig. 1 Spike-in approach and analysis flowchart. Left, a theoretical sample where the count data for OTU A is multiplied by 3 before rescaling tooriginal sequencing depth. Right, flowchart of the simulation study, yielding FPR and AUC for each method, dataset, and set of variables, as wellas R2 values for all combinations of normalization, transformation, and distance in the beta-diversity optimization

Thorsen et al. Microbiome (2016) 4:62 Page 3 of 14

Page 57: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

by increasing the raw counts of random OTUs in cases,either multiplicatively or additively (see Fig. 1), by arange of spike-in magnitudes, in order to measure therelative retrieval performance between methods.Additional file 4: Figure S3A shows the area under the

curve (AUC) value distributions of the receiver operatingcharacteristics (ROC) curve when using p values todiscriminate between multiplicatively spiked and non-spiked OTUs in the feces dataset A1, for all themethods, case proportions, and magnitudes. We foundthat most methods improved detection power as thespike-in magnitudes increase, though both Wilcoxonand metagenomeSeq ZIG (filtered) did not exhibit thisproperty to the same extent as the others. The multi-plicative spike-in AUC distributions for the other twodatasets A2 and A3 (Additional file 4: Figure S3B, S3C)showed very similar characteristics. Overall, the bestperformance in terms of AUC was exhibited by thesequencing-specific methods edgeR, metagenomeSeqZIG, and baySeq, as well as the assumption-free permu-tation test. The mid-level in performance was representedby negative binomial GLM, DESeq2, and ALDEx2, whereast tests and Wilcoxon performed the worst. The robustnessof these methods varied greatly, with some tests yieldingAUC values below 0.5, in case of the t tests and Wilcoxoneven with a median value below 0.5 in the unbalanced tests

with case proportions at 10%. The most robust test was thepermutation test and metagenomeSeq feature model, whichonly very rarely fell below 0.5 in AUC.We repeated these analyses with additive spiking

(Additional file 5: Figure S4A, S4B, S4C), yielding verysimilar results, albeit with lower variance in the distribu-tions. We found that the methods exhibited the samehierarchy of performance across the three datasets as inthe multiplicative spike-in tests. The best performingmodels were saturated already at magnitude 10, render-ing magnitude of change 20 unnecessary.Finally, we considered a mixed spike-in setup (Add-

itional file 6: Figure S5). In this setup, methods were notas clearly separated in AUC values, although the samegeneral hierarchy was retrieved. The highest AUC valueswere found in the methods edgeR, metagenomeSeq ZIG,negative binomial GLMs, and the permutation test.Figure 3 shows the median AUC values vs the median

FPRs, illustrating the overall performance of the variousmethods in the three datasets, at the highest magnitude(20) with multiplicative spiking.

Subsets and simulated dataWe repeated the FPR (n = 11,700) and spike-in tests (n= 117,000) to examine the relative performance of thevarious methods in small- and medium-sized datasets by

FPR = 0.417 FPR = 0.206 FPR = 0.019 FPR = 0.501 FPR = 0.039 FPR = 0.314 FPR = 0.133

FPR = 0.015 FPR = 0.025 FPR = 0.033 FPR = 0.028 FPR = 0 FPR = 0.001 FPR = 0.052

metagenomeSeq metagenomeSeq filtered mgS featureModel baySeq DESeq2 edgeR Negative binomial GLM

t−test Log t−test Wilcoxon Permutation ALDEx2 t−test ALDEx2 Wilcoxon Random uniform

0.98

0.79

0.1

1e−10

1e−100

0.98

0.79

0.1

1e−10

1e−100

0 87.1 93.3 97.2 100 0 87.1 93.3 97.2 100 0 87.1 93.3 97.2 100 0 87.1 93.3 97.2 100 0 87.1 93.3 97.2 100 0 87.1 93.3 97.2 100 0 87.1 93.3 97.2 100Percent zeroes

p va

lue,

una

djus

ted

Fig. 2 OTU sparsity vs. p value. Scatterplots of OTU sparsity vs p value with panels representing each differential relative abundance test methodin feces dataset A1, with 50% cases. Colored line represents the LOESS regression on data. False positive rate (FPR) is defined as the fraction ofOTUs with p< 0.05. Each differential relative abundance test represents the median FPR for that method, out of all 150 permutations. Contour lines indicatepoint density and can be compared to a hypothetical null distribution of p values demonstrated in the final panel (“Random uniform”)

Thorsen et al. Microbiome (2016) 4:62 Page 4 of 14

Page 58: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

subsetting datasets A1–A3, yielding datasets of lowersparsity (A1s–A3s and A1m–A3m) (Additional file 1:Table S1). We found that the performance hierarchy wasvery similar across all six subsets with regard to bothFPR and spike-in AUC (Fig. 4). There were some devia-tions from the larger sets, especially metagenomeSeqZIG and the permutation test performed worse in thesubsets, whereas edgeR kept a high AUC but with muchlower FPR. The distributions of FPR and AUC under allparameter combinations can be seen in Additional file 7:Figure S6A–D.In the simulated dataset A4, the AUC results (n =

58,500, Additional file 8: Figure S7A–C) were nearlyidentical to those from dataset A3, on which it wasbased, except for baySeq, which performed worse indataset A4, but with very high variability.

Spike-in retrieval sensitivity to sparsityWe examined the effect of OTU sparsity on the ability ofthe various methods to retrieve spiked OTUs, expressedas the p value quantile of a spiked OTU within all p valuesfrom a dataset, for each method, which should ideally beas low as possible. Additional file 9: Figure S8A–C shows

that almost all methods had better detection power at lowsparsity (many positive samples), but the patterns of satur-ation were quite different. Additional file 9: Figure S8Ashows that most methods were primarily dependent onthe number of positive samples, with the lines from thedifferently sized datasets following each other closely.Notably, Wilcoxon tests were negatively influenced byzero-inflation, meaning that the detection power wasdecreased with dataset size for the same number of posi-tive samples. For the smaller datasets, metagenomeSeqZIG and baySeq did not show better performance withmore positive samples. The filtered metagenomeSeq ZIGand feature model had several OTUs with quantiles of 1,since p values were filtered or not computed and thereforeset to a value of 1. Additional file 9: Figure S8B showssensitivity to case proportion, where almost all methodshad better performance across the range with morebalanced group sizes, though the effect was greatest formetagenomeSeq feature model and the t test. Finally, inAdditional file 9: Figure S8C, we see that most methodssaturate faster with high spike-in magnitude. Notably, theWilcoxon test gains little from increased magnitudes.Overall, the metagenomeSeq feature model seemed to

0.05

0.1

0.2

0.3

0.4

0.5

0.05

0.1

0.2

0.3

0.4

0.50.6

0.05

0.1

0.2

0.3

0.4

0.50.60.70.8

A1 − F

aeces622 sam

ples, 2153 OT

Us

A2 − V

aginal670 sam

ples, 898 OT

Us

A3 − A

irways

144 samples, 588 O

TU

s

0.4 0.5 0.6 0.7 0.8 0.9Median Area Under the Curve (AUC)

Med

ian

Fal

se P

ositi

ve R

ate

(FP

R)

MethodmetagenomeSeq

metagenomeSeq, filtered

mgs featureModel

baySeq

DESeq2

edgeR

Negative Binomial GLM

t−test

Log t−test

Wilcoxon

Permutation

ALDEx2 t−test

ALDEx2 Wilcoxon

Case proportion (%)10

25

50

Fig. 3 Median test AUC vs median test FPR. Scatterplots of median test area under the curve (AUC) vs median test false positive rate (FPR), forthe datasets A1–A3; three different compartments in the COPSAC2010 cohort. FPR is defined as the fraction of OTUs with p < 0.05. Dot colorrepresents differential relative abundance test method, while dot shape represents experiment case/control balance

Thorsen et al. Microbiome (2016) 4:62 Page 5 of 14

Page 59: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

saturate at the lowest number of positive samples foracross these figures, although it required at least two posi-tive samples in each group to compute a p value, mostevident in Additional file 9: Figure S8B.

Beta-diversity optimizationWe studied the effects of library size normalization,count transformation, and distance metric on the abilityto separate biologically relevant groups in beta-diversityanalyses (Fig. 5). All analyses were significant with p <0.001, but with large differences in R2 values. In thefeces dataset B1, the optimal separation was found inlog transformed counts using 10−5 as pseudocountwith weighted UniFrac, yielding an Adonis R2 value of0.367. In the HMP dataset B2, the optimal was non-transformed weighted UniFrac, with an R2 value of0.166. In the feces dataset B3, TMM normalizationwas not possible due to high sparsity, and this methodwas omitted from the analysis. The optimal separationwas found in log transformed counts using 10−5 aspseudocount with weighted UniFrac, yielding an R2

value of 0.145. As a sensitivity analysis to includeTMM, we agglomerated closely related OTUs, thereby

reducing the number of OTUs and the sparsity, whichdid not materially change the results, see Additionalfile 10: Figure S9.Across all three datasets, several characteristics were

very similar. The most important factor was the choiceof distance metric, with the weighted UniFrac metricscoring the highest in terms of separation power in allthree datasets. The transformation applied to (normal-ized) counts was of less importance. In the feces datasetB1, log transformations with very small pseudocountswere best, whereas the untransformed counts were opti-mal in the HMP oral dataset B2. In both cases, the effectswere large. The effect of library size normalization was byfar the lowest, especially between no normalization, CSS,DESeq2, and TMM normalization, which essentially didnot matter in terms of separation in any of the three data-sets. TSS was the normalization type with the highestimpact, but it mostly changed the optimal transformationchoice, rather than improve or deteriorate the separationpower as such. These effects were very apparent whencomparing the optimal combinations of normalization,transformation, and distance with the poorest for eachdataset, as described above, visualized with principal

A1s Faeces8+8 samples, 703 OTUs

A2s Vaginal8+8 samples, 213 OTUs

A3s Airways8+8 samples, 223 OTUs

A1m Faeces25+25 samples, 977 OTUs

A2m Vaginal25+25 samples, 430 OTUs

A3m Airways25+25 samples, 418 OTUs

0.05

0.1

0.2

0.3

0.4

0.5

0.6

0.05

0.1

0.2

0.3

0.4

0.5

0.05

0.1

0.2

0.3

0.4

0.5

0.6

0.05

0.1

0.2

0.3

0.4

0.5

0.6

0.05

0.1

0.2

0.3

0.4

0.5

0.05

0.1

0.2

0.3

0.4

0.5

0.5 0.6 0.7 0.8 0.5 0.6 0.7 0.5 0.6 0.7

0.5 0.6 0.7 0.8 0.5 0.6 0.7 0.8 0.5 0.6 0.7 0.8Median Area Under the Curve (AUC)

Med

ian

Fals

e P

ositi

ve R

ate

(FP

R)

MethodmetagenomeSeq

metagenomeSeq, filtered

metagenomeSeq featureModel

baySeq

DESeq2

edgeR

Negative Binomial GLM

t test

Log t test

Wilcoxon

Permutation

ALDEx2 t test

ALDEx2 Wilcoxon

Fig. 4 Median test AUC vs median test FPR, small and medium subsets. Scatterplots of median test area under the curve (AUC) vs median testfalse positive rate (FPR), for the datasets A1s–A3s and A1m–A3m; three different compartments in the COPSAC2010 cohort, with 50% case/controlproportion. FPR is defined as the fraction of OTUs with p < 0.05. Dot color represents differential relative abundance test method

Thorsen et al. Microbiome (2016) 4:62 Page 6 of 14

Page 60: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

coordinates analysis (PCoA) plots in panels B and C ofeach dataset subplot.

DiscussionWe conducted extensive benchmarking of the most popu-lar available methods for differential relative abundance

testing of large microbiome datasets. The main character-istics of each method in terms of FPR, AUC, balancesensitivity, and computational burden are summarized inAdditional file 11: Table S3. We found that severalmethods, including edgeR, metagenomeSeq ZIG, and bay-Seq, had high false positive rates when testing randomly

No normalization TSS CSS DESeq2 TMM No effect of normalization

G

C

E

N/A N/A

B

D

F

0.0

0.1

0.2

0.3

0.00

0.05

0.10

0.15

0.00

0.05

0.10

0.15

B1 − F

aeces (3 groups)1,082 sam

ples, 2,431 OT

Us

B2 − P

alate / Tongue (2 groups)

617 samples, 19,507 O

TU

sB

3 − Faeces (5 groups)

992 samples, 22,406 O

TU

s

Bray−CurtisEuclidean

Bray−CurtisEuclidean

Bray−CurtisEuclidean

Bray−CurtisEuclidean

Bray−CurtisEuclidean

Jensen−Shannon DivergenceUniFrac

Weighted UniFrac

Distance

R2

Transformation None Log(x+1) Log(x+0.01) Log(x+0.0001) Sqrt(x) Cubrt(x)

Axis.1 [46.5%]

Axi

s.2

[13

%]

Weighted UniFrac, Log(x+0.0001)

Axis.1 [46.8%]

Axi

s.2

[23

.8%

]

TMM, Euclidean

Axis.1 [41%]

Axi

s.2

[22

.8%

]

Weighted UniFrac

Axis.1 [5.1%]

Axi

s.2

[3.

7%]

TMM, Euclidean, Log(x + 0.0001)

Axis.1 [48.1%]

Axi

s.2

[13

.1%

]

Weighted UniFrac, Log(x+0.0001)

Axis.1 [23.6%]

Axi

s.2

[14

.7%

]

CSS, Euclidean

a b

c

d

e

f

g

Fig. 5 (See legend on next page.)

Thorsen et al. Microbiome (2016) 4:62 Page 7 of 14

Page 61: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

permuted data, often grossly overestimating the OTU-wise differences between two groups, which indicates thatassumptions made by the models were not met by thedata. Intriguingly, the methods with the highest FPR alsohad the highest AUC values for recovering our spikedOTUs. Thus, the p values obtained from these methodsare very well suited to distinguish differentially abundantOTUs from non-differentially abundant OTUs, but arenot meaningful in relation to normal thresholds for sig-nificance (i.e., 5%), instead representing an arbitrary classi-fier value. This problem could potentially be solved bysetting a more restrictive significance threshold, althoughthe value of this threshold would need to be set empiric-ally for each dataset, for instance through a permutationsimilar to the setup of this study. Even then, for some ofthe methods, especially metagenomeSeq ZIG and baySeq,we found that p values varied greatly with the sparsity of agiven OTU, meaning that this empirical cutoff valueshould not be the same in all OTUs but, to some extent,depend on the sparsity of that OTU to accurately reflectthe null distribution. It has to be noted that baySeq wasrun with the default negative binomial prior distribution,but allows the user to define a custom parameterizationfor the prior, which could improve the performance ofbaySeq in this regard. We have not explicitly addressedpre-inference filtering, which is a common practice toreduce the strain of correction for multiple inference.However, we have examined the effects of metagenome-Seq ZIG’s recommended filtering step. We found that thisfiltering removed the most sparse/rare OTUs, which ame-liorates the abovementioned dependence of p values onsparsity. However, it is a very conservative filtering, whichcould also be applied to any of the other methods, anddoes not fix the underlying problems with the fit of thestatistical model. Indeed, many rare OTUs could be trulydifferentially abundant in many types of studies. We haveanalyzed crude p values across all methods, and not expli-citly corrected p values for multiple inference, leaving theexpected null FPR at 0.05. This correction is a necessarystep in most situations and is often done by controllingthe false discovery rate using the Benjamini-Hochbergapproach [28] or the familywise error rate using, e.g., theBonferroni correction [29]. However, this step is inde-pendent of model choice and should be applied regardlessof which method is used to obtain p values, which makesit irrelevant in our study setup.

The inclusion of a permutation test is not meant as arecommendation or novel method, but it proves aninteresting comparison as it is simple, extremely robust,and has good detection power, such that it far outper-forms the other simple methods—t test and Wilcoxon. Itshould be noted that the many ties in sparse data maydisproportionately limit the maximum statistical powerof rank-based tests such as the Wilcoxon test (illustratedin Additional file 12: Figure S10), which was also evidentin Additional file 9: Figure S8A–C.Furthermore, under some circumstances, the t test

produced AUC values that were below 0.5, i.e., worsethan random performance. As can be seen from thecontour lines in Fig. 2, this occurred due to the t testproducing too low p values at extreme levels of sparsity,where only one sample was positive, which overpoweredthe p value decrease from the weaker spike-in magni-tudes as these were selected to represent low, medium,and high levels of sparsity. Naturally, this phenomenoncan be attributed to unmet distributional assumptions inthe data.The AUC statistic is usually employed as a measure of

separation, e.g., how well does a certain biomarkerdistinguish between healthy and sick. However, it canalso be used as a scale-independent enrichment statistic,as in the present study. Importantly, at low spike-inmagnitudes, AUC values should not be expected to beclose to 1 but should rather be used to compare powerbetween different methods.We repeated the analyses in small- and medium-sized

datasets, since many researchers opt for smaller, balanceddesigns when testing specific experimental hypotheses.These results showed that some methods performedworse (permutation test, metagenomeSeq ZIG), whileothers improved (edgeR) when compared to the resultsfrom the large datasets. This phenomenon may be linkedto the decreased sparsity of these smaller sets, as describedin Additional file 1: Table S1, due to a lower amount ofrare taxa compared to common taxa, in addition to thedifferences in statistical power of the methods given lowsample sizes, which may limit real-world applications.Previously, the optimal library size normalization for

beta-diversity measures have been thoroughly discussed,and count transformations have been recognized as im-portant approaches for optimally separating biologicallymeaningful groups [7, 13, 30–32]. Our results highlight

(See figure on previous page.)Fig. 5 The effect of normalization, transformation, and distance metric on beta-diversity separation. a Datasets B1–B3 (vertical panels) with all com-binations of library size normalizations (horizontal panels) and count transformations (color) applied prior to the calculation of distances and use ofAdonis permanova model. Effect of design variable in question (B1—age; B2—tongue versus palate; B3—age group) measured as model R2 value.The highest and lowest R2 values (yielding best and worst separation, respectively) are demonstrated in subplots b–g for each dataset as principalcoordinate analysis plots, colored by design variable, with overlaid prediction ellipses (B1—subplot (b, c); B2—subplot (d, e); B3—subplot (f, g)).CSS cumulative sum scaling, TMM trimmed mean of M values, TSS total sum scaling

Thorsen et al. Microbiome (2016) 4:62 Page 8 of 14

Page 62: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

the importance of carefully considering which normali-zations, count transformations, and distance metricsshould be applied to identify the best separation in thebeta-diversity space. In particular, the relative impact ofthese three factors has been elucidated. Generally, wefound that library size normalization is the least importantof the three. Especially the difference between nonormalization, CSS, DESeq2, and TMM normalizationswas negligible in all three datasets. TSS normalization wasthe most different from the others, but mostly changedthe optimal transformation choice rather than improve ordeteriorate the separation, which highlights the import-ance of choosing an appropriate pseudocount relative tothe scale of the data when using log transformations.Count transformations did not necessarily improve separ-ation but was very impactful in all cases. The effect of alog transform is down-weighting of high-abundant taxaand up-weighting of low-abundant taxa, which is a pivotalconsideration in terms of expected abundance levels oftaxonomic differences between groups in a study. Themost important factor was the choice of distance metric.In all of our examples, the best separation was found withthe weighted UniFrac metric. However, this study was notdesigned to infer which distance metric is best, as this willdepend on the data, but rather the relative importance ofthese three factors. It should also be noted that R2 valuesfor presence/absence distance metrics such as unweightedUniFrac may be inherently limited by sparsity and raretaxa. Additionally, the problems with increasing sparsityin large datasets observed in the taxon-wise DA testsshould not affect beta-diversity tests, since pairwisesample-wise distances do not depend on taxa absent fromboth samples.It is a strength of the study that we, through a large

amount of computations, have generated results fromcombinations of parameters of relevance to the field,namely statistical method, normalization method, case/control ratio, sample size, and spike-in magnitude. It isalso a strength of this study that our analyses areconducted on large and biologically diverse humanmicrobiome data. Many large-scale microbiome studiesare being conducted and planned presently, with diversehuman ecological niches represented. Thus, it is import-ant to survey different body sites, since the uniquelydifferent microbial compositions present may influencethe distributional characteristics of the resulting datasets.While the results presented here derive from biologicaldata, our results from the spike-in analyses rely on insilico spike-ins, rather than actual biological signals orwet-lab spike-ins. This approach is both a strength and alimitation, in that it allows very precise manipulationand complete control throughout the experiment, aswell as the opportunity for nearly limitless repetition toexamine well-resolved distributions of the parameters of

interest. Conversely, it does not represent an actual bio-logical signal, and manipulating the data may skew certaindistributional qualities present, such as the inherent countratios between OTUs. Though not feasible in this study, fu-ture studies could conduct wet-lab spike-ins to track andcompare detection power between packages. However,great care must be taken to control the relative concentra-tions of original content vs. spike-in content, especially inthe case of rare OTUs. Hence, this approach also posesmany potential issues that may lead to skewed data notaccurately reflecting true biological changes.In previous studies, the best way to account for variation

in library sizes has been discussed with a primary focus onthe procedure of rarefying counts that is random within-sample resampling of counts without replacement to aneven sequencing depth across all samples [12]. This preva-lent approach has been criticized due to discarding of validdata, but others argue that it can be the optimal method insome situations, as uneven library sizes disproportionatelyaffects unweighted distance measures and presence/absence analyses [13]. Since the topic of rarefaction alreadyhas been debated in detail, and is currently regardedunfavorably, we chose not to include it in this study.

ConclusionThis study represents an independent attempt to bench-mark various methods for differential relative abundanceanalysis of count-based microbiome datasets, using realbiological large-scale datasets. The results presented herewarrant an increased awareness of the potential for spuri-ous findings in differential relative abundance analyses. 16Sdata poses problems to both parametric and nonparametricstatistical models, and new methods should explicitlyaccount for sparsity, which is increased in large datasets.Considering the results presented here as a whole, werecommend researchers choose tools for detecting DA thatexhibit low false positive rates, that have good retrievalpower across effect sizes and case/control proportions, andthat are not biased for these parameters at differing levelsof (high) sparsity: metagenomeSeq feature model and thebasic permutation test both fulfill these criteria for largeand small datasets, and edgeR for small datasets. Whenexploring beta diversity of microbiome data, analysts shouldcarefully consider their choice of count transformation anddistance metric, the latter having the largest impact onresults. We have provided all source code and source datanecessary to reproduce the results presented in this study,including random seeds for random processes. This willallow other investigators to verify and expand upon ourresults and aid in selecting the optimal analysis methodsgiven the unique characteristics of their own data. Thecomparisons can easily be extended to analysis methodsnot covered in this paper, ensuring that computation time,rather than coding time, should be the main limiting factor.

Thorsen et al. Microbiome (2016) 4:62 Page 9 of 14

Page 63: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

MethodsSample collection and preparationFor dataset A1, A2, A3, and B1, the primary samplematerials were collected from the COpenhagen Pro-spective Studies on Asthma in Childhood 2010 (COP-SAC2010) mother-child cohort, following 700 childrenand their families from pregnancy into childhood, aspreviously described in detail [33]. In this study, we usedfecal samples collected at ages 1 week (n = 95), 1 month(n = 361), and 1 year (n = 622); vaginal swabs collected atweek 36 of pregnancy (n = 670); and hypopharyngealaspirates (n = 144) collected at acute wheezy episodes inchildren with persistent wheeze aged 1–3 years, using asoft suction catheter passed through the nose. DNA wasextracted using MoBio PowerSoil kits on an EpMotion5075, amplified using a two-step PCR reaction with for-ward and reverse 16S V4 primers, and sequenced using250bp paired-end sequencing on an Illumina MiSeq. Afull description of the laboratory workflow and the bio-informatics pipeline is available in the Additional file 13.To examine effects in smaller datasets, we subset data-

sets A1, A2, and A3 into 16 (small) and 50 samples(medium) by random sampling with recorded randomseeds, resulting in datasets A1s–A3s and A1m–A3m.Additionally, we created a simulated dataset A4 by inde-pendent resampling of all OTUs across samples, withoutreplacement, of dataset A3.Additionally, for dataset B2, we used public data from the

Human Microbiome Project [34], testing separation abilitybetween the tongue dorsum (n = 316) and hard palate (n =301) 16S V3-5 samples (http://hmpdacc.org/HMQCP/).For dataset B3, we used data from Pop et al. [35],downloaded from Bioconductor (http://bioconductor.org/packages/release/data/experiment/html/msd16s.html), test-ing separation between age groups 0–6 months (n = 112),6–12 months (n = 308), 12–18 months (n = 173), 18–24 months (n = 146), and 24–60 months (n = 253).To reduce sparsity of dataset B3, chimeras were

rechecked using USEARCH v7.0.1090 [36] against thegold database [37], and 3624 chimeras (listed inAdditional file 14: Table S2) were removed from theOTU table. Since a phylogenetic tree file was notpublished along with the OTU table and samplemetadata from this paper, we built one using thesupplied reference sequences as described in the “Bio-informatics” section of the Additional file 13. Due toissues with TMM normalization of this dataset (seethe “Results” section), we agglomerated similar OTUsto reduce the sparsity as a sensitivity analysis. Thiswas achieved by computing pairwise phylogeneticdistances using the tree and grouping together allOTUs who were closer to each other than the 0.001quantile of the distance distribution, see Additionalfile 1: Table S1. The OTUs were merged with the

merge_taxa function in the R package phyloseq [38],using the OTU with the highest sum of counts asarchetype.

Differential relative abundance testingWe selected several widely used methods for differentialrelative abundance testing to apply on the datasets, usingbuilt-in library size normalization, default parameters,and testing as applicable for each method. All tests wereconducted using the statistical software package R [39]and parallelized using custom bash scripts with GNUparallel [40]. The source code for all testing proceduresis available in an online repository. The selectedmethods and associated transformation steps were asfollows:

! metagenomeSeq ZIG [7]: using raw counts,cumulative sum scaling (CSS) was applied withthe quantile supplied by cumNormStat. Testingwas done using fitZig.

! metagenomeSeq ZIG, filtered: as above, butdiscarding all OTUs below median effective samplesize (supplied by the fitZig model), as recommendedby the authors in the package vignette.

! mgS featureModel: as with metagenomeSeq, butusing fitFeatureModel to test, rather than fitZig.

! baySeq [17]: using raw counts, two models weredefined; no changes and cases vs controls. Librarysizes were supplied to the object. Priors wereestimated with the negative binomial distributionbefore estimating likelihoods for cases vs controls.P values were defined as 1-likelihood.

! DESeq2 [15]: using raw counts, geometric meanswere calculated manually and supplied to theestimateSizeFactors function. Standard testing wasinvoked with DESeq.

! edgeR [16]: using raw counts, normalization factorswere calculated with trimmed mean of M values(TMM), common and tagwise dispersions wereestimated, and testing was done with exactTest.

! Negative binomial generalized linear model (GLM):using raw counts, a model was fitted to each OTUwith glm.nb from the MASS package [41], using log(library size) as offset and cases vs controls as theonly dependent variable.

! t test: counts were transformed to relativeabundances, before applying the R built-in t.testfunction using default parameters, including theWelch/Satterthwaite approximation to degrees offreedom due to possibly unequal variances.

! Log t test: as above, but counts were logtransformed first, with a pseudocount of 1.

! Wilcoxon: counts were transformed to relativeabundances, before applying the R built-in

Thorsen et al. Microbiome (2016) 4:62 Page 10 of 14

Page 64: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Mann-Whitney/Wilcoxon rank sum test withwilcox.test using default parameters.

! Permutation: a simple custom permutation test waswritten and applied to counts normalized to relativeabundances. First, a test statistic was computed as log

mean counts in casesmean counts in controls

! "2. Next, 104 permutations of thistest statistic was calculated by random resampling ofall cases and controls without replacement toempirically estimate the null distribution of eachOTU. The p value associated with an OTU wascalculated as the proportion of permuted test statisticsequal to or greater than the real test statistic.

! ALDEx2 [19]: raw counts were supplied to the aldexfunction, using 32 Monte Carlo samples, and both theWelch t test (we.ep) and Wilcoxon (wi.ep) p valueswere used.

If a method did not return a p value for any givenOTU, it was set to 1.

False positive ratesUnannotated OTU tables were tested for FPR by randomlyselecting samples as cases or controls, thus assuring thenull hypothesis, with varying case proportions of 10, 25, or50%, and subsequently applying all the abovementionedDA methods. This was repeated 150 times using uniquerecorded random seeds, identical between methods. A falsepositive was defined as an OTU with a crude p value below0.05. The FPR was defined as number of false positiveOTUs/total number of OTUs.

Spike-in retrieval performanceUnannotated OTU tables were randomized as describedabove. Then, five random OTUs from each relative abun-dance tertile were modified (“spiked”) with a given magni-tude, only in cases, to induce a signal in the data. OnlyOTUs present in at least one of the assigned case sampleswere eligible. Spiking was done either by multiplyingcounts by a given magnitude (multiplicative), multiplyingby a range of magnitudes (mixed multiplicative), or addingthe mean proportion of nonzero counts multiplied by amagnitude to all non-zero counts (additive). After spiking,samples were rescaled to original sequencing depth. Thiswas repeated for the magnitudes 0.5, 2, 5, 10, and 20 formultiplicative and 0.5, 2, 5, and 10 for additive, with thecase/control proportions 10, 25, and 50%. All DA methodswere applied, and p values were obtained as describedabove. This was repeated 150 times for all combinationsof case proportion, spike method, magnitude, and methodon each dataset. The area under the receiver operatingcharacteristic curve (AUC) value was calculated using rawp values, assuming they were lower in spiked vs. non-spiked OTUs, with the “pROC” package [42].

Beta-diversity optimizationTo assess the effects of normalization, transformation,and distance metrics on the ability of beta-diversityanalysis to distinguish between groups, we selected data-sets with less-than-perfect separation by a particulardesign variable. Next, datasets were subjected to differ-ent normalization methods (none (included as baseline),total sum scaling/relative abundance (TSS), metagenome-Seq’s cumulative sum scaling (CSS), edgeR’s trimmedmean of M values (TMM) and DESeq2’s size factors), andtransformations (natural logarithm with pseudocount 1,0.01, and 0.0001, square root, and cubic root). In case ofpseudocounts below one, all post-transformation valueswere corrected by subtracting the log of the pseudocount,effectively preserving zeroes from the original counts.Finally, selected distance metrics were applied (Bray-Curtis,Euclidean, Jensen-Shannon Divergence, UniFrac, weightedUniFrac) to provide beta-diversity distance matrices fromall these combinations. Jensen-Shannon Divergence,UniFrac, and weighted UniFrac are independent ofnormalization by design and were only computed onceper dataset and transformation. We then fitted adistance-based permutation multiple analysis of vari-ance (PERMANOVA) model (Adonis from the R pack-age vegan [43]) to assess the separation power of thegiven design variable in the dataset, measured as an R-squared value. In datasets B1 and B2, where the designincluded repeated measurements, these were suppliedto the model in the strata argument. All data handlingand distance calculations were done using the statis-tical software package R [39] with the add-on packagephyloseq [38]. All plots were produced with the pack-age ggplot2 [44].

Additional files

Additional file 1: Table S1. Overview of the datasets used in the study.Sampling and data characteristics of the seven datasets used in thestudy, A1–A4 for the false positive rate and spike-in retrieval tests andB1–B3 for the beta-diversity optimization tests. (XLSX 5 kb)

Additional file 2: Figure S1. Data characteristics for feces dataset A1.(A) Dot plot overview of the dataset; black if an OTU was present, blank ifnot present in a given sample. Sparsity in the dataset is 90.8%. (B) Librarysize distribution showing differences of several orders of magnitude. (C)Mean-variance relationship showing overdispersion, i.e., higher variancethan mean value of a given OTU. (PDF 504 kb)

Additional file 3: Figure S2. A. False positive rate distributions fordatasets A1–A3. Violin plot of distributions of false positive rate (FPR) in150 iterations for each case proportion in datasets A1–A3 (verticalpanels), analyzed with all differential relative abundance methods(horizontal panels). FPR is defined as the fraction of OTUs with p < 0.05.P values were not corrected for multiple testing. Black dots representmedians in each distribution. B. False positive rate distributions fordataset A4. Violin plot of distributions of false positive rate (FPR) in 150iterations dataset A4, analyzed with all differential relative abundancemethods (horizontal panels). FPR is defined as the fraction of OTUs withp < 0.05. P values were not corrected for multiple testing. Black dotsrepresent medians in each distribution. (ZIP 649 kb)

Thorsen et al. Microbiome (2016) 4:62 Page 11 of 14

Page 65: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Additional file 4: Figure S3. A. Area under the curve distributions formultiplicative spike-ins in dataset A1. Violin plot of distributions of area underthe receiver operating characteristic curve (AUC) for spiked vs non-spiked pvalues from differential relative abundance (DA) tests. AUC distributions from150 iterations for each combination of spike-in magnitude and caseproportion (vertical panels) in dataset A1, analyzed with all differentialrelative abundance methods (horizontal panels). Black dots representmedians in each distribution. B. Area under the curve distributions formultiplicative spike-ins in dataset A2. Violin plot of distributions ofarea under the receiver operating characteristic curve (AUC) forspiked vs non-spiked p values from differential relative abundance(DA) tests. AUC distributions from 150 iterations for each combinationof spike-in magnitude and case proportion (vertical panels) in datasetA2, analyzed with all differential relative abundance methods (horizontalpanels). Black dots represent medians in each distribution. C. Area under thecurve distributions for multiplicative spike-ins in dataset A3. Violin plot ofdistributions of area under the receiver operating characteristic curve (AUC)for spiked vs non-spiked p values from differential relative abundance (DA)tests. AUC distributions from 150 iterations for each combination of spike-inmagnitude and case proportion (vertical panels) in dataset A3, analyzedwith all differential relative abundance methods (horizontal panels). Blackdots represent medians in each distribution. (ZIP 2685 kb)

Additional file 5: Figure S4. A. Area under the curve distributions foradditive spike-ins in dataset A1. Violin plot of distributions of area under thereceiver operating characteristic curve (AUC) for spiked vs non-spiked pvalues from differential relative abundance (DA) tests. AUCdistributions from 150 iterations for each combination of spike-in magnitudeand case proportion (vertical panels) in dataset A1, analyzed with alldifferential relative abundance methods (horizontal panels). Black dotsrepresent medians in each distribution. B. Area under the curve distributionsfor additive spike-ins in dataset A2. Violin plot of distributions of area underthe receiver operating characteristic curve (AUC) for spiked vs non-spiked pvalues from differential relative abundance (DA) tests. AUC distributions from150 iterations for each combination of spike-in magnitude and caseproportion (vertical panels) in dataset A2, analyzed with all differentialrelative abundance methods (horizontal panels). Black dots representmedians in each distribution. C. Area under the curve distributions foradditive spike-ins in dataset A3. Violin plot of distributions of areaunder the receiver operating characteristic curve (AUC) for spiked vsnon-spiked p values from differential relative abundance (DA) tests.AUC distributions from 150 iterations for each combination of spike-inmagnitude and case proportion (vertical panels) in dataset A3, analyzedwith all differential relative abundance methods (horizontal panels). Blackdots represent medians in each distribution. (ZIP 2183 kb)

Additional file 6: Figure S5. Area under the curve distributions formixed multiplicative spike-ins in datasets A1–A3. Violin plot of distributions ofarea under the receiver operating characteristic curve (AUC) for spiked vs non-spiked p values from differential relative abundance tests. AUC distributionsfrom 150 iterations for each case proportion in datasets A1–A3 (verticalpanels), analyzed with all differential relative abundance methods (horizontalpanels). Black dots represent medians in each distribution. (PDF 588 kb)

Additional file 7: Figure S6. A. False positive rate distributions fordatasets A1s–A3s and A1m–A3m. Violin plot of distributions of falsepositive rate (FPR) in 150 iterations for datasets A1s–A3s and A1m–A3m(vertical panels), analyzed with all differential relative abundance methods(horizontal panels). FPR is defined as the fraction of OTUs with p < 0.05.P values were not corrected for multiple testing. Black dots representmedians in each distribution. B. Area under the curve distributions formultiplicative spike-ins in datasets A1s–A3s and A1m–A3m. Violin plot ofdistributions of area under the receiver operating characteristic curve(AUC) for spiked vs non-spiked p values from differential relative abundance(DA) tests. AUC distributions from 150 iterations for each multiplicativespike-in magnitude in datasets A1s–A3s and A1m–A3m (vertical panels),analyzed with all differential relative abundance methods (horizontal panels).Black dots represent medians in each distribution. C. Area under the curvedistributions for additive spike-ins in datasets A1s–A3s and A1m–A3m. Violinplot of distributions of area under the receiver operating characteristic curve(AUC) for spiked vs non-spiked p values from differential relative abundance(DA) tests. AUC distributions from 150 iterations for each additive spike-in

magnitude in datasets A1s–A3s and A1m–A3m (vertical panels), analyzedwith all differential relative abundance methods (horizontal panels). Blackdots represent medians in each distribution. D. Area under the curvedistributions for mixed multiplicative spike-ins in datasets A1s–A3sand A1m–A3m. Violin plot of distributions of area under the receiveroperating characteristic curve (AUC) for spiked vs non-spiked p valuesfrom differential relative abundance (DA) tests. AUC distributions from150 iterations for mixed multiplicative spike-in magnitudes in datasetsA1s–A3s and A1m–A3m (vertical panels), analyzed with all differentialrelative abundance methods (horizontal panels). Black dots representmedians in each distribution. (ZIP 4027 kb)

Additional file 8: Figure S7. A. Area under the curve distributions formultiplicative spike-ins in dataset A4. Violin plot of distributions ofarea under the receiver operating characteristic curve (AUC) forspiked vs non-spiked p values from differential relative abundance(DA) tests. AUC distributions from 150 iterations for each combinationof multiplicative spike-in magnitude and case proportion (verticalpanels) in dataset A4, analyzed with all differential relative abundancemethods (horizontal panels). Black dots represent medians in eachdistribution. B. Area under the curve distributions for additive spike-ins indataset A4. Violin plot of distributions of area under the receiver operatingcharacteristic curve (AUC) for spiked vs non-spiked p values from differentialrelative abundance (DA) tests. AUC distributions from 150 iterations for eachcombination of additive spike-in magnitude and case proportion (verticalpanels) in dataset A4, analyzed with all differential relative abundancemethods (horizontal panels). Black dots represent medians in each distribution.C. Area under the curve distributions for mixed multiplicative spike-ins indataset A4. Violin plot of distributions of area under the receiver operatingcharacteristic curve (AUC) for spiked vs non-spiked p values from differentialrelative abundance (DA) tests. AUC distributions from 150 iterations for eachcase proportion (vertical panels) in dataset A4, analyzed with all differentialrelative abundance methods (horizontal panels). Black dots represent mediansin each distribution. (ZIP 1831 kb)

Additional file 9: Figure S8. A. Spike-in retrieval as a function of numberof positive samples, by dataset size. Aggregated results across 150 iterationsof multiplicative spike-ins of magnitude 5 with 50% cases, in datasets A1,A1s, and A1m. Each dot represents a spiked OTU. The Y-axis displays its pvalue quantile (0 is lowest p value, 1 is highest p value) within that DA run,and the X axis shows how many samples are positive (nonzero) for thatOTU. Results from the three datasets are overlaid with different colors andfaceted by statistical method. B. Spike-in retrieval as a function of numberof positive samples, by case proportion. Aggregated results across 150iterations of multiplicative spike-ins of magnitude 5 with 10, 25, or 50%cases, in dataset A1. Each dot represents a spiked OTU. The Y-axis displaysits p value quantile (0 is lowest p value, 1 is highest p value) within that DArun, and the X axis shows how many samples are positive (nonzero) for thatOTU.Results from the three case proportions are overlaid with different colors,and faceted by statistical method. C. Spike-in retrieval as a function ofnumber of positive samples, by spike-in magnitude. Aggregated resultsacross 150 iterations of multiplicative spike-ins of magnitudes 0.5, 2, 5, 10,and 20 with 50% cases, in dataset A1. Each dot represents a spiked OTU.The Y-axis displays its p value quantile (0 is lowest p value, 1 ishighest p value) within that DA run, and the X axis shows howmany samples are positive (nonzero) for that OTU. Results from thedifferent spike-in magnitudes are overlaid with different colors, andfaceted by statistical method. (ZIP 16398 kb)

Additional file 10: Figure S9. The effect of normalization,transformation, and distance metric on beta-diversity separation in themodified dataset B3. Dataset B3 modified by grouping of several OTUs,reducing the dataset to 8770 OTUs. All combinations of library sizenormalizations (horizontal panels) and count transformations (color)applied prior to calculation of distances and use of Adonis permanovamodel. Effect of age group measured as model R2 value. The highest andlowest R2 values (yielding best and worst separation, respectively) aredemonstrated in subplots B & C as principal coordinate analysis plots,colored by design variable, with overlaid prediction ellipses. CSScumulative sum scaling, TMM trimmed mean of M values, TSS totalsum scaling. (PDF 124 kb)

Thorsen et al. Microbiome (2016) 4:62 Page 12 of 14

Page 66: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Additional file 11: Table S3. Summary of benchmark performance forincluded tools. (PDF 32 kb)

Additional file 12: Figure S10. The lowest obtainable p values from aWilcoxon rank sum test depend on sample size and sparsity. Overview ofthe lowest p value theoretically obtainable using optimal conditions for agiven sample size and level of sparsity. Optimal conditions refer to (a)50% cases, i.e., n1 = n2, (b) only nonzero counts in one of the groups,limited by level of sparsity, (c) no ties between nonzero counts. Onesided p values from normal approximation shown. Since the p value onlydepends on the rank distribution, the actual count values do not matter.Color of the lines indicates level of sparsity; dotted red line correspondsto p = 0.05. (PDF 9 kb)

Additional file 13: Supplementary methods. (DOCX 15 kb)

Additional file 14: Table S2. Chimeras removed from dataset B3.(XLSX 49 kb)

AbbreviationsAUC: Area under the receiver operating characteristic (ROC) curve;CSS: Cumulative sum scaling; DA: Differential (relative) abundance;FPR: False positive rate; GLM: Generalized linear model; OTU: Operationaltaxonomic unit; PCoA: Principal coordinates analysis; ROC: Receiveroperating characteristic; TMM: Trimmed mean of M values

AcknowledgementsWe express our deepest gratitude to the children and families of theCOPSAC2010 cohort study for all their support and commitment. Weacknowledge and appreciate the unique efforts of the COPSAC researchteam.

FundingCOPSAC is funded by private and public research funds all listed onwww.copsac.com. The Lundbeck Foundation (Grant No. R16-A1694); TheDanish Ministry of Health (Grant No. 903516); Danish Council for StrategicResearch (Grant No. 0603-00280B), and The Capital Region ResearchFoundation (no grant number) have provided core support for COPSAC.No pharmaceutical company was involved in the study. The fundingagencies did not have any role in the design and conduct of the study;collection, management, and interpretation of the data; or preparation,review, or approval of the manuscript.

Availability of data and materialsThe datasets generated during and/or analyzed during the current study areavailable in the following repositories:Benchmark code and COPSAC 16S OTU datasets (A1-A4, A1s-A3s, A1m-A3m,B1): Bitbucket, https://bitbucket.org/jonathanth/large_scale_16s [REVIEWERSLOGIN to bitbucket: user: [email protected], password: january16.Repository will be made public upon publication].HMP OTU dataset (B2): HMP Data Analysis and Coordination Center (DACC),http://hmpdacc.org/HMQCP/MSD (Pop) OTU dataset (B3): Bioconductor: http://bioconductor.org/packages/release/data/experiment/html/msd16s.html

Authors’ contributionsJT, AB, and JW devised and ran the analyses and wrote the paper. MM andMR contributed to the generation of datasets A1–A4 and B1. JS, SS, and HBdevised and planned the experiments for datasets A1–A4 and B1. All authorsread and approved the final manuscript.

Competing interestsThe authors declare that they have no competing interests.

Consent for publicationNot applicable.

Ethics approval and consent to participateThe COPASC2010 study was conducted in accordance with the guidingprinciples of the Declaration of Helsinki and was approved by the LocalEthics Committee (COPSAC2010: H-B-2008-093) and the Danish Data Protection

Agency (COPSAC2010: 2015-41-3696). Both parents gave written informedconsent before enrollment.

Author details1COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlevand Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark.2Section of Microbiology, Department of Biology, University of Copenhagen,Copenhagen, Denmark. 3Department of Biology, Laboratory of Genomicsand Molecular Biomedicine, University of Copenhagen, Copenhagen,Denmark.

Received: 28 July 2016 Accepted: 15 November 2016

References1. Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, et al. A human

gut microbial gene catalogue established by metagenomic sequencing.Nature. 2010;464:59–65.

2. Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, Mende DR, et al.Enterotypes of the human gut microbiome. Nature. 2011;473:174–80.

3. Stackebrandt E, Goebel BM. Taxonomic note: a place for DNA-DNAreassociation and 16S rRNA sequence analysis in the present speciesdefinition in bacteriology. Int J Syst Bacteriol. 1994;44;846–49.

4. Huse SM, Welch DM, Morrison HG, Sogin ML. Ironing out the wrinklesin the rare biosphere through improved OTU clustering. EnvironMicrobiol. 2010;12:1889–98.

5. Schloss PD, Westcott SL. Assessing and improving methods used inoperational taxonomic unit-based approaches for 16S rRNA gene sequenceanalysis. Appl Environ Microbiol. 2011;77:3219–26.

6. McDonald D, Clemente JC, Kuczynski J, Rideout JR, Stombaugh J, Wendel D,et al. The biological observation matrix (BIOM) format or: how I learned tostop worrying and love the ome-ome. GigaScience. 2012;1:7.

7. Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis formicrobial marker-gene surveys. Nat Methods. 2013;10:1200–2.

8. Zuur A, Leno EN, Walker N, Saveliev AA, Smith GM. Mixed effects modelsand extensions in ecology with R. Springer Science & Business Media. Berlin,Heidelberg; 2009.

9. Richards SA. Dealing with overdispersed count data in applied ecology:overdispersed count data. J Appl Ecol. 2007;45:218–27.

10. Robinson MD, Smyth GK. Moderated statistical tests for assessing differencesin tag abundance. Bioinformatics. 2007;23:2881–7.

11. Anders S, Huber W. Differential expression analysis for sequence count data.Genome Biol. 2010;11:R106.

12. McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiomedata is inadmissible. PLoS Comput Biol. 2014;10:e1003531.

13. Weiss SJ, Xu Z, Amir A, Peddada S, Bittinger K, Gonzalez A, et al. Effects oflibrary size variance, sparsity, and compositionality on the analysis ofmicrobiome data [Internet]. PeerJ PrePrints; 2015. Report No.: e1408.Available from: https://peerj.com/preprints/1157.

14. Bolker BM, Brooks ME, Clark CJ, Geange SW, Poulsen JR, Stevens MHH, et al.Generalized linear mixed models: a practical guide for ecology andevolution. Trends Ecol Evol. 2009;24:127–35.

15. Love MI, Huber W, Anders S. Moderated estimation of fold change anddispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550.

16. Robinson MD, McCarthy DJ, Smyth GK. edgeR: a bioconductor package fordifferential expression analysis of digital gene expression data. BioinformaOxf Engl. 2010;26:139–40.

17. Hardcastle TJ, Kelly KA. baySeq: empirical Bayesian methods for identifyingdifferential expression in sequence count data. BMC Bioinformatics. 2010;11:422.

18. Paulson J, Pop M, Bravo H. metagenomeSeq: statistical analysis for sparsehigh-throughput sequncing. Bioconductor package: 1.10.0. [Internet].Available from: http://cbcb.umd.edu/software/metagenomeSeq.

19. Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB.Unifying the analysis of high-throughput sequencing datasets:characterizing RNA-seq, 16S rRNA gene sequencing and selective growthexperiments by compositional data analysis. Microbiome. 2014;2:15.

20. Ravel J, Gajer P, Abdo Z, Schneider GM, Koenig SSK, McCulle SL, et al.Vaginal microbiome of reproductive-age women. Proc Natl Acad Sci U S A.2011;108 Suppl 1:4680–7.

21. Koren O, Knights D, Gonzalez A, Waldron L, Segata N, Knight R, et al. Aguide to enterotypes across the human body: meta-analysis of microbial

Thorsen et al. Microbiome (2016) 4:62 Page 13 of 14

Page 67: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

community structures in human microbiome datasets. PLoS Comput Biol.2013;9:e1002863.

22. Knights D, Ward TL, McKinlay CE, Miller H, Gonzalez A, McDonald D, et al.Rethinking “enterotypes.”. Cell Host Microbe. 2014;16:433–7.

23. Angly FE, Dennis PG, Skarshewski A, Vanwonterghem I, Hugenholtz P,Tyson GW. CopyRighter: a rapid tool for improving the accuracy ofmicrobial community profiles through lineage-specific gene copy numbercorrection. Microbiome. 2014;2:11.

24. Bray JR, Curtis JT. An ordination of the upland forest communities ofsouthern Wisconsin. Ecol Monogr. 1957;27:325–49.

25. Lozupone C, Knight R. UniFrac: a new phylogenetic method for comparingmicrobial communities. Appl Environ Microbiol. 2005;71:8228–35.

26. Lozupone CA, Hamady M, Kelley ST, Knight R. Quantitative and qualitativeβ diversity measures lead to different insights into factors that structuremicrobial communities. Appl Environ Microbiol. 2007;73:1576–85.

27. Legendre P, Legendre LFJ. Numerical Ecology. Amsterdam: Elsevier; 1998.28. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical

and powerful approach to multiple testing. J R Stat Soc Ser B Methodol.1995;57:289–300.

29. Bonferroni CE. Teoria statistica delle classi e calcolo delle probabilità. Libreriainternazionale Seeber: Florence; 1936.

30. Costea PI, Zeller G, Sunagawa S, Bork P. A fair comparison. Nat Methods.2014;11:359.

31. Paulson JN, Bravo HC, Pop M. Reply to: a fair comparison. Nat Methods.2014;11:359–60.

32. Fukuyama J, McMurdie PJ, Dethlefsen L, Relman DA, Holmes S. Comparisonsof distance methods for combining covariates and abundances inmicrobiome studies. Biocomput. 2012 [Internet]. WORLD SCIENTIFIC; 2011.p. 213–24. Available from: http://www.worldscientific.com/doi/abs/10.1142/9789814366496_0021. [cited 19 May 2015].

33. Bisgaard H, Vissing NH, Carson CG, Bischoff AL, Følsgaard NV, Kreiner-Møller E,et al. Deep phenotyping of the unselected COPSAC2010 birth cohort study.Clin Exp Allergy J Br Soc Allergy Clin Immunol. 2013;43:1384–94.

34. Human Microbiome Project Consortium. Structure, function and diversity ofthe healthy human microbiome. Nature. 2012;486:207–14.

35. Pop M, Walker AW, Paulson J, Lindsay B, Antonio M, Hossain MA, et al.Diarrhea in young children from low-income countries leads to large-scalealterations in intestinal microbiota composition. Genome Biol. 2014;15:R76.

36. Edgar RC. Search and clustering orders of magnitude faster than BLAST.Bioinforma Oxf Engl. 2010;26:2460–1.

37. Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G, et al.Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res. 2011;21:494–504.

38. McMurdie PJ, Holmes S. phyloseq: an R package for reproducible interactiveanalysis and graphics of microbiome census data. PLoS One. 2013;8:e61217.

39. R Core Team. R: a language and environment for statistical computing.Vienna: R Foundation for Statistical Computing, Vienna, Austria. [Internet];2015. Available from: http://www.R-project.org/.

40. Tange O. GNU parallel—the command-line power tool. Login USENIX Mag.2011;36:42–7.

41. Venables WN, Ripley BD. Modern applied statistics with S [Internet].Fourth. New York: Springer; 2002. Available from: http://www.stats.ox.ac.uk/pub/MASS4.

42. Robin X, Turck N, Hainard A, Tiberti N, Lisacek F, Sanchez J-C, et al. pROC: anopen-source package for R and S+ to analyze and compare ROC curves.BMC Bioinformatics. 2011;12:77.

43. Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’Hara RB, et al.vegan: community ecology package [Internet]. 2015. Available from: https://cran.r-project.org/web/packages/vegan/index.html. [cited 21 Sept 2015].

44. Wickham H. ggplot2: elegant graphics for data analysis [Internet]. New York:Springer; 2009. Available from: http://ggplot2.org/book/. • We accept pre-submission inquiries

• Our selector tool helps you to find the most relevant journal• We provide round the clock customer support • Convenient online submission• Thorough peer review• Inclusion in PubMed and all major indexing services • Maximum visibility for your research

Submit your manuscript atwww.biomedcentral.com/submit

Submit your next manuscript to BioMed Central and we will help you at every step:

Thorsen et al. Microbiome (2016) 4:62 Page 14 of 14

Page 68: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

SYNTENY: A new tool to explore the functional content of the

mobile pool from metagenomes

Asker Daniel Brejnrod1,

*, Samuel Jacquiod2,

Gisle Alberg Vestergaard3, Jakob Russel

1 , Michael

schloter3, and Søren Johannes Sørensen

1

1 Institute of Biology, University of Copenhagen, Copenhagen, 2200, Denmark

2 UMR1347 Agroécologie, Centre INRA Dijon, Région Bourgogne, 2100 Dijon, France

3 Research Unit for Comparative Microbiome Analysis, Helmholtz Zentrum München, German

Research Center for Environmental Health, 85764 Neuherberg, Germany

* To whom correspondence should be addressed. Tel: +45 51 82 70 07; Fax: +45 35 32 20 40; Email:

[email protected]

Present Address: Søren Johannes Sørensen, Institute of Biology, University of Copenhagen,

Copenhagen, 2200, Denmark

ABSTRACT

Horisontal gene transfer (HGT) is an important mechanism for bacteria to adapt to the environment.

HGTallows bacteria to quickly embed DNA needed to respond to environmental change. This can

impact the environment and human health, as demonstrated by the antibiotics resistance crisis.

Mobile functionality is not limited to resistance genes, but can be metabolic or structural. We have

developed tools facilitating the identification of genes associated with mobile genetic elements thus

potentially identifying key responders to environmental changes.

The current work presents software to explore functionality originating from mobile elements.

Metagenomic contigs are scanned with curated databases of transposases to identify likely

recombination loci. From this mobile frames are defined and the functionality contained in these are

analysed for enrichment and co-occurrence of elements. Finally we implement statistical models to

identify candidate functionality that is likely to have an impact on outcomes and found in mobile

frames

We apply the tools to two datasets. The first is a soil metagenome generated from a long-term copper

contaminated site as well as a control soil, in which we identify elements related to transport and

stress response. Second, we use the publically available metagenome set of patients suffering from

Ulcerative Colitis (UC) to identify thiamine synthesizing domains in the mobile pool as associated with

UC. Both datasets had highly significant hits of unknown function.

INTRODUCTION

The terminology “synteny” was first coined by JH Renwick in 1971 when mapping the Human genome

(1) and originally referred to the genetic linkage between loci positioned on the same chromosome.

The definition has since evolved, and now refers to gene loci on a chromosomal region in different

organisms sharing common evolution ancestry (2).

Mobile genetic elements (MGE) play a crucial role in a diverse set of contexts, no matter the mode, be

it conjugation, transduction or transformation. Potentially conferring novel phenotypes to the recipient,

they are often the basis for the rapid spread of antibiotic resistance in clinical and agricultural

settings(3,4). They can have big impacts on the bacterial interaction with the environment. For

instance, it was recently shown that phage-induced inhibition of CO2 fixation could impact primary

Page 69: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

productionof biomass more than “the sum of all coral reefs, salt marshes, estuaries, and marine

macrophytes on Earth” (5). MGEs are also important for bacterial evolution (6,7).

The case of antibiotics resistance has demonstrated clearly that mobility can enhance the spread of

bacterial traits harmful to human health. After having sequenced the human genome and completed

Genome Wide Associations studies, it has emerged that for several diseases genetics can only

explain a low percentage of the variance. Attention has turned to the microbiome for finding factors

that are directly responsible for disease as well as microbiome members that enhance penetrance of

genetic variance, and several studies have suggested interactions between genetics and the

microbiome (8-10). Analogous to the antibiotics case, mobilisation and spread of bacterial traits might

be affecting disease-presenting populations.

Various methods have been developed to study the fraction of the DNA pool that is mobile with wetlab

methods, many relying on enriching or selecting smaller circular elements by rolling circle

amplification(11). This has produced new insights into the functional genes carried on these elements,

such as antibiotics- and heavy metal resistance(12).

As the number of sequenced prokaryotic genomes has been steadily growing over the last years,

algorithms and computational tools were concomitantly developed to carry on synteny

investigations (see Deniélou et al (13) for a review). For metagenomics there has been several

attempts at identifying mobile elements, both by searching against databases (14), extracting

information from the assembly graphs, and categorizing elements from metagenomics binning (15)

The recent advent of DNA sequencing as a tool for studying bacterial populations and their

interactions with their environment and hosts has opened up new possibilities of studying microbial

function at a large scale. Major metagenome projects such as MetaHit and the Human Microbiome

Project has found associations between microbiome and wide range of diseases(16), while the Earth

Microbiome Project has found that microbiome both impact and are impacted by their local

environment(17).

In this work we develop software to characterize the functionality of mobile genetic elements and

identify candidates for further investigation into bacterial impact on their environment. The chosen

strategy is to identify transposases and integrases, instigators of horizontal gene transfer, in

metagenomic contigs by Hidden Markov Models (HMMs) of these elements curated from the literature

and characterize the functionality that are physically adjacent to it. We define the concept of open and

closed frames to denote collections of protein sequences that are next to or enclosed in two

transposase annotated genes.

Based on orthologue clustering of the proteins in the contigs we group proteins into functional groups.

Based on this, the software offers functionality for identifying functions enriched in frames compared

to the background contigs. Additionally the software provides statistical integration with expression or

abundance matrices, allowing the user to identify candidate functionality that interacts with being

found in a mobile context and being associated with sample groups, disease-status or environmental

characteristics. Finally a network based approach allows the identification of functionality that is often

carried together in mobile frames.

We illustrate the use of the software by analyzing a metagenome from a long-term copper

contaminated soil and the publically available Metahit data, and demonstrate how the exploration of

functional content can be used to generate hypotheses.

MATERIAL AND METHODS

Page 70: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Implementation

SYNTENY is implemented in python2 with the following dependencies: igraph, numpy, pandas, scipy,

and, statsmodels.

The provided tools can perform the analyses for various annotation sources. Enrichment can be

analysed for eg. pfam domains, KOs or GO terms.

The script integrate_synteny.py takes annotations from various sources, in two different classes

Horisontal Gene Transfer Instigators (HGTi) annotations or functional annotations. The HGTi class is

used downstream to aggregate HGTis of all origins into determination of mobile frames.

For identifying frames the script find_frames.py is provided. This will search contigs for open frames,

protein windows (by default 10, possibly overlapping, proteins long) adjacent to HGTi elements on the

3’ and 5’ sides. Assignments to frames are terminated if ribosomal elements are identified as

indicated by the GO term of the Pfam assignment containing the string “ribosom”. Closed frames,

functional genes enclosed by two HGTi elements, less than 20 genes in size by default will also be

extracted in the same format for downstream analysis. The HGTi genes themselves are stripped from

the frame prior to downstream analyses.

SYNTENY database

To effectively annotate ORFs for identification of frames a helper script is provided to construct a

custom database. This database is the concatenation of the Pfam database (18), the tnppred

database (19), and an entry from TIGRFAM HMM collection (20), an HMM model of an integron.

This concatenated database is constructed to resolve ambiguities in HMM search between different

databases, as comparison of hits based on e-values should be based on databases of the same size.

For computational reasons annotations were done with hmmsearch (http://hmmer.org/). Besides the

default e-value filter we used the gathering thresholds embedded in the Pfam models to filter hits,

where possible. Certain domain hits were validated as previously observed in a mobile context by

searching the domain against the ACLAME (21) database using the gathering threshold cut-off.

Enrichment analysis

Enrichment analysis of frame functional content is reported in terms of Odds Ratios (OR), calculated

based on the number of times proteins with a tag of interest appear either in a mobile frame context or

in the remaining background set of proteins.

SYNTENY offers two methods for assessing significance. The first is a simple one-sided permutation

test, permuting the labels of frame/background, recalculating the OR, and after the desired number of

permutations calculating the number of times the OR were lower than the observed OR.

Page 71: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

The second method uses Fishers exact test, as implemented in Scipy for providing a p-value. The p-

value computed is one-sided as we are only interested in in-frame enrichment.

For many practical applications we recommend the second option as in practice the number of

permutations required to converge towards the p-value can be quite large and thus the procedure is

slow.

Abundance interaction modelling

Interaction modelling is performed by treating the abundance of every protein annotated to the tag of

interest as an observation, and performing a linear regression of the following form:

Abundance = Beta1*Obs + Beta2*Group+Beta3*Obs*Group + residual

Where “Obs” contains either “background or “frame” and Group contains “group1” or “group2”.

We minimally require 5 observations in each group to model a tag. Abundances were log-transformed,

as this gave the best overall model fit, when tested on a broad range of models.

The significance is determined by the p-value for the Beta3 coefficient alone and is set at 0.05. The

software provides p-values adjusted by either Bonferroni or Benjamini-Hochberg (BH) corrections. All

reported p-values were corrected using BH.

Statistical validation of the abundance interaction modelling

Two types of analyses were carried out to ensure the modelling performs well when faced with

several anticipated problems such as heterogeneous response distributions for proteins and

imbalanced number of observations. These were false discovery rate assessment and power

calculations.

Assessment of the false discovery rate was performed by permuting the data under various schemes

and re-computing the p-value. P-values should be homogenously distributed under the null

hypothesis, and the proportion of significant p-values should be less than 0.05. To assess sensitivity

to the modelled factors permutation proceeded in three schemes: Randomisation of both variables

and individual randomization. 23 models of orthologues from the Hygum soils were permuted with the

three schemes a 1000 times each.

Power analysis

Power analysis of linear models attempts to elucidate the relationship between the number of

samples and the power of a model to detect significant differences with noisy data. To do this ,

assumptions needs to be made about the effect size and residual distribution. We picked these

parameters from a model of the top hit in terms of significance so that we could explore the effect of

having fewer observations on a strong effect. We chose to do the power analysis by means of

simulation. This strategy generates new abundance observations by filling numbers into the linear

Page 72: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

equation described previously. The Beta coefficients are given by the previously assumed model, as

well as the distribution of residuals that we randomly sampled from a normal distribution. In addition

we wanted to explore the effect of the proportion of observations in frame, and we randomly sampled

this proportion from a binomial distribution. 10000 samples were drawn.

Network clustering

For identification of co-occurring functional elements we implemented a weighted network based

strategy. Each chosen functional unit (De novo orthologue, Pfam domain, KEGG orthologue, etc.) is

modelled as a node. Edges between the nodes are defined as the sum of distances that the units

have been found from each other within a frame context. In order to weight physically closer elements

higher, the weights were defined as 1-(distance/frame length). This ensures that highly connected

elements appearing close together are represented as a densely connected subnet. For identification

of dense subnets we used the python igraph implementation of the fastgreedy algorithm (22). This

computationally efficient algorithm ensures that the detection can scale to potentially very large

networks.

Datasets

We study two applications of the tool, both in various metagenomes. One is a soil dataset, the other is

from a human metagenome.

Generation of the Hygum soil dataset has been described in [The other paper]. Briefly, 6 samples

from a control spot and a heavily contaminated spot respectively were taken. The contaminated site

had a copper concentration of ~4500 mg/kg and has previously been described in detail in Nunes et

al. (23). The DNA was extracted and sequenced on Illumina Hiseq. Reads were quality

filtered ,trimmed and assembled with Ray (24). ORFs were called with Prodigal(25).

From the the Metahit human metagenome catalogue, predicted amino acid sequences and their

associated metagenomics abundances from Ulcerative Colitis (UC) subset has been generously

uploaded to http://www.cbs.dtu.dk/databases/CAG/ in support of Nielsen et. al (15).

The proteins and abundances were downloaded. All proteins from ~27k contigs longer than 10 genes

in length were extracted. This subset of genes were clustered and annotated.

Orthologue clustering

For computational reasons we use two different strategies for orthologue identification. Because “all

vs all” searches scale poorly, we use a more exact strategy for the smaller protein set and a more

approximate for the large set. For the smaller Hygum protein set we compute alignments with

ssearch36 (26), while for the larger MetaHit the search was performed with diamond blastp (27),

using the “–more-sensitive” option.

The hits from the two approaches were filtered by how well the alignment was covered as in (28).

Finally Markov clustering (29) were run with unweighted edges to obtain orthologues.

KEGG annotation

Page 73: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

For the Hygum dataset called ORFs were searched against the KEGG 2014 release genes database

(30) with diamond (27) using the “–more-sensitive” option and the top hit was used to translate to a

Kegg Orthologue using the provided mapping file (ko_gene.txt).

For the Metahit dataset the KO annotations can be found on the aforementioned website.

Methodological details about the annotation can be found in (15)

RESULTS

Statistical validation of the abundance interaction modelling

Classic linear regression diagnostic plots are shown for a representative selection of 5 models in

figure 2A and B. A shows the fitted values against the residual which is traditionally used to evaluate

the residuals for all combinations of dummy variable levels. There are no obvious trends, but it should

be noted that there is some minor discrepancies in the variance of the factor levels. Figure 2B shows

a QQ plot of the same 5 models. The QQ plot is used to evaluate the distribution of residuals against

a theoretical distribution. While the overall trend is linear there is some notable deviation at the tail

end of the distribution.

To evaluate the severity of the noted deviations from model assumptions simulations were performed

to asses false discovery rate (FDR). 23 models were included in the FDR assessment. As can be

seen in figure 2, the distribution of p-values for all the permutation schemes was largely homogenous.

The largest differences were mainly observed at the very ends of the interval, near 0 and 1. In

particular, permutation of the Obs variable (specifying whether an observation came from a frame or

from the background) has bump at the low end. And indeed, while the two other permutation scheme

has a proportion of p-values near the desired 0.05, the Obs permutations are at 0.11 as seen in the

insert in figure 2.

Another aspect of the model is how effective it is at discovery, the so-called power of the model. This

property of statistical modelling can be explored by simulation. To explore the effects of having fewer

or more unbalanced samples some assumptions are necessary. Power can only be computed in light

of an assumed effect size and standard deviation. With these in hand samples can be drawn from the

putative distribution and the summarised power plotted. Under the assumptions that a highly

significant model from the Hygum data has made a real discovery we use the effect size and standard

deviation to generate data and compute the modelling power, visualized in figure 2 D-F. As the power

in this scenario is both a function of the total number of observation we randomly drew sample sizes

(N) and the proportion of frame vs background, while the proportion of case/controls were fixed at 0.5.

Panel A demonstrates the relationship between the number of observations of a certain tag (N) and

the power of the model. Power has traditionally been sought to be 0.8 (31)illustrated by the horizontal

line. Reaching 0.8 power in a model with this high effect size requires less than 50 observations. The

power as a function of the absolute number of in-frame observations can be seen in panel E. Power is

seemingly not a monotonous function of the number of in-frame observations, likely because heavily

Page 74: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

unbalanced number of observations is a problem for Ordinary Least Squares estimation of

coefficients. Because the sample size, N, is random we also explored the effect of the proportion of

in-frame observations in panel F, demonstrating that this should not be skewed to heavily.

Hygum soils analysis

The hygum soil metagenome were assembled into 93928 contigs with an N50 of 943. 230440 ORFs

were called. These were clustered into 103144 orthologues. To validate this de novo orthologue

clustering we compared the KO annotation against this new clustering by computing the Shannon

index of KO annotations in each cluster. A Shannon index of 0 would mean that all KO annotations

were the same. 70% of the clusters with a KO annotation had a Shannon index of 0. Annotation

against the provided database identified 60.8% of the ORFs using the constructed database.

The find_frames.py tool detected 2172 mobile frames with functional content in the Hygum soil

contigs.

The enrichment analyses were performed at both Pfam and GO terms level. The significant hits for

GO were unidentified functionality and DNA integration. The same were seen at the Pfam level,

where both no identification and a domain of unknown function were enriched. However one other

domain. RHH_1, was also enriched. To identify candidates for further investigation the de novo

orthologue clusters were investigated for enrichment. One cluster (13312) has no annotation for any

of the two underlying proteins, but the other top hit, with p=0.067, was formed from 238 proteins. For

explorative purposes we can condense their functional assignments into a word cloud representation

shown in figure 3A where the size of the terms are weighted by occurrence.

The Usp domain was identified which is a general stress response protein as well as

Pyridox_oxidase a domain from Pyridoxamine 5'-phosphate

The top hit for abundance interaction modelling of Pfam domains was Proton_antipo_M, a domain of

an antiporter. In addition to antiporters. Significant hits with enrichment in the background protein set

included several transport related domains such as MFS_1/3, Major facilitator superfamily, ABC_tran,

ABC transporters and BPD_transp_1, binding-protein-dependent transporters.

MetaHIT Ulcerative Colitis subset data analyses

Around 550,000 sequences were clustered into X orthologs. X proteins had the best hit annotation as

an HGTi element. Integration of annotations also included the original MetaHIT KO assignements

downloaded from the website.

3346 frames were detected in the contigs. The most significant enrichment were unclassified protein

domains, both unannotated by KEGG and Pfam and Pfam Domains of Unknown Function (DUF). At

the KO level several orthologues were related to mobility K07497 and K07484, putative transposases,

Page 75: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

K00986 a reverse transcriptase, and finally K02314 and K01971, helicase and ligases respectively .

Other top hits included K06148, an ABC transporter for an unknown substrate and finally K06412, the

sporulation initiation gene spoVG. Validation of this hit by searching for SpoVG domains in the

ACLAME database showed no hits, indicating that this domain has not previously been observed in

the mobile pool of genetic elements, and could possibly be a false positive. A near-significant hit was

the NMT1 domain (p=0.11)

The top hit of the abundance interaction analysis was a Domain of Unknown Function, DUF4199.

The second top hit was NMT1. A search against the ACLAME database revealed 80 hits of which 79

were in plasmids and 1 from prophage. Figure 3B shows the adjusted interaction means from the

model, illustrating that the abundance is higher in frames and in group2 (UC patients).

For KO annotations the top hits were two transporter genes: one is K02051 a sulfonate/nitrate/taurine

transporter, the other is K06148 an ABC transporter of unknown substrate.

DISCUSSION

We have to validate and asses the power of the abundance interaction tool included in the SYNTENY

software. The residuals from the models seem to approximately follow assumptions. Depending on

how the null hypothesis is generated the models seem to have an appropriate false positive rate.

From the power analysis it appears that 50 number of observations is needed for the models to have

any detection power. The element of interest must be found in frame a minimum of X times to reach a

power of 0.8, but cannot be found so many times that the comparison to the background becomes

very unbalanced.

The analyses of enrichment of mobile functionality in the Hygum soils showed hits on all levels of

resolutions, both varying levels of database hits, and de novo orthologues.

The enriched domain RHH_1 is a transcriptional regulator previously identified on a plasmid (32).

De novo functional orthologue clustering enables the discovery of novel functionality and with the

word cloud functionality it is possible to explore what functionality the proteins in the cluster have

been assigned to. For the Hygum soil, the most well-described candidate appears to be an

oxidoreductase class of enzyme possibly involved in stress response.

The antiporter hit Proton_antipo_M has an unknown substrate, but it has previously been

demonstrated to confer copper transport capability (33)

Pyridox_oxidase is the domain in the Pyridoxamine 5'-phosphate oxidase which is a rate-limiting step

in vitamin b6 synthesis first characterized in fungi(34). This enzyme has previously been shown in

yeast to have a critical role in cadmium/arsenic tolerance (35). Vitamin B6 has been shown to be

involved in oxidative stress management of various types of fungi (36,37) which is a toxicity

mechanism for high copper concentrations, as free Cu ions can induce free oxygen radical formation

Page 76: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

(38). These anti-oxidative properties have also been shown in plants and Plasmodium Falciparum (39)

but has not to our knowledge been shown to be explicitly protective in bacteria.

As is the case for the Hygum soil, the top hits for both enrichment and abundance interaction analysis

of the MetaHIT UC dataset was unidentified functionality

Another hit is a taurine-related gene. Taurine plays an important role in the host-microbiome

relationship, because bile acids are conjugate with taurine. Conjugated taurine has been shown to be

more effective at eliminating harmful bacteria from the gut (40) (41) (42). The microbiome of IBD

patients have also been shown to have less potential for bile acid deconjugation (43,44) and this has

been corroborated by targeted metabolomics(45). As anaerobic bacteria can use taurine as a

substrate for growth (46) one hypothesis consistent with these observations is that gut bacteria

redirects the taurine pool that the host would have used for conjugating bile acids for clearing out

bacteria, further promoting a high bacterial load.

NMT1 is a protein-domain that has been related to the bio-synthesis of vitamins B1 and B6 (47).

Micronutrients status play a role in IBD (48). Previous microbiome analyses have suggested an

increase in thiamine production in ileal Crohns disease (49), as well as a decrease in successful

treatment of children with Crohns disease by Exclusive Enteral Nutrition (50). In the same experiment

there was no significant difference in serum vitamin B1 levels, and significantly higher vitamin b6

levels in serum (51), but this is likely due to improved uptake by the host as inflammation is lowered,

and independent of the microbiota production. However, a mechanistic suggestion is not straight-

forward from these data as it is not clear how the microbiome impacts the available levels and the

consequence of this might not be the same in all forms of IBD or inflamed tissue.

Conclusions SYNTENY is a tool for exploring the functional content in regions around HGTi elements. We have

implemented python scripts to integrate a wide variety of functional annotation and include this in

mobile frames that are extracted from the called proteins. We have developed tools for the

downstream analyses of enrichment and abundance interactions with an outcome variable. These

tools were statistically validated, and we estimated how many observations of an element we

must have to reach satisfactory power. Two datasets have been analysed and particular interesting

findings validated by searching for these in a database of previously observed functionality in mobile

elements. Both datasets demonstrate that there is a large pool of uncharacterised genomic

functionality in mobile elements. For both datasets we identify candidate functionality that is likely to

be transferred horizontally and impacting the outcome of the samples. In particular, we see that

Vitamin B1/B6 synthesis that has previously been linked to IBD in microbiome studies, appear to be a

mobile trait, which could have an impact on the way microbiome studies are conducted in IBD where

the functional content rather than the phylogenetic content should have the main focus.

AVAILABILITY

Page 77: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Metahit data is available at http://www.cbs.dtu.dk/databases/CAG/

ACCESSION NUMBERS

Accession numbers for raw sequences of the Hygum soil metagenome is: X

SUPPLEMENTARY DATA

Supplementary Data are available at NAR online.

ACKNOWLEDGEMENT

The authors would like to acknowledge Karin Pinholt Vestberg for DNA extraction and library

preparation of the soil metagenomes. The authors would like to thank Lars Hestbjerg Hansen for

fruitful discussion on the topic of mobile genetic elements. The authors would finally like to

acknowledge the MetaHIT project and Henrik Bjørn Nielsen in particular for excellent availability of

their generated data.

FUNDING

This work was funded by the Lundbeck foundation [grant R93-A8499]. Funding for open access

charge: [X]

CONFLICT OF INTEREST

The authors have no conflics of interest to declare.

REFERENCES

1. Renwick, J.H. (1971) The mapping of human chromosomes. Annual review of genetics, 5, 81-120.

2. Passarge, E., Horsthemke, B. and Farber, R.A. (1999) Incorrect use of the term synteny. Nat Genet, 23, 387.

3. von Wintersdorff, C.J., Penders, J., van Niekerk, J.M., Mills, N.D., Majumder, S., van Alphen, L.B., Savelkoul, P.H. and Wolffs, P.F. (2016) Dissemination of antimicrobial resistance in microbial ecosystems through horizontal gene transfer. Frontiers in microbiology, 7.

4. Blair, J.M., Webber, M.A., Baylay, A.J., Ogbolu, D.O. and Piddock, L.J. (2015) Molecular mechanisms of antibiotic resistance. Nature Reviews Microbiology, 13, 42-51.

5. Puxty, R.J., Millard, A.D., Evans, D.J. and Scanlan, D.J. (2016) Viruses inhibit CO 2 fixation in the most abundant phototrophs on Earth. Current Biology, 26, 1585-1589.

6. Polz, M.F., Alm, E.J. and Hanage, W.P. (2013) Horizontal gene transfer and the evolution of bacterial and archaeal population structure. Trends in Genetics, 29, 170-175.

7. Boto, L. (2010) Horizontal gene transfer in evolution: facts and challenges. Proceedings of the Royal Society of London B: Biological Sciences, 277, 819-827.

Page 78: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

8. Goodrich, J.K., Davenport, E.R., Beaumont, M., Jackson, M.A., Knight, R., Ober, C., Spector, T.D., Bell, J.T., Clark, A.G. and Ley, R.E. (2016) Genetic determinants of the gut microbiome in UK twins. Cell host & microbe, 19, 731-743.

9. Blekhman, R., Goodrich, J.K., Huang, K., Sun, Q., Bukowski, R., Bell, J.T., Spector, T.D., Keinan, A., Ley, R.E. and Gevers, D. (2015) Host genetic variation impacts microbiome composition across human body sites. Genome biology, 16, 191.

10. Wang, J., Thingholm, L.B., Skiecevičienė, J., Rausch, P., Kummen, M., Hov, J.R., Degenhardt, F., Heinsen, F.-A., Rühlemann, M.C. and Szymczak, S. (2016) Genome-wide association analysis identifies variation in vitamin D receptor and other host factors influencing the gut microbiota. Nature Genetics.

11. Jørgensen, T.S., Kiil, A.S., Hansen, M.A., Sørensen, S.J. and Hansen, L.H. (2014) Current strategies for mobilome research. Frontiers in microbiology, 5.

12. Luo, W., Xu, Z., Riber, L., Hansen, L.H. and Sørensen, S.J. (2016) Diverse gene functions in a soil mobilome. Soil Biology and Biochemistry, 101, 175-183.

13. Deniélou, Y.-P., Sagot, M.-F., Boyer, F. and Viari, A. (2011) Bacterial syntenies: an exact approach with gene quorum. BMC bioinformatics, 12, 193.

14. Jacquiod, S., Demanèche, S., Franqueville, L., Ausec, L., Xu, Z., Delmont, T.O., Dunon, V., Cagnon, C., Mandic-Mulec, I. and Vogel, T.M. (2014) Characterization of new bacterial catabolic genes and mobile genetic elements by high throughput genetic screening of a soil metagenomic library. Journal of biotechnology, 190, 18-29.

15. Nielsen, H.B., Almeida, M., Juncker, A.S., Rasmussen, S., Li, J., Sunagawa, S., Plichta, D.R., Gautier, L., Pedersen, A.G. and Le Chatelier, E. (2014) Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nature biotechnology, 32, 822-828.

16. Consortium, H.M.P. (2012) A framework for human microbiome research. Nature, 486, 215-221.

17. Gilbert, J.A., Jansson, J.K. and Knight, R. (2014) The Earth Microbiome project: successes and aspirations. BMC biology, 12, 69.

18. Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths‐Jones, S., Khanna, A., Marshall, M., Moxon, S. and Sonnhammer, E.L. (2004) The Pfam protein families database. Nucleic acids research, 32, D138-D141.

19. Riadi, G., Medina-Moenne, C. and Holmes, D.S. (2012) TnpPred: a web service for the robust prediction of prokaryotic transposases. Comparative and functional genomics, 2012.

20. Haft, D.H., Selengut, J.D. and White, O. (2003) The TIGRFAMs database of protein families. Nucleic acids research, 31, 371-373.

21. Leplae, R., Hebrant, A., Wodak, S.J. and Toussaint, A. (2004) ACLAME: a CLAssification of Mobile genetic Elements. Nucleic acids research, 32, D45-D49.

22. Clauset, A., Newman, M.E. and Moore, C. (2004) Finding community structure in very large networks. Physical review E, 70, 066111.

23. Nunes, I., Jacquiod, S., Brejnrod, A., Holm, P.E., Johansen, A., Brandt, K.K., Priemé, A. and Sørensen, S.J. (2016) Coping with copper: legacy effect of copper on potential activity of soil bacteria following a century of exposure. FEMS Microbiology Ecology, 92, fiw175.

24. Boisvert, S., Raymond, F., Godzaridis, É., Laviolette, F. and Corbeil, J. (2012) Ray Meta: scalable de novo metagenome assembly and profiling. Genome biology, 13, 1.

25. Hyatt, D., Chen, G.-L., LoCascio, P.F., Land, M.L., Larimer, F.W. and Hauser, L.J. (2010) Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC bioinformatics, 11, 1.

26. Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proceedings of the National Academy of Sciences, 85, 2444-2448.

27. Buchfink, B., Xie, C. and Huson, D.H. (2015) Fast and sensitive protein alignment using DIAMOND. Nature methods, 12, 59-60.

Page 79: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

28. Vestergaard, G., Garrett, R.A. and Shah, S.A. (2014) CRISPR adaptive immune systems of Archaea. RNA biology, 11, 156-167.

29. Vlasblom, J. and Wodak, S.J. (2009) Markov clustering versus affinity propagation for the partitioning of protein interaction graphs. BMC bioinformatics, 10, 99.

30. Kanehisa, M. and Goto, S. (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research, 28, 27-30.

31. Röhrig, B., du Prel, J.-B., Wachtlin, D., Kwiecien, R. and Blettner, M. (2010) Sample size calculation in clinical trials. Dtsch Arztebl Int, 107, 552-556.

32. Acebo, P., Garcia de Lacoba, M., Rivas, G., Andreu, J.M., Espinosa, M. and Solar, G.d. (1998)

Structural features of the plasmid pMV158‐encoded transcriptional repressor CopG, a

protein sharing similarities with both helix‐turn‐helix and β‐sheet DNA binding proteins. Proteins: Structure, Function, and Bioinformatics, 32, 248-261.

33. Goldberg, M., Pribyl, T., Juhnke, S. and Nies, D.H. (1999) Energetics and topology of CzcA, a cation/proton antiporter of the resistance-nodulation-cell division protein family. Journal of Biological Chemistry, 274, 26065-26070.

34. Domitrovic, T., Raymundo, D.P., da Silva, T.F. and Palhano, F.L. (2015) Experimental Evidence for a Revision in the Annotation of Putative Pyridoxamine 5'-Phosphate Oxidases P (N/M) P from Fungi. PloS one, 10, e0136761.

35. Guo, L., Ganguly, A., Sun, L., Suo, F., Du, L.-L. and Russell, P. (2016) Global Fitness Profiling Identifies Arsenic and Cadmium Tolerance Mechanisms in Fission Yeast. G3: Genes| Genomes| Genetics, 6, 3317-3333.

36. Benabdellah, K., Azcón‐Aguilar, C., Valderas, A., Speziga, D., Fitzpatrick, T.B. and Ferrol, N.

(2009) GintPDX1 encodes a protein involved in vitamin B6 biosynthesis that is up‐regulated by oxidative stress in the arbuscular mycorrhizal fungus Glomus intraradices. New Phytologist, 184, 682-693.

37. Bilski, P., Li, M., Ehrenshaft, M., Daub, M. and Chignell, C. (2000) Symposium-in-Print Vitamin B6 (Pyridoxine) and Its Derivatives Are Efficient Singlet Oxygen Quenchers and Potential Fungal Antioxidants. Photochemistry and Photobiology, 71, 129-134.

38. Gaetke, L.M. and Chow, C.K. (2003) Copper toxicity, oxidative stress, and antioxidant nutrients. Toxicology, 189, 147-163.

39. Knoeckel, J., Mueller, I.B., Butzloff, S., Bergmann, B., Walter, R.D. and Wrenger, C. (2012) The antioxidative effect of de novo generated vitamin B6 in Plasmodium falciparum validated by protein interference. Biochemical Journal, 443, 397-405.

40. Buffie, C.G., Bucci, V., Stein, R.R., McKenney, P.T., Ling, L., Gobourne, A., No, D., Liu, H., Kinnebrew, M. and Viale, A. (2015) Precision microbiome reconstitution restores bile acid mediated resistance to Clostridium difficile. Nature, 517, 205-208.

41. Theriot, C.M., Bowman, A.A. and Young, V.B. (2016) Antibiotic-induced alterations of the gut microbiota alter secondary bile acid production and allow for Clostridium difficile spore germination and outgrowth in the large intestine. MSphere, 1, e00045-00015.

42. Weingarden, A.R., Dosa, P.I., DeWinter, E., Steer, C.J., Shaughnessy, M.K., Johnson, J.R., Khoruts, A. and Sadowsky, M.J. (2016) Changes in colonic bile acid composition following fecal microbiota transplantation are sufficient to control Clostridium difficile germination and growth. PloS one, 11, e0147210.

43. Gevers, D., Kugathasan, S., Denson, L.A., Vázquez-Baeza, Y., Van Treuren, W., Ren, B., Schwager, E., Knights, D., Song, S.J. and Yassour, M. (2014) The treatment-naive microbiome in new-onset Crohn’s disease. Cell host & microbe, 15, 382-392.

44. Labbé, A., Ganopolsky, J.G., Martoni, C.J., Prakash, S. and Jones, M.L. (2014) Bacterial bile metabolising gene abundance in Crohn's, ulcerative colitis and type 2 diabetes metagenomes. PLoS One, 9, e115175.

Page 80: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

45. Fiasse, R., Eyssen, H., Leonard, J. and Dive, C. (1983) Faecal bile acid analysis and intestinal absorption in Crohn's disease before and after ileal resection. European journal of clinical investigation, 13, 185-192.

46. Chien, C., Leadbetter, E. and Godchaux, W. (1997) Taurine-sulfur assimilation and taurine-pyruvate aminotransferase activity in anaerobic bacteria. Applied and environmental microbiology, 63, 3021-3024.

47. Rodríguez‐Navarro, S., Llorente, B., Rodríguez‐Manzaneque, M.T., Ramne, A., Uber, G.,

Marchesan, D., Dujon, B., Herrero, E., Sunnerhagen, P. and Pérez‐Ortín, J.E. (2002) Functional analysis of yeast gene families involved in metabolism of vitamins B1 and B6. Yeast, 19, 1261-1276.

48. Weisshof, R. and Chermesh, I. (2015) Micronutrient deficiencies in inflammatory bowel disease. Current Opinion in Clinical Nutrition & Metabolic Care, 18, 576-581.

49. Morgan, X.C., Tickle, T.L., Sokol, H., Gevers, D., Devaney, K.L., Ward, D.V., Reyes, J.A., Shah, S.A., LeLeiko, N. and Snapper, S.B. (2012) Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome biology, 13, R79.

50. Quince, C., Ijaz, U.Z., Loman, N., Eren, A.M., Saulnier, D., Russell, J., Haig, S.J., Calus, S.T., Quick, J. and Barclay, A. (2015) Extensive modulation of the fecal metagenome in children with Crohn’s disease during exclusive enteral nutrition. The American journal of gastroenterology, 110, 1718-1729.

51. Gerasimidis, K. (2009), University of Glasgow.

TABLE AND FIGURES LEGENDS

Figure 1: Flowchart of data processing. Steps in red is accomplished by external software, steps in

blue with SYNTENY. Metagenomic contigs are annotated with various sources of information. This is

integrated, and used to identify the mobile frames. These are finally used to compute the statistical

associations.

Figure 2: (A) Residuals vs fitted values for 5 representative abundance interaction models. (B) QQ

plot of the residual distribution of the same 5 models. (C) The distribution of p-values for 23

representative models under varying permutation schemes. All is the total randomization of both

factors, Group randomizes the outcome variable, while Obs randomizes wether observations from

the background or a frame. (D) Model power as a function of number of observed proteins from a

particular class (ie. a certain Pfam or KO annotation). (E) Power as a function of the number of in

frame observations. (F) Power as a function of the proportion of in frame observations.

Figure 3: (A) Word cloud example of a de novo orthologue candidate hit (cluster 60) from the Hygum

soils. (B) Adjusted Interaction means of the abundances of protein observations.

Page 81: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Figure 1

04/04/2017 1

ORFs

Contigs

Genome(s)

KEGG

Pfam.A

Tnpred

synteny.py

Ortholog clustering

Find_frames.py

Integrate_expression_matrix.py

Frames_enrichmentt.py

Cluster_net.py

Page 82: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Figure 2

04/04/2017 2

Page 83: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Figure 3

Page 84: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

FEMS Microbiology Ecology, 92, 2016, fiw175

doi: 10.1093/femsec/fiw175Advance Access Publication Date: 18 August 2016Research Article

RESEARCH ARTICLE

Coping with copper: legacy effect of copper onpotential activity of soil bacteria following a centuryof exposureInes Nunes1, Samuel Jacquiod1,∗, Asker Brejnrod1, Peter E. Holm2,Anders Johansen3, Kristian K. Brandt2, Anders Prieme1

and Søren J. Sørensen1

1Section of Microbiology, University of Copenhagen, Universitetsparken 15, Building 1, 2100 Copenhagen,Denmark, 2Department of Plant and Environmental Sciences (PLEN), University of Copenhagen,Thorvaldsensvej 40, DK-1871 Frederiksberg C, Denmark and 3Department of Environmental Science, AarhusUniversity, Frederiksborgvej 399, 4000 Roskilde, Denmark∗Corresponding author: Section of Microbiology, Department of Biology, University of Copenhagen, Universitetsparken 15, Building 1, 2100 Copenhagen,Denmark. Tel: +45-28704746; E-mail: [email protected] sentence summary: Long-term effect of a copper gradient on soil bacteria.Editor: Angela Sessitsch

ABSTRACT

Copper has been intensively used in industry and agriculture since mid-18th century and is currently accumulating in soils.We investigated the diversity of potential active bacteria by 16S rRNA gene transcript amplicon sequencing in a temperategrassland soil subjected to century-long exposure to normal (∼15mgkg−1), high (∼450mgkg−1) or extremely high(∼4500mgkg−1) copper levels. Results showed that bioavailable copper had pronounced impacts on the structure of thetranscriptionally active bacterial community, overruling other environmental factors (e.g. season and pH). As copperconcentration increased, bacterial richness and evenness were negatively impacted, while distinct communities with anenhanced relative abundance of Nitrospira and Acidobacteria members and a lower representation of Verrucomicrobia,Proteobacteria and Actinobacteria were selected. Our analysis showed the presence of six functional response groups(FRGs), each consisting of bacterial taxa with similar tolerance response to copper. Furthermore, the use of FRGs revealedthat specific taxa like the genus Nitrospira and several Acidobacteria groups could accurately predict the copper legacyburden in our system, suggesting a potential promising role as bioindicators of copper contamination in soils.

Keywords: diversity; functional response group; RNA; bioindicator; bioavailable copper; PLFA

INTRODUCTION

Environmental heavy metal pollution is a widespread problem,and associated effects on soil microbes have, therefore, beenstudied for decades (Baath 1989; Giller, Witter and McGrath1998). Microbial adaptation and resistance to metal contami-

nants is well known and described in the literature, includingmechanisms such as (i) extra- and intracellular sequestration,(ii) exclusion by permeability barriers, (iii) enzymatic detoxifi-cation, (iv) efflux-pumps and (v) specific reduction of the cel-lular targets’ sensibility (Rouch, Lee and Morby 1995; Hobman,

Received: 23 February 2016; Accepted: 16 August 2016C© FEMS 2016. All rights reserved. For permissions, please e-mail: [email protected]

1

Page 85: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

2 FEMS Microbiology Ecology, 2016, Vol. 92, No. 11

Yamamoto and Oshima 2007; Epelde et al. 2015). The latter is oneof themost complexmetal resistancemechanisms as it requirescellular restructuration and can be achieved by mutation of thetarget component (maintaining its functionality), by increasingthe target’s quantity, by repairing the component (mostly fea-sible when the target is DNA) and by bypassing processes in-cluding plasmid-encoded metal-resistant forms of the targetor alternative metal-resistant pathways (Rouch, Lee and Morby1995; Hobman, Yamamoto and Oshima 2007). Nevertheless, pre-vious DNA-based studies have commonly reported a high sen-sitivity of soil bacterial communities toward these pollutants(Brandt et al. 2010; Berg et al. 2012). Moreover, model studies inmicrocosms have shown the limited resistance and resiliencecapacities of soil microbial communities after metal exposures,especially in terms of function maintenance and recovery (Grif-fiths and Phillipot 2013). However, contrasting results were re-ported in the literature about diversity losses after heavy metalexposure in different systems (Gans,Wolinsky and Dunbar 2005;Wakelin et al. 2010; Berg et al. 2012; Gillan et al. 2015). For in-stance, similar richness levels were reported between normaland contaminated conditions after long-term exposure in riversediments (Gillan et al. 2015), and also in soil (Berg et al. 2012).On the other hand, diversity loss was observed in several occur-rences including long-term exposure in soils (Naveed et al. 2014;Li et al. 2015).

Amongst heavy metals, copper is one of the most commonsoil contaminants (Griffiths and Philippot 2013), and copper-derived biocides (e.g. copper sulfate) have been intensively usedin wood impregnation stations and agriculture since the mid-18th century (Rusjan 2012). Although copper is an essential ele-ment supporting many metabolic reactions, it becomes toxic athigh concentrations (Arguello, Raimunda and Padilla-Benavides2013), and the effects of copper compounds on soil microbialcommunities have been the focus of several recent studies (Berget al. 2012; Griffiths and Phillipot 2013; Mackie et al. 2014), includ-ing long-term exposure (Dell’Amico et al. 2008; Mackie et al. 2013;Naveed et al. 2014). Nevertheless, very little is known regard-ing the active fraction and functionality of the correspondingselected microbial communities after extended period of time.The remarkable adaptive potential of microbial communities tothrive under inhospitable conditions is already well establishedand documented (Cowan et al. 2015). However, understandingthe long-term selection dynamics behind the evolution of an ini-tial copper resistance trait into a fully developed lifestyle strat-egy is still not completely understood. Indeed, long-term expo-sure to a persistent perturbation in complex ecosystems like soilis expected to triggermany direct and indirect effects, whichwillall impact the bacterial communities in an indiscernible and en-tangled manner. For instance, it is known that copper directlyimpacts other components of the ecosystem (e.g. fungi, pro-tists, the plant cover), which might in return also influence thebacterial communities to some extent (Ekelund, Olsson and Jo-hansen 2003; Rajapaksha, Tobor-Kap�lon and Baath 2004; Strand-berg et al. 2006; Fernandez-Calvino et al. 2010; Fernandez-Calvinoand Baath 2016). In addition, long-term exposure to several per-sistent concentrations is expected to trigger serial successionsin the bacterial communities, resulting in selection of differentlifestyle strategies (e.g. opportunistic, generalists, specialists).This will ultimately reflect the different evolution trajectoriestaken by the former community over time, going beyond the ini-tial step of resistance development. All these intermixed and cu-mulative processeswill bemerged into a so-called ‘legacy’ effect,making any attempts to decipher the effects of each individualfactor complicated, if not impossible.

The Hygum fallow research site (Jutland, Denmark) offersa remarkable opportunity to shed light on this topic. Heavilycontaminated by repeated use of blue vitriol (CuSO4) a centuryago, the Hygum site harbors a well-documented copper gradi-ent which has been kept for research purposes (Strandberg et al.2006). In this study, we aimed to assess the long-term legacy ef-fect of copper contamination on the structure of the transcrip-tionally active soil bacterial community. In order to achieve thisgoal, an array of complementary techniques was used, includ-ing phospholipid fatty acid (PLFA) analysis, MicroResp and 16SrRNA gene transcript amplicon sequencing. Here, the terminol-ogy ‘transcriptional activity’ is used to avoid confusion with 16SrRNA gene sequencing results obtained from DNA instead ofRNA.While results obtained fromDNA account for the total bac-terial community, results obtained from RNA are showing thediversity of the bacteria with active transcription of 16S rRNAgene at sampling time, indicating the potential to produce pro-teins (transcriptionally active). The experiment was performedwith soil samples collected every 3 months between November2013 andAugust 2014, from three plots established at theHygumsite. We hypothesized that direct and indirect effects linked to acentury of exposure to different copper levels will act as a ‘legacyburden’, resulting in selection of specific bacterial strategies andlifestyles. We also intend to shed light on the contrasting diver-sity status reported in the literature by going one step furtherand use state-of-the-art sensitive RNA-based approaches to lookinto the transcriptionally active bacteria. We suspect that thedifferent legacy burden resulting from the copper concentrationgradient would reduce the richness of the transcriptionally ac-tive soil bacterial community, selecting divergent groups, withaltered functionalities. To better understand the dynamics atplay under copper legacies and the resulting bacterial strategies,we applied the macroecological concept of functional responsegroup (FRG), operationally defined as a group of organisms re-sponding similarly to environmental change (Lavorel and Gar-nier 2002).

MATERIAL AND METHODSExperimental setup and soil sampling

The 60 × 120 m flat experimental field is located at Hygum(55◦46’N, 9◦27’E), Denmark. Intensively studied since 1993, thecopper contamination gradient is well documented (Arthuret al. 2012; Berg et al. 2012) and a permanent 4 × 4 m gridallows the use of a transversal–longitudinal coordinate sys-tem. Within 80 m, copper concentration varies from ∼15 mgCu kg−1 soil (slightly above typical Danish background values)to more than 4000 mg Cu kg−1 soil. Soil samples were ob-tained within three 4 × 4 m areas corresponding to three lev-els of copper contamination along a longitudinal transect: con-trol (C), semi-contaminated (SC) and hotspot (HS). The con-trol plot was located outside the contaminated area and con-tained 13–16 mg Cu kg−1 soil. The SC and HS plots contained be-tween 400–500 mg Cu kg−1 soil and 3000–6000 mg Cu kg−1 soil,respectively.

Each plot was sampled every 3 months between November2013 and August 2014 resulting in a total of four sampling cam-paigns.

Detailed plot location, soil characterization andweather con-ditions in the samplingmonths are presented in supplementarydata (Table S1, Supporting Information). Six soil samples (4–14cm soil depth) per plot were collected at each of the four sam-pling campaigns (total n = 6 × 3 × 4 = 72 samples). Each sample

Page 86: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Nunes et al. 3

consisted of six subsamples collected at six randomly chosenlocations within each 16 m2 plot. All samples were manuallyhomogenized and subsamples of 2 g were immediately storedin 5mL of ice-cold LifeGuard Soil Preservation Solution (MOBIOlaboratories, Carlsbad, CA, USA) for RNA extraction. Sample ori-gin and nomenclature are presented in supplementary data (Ta-ble S2, Supporting Information).

Soil pH and moisture

Five grams of moist soil were used for measuring pH in wa-ter using a 1:5 dilution and an Inlab R©Routine-L (pH 0–14.0;100◦C) pH meter with a 3 M KCl electrolyte (Mettler Toledo,Glostrup, Denmark). Detailed protocol and results are pre-sented in supplementary data (Supplementary Technical Noteand Fig. S1, Supporting Information). Moisture content of eachsample was determined by drying 10 g of moist soil for 24h at 90◦C in a drying oven (Termaks AS, Bergen, Norway)and calculated by the difference of weight before and afterdrying.

Bioavailable copper and chemical soil metal analysis

The amount of bioavailable Cu ([Cubio]) in soil–water suspen-sions was determined in all soil samples using a Cu-specificPseudomonas fluorescens DF57-Cu15 bioreporter strain as de-scribed previously in Brandt, Holm and Nybroe (2008) with afew modifications (Supplementary Technical Note and Fig. S1,Supporting Information). The amount of bioavailable copperper mass of dry soil (Cubio-soil) was calculated assuming zerobioavailability of particle-associated copper (Brandt, Holm andNybroe 2006a).

Total soil Cu and total metal composition were measured byGraphite Furnace Atomic Absorption (GFAAS) (Brandt, Holm andNybroe 2008; Table S1, Supporting Information) and InductivelyCoupled Plasma Mass Spectrometry (ICP-MS) (Berg et al. 2010;Table S1, Supporting Information), respectively.

Catabolic response profiling (MicroResp)

Soil samples were sieved through a 2-mm sieve, adjusted toproper water content and pre-incubated for a week (according tothemanufacturer’smanual) prior to theMicroResp assay (Camp-bell et al. 2003).

Water and the following seven substrates were used: D-(+)-Galactose (GAL), L-Malic acid (MAL), gamma amino butyric acid(GABI), n-acetyl glucosamine (AGL), D(+) glucose (GLU), alpha ke-toglutarate (AKET) and citric acid (CIT) (Campbell et al. 2003; asmodified by Creamer et al. 2009). Percentage of CO2 was obtainedfrom respiration rates (μg CO2-C g−1 dry soil h−1) using a calibra-tion procedure where indicator gel microplates were incubatedat known CO2 concentrations for 3 h. Absorbance of the indi-cator gel was recorded at 590 nm (Chameleon FP plate reader;Hidex, Turku, Finland). For the assay, samples were incubatedin 96 deep well plates with water, and the seven selected sub-strates for 6 h, and respiration (CO2 production) was recordedby the change of color in the indicator gel microplate invertedabove the samples. Longer incubation time resulted in nearly to-tal de-coloration in all wells regardless of copper concentration.The final percentage of CO2 was obtained from the absorbancevalues as the fit between absorbance and CO2 follows the ex-pression A + B / (1 + D × Ai) (www.microresp.com; A: –0.235, B:–2.01, D: –10.8, Ai: normalized absorbance value from individualmicro well).

PLFA analysis

PLFAs were extracted from the soil samples according toJohansen et al. (2013). Briefly, 10 g of dry weight of soilwere placed in Teflon centrifuge tubes (Nalgene Nunc Int.,Oak Ridge, Tennessee, USA) and extracted in 10mL ofdichloromethane/methanol/citrate buffer (0.15M; pH 4.0; 1:2:0.8,vol:vol:vol). Supernatants from two consecutive extractionswere pooled and split into two phases by the addition of chlo-roform and citrate buffer. Polar lipids (in the lower phase) werepurified and derivatized according to Joner et al. (2001). Analyt-ical procedure, GC settings and affiliation of individual FAMEs(Fatty Acid Methyl Ester) to taxonomic groups were performedas described in Johansen et al. (2013). Results are presented insupplementary data (Fig. S2, Supporting Information).

RNA extraction and reverse transcription

Total RNA was extracted from the preserved 2 g of soil usingthe RNA PowerSoil R© Total RNA Isolation Kit (MOBIO laborato-ries, USA) according to the manufacturer’s instructions. TotalRNA extracts were resuspended in sodium citrate (1 mM), quan-tified using a QubitTM fluorometer (Invitrogen, by Life Technolo-gies, Nærum, Denmark) with a Quant-iT RNA Assay Kit (range5–100 ng; Invitrogen) and stored at –80◦C. Samples with totalRNA concentrations lower than 20 ng/μL were excluded fromthe experiment (Table S2, Supporting Information). The DNA-free Kit (Ambion, by Life Technologies) protocol was optimizedfor DNase treatment of the samples and cDNA was obtained us-ing the Roche Reverse transcription kit (Roche, Hvidovre, Den-mark) with random hexamers (100 μM; TAG Copenhagen). De-tailed protocols are provided in supplementary data (Supple-mentary Technical Note, Supporting Information).

Sequencing and trimming of 16S rRNA gene transcripts

PCR of a 1:10 dilution of the obtained cDNA (previ-ous section) was performed using the primers 341F(5’CCTACGGGRBGCASCAG-3’) and 806R (5’GGACTACNNGGGTATCTAAT-3’) (Sigma-Aldrich, Brøndby, Denmark) flankingthe V3 and V4 regions of the 16S rRNA gene (Yu et al. 2005; Berget al. 2012). A detailed protocol of the library construction ispresented in supplementary data (Supplementary TechnicalNote, Supporting Information). Paired-end sequencing of the16S rRNA gene transcripts was done using MiSeq reagent kit v2(500 cycles) and a MiSeq sequencer (Illumina Inc., San Diego,CA, USA). Sequences were demultiplexed using the MiSeq Con-troller Software, trimmed for diversity spacers using biopieces(www.biopieces.org), mate-paired using usearch v7.0.1090 (Edgar2010) and filtered using usearch—maxee 0.5. OTU clusteringwas performed using UPARSE (Edgar 2013) as recommended,in particular removing singletons after dereplication. usearchagainst the ‘Gold’ database included in the ChimeraSlayerpackage (Haas et al. 2011) was performed for chimera checkingusing the recommended settings, and OTU representativesequences were classified using Mothur v.1.25.0 wang() function(Schloss et al. 2009) at 0.8 confidence threshold. A phyloge-netic tree was constructed using QIIME wrappers for PyNAST(Caporaso et al. 2010a), FastTree (Price, Dehal and Arkin 2009)and filter˙alignment.py (Caporaso et al. 2010b). Alignments werebuilt against the 2011 version of Greengenes (DeSantis et al.2006) and filtered using allowed gap frac 0.999999 and threshold3.0. The rrarefy function of the vegan package (Oksanen et al.2013; the R 3.1.1 software (R Core Team 2014)) was used to rarefy

Page 87: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

4 FEMS Microbiology Ecology, 2016, Vol. 92, No. 11

the generated annotation tables to 15 000 reads per sample.Samples with lower number of reads were excluded from theanalysis (Table S2, Supporting Information).

The potential biases introduced by differences in the 16SrRNA gene copy number of different bacterial groups for abso-lute quantification are well known, and have been actively dis-cussed by the scientific community in the last decades (Klappen-bach, Dunbar and Schmidt 2000; Langille et al. 2013; Vetrovskyand Baldrian 2013). Since we work with cDNA in this study, wekept the 16S rRNA gene transcript counts without normalizationin order to perform a comparative analysis, as we are not inter-ested in absolute quantification, and also because the differen-tial expression levels between species in complex communitiesare unknown and difficult to standardize.

Statistical analysis of 16S rRNA gene transcripts

Rarefaction curves (Fig. S3, Supporting Information), alpha-diversity (richness, Simpson’s diversity, Shannon’s diversity,Shannon’s evenness and Chao-1 index) and paired-group clus-ter analysis (Bray–Curtis dissimilarity index) at genus level wereobtained using the PAST software ver.2.17 (Hammer, Harper andRyan 2001). Differences in alpha diversity were determined us-ing two-way ANOVA (SigmaPlot 12.5). Multivariate analysis us-ing the vegan package was applied to the samples for beta-diversity analysis at genus level. Briefly, the rarefied and log10-transformed compositional dataset was subjected to a redun-dancy analysis (RDA) using seven environmental factors: totaland bioavailable copper, pH, moisture, season, substrate use(based on MicroResp profiles) and bacterial biomass (based onPLFAs). The significance of the model, of the RDA axes and ofeach of the used factors was tested by ANOVA (respectively,200, 100–300 and 10 000 permutations). A parallel PERMANOVAusing 10 000 permutations and Bray–Curtis dissimilarity indexwas performed using total copper, sampling month or the inter-action of both, as factors (vegan package; Table S3, SupportingInformation).

For selection of taxa (genus level) with significant changesin prevalence according to copper contamination level, a nega-tive binomial generalized linear model (nbGLM) was applied tothe dataset (mvabund package; Wang et al. 2012) and analysis ofdeviance (P < 0.01) was used. The selected taxa were plotted ina heatmap of centered and scaled counts, using a combinationof the following R packages: gplots (Warnes et al. 2015), vegan, ri-oja (Juggins 2014) and RcolorBrewer (Neuwirth 2007), as previouslydescribed (Jacquiod et al. 2016).

SIMPER analysis (similarity percentage) using Bray–Curtisdissimilarity index was then performed in order to determinehow much dissimilarity (percentage) was overall explained bythem (PAST ver. 2.17).

Groups of organisms that change their activity pattern sim-ilarly according to the copper contamination level were classi-fied in FRGs using cluster analysis (pvclust package; Suzuki andShimodaira 2006; complete agglomeration method, Euclideandistance and bootstrapping 10 000 times) on the centered andscaled counts of the selected taxa, and cut-off between 10.5 and12.5 according to the contamination level.

RESULTSMicroResp and PLFA profiles

Overall, the catabolic response profiles from the three plotswere quite different, mostly due to a progressively lower

mineralization rate of GAL, GABI, AGL, GLU and AKET with in-creasing copper contamination (P < 0.001; Fig. S1, SupportingInformation). MAL mineralization was also reduced in the twocontaminated plots (SC and HS) but at a lower degree (P = 0.01).The turnover of other substrates, such as CIT, was unimpairedby the Cu concentrations (P = 0.71).

The results of the PLFA analysis showed a decrease in bothbacterial and especially fungal biomass with increasing cop-per concentration (Fig. S2, Supporting Information). Moreover,the effect of copper seemed to affect the Gram-negative bacte-ria most, having decreased levels of total PLFA g−1 of dry soilin the contaminated plots (SC and HS) compared to the non-contaminated soil. This pattern was less pronounced for Gram-positive bacteria. The concentration of total PLFA was relativelystable along the four seasons.

Diversity of 16S rRNA gene transcripts

A total of 3 016 459 partial 16S rRNA gene transcript sequenceswere obtained from 69 out of 72 soil samples, ranging from 16031 to 92 777 sequences per sample (Table S2, Supporting Infor-mation). Alpha-diversity and rarefaction curves were calculatedand are presented, respectively in Fig. 1 and Fig. S3 (SupportingInformation). Richness and evenness of the transcriptionally ac-tive bacterial community (Richness, Shannon’s evenness, Chao-1, Shannon and Simpson diversity indices) decreased with in-creasing copper contamination (ANOVA; P < 0.001), but did notchange with season (ANOVA; P > 0.05; Fig. 1). The combined ef-fect of copper and sampling month also significantly affectedthe alpha-diversity indices (ANOVA; P < 0.001) in the three con-tamination levels, indicating seasonal variation of Cu effects(Fig. 1).

Changes in the structure of the transcriptionally active soilbacterial communities were investigated by multivariate beta-diversity analysis. Results confirmed the previous observations,indicating copper as the main driving force in the system (Figs 2and 3). RDAwith the included factors (total and bioavailable cop-per, pH, moisture, season, substrate use and bacterial biomass)explained 69.0% of the total variance, ofwhich 57.8% is describedin the first two axes (respectively, 46.0% and 11.8%; Fig. 2). Themost explanatory variable was bioavailable copper (42.2%; Ta-ble S4, Supporting Information), which was associated with thefirst axis, followed by total copper (16.5%; Table S4, SupportingInformation) associated with both first and second axis. A sea-sonal effect was also significant, but at a much lower degree(6.0%; Table S4, Supporting Information), being mostly associ-ated with the third axis (5.8%), which segregated summer fromthe other time points (Fig. 2). Despite being significant, pH onlycontributed 2.8% for the total explained variance (Fig. 2; Table S4,Supporting Information). PERMANOVA on Bray–Curtis dissim-ilarity index confirmed the observed trends, indicating bothcopper (60.3%) and season (13.5%) as significantly impactingthe community profiles (Table S3, Supporting Information; P <

0.001). The overruling effect of copper was assessed by a clusteranalysis with Bray–Curtis dissimilarity index (Fig. 3; copheneticcorrelation coefficient = 0.87), revealing a clear first separationof the HS samples from the others, while the second dichotomysegregated the C samples from the SCs. This clear copper gradi-ent clustering pattern was mostly due to major phylum-drivendifferences, namely a progressive increase in the relative abun-dance of the transcriptionally active Acidobacteria and Nitro-spira phyla, associated with a steady decline of Actinobacte-ria, Proteobacteria and Verrucomicrobia along the copper gradi-ent (Fig. 3). Hence, long-term exposure to copper promoted the

Page 88: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Nunes et al. 5

Figure 1. Alpha-diversity: Simpson (A), Shannon (B), Evenness (C), Richness (D) and Chao-1 (E) indices calculated from 16S rRNA gene transcript sequencing resultsfor the different copper contamination levels (C, SC and HS) and sampling months (Nov13, Feb14, May14 and Aug14). Error bars represent standard error of average

values (4 < n < 6). The P-values obtained in the ANOVAs (α = 0.05) performed to each of the alpha-diversity indices, using copper, sampling month or the interactionof both as factors are presented in (F). Stars represent the level of significance of the differences found in the diversity indices accordingly to the code: (–) > 0.05, (∗)0.05 ≤ P-values < 0.01; (∗∗) 0.01 ≤ P-values< 0.001; (∗∗∗) P-values ≤ 0.001.

differentiation of three distinct transcriptionally active bacterialcommunities, with major phylum-driven differences.

Functional response groups

We investigated abundance shift patterns at the genus level be-tween copper contamination levels (Fig. 4). A total of 124 outof 353 genera showed significantly different prevalence accord-ing to the three copper contamination levels (nbGLM, P < 0.01;Fig. 4). These genera accounted for 71.1% of the Bray–Curtis dis-similarity (SIMPER analysis; Fig. 4) and represented 51.9%, 53.9%and 57.8% of the total genera abundance respectively in C, SCand HS plots (Fig. 5, panel A).

Six FRGs were identified corresponding to six different toler-ance ranges towards copper (Fig. 4). FRG1 and FRG2 presentedhigher relative abundance in the C plot. FRG1 corresponded tothe most sensitive group, with 50% relative abundance decreasein the intermediate copper level (SC plot). On the other hand,FRG2 showed a slightly higher tolerance, with a progressive rel-ative abundance decrease along copper level (Fig. 5, panel A).These two FRGs presented the highest richness levels, but withdifferent compositions, as FRG1 consisted mainly of Proteobac-terial and Actinobacterial genera, while FRG2 was mostly con-stituted by Verrucomicrobial and Acidobacterial genera (Fig. 5,panel B).

FRG3 and FRG4 were associated with an intermediate toler-ance to copper contamination (Figs 4 and 5). FRG3 had stable rep-resentation in the C and SC plots and dramatically dropped to athird in the HS plot (Fig. 5, panel A). FRG4 corresponded to thegroup of organismswith enhanced transcriptional activity in the

SC plot and low relative abundance in the HS plot (Fig. 5, panelA). Both FRGs were mostly constituted by Proteobacterial gen-era (Fig. 5, panel B). However, FRG3 presented a higher richnessthan FRG4, including members from several other phyla (Bacte-riodetes, Verrucomicrobia, Firmicutes, Chloroflexi, Armatinon-adetes and Acidobacteria). Besides Proteobacteria, FRG4 wasonly constituted by Acidobacteriamembers. However, 70% of se-quences belonging to this group were affiliated to Rhodomicro-bium and an unclassified group of Rhodospirilllales (Alphapro-teobacteria, Figs 4 and 5).

FRG5 and FRG6 included the taxa with higher representa-tion (43.6%) in the HS plot (Fig. 5, panel A). While members ofFRG6 were particularly abundant in the HS plot, taxa in FRG5displayed a steady increase along the copper gradient (Fig. 5,panel A). FRG5 only included five taxa, four Acidobacterial gen-era (34.1% of the FRG5 proportion) and one group of unclas-sified Betaproteobacteria (65.9% of the FRG5 proportion). FRG6was more diverse than FRG5, being dominated by Acidobacteria(52.3% of the FRG6 proportion) followed by Nitrospira, Alphapro-teobacteria and Actinobacteria (Fig. 5, panel B).

DISCUSSIONCopper as an ecosystem legacy burden

We present here the first RNA-based study addressing the ef-fect of long-term copper exposure on the structure of thetranscriptionally active soil bacterial community. The usedcomplementary methodologies (PLFAs, MicroResp profiles and16S rRNA gene transcript sequencing) allowed us to have a

Page 89: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

6 FEMS Microbiology Ecology, 2016, Vol. 92, No. 11

Figure 2. 2D representation of redundancy analysis (RDA) performed to rarefied and log10-transformed 16S rRNA amplicon sequencing data using total and bioavailablecopper, soil moisture, substrate use, bacterial biomass, pH and season as explanatory factors (n = 7). Significance of the model, axes and factors was determined by

ANOVA (respectively, 200, 100–300, 10 000 permutations; α = 0.05). Only significant factors (P < 0.05) are depicted. Stars stand for the level of significance according tothe code: (∗) 0.05 ≤ P-values < 0.01; (∗∗) 0.01 ≤ P-values< 0.001; (∗∗∗) P-values ≤ 0.001.

clearer picture of the overruling copper legacy effect on the bac-terial communities. As expected, the results showed that cop-per is the main driving force in the system, resulting in a stronglegacy burden seen at many levels. Indeed, copper has affectedmany components of the ecosystem, including soil properties(e.g. compaction, C/N ratio, pH) but also plant cover (e.g. absentin the hotspot). This is coherent with a previous study report-ing Cu-induced changes in organic matter content through al-teration of the plant cover (Sauve 2006). Furthermore, the mi-crobial component was also strongly impacted, both in terms ofdiversity and numbers.

Our results point to a decline in bacterial biomass as doc-umented by the PLFA analysis. Despite being more tolerantto heavy metals than bacteria (Baath 1989; Rajapaksha, Tobor-Kap�lon and Baath 2004), fungal biomass also dropped with in-creasing concentrations, which is in agreement with the previ-ously attested sensitivity to copper compounds (e.g. Bordeauxmixture; Baath 1989; Rusjan 2012). This legacy burden effect ob-served on the overall components of the site tends to indicatethat ecosystem functioning is impacted.

A significant decrease of all alpha-diversity indices along thecopper gradient, reflecting a loss in both richness and even-ness of the transcriptionally active bacterial communities, es-pecially at the highest Cu concentration, was also observed inthis study. Consequences of diversity loss on ecosystem func-tioning are still unclear and contrasted in the literature depend-ing on the traits considered. For instance, soil bacterial richnesshas been linked to the capacity to cope with disturbance, thanksto functional redundancy (Fetzer et al. 2015), and some deleteri-ous effects of lowered diversity have been detected in specifickey functions such as denitrification (Philippot et al. 2013). Ourresults indicate that the transcriptionally active bacterial com-munity had a decline in its diversity over a century of exposure,being unable to recover to levels close to the non-contaminatedareas and pointing to a potential effect on the functional-ity in the ecosystem. This observation is correlating with the

accumulation of total carbon and nitrogen in the most con-taminated soil (Table S1, Supporting Information) and conse-quently with potential loss or reduction of functions relatedwith organic matter degradation. This is further supported bythe catabolic response profiles obtained, showing a progressivedecreased mineralization rate of GAL, GABI, AGL and GLU withincreasing copper level, which is indicative of slower metabolicturnover activities. Although linking specific functions to taxo-nomical changes remains highly speculative, our results clearlyindicate a concomitant microbial biomass and biodiversity loss,together with catabolic activity reduction and higher C/N ra-tio with increasing copper level. This could indicate the spe-cific loss/reduction of key catabolic functions due to the dele-terious effect of copper on soil microbes, resulting in slower car-bon turnover. These observations are clearly highlighting thepresence of a real copper concentration-driven legacy burdenaffecting the entire ecosystem.

The copper effect on diversity and community structure

Previously reported effects of copper on soil microbial com-munities have been contradictory, probably due to majorinfluences of exposure time and resolution of the employedtechniques (Dell’Amico et al. 2008; Ippolito, Ducey and Tarkalson2010; Wakelin et al. 2010; Berg et al. 2012). Berg et al. (2012) pre-sented the first study addressing the effect of copper on soil bac-terial communities using high-throughput DNA sequencing ofthe 16S rRNA gene at the Hygum site (Berg et al. 2012). Theirresults showed maintenance of the OTU richness with increas-ing copper. Considering that both studies were performed at thesame site and using similar levels of contamination, but target-ing different fractions of the bacterial community, the differ-ences observed could result from the presence of a large fractionof inactive/latent bacteria in the copper-contaminated plots,especially in the hotspot. The Actinobacteria phylum, for exam-ple, illustrates this assumption, as it was not previously affected

Page 90: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Nunes et al. 7

Figure 3. Cluster analysis of the 16S rRNA gene transcripts sequencing results using the Bray–Curtis dissimilarity index, and stacked horizontal bars depicting therelative abundance of the taxa present in the samples. Colors in the stacked horizontal bars correspond to the phylum level classification of each taxon, except for

Proteobacteria which were classified to class level. Stars depict the significance of the differences in the taxon proportion found between each copper contaminationlevel: (∗) 0.05 ≤ P-values < 0.01; (∗∗) 0.01 ≤ P-values< 0.001; (∗∗∗) P-values ≤ 0.001.

by increasing copper concentration (Berg et al. 2012), but was sig-nificantly impacted in this study. Nevertheless, it should also benoted that the lack of effects of copper on OTU richness from theprevious DNA-based survey may possibly be explained by insuf-ficient community coverage (i.e. insufficient sequencing depth).This notion is indeed supported by results from a subsequentfollow-up study showing a decrease of Pseudomonas spp. rich-ness with increasing copper contamination at the Hygum site(Thorsen, Brandt and Nybroe 2013).

In terms of community structure, our results pointed to ma-jor phylogenetic changes (e.g. phyla and classes) of the tran-scriptionally active community along the copper gradient, corre-sponding to an increased prevalence of Nitrospira andAcidobac-teria, while Proteobacteria, Actinobacteria and Verrucomicrobiawere receding. These results are in accordance with previousstudies at DNA level in the same site (Berg et al. 2012; Naveedet al. 2014). However, Actinobacteria members were reported tosurvive exposure to several heavy metals (Epelde et al. 2015),

including copper (Lejon et al. 2008), while on the other handAcidobacterial-related taxa are known to be sensitive (Wakelinet al. 2010). Considering the different setups and methodologiesused in the literature, we cannot exclude that the differences ob-served are related to other parameters (e.g. the exposure time,the soil characteristics, the resolution of the molecular tech-niques and potential synergistic/antagonistic effects of other as-sociated heavy metals).

Nevertheless, observing such changes in representation oflarge taxonomic groups is fitting our initial hypothesis towarddifferential selection of microbial strategists over time. Indeed,many important ecological traits related to trophic lifestyles (e.g.oligotrophs and copiotrophs) and strategies (specialists and gen-eralists) have been previously reported to correlate with largetaxonomical groups as family, class or phylum (Fierer, Bradfordand Jackson 2007; Philippot et al. 2009, 2010; Fierer et al. 2012;Placella, Brodie and Firestone 2012; Ramirez, Craine and Fierer2012; Barnard, Osborne and Firestone 2013; Trivedi, Anderson

Page 91: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

8 FEMS Microbiology Ecology, 2016, Vol. 92, No. 11

Figure 4. Heatmap plotting the relative abundance of the 124 taxa differing significantly with copper contamination level. Taxa were selected from the annotationtable fitted to a negative binomial generalized linear model (nbGLM) using analysis of deviance (α = 0.05). Presented data are centered and scaled to the average ofeach taxon abundance and each vertical bar corresponds to a sample. Sample clustering (constrained average agglomerationmethod) and SIMPER analysis were basedon Bray–Curtis dissimilarity index and taxa clustering was based on Euclidean distance (complete agglomeration method; bootstrap = 10 000). Functional response

groups (FRGs; different color in the lateral dendrogram) were identified using a distance cut-off of 10.5 for the control and 12.5 for both semi-contaminated and hotspotsamples. The right vertical bars in the heatmap correspond to the phylum level classification of each taxon with exception of Proteobacteria which were classified toclass level and to the FRGs.

and Singh 2013). In this way, our results tend to indicate thatthe initial selection of copper resistance traits evolved, and wasintegrated into the bacterial lifestyle strategies after a century ofexposure to the copper legacy burden.

Interestingly, it has been shown that different groups withinActinobacteria and Acidobacteria phyla are responding withcontrasting patterns to heavy metal contamination (Berg et al.2012; Epelde et al. 2015). This emphasizes that phylogenetic-driven strategies should not be generalized, and accurate deter-mination should be complemented by other approaches in thiskind of studies. Themacroecological concept of FRGs has provento be a useful complementing tool for this purpose, allowing theidentification of bacterial taxa with different levels of tolerancetoward copper. Indeed, six FRGs were identified in our setup cor-responding to three main levels of sensitivity/tolerance.

The sensitive fraction of the community

FRGs 1 and 2 grouped the taxa with highest prevalence inthe control plot and a high sensitivity to the two tested cop-per concentrations (FRG1) or only to the highest one (FRG2).Those two groups were the most diverse of all six, attestingthe decrease in biodiversity along the increasing copper gradi-ent. Moreover, several taxa with previously reported sensitivityto either heavy metals in general (Epelde et al. 2015) or specif-ically to copper (Giller, Witter and McGrath 1998; Brandt et al.2006b; Berg et al. 2012) were identified within these FRGs, in-cluding the Pseudomonas and the Ilumatobacter genera and theActinobacteria orders Acidimicrobiales and Solirubrobacterales(FRG1). Acidimicrobiales are known to inhabit acidic environ-ments (Clark and Norris 1996), correlating with the higher pH

Page 92: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Nunes et al. 9

Figure 5. Relative abundance (A) and taxonomical affiliation (B) of the functional response groups (FRGs) present in the control, the semi-contaminated and the

hotspot plots. Percentages in vertical staked bars (A) correspond to the total abundance represented by the six FRGs in each plot. Colors in the stacked horizontal barscorrespond to the phylum level classification of each taxon, with exception of Proteobacteria which were classified to class level. Error bars represent standard errorof average values (22< n < 24 in A and n = 69 for B).

in the copper-polluted plots, especially the hotspot. Verrucomi-crobia constituted a large portion of FRG2 and showed a pro-gressive decrease in relative abundance with increasing copperconcentration as previously shown for DNA-based communitystructure (Berg et al. 2012).

FRGs with intermediate copper tolerance

FRG3 and FRG4 grouped the taxa with intermediate tolerance tocopper. Organismswith higher prevalence in both C and SC plotsand sudden decrease in the HS constitute FRG3, whereas FRG4contained the bacteria with higher prevalence only at the inter-mediate contamination level. Between those two groups, severalbacteria belonging to the Rhizobiales (e.g. Rhizobium, Hyphomi-crobium, Rhodomicrobium, whose members are typically nitrogenfixers) were found. Rhizobium species have previously been sug-gested as bioindicators of metal toxicity (Giller, Witter and Mc-Grath 1998). Our results showed the specific potential activityenhancement of these bacteria at quite high copper concentra-tions, reflecting a larger tolerance range for thesemembers com-pared towhatwas previously reported (Tom-Petersen et al. 2003).

Autotrophic nitrifiers have also been suggested as gen-eral toxicity biosensors (Tanaka, Taguchi and Utsumi 2002)presenting different levels of sensitivity to copper in soils

(10–1000mgkg−1) depending on experimental conditions(Dell’Amico et al. 2008). Corroborating these previous findings,the ammonia-oxidizing Nitrosospira was found in FRG4 in ourstudy, indicating enhancement at copper concentrations ashigh as 500mgkg−1 but significant reduction at concentrationsabove 3000mgkg−1. This is consistent with a previous studyat the Hygum site showing decreased potential nitrificationactivity (EC50 = 2060mgkg−1) and tolerance developmentamong ammonia oxidizers present at Cu concentrations above1518mgkg−1 (Mertens et al. 2010).

FRGs showing extreme copper tolerance

Associated with the highest tolerance toward extreme copperconcentrations, FRG5 and FRG6 represent the hardiest groupsfound in this study. Several Acidobacteria groups could be iden-tified among these two FRGs as well as Alphaproteobacteria(FRG6), Nitrospira (FRG6) and unclassified groups of Betapro-teobacteria (FRG5) and Actinobacteria (FRG6).

The genus Nitrospira was the only representative of the Ni-trospira phylum in FRG6. This genus contains the main nitrite-oxidizing bacteria and hence, plays an important role in thenitrification process (Edwards et al. 2013). Despite being stud-ied mostly for its contribution in the nitrogen cycle, Nitrospira

Page 93: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

10 FEMS Microbiology Ecology, 2016, Vol. 92, No. 11

members are also known to have several transporters and effluxsystems (Lucker et al. 2010) that may provide them a selectiveadvantage for survival under extreme copper levels.

Potential copper pollution bioindicators

It is interesting to note that the Acidobacteria phylum was theonly phylumwith representatives in all the FRGs detected in thisstudy. This emphasizes the functional heterogeneity of this phy-lum and may explain the contradictory copper tolerances re-ported in previous studies for the taxa within this group (Berget al. 2012; Epelde et al. 2015). Despite being acknowledged asubiquitous, diverse and abundant in soils (Naether et al. 2012;Catao et al. 2014), very little information is available regardingthe functionality of Acidobacteria members. Our results sug-gest a gradual involvement and loss of different Acidobacteriasubgroups along the copper gradient, which was indicated, tosome extent, by Berg et al. (2012) at the DNA level. The subgroupscan be ranked from the most sensitive (e.g. Gp25 in FRG1, fol-lowed by Gp4 and Gp10 in FRG2) to intermediate tolerant (e.g.Gp5 and Gp22 in FRG3, followed by Gp2 and Gp13 only active inthe semi-contaminated plot) and the highly tolerant subgroupswith potential activity in the hotspot (e.g. the gradually increas-ing Gp1, Gp3 and Gp15 along the copper gradient, as well as Gp6,Gp11 and Gp12 only in the hotspot). Acidobacteria Gp1, Gp4 andGp6 are the most common Acidobacterial groups in soil envi-ronments (Naether et al. 2012). While Gp6 was not classified intoany of the FRGs identified in this study, the others presentedopposite tolerance patterns, with Gp4 receding with increas-ing copper and Gp1 being promoted. Hence, our results suggestthat these different Acidobacterial groups could represent ex-cellent bioindicators of copper-related pollution in soil, as dif-ferential change in their transcriptional activity can be directlylinked to different long-term exposure levels and specific strate-gies. In addition, the genus Nitrospira also seems to be a goodbioindicator candidate, as it was important in discriminating thehighly copper-contaminated soils from the semi-contaminatedand control soils.

SUPPLEMENTARY DATA

Supplementary data are available at FEMSEC online.

ACKNOWLEDGEMENTS

The authors acknowledge the support from the Villum Centerof Excellence for Environmental and Agricultural Microbiology(CREAM).

FUNDING

This work was supported by the People Program (Marie CurieActions FP7-PEOPLE-2011-ITN) of the European Union’s SeventhFramework Program FP7/2007-2013, under the project TRAIN-BIODIVERSE, Training for functional soil microorganism biodi-versity [REA 289949].

Conflict of interest. None declared.

REFERENCES

Arguello JM, Raimunda D, Padilla-Benavides T. Mechanisms ofcopper homeostasis in bacteria. Front Cell Infect Microbiol2013;5:73.

Arthur E, Moldrup P, Holmstrup M et al. Soil microbial and phys-ical properties and their relations along a steep copper gra-dient. Agr Ecosyst Environ 2012;159:9–18.

Baath E. Effects of heavy metals in soil on microbial processesand populations (a review).Water Air Soil Poll 1989;47:335–79.

Barnard RL, Osborne CA, Firestone MK. Responses of soil bac-terial and fungal communities to extreme desiccation andrewetting. ISME J 2013;7:2229–41.

Berg J, Brandt KK, Al-Soud WA et al. Selection for Cu-tolerantbacterial communities with altered composition, but unal-tered richness, via long-term Cu exposure. Appl Environ Mi-crob 2012;78:7438.

Berg J, Thorsen MK, Holm PE et al. Cu exposure under field con-ditions coselects for antibiotic resistance as determined bya novel cultivation-independent bacterial community toler-ance assay. Environ Sci Technol 2010;44:8724–8.

Brandt KK, Frandsen RJN, Holm PE et al. Development ofpollution-induced community tolerance is linked to struc-tural and functional resilience of a soil bacterial communityfollowing a five-year field exposure to copper. Soil Biol Biochem2010;42:748–57.

Brandt KK, Holm PE, Nybroe O. Bioavailability and toxicity ofsoil particle-associated copper as determined by two Pseu-domonas fluorescens biosensor strains. Environ Toxicol andChem 2006a;25:1738–41.

Brandt KK, Holm PE, Nybroe O. Evidence for bioavailable Cu-dissolved organic matter complexes and transiently in-creased Cu bioavailability in manure-amended soils as de-termined by bioluminescent bacterial biosensors. Environ SciTechnol 2008;42:3102–8.

Brandt KK, Petersen A, Holm PE et al. Decreased abundance anddiversity of culturable Pseudomonas spp. populations withincreasing copper exposure in the sugar beet rhizosphere.FEMS Microbiol Ecol 2006b;56:281–91.

Campbell CD, Chapman SJ, Cameron CM et al. A rapid microtiterplate method to measure carbon dioxide evolved from car-bon substrate amendments so as to determine the physio-logical profiles of soil microbial communities by using wholesoil. Appl Environ Microb 2003;69:3593–9.

Caporaso JG, Bittinger K, Bushman FD et al. PyNAST: a flexibletool for aligning sequences to a template alignment. Bioinfor-matics 2010a;26:266–7.

Caporaso JG, Kuczynski J, Stombaugh J et al. QIIME allows anal-ysis of high-throughput community sequencing data. NatMethods 2010b;7:335–6.

Catao ECP, Lopes FAC, Araujo JF et al. Soil acidobacterial 16S rRNAgene sequences reveal subgroup level differences betweenSavanna-like Cerrado and Atlantic Forest Brazilian Biomes.Int J Microbiol 2014;2014:156341.

Clark DA, Norris PR. Acidimicrobium ferrooxidans gen. nov., sp.nov.: mixed-culture ferrous iron oxidation with Sulfobacillusspecies. Microbiology 1996;142:785–90.

Cowan DA, Ramond JB, Makhalanyane TP et al.Metagenomics ofextreme environments. Curr Opin Microbiol 2015;25:97–102.

Creamer RE, Bellamy P, Black HIJ et al. An inter-laboratory com-parison of multi-enzyme and multiple substrate-inducedrespiration assays to assessmethod consistency in soil mon-itoring. Biol Fert Soils 2009;45:623–33.

Dell’Amico E, Mazzocchi M, Cavalca L et al. Assessment of bac-terial community structure in a long-term copper-pollutedex-vineyard soil. Microbiol Res 2008;163:671–83.

DeSantis TZ, Hugenholtz P, Larsen N et al. Greengenes, achimera-checked 16S rRNA gene database and workbenchcompatible with ARB. Appl Environ Microb 2006;72:5069–72.

Page 94: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Nunes et al. 11

Edgar RC. Search and clustering orders of magnitude faster thanBLAST. Bioinformatics 2010;26:2460–1.

Edgar RC. UPARSE: highly accurate OTU sequences from micro-bial amplicon reads. Nat methods 2013;10:996–8.

Edwards TA, Calica NA, Huang DA et al. Cultivation and charac-terization of thermophilic Nitrospira species from geother-mal springs in the US Great Basin, China, and Armenia. FEMSMicrobiol Ecol 2013;85:283–92.

Ekelund F, Olsson S, Johansen A. Changes in the successionand diversity of protozoan and microbial populations insoil spiked with a range of copper concentrations. Soil BiolBiochem 2003;35:1507–16.

Epelde L, Lanzen A, Blanco F et al. Adaptation of soil micro-bial community structure and function to chronicmetal con-tamination at an abandoned Pb-Zn mine. FEMS Microbiol Ecol2015;91:1–11.

Fernandez-Calvino D, Baath E. Interaction between pH and Cutoxicity on fungal and bacterial performance in soil. Soil BiolBiochem 2016;96:20–9.

Fernandez-Calvino D, Martın A, Arias-Estevez M et al. Microbialcommunity structure of vineyard soils with diferente pH andcopper content. Appl Soil Ecol 2010;46:276–82.

Fetzer I, Johst K, Schawe R et al. The extent of functional redun-dancy changes as species’ roles shift in different environ-ments. P Natl Acad Sci USA 2015;112:14888–93.

Fierer N, Bradford MA, Jackson RB. Toward an ecological classifi-cation of soil bacteria. Ecology 2007;88:1354–64.

Fierer N, Lauber CL, Ramirez KS et al.Comparativemetagenomic,phylogenetic and physiological analyses of soil microbialcommunities across nitrogen gradients. ISME J 2012;6:1007–17.

Gans J, Wolinsky M, Dunbar J. Computational improvements re-veal great bacterial diversity and high metal toxicity in soil.Science 2005;309:1387–90.

Gillan DC, Roosa S, Kunath B et al. The long-term adaptationof bacterial communiies in metal-contaminated sediments:a metaproteogenomic study. Environ Microbiol 2015;17:1991–2005.

Giller KE, Witter E, McGrath SP. Toxicity of heavy metals to mi-croorganisms and microbial processes in agricultural soils: areview. Soil Biol Biochem 1998;30:1389–414.

Griffiths BS, Philippot L. Insights into the resistance and re-silience of the soil microbial community. FEMS Microbiol Rev2013;37:112–29.

Haas BJ, Gevers D, Earl AM et al. Chimeric 16S rRNA sequenceformation and detection in Sanger and 454-pyrosequencedPCR amplicons. Genome Res 2011;21:494–504.

Hammer Ø, Harper DAT, Ryan PD. PAST: paleontological statis-tics software package for education and data analysis.Palaeontol Electron 2001;4:9.

Hobman J, Yamamoto K, Oshima T. Transcriptomic responsesof bacterial cells to sublethal metal ion stress. In: NieD, Silver S (eds). Molecular Microbiology of Heavy Metals.Berlin/Heidelberg: Springer, 2007, 73–115.

Ippolito JA, Ducey T, Tarkalson D. Copper impacts on corn,soil extractability, and the soil bacterial community. Soil Sci2010;175:586–92.

Jacquiod S, Stenbæk J, Santos SS et al. Metagenomes providevaluable comparative information on soil microeukaryotes.Res Microbiol 2016;167:436–50.

Johansen A, Carter MS, Jensen ES et al. Effects of digestatefrom anaerobic fermented cattle slurry and plant materialson soil microbiota and fertility. Appl Soil Ecol 2013;16:297–305.

Joner EJ, Johansen A, Loibner AP et al. Rhizosphere effects andmicrobial community structure, and dissipation and tox-icity of PAH in spiked soil. Environ Sci Technol 2001;835:2773–7.

Juggins S. rioja: Analysis of Quaternary Science Data, R package ver-sion (0.9-3), 2014, http://cran.r-project.org/package=rioja (22August 2016, date last accessed).

Klappenbach JA, Dunbar JM, Schmidt TM. rRNA operon copynumber reflects ecological strategies of bacteria. Appl Envi-ron Microb 2000;66:1328–33.

Langille MGI, Zaneveld J, Caporaso JG et al. Predictive func-tional profiling of microbial communities using 16SrRNA marker gene sequences. Nat Biotechnol 2013;9:814–21.

Lavorel S, Garnier E. Predicting changes in community composi-tion and ecosystem functioning from plant traits. Funct Ecol2002;16:545–56.

Lejon DPH, Martins JMF, Leveque J et al. Copper dynamics andimpact onmicrobial communities in soils of variable organicstatus. Environ Sci Technol 2008;42:2819–25.

Li J, Ma Y-B, Hu H-W et al. Field-based evidence for consistentresponses of bacterial communities to copper contaminationin two contrasting agricultural soils. Front Microbiol 2015;31:1–11.

Lucker S, Wagner M, Maixner F et al. A Nitrospira metagenomeilluminates the physiology and evolution of globally im-portant nitrite-oxidizing bacteria. P Natl Acad Sci USA2010;107:13479–84.

Mackie KA, Muller T, Zikeli S et al. Long-term copper applicationin an organic vineyard modifies spatial distribution of soilmicro-organisms. Soil Biol Biochem 2013;65:245–53.

Mackie KA, Schmidt HP, Muller T et al. Cover crops influence soilmicroorganisms and phytoextraction of copper from a mod-erately contaminated vineyard. Sci Total Environ 2014;500-501:34–43.

Mertens J, Wakelin SA, Broos K et al. Extent of copper toleranceand consequences for functional stability of the amonia-oxidizing community in long-term copper contaminatedsoils. Environ Toxicol Chem 2010;29:27–37.

Naether A, Foesel BU, Naegele V et al. Environmental factors af-fect acidobacterial communities below the subgroup levelin grassland and forest soils. Appl Environ Microb 2012;78:7398–406.

Naveed M, Moldrup P, Arthur E et al. Simultaneous loss ofsoil biodiversity and functions along a copper contamina-tion gradient: when soil goes to sleep. Soil Sci Soc Am J2014;78:1239–50.

Neuwirth E. RColorBrewer: ColorBrewer palettes. R pack-age version (1.0-2), 2007, http://cran.r-project.org/web/packages/RColorBrewer/index.html (22 August 2016, datelast accessed).

Oksanen J, Blanchet FG, Kindt R et al. vegan: Commu-nity Ecology Package. R package version 2.0-10, 2013,http://CRAN.R-project.org/package=vegan (22 August 2016,date last accessed).

Philippot L, Andersson SG, Battin TJ et al. The ecological co-herence of high bacterial taxonomic ranks. Nat Rev Microbiol2010;8:523–9.

Philippot L, Cuhel J, Saby NP et al. Spatial patterns of bacte-rial taxa in nature reflect ecological traits of deep branchesof the 16S rRNA bacterial tree. Environ Microbiol 2009;11:1518–26.

Philippot L, Spor A, Henault C et al. Loss in microbial diversityaffects nitrogen cycling in soil. ISME J 2013;7:1609–19.

Page 95: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

12 FEMS Microbiology Ecology, 2016, Vol. 92, No. 11

Placella SA, Brodie EL, Firestone MK. Rainfall-induced car-bon dioxide pulses result from sequential resuscitation ofphylogenetically clustered microbial groups. P Natl Acad SciUSA 2012;109:10931–6.

Price MN, Dehal PS, Arkin AP. FastTree: computing large mini-mum evolution trees with profiles instead of a distance ma-trix. Mol Biol Evol 2009;26:1641–50.

R Development Core Team. R: A Language and Environmentfor Statistical Computing. Vienna, Austria: R Foundationfor Statistical Computing, 2014. ISBN 3-900051-07-0,http://www.R-project.org/ (22 August 2016, date last ac-cessed).

Rajapaksha RMCP, Tobor-Kap�lon MA, Baath E. Metal toxicity af-fects fungal and bacterial activities in soil differently. ApplEnviron Microb 2004;70:2966–73.

Ramirez KS, Craine JM, Fierer N. Consistent effects of ni-trogen amendments on soil microbial communitiesand processes across biomes. Global Change Biol 2012;18:1918–27.

Rouch DA, Lee BTO, Morby AP. Understanding cellular responsesto toxic agents: a model for mechanism-choice in bacterialmetal resistance. J Ind Microbiol 1995;14:132–41.

Rusjan D. Copper in Horticulture. In: Dhanasekaran D (ed). Fungi-cides for Plant and Animal Diseases. Shanghai, China: InTech,2012. ISBN: 978-953-307-804-5.

Sauve S. Copper inhibition of soil organic matter decomposi-tion in a seventy-year field exposure. Environ Toxicol Chem2006;25:854–7.

Schloss PD, Westcott SL, Ryabin T et al. Introducing mothur:open-source, platform-independent, community-supportedsoftware for describing and comparing microbial communi-ties. Appl Environ Microb 2009;75:7537–41.

Strandberg B, Axelsen JA, Pedersen MB et al. Effect of a coppergradient on plant community structure. Environ Toxicol Chem2006;25:743–53.

Suzuki R, Shimodaira H. Pvclust: an R package for assess-ing the uncertainty in hierarchical clustering. Bioinformatics2006;22:1540–2.

Tanaka Y, Taguchi K, Utsumi H. Toxicity assessment of 255chemicals to pure cultured nitrifying bacteria using biosen-sor. Water Sci Technol 2002;46:331–5.

Thorsen MK, Brandt KK, Nybroe O. Abundance and diversityof culturable Pseudomonas constitute sensitive indicators foradverse long-term copper impacts in soil. Soil Biol Biochem2013;57:933–5.

Tom-PetersenA, Leser T,Marsh TL et al. Effects of copper amend-ment on the bacterial community in agricultural soil anal-ysed by the T-RFLP technique. FEMS Microbiol Ecol 2003;46:53–62.

Trivedi P, Anderson IC, Singh BK. Microbial modulators of soilcarbon storage: integrating genomic and metabolic knowl-edge for global prediction. Trends Microbiol 2013;21:641–51.

Vetrovsky T, Baldrian P. The variability of the 16S rRNA gene inbacterial genomes and its consequences for bacterial com-munity analyses. PLoS One 2013;8:e57923

Wakelin SA, Chu G, Lardner R et al. A single application of Cuto field soil has long-term effects on bacterial communitystructure, diversity, and soil processes. Pedobiologia 2010;53:149–58.

Wang Y, NaumannU,Wright ST et al.Mvabund- an R package formodel-based analysis of multivariate abundance data.Meth-ods Ecol Evol 2012;3:471–4.

Warnes GR, Bolker B, Bonebakker L et al. gplots: Various R Program-ming Tools for Plotting Data. R package version 2.17.0, 2015,http://CRAN.R-project.org/package=gplots (22 August 2016,date last accessed).

Yu Y, Lee C, Kim J et al. Group-specific primer and probesets to detect methanogenic communities using quantita-tive real-time polymerase chain reaction. Biotechnol Bioeng2005;89:670–9.

Page 96: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

An integrative molecular approach reveals the lifestyle and phage

susceptibility of the microbial community of a long-term contaminated soil Asker Daniel Brejnrod, Samul Jacquoid, Inés Marques Nunes, Anders Priemé, Søren Johannes Sørensen

Summary

Industrial production has left many sites contaminated with metals. High metal concentrations are

immediately toxic to microorganisms, and microbial communities must adapt to cope with long term

exposure.

This study examines soil samples from a Danish site heavily contaminated with copper (Cu) for 100 years.

The metagenomes and rRNA-depleted metatranscriptomes from control (~15 ppm Cu), semi-contaminated

(~450 ppm Cu) and heavily contaminated “hotspot” soil (~4500 ppm Cu) were sequenced using Illumina

technology. The metagenomic data were mapped against functional and taxonomic databases and

assembled for mapping the metatranscriptome.

The hotspot soil had lower microbial diversity and a phylogenetic shift compared to the two other soils, and

revealed an enhanced relative abundance of transport proteins. The carbon cycle of the contaminated soils

was changed, e.g. the glycoside hydrolase diversity was broader at the hotspot. The microbial growth rate

was lower at the hotspot suggesting that these conditions enhance K-selected microorganisms. In addition,

the contaminated soils showed an elevated expression of phage proteins suggestive of a shift toward the

lytic cycle. Overall, our data suggest that the persistent copper contamination has adapted the microbial

community into a slow-growing lifestyle making them susceptible to attack by lytic phages.

Introduction

Metal and metalloid contamination is a common leftover from previous and current industrial activities,

and can be found on many sites throughout the world. This pollution is characterized by extreme level of

environmental persistence as they can accumulate locally for extended time (Mortatti, de Moraes, &

Probst, 2012; Reis et al., 2016). Metals and metalloids may turn bioavailable under specific physicochemical

conditions, becoming toxic for living organisms. Copper is a common metal contaminant in soils (Griffiths &

Philippot, 2013) , as it is massively used in biocides mixtures applied in agriculture and wood impregnation

plants (Horsfall, 1945). While copper is an essential trace metal for cell functioning, high concentrations are

toxic. High copper bioavailability impacts both microbial and plant life, as well as all other soil

biota(Paoletti, 1999). Following accumulation and long-term exposure, copper pollution may result in

deleterious “legacy burdens”, affecting all ecosystem trophic levels and components (Nunes et al., 2016).

Long-term exposure to copper could lead to selection of resistant or tolerant microorganisms, often

through the acquisition of horizontally transferred elements conferring resistance and tolerance (Luo, Xu,

Riber, Hansen, & Sørensen, 2016). Several mechanisms may be involved in resistance to elevated copper

concentration, e.g. efflux pumps, copper sequestration, sensibility reduction, enzymatic detoxification and

permeable barrier exclusion (Epelde, Lanzén, Blanco, Urich, & Garbisu, 2015).

Page 97: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Resistance, resilience and stability are key characteristics of soil ecosystem services, with practical

importance in scenarios such as land use and climate change. The underlying mechanisms are still being

unraveled (Graham et al., 2016), but it is clear that both the history of the soil and the current stress

influence the microbial response. Copper contamination has the potential to induce stress at several levels.

First, the direct oxidative stress induced by high concentrations of copper ions (Gaetke & Chow, 2003), and

second, the indirect effects such as changes in physicochemical conditions such as pH, and loss of plant

cover, fungi and other soil biota can impact nutrient availability and functioning of the bacterial cell s.

Soil microbiome research has recently adapted r/K selection theory (Pianka, 1970) as an ecological

framework for explaining observations about microbial dispersal and colonization patterns. Much of this

work has focused on characterizing the carbon metabolism of microbes (Fierer, Bradford, & Jackson, 2007)

and has been used to make predictions about the consequences of soil disturbance (De Vries & Shade,

2013). Briefly, r selected microorganisms are fast growers that can utilize simple carbohydrates for biomass

production, while K selected microorganisms can utilize a wide range of substrates, but grow slower.

Microbial communities are highly susceptible to infection by phages, and phages can drive evolutionary

changes (Koskella & Brockhurst, 2014), as well as have an impact on biogeochemical cycles (Puxty, Millard,

Evans, & Scanlan, 2016). Dynamic modelling of microcosms with recalcitrant aromatic compounds have

suggested a decrease in taxonomic richness, but an increase in genetic diversity of the microbial

communities, under the simultaneous stress of pollution and phage attack (Kato et al., 2015). Accordingly,

several studies have suggested a phage response when microbial communities are stressed with heavy

metals (Lee et al., 2006; Motlagh, Bhattacharjee, & Goel, 2015).

In the present work, we used an integrated approach complimenting metagenomics with

metatranscriptomics to study microbial function and diversity in soil from a Danish site, Hygum, heavily

contaminated with copper. The site of a timber preservation facility from 1911 to 1924, the soil at Hygum

has been contaminated by CuSO4 used for impregnation of timber (Nunes et al., 2016).

We hypothesized that organisms in long-term contaminated soil would have the genetic potential for

copper resistance, either by efflux pumps or other mechanisms. Additionally, we expected that with the

plant cover missing, altered substrate availability would shape the life-style of the bacteria present.

Materials and methods

Site and soil description

The Hygum site is located in Jutland, Denmark (55o45’N, 9o27’E). The site has previously been extensively

described by Nunes et al. (2016). The site has a local copper gradient, varying from ~15 mg Cu kg-1 to ~4500

mg Cu kg-1, at the control (C) and the Hotspot (HS) areas, respectively. The semi-contaminated (SC) samples

contained ~450mg kg-1. The samples were obtained in May 2014. 16-m2 plots were drawn and six samples

taken within each plot at 4-14 cm soil depth.

Samples for RNA were stored in LifeGuard Soil Preservation Solution (MOBIO Laboratories, Carlsbad, CA,

USA) as described in (Nunes et al., 2016).

Page 98: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Details about the particular samples can be found in (Nunes et al., 2016) for the May 2014 samples,

including soil characteristics and copper measurements.

DNA isolation and sequencing

Total microbial community DNA was isolated from 1-g soil samples using the FastDNA® SPIN Kit for Soil (MP

Biomedicals, Santa Ana, CA, USA) according to manufacturer’s instructions and stored at -20°C. DNA

qualification was carried out by agarose gel electrophoresis and comparison to a known standard.

Metagenomic libraries for shotgun sequencing were prepared using the Illumina Nextera XT DNA Library

Preparation Kit (Illumina, San Diego, CA, USA). Briefly, the extracted DNA samples, each adjusted to a final

concentration of 0.2 ng μL-1, were subjected to a one-step tagmentation step in which the DNA was

fragmented and tagged with adapter sequences containing unique index and barcode sequences using the

Nextera XT Transposome (Illumina, San Diego, CA, USA). The resulting DNA was subsequently amplified via

a limited-cycle PCR program, according to manufacturer’s instructions. PCR products were purified and size-

selected using Agencourt AMPure XP beads (Beckman Coulter, Brea, CA, USA) according to manufacturer’s

instructions and stored at -20 oC for further processing. The average size distribution of the library

fragments was determined to range from 800 to 1200 bp using a Fragment Analyzer™ Automated CE

System (Advanced Analytical Technologies, Ames, IA, USA) accompanied by the High Sensitivity NGS

Fragment Analysis Kit (Advanced Analytical Technologies, Ames, IA, USA), according to manufacturer’s

instructions. Obtained data were processed with the PROSize® 2.0 Software (Advanced Analytical

Technologies, Ames, IA, USA), and finally the prepared metagenomic DNA libraries were subjected to 2x300

bp paired-end sequencing on an Illumina® MiSeq® platform (Illumina, San Diego, CA, USA).

RNA isolation and cDNA sequencing

Sample preservation and total RNA isolation/purification are described by Nunes et al. (2016). Ribo-Zero™

rRNA Removal Kit (Bacteria) with the Magnetic Core Kit (both from epicenter®, an Illumina® company, USA)

was used for rRNA depletion following manufacturer's instruction. Depleted samples were purified using

RNeasy® MiniElute® kit (Qiagen®, Copenhagen, Denmark) according to the manufacturer´s instructions, and

ScriptSeq v2 RNA-seq libraries were constructed (Epicenter® an Illumina® company, USA). Based on

electrophoresis quality and yields, the three best replicates were kept for further analysis. Sequencing was

done at the Danish National High-Throughput DNA Sequencing Center using a 2x150bp Illumina® HiSeq®

Rapid Paired End run. rRNA sequences were filtered using the Bonsai software SortMeRNA version 2.0

against appropriated databases (Kopylova, Noé, & Touzet, 2012). Filtered mRNA sequences were cleaned

and overlaps merged using a Biopieces (www.biopieces.org) based pipeline.

Page 99: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Bioinformatics analyses

The full bioinformatics analyses strategy is outlined in Figure 1.

Coverage of the genome pool was estimated using the Nonpareil algorithm in k-mer mode, building a

statistical model of the observed k-mers. Rarefaction curves at the genus level were built by Metaxa2

(Bengtsson‐Palme et al., 2015) by extracting all Small Subunit (SSU) sequences from the samples using

HMMs and Blast. These SSU sequences were classified by Metaxa2 for the taxonomic overview. To further

investigate the unresolved part of the taxonomic composition data were classified using Metaphlan2

(Segata et al., 2012).

To map the metagenome against a functional and taxonomic reference database Humann2 (Abubucker et

al., 2012) was run for all samples with default parameters against the uniref90 database. Resulting tables

were merged with humann2_join_tables and normalized to relative abundances with the

humann2_renorm_table script.

Quality-controlled reads were concatenated and assembled with Ray (Boisvert, Raymond, Godzaridis,

Laviolette, & Corbeil, 2012) using a k-mer size of 41. Contigs were filtered to sizes > 1 kb. Open reading

frames (ORF) were called and translated to amino acid sequences with Prodigal (Hyatt et al., 2010) in

metagenome mode (’meta’). Translated search of reads against protein coding sequences was done with

Diamond (Buchfink, Xie, & Huson, 2015), picking the top hit and filtering at an e-value of 10-3. All

abundances were normalized by dividing with the gene length to avoid the bias that more reads could map

to longer ORFs. PFAM (Bateman et al., 2004) annotation was performed by using hmmer 3b1

(http://hmmer.org/) against PFAM release 30, using the conservative “–cut_ga” option to only include

annotations passing the Gathering Threshold set by PFAM curators. Additional differences in sequencing

depth between samples were normalized by dividing by the total amount of mapped reads. This abundance

matrix formed the basis for subsequent statistics processing. The abundance matrix of GO terms

(Ashburner et al., 2000) (Consortium, 2015) was created by mapping PFAM annotations to GO terms

(Mitchell et al., 2014), and expanding the table with a GO term for every instance, then aggregating them

by summation.

To assign the called ORFs to copA (a copper-exporting ATPase) we searched them against the database

provided in Li et al. (Li et al., 2015) of 122 sequences using usearch 9 (Edgar, 2010). The nucleotide

sequences were searched at an identity threshold of 80 % with an e-value of 1E-6.

DbCan (Yin et al., 2012) annotation was done according to the instructions on the website

(http://csbl.bmb.uga.edu/dbCAN/download/readme.txt, retrieved 22-03-2017). Briefly, hmmscan was run

against the called ORFs and the supplied script was used to parse the results. Finally, the recommended e-

value for bacterial annotation of 1E-18 was used to filter hits.

To check for confounding of the phage analyses by contamination of the PhiX used in Illumina-based

sequencing we used the “filter_phix” functionality in usearch.

Statistics

Page 100: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Hypothesis testing was performed with t-test in cases of data that either were or could be transformed to

normality, or with the non-parametric Wilcoxon test if this was not the case.

PCA was performed in R using the prcomp() function from the stats package, and results were visualized

using ggplot2 (Wickham, 2016). PCoA was performed in R using the pcoa() function from the ape package

(Paradis, Claude, & Strimmer, 2004). Projection of exogenous variables was performed as recommended in

(Legendre & Legendre, 2012). Briefly, the covariance of all variables with the axes of interest was computed

and scaled with eigenvalues. Finally, these ‘triplot scores’ were normalized to the interval [0-1], along with

the PCoA scores.

Results and discussion

Sequencing data characteristics and processing

Deep sequencing and subsequent quality filtering produced 144.2 Gbps of metagenomics data, averaging

6.1 Gbp (SD: 0.97) per sample. Soil is known to be a highly species-rich environment (Nesme et al., 2016;

Tringe et al., 2005) and estimating coverage is key to interpretation of results. We used the Nonpareil

algorithm to estimate the pooled coverage at 0.82, with an estimated 239 Gbps required to cover 95% of

the metagenome (Figure 2 a). The hotspot soil had a significantly higher coverage than the two other soils

(figure 2A inset)

Because of this uncovered diversity the analysis proceeded in two complementary tracks as outlined in

figure 1. One based on the Humann2 mapping pipeline and one based on a co-assembly of all data and

subsequent ORF calling and annotation. This combined strategy allows both discovery of new ORFs from

the assembly and mapping of reads against functional genes in databases that might be represented in too

low abundance to be assembled. Co-assembly of the metagenome resulted in 93928 contigs longer than

1kb, with an N50 of: X. From this 230440 ORFs were called of which 60% could be annotated with Pfam.

The translated search against the ORFs from the de novo assembly mapped an ave rage of 32.8% (SD: 9%) of

the reads.

Diversity and composition

Extraction of small ribosomal subunit genes via metaxa2 was performed to elucidate the taxonomic

coverage and composition. Rarefaction curves of resampling at the genus level shows no saturation (figure

2B), which is not uncommon in complex environments such as soil (Gans, Wolinsky, & Dunbar, 2005). Clear

group differences were present, with the hotspot samples showing lower richness than both the control

and semi-contaminated soil (t-test, p < 0.05), mirroring the previous observation that higher coverage of

hotspot samples were obtained.

Bacterial diversity at the Hygum site has been explored previously using a broad array of methods and

technologies. 16S rRNA cDNA amplicon sequencing showed a clear impact of copper contamination on

diversity (Nunes et al., 2016), along with transcriptional activity of mRNA (Jacquiod et al. in prep). 16S rRNA

gene amplicon sequencing has shown conflicting results with two studies showing no relation between OTU

richness and bioavailable copper concentration (Berg et al., 2012); Altimira et al. 2012), while Naveed et al

(Naveed et al., 2014) showed an association. It should be noted that Naveed et al. showed a threshold for

dose-response effects, and that the studies involved soil sampled at different high copper concentrations.

Page 101: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

The soils were dominated by Proteobacteria, Actinobacteria, Acidobacteria and Firmicutes (figure 2C).

Hierachical clustering of the samples separated all but one sample into discrete groups related to

contamination status. The division into groups was driven by relative abundance differences in the

dominating clades, but there was also heterogeneity within the Proteobacterial classes. In particular

Alphaproteobacteria was enriched in control soils.

The extracted SSUs could not be classified to reveal which groups were dominating the HS samples, but

mapping of reads using metaphlan2 revealed that the group unidentified by using SSU reads was likely to

be Betaprotebacteria. There was also a significant enrichment of proteins annotated to lpxM in the HS

group (P = 0.0043), a cell-wall synthesizing protein specific to Gammaproteobacteria (Brix, Eriksen, Larsen,

& Bisgaard, 2015). Cultivation of microbes at similar long-term contaminated sites has previously shown

enrichment of gram-negative bacteria in the polluted soil (Dell’Amico, Mazzocchi, Cavalca, Allievi, &

Andreoni, 2008). Recent work with bacterial communities exposed to paint containing metals have

suggested that the primary resistance driver was a shift in taxonomic composition towards

Gammaproteobacteria (Flach et al., 2017) and it has been suggested that the mechanism for this is related

to biofilm formation (O'Brien, Hodgson, & Buckling, 2014)

Associations of metagenome with environmental variables

Figure 3 shows environmental parameters projected onto a PCoA representation of the taxonomic and

functional composition, respectively, of the metagenomes. Both plots show first axis separating the copper

gradient which is confirmed by projecting the bio available and total copper concentrations on the plot. The

second axis seems to be driven by a combination of pH and moisture. Other measures of healthy biological

activity such as total respiration, fungal lipids and bacterial biomass are negatively impacted by the high

copper concentration.

Taxonomically, the putative Betaproteobacteria were enriched in the hotspot while Alphaproteobacteria

were highly correlated to the control soil. Axis 2 shows a clear inverse correlation between pH and

Acidobacteria, while the Actinobacteria were positively correlated to more alkaline soil. The raw

correlations between these two classes are shown in supporting figure 2. Previous 16S rDNA PCR amplicon

surveys of bacterial phyla have found strong correlations with pH for Acidobacteria but not for

Actinobacteria (Jones et al., 2009; Rousk et al., 2010).

The PCoA of the functional composition revealed major axes similar to the taxonomical composition,

probably reflecting that the taxonomically different bacteria have different gene sets. Figure 3B illustrates

the top ten GO terms correlated with the axes. While no negative correlations with copper concentrations

are shown, positive correlations indicate transport related enzymes, while the positive correlations for axis

2 are DNA binding and transcription factor related. High-level analysis of GO terms seemed to indicate that

the primary correlations to the high copper concentrations were membrane and transport related, while

low soil pH was characterized by higher abundance of transcription regulation. The full table of significant

correlations can be found in supplement X

Catabolic potential is associated to copper contamination status

To explore differences in catabolic potential, the assembled metagenome ORFs were annotated with

Database of Carbohydrate-active enzyme Annotation (dbCan). 3901 ORFs could be assigned a carbohydrate

Page 102: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

active function. A PCoA plot of the protein metagenome abundances showed distinct catabolic profiles

(figure 4B)Notable associations included several significant lipopolysaccharide related families such as GT83

and GT4 (rho >0.89, adjusted P <0.05 ) positively correlated with high copper content. Negative correlations

with copper content included GH87, alpha-1,3-glucanase, capable of degrading fungal cell walls (Ait-Lahsen

et al., 2001), as well as CE5, cutinase, that enhances the breakdown of cellulose found in plant cell walls

(Zhang, Siika-aho, Tenkanen, & Viikari, 2011) by allowing access to cellulose substrate.

Overall Shannon diversity of glycoside hydrolases families were substantially higher in the hotspot soils

(Wilcoxon test, P < 0.05 ), while the diversity in control soils was significantly higher than in semi-

contaminated soils (Wilxocon test, P < 0.05) (figure 4A). As previously mentioned, fungal biomass was lower

in the contaminated soils, and previous studies of enchytraeid worms have demonstrated that they are

severely impacted by copper contamination, suggesting that at the copper concentrations of the soil

sampled for this study they would be nearly gone (Konečný et al., 2014). Earthworms would conceivably be

reduced in numbers because of the lack of substrate from the missing plant cover. Additionally, Nunes et al.

(2016) showed lower respiration of simple carbohydrates and a significantly higher percentage of carbon

content in the hotspot, suggesting decreased carbon turnover at elevated copper concentration.

These facts taken together suggest that the copper contamination has opened up a cataboli c niche of

complex carbon substrates for copper tolerant bacteria (at the expense of fungi) reflected in their

metagenomic potential. In the r/K-selection framework heavy copper contamination would be said to

select for K-strategists, slow growers utilizing a wide range of carbon substrates.

Dose-dependent copA diversity

To cover the largest diversity of the copA gene we searched the ORFs against a previously constructed

database of copA genes from Li et al. (Li et al., 2015). This revealed a significantly higher diversity of copA

(Figure 4C), in both semi-contaminated and hotspot samples as compared to controls (Wilcoxon, P < 0.001).

The normalized summed abundance of reads allocated to copA affiliated genes was significantly higher in

both contaminated groups compared to the control soil (figure 4D), however there was no significant

difference between the two contaminated groups

Increased expression of phage and CRISPR related proteins in heavily polluted soils.

To check for contamination by the PhiX used in Illumina sequencing we searched all the samples and found

no significant differences between groups (supplementary figure 1).

Phages may be a pronounced stress to microorganisms. Many bacteria have evolved CRISPR systems,

bacterial “immune” systems that detect the presence of foreign nucleotide sequences and degrade them

[REF]. The relative abundance of CRISPR/Cas-related proteins was significantly higher in the metagenomes

of the hotspot (t-test, P < 0.001) suggesting an enhanced need of bacteria in the highly contaminated soil to

fend off phages. The semi-contaminated shows a trend towards higher abundance (T-test, P = 0.055).

Expression of these proteins tends (Wilcoxon, p=0.1) to show a similar pattern, but more replicates would

be needed to conclude this. In addition, both hotspot and semi-contaminated soils have significantly higher

expression of total phage protein (t-test, P < 0.05).

Page 103: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Examining the contigs for probable prophages only revealed two incomplete hits; one putative rhizobium

phage at 40% completeness and one putative bacillus phage at 70% completeness. However, features

annotated by Pfam as “Phage_lysis” were significantly correlated with bioavailable copper (rho=0.83, P <

0.001). Additionally, proteins annotated as “Amidase_2” were significantly enriched in hotspot samples (t-

test, P = 0.015). This domain is a peptidoglycan hydrolase that is both found in Bacillus strains and in phage

lysins(Ward, Curtis, Taylor, & Buxton, 1982).

It has been suggested that phages are primarily lysogenic in situations with high microbial abundance s and

growth rates, while in situations with low growth rates they use lysis as the main mechanism of spread

(Silveira & Rohwer, 2016). Nunes et al demonstrated that metabolic activity is lower in hotspot soil. As a

proxy for DNA replication we chose expression of the polA, DNA polymerase-coding gene with a ~2 fold

increase (Wilcoxon signed-rank test, P = 0.1) in the control soil compared to the contaminated soils.

It has also previously been shown that phage abundances tend to be higher in low diversity microbial

populations, because phages have a very specific host-range.

Strengths and limitations

The two major strengths of this study are the sampling site and the comprehensive sequencing, allowing

for integration of metagenomic and metatranscriptomic data.

The sampling site has a persistent long term (>100 years) contamination at very high levels, ensuring that

the observed selected traits are not transient, that the full scope of consequences of copper contamination

can be studied, and that changes in intracellular DNA are mirrored by extracellular DNA (Carini et al., 2016) .

Additionally, the site is well-characterized having been the subject of many contributions studying different

aspects, both conventionally and molecularly.

We used deep sequencing to characterize the metagenome, which is suitable for the complex soil

environment. Additionally, we supplemented with metatranscriptome sequencing and utilized the

assembly strategy to tie the two types of data together.

Nevertheless, the current study has several limitations that should be acknowledged, some technical and

some inherent to observational studies of this environment.

We have attempted to explore the technical limitation of sequencing depth by estimating how much

sequencing effort would be needed to capture 95% of the genetic content. Sequencing depth is particularly

challenging in high-diversity environments such as soil. There are other technical challenges with working in

soil with inhibitors such as humic acid complicating PCR and other reactions. Finally, all metagenome

studies have the limitation that DNA and RNA isolation protocols may not effectively lyse all cells equally,

and as such the observed composition will be biased (Jacquiod et al., 2016; Wesolowska-Andersen et al.,

2014), as well as problems with contaminants from kits (Salter et al., 2014).

Observational studies in soil that has large differences in abiotic factors are prone to confounding, because

this induces changes in not only microbial community structure and activity but also pH, plant cover, soil

composition, and soil fauna composition. Untangling these factors requires follow-up by designed

Page 104: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

intervention studies, which can separate multi-factorial effects. However, we believe that observational

studies contribute to the generation of relevant ecological hypotheses for testing.

Conclusion

Characterization of contaminated soils by sequencing at ~6Gbp can be an effective tool for testing

hypotheses that can be formulated as changes in relative abundance and/or expression of genes. The

rarefaction curves showed that we have far from exhausted all taxonomic and genetic microbial diversity,

but some of the major patterns have been captured. The hotspot soil is lower in microbial diversity, while

the semi-contaminated are at the same level. Overall functionality patterns indicate a large dependence on

both copper and pH, but separating these effects completely requires designed experiments. The hotspot

soil appears to enhance K-selected bacteria, through the opening of a new catabolic niche by the

disappearance of other biota., e.g. saprotrophic fungi, and the data from our study and previous literature

indicate some degree of interplay between the direct copper effect and the indirect copper effect mediated

through e.g. altered plant cover. Finally we demonstrate that phage RNA is more expressed and the main

phage lifestyle appears to be lytic. We speculate that this could be due to a combination of direct toxic

stress due to copper, the lower bacterial diversity and the lower bacterial growth rates in the highly

contaminated soil.

Abubucker, S., Segata, N., Goll, J., Schubert, A. M., Izard, J., Cantarel, B. L., . . . Henrissat, B. (2012). Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput Biol, 8(6), e1002358.

Ait-Lahsen, H., Soler, A., Rey, M., de la Cruz, J., Monte, E., & Llobell, A. (2001). An antifungal exo-α-1, 3-glucanase (AGN13. 1) from the biocontrol fungus Trichoderma harzianum. Applied and environmental microbiology, 67(12), 5833-5839.

Altimira, F., Yáñez, C., Bravo, G., González, M., Rojas, L. A., & Seeger, M. (2012). Characterization of copper-resistant bacteria and bacterial communities from copper-polluted agricultural soils of central Chile. BMC microbiology, 12(1), 193.

Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., . . . Eppig, J. T. (2000). Gene Ontology: tool for the unification of biology. Nature genetics, 25(1), 25-29.

Bateman, A., Coin, L., Durbin, R., Finn, R. D., Hollich, V., Griffiths‐Jones, S., . . . Sonnhammer, E. L. ( 2004). The Pfam protein families database. Nucleic acids research, 32(suppl 1), D138-D141.

Bengtsson‐Palme, J., Hartmann, M., Eriksson, K. M., Pal, C., Thorell, K., Larsson, D. G. J., & Nilsson, R. H. (2015). METAXA2: improved identification and taxonomic classification of small and large subunit rRNA in metagenomic data. Molecular Ecology Resources, 15(6), 1403-1414.

Berg, J., Brandt, K. K., Al-Soud, W. A., Holm, P. E., Hansen, L. H., Sørensen, S. J., & Nybroe, O. (2012). Selection for Cu-tolerant bacterial communities with altered composition, but unaltered richness, via long-term Cu exposure. Applied and environmental microbiology, 78(20), 7438-7446.

Boisvert, S., Raymond, F., Godzaridis, É., Laviolette, F., & Corbeil, J. (2012). Ray Meta: scalable de novo metagenome assembly and profiling. Genome biology, 13(12), 1.

Brix, S., Eriksen, C., Larsen, J. M., & Bisgaard, H. (2015). Metagenomic heterogeneity explains dual immune effects of endotoxins. Journal of Allergy and Clinical Immunology, 135(1), 277.

Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature methods, 12(1), 59-60.

Consortium, G. O. (2015). Gene ontology consortium: going forward. Nucleic acids research, 43(D1), D1049-D1056.

Page 105: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

De Vries, F. T., & Shade, A. (2013). Controls on soil microbial community stability under climate change. Dell’Amico, E., Mazzocchi, M., Cavalca, L., Allievi, L., & Andreoni, V. (2008). Assessment of bacterial

community structure in a long-term copper-polluted ex-vineyard soil. Microbiological research, 163(6), 671-683.

Edgar, R. C. (2010). Search and clustering orders of magnitude faster than BLAST. Bioinformatics, 26(19), 2460-2461.

Epelde, L., Lanzén, A., Blanco, F., Urich, T., & Garbisu, C. (2015). Adaptation of soil microbial community structure and function to chronic metal contamination at an abandoned Pb-Zn mine. FEMS Microbiology Ecology, 91(1), 1-11.

Fierer, N., Bradford, M. A., & Jackson, R. B. (2007). Toward an ecological classification of soil bacteria. Ecology, 88(6), 1354-1364.

Flach, C.-F., Pal, C., Svensson, C. J., Kristiansson, E., Östman, M., Bengtsson-Palme, J., . . . Larsson, D. J. (2017). Does antifouling paint select for antibiotic resistance? Science of The Total Environment.

Gaetke, L. M., & Chow, C. K. (2003). Copper toxicity, oxidative stress, and antioxidant nutrients. Toxicology, 189(1), 147-163.

Gans, J., Wolinsky, M., & Dunbar, J. (2005). Computational improvements reveal great bacterial diversity and high metal toxicity in soil. Science, 309(5739), 1387-1390.

Graham, E. B., Knelman, J. E., Schindlbacher, A., Siciliano, S., Breulmann, M., Yannarell, A., . . . Prosser, J. (2016). Microbes as engines of ecosystem function: when does community structure enhance predictions of ecosystem processes? Frontiers in microbiology, 7.

Griffiths, B. S., & Philippot, L. (2013). Insights into the resistance and resilience of the soil microbial community. FEMS microbiology reviews, 37(2), 112-129.

Horsfall, J. G. (1945). Fungicides and their action. Fungicides and their action. Hyatt, D., Chen, G.-L., LoCascio, P. F., Land, M. L., Larimer, F. W., & Hauser, L. J. (2010). Prodigal: prokaryotic

gene recognition and translation initiation site identification. BMC bioinformatics, 11(1), 1. Jacquiod, S., Stenbæk, J., Santos, S. S., Winding, A., Sørensen, S. J., & Priemé, A. (2016). Metagenomes

provide valuable comparative information on soil microeukaryotes. Research in microbiology, 167(5), 436-450.

Jones, R. T., Robeson, M. S., Lauber, C. L., Hamady, M., Knight, R., & Fierer, N. (2009). A comprehensive survey of soil acidobacterial diversity using pyrosequencing and clone library analyses. The ISME journal, 3(4), 442-453.

Kato, H., Mori, H., Maruyama, F., Toyoda, A., Oshima, K., Endo, R., . . . Ohtsubo, Y. (2015). Time -series metagenomic analysis reveals robustness of soil microbiome against chemical disturbance. DNA Research, 22(6), 413-424.

Konečný, L., Ettler, V., Kristiansen, S. M., Amorim, M. J. B., Kříbek, B., Mihaljevič, M., . . . Scott -Fordsmand, J. J. (2014). Response of Enchytraeus crypticus worms to high metal levels in tropical soils polluted by copper smelting. Journal of Geochemical Exploration, 144, 427-432.

Kopylova, E., Noé, L., & Touzet, H. (2012). SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data. Bioinformatics, 28(24), 3211-3217.

Koskella, B., & Brockhurst, M. A. (2014). Bacteria–phage coevolution as a driver of ecological and evolutionary processes in microbial communities. FEMS microbiology reviews, 38(5), 916-931.

Lee, L. H., Lui, D., Platner, P. J., Hsu, S.-F., Chu, T.-C., Gaynor, J. J., . . . Lustigman, B. K. (2006). Induction of temperate cyanophage AS-1 by heavy metal–copper. BMC microbiology, 6(1), 17.

Legendre, P., & Legendre, L. F. (2012). Numerical ecology (Vol. 24): Elsevier. Li, X., Zhu, Y.-G., Shaban, B., Bruxner, T. J., Bond, P. L., & Huang, L. (2015). Assessing the genetic diversity of

Cu resistance in mine tailings through high-throughput recovery of full-length copA genes. Scientific reports, 5.

Luo, W., Xu, Z., Riber, L., Hansen, L. H., & Sørensen, S. J. (2016). Diverse gene functions in a soil mobilome. Soil Biology and Biochemistry, 101, 175-183.

Page 106: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Mitchell, A., Chang, H.-Y., Daugherty, L., Fraser, M., Hunter, S., Lopez, R., . . . Pesseat, S. (2014). The InterPro protein families database: the classification resource after 15 years. Nucleic acids research, gku1243.

Mortatti, J., de Moraes, G. M., & Probst, J.-L. (2012). Heavy metal distribution in recent sediments along the Tietê River basin (São Pauro, Brazil). Geochemical Journal, 46(1), 13-19.

Motlagh, A. M., Bhattacharjee, A. S., & Goel, R. (2015). Microbiological study of bacteriophage induction in the presence of chemical stress factors in enhanced biological phosphorus removal (EBPR). Water research, 81, 1-14.

Naveed, M., Moldrup, P., Arthur, E., Holmstrup, M., Nicolaisen, M., Tuller, M., . . . Komatsu, T. (2014). Simultaneous loss of soil biodiversity and functions along a copper contamination gradient: when soil goes to sleep. Soil Science Society of America Journal, 78(4), 1239-1250.

Nesme, J., Achouak, W., Agathos, S. N., Bailey, M., Baldrian, P., Brunel, D., . . . Jurkevitch, E. (2016). Back to the future of soil metagenomics. Frontiers in microbiology, 7, 73.

Nunes, I., Jacquiod, S., Brejnrod, A., Holm, P. E., Johansen, A., Brandt, K. K., . . . Sørensen, S. J. (2016). Coping with copper: legacy effect of copper on potential activity of soil bacteria following a century of exposure. FEMS Microbiology Ecology, 92(11), fiw175.

O'Brien, S., Hodgson, D. J., & Buckling, A. (2014). Social evolution of toxic metal bioremediation in Pseudomonas aeruginosa. Proceedings of the Royal Society of London B: Biological Sciences, 281(1787), 20140858.

Paoletti, M. G. (1999). The role of earthworms for assessment of sustainability and as bioindicators. Agriculture, Ecosystems & Environment, 74(1), 137-155.

Paradis, E., Claude, J., & Strimmer, K. (2004). APE: analyses of phylogenetics and evolution in R language. Bioinformatics, 20(2), 289-290.

Pianka, E. R. (1970). On r-and K-selection. The American Naturalist, 104(940), 592-597. Puxty, R. J., Millard, A. D., Evans, D. J., & Scanlan, D. J. (2016). Viruses inhibit CO 2 fixation in the most

abundant phototrophs on Earth. Current Biology, 26(12), 1585-1589. Reis, M. P., Dias, M. F., Costa, P. S., Ávila, M. P., Leite, L. R., de Araújo, F. M., . . . Chartone -Souza, E. (2016).

Metagenomic signatures of a tropical mining-impacted stream reveal complex microbial and metabolic networks. Chemosphere, 161, 266-273.

Rousk, J., Bååth, E., Brookes, P. C., Lauber, C. L., Lozupone, C., Caporaso, J. G., . . . Fierer, N. (2010). Soil bacterial and fungal communities across a pH gradient in an arable soil. The ISME journal, 4(10), 1340-1351.

Salter, S. J., Cox, M. J., Turek, E. M., Calus, S. T., Cookson, W. O., Moffatt, M. F., . . . Walker, A. W. (2014). Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC biology, 12(1), 87.

Segata, N., Waldron, L., Ballarini, A., Narasimhan, V., Jousson, O., & Huttenhower, C. (2012). Metagenomic microbial community profiling using unique clade-specific marker genes. Nature methods, 9(8), 811-814.

Silveira, C. B., & Rohwer, F. L. (2016). Piggyback-the-Winner in host-associated microbial communities. npj Biofilms and Microbiomes, 2, 16010.

Tringe, S. G., Von Mering, C., Kobayashi, A., Salamov, A. A., Chen, K., Chang, H. W., . . . Detter, J. C. (2005). Comparative metagenomics of microbial communities. Science, 308(5721), 554-557.

Ward, J. B., Curtis, C. A., Taylor, C., & Buxton, R. S. (1982). Purification and characterization of two phage PBSX-induced lytic enzymes of Bacillus subtilis 168: an N-acetylmuramoyl-L-alanine amidase and an N-acetylmuramidase. Microbiology, 128(6), 1171-1178.

Wesolowska-Andersen, A., Bahl, M. I., Carvalho, V., Kristiansen, K., Sicheritz-Pontén, T., Gupta, R., & Licht, T. R. (2014). Choice of bacterial DNA extraction method from fecal material influences community structure as evaluated by metagenomic analysis. Microbiome, 2(1), 19.

Wickham, H. (2016). ggplot2: elegant graphics for data analysis: Springer.

Page 107: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Yin, Y., Mao, X., Yang, J., Chen, X., Mao, F., & Xu, Y. (2012). dbCAN: a web resource for automated carbohydrate-active enzyme annotation. Nucleic acids research, 40(W1), W445-W451.

Zhang, J., Siika-aho, M., Tenkanen, M., & Viikari, L. (2011). The role of acetyl xylan esterase in the solubilization of xylan and enzymatic hydrolysis of wheat straw and giant reed. Biotechnology for

Biofuels, 4(1), 60.

Figure Legends

Figure 1: Overview of the strategy for the integration of the molecular data. Quality Control (QC) was

performed on the metagenomics reads (xx bp) before assembly, mapping and identification (search for

ribosomal small subunit sequences in databases). Following assembly into contigs, ORFs were called and

translated into amino acid sequences. Metagenome and transcriptome reads were then mapped to the

ORFs and abundances normalized to sample depth and protein length were calculated. Normalization was

performed by by dividing each element by the sequencing depth and the protein length. Finally the four

abundance matrices were used to do the statistics.

Figure 2: Diversity and composition of the soil groups. (A) shows the estimated coverage of k -mers from

non-pareil algoritm. The inset shows the final coverage estimate. (B) illustrates rarefaction of the classified

small subunit ribosomal sequences. The vertical line indicates 4500 ribosomal sequences, and the inset

shows the observed number of genera when resampling 4500 ribosomal sequences. (C) Heatmap of the

taxonomic composition of the samples. Rows and columns are organized by hierarchical clustering of the

taxa abundances.

Figure 3: PCoA plots of the (A) taxonomic and (B) functional composition of the soil samples. Projected on

top is the correlations of various measured factors from the soil. TotalCu and BioCu are the total and

bioavailable amounts of copper, respectively. Bacterial and fungal biomass was quantified by PLFA.

Figure 4: (A) Shannon diversity of the proteins classified as glycoside hydrolases in the metagenomes of the

three soils. (B) PCoA plot of the distinct carbohydrate-active enzyme profiles. (C) Shannon diversity of copA

genes. (D) Total relative abundance of metagenomics reads allocated to copA.

Figure 5: (A) Relative abundance of metagenomic reads from CRISPR/cas proteins. (B) Relative

metatranscriptome expression of CRISPR/CAS proteins. (C) Relative metanogenomic abundance of phage

proteins. (D) Relative metratranscriptome expression of phage proteins.

Supplementary figure 1: Correlations of (A) Actinobacteria and (B) Acidbacteria with pH

Supplementary figure 2: PhiX contamination of metagenomes in the groups.

Page 108: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

04/04/2017 1

QC Mapping

Assembly Metagenomic reads

Call ORFs

Contigs

Amino Acid sequences

Transcriptome mapping

Statistics

Metagenome mapping

Figure 1

Ribosomal identification

Normalization

Page 109: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Figure 2

A B

C

Page 110: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Figure 3

Page 111: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Figure 4

Page 112: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Figure 5

Page 113: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Supporting figure 1

Page 114: Bioinformatics for discovery of microbiome variation Daniel Brejnrod.pdf · Introduction to manuscripts 17 Manuscript 1 17 Manuscript 2 21 Manuscripts 3 & 4 25 List of references

Supporting figure 2