metagenomeseq: statistical analysis for sparse high...

35
metagenomeSeq: Statistical analysis for sparse high-throughput sequencing Joseph Nathaniel Paulson Applied Mathematics & Statistics, and Scientific Computation Center for Bioinformatics and Computational Biology University of Maryland, College Park [email protected] Modified: October 4, 2016. Compiled: January 4, 2019 Contents 1 Introduction 3 2 Data preparation 4 2.1 Biom-Format ...................................... 4 2.2 Loading count data ................................... 5 2.3 Loading taxonomy ................................... 5 2.4 Loading metadata ................................... 6 2.5 Creating a MRexperiment object .......................... 6 2.6 Example datasets .................................... 7 2.7 Useful commands .................................... 9 3 Normalization 14 3.1 Calculating normalization factors ........................... 14 3.2 Exporting data ..................................... 14 4 Statistical testing 16 4.1 Zero-inflated Log-Normal mixture model for each feature .............. 16 4.1.1 Example using fitFeatureModel for differential abundance testing ..... 16 4.2 Zero-inflated Gaussian mixture model ........................ 16 4.2.1 Example using fitZig for differential abundance testing ........... 17 4.2.2 Multiple groups ................................. 19 4.2.3 Exporting fits .................................. 20 4.3 Time series analysis .................................. 20 4.4 Log Normal permutation test ............................. 21 4.5 Presence-absence testing ................................ 21 4.6 Discovery odds ratio testing .............................. 22 4.7 Feature correlations .................................. 23 4.8 Unique OTUs or features ............................... 23 5 Aggregating counts 25 1

Upload: lambao

Post on 16-Feb-2019

245 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

metagenomeSeq: Statistical analysis for sparse

high-throughput sequencing

Joseph Nathaniel Paulson

Applied Mathematics & Statistics, and Scientific ComputationCenter for Bioinformatics and Computational Biology

University of Maryland, College Park

[email protected]

Modified: October 4, 2016. Compiled: January 4, 2019

Contents

1 Introduction 3

2 Data preparation 42.1 Biom-Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Loading count data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Loading taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.4 Loading metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5 Creating a MRexperiment object . . . . . . . . . . . . . . . . . . . . . . . . . . 62.6 Example datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.7 Useful commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Normalization 143.1 Calculating normalization factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.2 Exporting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Statistical testing 164.1 Zero-inflated Log-Normal mixture model for each feature . . . . . . . . . . . . . . 16

4.1.1 Example using fitFeatureModel for differential abundance testing . . . . . 164.2 Zero-inflated Gaussian mixture model . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2.1 Example using fitZig for differential abundance testing . . . . . . . . . . . 174.2.2 Multiple groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.2.3 Exporting fits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 Time series analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4 Log Normal permutation test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.5 Presence-absence testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.6 Discovery odds ratio testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.7 Feature correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.8 Unique OTUs or features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Aggregating counts 25

1

Page 2: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

6 Visualization of features 266.1 Interactive Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.2 Structural overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266.3 Feature specific . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

7 Summary 317.1 Citing metagenomeSeq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317.2 Session Info . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

8 Appendix 338.1 Appendix A: MRexperiment internals . . . . . . . . . . . . . . . . . . . . . . . . 338.2 Appendix B: Mathematical model . . . . . . . . . . . . . . . . . . . . . . . . . . 338.3 Appendix C: Calculating the proper percentile . . . . . . . . . . . . . . . . . . . 33

2

Page 3: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

1 Introduction

This is a vignette for pieces of an association study pipeline. For a full list offunctions available in the package: help(package=metagenomeSeq). For more in-formation about a particular function call: ?function. See fitFeatureModel for our latestdevelopment.

To load the metagenomeSeq library:

library(metagenomeSeq)

Metagenomics is the study of genetic material targeted directly from an environmental com-munity. Originally focused on exploratory and validation projects, these studies now focus onunderstanding the differences in microbial communities caused by phenotypic differences. Ana-lyzing high-throughput sequencing data has been a challenge to researchers due to the uniquebiological and technological biases that are present in marker-gene survey data.

We present a R package, metagenomeSeq, that implements methods developed to accountfor previously unaddressed biases specific to high-throughput sequencing microbial marker-genesurvey data. Our method implements a novel normalization technique and method to accountfor sparsity due to undersampling. Other methods include White et al.’s Metastats and Segataet al.’s LEfSe. The first is a non-parametric permutation test on t-statistics and the second is anon-parametric Kruskal-Wallis test followed by subsequent wilcox rank-sum tests on subgroupsto guard against positive discoveries of differential abundance driven by potential confounders -neither address normalization nor sparsity.

This vignette describes the basic protocol when using metagenomeSeq. A normalizationmethod able to control for biases in measurements across taxanomic features and a mixturemodel that implements a zero-inflated Gaussian distribution to account for varying depths ofcoverage are implemented. Using a linear model methodology, it is easy to include confoundingsources of variability and interpret results. Additionally, visualization functions are provided toexamine discoveries.

The software was designed to determine features (be it Operational Taxanomic Unit (OTU),species, etc.) that are differentially abundant between two or more groups of multiple samples.The software was also designed to address the effects of both normalization and undersamplingof microbial communities on disease association detection and testing of feature correlations.

3

Page 4: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

Prepare MRexperiment

object

Normalization

Statistical testing

Step 1

Step 2

Step 3

metagenomeSeq overview

1. Visualize data2. Save results

1. Abundance2. P/A

Figure 1: General overview. metagenomeSeq requires the user to convert their data into MR-experiment objects. Using those MRexperiment objects, one can normalize their data, runstatistical tests (abundance or presence-absence), and visualize or save results.

2 Data preparation

Microbial marker-gene sequence data is preprocessed and counts are algorithmically defined fromproject-specific sequence data by clustering reads according to read similarity. Given m featuresand n samples, the elements in a count matrix C (m,n), cij , are the number of reads annotatedfor a particular feature i (whether it be OTU, species, genus, etc.) in sample j.

sample1 sample2 . . . samplen

feature1 c11 c12 . . . c1nfeature2 c21 c22 . . . c2n...

......

. . ....

featurem cm1 cm2 . . . cmn

Count data should be stored in a delimited (tab by default) file with sample names along

the first row and feature names along the first column.Data is prepared and formatted as a MRexperiment object. For an overview of the internal

structure please see Appendix A.

2.1 Biom-Format

You can load in BIOM file format data, the output of many commonly used, using the loadBiomfunction. The biom2MRexperiment and MRexperiment2biom functions serve as a gatewaybetween the biom-class object defined in the biom package and a MRexperiment-classobject. BIOM format files IO is available thanks to the biomformat package.

As an example, we show how one can read in a BIOM file and convert it to a MRexperimentobject.

4

Page 5: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

# reading in a biom filelibrary(biomformat)biom_file <- system.file("extdata", "min_sparse_otu_table.biom",

package = "biomformat")b <- read_biom(biom_file)biom2MRexperiment(b)

## MRexperiment (storageMode: environment)## assayData: 5 features, 6 samples## element names: counts## protocolData: none## phenoData: none## featureData: none## experimentData: use 'experimentData(object)'## Annotation:

As an example, we show how one can write a MRexperiment object out as a BIOM file.Here is an example writing out the mouseData MRexperiment object to a BIOM file.

data(mouseData)# options include to normalize or notb <- MRexperiment2biom(mouseData)write_biom(b, biom_file = "˜/Desktop/otu_table.biom")

2.2 Loading count data

Following preprocessing and annotation of sequencing data metagenomeSeq requires a countmatrix with features along rows and samples along the columns. metagenomeSeq includesfunctions for loading delimited files of counts loadMeta and phenodata loadPhenoData.

As an example, a portion of the lung microbiome [1] OTU matrix is provided in metagenomeSeq’slibrary ”extdata” folder. The OTU matrix is stored as a tab delimited file. loadMeta loadsthe taxa and counts into a list.

dataDirectory <- system.file("extdata", package = "metagenomeSeq")lung = loadMeta(file.path(dataDirectory, "CHK_NAME.otus.count.csv"))dim(lung$counts)

## [1] 1000 78

2.3 Loading taxonomy

Next we want to load the annotated taxonomy. Check to make sure that your taxa annotationsand OTUs are in the same order as your matrix rows.

taxa = read.delim(file.path(dataDirectory, "CHK_otus.taxonomy.csv"),stringsAsFactors = FALSE)

As our OTUs appear to be in order with the count matrix we loaded earlier, the next stepis to load phenodata.

Warning: features need to have the same names as the rows of the count matrix when wecreate the MRexperiment object for provenance purposes.

5

Page 6: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

2.4 Loading metadata

Phenotype data can be optionally loaded into R with loadPhenoData. This function loadsthe data as a list.

clin = loadPhenoData(file.path(dataDirectory, "CHK_clinical.csv"),tran = TRUE)

ord = match(colnames(lung$counts), rownames(clin))clin = clin[ord, ]head(clin[1:2, ])

## SampleType## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2 Bronch2.PreWash## CHK_6467_E3B11_OW_V1V2 OW## SiteSampled## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2 Bronchoscope.Channel## CHK_6467_E3B11_OW_V1V2 OralCavity## SmokingStatus## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2 Smoker## CHK_6467_E3B11_OW_V1V2 Smoker

Warning: phenotypes must have the same names as the columns on the count matrix whenwe create the MRexperiment object for provenance purposes.

2.5 Creating a MRexperiment object

Function newMRexperiment takes a count matrix, phenoData (annotated data frame), andfeatureData (annotated data frame) as input. Biobase provides functions to create annotateddata frames. Library sizes (depths of coverage) and normalization factors are also optionalinputs.

phenotypeData = AnnotatedDataFrame(clin)phenotypeData

## An object of class 'AnnotatedDataFrame'## rowNames: CHK_6467_E3B11_BRONCH2_PREWASH_V1V2## CHK_6467_E3B11_OW_V1V2 ...## CHK_6467_E3B09_BAL_A_V1V2 (78 total)## varLabels: SampleType SiteSampled SmokingStatus## varMetadata: labelDescription

A feature annotated data frame. In this example it is simply the OTU numbers, but it canas easily be the annotated taxonomy at multiple levels.

OTUdata = AnnotatedDataFrame(taxa)OTUdata

## An object of class 'AnnotatedDataFrame'## rowNames: 1 2 ... 1000 (1000 total)## varLabels: OTU Taxonomy ... strain (10 total)## varMetadata: labelDescription

6

Page 7: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

obj = newMRexperiment(lung$counts,phenoData=phenotypeData,featureData=OTUdata)# Links to a paper providing further details can be included optionally.# experimentData(obj) = annotate::pmid2MIAME("21680950")obj

## MRexperiment (storageMode: environment)## assayData: 1000 features, 78 samples## element names: counts## protocolData: none## phenoData## sampleNames: CHK_6467_E3B11_BRONCH2_PREWASH_V1V2## CHK_6467_E3B11_OW_V1V2 ...## CHK_6467_E3B09_BAL_A_V1V2 (78 total)## varLabels: SampleType SiteSampled SmokingStatus## varMetadata: labelDescription## featureData## featureNames: 1 2 ... 1000 (1000 total)## fvarLabels: OTU Taxonomy ... strain (10 total)## fvarMetadata: labelDescription## experimentData: use 'experimentData(object)'## Annotation:

2.6 Example datasets

There are two datasets included as examples in the metagenomeSeq package. Data needs tobe in a MRexperiment object format to normalize, run statistical tests, and visualize. As anexample, throughout the vignette we’ll use the following datasets. To understand a function’susage or included data simply enter ?functionName.

1. Human lung microbiome [1]: The lung microbiome consists of respiratory flora sampledfrom six healthy individuals. Three healthy nonsmokers and three healthy smokers. Theupper lung tracts were sampled by oral wash and oro-/nasopharyngeal swabs. Sampleswere taken using two bronchoscopes, serial bronchoalveolar lavage and lower airway pro-tected brushes.

data(lungData)lungData

## MRexperiment (storageMode: environment)## assayData: 51891 features, 78 samples## element names: counts## protocolData: none## phenoData## sampleNames: CHK_6467_E3B11_BRONCH2_PREWASH_V1V2## CHK_6467_E3B11_OW_V1V2 ...## CHK_6467_E3B09_BAL_A_V1V2 (78 total)## varLabels: SampleType SiteSampled SmokingStatus## varMetadata: labelDescription## featureData## featureNames: 1 2 ... 51891 (51891 total)

7

Page 8: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

## fvarLabels: taxa## fvarMetadata: labelDescription## experimentData: use 'experimentData(object)'## pubMedIds: 21680950## Annotation:

2. Humanized gnotobiotic mouse gut [2]: Twelve germ-free adult male C57BL/6J mice werefed a low-fat, plant polysaccharide-rich diet. Each mouse was gavaged with healthy adulthuman fecal material. Following the fecal transplant, mice remained on the low-fat, plantpolysacchaaride-rich diet for four weeks, following which a subset of 6 were switched to ahigh-fat and high-sugar diet for eight weeks. Fecal samples for each mouse went throughPCR amplification of the bacterial 16S rRNA gene V2 region weekly. Details of exper-imental protocols and further details of the data can be found in Turnbaugh et. al.Sequences and further information can be found at: http://gordonlab.wustl.edu/TurnbaughSE_10_09/STM_2009.html

data(mouseData)mouseData

## MRexperiment (storageMode: environment)## assayData: 10172 features, 139 samples## element names: counts## protocolData: none## phenoData## sampleNames: PM1:20080107 PM1:20080108 ...## PM9:20080303 (139 total)## varLabels: mouseID date ... status (5 total)## varMetadata: labelDescription## featureData## featureNames: Prevotellaceae:1 Lachnospiraceae:1## ... Parabacteroides:956 (10172 total)## fvarLabels: superkingdom phylum ... OTU (7 total)## fvarMetadata: labelDescription## experimentData: use 'experimentData(object)'## Annotation:

8

Page 9: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

2.7 Useful commands

Phenotype information can be accessed with the phenoData and pData methods:

phenoData(obj)

## An object of class 'AnnotatedDataFrame'## sampleNames: CHK_6467_E3B11_BRONCH2_PREWASH_V1V2## CHK_6467_E3B11_OW_V1V2 ...## CHK_6467_E3B09_BAL_A_V1V2 (78 total)## varLabels: SampleType SiteSampled SmokingStatus## varMetadata: labelDescription

head(pData(obj), 3)

## SampleType## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2 Bronch2.PreWash## CHK_6467_E3B11_OW_V1V2 OW## CHK_6467_E3B08_OW_V1V2 OW## SiteSampled## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2 Bronchoscope.Channel## CHK_6467_E3B11_OW_V1V2 OralCavity## CHK_6467_E3B08_OW_V1V2 OralCavity## SmokingStatus## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2 Smoker## CHK_6467_E3B11_OW_V1V2 Smoker## CHK_6467_E3B08_OW_V1V2 NonSmoker

Feature information can be accessed with the featureData and fData methods:

featureData(obj)

## An object of class 'AnnotatedDataFrame'## featureNames: 1 2 ... 1000 (1000 total)## varLabels: OTU Taxonomy ... strain (10 total)## varMetadata: labelDescription

head(fData(obj)[, -c(2, 10)], 3)

## OTU superkingdom phylum class## 1 1 Bacteria Proteobacteria Epsilonproteobacteria## 2 2 <NA> <NA> <NA>## 3 3 Bacteria Actinobacteria Actinobacteria (class)## order family genus## 1 Campylobacterales Campylobacteraceae Campylobacter## 2 <NA> <NA> <NA>## 3 Actinomycetales Actinomycetaceae Actinomyces## species## 1 Campylobacter rectus## 2 <NA>## 3 Actinomyces radicidentis

9

Page 10: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

The raw or normalized counts matrix can be accessed with the MRcounts function:

head(MRcounts(obj[, 1:2]))

## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2## 1 0## 2 0## 3 0## 4 0## 5 0## 6 0## CHK_6467_E3B11_OW_V1V2## 1 0## 2 0## 3 0## 4 0## 5 0## 6 0

A MRexperiment-class object can be easily subsetted, for example:

featuresToKeep = which(rowSums(obj) >= 100)samplesToKeep = which(pData(obj)$SmokingStatus == "Smoker")obj_smokers = obj[featuresToKeep, samplesToKeep]obj_smokers

## MRexperiment (storageMode: environment)## assayData: 1 features, 33 samples## element names: counts## protocolData: none## phenoData## sampleNames: CHK_6467_E3B11_BRONCH2_PREWASH_V1V2## CHK_6467_E3B11_OW_V1V2 ...## CHK_6467_E3B09_BAL_A_V1V2 (33 total)## varLabels: SampleType SiteSampled SmokingStatus## varMetadata: labelDescription## featureData## featureNames: 570## fvarLabels: OTU Taxonomy ... strain (10 total)## fvarMetadata: labelDescription## experimentData: use 'experimentData(object)'## Annotation:

head(pData(obj_smokers), 3)

## SampleType## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2 Bronch2.PreWash## CHK_6467_E3B11_OW_V1V2 OW## CHK_6467_E3B11_BAL_A_V1V2 BAL.A## SiteSampled## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2 Bronchoscope.Channel## CHK_6467_E3B11_OW_V1V2 OralCavity

10

Page 11: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

## CHK_6467_E3B11_BAL_A_V1V2 Lung## SmokingStatus## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2 Smoker## CHK_6467_E3B11_OW_V1V2 Smoker## CHK_6467_E3B11_BAL_A_V1V2 Smoker

Alternative normalization scaling factors can be accessed or replaced with the normFactorsmethod:

head(normFactors(obj))

## [,1]## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2 NA## CHK_6467_E3B11_OW_V1V2 NA## CHK_6467_E3B08_OW_V1V2 NA## CHK_6467_E3B07_BAL_A_V1V2 NA## CHK_6467_E3B11_BAL_A_V1V2 NA## CHK_6467_E3B09_OP_V1V2 NA

normFactors(obj) <- rnorm(ncol(obj))head(normFactors(obj))

## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2## 1.3709584## CHK_6467_E3B11_OW_V1V2## -0.5646982## CHK_6467_E3B08_OW_V1V2## 0.3631284## CHK_6467_E3B07_BAL_A_V1V2## 0.6328626## CHK_6467_E3B11_BAL_A_V1V2## 0.4042683## CHK_6467_E3B09_OP_V1V2## -0.1061245

Library sizes (sequencing depths) can be accessed or replaced with the libSize method:

head(libSize(obj))

## [,1]## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2 0## CHK_6467_E3B11_OW_V1V2 16## CHK_6467_E3B08_OW_V1V2 1## CHK_6467_E3B07_BAL_A_V1V2 2## CHK_6467_E3B11_BAL_A_V1V2 118## CHK_6467_E3B09_OP_V1V2 5

libSize(obj) <- rnorm(ncol(obj))head(libSize(obj))

## CHK_6467_E3B11_BRONCH2_PREWASH_V1V2## -0.88577630

11

Page 12: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

## CHK_6467_E3B11_OW_V1V2## -1.09978090## CHK_6467_E3B08_OW_V1V2## 1.51270701## CHK_6467_E3B07_BAL_A_V1V2## 0.25792144## CHK_6467_E3B11_BAL_A_V1V2## 0.08844023## CHK_6467_E3B09_OP_V1V2## -0.12089654

12

Page 13: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

Additionally, data can be filtered to maintain a threshold of minimum depth or OTU pres-ence:

data(mouseData)filterData(mouseData, present = 10, depth = 1000)

## MRexperiment (storageMode: environment)## assayData: 1057 features, 137 samples## element names: counts## protocolData: none## phenoData## sampleNames: PM1:20080108 PM1:20080114 ...## PM9:20080303 (137 total)## varLabels: mouseID date ... status (5 total)## varMetadata: labelDescription## featureData## featureNames: Erysipelotrichaceae:8## Lachnospiraceae:129 ... Collinsella:34 (1057## total)## fvarLabels: superkingdom phylum ... OTU (7 total)## fvarMetadata: labelDescription## experimentData: use 'experimentData(object)'## Annotation:

Two MRexperiment-class objects can be merged with the mergeMRexperiments func-tion, e.g.:

data(mouseData)newobj = mergeMRexperiments(mouseData, mouseData)

## MRexperiment 1 and 2 share sample ids; adding labels to sample ids.

newobj

## MRexperiment (storageMode: environment)## assayData: 10172 features, 278 samples## element names: counts## protocolData: none## phenoData## sampleNames: PM1:20080107.x PM1:20080108.x ...## PM9:20080303.y (278 total)## varLabels: mouseID date ... status (5 total)## varMetadata: labelDescription## featureData## featureNames: Prevotellaceae:1 Lachnospiraceae:1## ... Parabacteroides:956 (10172 total)## fvarLabels: superkingdom phylum ... OTU (7 total)## fvarMetadata: labelDescription## experimentData: use 'experimentData(object)'## Annotation:

13

Page 14: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

3 Normalization

Normalization is required due to varying depths of coverage across samples. cumNorm is anormalization method that calculates scaling factors equal to the sum of counts up to a particularquantile.

Denote the lth quantile of sample j as qlj , that is, in sample j there are l taxonomic features

with counts smaller than qlj . For l = b.95mc then qlj corresponds to the 95th percentile of thecount distribution for sample j.

Denote slj =∑

(i|cij≤qlj)cij as the sum of counts for sample j up to the lth quantile. Our

normalization chooses a value l̂ ≤ m to define a normalization scaling factor for each sampleto produce normalized counts c̃ij =

cij

sl̂jN where N is an appropriately chosen normalization

constant. See Appendix C for more information on how our method calculates the properpercentile.

These normalization factors are stored in the experiment summary slot. Functions to deter-mine the proper percentile cumNormStat, save normalized counts exportMat, or save varioussample statistics exportStats are also provided. Normalized counts can be called easily bycumNormMat(MRexperimentObject) or MRcounts(MRexperimentObject,norm=TRUE,log=FALSE).

3.1 Calculating normalization factors

After defining a MRexperiment object, the first step is to calculate the proper percentile bywhich to normalize counts. There are several options in calculating and visualizing the relativedifferences in the reference. Figure 3 is an example from the lung dataset.

data(lungData)p = cumNormStatFast(lungData)

To calculate the scaling factors we simply run cumNorm

lungData = cumNorm(lungData, p = p)

The user can alternatively choose different percentiles for the normalization scheme by spec-ifying p.

There are other functions, including normFactors, cumNormMat, that return the normal-ization factors or a normalized matrix for a specified percentile. To see a full list of functionsplease refer to the manual and help pages.

3.2 Exporting data

To export normalized count matrices:

mat = MRcounts(lungData, norm = TRUE, log = TRUE)[1:5, 1:5]exportMat(mat, file = file.path(dataDirectory, "tmp.tsv"))

To save sample statistics (sample scaling factor, quantile value, number of identified featuresand library size):

exportStats(lungData[, 1:5], file = file.path(dataDirectory,"tmp.tsv"))

## Default value being used.

14

Page 15: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

head(read.csv(file = file.path(dataDirectory, "tmp.tsv"), sep = "\t"))

## Subject Scaling.factor## 1 CHK_6467_E3B11_BRONCH2_PREWASH_V1V2 67## 2 CHK_6467_E3B11_OW_V1V2 2475## 3 CHK_6467_E3B08_OW_V1V2 2198## 4 CHK_6467_E3B07_BAL_A_V1V2 836## 5 CHK_6467_E3B11_BAL_A_V1V2 1008## Quantile.value Number.of.identified.features Library.size## 1 2 60 271## 2 1 3299 7863## 3 1 2994 8360## 4 1 1188 5249## 5 1 1098 3383

15

Page 16: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

4 Statistical testing

Now that we have taken care of normalization we can address the effects of under sampling ondetecting differentially abundant features (OTUs, genes, etc). This is our latest developmentand we recommend fitFeatureModel over fitZig. MRcoefs, MRtable and MRfulltable are usefulsummary tables of the model outputs.

4.1 Zero-inflated Log-Normal mixture model for each feature

By reparametrizing our zero-inflation model, we’re able to fit a zero-inflated model for eachspecific OTU separately. We currently recommend using the zero-inflated log-normal model asimplemented in fitFeatureModel.

4.1.1 Example using fitFeatureModel for differential abundance testing

Here is an example comparing smoker’s and non-smokers lung microbiome.

data(lungData)lungData = lungData[, -which(is.na(pData(lungData)$SmokingStatus))]lungData = filterData(lungData, present = 30, depth = 1)lungData <- cumNorm(lungData, p = 0.5)pd <- pData(lungData)mod <- model.matrix(˜1 + SmokingStatus, data = pd)lungres1 = fitFeatureModel(lungData, mod)head(MRcoefs(lungres1))

## logFC se pvalues adjPvalues## 3465 -4.824949 0.5697511 0.000000e+00 0.000000e+00## 35827 -4.304266 0.5445548 2.664535e-15 1.079137e-13## 2817 2.320656 0.4324661 8.045793e-08 1.629273e-06## 2735 2.260203 0.4331098 1.803341e-07 2.921412e-06## 5411 1.748296 0.3092461 1.572921e-08 4.246888e-07## 48745 -1.645805 0.3293117 5.801451e-07 7.831959e-06

4.2 Zero-inflated Gaussian mixture model

The depth of coverage in a sample is directly related to how many features are detected in asample motivating our zero-inflated Gaussian (ZIG) mixture model. Figure 2 is representative ofthe linear relationship between depth of coverage and OTU identification ubiquitous in marker-gene survey datasets currently available. For a quick overview of the mathematical model seeAppendix B.

Function fitZig performs a complex mathematical optimization routine to estimate prob-abilities that a zero for a particular feature in a sample is a technical zero or not. The functionrelies heavily on the limma package [4]. Design matrices can be created in R by using themodel.matrix function and are inputs for fitZig.

For large survey studies it is often pertinent to include phenotype information or confoundersinto a design matrix when testing the association between the abundance of taxonomic featuresand a phenotype phenotype of interest (disease, for instance). Our linear model methodology caneasily incorporate these confounding covariates in a straightforward manner. fitZig outputincludes weighted fits for each of them features. Results can be filtered and saved using MRcoefsor MRtable.

16

Page 17: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

Figure 2: The number of unique features is plotted against depth of coverage for samples from the HumanMicrobiome Project [3]. Including the depth of coverage and the interaction of body site and sequencing sitewe are able to acheive an adjusted R2 of .94. The zero-inflated Gaussian mixture was developed to account formissing features.

4.2.1 Example using fitZig for differential abundance testing

Warning: The user should restrict significant features to those with a minimum number ofpositive samples. What this means is that one should not claim features are significant unlessthe effective number of samples is above a particular percentage. For example, fold-changeestimates might be unreliable if an entire group does not have a positive count for the featurein question.

We recommend the user remove features based on the number of estimated effective samples,please see calculateEffectiveSamples. We recommend removing features with less thanthe average number of effective samples in all features. In essence, setting eff = .5 when usingMRcoefs, MRfulltable, or MRtable. To find features absent from a group the functionuniqueFeatures provides a table of the feature ids, the number of positive features and readsfor each group.

In our analysis of the lung microbiome data, we can remove features that are not presentin many samples, controls, and calculate the normalization factors. The user needs to decidewhich metadata should be included in the linear model.

data(lungData)controls = grep("Extraction.Control", pData(lungData)$SampleType)lungTrim = lungData[, -controls]rareFeatures = which(rowSums(MRcounts(lungTrim) > 0) < 10)lungTrim = lungTrim[-rareFeatures, ]lungp = cumNormStat(lungTrim, pFlag = TRUE, main = "Trimmed lung data")

## Default value being used.

17

Page 18: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Trimmed lung data

Index

Rel

ativ

e di

ffere

nce

for

refe

renc

e

0 0.25 0.5 0.75 1

Figure 3: Relative difference for the median difference in counts from the reference.

lungTrim = cumNorm(lungTrim, p = lungp)

After the user defines an appropriate model matrix for hypothesis testing there are optionalinputs to fitZig, including settings determined by zigControl. We ask the user to reviewthe help files for both fitZig and zigControl. For this example we include body site ascovariates and want to test for the bacteria differentially abundant between smokers and non-smokers.

smokingStatus = pData(lungTrim)$SmokingStatusbodySite = pData(lungTrim)$SampleTypenormFactor = normFactors(lungTrim)normFactor = log2(normFactor/median(normFactor) + 1)mod = model.matrix(˜smokingStatus + bodySite + normFactor)settings = zigControl(maxit = 10, verbose = TRUE)fit = fitZig(obj = lungTrim, mod = mod, useCSSoffset = FALSE,

control = settings)

## it= 0, nll=88.42, log10(eps+1)=Inf, stillActive=1029## it= 1, nll=93.56, log10(eps+1)=0.06, stillActive=261## it= 2, nll=93.46, log10(eps+1)=0.05, stillActive=120## it= 3, nll=93.80, log10(eps+1)=0.05, stillActive=22## it= 4, nll=93.94, log10(eps+1)=0.03, stillActive=3## it= 5, nll=93.93, log10(eps+1)=0.00, stillActive=1## it= 6, nll=93.90, log10(eps+1)=0.00, stillActive=1## it= 7, nll=93.87, log10(eps+1)=0.00, stillActive=1## it= 8, nll=93.86, log10(eps+1)=0.00, stillActive=1## it= 9, nll=93.85, log10(eps+1)=0.00, stillActive=1

# The default, useCSSoffset = TRUE, automatically includes# the CSS scaling normalization factor.

18

Page 19: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

The result, fit, is a list providing detailed estimates of the fits including a limma fit infit$fit and an ebayes statistical fit in fit$eb. This data can be analyzed like any limmafit and in this example, the column of the fitted coefficientsrepresents the fold-change for our”smoker” vs. ”nonsmoker” analysis.

Looking at the particular analysis just performed, there appears to be OTUs representing twoPrevotella, two Neisseria, a Porphyromonas and a Leptotrichia that are differentially abundant.One should check that similarly annotated OTUs are not equally differentially abundant incontrols.

Alternatively, the user can input a model with their own normalization factors including themdirectly in the model matrix and specifying the option useCSSoffset = FALSE in fitZig.

4.2.2 Multiple groups

Assuming there are multiple groups it is possible to make use of Limma’s topTable functionsfor F-tests and contrast functions to compare multiple groups and covariates of interest. Theoutput of fitZig includes a ’MLArrayLM’ Limma object that can be called on by other functions.When running fitZig by default there is an additional covariate added to the design matrix. Thefit and the ultimate design matrix are crucial for contrasts.

# maxit=1 is for demonstration purposessettings = zigControl(maxit = 1, verbose = FALSE)mod = model.matrix(˜bodySite)colnames(mod) = levels(bodySite)# fitting the ZIG modelres = fitZig(obj = lungTrim, mod = mod, control = settings)# The output of fitZig contains a list of various useful# items. hint: names(res). Probably the most useful is the# limma 'MLArrayLM' object called fit.zigFit = res$fitfinalMod = res$fit$design

contrast.matrix = makeContrasts(BAL.A - BAL.B, OW - PSB, levels = finalMod)fit2 = contrasts.fit(zigFit, contrast.matrix)fit2 = eBayes(fit2)topTable(fit2)

## BAL.A...BAL.B OW...PSB AveExpr F## 18531 0.37318792 2.075648 0.7343081 12.715105## 6291 -0.10695735 1.658829 0.4671470 12.956898## 37977 -0.37995461 2.174071 0.4526060 12.528733## 6901 0.17344138 1.466113 0.2435881 12.018652## 40291 0.06892926 1.700238 0.2195735 11.803380## 36117 -0.28665883 2.233996 0.4084024 10.571931## 7343 -0.22859078 1.559465 0.3116465 10.090602## 7342 0.59882970 1.902346 0.5334647 9.410984## 1727 1.09837459 -2.160466 0.7780167 9.346013## 40329 -0.07145998 1.481582 0.2475735 9.700136## P.Value adj.P.Val## 18531 5.359780e-05 0.02813711## 6291 5.482439e-05 0.02813711## 37977 8.203239e-05 0.02813711## 6901 1.335806e-04 0.03212047

19

Page 20: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

## 40291 1.560761e-04 0.03212047## 36117 3.012092e-04 0.05013569## 7343 3.931844e-04 0.05013569## 7342 4.901651e-04 0.05013569## 1727 5.027597e-04 0.05013569## 40329 5.259032e-04 0.05013569

# See help pages on decideTests, topTable, topTableF,# vennDiagram, etc.

Further specific details can be found in section 9.3 and beyond of the Limma user guide.The take home message is that to make use of any Limma functions one needs to extract thefinal model matrix used: res$fit$design and the MLArrayLM Limma fit object: res$fit.

4.2.3 Exporting fits

Currently functions are being developed to wrap and output results more neatly, but MRcoefs,MRtable, MRfulltable can be used to view coefficient fits and related statistics and exportthe data with optional output values - see help files to learn how they differ. An important noteis that the by variable controls which coefficients are of interest whereas coef determines thedisplay.

To only consider features that are found in a large percentage of effectively positive (positivesamples + the weight of zero counts included in the Gaussian mixture) use the eff option in theMRtables.

taxa = sapply(strsplit(as.character(fData(lungTrim)$taxa), split = ";"),function(i) {

i[length(i)]})

head(MRcoefs(fit, taxa = taxa, coef = 2))

## smokingStatusSmoker## Neisseria polysaccharea -4.031612## Neisseria meningitidis -3.958899## Prevotella intermedia -2.927686## Porphyromonas sp. UQD 414 -2.675306## Prevotella paludivivens 2.575672## Leptotrichia sp. oral clone FP036 2.574172## pvalues adjPvalues## Neisseria polysaccharea 3.927097e-11 2.959194e-08## Neisseria meningitidis 5.751592e-11 2.959194e-08## Prevotella intermedia 4.339587e-09 8.930871e-07## Porphyromonas sp. UQD 414 1.788697e-07 1.357269e-05## Prevotella paludivivens 1.360718e-07 1.272890e-05## Leptotrichia sp. oral clone FP036 3.544957e-04 1.414122e-03

4.3 Time series analysis

Implemented in the fitTimeSeries function is a method for calculating time intervals forwhich bacteria are differentially abundant. Fitting is performed using Smoothing Splines ANOVA

20

Page 21: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

(SS-ANOVA), as implemented in the gss package. Given observations at multiple time pointsfor two groups the method calculates a function modeling the difference in abundance across alltime. Using group membership permutations weestimate a null distribution of areas under thedifference curve for the time intervals of interest and report significant intervals of time.

Use of the function for analyses should cite: ”Finding regions of interest in high throughputgenomics data using smoothing splines” Talukder H, Paulson JN, Bravo HC. (Submitted)

For a description of how to perform a time-series / genome based analysis call the fitTimeSeriesvignette.

# vignette('fitTimeSeries')

4.4 Log Normal permutation test

Included is a standard log normal linear model with permutation based p-values permutation.We show the fit for the same model as above using 10 permutations providing p-value resolutionto the tenth. The coef parameter refers to the coefficient of interest to test. We first generatethe list of significant features.

coeffOfInterest = 2res = fitLogNormal(obj = lungTrim, mod = mod, useCSSoffset = FALSE,

B = 10, coef = coeffOfInterest)

# extract p.values and adjust for multiple testing res$p are# the p-values calculated through permutationadjustedPvalues = p.adjust(res$p, method = "fdr")

# extract the absolute fold-change estimatesfoldChange = abs(res$fit$coef[, coeffOfInterest])

# determine features still significant and order by thesigList = which(adjustedPvalues <= 0.05)sigList = sigList[order(foldChange[sigList])]

# view the top taxa associated with the coefficient of# interest.head(taxa[sigList])

## [1] "Veillonella montpellierensis"## [2] "Veillonella rodentium"## [3] "Listeria grayi"## [4] "Anaeroglobus geminatus"## [5] "Veillonella montpellierensis"## [6] "Macrococcus caseolyticus"

4.5 Presence-absence testing

The hypothesis for the implemented presence-absence test is that the proportion/odds of a givenfeature present is higher/lower among one group of individuals compared to another, and wewant to test whether any difference in the proportions observed is significant. We use Fisher’sexact test to create a 2x2 contingency table and calculate p-values, odd’s ratios, and confidence

21

Page 22: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

intervals. fitPA calculates the presence-absence for each organism and returns a table of p-values, odd’s ratios, and confidence intervals. The function will accept either a MRexperimentobject or matrix. MRfulltable when sent a result of fitZig will also include the results offitPA.

classes = pData(mouseData)$dietres = fitPA(mouseData[1:5, ], cl = classes)# Warning - the p-value is calculating 1 despite a high odd's# ratio.head(res)

## oddsRatio lower upper## Prevotellaceae:1 Inf 0.01630496 Inf## Lachnospiraceae:1 Inf 0.01630496 Inf## Unclassified-Screened:1 Inf 0.01630496 Inf## Clostridiales:1 0 0.00000000 24.77661## Clostridiales:2 Inf 0.01630496 Inf## pvalues adjPvalues## Prevotellaceae:1 1.0000000 1## Lachnospiraceae:1 1.0000000 1## Unclassified-Screened:1 1.0000000 1## Clostridiales:1 0.3884892 1## Clostridiales:2 1.0000000 1

4.6 Discovery odds ratio testing

The hypothesis for the implemented discovery test is that the proportion of observed countsfor a feature of all counts are comparable between groups. We use Fisher’s exact test to createa 2x2 contingency table and calculate p-values, odd’s ratios, and confidence intervals. fitDOcalculates the proportion of counts for each organism and returns a table of p-values, odd’sratios, and confidence intervals. The function will accept either a MRexperiment object ormatrix.

classes = pData(mouseData)$dietres = fitDO(mouseData[1:100, ], cl = classes, norm = FALSE, log = FALSE)head(res)

## oddsRatio lower upper## Prevotellaceae:1 Inf 0.01630496 Inf## Lachnospiraceae:1 Inf 0.01630496 Inf## Unclassified-Screened:1 Inf 0.01630496 Inf## Clostridiales:1 0 0.00000000 24.77661## Clostridiales:2 Inf 0.01630496 Inf## Firmicutes:1 0 0.00000000 24.77661## pvalues adjPvalues## Prevotellaceae:1 1.0000000 1.0000000## Lachnospiraceae:1 1.0000000 1.0000000## Unclassified-Screened:1 1.0000000 1.0000000## Clostridiales:1 0.3884892 0.7470946## Clostridiales:2 1.0000000 1.0000000## Firmicutes:1 0.3884892 0.7470946

22

Page 23: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

4.7 Feature correlations

To test the correlations of abundance features, or samples, in a pairwise fashion we have imple-mented correlationTest and correctIndices. The correlationTest function willcalculate basic pearson, spearman, kendall correlation statistics for the rows of the input andreport the associated p-values. If a vector of length ncol(obj) it will also calculate the correlationof each row with the associated vector.

cors = correlationTest(mouseData[55:60, ], norm = FALSE, log = FALSE)head(cors)

## correlation## Clostridiales:11-Lachnospiraceae:35 -0.02205882## Clostridiales:11-Coprobacillus:3 -0.01701180## Clostridiales:11-Lactobacillales:3 -0.01264304## Clostridiales:11-Enterococcaceae:3 0.57315130## Clostridiales:11-Enterococcaceae:4 -0.01264304## Lachnospiraceae:35-Coprobacillus:3 0.24572606## p## Clostridiales:11-Lachnospiraceae:35 7.965979e-01## Clostridiales:11-Coprobacillus:3 8.424431e-01## Clostridiales:11-Lactobacillales:3 8.825644e-01## Clostridiales:11-Enterococcaceae:3 1.663001e-13## Clostridiales:11-Enterococcaceae:4 8.825644e-01## Lachnospiraceae:35-Coprobacillus:3 3.548360e-03

Caution: http://www.ncbi.nlm.nih.gov/pubmed/23028285

4.8 Unique OTUs or features

To find features absent from any number of classes the function uniqueFeatures provides atable of the feature ids, the number of positive features and reads for each group. Thresholdingfor the number of positive samples or reads required are options.

cl = pData(mouseData)[["diet"]]uniqueFeatures(mouseData, cl, nsamples = 10, nreads = 100)

## featureIndices Samp. in BK## Enterococcaceae:28 2458 0## Lachnospiraceae:1453 2826 16## Firmicutes:367 4165 0## Prevotellaceae:143 7030 50## Lachnospiraceae:4122 7844 15## Enterococcus:182 8384 0## Lachnospiraceae:4347 8668 50## Prevotella:79 8749 50## Prevotella:81 8994 71## Prevotellaceae:433 9223 53## Samp. in Western Reads in BK## Enterococcaceae:28 36 0## Lachnospiraceae:1453 0 192## Firmicutes:367 32 0## Prevotellaceae:143 0 109

23

Page 24: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

## Lachnospiraceae:4122 0 505## Enterococcus:182 34 0## Lachnospiraceae:4347 0 130## Prevotella:79 0 100## Prevotella:81 0 370## Prevotellaceae:433 0 154## Reads in Western## Enterococcaceae:28 123## Lachnospiraceae:1453 0## Firmicutes:367 112## Prevotellaceae:143 0## Lachnospiraceae:4122 0## Enterococcus:182 163## Lachnospiraceae:4347 0## Prevotella:79 0## Prevotella:81 0## Prevotellaceae:433 0

24

Page 25: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

5 Aggregating counts

Normalization is recommended at the OTU level. However, functions are in place to aggregatethe count matrix (normalized or not), based on a particular user defined level. Using the feature-Data information in the MRexperiment object, calling aggregateByTaxonomy or aggTax ona MRexperiment object and declaring particular featureData column name (i.e. ’genus’) willaggregate counts to the desired level with the aggfun function (default colSums). Possible aggfunalternatives include colMeans and colMedians.

obj = aggTax(mouseData, lvl = "phylum", out = "matrix")head(obj[1:5, 1:5])

## PM1:20080107 PM1:20080108 PM1:20080114## Actinobacteria 0 3 2## Bacteroidetes 486 921 1103## Cyanobacteria 0 0 0## Firmicutes 455 922 1637## NA 5 25 5## PM1:20071211 PM1:20080121## Actinobacteria 37 0## Bacteroidetes 607 818## Cyanobacteria 0 0## Firmicutes 772 1254## NA 8 4

Additionally, aggregating samples can be done using the phenoData information in the MR-experiment object. Calling aggregateBySample or aggsamp on a MRexperiment object anddeclaring a particular phenoData column name (i.e. ’diet’) will aggregate counts with the aggfunfunction (default rowMeans). Possible aggfun alternatives include rowSums and rowMedians.

obj = aggSamp(mouseData, fct = "mouseID", out = "matrix")head(obj[1:5, 1:5])

## PM1 PM10 PM11 PM12 PM2## Prevotellaceae:1 0 0 0.00000000 0 0## Lachnospiraceae:1 0 0 0.00000000 0 0## Unclassified-Screened:1 0 0 0.08333333 0 0## Clostridiales:1 0 0 0.00000000 0 0## Clostridiales:2 0 0 0.00000000 0 0

The aggregateByTaxonomy,aggregateBySample, aggTax aggSamp functions are flex-ible enough to put in either 1) a matrix with a vector of labels or 2) a MRexperiment object witha vector of labels or featureData column name. The function can also output either a matrix orMRexperiment object.

25

Page 26: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

6 Visualization of features

To help with visualization and analysis of datasets metagenomeSeq has several plotting func-tions to gain insight of the dataset’s overall structure and particular individual features. Aninitial interactive exploration of the data can be displayed with the display function.

For an overall look at the dataset we provide a number of plots including heatmaps offeature counts: plotMRheatmap, basic feature correlation structures: plotCorr, PCA/MDScoordinates of samples or features: plotOrd, rarefaction effects: plotRare and contingencytable style plots: plotBubble.

Other plotting functions look at particular features such as the abundance for a single feature:plotOTU and plotFeature, or of multiple features at once: plotGenus. Plotting multipleOTUs with similar annotations allows for additional control of false discoveries.

6.1 Interactive Display

Due to recent advances in the interactiveDisplay package, calling the display functionon MRexperiment objects will bring up a browser to explore your data through several inter-active visualizations. For more detailed interactive visualizations one might be interested in theshiny-phyloseq package.

# Calling display on the MRexperiment object will start a# browser session with interactive plots.

# require(interactiveDisplay) display(mouseData)

6.2 Structural overview

Many studies begin by comparing the abundance composition across sample or feature pheno-types. Often a first step of data analysis is a heatmap, correlation or co-occurence plot or someother data exploratory method. The following functions have been implemented to provide afirst step overview of the data:

1. plotMRheatmap - heatmap of abundance estimates (Fig. 4 left)

2. plotCorr - heatmap of pairwise correlations (Fig. 4 right)

3. plotOrd - PCA/CMDS components (Fig. 5 left)

4. plotRare - rarefaction effect (Fig. 5 right)

5. plotBubble - contingency table style plot (see help)

Each of the above can include phenotypic information in helping to explore the data.Below we show an example of how to create a heatmap and hierarchical clustering of log2

transformed counts for the 200 OTUs with the largest overall variance. Red values indicatecounts close to zero. Row color labels indicate OTU taxonomic class; column color labelsindicate diet (green = high fat, yellow = low fat). Notice the samples cluster by diet in thesecases and there are obvious clusters. We then plot a correlation matrix for the same features.

trials = pData(mouseData)$dietheatmapColColors = brewer.pal(12, "Set3")[as.integer(factor(trials))]heatmapCols = colorRampPalette(brewer.pal(9, "RdBu"))(50)

26

Page 27: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

PM

5:20

0803

03P

M12

:200

8012

8P

M12

:200

8020

4P

M9:

2008

0108

PM

10:2

0080

108

PM

6:20

0801

08P

M12

:200

8010

8P

M5:

2008

0108

PM

8:20

0801

08P

M5:

2008

0114

PM

6:20

0801

14P

M10

:200

8011

4P

M8:

2008

0114

PM

9:20

0802

11P

M6:

2008

0121

PM

9:20

0801

14P

M6:

2008

0211

PM

5:20

0801

21P

M5:

2008

0225

PM

5:20

0802

18P

M5:

2008

0211

PM

8:20

0801

21P

M6:

2008

0128

PM

6:20

0802

04P

M9:

2008

0121

PM

9:20

0801

28P

M9:

2008

0204

PM

9:20

0802

18P

M9:

2008

0225

PM

12:2

0080

218

PM

12:2

0080

225

PM

12:2

0080

303

PM

12:2

0080

114

PM

12:2

0080

121

PM

6:20

0803

03P

M6:

2008

0225

PM

6:20

0802

18P

M8:

2008

0211

PM

8:20

0802

04P

M8:

2008

0128

PM

12:2

0080

211

PM

10:2

0080

211

PM

10:2

0080

225

PM

10:2

0080

218

PM

10:2

0080

303

PM

9:20

0803

03P

M8:

2008

0303

PM

8:20

0802

25P

M8:

2008

0218

PM

10:2

0080

128

PM

10:2

0080

204

PM

10:2

0080

121

PM

5:20

0802

04P

M5:

2008

0128

PM

7:20

0712

11P

M11

:200

7121

1P

M8:

2007

1211

PM

2:20

0712

11P

M3:

2007

1211

PM

10:2

0071

211

PM

5:20

0712

11P

M9:

2007

1211

PM

1:20

0712

11P

M7:

2007

1217

PM

7:20

0803

03P

M7:

2008

0121

PM

7:20

0801

14P

M7:

2008

0204

PM

7:20

0801

28P

M7:

2008

0211

PM

7:20

0802

18P

M7:

2008

0225

PM

12:2

0071

217

PM

9:20

0712

17P

M2:

2007

1217

PM

10:2

0071

217

PM

11:2

0071

217

PM

5:20

0712

17P

M8:

2007

1217

PM

4:20

0712

17P

M6:

2007

1217

PM

1:20

0712

17P

M3:

2007

1217

PM

3:20

0802

18P

M11

:200

8030

3P

M11

:200

8022

5P

M3:

2008

0303

PM

1:20

0803

03P

M4:

2008

0303

PM

2:20

0803

03P

M2:

2008

0225

PM

2:20

0802

18P

M4:

2008

0114

PM

4:20

0802

04P

M4:

2008

0211

PM

4:20

0802

25P

M4:

2008

0218

PM

1:20

0802

18P

M1:

2008

0225

PM

1:20

0801

14P

M1:

2008

0128

PM

4:20

0801

28P

M11

:200

8021

1P

M11

:200

8020

4P

M1:

2008

0211

PM

1:20

0801

21P

M3:

2008

0121

PM

11:2

0080

218

PM

9:20

0801

07P

M4:

2008

0121

PM

3:20

0802

11P

M3:

2008

0128

PM

11:2

0080

128

PM

3:20

0802

25P

M2:

2008

0204

PM

2:20

0802

11P

M3:

2008

0204

PM

1:20

0802

04P

M1:

2008

0107

PM

3:20

0801

07P

M4:

2008

0107

PM

4:20

0801

08P

M2:

2008

0108

PM

6:20

0801

07P

M8:

2008

0107

PM

2:20

0801

07P

M5:

2008

0107

PM

3:20

0801

14P

M3:

2008

0108

PM

1:20

0801

08P

M10

:200

8010

7P

M12

:200

8010

7P

M2:

2008

0121

PM

2:20

0801

28P

M2:

2008

0114

PM

11:2

0080

121

PM

11:2

0080

107

PM

11:2

0080

114

PM

11:2

0080

108

Prevotella:84Lachnospiraceae:2961Bacteroides:890Bacteroides:925Bacteroides:369Veillonellaceae:60Parabacteroides:743Bacteroides:1132Prevotella:85Prevotella:86Coprobacillus:38Bacteroides:811Bacteroides:921Veillonellaceae:54Faecalibacterium:288Faecalibacterium:276Bryantella:94Lachnospiraceae:4319Lachnospiraceae:3663Clostridiales:1116Ruminococcus:15LachnospiraceaeIncertaeSedis:977Lachnospiraceae:3493Coprobacillus:49ErysipelotrichaceaeIncertaeSedis:313LachnospiraceaeIncertaeSedis:936Bacteroides:1055Bacteroides:891Alistipes:151Bacteroides:1111Bacteroides:1081Bacteroides:956Bacteroides:1153Alistipes:161Alistipes:156Lachnospiraceae:3899Bacteroides:1150Bacteroides:1151Bacteroides:1194Bacteroides:1166Bacteroides:1060Betaproteobacteria:13Betaproteobacteria:14Clostridiales:1114Bacteroides:868Ruminococcaceae:612Ruminococcaceae:593Lachnospiraceae:3861Clostridiales:1117Bacteroides:1050Lachnospiraceae:4003Lachnospiraceae:3963Lachnospiraceae:4394Lachnospiraceae:4261Lachnospiraceae:4362Lachnospiraceae:4233LachnospiraceaeIncertaeSedis:981Lachnospiraceae:685Lachnospiraceae:3743Bacteroides:768Bacteroides:1146Bacteroides:978Lachnospiraceae:3867Lachnospiraceae:3407LachnospiraceaeIncertaeSedis:962Bacteroidales:105Bacteroides:965Bacteroidales:72Lachnospiraceae:4374Lachnospiraceae:3509Firmicutes:340Bacteroides:820Bacteroides:667Bacteroides:854Ruminococcaceae:453Bacteroides:265Dorea:58Faecalibacterium:282Faecalibacterium:275Sutterella:13Lachnospiraceae:4347Bacteroides:901Prevotellaceae:143Bacteria:2Prevotellaceae:433Prevotellaceae:334Bacteria:4Bacteria:5Prevotella:74Prevotella:76Prevotella:81Prevotella:83Enterococcus:153Enterococcus:39Lachnospiraceae:3906ErysipelotrichaceaeIncertaeSedis:248ErysipelotrichaceaeIncertaeSedis:240LachnospiraceaeIncertaeSedis:854LachnospiraceaeIncertaeSedis:967LachnospiraceaeIncertaeSedis:1018ErysipelotrichaceaeIncertaeSedis:315Ruminococcaceae:80Eubacterium:44Ruminococcaceae:635LachnospiraceaeIncertaeSedis:506LachnospiraceaeIncertaeSedis:929Lachnospiraceae:4308Lachnospiraceae:4295Collinsella:28Lachnospiraceae:3067Enterobacteriaceae:18Enterobacter:1Lachnospiraceae:3907Lachnospiraceae:1453Lachnospiraceae:4122Coprobacillus:126Lachnospiraceae:4196Lachnospiraceae:4056Lachnospiraceae:2958Firmicutes:278Ruminococcaceae:274Ruminococcaceae:541Enterococcus:43Ruminococcaceae:575Lachnospiraceae:3800LachnospiraceaeIncertaeSedis:290Lachnospiraceae:796Lachnospiraceae:4248Lachnospiraceae:209LachnospiraceaeIncertaeSedis:277Coriobacteriaceae:42ErysipelotrichaceaeIncertaeSedis:255ErysipelotrichaceaeIncertaeSedis:232Ruminococcaceae:374Lachnospiraceae:3839Lachnospiraceae:1726Lachnospiraceae:4074Bilophila:37Bilophila:38Akkermansia:54IncertaeSedisXIII:17Lachnospiraceae:2908Bacteroides:1058Lachnospiraceae:4070Ruminococcaceae:634Clostridiales:313Lachnospiraceae:2423Enterococcus:182Enterococcus:162Lachnospiraceae:4148LachnospiraceaeIncertaeSedis:955Lactococcus:48Lactococcus:42Enterococcaceae:28Ruminococcaceae:469LachnospiraceaeIncertaeSedis:1021LachnospiraceaeIncertaeSedis:1009Ruminococcaceae:526Lachnospiraceae:3410Erysipelotrichaceae:49Coprobacillus:67Erysipelotrichaceae:43Akkermansia:40RuminococcaceaeIncertaeSedis:49LachnospiraceaeIncertaeSedis:1014Ruminococcaceae:547Ruminococcaceae:639Lachnospiraceae:4098Clostridiales:1075RuminococcaceaeIncertaeSedis:55Lachnospiraceae:4244Lachnospiraceae:3854LachnospiraceaeIncertaeSedis:471Lachnospiraceae:3636Coprobacillus:85Lachnospiraceae:313Lachnospiraceae:4403IncertaeSedisXIII:13ErysipelotrichaceaeIncertaeSedis:254Ruminococcaceae:387Lachnospiraceae:3992Anaerostipes:38Lachnospiraceae:2700Lachnospiraceae:4440Erysipelotrichaceae:26Parabacteroides:745Lachnospiraceae:3189Proteobacteria:26Betaproteobacteria:10Bacteroides:1048Bacteroides:1084LachnospiraceaeIncertaeSedis:1016Lachnospiraceae:4411Lachnospiraceae:4084Holdemania:23Clostridia:36Ruminococcaceae:642Lachnospiraceae:4047Ruminococcaceae:357Lachnospiraceae:3796

0 2 4 6 8 12

Value

060

00

Color Keyand Histogram

Cou

nt

Ent

erob

acte

r:1

Ent

erob

acte

riace

ae:1

8La

chno

spira

ceae

:430

8La

chno

spira

ceae

:429

5La

chno

spira

ceae

:390

7La

chno

spira

ceae

:145

3La

chno

spira

ceae

:412

2La

chno

spira

ceae

:306

7C

ollin

sella

:28

Lach

nosp

irace

ae:2

09La

chno

spira

ceae

Ince

rtae

Sed

is:2

77La

chno

spira

ceae

Ince

rtae

Sed

is:9

29La

chno

spira

ceae

:424

8La

chno

spira

ceae

:385

4La

chno

spira

ceae

:796

Lach

nosp

irace

aeIn

cert

aeS

edis

:471

Cop

roba

cillu

s:85

Lach

nosp

irace

ae:4

403

Lach

nosp

irace

ae:4

244

Akk

erm

ansi

a:54

Akk

erm

ansi

a:40

Lach

nosp

irace

aeIn

cert

aeS

edis

:101

4La

chno

spira

ceae

:290

8E

rysi

pelo

tric

hace

aeIn

cert

aeS

edis

:254

Rum

inoc

occa

ceae

Ince

rtae

Sed

is:4

9E

rysi

pelo

tric

hace

ae:4

9C

opro

baci

llus:

49C

lost

ridia

:36

Lach

nosp

irace

ae:3

636

Lach

nosp

irace

aeIn

cert

aeS

edis

:936

Hol

dem

ania

:23

Ince

rtae

Sed

isX

III:1

3E

rysi

pelo

tric

hace

aeIn

cert

aeS

edis

:313

Rum

inoc

occa

ceae

:642

Vei

llone

llace

ae:5

4B

acte

roid

es:9

21La

chno

spira

ceae

:318

9P

arab

acte

roid

es:7

45B

etap

rote

obac

teria

:10

Pro

teob

acte

ria:2

6E

rysi

pelo

tric

hace

ae:4

3C

opro

baci

llus:

67C

opro

baci

llus:

38E

rysi

pelo

tric

hace

ae:2

6R

umin

ococ

cace

ae:3

87A

naer

ostip

es:3

8La

chno

spira

ceae

:313

Lach

nosp

irace

ae:3

992

Bac

tero

ides

:116

6B

etap

rote

obac

teria

:13

Bac

tero

ides

:106

0B

etap

rote

obac

teria

:14

Par

abac

tero

ides

:743

Lach

nosp

irace

ae:3

493

Bac

tero

ides

:891

Lach

nosp

irace

ae:4

411

Lach

nosp

irace

ae:4

084

Alis

tipes

:161

Alis

tipes

:156

Rum

inoc

occa

ceae

:541

Ery

sipe

lotr

icha

ceae

Ince

rtae

Sed

is:2

40E

nter

ococ

cus:

43E

nter

ococ

cus:

39La

ctoc

occu

s:42

Lact

ococ

cus:

48La

chno

spira

ceae

:341

0R

umin

ococ

cace

ae:5

26La

chno

spira

ceae

:380

0R

umin

ococ

cace

ae:5

75La

chno

spira

ceae

:390

6E

rysi

pelo

tric

hace

aeIn

cert

aeS

edis

:232

Ery

sipe

lotr

icha

ceae

Ince

rtae

Sed

is:2

48La

chno

spira

ceae

Ince

rtae

Sed

is:9

55La

chno

spira

ceae

Ince

rtae

Sed

is:8

54La

chno

spira

ceae

:419

6E

nter

ococ

cus:

182

Ent

eroc

occu

s:16

2La

chno

spira

ceae

:242

3La

chno

spira

ceae

Ince

rtae

Sed

is:1

021

Rum

inoc

occa

ceae

:634

Ent

eroc

occa

ceae

:28

Rum

inoc

occa

ceae

:274

Lach

nosp

irace

aeIn

cert

aeS

edis

:100

9La

chno

spira

ceae

:414

8E

nter

ococ

cus:

153

Lach

nosp

irace

aeIn

cert

aeS

edis

:967

Rum

inoc

occa

ceae

:469

Lach

nosp

irace

aeIn

cert

aeS

edis

:101

8La

chno

spira

ceae

:295

8F

irm

icut

es:2

78La

chno

spira

ceae

Ince

rtae

Sed

is:2

90La

chno

spira

ceae

:172

6R

umin

ococ

cace

ae:3

74La

chno

spira

ceae

:407

0C

opro

baci

llus:

126

Lach

nosp

irace

ae:4

056

Lach

nosp

irace

ae:4

074

Clo

strid

iale

s:31

3E

rysi

pelo

tric

hace

aeIn

cert

aeS

edis

:255

Cor

ioba

cter

iace

ae:4

2La

chno

spira

ceae

:383

9La

chno

spira

ceae

:409

8A

listip

es:1

51B

acte

roid

es:1

111

Bac

tero

ides

:105

8B

acte

roid

es:1

081

Bilo

phila

:38

Bilo

phila

:37

Lach

nosp

irace

ae:3

796

Rum

inoc

occa

ceae

:357

Bac

tero

ides

:956

Bac

tero

ides

:115

3In

cert

aeS

edis

XIII

:17

Rum

inoc

occa

ceae

:639

Clo

strid

iale

s:10

75R

umin

ococ

cace

ae:5

47E

rysi

pelo

tric

hace

aeIn

cert

aeS

edis

:315

Rum

inoc

occa

ceae

:635

Rum

inoc

occa

ceae

Ince

rtae

Sed

is:5

5La

chno

spira

ceae

:404

7La

chno

spira

ceae

Ince

rtae

Sed

is:5

06E

ubac

teriu

m:4

4R

umin

ococ

cace

ae:8

0P

revo

tella

:83

Pre

vote

lla:8

1P

revo

tella

:85

Pre

vote

llace

ae:3

34B

acte

ria:4

Bac

teria

:2P

revo

tella

ceae

:143

Pre

vote

llace

ae:4

33S

utte

rella

:13

Bac

tero

ides

:965

Bac

tero

ides

:369

Bac

tero

idal

es:7

2La

chno

spira

ceae

:270

0B

acte

roid

ales

:105

Pre

vote

lla:7

6B

acte

ria:5

Pre

vote

lla:8

6P

revo

tella

:84

Pre

vote

lla:7

4B

acte

roid

es:8

11Fa

ecal

ibac

teriu

m:2

88Fa

ecal

ibac

teriu

m:2

76D

orea

:58

Faec

alib

acte

rium

:275

Faec

alib

acte

rium

:282

Lach

nosp

irace

aeIn

cert

aeS

edis

:981

Lach

nosp

irace

aeIn

cert

aeS

edis

:101

6V

eillo

nella

ceae

:60

Bac

tero

ides

:105

5La

chno

spira

ceae

:389

9B

acte

roid

es:1

048

Bac

tero

ides

:105

0B

acte

roid

es:1

084

Bac

tero

ides

:115

0B

acte

roid

es:1

151

Bac

tero

ides

:119

4B

acte

roid

es:1

132

Lach

nosp

irace

ae:2

961

Bac

tero

ides

:901

Bac

tero

ides

:265

Bac

tero

ides

:768

Lach

nosp

irace

ae:3

407

Bac

tero

ides

:890

Rum

inoc

occu

s:15

Lach

nosp

irace

ae:4

374

Fir

mic

utes

:340

Bac

tero

ides

:925

Bac

tero

ides

:820

Bac

tero

ides

:667

Bac

tero

ides

:854

Lach

nosp

irace

ae:4

347

Lach

nosp

irace

ae:4

440

Lach

nosp

irace

aeIn

cert

aeS

edis

:962

Lach

nosp

irace

ae:6

85La

chno

spira

ceae

:374

3La

chno

spira

ceae

:400

3La

chno

spira

ceae

:439

4La

chno

spira

ceae

:436

2La

chno

spira

ceae

:423

3B

acte

roid

es:1

146

Rum

inoc

occa

ceae

:612

Rum

inoc

occa

ceae

:593

Bac

tero

ides

:978

Bac

tero

ides

:868

Clo

strid

iale

s:11

14R

umin

ococ

cace

ae:4

53C

lost

ridia

les:

1117

Lach

nosp

irace

ae:3

861

Lach

nosp

irace

ae:4

261

Lach

nosp

irace

ae:3

963

Lach

nosp

irace

ae:3

867

Clo

strid

iale

s:11

16B

ryan

tella

:94

Lach

nosp

irace

ae:4

319

Lach

nosp

irace

aeIn

cert

aeS

edis

:977

Lach

nosp

irace

ae:3

509

Lach

nosp

irace

ae:3

663

Enterobacter:1Enterobacteriaceae:18Lachnospiraceae:4308Lachnospiraceae:4295Lachnospiraceae:3907Lachnospiraceae:1453Lachnospiraceae:4122Lachnospiraceae:3067Collinsella:28Lachnospiraceae:209LachnospiraceaeIncertaeSedis:277LachnospiraceaeIncertaeSedis:929Lachnospiraceae:4248Lachnospiraceae:3854Lachnospiraceae:796LachnospiraceaeIncertaeSedis:471Coprobacillus:85Lachnospiraceae:4403Lachnospiraceae:4244Akkermansia:54Akkermansia:40LachnospiraceaeIncertaeSedis:1014Lachnospiraceae:2908ErysipelotrichaceaeIncertaeSedis:254RuminococcaceaeIncertaeSedis:49Erysipelotrichaceae:49Coprobacillus:49Clostridia:36Lachnospiraceae:3636LachnospiraceaeIncertaeSedis:936Holdemania:23IncertaeSedisXIII:13ErysipelotrichaceaeIncertaeSedis:313Ruminococcaceae:642Veillonellaceae:54Bacteroides:921Lachnospiraceae:3189Parabacteroides:745Betaproteobacteria:10Proteobacteria:26Erysipelotrichaceae:43Coprobacillus:67Coprobacillus:38Erysipelotrichaceae:26Ruminococcaceae:387Anaerostipes:38Lachnospiraceae:313Lachnospiraceae:3992Bacteroides:1166Betaproteobacteria:13Bacteroides:1060Betaproteobacteria:14Parabacteroides:743Lachnospiraceae:3493Bacteroides:891Lachnospiraceae:4411Lachnospiraceae:4084Alistipes:161Alistipes:156Ruminococcaceae:541ErysipelotrichaceaeIncertaeSedis:240Enterococcus:43Enterococcus:39Lactococcus:42Lactococcus:48Lachnospiraceae:3410Ruminococcaceae:526Lachnospiraceae:3800Ruminococcaceae:575Lachnospiraceae:3906ErysipelotrichaceaeIncertaeSedis:232ErysipelotrichaceaeIncertaeSedis:248LachnospiraceaeIncertaeSedis:955LachnospiraceaeIncertaeSedis:854Lachnospiraceae:4196Enterococcus:182Enterococcus:162Lachnospiraceae:2423LachnospiraceaeIncertaeSedis:1021Ruminococcaceae:634Enterococcaceae:28Ruminococcaceae:274LachnospiraceaeIncertaeSedis:1009Lachnospiraceae:4148Enterococcus:153LachnospiraceaeIncertaeSedis:967Ruminococcaceae:469LachnospiraceaeIncertaeSedis:1018Lachnospiraceae:2958Firmicutes:278LachnospiraceaeIncertaeSedis:290Lachnospiraceae:1726Ruminococcaceae:374Lachnospiraceae:4070Coprobacillus:126Lachnospiraceae:4056Lachnospiraceae:4074Clostridiales:313ErysipelotrichaceaeIncertaeSedis:255Coriobacteriaceae:42Lachnospiraceae:3839Lachnospiraceae:4098Alistipes:151Bacteroides:1111Bacteroides:1058Bacteroides:1081Bilophila:38Bilophila:37Lachnospiraceae:3796Ruminococcaceae:357Bacteroides:956Bacteroides:1153IncertaeSedisXIII:17Ruminococcaceae:639Clostridiales:1075Ruminococcaceae:547ErysipelotrichaceaeIncertaeSedis:315Ruminococcaceae:635RuminococcaceaeIncertaeSedis:55Lachnospiraceae:4047LachnospiraceaeIncertaeSedis:506Eubacterium:44Ruminococcaceae:80Prevotella:83Prevotella:81Prevotella:85Prevotellaceae:334Bacteria:4Bacteria:2Prevotellaceae:143Prevotellaceae:433Sutterella:13Bacteroides:965Bacteroides:369Bacteroidales:72Lachnospiraceae:2700Bacteroidales:105Prevotella:76Bacteria:5Prevotella:86Prevotella:84Prevotella:74Bacteroides:811Faecalibacterium:288Faecalibacterium:276Dorea:58Faecalibacterium:275Faecalibacterium:282LachnospiraceaeIncertaeSedis:981LachnospiraceaeIncertaeSedis:1016Veillonellaceae:60Bacteroides:1055Lachnospiraceae:3899Bacteroides:1048Bacteroides:1050Bacteroides:1084Bacteroides:1150Bacteroides:1151Bacteroides:1194Bacteroides:1132Lachnospiraceae:2961Bacteroides:901Bacteroides:265Bacteroides:768Lachnospiraceae:3407Bacteroides:890Ruminococcus:15Lachnospiraceae:4374Firmicutes:340Bacteroides:925Bacteroides:820Bacteroides:667Bacteroides:854Lachnospiraceae:4347Lachnospiraceae:4440LachnospiraceaeIncertaeSedis:962Lachnospiraceae:685Lachnospiraceae:3743Lachnospiraceae:4003Lachnospiraceae:4394Lachnospiraceae:4362Lachnospiraceae:4233Bacteroides:1146Ruminococcaceae:612Ruminococcaceae:593Bacteroides:978Bacteroides:868Clostridiales:1114Ruminococcaceae:453Clostridiales:1117Lachnospiraceae:3861Lachnospiraceae:4261Lachnospiraceae:3963Lachnospiraceae:3867Clostridiales:1116Bryantella:94Lachnospiraceae:4319LachnospiraceaeIncertaeSedis:977Lachnospiraceae:3509Lachnospiraceae:3663

−1 0 0.5 1

Value

010

00

Color Keyand Histogram

Cou

nt

Figure 4: Left) Abundance heatmap (plotMRheatmap). Right) Correlation heatmap (plotCorr).

# plotMRheatmapplotMRheatmap(obj = mouseData, n = 200, cexRow = 0.4, cexCol = 0.4,

trace = "none", col = heatmapCols, ColSideColors = heatmapColColors)

# plotCorrplotCorr(obj = mouseData, n = 200, cexRow = 0.25, cexCol = 0.25,

trace = "none", dendrogram = "none", col = heatmapCols)

Below is an example of plotting CMDS plots of the data and the rarefaction effect at theOTU level. None of the data is removed (we recommend removing outliers typically).

cl = factor(pData(mouseData)$diet)

# plotOrd - can load vegan and set distfun = vegdist and use# dist.method='bray'plotOrd(mouseData, tran = TRUE, usePCA = FALSE, useDist = TRUE,

bg = cl, pch = 21)

# plotRareres = plotRare(mouseData, cl = cl, pch = 21, bg = cl)

# Linear fits for plotRare / legendtmp = lapply(levels(cl), function(lv) lm(res[, "ident"] ˜ res[,

"libSize"] - 1, subset = cl == lv))for (i in 1:length(levels(cl))) {

abline(tmp[[i]], col = i)}legend("topleft", c("Diet 1", "Diet 2"), text.col = c(1, 2),

box.col = NA)

27

Page 28: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●●

●●

●●

●●

●●

●●●

●●

● ●

● ●

●●

●●

●●

●●

−20 −10 0 10 20 30

−40

−30

−20

−10

010

20

MDS component: 1

MD

S c

ompo

nent

: 2

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

1000 2000 3000 4000 5000 6000

300

400

500

600

Depth of coverage

Num

ber

of d

etec

ted

feat

ures

Diet 1Diet 2

Figure 5: Left) CMDS of features (plotOrd). Right) Rarefaction effect (plotRare).

6.3 Feature specific

Reads clustered with high similarity represent functional or taxonomic units. However, it is pos-sible that reads from the same organism get clustered into multiple OTUs. Following differentialabundance analysis. It is important to confirm differential abundance. One way to limit falsepositives is ensure that the feature is actually abundant (enough positive samples). Anotherway is to plot the abundances of features similarly annotated.

1. plotOTU - abundances of a particular feature by group (Fig. 6 left)

2. plotGenus - abundances for several features similarly annotated by group (Fig. 6 right)

3. plotFeature - abundances of a particular feature by group (similar to plotOTU, Fig.7)

Below we use plotOTU to plot the normalized log(cpt) of a specific OTU annotated as Neis-seria meningitidis, in particular the 779th row of lungTrim’s count matrix. Using plotGenuswe plot the normalized log(cpt) of all OTUs annotated as Neisseria meningitidis.

It would appear that Neisseria meningitidis is differentially more abundant in nonsmokers.

head(MRtable(fit, coef = 2, taxa = 1:length(fData(lungTrim)$taxa)))

## +samples in group 0 +samples in group 1## 63 24 6## 779 23 7## 358 24 1## 499 21 2## 25 15 26## 928 2 11## counts in group 0 counts in group 1 smokingStatusSmoker## 63 1538 11 -4.031612## 779 1512 22 -3.958899

28

Page 29: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

## 358 390 1 -2.927686## 499 326 2 -2.675306## 25 162 1893 2.575672## 928 4 91 2.574172## pvalues adjPvalues## 63 3.927097e-11 2.959194e-08## 779 5.751592e-11 2.959194e-08## 358 4.339587e-09 8.930871e-07## 499 1.788697e-07 1.357269e-05## 25 1.360718e-07 1.272890e-05## 928 3.544957e-04 1.414122e-03

patients = sapply(strsplit(rownames(pData(lungTrim)), split = "_"),function(i) {

i[3]})

pData(lungTrim)$patients = patientsclassIndex = list(smoker = which(pData(lungTrim)$SmokingStatus ==

"Smoker"))classIndex$nonsmoker = which(pData(lungTrim)$SmokingStatus ==

"NonSmoker")otu = 779

# plotOTUplotOTU(lungTrim, otu = otu, classIndex, main = "Neisseria meningitidis")

# Now multiple OTUs annotated similarlyx = fData(lungTrim)$taxa[otu]otulist = grep(x, fData(lungTrim)$taxa)

# plotGenusplotGenus(lungTrim, otulist, classIndex, labs = FALSE, main = "Neisseria meningitidis")

lablist <- c("S", "NS")axis(1, at = seq(1, 6, by = 1), labels = rep(lablist, times = 3))

classIndex = list(Western = which(pData(mouseData)$diet == "Western"))classIndex$BK = which(pData(mouseData)$diet == "BK")otuIndex = 8770

# par(mfrow=c(1,2))dates = pData(mouseData)$dateplotFeature(mouseData, norm = FALSE, log = FALSE, otuIndex, classIndex,

col = dates, sortby = dates, ylab = "Raw reads")

29

Page 30: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

●●

●●

● ●

02

46

8

Neisseria meningitidis

Groups of comparison

Nor

mal

ized

log(

cpt)

smoker nonsmoker

●● ●

● ●●● ● ●●●

●●

●●● ●● ●●● ●

●●

●●

●●

●●

●●●● ●●●● ●●●● ●●● ●●●●● ●●● ●● ●● ●●

●●●

●●●

● ●

● ●●● ●●

●●●

●●

●●●●●

●● ● ●●●●● ● ●● ●●● ●● ●●●●● ●●●●● ●●●● ●●●●●

●●●

●●

● ● ●●●●

●●●

●●

●●●

02

46

8

Neisseria meningitidis

Groups of comparison

Nor

mal

ized

log(

cpt)

S NS S NS S NS

Figure 6: Left) Abundance plot (plotOTU). Right) Multiple OTU abundances (plotGenus).

0 10 20 30 40 50

010

020

030

040

050

060

070

0

Western

Raw

rea

ds

0 20 40 60 80

010

020

030

040

050

060

070

0

BK

Raw

rea

ds

Figure 7: Plot of raw abundances

30

Page 31: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

7 Summary

metagenomeSeq is specifically designed for sparse high-throughput sequencing experimentsthat addresses the analysis of differential abundance for marker-gene survey data. The package,while designed for marker-gene survey datasets, may be appropriate for other sparse data setsfor which the zero-inflated Gaussian mixture model may apply. If you make use of the statis-tical method please cite our paper. If you made use of the manual/software, please cite themanual/software!

7.1 Citing metagenomeSeq

citation("metagenomeSeq")

#### Please cite the top for the original statistical## method and normalization method implemented in## metagenomeSeq and the bottom for the## software/vignette guide. Time series## analysis/function is described in the third citation.#### JN Paulson, OC Stine, HC Bravo, M Pop.## Differential abundance analysis for microbial## marker-gene surveys. Nat Meth Accepted#### JN Paulson, H Talukder, M Pop, HC Bravo.## metagenomeSeq: Statistical analysis for sparse## high-throughput sequencing. Bioconductor package:## 1.24.1. http://cbcb.umd.edu/software/metagenomeSeq#### H Talukder*, JN Paulson*, HC Bravo. Longitudinal## differential abundance analysis of marker-gene## surveys. Submitted#### To see these entries in BibTeX format, use## 'print(<citation>, bibtex=TRUE)', 'toBibtex(.)', or## set 'options(citation.bibtex.max=999)'.

7.2 Session Info

sessionInfo()

## R version 3.5.2 (2018-12-20)## Platform: x86_64-pc-linux-gnu (64-bit)## Running under: Ubuntu 16.04.5 LTS#### Matrix products: default## BLAS: /home/biocbuild/bbs-3.8-bioc/R/lib/libRblas.so## LAPACK: /home/biocbuild/bbs-3.8-bioc/R/lib/libRlapack.so#### locale:

31

Page 32: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C## [9] LC_ADDRESS=C LC_TELEPHONE=C## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C#### attached base packages:## [1] parallel stats graphics grDevices utils## [6] datasets methods base#### other attached packages:## [1] biomformat_1.10.1 gss_2.1-9## [3] metagenomeSeq_1.24.1 RColorBrewer_1.1-2## [5] glmnet_2.0-16 foreach_1.4.4## [7] Matrix_1.2-15 limma_3.38.3## [9] Biobase_2.42.0 BiocGenerics_0.28.0## [11] knitr_1.21#### loaded via a namespace (and not attached):## [1] Rcpp_1.0.0 magrittr_1.5## [3] lattice_0.20-38 plyr_1.8.4## [5] stringr_1.3.1 highr_0.7## [7] caTools_1.17.1.1 tools_3.5.2## [9] grid_3.5.2 rhdf5_2.26.2## [11] xfun_0.4 KernSmooth_2.23-15## [13] iterators_1.0.10 gtools_3.8.1## [15] matrixStats_0.54.0 Rhdf5lib_1.4.2## [17] formatR_1.5 bitops_1.0-6## [19] codetools_0.2-16 evaluate_0.12## [21] gdata_2.18.0 stringi_1.2.4## [23] compiler_3.5.2 gplots_3.0.1## [25] jsonlite_1.6

32

Page 33: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

8 Appendix

8.1 Appendix A: MRexperiment internals

The S4 class system in R allows for object oriented definitions. metagenomeSeq makes useof the Biobase package in Bioconductor and their virtual-class, eSet. Building off of eSet,the main S4 class in metagenomeSeq is termed MRexperiment. MRexperiment is a simpleextension of eSet, adding a single slot, expSummary.

The experiment summary slot is a data frame that includes the depth of coverage andthe normalization factors for each sample. Future datasets can be formated as MRexperi-ment objects and analyzed with relative ease. A MRexperiment object is created by callingnewMRexperiment, passing the counts, phenotype and feature data as parameters.

We do not include normalization factors or library size in the currently available slot specifiedfor the sample specific phenotype data. All matrices are organized in the assayData slot. Allphenotype data (disease status, age, etc.) is stored in phenoData and feature data (OTUs,taxanomic assignment to varying levels, etc.) in featureData. Additional slots are availablefor reproducibility and annotation.

8.2 Appendix B: Mathematical model

Defining the class comparison of interest as k(j) = I{j ∈ groupA}. The zero-inflated modelis defined for the continuity-corrected log2 of the count data yij = log2(cij + 1) as a mixtureof a point mass at zero I{0}(yij) and a count distribution fcount(yij ;µi, σ

2i ) ∼ N(µi, σ

2i ). Given

mixture parameters πj , we have that the density of the zero-inflated Gaussian distribution forfeature i, in sample j with Sj total counts is:

fzig(yij ; θ) = πj(Sj) · I{0}(yij) + (1− πj(Sj)) · fcount(yij ; θ) (1)

Maximum-likelihood estimates are approximated using an EM algorithm, where we treatmixture membership ∆ij = 1 if yij is generated from the zero point mass as latent indicatorvariables [5]. We make use of an EM algorithm to account for the linear relationship betweensparsity and depth of coverage. The user can specify within the fitZig function a non-defaultzero model that accounts for more than simply the depth of coverage (e.g. country, age, anymetadata associated with sparsity, etc.). See Figure 8 for the graphical model.

More information will be included later. For now, please see the online methods in:http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.2658.html

8.3 Appendix C: Calculating the proper percentile

To be included: an overview of the two methods implemented for the data driven percentilecalculation and more description below.

The choice of the appropriate quantile given is crucial for ensuring that the normalizationapproach does not introduce normalization-related artifacts in the data. At a high level, thecount distribution of samples should all be roughly equivalent and independent of each other upto this quantile under the assumption that, at this range, counts are derived from a commondistribution.

More information will be included later. For now, please see the online methods in:http://www.nature.com/nmeth/journal/vaop/ncurrent/full/nmeth.2658.html

33

Page 34: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

Figure 8: Graphical model. Green nodes represent observed variables: Sj is the total number of reads in samplej; kj the case-control status of sample j; and yij the logged normalized counts for feature i in sample j. Yellownodes represent counts obtained from each mixture component: counts come from either a spike-mass at zero,y0ij , or the “count” distribution, y1ij . Grey nodes b0i, b1i and σ2

i represent the estimated overall mean, fold-changeand variance of the count distribution component for feature i. πj , is the mixture proportion for sample j whichdepends on sequencing depth via a linear model defined by parameters β0 and β1. The expected value of latentindicator variables ∆ij give the posterior probability of a count being generated from a spike-mass at zero, i.e.y0ij . We assume M features and N samples.

34

Page 35: metagenomeSeq: Statistical analysis for sparse high ...bioconductor.org/packages/release/bioc/vignettes/metagenomeSeq/... · metagenomeSeq: Statistical analysis for sparse high-throughput

References

[1] Emily S Charlson, Kyle Bittinger, Andrew R Haas, Ayannah S Fitzgerald, Ian Frank, AnjanaYadav, Frederic D Bushman, and Ronald G Collman. Topographical continuity of bacterialpopulations in the healthy human respiratory tract. American Journal of Respiratory andCritical Care Medicine, 184, 2011.

[2] Peter J Turnbaugh, Vanessa K Ridaura, Jeremiah J Faith, Federico E Rey, Rob Knight, andJeffrey I Gordon. The effect of diet on the human gut microbiome: a metagenomic analysisin humanized gnotobiotic mice. Science translational medicine, 1(6):6ra14, 2009.

[3] Consortium HMP. A framework for human microbiome research. Nature, 486(7402), 2012.

[4] Gordon K Smyth. Limma: linear models for microarray data. Number October. Springer,2005.

[5] A P Dempster, N M Laird, and D B Rubin. Maximum likelihood from incomplete data viathe em algorithm. Journal of the Royal Statistical Society Series B Methodological, 39(1):1–38, 1977.

35