visualization and statistical comparisons of microbial communities using r...

September 20, 2010 15:55 WSPC - Proceedings Trim Size: 11in x 8.5in phylochip09n

VISUALIZATION AND STATISTICAL COMPARISONS OF MICROBIALCOMMUNITIES USING R PACKAGES ON PHYLOCHIP DATA

SUSAN HOLMES∗Statistics Department, Stanford University,

Stanford, CA 94305, USA∗E-mail:[email protected]/˜susan/

ALEXANDER ALEKSEYENKOCenter for Health Informatics and Bioinformatics,

NYU School of Medicine,New York, NY USA

ALDEN TIMMEStatistics Department, Stanford University,

Stanford, CA 94305, USA

TYRRELL NELSONDepartment of Civil and Environmental Engineering

Stanford University, Clark Center E-250318 Campus Drive, Stanford CA, 94305, USA

PANKAJ JAY PASRICHADivision of Gastroenterology and Hepatology

Stanford University Medical CenterAlway Building, Room M211

300 Pasteur Drive,Stanford, CA 94305, USA

ALFRED SPORMANNDepartment of Civil and Environmental Engineering

Stanford University, Clark Center E-250318 Campus Drive, Stanford CA, 94305, USA

This article explains the statistical and computational methodology used to analyze species abun-dances collected using the LNBL Phylochip in a study of Irritable Bowel Syndrome (IBS) in rats.

Some tools already available for the analysis of ordinary microarray data are useful in this typeof statistical analysis. For instance in correcting for multiple testing we use Family Wise Error ratecontrol and step-down tests (available in the multtest package). Once the most significant speciesare chosen we use the hypergeometric tests familiar for testing GO categories to test specific phylaand families.

We provide examples of normalization, multivariate projections, batch effect detection and in-tegration of phylogenetic covariation, as well as tree equalization and robustification methods.

Keywords: Hypergeometric Test; PhyloChip; projections; Quality Control; R; Phylogenetic Tree

1. Introduction

We present here some examples of using robust multivariate methods for the specific challengesof microbiome studies. We use as a running example a comparative study of microbiologicalcommunities in healthy and IBS rats sampled at different locations in the intestine. Theresults of the biological analysis have been submitted elsewhere,1 we concentrate here on thestatistical and computational challenges involved in such a project.


1.1. IBS in humans and rats

It is believed that alterations in the microflora of humans with IBS comes from changes incolonic fermentation patterns as has been described in King et al.2 Recently, some researchgroups have been able to use culture-independent methods and deep high throughput 16S ri-bosomal RNA gene sequencing to demonstrate significant differences in the microbiome of IBSpatients.3,4 The complexity induced by high individual variation of the microbiome suggestedthat a good starting point in this comparative study would be a rodent model that mimicsthe human condition. We have as our working hypothesis that the enteric microflora of adultrats with colonic hypersensitivity would differ from that of controls. We use a comprehensiveand relatively simple way of studying the microflora using a 16S rRNA gene DNA microar-ray called the Phylochip.5 The Phylochip has the advantage over high-throughput sequencingassays in that it is designed to detect presence and abundance of individual species. A majordrawback of utilizing the Phylochip platform for this project was that the chip design was notspecific to the intestinal microbiome and as a consequence there is a very unequal resolution incertain phyla, representing unequal knowledge about prokaryotic constituents of these phyla.

1.2. The data and software platform

Data were collected on the microbial community of different sections of the large bowel ofrats with colonic hypersensitivity induced by neonatal acetic acid irritation. This microar-ray consists of 500,000 oligonucleotide probes capable of identifying 8743 of bacteria andarchaea and provides a comprehensive census for presence and relative abundance of mostknown prokaryotes in a massive parallel assay. This array uses the the GeneChip (AffymetrixCorporation) technology, thus we could use the Bioconductor6 suite of tools for annotation7

and normalization of the data in the same way as is usual for microarray studies.8 We thenused multivariate methods to visualize comparisons between different groupings of the dataenabling us to enhance our quality control of the experimental protocol.

We then separated the data into consistently present species and those presenting highervariability. Previous computational approaches include the use of the weighted unifrac

(Wasserstein distance9) between communities.10 Here we take a geometrical approach to the vi-sualization and detection of various multidimensional biases and changes in variability, as wellas the combination of phylogenetic and low rank information. This is more akin to Purdom11

who also combines phylogenetic and abundance data, but for PCR sequenced phylotypes.Figure 1(a) shows a diagram of the data analysis workflow we chose to follow.

2. Details of the Data Analysis Procedures

2.1. Prefiltering and Normalization of the Microarray Data

We created and used a custom-tailored package containing the annotation of all the probeson the Phylochip using the makecdfenv7 package. As with standard expression data, the dataneed to be preprocessed to ensure that the variance was independent of the level of abundanceas described in Durbin et al12 and implemented in the vsn package8 in the Bioconductor6 suiteof R13 packages. Figure 1 (b) shows the densities of each of the arrays in the two groups after


variance stabilizing normalization.

Annotation andNormalization

Input phylogeneticinformation

Transform data throughthresholding and ranking

Dimension reduction (PCA)

Combine data with Tree

Prefilter the OTUs

Test for difference(adjust for multiple testing)

(lattice graph)

Highly consistent Highly variable

(most abundant)

Dimension reduction (PCA)

Project by batch and locations

Project by batch and locations

makecdfenvvsn

genefilter

ade4

s.class

multtest

aperead.tree

plot.tree

sort

veganpicante

(a) Different stages of Data Analysis

4 6 8 10 14

0.00

0.05

0.10

0.15

0.20

0.25

VSN−CTL

Den

sity

CCCDCMCNTR23CFCNTR23DFCNTR23MFCNTR23PFCNTRCFCNTRDFCNTRMFCNTRPFCPIBS1_CntrCbIBS1_CntrPb

4 6 8 10 14

0.00

0.05

0.10

0.15

0.20

0.25

VSN−IBS

Den

sity

20CF20DF20MF20PFIBS1_IBSCbIBS1_IBSPbIBSCIBSDIBSMIBSPICIDIMIP

(b) Density after variance stabilizing transformations.

Fig. 1: Tools were transposed from the standard microarray analyses

3. Batch Effect Detection using projections on Principal Planes

A standard principal component analysis was done on the centered and scaled abundancedata. In the first set of data, we had originally 24 samples, 12 from IBS, 12 from healthycontrols that we wanted to compare, the 12 samples for each group came from 4 locations inthe large intestine, however the first apparent differences came from batch groups. We had afirst batch of samples corresponding to analyses that were done on day 1 consisted of 6 arrays(3 IBS/3CTL), a second batch 18 arrays (9IBS and 9CTL), done on a second date with adifferent protocol and array batch. We used the additional ability provided by the projectionof supplementary group means and variance as in the function s.class in the ade414 packageto explore these batch effects in the laboratory methods used to generate the data. The ellipsesare computed using the means, variances and covariance of each group of points on both axes,and are drawn with these parameters: the center of the ellipse is centered on the means, itswidth and height are given by the variances, and the covariance sets the slope of the main axisof the ellipse. In Figure 2, on the left we see the first two batches although both balanced withregards to IBS and healthy rats were extremely different in variability and overall multivariatelocation. In order to explore this further, a third batch was generated with the same arrays asbatch 2 but the same experimental protocol as batch 1. We see that the third group faithfullyoverlaps with batch 1 thus showing that the batch effect was not due to a difference in arrays


but to the experimental protocol. This shows the utility of PCA in quality control. After

d = 0.02

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

1

2

Two Batches of Arrays−First Plane

d = 0.02

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● 1

2

3

Fig. 2: On the left the first plane of the PCA shows the first set of data with two batches andon the right the third set of arrays was added.

finding this particular effect we redid part of the data collection procedure, using only theprotocol used in batches 1 and 3, we analyzed 24 samples. We also added 8 samples frommucosal linings, 4 from IBS , 4 for control in each of the 4 intestinal locations. We combinedthe data into a 32 column matrix of abundance of 8364 species. Since the abundance datawere extremely variable and we had seen the sensitivity of the data to varying conditions andprotocols we decided to pair the data by location and type. For each pair we had an IBS anda CTL rat, for a sample collected in the location and in the same way, we used the pairingdesign to minimize the biases from experimental artifacts.

3.1. Ranking and Thresholding

In order to deliver a more robust statistical analysis, we ranked the species abundances withineach array: the ranks go from 1 (small) to 8364 (large). This is a standard non parametricstatistical procedure that enhances the stability of the results because a few outliers cannotbias the analyses. We considered that there were not more than 2000 species present so weset a threshold at 6000 (this is conservative as for instance a recent study in humans placesthe estimate of numbers of species in the human gut at between 1,000 and 1,20015). We thussuppose that all ranks smaller than 6000 were just noise and set them all to be equal to 6000.This avoids finding large differences in ranks for species that are only present at the noise level.We restrict the first part of our analysis here to the species that appeared present in almostall 32 arrays, ie those that had a ranking larger than 6000 in all but one of the arrays. Wecan see the distribution patterns with varying thresholds from 5000 to 8400 in Table 1. As inmicroarray studies, it is important to prefilter the species so that only those yielding consistent


#Arrays 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

# > 5000 3997 241 144 91 64 60 55 43 43 37 45 46 41 28 28 38 23# > 6000 5180 207 136 71 62 48 32 39 38 31 34 25 25 24 24 24 22# > 7000 6492 120 82 40 32 27 31 13 33 22 22 18 19 12 13 20 17# > 8000 7737 70 35 25 9 9 10 12 13 12 15 14 9 11 9 11 7# > 8400 8235 36 24 21 14 6 6 11 10 5 8 5 5 5 6 2 6

#Arrays 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

# > 5000 26 26 38 42 41 33 40 47 56 54 47 59 80 88 167 2766# > 6000 24 20 22 20 26 28 46 43 41 41 45 40 46 72 109 1989# > 7000 14 18 18 25 18 20 26 22 18 21 19 26 30 59 83 1204# > 8000 11 7 11 11 11 10 9 16 9 12 13 12 18 23 38 415# > 8400 7 2 8 4 5 7 6 13 10 5 5 5 6 16 18 112

Table 1: Tables showing the number of species present at a given level of abundance asmeasured by ranks in 0,1,2,. . . ,32 arrays. We can see in particular that there are about 2,000species present at least at the rank 6000 in all 32 arrays and about 415 which are highlyabundant (> 8000) in all arrays.

signals enter the analysis. In particular, this is important for various testing procedures we willuse later (testing for differences between IBS and CTL), where having extra non-meaningfulspecies costs us extra power requiring us to perform more tests than necessary. Table 1 is thebasis of most of the prefiltering presented in the paper.

4. Incorporating and adjusting the phylogenetic information

4.1. Difficulty with the Original Tree: heterogeneous levels of resolution

We entered the complete phylogeny of 16sRNA provided by GreenGenes into the R13 packageape.16 We can see in the left tree of Figure 3 that the phylogenetic tree of all the bacteriatested for on the microarrays is not ultra-metric. That is, not every species is at the samedistance from the root. When looking at the phylogenetic tree (Figure 3), it is evident thatsome areas of the tree have much greater resolution than others. The problem with this is thatsome species of bacteria are probed multiple times by the array. Therefore, they have morechances than other bacteria of showing significance under the null hypothesis. For example, ofthe 158 bacteria found to be significantly over- or under-abundant in IBS rats at the α = 0.05level in the first dataset (excluding the mucosal samples), nine are C. leptum, ten are R.hansenii, and ten are P. ruminicola. One of the questions that must be answered is whetheror not higher resolution in certain areas of the phylogenetic tree caused these species to beover-represented among the bacteria of interest. We will see below that the hypergeometrictest provides a way to control the phylogenetic bias at the higher-order level, but there is a lotof information lost when we look only at phyla. In an attempt to conserve information whilecorrecting for this oversampling in certain regions, we also propose a method for collapsingthe tree by merging the tips of related species with similar microarray intensities.


Fig. 3: On the left, we have the tree of all operational taxonomic units (otus) present on thePhylochip, we can observe that the distance to the root of many of the otus is variable, thusindicating a heterogeneous degree of resolution. The two trees on the right are filtered treesrepresenting only the 400 most abundant species. The blue tree on the right was computedby using the collapsing algorithm presented in this section, we see that the long right cladeat the bottom of the middle tree has disappeared.

The idea is to control for over-resolution by merging tips of the clades that are moreresolved, creating a more level playing field for the multiple testing. We used the lengthfrom the root of the tree as the main parameter for collapsing tips. That is, for any twospecies further from the root than the given maximum distance, we try to merge the two tips.However, merging is only done if the microarray data from the two species are similar enoughto be merged. Tips are only merged if there is a low enough variance across the bacteria foreach microarray measurement. What is a low enough variance, however, is difficult to define.For the purposes of the analysis here, we used a bootstrap procedure17 that estimated theq = 0.9-quantile for a random collection of groups of size n bacteria. This served as the cutoffof what could be considered a small enough variability within that clade. A collection of n


bacteria is merged only when all their tips are farther than the maximum length from rootand p = 80% of the 32 variances across the collection (one for each microarray) are below thecomputed thresholds. These were arbitrary thresholds that we have only evaluated empiricallyby running the algorithm with varying values for n, q and p.

4.2. Consistently Abundant Species and their place on the Tree

Here we chose about the top 100 most consistently abundant species following the choiceof a threshold of about 8400 as in Table 1. Here we show how we can use the enhancedplotting facilities in R through the Lattice compatible packages, we can plot the completetree, identifying the part of the tree which is covered by a subset of species.

y.coo

rd

y.coo

rd

range

(y.co

ord)

c(y.co

ord, y

.coord

)

Environmental_clone_wchb1-41_subgroup V eCytophaga_group_ii CFB group clone P. paCytophaga_group_ii Cytophagales clone HyBac.fragilis_subgroup clone L11-6Flx.sancti_subgroup Cilia-associated resRhodothermus_group sequences clone: IheBEnvironmental_clone_rb25_group HolophagaEnvironmental_clone_rb40_group clone JG3Environmental_clone_rb40_group clone DA0Dfi.dehalogenans_subgroup clone MB-C2-15Achromatium_assemblage Gamma IRMethylophaga buryaticaAeromonas eucrenophilaEscherichia_subgroup str. KN4.Pantoea agglomerans c3Secondary endosymbiont c11Secondary endosymbiont c14Altm.macleodii_group Alteromonadaceae clPseudomonas syringaeLeg.wadsworthii_subgroup clone ARKCH2Br2Rhf.fermentans_subgroup clone KD2-104Av.avenae_subgroup clone AF1-8Oxalobacter_group clone 29Acinetobacter tandoiiAcn.lwoffii_subgroup clone AF2-1DSpg.pruni_subgroup alpha clone:APe4_19Brevundimonas diminuta c2Blb.capsulatus_subgroup Alpha Kaza-35Rickettsia conoriiClostridium tetaniC.cochlearium_subgroup anoxSCC-33Environmental_clone_tm7_group gold mine C.thermolacticum_subgroup low G+C Gram-pDfm.geothermicum_group low G+C Gram-posiSymbiobacterium toebiiDesulfotomaculum thermoacetoxidansDfi.dehalogenans_subgroup clone RCP2-71Selenomonas ruminantium c2Bb.brevis_group Brevibacillus sp. MN 47.Bacillus horikoshii c2Caryophanon latumCrn.divergens_subgroup Carnobacterium spNostocoida limicola c6Tetragenococcus muriaticusEnterococcus faecium c2Streptococcus downeiStreptococcus gordoniiStreptococcus bovis c2Lactobacillus parakefiriPediococcus acidilactici c2Lactobacillus vaginalisLactobacillus pontisLactobacillus reuteriLactobacillus letivaziLactobacillus perolens c2Staphylococcus_subgroup Staphylococcus sLis.monocytogenes_subgroup clone S24-10Streptococcus pleomorphusPae.polymyxa_subgroup low G+C Gram-positC.thermocellum_subgroup low G+C Gram-posAcetanaerobacter thermotoleransAnaerococcus vaginalisC.glycolicum_subgroup Clostridium sp. MDC.aminobutyricum_subgroup clone: Rs-050C.cochlearium_subgroup clone RC30.C.thermobutyricum_subgroup ClostridiaceaC.leptum_subgroup Clostridiaceae clone:RC.subterminale_subgroup clone:Rs-Q18C.propionicum_subgroup clone:Rs-G77Clostridium leptumEub.ventriosum_subgroup clone HuCA20C.polysaccharolyticum_subgroup rumen cloC.thermocellum_subgroup clone 29D19C.leptum_subgroup Clostridiaceae clone: C.leptum_subgroup clone p-2657-65A5C.leptum_subgroup Clostridiaceae clone:RC.leptum_subgroup clone p-320-a3C.leptum_subgroup clone p-2179-s959-3C.leptum_subgroup rumen clone BF30C.leptum_subgroup Oscillospira guillermoC.leptum_subgroup clone HuCB5C.leptum_subgroup Butyrate-producing L2-C.leptum_subgroup clone HuCA1C.leptum_subgroup Clostridiaceae clone:RC.leptum_subgroup Clostridiaceae clone:RClostridium thermocellum c2Clostridium putrefaciensClostridium septicumClostridium beijerinckii c2Clostridium butyricumPeb.acetylenicus_subgroup delta isolate Gbc.metallireducens_subgroup isolate a2bDesulfobulbus_assemblage clone SHD-1Desulfobulbus_assemblage isolate a2b025Dbb.propionicus_assemblage delta clone HSyntrophus gentianaeNocardia transvalensisBif.infantis_subgroup human infant S6C

-50.0

-5.0

40.0

fold c

hang

e

CTL IBS Differ. Species

Fig. 4: The left tree shows the complete tree on all species in black with the subtree of mostabundant species in red, this subtree is the one plotted on the next panel. Values of abundancein CTL and IBS rats are plotted in the next two columns, the pink/blue scaled variables arethe truncated rank differences between the two groups.


4.3. Highly variable species

We concentrate now on the species which are abundant enough to be considered consistentlypresent (more than 15 out of 32 arrays over 7800) but that also show high variability (standarddeviation above 150). These values were arbitrarily chosen to retain about 100 species. Therewere actually 99 such species for which we had complete annotation information. We then

d = 5

C23CF2

C23DF2

C23MF2

C23PF2

CC

CD

CM

CP

IBS20CF2

IBS20DF2

IBS20MF2

IBS20PF2

IC

ID

IM

IP

MucCTLC MucCTLD

MucCTLM

MucCTLP

MucIBSC MucIBSD MucIBSM MucIBSP PCTL2 PCTLCF2

PCTLM2

PCTLP2

PIBSC2

PIBSD2

PIBSM2

PIBSP2

Fig. 5: Principal component analysis of top most abundant and variable species, we see theMucosal location is the explanation for the first component, all the mucosal samples havenegative loadings on this factor.

took the results of the PCA analysis and combined them with the tree information by usingthe loadings on the first two components (which account for 55% variance) and plotted themalongside the phylogenetic sub tree of the species we had retained as most variable. This plotis much easier to read than the projections of long species names in the two dimensionalprincipal plane. We have colored in red the species that are more abundant in the mucosalsamples.

4.4. Multiple Testing for finding differentially expressed species

The first set of analyses showed that the main differences were batch effects and differencesbetween the mucosal and other samples, so we decided to proceed by pairing the data bylocation, batch and mucosal types, thus removing the extra variance due to these factors.Thus we proceed into the testing phase using a paired design and we will use corrections madeon the paired t-test rather than the ordinary one. We will use truncated paired differences


y.co

ord

PC1+PC1−

y.co

ord

PC2− PC2+

c(y.

coor

d, y

.coo

rd)

Environmental_clone_4b7_subgroup unnamedMrb.hydrocarbonoclasticus_group clone SHSpm.paucivorans_subgroup clone SHA−109Prosthecobacter dejongeiiEnvironmental_clone_s027_group rumen cloPirellula_schlesner_isolates clone HT2F1Pirellula_schlesner_isolates planctomyceCytophaga_group_ii Bacteroidetes clone MPpm.macacae_subgroup clone ABLCf15Rik.microfusus_subgroup Bacteroidaceae cCy.fermentans_subgroup Bacteroidetes sp.Nitrospira cf c2Environmental_clone_s027_group 17F9Acbt.capsulatum_group K−5b10Nitrosococcus halophilusHalomonas variabilis c2Shewanella benthicaShewanella waksmaniiRum.amylophilus_subgroup AnaerobiospirilHistophilus somniAltm.macleodii_group clone ARKIA−34Pal.haloplanktis_subgroup Marine Tw−10Ps.amygdali_subgroup isolate C1_B045Pseudomonas marginalisPs.putida_subgroup Pseudomonas sp. dcm7BAzc.indigens_subgroup beta clone:Rs−B77Nss.multiformis_subgroup Banisveld landfBurkholderia caribiensisBurkholderia malleiRalstonia basilensisRalstonia insidiosaOxalobacter_group Beta A1020Oxalobacter formigenes c2Herbaspirillum seropedicaeDv.riboflavina_subgroup alpha clone JG34Ruegeria atlanticaBei.indica_subgroup alpha clone Gitt−KF−Rickettsia_subgroup alpha clone:Rs−B60Desulfovibrio giganteusPrc.marinus_group clone TK09Pseudanabaena_group Synechococcus sp. PCLeptospira faineiTrp.pallidum_subgroup Treponema III:C:BABeet leafhopperBacillus benzoevorans c2B.cohnii_subgroup soil clone 1448−1Bacillus sphaericus c3Bacillus silvestris c2Abiotrophia_group feedlot manure B7Enterococcus_group Swine manure RT−5DStc.salivarius_subgroup clone 32CRLactobacillus murinusLactobacillus gallinarumL.delbrueckii_subgroup clone L10−37Staphylococcus_subgroup Staphylococcus sC.purinolyticum_subgroup low G+C Gram−poAnaerococcus prevotiiC.aminobutyricum_subgroup clone:Rs−N82C.polysaccharolyticum_subgroup isolate HRuc.gnavus_subgroup sp. III−35Ruc.gnavus_subgroup clone L10−3Eubacterium oxidoreducensBtv.fibrisolvens_subgroup clone p−2418−5Eubacterium plexicaudatumRuc.hansenii_subgroup A17Ruc.hansenii_subgroup GA36Ruminococcus obeumClostridium bolteiC.leptum_subgroup CCAJG148C.xylanolyticum_subgroup equine intestinRuc.hansenii_subgroup equine intestinal C.xylanolyticum_subgroup equine intestinC.xylanolyticum_subgroup clone p−5263−4WRuc.gnavus_subgroup clone p−57−a5C.aminovalericum_subgroup from anoxic buC.thermocellum_subgroup clone p−1062−a5Ruc.hansenii_subgroup equine intestinal C.polysaccharolyticum_subgroup rumen cloC.leptum_subgroup clone RFN71.C.leptum_subgroup str. ASF 500.C.leptum_subgroup clone p−953−s962−5C.leptum_subgroup ckncm326−B4−13C.leptum_subgroup Oscillospira guillermoC.leptum_subgroup clone p−241−o5Clostridium aurantibutyricumPol.cellulosum_subgroup clone BTM36Dbb.propionicus_assemblage delta isolateEnvironmental_clone_ocs307_group MERTZ_2Rco.rhodochrous_group clone RCP2−103Dactylosporangium aurantiacumDactylosporangium matsuzakienseMycobacterium terraeKineococcus aurantiacus c3Act.neuii_subgroup Actinomyces sp. CCUG Bifidobacterium psychroaerophilumBifidobacterium breveEnvironmental_clone_opb80_group UASB_TL2Anr.thermoterrenum_group Flexistipes sp.Lpp.ferrooxidans_subgroup clone JG34−KF−

Fig. 6: Complete tree with the subtree of most variable among the consistently abundantspecies and the loadings on the first two principal components.

in ranks as input to standard multiple testing programs for finding the adjusted p-values. Tocontrol for false discovery due to multiple testing, p-values were adjusted according to theBenjamini-Hochberg procedure, which is able to control for FDR given some assumptions onthe expression levels of the bacteria on the microarray. We used the multtest package fromBioconductor.6

4.5. Significant differences projected onto the Tree

In order to visualize the parts of the phylogenetic tree most influenced by changes in speciesabundance between groups we retained the most significantly changed species (up in IBS or


up in CTL) on the tree and used the facilities available through the ape18 and the lattice19

packages.

y.coo

rd

CTLIBS

y.coo

rd

p-v alue

c(y.co

ord, y

.coord

)

Flx.maritimus_subgroup CFB group clone R

Csb.balustinum_subgroup clone BIOEST-10

Ppm.macacae_subgroup clone ABLCf15

Frankia_subgroup actinobacterium clone J

Nitrosococcus halophilus

Nsc.oceani_subgroup clone RCP2-96

Shewanella benthica

Rum.amylophilus_subgroup Anaerobiospiril

Marine gamma

Ps.amygdali_subgroup isolate C1_B045

Burkholderia caribiensis

Ralstonia basilensis

Oxalobacter_group Beta A1020

Oxalobacter formigenes c2

Herbaspirillum seropedicae

Dv.riboflavina_subgroup alpha clone JG34

Rickettsia_subgroup alpha clone:Rs-B60

Rhv.sulfidophilum_subgroup clone ZA2526c

Alvinella_pompejana_symbionts_subgroup S

C.thermocellum_subgroup equine intestina

Prc.marinus_group clone TK09

Halobacillus locisalis

B.thermocloacae clone p-2330-s962-2

Anaerococcus prevotii

Ruc.gnavus_subgroup clone L10-3

Ruc.gnavus_subgroup clone A11.

Ruc.gnavus_subgroup clone HuCA28

Ruc.hansenii_subgroup GA36

Clostridium boltei

C.xylanolyticum_subgroup equine intestin

Ruc.gnavus_subgroup clone p-57-a5

Lcn.multipara_subgroup clone p-1594-c5

C.aminovalericum_subgroup from anoxic bu

C.thermocellum_subgroup clone p-1062-a5

Btv.fibrisolvens_subgroup Butyrate-produ

C.leptum_subgroup clone p-2573-9F5

C.leptum_subgroup ckncm326-B4-13

Anr.thermoterrenum_group Flexistipes sp.

p-value Species

Fig. 7: The left tree shows the complete tree on all species in black with the subtree of setof species that show the most significantly differences between CTL and IBS in red in thesecond panel. Values of abundance in CTL and IBS rats are plotted in the next two columns,the next column shows the − log(pvalue), so the largest bars represent the most significantlydifferent species.

4.6. Category Based Comparisons

We chose as the list of most significant species those that had adjusted p-values lower than0.05 in the multiple testing procedure detailed above. We created two lists, one for whichthe ranked abundances were larger in the IBS, the other for which the ranked abundanceswere larger in the CTL group. We wanted to find specific families or phyla that are over-represented in either of the lists. This is a similar situation as that of testing significanceof Gene Ontology categories for expression studies. We recall that in both situations the


relevant test is the hypergeometric and that Fisher’s exact test and the hypergeometric testformulation are equivalent.20 We define the set of prefiltered species (species universe) asthose that passed the threshold test of being present (> 6000) in at least 31 of the arrays (seeTable 1). The chosen species (universe and significant) are then binned by phyla or families,these categories replace the Gene Ontology categories used in microarray studies. We arelooking for overrepresentation of certain families or phyla. This method is especially relevanthere as the chip does not have equal representation of different families and phyla.

The results and details of the hypergeometric tests can be consulted in Nelson et al, 20101

where we conclude in particular that the IBS had significantly more Bacteriodetes and on theother hand there is an overrepresentation of Firmicutes in the healthy controls. At the familylevel, the results showed that the families of Oxalobacteraceae, Prevotellaceae, Burkholderi-aceae, Sphingobacteriaceae were significantly overrepresented in IBS rat. Conversely, the mostsignificantly enriched family in control rats were Lachnospiraceae, including Ruminococcus sp.,followed by Erysipelotrichaeceae and Clostridiaceae.

5. Summary

Some methods developed for standard microarray studies can be useful in Phylochip studies,examples shown here include variance stabilization, prefiltering, multiple testing and hyper-geometric tests.

Batch effects can be detected through multivariate projections using methods such as PCAcomplemented with the projections of the relevant means, variance and covariance ellipses onthe principal planes. We concluded that the best way to counter batch effects was then to usepaired differences between subjects if a comparative design is available.

High between subject variability in bacterial abundances suggests the use of ranks is moreeffective than the original intensities. This method is known to be robust in the sense that ifsome of the abundance values are on very different scales, their effect on the overall outcomecan be minimized by replacing the original values by the ranks within each array. We haveprovided an example of such an approach here.

Finally the integration of complex phylogenetic structure is possible through the conjointuse of the many available packages in R for doing phylogenetics and community analysis. Wehave provided an example of a complex combination of plotting trees and results from PCA.

Acknowledgments

We thank Todd DeSantis and Cindy Wu for enlightening conversations about the Phylochipand all participants in the R and Bioconductor projects for their packages. This work wassupported by a grant from the Bio-X Interdisciplinary Initiative Program IIP4-32 to P.J.Pasricha and A. Spormann and grant NIH-5-R01GM086884 to S. Holmes. We thank thereviewers who made useful comments that helped improve the paper.


References

1. T. A. Nelson, S. Holmes, A. V. Alekseyenko, M. Shenoy, T. DeSantis, C. Wu, G. L. Anderson,J. Sonnenburg, P. J. Pasricha and A. Spormann, Phylochip microarray analysis reveals alteredgastrointestinal microbial communities in a rat model of colonic hypersensitivity, Submitted.

2. T. King, M. Elia and J. Hunter, The Lancet (Jan 1998).3. E. Malinen, T. Rinttila, K. Kajander and J. Matto, The American journal of Gastroenterology

100, 373 (Jan 2005).4. A. Kassinen, L. Krogius-Kurikka and H. Makivuokko, Gastroenterology (Jan 2007).5. K. H. Wilson, W. J. Wilson, J. L. Radosevich, T. Z. DeSantis, V. S. Viswanathan, T. A. Kucz-

marski and G. L. Andersen, Appl. Environ. Microbiol. 68, 2535 (2002).6. R. C. Gentleman, V. J. Carey, D. M. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gau-

tier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li,M. Maechler, A. J. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Y. H. Yang andJ. Zhang, Genome Biology 5, p. R80 (Jan 2004).

7. R. A. Irizarry, L. Gautier, W. Huber and B. Bolstad, makecdfenv 1.18 (2009), http://cran.r-project.org/doc/packages/makecdfenv.pdf.

8. W. Huber, A. von Heydebreck, H. Sultmann, A. Poustka and M. Vingron, Bioinformatics 18Suppl 1, S96 (Jan 2002).

9. S. N. Evans and F. A. Matsen, arXiv q-bio.PE (Jan 2010).10. M. Hamady, C. Lozupone and R. Knight, The ISME Journal (Jan 2009).11. E. Purdom, Annals of Applied Statistics .12. B. Durbin, J. Hardin, D. Hawkins and D. Rocke, Bioinformatics 19, 1360 (2003).13. R. Ihaka and R. Gentleman, Journal of Computational and Graphical Statistics 5, 299 (1996).14. D. Chessel, A. Dufour and J. Thioulouse, R News 4, 5 (2004).15. J. Qin, R. Li, J. Raes, M. Arumugam, K. er Solvsten Burgdorf, C. Manichanh, T. Nielsen,

N. Pons, F. ce Levenez, T. Yamada, D. R. Mende, J. Li, J. Xu, S. Li, D. Li, J. Cao, B. Wang,H. Liang, H. Zheng, Y. ong Xie, J. Tap, P. Lepage, M. Bertalan, J.-M. Batto, T. orben Hansen,D. L. Paslier, A. Linneberg, H. B. Nielsen, E. P. tier, P. Renault, T. Sicheritz-Ponten, K. Turner,H. Zhu, C. ang Yu, S. Li, M. Jian, Y. Zhou, Y. Li, X. Zhang, S. gang Li, N. Qin, H. Yang,J. Wang, S. Brunak, J. D. a nd Francisco Guarner, K. Kristiansen, O. Pedersen, J. Parkhill,J. Weissenbach, M. Consortium, P. Bork, S. D. Ehrlich and J. Wang, Nature 464, 59 (Mar2010).

16. E. Paradis, Ape (analysis of phylogenetics and evolution) v1.8-2 (2006), http://cran.r-project.org/doc/packages/ape.pdf.

17. B. Efron, R. Tibshirani and R. Tibshirani, An introduction to the bootstrap (Chapman &Hall/CRC, 1993).

18. E. Paradis, J. Claude and K. Strimmer, Bioinformatics 20, 289 (2004).19. D. Sarkar, lattice: Lattice Graphics, (2009). R package version 0.17-26.20. I. Rivals, L. Personnaz, L. Taing and M.-C. Potier, Bioinformatics 23, 401 (Feb 2007).21. S. Kembel, P. Cowan, M. Helmus, W. Cornwell, H. Morlon, D. Ackerly, S. Blomberg and C. Webb,

Bioinformatics 26, 1463 (2010).22. K. S. Pollard, H. N. Gilbert, Y. Ge, S. Taylor and S. Dudoit, multtest: Resampling-based multiple

hypothesis testing, (2010). R package version 2.4.0.23. J. Oksanen, F. G. Blanchet, R. Kindt, P. Legendre, R. G. O’Hara, G. L. Simpson, P. Solymos,

M. H. H. Stevens and H. Wagner, vegan: Community Ecology Package, (2010). R package version1.17-0.

visualization and statistical comparisons of microbial communities using r...

Documents