microbiomes and metabolomes - biostatistics · metabolomics normalisation • no internal...

Microbiomesandmetabolomes

MichaelInouyeBakerHeartandDiabetesInstituteUniv ofMelbourne/MonashUniv

SummerInstituteinStatisticalGenetics2017IntegrativeGenomicsModule

Seattle

@minouye271www.inouyelab.org

Interactionsbetweenmicrobesandmetabolites

Phelanetal,NatChem Biol 2011

Metabolites are…

• Nutrients

• Signalsbetweencells (microbe-microbe,microbe-host)

• Controlofmulticellular/communitybehaviour

Humanmicrobiotaandmetabolism

LeChatelier etal,Nature2013

LGC– Lowbacterial genecount(‘richness’)

Koropatkin etal,2013

Roseburia,Eubacterium,Clostridium,Ruminococcus,Bifidobacterium

SCFA– Shortchainfattyacids

Background(microbiome)

• Culture • Looking only at a few candidate species • Need prior hypothesis • Akin to candidate gene approach

• Sequencing• >99% of microbes cannot be cultured.• Hypothesis free/unbiased approach• Akin to GWAS

4

CourtesyofShuMeiTeo

Background

5

Whatbacteria?

• Sequence all the genes

• Sequence a marker gene• Cost efficient• Depends on research question• 16S rRNA gene

6http://www.alimetrics.net/en/index.php/dna-sequence-analysis

• Present in all bacteria

• Conserved regions à primer design

• Variable regions • Large databases

Why16SrRNA gene?

à Taxonomic assignment

PCRReaction mix• forward primer• PCR mix

Well-specific mix• sample• sample specific tagged reverse primer

7

• DNA extraction Bead beating step for lysis

Samplepreparationformultiplexsequencing

….

961 2 3

Final pool for sequencing

Pool equal amounts of DNA

8

Purify the sampleQuantify purified sampleDilute samples so that each sample has same concentration

Sample preparation• DNA extraction• PCR

sequencingprimer

Uniquebarcode

16SV4amplicon

TargetprimerF(PCRamplification)

Forwardfusionprimerx 1

sequencingprimer

TargetprimerR(PCRamplification)

Reverse fusionprimersxn

sequence

sequence

9

Sample1.read1.fastq~200,000sequences× >1000samples@M00267:65:000000000-A5GLT:1:1101:16947:15031:N:0:27TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTACAGCGCGCGCAGGCGGTTTTTTTTGTCTGATGTTACCGCCTCTGGCTTTACCTTTGTACTCCTTCTTATCCTTTTTTCCTTTCTTCCTTTCGGGTGACTTGTACTTCCTGTT+>>>>>FFAFDFF1EFGEEGCF1GGG0EEH1AFF1/AEE///A1///A////////>>/>?>///B>>1121BF221//<//111??F01?11<11<111111?1111111<1111-00==0000000000.-.;-.00:0;0:;0::00:0@M00267:65:000000000-A5GLT:1:1101:13913:15031:N:0:27TACGTAGGTCCCGAGCGTTGTCCGGCTTTATTGGGCGTACAGCGTGCTCTGTCGGTTTTTTTTTTCTTCTTTTTTTTGCTTTTTCTTTCCCTTTTTCTTCTTTTTTTTCTTTTTTTCTTTTCTTCCTTTCTGTTTCCTTTTCTTCCCTTTT+>>>>AFFBFFFFGCFGGEGEGFHDGCAEH5DG53222A00153011115535510>B111>///>3B4444?B3////222B22BBB22<22?22?22222??11//-011>=<--/0=000=00000000000000;00:00;00/:0;0@M00267:65:000000000-A5GLT:1:1101:17490:15101:N:0:27TACGTAGGTCCCGAGCGTTGTCCGGCTTTCTTGGGCGTCTCGCTCGCGCTGTCGTTTTTTTTCTTCTTTTGTTTCCTCCTTTTTCTTTTCCTTTTTTCTCTTTTTTCTCTTTTTTTCTTTTCTTCTTTTCTTTTTTCTTGTTTTTCCTTTT+>3>>>C?AFFFFGCEEGEGGGBFAE2AAG5BD53222A01231110000001311B?11/>/34B4B4443BB04443303433BBF44?31?21?/01222?11<01222?<1/-01>111<111111111100-/=00=0/<./000;0@M00267:65:000000000-A5GLT:1:1101:17477:15231:N:0:27TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTCACGCGCGCGCTGGCGGTTTTTTTTTTCTTCTGTTTTCTGCTCTTTCTTCTCCTTTTTTCTCTTTTTTTTCTTTTTTTCTTTTCTTCCTTTCTCTTTCCTTGTCTTTCCTTTT+>>>>ACFBFFFFGFGG?EFEGFHDE?AEH5DG53AAAA0111100000////1//>>///<///>2?2222@@22222222>21?<F11?11?1<1-01111>11<--<00;;0--:0;000900000000000000;00;00;09090;0@M00267:65:000000000-A5GLT:1:1101:13603:15301:N:0:27TACGTAGGTCCCGAGCGTTGTCCGGCTTTATTGTGCGTAAAGCGATCTCTTTCGTTTTTTTTTTTCTTCTTTTTTCTTCTTTTTCTTTACCCTTTTTTTTTTTTTTTTCTTTTTTTCTTTTGTTCTCTTCTTTTTTCTTTTCTTTCCTTTT+>>>>>FFAFFBFEGCEAEEEFDGG22AEHDFG55552B21155111115555511BA11>>///>3B4444BF3/344444443B?B44?33?0111/////>--<--/00;<0--/0;900;/00000000090-/900;00;00000;0@M00267:65:000000000-A5GLT:1:1101:13615:15421:N:0:27TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGTCTGTTTGTTTTTTCTTCTGTTTACTGCTTTTTCTTTACCCTTTTTTTCTTTTTTATCTTTTTTTCTTTTCTTCTTTTGTCGTTAGTTTTCTTCCCTTTT+

10

Merge reads (Software: flash)

Illumina Miseq2X(150– 250)bp

pairedend1run>16Mreads

Roche454/IonTorrent~400- 1000bp

single end1run>160Kreads

11

Cleaningthedata





12

Cleaningthedata

Homopolymer errorcorrection

Balzer S et al. Bioinformatics 2010;26:i420-i425

Acacia





13

Stitch/merge thereads together

Cleaningthedata

Homopolymer errorcorrection Merge reads Flash





14

Cleaningthedata

Homopolymer errorcorrection Merge reads

RemoveChimeras

Cluster readsSequencesimilarity

Pickonesequence per

cluster

AlignseqsAssign

taxonomy

Build tableBuild tree

AlphaandBetadiversity

TaxonbasedAssociationanalysis

15

Database- greengenes

- Silva- Rdp

OperationalTaxonomicUnits(OTUs)

• A cluster of highly similar sequences is termed an OTU.• Typically.. Cluster all the sequences at a predefined similarity threshold

(97%)• Within cluster(OTU), sequences are >=97% similar.• Between OTUs, the sequences are <97% similar.

• Computationally demanding for large datasets (>200 million)

16

Closed reference OTU picking • Greengenes 99% OTUs

• Highly parallelizable• Sequences that do not match

reference are thrown

17

Greengenes database

OTU3OTU1

OTU2

Compareeachsequence

Exampleoutput– 1sample

18

• Note different OTUs can have the same taxonomy

• Can also summarize in terms of taxonomy, example at the genus level

Summarizedtable

19

Phylum Sample1 Sample2 Sample3Acidobacteria 0 22 0

Firmicutes 291777 728 21

Alphadiversity• Which sample has the most compositionally diverse microbiome?

• Rarefaction – subsample equal number of reads• Qualitative measure – absence/presence• Quantitative measure – considers relative abundance

20

Qualitative – A and B equally diverseQuantitative – B is more diverse

A

B

Shannon'sdiversityindex

21

N totalnumberofspecies/OTUs inthecommunity (richness)Pi proportionofspecies i relative toN

0.0 0.1 0.2 0.3 0.4 0.5

0.05

0.10

0.15

0.20

0.25

0.30

0.35

p

-p*ln(p)

Betadiversity• “Between sample” diversity – how similar/dissimilar

are two samples• Unifrac distance = fraction of the total branch lengths

that is unique to one community

22

D=1D=0.5

Lozupone andKnight2005

Betadiversity

• PrincipalcoordinatesanalysisofUnifrac distancematrix• Whichfactorscorrelatewithdifferencesinmicrobiotacomposition?

23Costelloetal.(2009),Science 326:1694

Conceptsrecap

24

• What kind of animals are there? (Taxonomy)

• How many kinds of animals can I find where I hunt? (Alpha diversity)

• How different is one place from another? (Beta diversity)

Metabolism

Mapofhumanmetabolic pathways(it’scomplicated)

Metabolites

3-methoxytyramine

Metabolomics

UniversityofSydney

Metabolomics data

• Rightskewed• Alltheusualtechnicaleffects

Metabolomicsnormalisation• Nointernalstandard(s)

• Dividebythetotalsumofmetaboliteabundancesineachsample• Dividebythemedian

• Foreachmetabolite,subtract(log)abundanceofinternalstandard• Multipleinternalstandards

• Selectionofinternalstandard(acrossallmetabolitesorpermetabolite)• Modelvariationininternalstandardsandremoveitfromeachmetabolite

errorcoeffs

standards

mean_j

metabolite_j

Subtract ’unwanted’variation fromoriginalmetabolitelevels DeLivera etal,AnaChem 2012

Metabolomicsnormalisation• Usualapproachesresultinonenormalised valueforeachmetabolite,foreachsample(‘gobal’normalisation)

• Thesevaluescanthenbeusedindownstreamanalysis• PCA/clustering• Associationanalysis(thestandardapproachesapply)• Classification

• Whatifwecombinednormalisation andassociationanalysis?• Thusminimisingchancethat(biological)variationofinterestisnotremoved

DeLivera etal,AnaChem 2012

Modelingandremoving ’unwanted’variation

• The2-stepRemoveUnwantedVariationapproach(RUV2)

• Useofnon-changingmetabolitesasinternalstandards• Thesemetaboliteshouldbepresentinthesample,exposedtounwantedvariation,andnotassociatedwiththe(biological)factorsofinterest


errorcoeffs(nonchanging)metabolite_j (unwanted)variancecomponent

Numberofnon-changingmetabolites.Whichtochoose?


• Selectingthenumberofunwantedfactors(eg non-changingmetabolites)

• Metabolitesknownapriori nottochange• MetabolitesderivedfromQCsamples(e.g.replicates)• Spike-inmetabolites• PCA



errorcoeffs(unchanging)metabolite_j (unwanted)variancecomponent

errorcoeffs(changing)metabolite_j

(unwanted)variancecomponentcoeffs

Factorsofinterest!

Canalsoadjust foradditional covariates


Metabolomics pathwayanalysis• Similartogeneset-basedenrichmentapproaches• MetaPA (nowintegratedintoMetaboAnalyst)

• KEGGdatabase• Fisherexact test• Hypergeometric• GSEA• Networkcentrality approaches

Metabolic ‘potential’ofmicrobialcommunities• Inferwhatmetabolites(andlevelsthereof)areassociatedwithsequencesfromanymicrobialcommunity

• HUMAnN (Abubucker etal,PLOSCompBio2012)• Input:Metagenomic sequences• Output:Estimatesofgeneandpathwayabundances

HUMAnN

• InputQC’ed (non-human)metagenomicsequences

• Blastagainstaproteinsequencedatabase(e.g.KEGG)

• Estimategenefamilyabundances,normalisebygenefamilysequencelength

• Assigngenestopathways(e.g.usingMinPath)• Useinferredmicrobialtaxatonormalise forgenecopynumberandremoveunlikelypathways

• ‘Fillin’abundantpathwayswhichmaybemissingafewgenes

• Assigneachpathwayscoresforpresence/absenceandforabundance

microbiomes and metabolomes - biostatistics · metabolomics normalisation • no internal...

Documents