microbiomes and metabolomes - biostatistics · metabolomics normalisation • no internal...
TRANSCRIPT
Microbiomesandmetabolomes
MichaelInouyeBakerHeartandDiabetesInstituteUniv ofMelbourne/MonashUniv
SummerInstituteinStatisticalGenetics2017IntegrativeGenomicsModule
Seattle
@minouye271www.inouyelab.org
Interactionsbetweenmicrobesandmetabolites
Phelanetal,NatChem Biol 2011
Metabolites are…
• Nutrients
• Signalsbetweencells (microbe-microbe,microbe-host)
• Controlofmulticellular/communitybehaviour
Humanmicrobiotaandmetabolism
LeChatelier etal,Nature2013
LGC– Lowbacterial genecount(‘richness’)
Koropatkin etal,2013
Roseburia,Eubacterium,Clostridium,Ruminococcus,Bifidobacterium
SCFA– Shortchainfattyacids
Background(microbiome)
• Culture • Looking only at a few candidate species • Need prior hypothesis • Akin to candidate gene approach
• Sequencing• >99% of microbes cannot be cultured.• Hypothesis free/unbiased approach• Akin to GWAS
4
CourtesyofShuMeiTeo
Background
5
Whatbacteria?
• Sequence all the genes
• Sequence a marker gene• Cost efficient• Depends on research question• 16S rRNA gene
6http://www.alimetrics.net/en/index.php/dna-sequence-analysis
• Present in all bacteria
• Conserved regions à primer design
• Variable regions • Large databases
Why16SrRNA gene?
à Taxonomic assignment
PCRReaction mix• forward primer• PCR mix
Well-specific mix• sample• sample specific tagged reverse primer
7
• DNA extraction Bead beating step for lysis
Samplepreparationformultiplexsequencing
….
961 2 3
Final pool for sequencing
Pool equal amounts of DNA
8
Purify the sampleQuantify purified sampleDilute samples so that each sample has same concentration
Sample preparation• DNA extraction• PCR
sequencingprimer
Uniquebarcode
16SV4amplicon
TargetprimerF(PCRamplification)
Forwardfusionprimerx 1
sequencingprimer
TargetprimerR(PCRamplification)
Reverse fusionprimersxn
sequence
sequence
9
Sample1.read1.fastq~200,000sequences× >1000samples@M00267:65:000000000-A5GLT:1:1101:16947:15031:N:0:27TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTACAGCGCGCGCAGGCGGTTTTTTTTGTCTGATGTTACCGCCTCTGGCTTTACCTTTGTACTCCTTCTTATCCTTTTTTCCTTTCTTCCTTTCGGGTGACTTGTACTTCCTGTT+>>>>>FFAFDFF1EFGEEGCF1GGG0EEH1AFF1/AEE///A1///A////////>>/>?>///B>>1121BF221//<//111??F01?11<11<111111?1111111<1111-00==0000000000.-.;-.00:0;0:;0::00:0@M00267:65:000000000-A5GLT:1:1101:13913:15031:N:0:27TACGTAGGTCCCGAGCGTTGTCCGGCTTTATTGGGCGTACAGCGTGCTCTGTCGGTTTTTTTTTTCTTCTTTTTTTTGCTTTTTCTTTCCCTTTTTCTTCTTTTTTTTCTTTTTTTCTTTTCTTCCTTTCTGTTTCCTTTTCTTCCCTTTT+>>>>AFFBFFFFGCFGGEGEGFHDGCAEH5DG53222A00153011115535510>B111>///>3B4444?B3////222B22BBB22<22?22?22222??11//-011>=<--/0=000=00000000000000;00:00;00/:0;0@M00267:65:000000000-A5GLT:1:1101:17490:15101:N:0:27TACGTAGGTCCCGAGCGTTGTCCGGCTTTCTTGGGCGTCTCGCTCGCGCTGTCGTTTTTTTTCTTCTTTTGTTTCCTCCTTTTTCTTTTCCTTTTTTCTCTTTTTTCTCTTTTTTTCTTTTCTTCTTTTCTTTTTTCTTGTTTTTCCTTTT+>3>>>C?AFFFFGCEEGEGGGBFAE2AAG5BD53222A01231110000001311B?11/>/34B4B4443BB04443303433BBF44?31?21?/01222?11<01222?<1/-01>111<111111111100-/=00=0/<./000;0@M00267:65:000000000-A5GLT:1:1101:17477:15231:N:0:27TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTCACGCGCGCGCTGGCGGTTTTTTTTTTCTTCTGTTTTCTGCTCTTTCTTCTCCTTTTTTCTCTTTTTTTTCTTTTTTTCTTTTCTTCCTTTCTCTTTCCTTGTCTTTCCTTTT+>>>>ACFBFFFFGFGG?EFEGFHDE?AEH5DG53AAAA0111100000////1//>>///<///>2?2222@@22222222>21?<F11?11?1<1-01111>11<--<00;;0--:0;000900000000000000;00;00;09090;0@M00267:65:000000000-A5GLT:1:1101:13603:15301:N:0:27TACGTAGGTCCCGAGCGTTGTCCGGCTTTATTGTGCGTAAAGCGATCTCTTTCGTTTTTTTTTTTCTTCTTTTTTCTTCTTTTTCTTTACCCTTTTTTTTTTTTTTTTCTTTTTTTCTTTTGTTCTCTTCTTTTTTCTTTTCTTTCCTTTT+>>>>>FFAFFBFEGCEAEEEFDGG22AEHDFG55552B21155111115555511BA11>>///>3B4444BF3/344444443B?B44?33?0111/////>--<--/00;<0--/0;900;/00000000090-/900;00;00000;0@M00267:65:000000000-A5GLT:1:1101:13615:15421:N:0:27TACGTAGGTCCCGAGCGTTGTCCGGATTTATTGGGCGTAAAGCGAGCGCAGTCTGTTTGTTTTTTCTTCTGTTTACTGCTTTTTCTTTACCCTTTTTTTCTTTTTTATCTTTTTTTCTTTTCTTCTTTTGTCGTTAGTTTTCTTCCCTTTT+
10
Merge reads (Software: flash)
Illumina Miseq2X(150– 250)bp
pairedend1run>16Mreads
Roche454/IonTorrent~400- 1000bp
single end1run>160Kreads
11
Cleaningthedata
Illumina Miseq2X(150– 250)bp
pairedend1run>16Mreads
Roche454/IonTorrent~400- 1000bp
single end1run>160Kreads
12
Cleaningthedata
Homopolymer errorcorrection
Balzer S et al. Bioinformatics 2010;26:i420-i425
Acacia
Illumina Miseq2X(150– 250)bp
pairedend1run>16Mreads
Roche454/IonTorrent~400- 1000bp
single end1run>160Kreads
13
Stitch/merge thereads together
Cleaningthedata
Homopolymer errorcorrection Merge reads Flash
Illumina Miseq2X(150– 250)bp
pairedend1run>16Mreads
Roche454/IonTorrent~400- 1000bp
single end1run>160Kreads
14
Cleaningthedata
Homopolymer errorcorrection Merge reads
RemoveChimeras
Cluster readsSequencesimilarity
Pickonesequence per
cluster
AlignseqsAssign
taxonomy
Build tableBuild tree
AlphaandBetadiversity
TaxonbasedAssociationanalysis
15
Database- greengenes
- Silva- Rdp
OperationalTaxonomicUnits(OTUs)
• A cluster of highly similar sequences is termed an OTU.• Typically.. Cluster all the sequences at a predefined similarity threshold
(97%)• Within cluster(OTU), sequences are >=97% similar.• Between OTUs, the sequences are <97% similar.
• Computationally demanding for large datasets (>200 million)
16
Closed reference OTU picking • Greengenes 99% OTUs
• Highly parallelizable• Sequences that do not match
reference are thrown
17
Greengenes database
OTU3OTU1
OTU2
Compareeachsequence
Exampleoutput– 1sample
18
• Note different OTUs can have the same taxonomy
• Can also summarize in terms of taxonomy, example at the genus level
Summarizedtable
19
Phylum Sample1 Sample2 Sample3Acidobacteria 0 22 0
Firmicutes 291777 728 21
Alphadiversity• Which sample has the most compositionally diverse microbiome?
• Rarefaction – subsample equal number of reads• Qualitative measure – absence/presence• Quantitative measure – considers relative abundance
20
Qualitative – A and B equally diverseQuantitative – B is more diverse
A
B
Shannon'sdiversityindex
21
N totalnumberofspecies/OTUs inthecommunity (richness)Pi proportionofspecies i relative toN
0.0 0.1 0.2 0.3 0.4 0.5
0.05
0.10
0.15
0.20
0.25
0.30
0.35
p
-p*ln(p)
Betadiversity• “Between sample” diversity – how similar/dissimilar
are two samples• Unifrac distance = fraction of the total branch lengths
that is unique to one community
22
D=1D=0.5
Lozupone andKnight2005
Betadiversity
• PrincipalcoordinatesanalysisofUnifrac distancematrix• Whichfactorscorrelatewithdifferencesinmicrobiotacomposition?
23Costelloetal.(2009),Science 326:1694
Conceptsrecap
24
• What kind of animals are there? (Taxonomy)
• How many kinds of animals can I find where I hunt? (Alpha diversity)
• How different is one place from another? (Beta diversity)
Metabolism
Mapofhumanmetabolic pathways(it’scomplicated)
Metabolites
3-methoxytyramine
Metabolomics
UniversityofSydney
Metabolomics data
• Rightskewed• Alltheusualtechnicaleffects
Metabolomicsnormalisation• Nointernalstandard(s)
• Dividebythetotalsumofmetaboliteabundancesineachsample• Dividebythemedian
• Foreachmetabolite,subtract(log)abundanceofinternalstandard• Multipleinternalstandards
• Selectionofinternalstandard(acrossallmetabolitesorpermetabolite)• Modelvariationininternalstandardsandremoveitfromeachmetabolite
errorcoeffs
standards
mean_j
metabolite_j
Subtract ’unwanted’variation fromoriginalmetabolitelevels DeLivera etal,AnaChem 2012
Metabolomicsnormalisation• Usualapproachesresultinonenormalised valueforeachmetabolite,foreachsample(‘gobal’normalisation)
• Thesevaluescanthenbeusedindownstreamanalysis• PCA/clustering• Associationanalysis(thestandardapproachesapply)• Classification
• Whatifwecombinednormalisation andassociationanalysis?• Thusminimisingchancethat(biological)variationofinterestisnotremoved
DeLivera etal,AnaChem 2012
Modelingandremoving ’unwanted’variation
• The2-stepRemoveUnwantedVariationapproach(RUV2)
• Useofnon-changingmetabolitesasinternalstandards• Thesemetaboliteshouldbepresentinthesample,exposedtounwantedvariation,andnotassociatedwiththe(biological)factorsofinterest
DeLivera etal,AnaChem 2012
errorcoeffs(nonchanging)metabolite_j (unwanted)variancecomponent
Numberofnon-changingmetabolites.Whichtochoose?
DeLivera etal,AnaChem 2012
• Selectingthenumberofunwantedfactors(eg non-changingmetabolites)
• Metabolitesknownapriori nottochange• MetabolitesderivedfromQCsamples(e.g.replicates)• Spike-inmetabolites• PCA
Modelingandremoving ’unwanted’variation
DeLivera etal,AnaChem 2012
errorcoeffs(unchanging)metabolite_j (unwanted)variancecomponent
errorcoeffs(changing)metabolite_j
(unwanted)variancecomponentcoeffs
Factorsofinterest!
Canalsoadjust foradditional covariates
Modelingandremoving ’unwanted’variation
Metabolomics pathwayanalysis• Similartogeneset-basedenrichmentapproaches• MetaPA (nowintegratedintoMetaboAnalyst)
• KEGGdatabase• Fisherexact test• Hypergeometric• GSEA• Networkcentrality approaches
Metabolic ‘potential’ofmicrobialcommunities• Inferwhatmetabolites(andlevelsthereof)areassociatedwithsequencesfromanymicrobialcommunity
• HUMAnN (Abubucker etal,PLOSCompBio2012)• Input:Metagenomic sequences• Output:Estimatesofgeneandpathwayabundances
HUMAnN
• InputQC’ed (non-human)metagenomicsequences
• Blastagainstaproteinsequencedatabase(e.g.KEGG)
• Estimategenefamilyabundances,normalisebygenefamilysequencelength
• Assigngenestopathways(e.g.usingMinPath)• Useinferredmicrobialtaxatonormalise forgenecopynumberandremoveunlikelypathways
• ‘Fillin’abundantpathwayswhichmaybemissingafewgenes
• Assigneachpathwayscoresforpresence/absenceandforabundance