making order from chaos: using metagenome data as traits in individuals and as markers in entire...

Making Order from Chaos: Using Metagenome Data as Traits in Individuals and as Markers in Entire Ecosystems“

Andrew K BensonW.W. Marshall Distinguished Professor of

BiotechnologyDirector, Core for Applied Genomics and Ecology

Professor, Dept. of Food Science University of Nebraska

http://www.cancer.gov/

Phylloplane

Rhizosphere

Surface Water

Ground Water

Food

We Live in a World That is Numerically Dominated by Microorganisms

Oceanic

Soil

Rumen

Gastrointestinal

Oral

Organisms in these microbiomes contribute significantly to characteristics of these ecosystems

Phylloplane

Rhizosphere

N2 Fixation

Disease Resistance

Obesity

Inflammatory Bowel Disease

Diabetes

Gastric and Colon cancers

Significant variation in complexity of microbiomes from different ecosystems

Most of our understanding of these microbial ecosystemshas relied on culture-based approaches to cultivate, differentiate

and enumerate different species of microorganisms

Community composition 16S Microbiome

Community Genetic Potential Metagenomics and Metaproteomics

Community Physiology Metabolomics (nanoscale??)

Community Dynamics Microbiomics + FISH

Community interactions Microbiomics + FISH

High-throughput DNA sequencing technologies combined with other “omics” now allow systematic analysis of complex microbial

communities

PCR amplifyTag gene (16S rRNA)

ShotgunLibrary

Microbiome

Total genomic DNA

454 Pyrosequencing 454 Pyrosequencing

Metagenome

The 16S rRNA: the structural component of the Small Subunit and the most widely used molecular clock for bacteria

V6

V1- V2V3

V7

V5V4V8

Noller et al. 2001 Science ~54 recognized Phyla

Lyse bacteria By homogenizationWith glass beads

High throughput fecal DNA extraction

Attach gDNA To magnetic particles

Centrifuge to Remove debris

Robotic extraction

A16S 8F R357 B

gDNA from a sample

PCR amplify 16S rRNA gene

Sample-specific barcodes

Pool from 96 samples and sequence

TCTGCATG

TCTGCATG

GGAACTAA

TCCTTAGG

Quality Filtering

Length >200 bases

Barcode present

5’ 16S primer present

Average Q = 20

Trimming

Remove barcode

Remove 5’ primer

Remove 3’ primer

Remove 3’ adapter

Sample 1 2 3Barcode TCTGCATG GGAACTAA TCCTTAGG

Reads

Strategies for data analysis

1. Define species composition and abundance in each sample

2. Define phylogenetic content (genetic diversity) in each sample

3. Quantitative analysis of the distribution of species abundances and genetic diversity between two environments or through a “gradient” of environments or in multiple environments

Sequences

Kmer-based approaches

Kmer distributionKmer-based Distances

CD-Hit RDP Classifier

Multiple Sequence Alignment BLAST

Phylogenetic treeNearest neighbor(bit score)

Last common ancestorWith control sequences

Search representativeSequence against database

Amenable for high-throughput

All 8 base words from training set of known taxa is calculated and The probability of these words occurring in a query sequence is calculated

subset of words is used for probability calculation confidence of assignment is estimated by 100 reps of subsets (bootstrapping) ranking at higher order achieved by summing results from all taxon at lower level

AAAATTTT AAATTTTTT AATTTTTT

Taxon 1 0.1 0.01 0Taxon 2 0.15 0.03 0.05Taxon 3 0.08 0.006 0Taxon 4 0.012 0.1 0.003Taxon 5 0.09 0.083 0.003Taxon 6 0.048 0.03 0Taxon 7 0.1 0.07 0.002Taxon 8 0.004 0.02 0.01Taxon 9 0.065 0.027 0.1

AAAATTTT AAATTTTTT AATTTTTT

Query 1 0.1 0.01 0Query 2 0.048 0.03 0Query 3 0.065 0.027 0.1Query 4 0.012 0.1 0.003Query 5 0.09 0.083 0.003Query 6 0.1 0.01 0

Prob of Kmers from training set Prob of Kmers from query

Taxonomy-dependent analysis: RDP CLASSIFIER

http://rdp.cme.msu.edu/index.jsp

1. Aligns sequences by length and pulls longest sequence2. Distance between this sequence and all remaining sequences estimated

from short word scores 3. Those sequences within defined threshold word score limit are

added to the cluster4. Reiterate with remaining sequences

Godzik Laboratory

Taxonomy-independent analysis: CD-HIT

http://www.burnham.org/default.asp

BioServX Cluster

Etsuko MoriyamaComputer labCore for Applied

Genomics and Ecology

454 GutMicro Server

Instrument cluster Titanium cluster

Primary data collection

Image analysisBase calling

Quality FilteringDatabase Upload

Data analysisCLASSIFIEROTU-PICKER

Search Functions Composite files and Send for analysis Pipelines

Simplified Database Searches

Taxonomy-dependent and Taxonomy independent pipelines For data analysis

Composite Experiment files fromDatabase available for analysis

Set parameters And submit

Final check on Samples in the experiment

CD-HIT output CLASSIFIER output

Total genomic DNA

V1-V2 region16S rRNA gene

PCR amplification

Getting better at taxonomy-independent analysis

Taxonomy-Dependent blind to taxa not in model Taxonomy-Independent too much data for true alignment

Sample 1 (~10,000 reads)



Sample 1,000 (~10,000 reads)

~500 representative sequences ~500 representative sequences ~500 representative sequences

~500 representative sequences

Dereplicate Sequences to 97%

Kmer-based Group distance matrix

Rep Rep Rep seq seq seq 1 2 3

Rep seq 1 1 0.986 0.786Rep seq 2 1 0.693Rep seq 3 1

Complete linkage clustering

OTU 1

OTU2OTU3

OTU4OTU5

OTU6

Data reduction and creation of “sloppy bins”

OTU1

OTU3

OTU4

OTU5

OTU 1

OTU2

OTU3

OTU4

OTU5

OTU6

cmAlignOTU Rep seqs

>Rep seq 1_OTU1>Rep seq 2_OTU1>Rep seq 3_OTU3

>Rep seq 50,000_OTU4

Update Rep seqOTU file >Rep seq 1_OTU1

>Rep seq 2_OTU2>Rep seq 3_OTU3

>Rep seq 50,000_OTU4

Tightening up the OTUs with the secondary structureAware Infernal Aligner

E. P. Nawrocki, D. L. Kolbe, and S. R. Eddy, Infernal 1.0: Inference of RNA alignments Bioinformatics (2009),

http://selab.janelia.org/publications.html#Nawrocki09

Sequences Alignment Complete linkageClustering

Quantitative analysisOf Taxa or OTUs

Diversity estimates

Rarefaction Chao, Shannon

ANOVA and T-tests Confidence intervals

Quantifying abundance of ecological characteristics

From guts to greens Applications

Within this same complex of host tissues, a huge mass of microbes thrive. This massis referred to as the microbiome

The Gastrointestinal tract ecosystem: the next frontier in

biology

Specialized cells and tissues for: Nutrient breakdown and adsorption Flow (peristalsis) Immune surveillance Neural connectivity

How complex is the microbiome

Population density: 106 cells/ml in the ileum 1013 cells/gram in the colon

Species richness: 5 major phyla, 1,800 genera, 2,000-10,000 species of bacteria

Genetic coding content: 20-30 billion bases (10 times the human content)

Highly variable between individuals: extensive variation at the species/strain level

The microbiome essentially acts as a metabolic organ,encoding pathways for:

Nutrient breakdown, adsorption, utilizationSignaling within the microbiota and to the host Immune stimulation/suppression…just to name a few

Fundamental questions about composition of the gut microbiome

What factors influence composition—how much “G” and how much “E”? Are there Keystone species? Mutualists? Engineers? How do aberrations arise in composition? What is more important, species composition or function?

1. Sterile at birth rapidly colonized from maternal environment 2. Successive waves of colonizationstabilizes to climax community after weaning 3. Some resistance to perturbation memory?

Health Disease

Heartdisease

Diabetes

Cancer

IBDObesity

Gene 1Gene 2Gene 3Gene 4Gene 5Gene 6

Gene 100Gene 158Gene 573Gene 744Gene 2763Gene 18950Gene 21305Gene 22481Gene 24796

Gene AGene BGene C

Pathway 1

Pathway 2

Pathway 3

Anatomy of a polygenic complex disease

Diet Exercise

Genetic predisposition

IBD

Environmental factors

Gene 1Gene 2Gene 3Gene 4Gene 5Gene 6

Gene 100Gene 158Gene 573Gene 744Gene 2763Gene 18950Gene 21305Gene 22481Gene 24796

Gene AGene BGene C

Gut microbiota

Pathway 2

Pathway 3

Where does the gut microbiota fit in?

Diet Exercise

Genetic predisposition

IBD

Environmental factors

Gut microbiota

If the gut microbiota is associated (causally) with certain lifestyle diseases

And

If the gut microbiota is influenced by host genotype

…Then genetic susceptibility to certain complex lifestyle diseases may beManifest, in part, as predisposition to colonization by certain gut bacteria

Changing how we think about disease susceptibility

Metabolic effects

Gut microbiota

Disease

1. Selective Breeding Models 2. Genetic mapping models

Systematic approaches to measure the degree of genotypic influence at the individual level

Artificial selection models

If host genotype has significant influence…

Then we should be able to observe significant effects of host genotype on microbiome composition in selective breeding experiments

Composition of gut microbiota in selective breeding lines

AB X CD BA X DCCD X ABDC X BA

A (NIH) X B (ICR)B (ICR) X A (NIH)C (CF1) X D CFW(sw)D CFW(sw) X C (CF1)

15 generations Selection and Breeding (Heat loss)

~30 generations of closed breeding (no selection)10 generations of renewed selection and breeding

MH MC ML

F1Founder populations

http://www.animalscience.unl.edu/facultystaff/faculty/merlynnielsen.html



http://foodsci.unl.edu/Faculty/walter.cfm


Artificial selection models

If host genotype has significant influence…

Then we should observe significant effects of artificial selection on microbiome composition

Multiple generations of selective mating

Host genetic diversity high decreased genetic diversity

16 animals per line (one line rep)pyrosequencing at 5,000-10,000 reads per animal

Did composition of the GI microbiome respond to selection?

UNIFRAC analysis of 16S rRNA phylotypes from MH, ML, and MC

CD-Hit and cluster analysis weighted UNIFRAC analysis

MC

MC

MH + ML MH + ML

Rarefaction curves (97% cutoff) of microbiota from data pooled by line

Number of sequences

Phyl

otyp

es

MCMHML

Selective breeding compositional changes in gut microbiome (abundance of taxa)

Compositional changes contributed to phenotype

Statistics and BioinformaticsSteve Kachman (STATS)

Etsuko Moriyama (BioSci)

Mouse GenomicsDaniel Pomp

(Univ. of North Carolina)

What about direct evidence?

If there is significant effect of host genotype, then it should behaveas a polygenic phenotype: microbiome composition should co-segregate with multiple genomic markers in breeding populations

X

F1

Genotyping SNPs

Phenotyping454 sequencing16S rRNA from poops

QTL mapping to identify genetic architecture controlling Composition of the gut microbiome

F4

What is a trait with respect to gut microbiome?

1. Relative abundance of individual taxonomic ranks

2. Groups of taxa with positive or negative correlation

1 9 17 25 33 41 49 57 65 73 81 89 97 1051131211291371451531610

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Epsilonproteobacteria Deltaproteobacteria Alphaproteobacteria Gammaproteobacteria Betaproteobacteria Actinobacteria Thermodesulfobacteria Aquificae Flavobacteria Sphingobacteria Bacteroidetes Mollicutes Bacilli Erysipelotrichi Clostridia

Outbred ICR Base population

High voluntary wheel running

30 Generations of Selective breeding

HR mice:Higher VO2 MAXReduced Fatness Higher muscle glycogenHigher glycolytic and mitochondrialEnzyme activities

F4 Mapping population

>800 animalsWeaned at 3 weeks and caged by gender

7-8 weeks exercise cages Fecal samples collected at day 1 and day 6In exercise cages

Genotyping 768 fully informative markers Between ICR and B6 (present study stage at 550 QC’d Markers)

Phenotyping 10,000 454 reads from each animal using V1-V2Region, Taxonomy-assignment (RDP CLASSIFIER), normalized as proportion of total reads

QTLs mapped from 200 animals of the F4 cross

10 QTLs mapping to 7 chromosomes 4 different “compositional phenotypes”

Sometimes, you get lucky…QTLs on chromosome 15 control colonization by Helicobacter

Experiment N Sex SNPs Diet Parent Genetic of Origin Diversity

1a) C57 x HR F4 800 Both 768 Regular Y Low1b) C57 x HR F10 400 Both 50 per QTL High Fat vs. Reg Y Low2) Phenome Lines 400 Both 600,000+ Regular N Moderate3) Collaborative 1600 Both 600,000+ High Fat vs. Reg Y High Cross

Experiment N Sex SNPs Phenotypes Parent of Origin

Collaborative 1000 Both 600,000+ Cancer Y Cross

Roadmap for the next two years

http://www.nih.gov/


Microbiome analysis (Class level) of 700 animals from the F4 mapping population

Are strong effects of host genetics conserved in plants?

Plants also susceptible to infectious disease Microbiome of phylloplane (epiphytes and endophytes) May play protective role Much more prone to environmental variation?

Maize genetic resource populations:Nested Association Mapping (NAM) RILs from crosses of B73 X 25 other Inbred lines

Preliminary evaluation: 27 Inbred lines = parental inbreds of the NAM collection

Sample unit = 3 plants per pot, 3 pots per line

Leaves harvested at 14 days post planting and phylloplane bacteria removed by soaking

16 16 16 17 17 17 18 18 18 19 19 19 20 20 20 21 21 21 22 22 23 23 23 24 24 24 25 25 25 26 26 26 12 12 12 13 13 13 14 14

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

GammaproteobacteriaBetaproteobacteria

Classes of Proteobacteria showing statistically significant effects of breeding line from Maize Nested Association Mapping lines

Rel

ativ

e ab

unda

nce

Inbred Line

16 16 16 17 17 17 18 18 18 19 19 19 20 20 20 21 21 21 22 22 23 23 23 24 24 24 25 25 25 26 26 26 12 12 12 13 13 13 14 140

0.1

0.2

0.3

0.4

0.5

0.6

ComamonadaceaeXanthomonadaceaeSphingomonadaceaeBradyrhizobiaceaeStreptococcaceae

Bacterial Families showing statistically significant effects of breeding line from Maize Nested Association Mapping lines

Secreted effectorCarbohydrate transportAmino acid metabolism Spore formation

Taxon-based mapping

Function-based mapping

ShotgunLibrary

Metagenome-based mapping

454 Pyrosequencing

Orthologous gene families

Glucose transport

DNA Replication

Capsule polysaccharide

AmylaseAmino acid transport

Species ASpecies BSpecies CSpecies D

The SEED or Pfams

Best Hit

Taxonomic assignment

Functionalassignment

The MG RAST Pipeline:

Functional for low- Throughput metagenomics

Computational Bottleneck

DN

A replication

Protein transport

Protein secretion

Com

plex CH

O transport

Disaccharide transport

Cell division

Motility

Relative abundance

HealthDisease

Taxonomy inferred from best-hit of Metagenome data using the SEED database

Taxonomy inferred from rRNA reads fromMetagenome data using RDP ribosomal database

Host Metabolic effects

Gut microbiota

Disease

Host Metabolic effects

(microbial functions)Amino acid metabolismCarbohydrate metabolism

Disease

Environmental effects

Microbiota

Ecosystemtraits

Environmental effects

(microbial functions)Amino acid metabolismCarbohydrate metabolism

Ecosystemtraits

Environment 1

Environment 2

What factors influence composition—how much “G” and how much “E”? Are there Keystone species? Mutualists? Engineers? Indicators? How do aberrations arise in composition? What is more important, species composition or function?

Role for Computational, Mathematical, and Statistical Modeling

1. Develop models that can predict how microbial communities will respondto perturbation

Therapeutics (e.g. antibiotics) Prebiotics and Probiotics Interventions (chemotherapy) Biologicals (e.g. anti-TNF-alpha) Dietary variables (can diet overcome genetic predisposition and vice versa)

2. Develop models that use microbial communities as predictors of Ecosystem health and performance Climate change

Community composition 16S Microbiome

Community Genetic Potential Metagenomics and Metaproteomics

Community Physiology Metabolomics (nanoscale??)

Community Dynamics Microbiomics + FISH

Community interactions Microbiomics + FISH

Drilling down through complex microbial communities

Energy metabolism Merlyn Nielson Larry Harshman An Sci BioSci

GI Microbiology Andy Benson Jens Walter Robert Hutkins Rod Moxley Food Sci Food Sci Food Sci VBS

Physiology and Nutrition Tim Carr Tom Burkey Ji-Young LEE NUTR An Sci NUTR

Statistics and BioinformaticsSteve Kachman (STATS)

Etsuko Moriyama (BioSci)

Mucosal ImmunologyDan Peterson

Food Sci

Mouse GenomicsDaniel Ciobanu Daniel Pomp David Threadgill

UNL UNC NCSU



http://www.animalscience.unl.edu/facultystaff/faculty/thomasburkey.html

Steve Kachman Etsuko Moriyama UNL Statistics UNL School of Biological Sci

Fangrui Ma The Nguyen16S rRNA analysis IT/ database programmingPipelines

Srinivas Aluru Pat Schnable ISU Computer Science ISU Agronomy

Xiao YangISU Computer Science

Ryan Legge454 sequencingData analysis

Daniel PompMouse GenomicsUNC Chapel Hill


making order from chaos: using metagenome data as traits in individuals and as markers in entire...

Documents

sequence slide

highthroughput slide

community composition

sample pcr

adapter sample

microbial ecosystems

different ecosystems

s primer present average