discovery and annotation of novel proteins from rumen gut metagenomic sequencing data
TRANSCRIPT
Discovery and Annotation of Novel Proteins from Rumen Gut Metagenomic Sequencing Data
Mick WatsonThe Roslin Institute
Edinburgh GenomicsUniversity of Edinburgh
Edinburgh Genomics• Genomics facility based at the University of Edinburgh• Available for collaborations on an academic, non-profit basis• Formed from merger of
– ARK-Genomics– The GenePool
• Funded by three major bio UK research councils
• A range of technologies and expertise available
http://genomics.ed.ac.uk
What am I going to talk about?• “The peril-ome” – perils of studying the
microbiome• Three projects– Enzyme discovery– Methane emissions– Rumen compartments
“THE PERIL-OME” – PERILS OF STUDYING THE MICROBIOME
What is the microbiome?“the ecological community of commensal, symbiotic, and pathogenic microorganisms that literally share our body space”
- Joshua Lederberg
Note: includes funghi, protists, archaea, bacteria, algae, viruses etc etc etc
(whisper it: most “microbiome” studies only look at bacteria/archaea)
How do we study the microbiome?• Marker gene vs shotgun metagenomics• Marker gene– 16S / 18S / ITS– Amplify this and compare
• Metagenomics– Extract all DNA– Fragment, sequence, interpret
16S studies are not metagenomics
http://phylogenomics.blogspot.co.uk/2012/08/referring-to-16s-surveys-as.html, http://biomickwatson.wordpress.com/2014/01/12/youre-probably-not-doing-metagenomics/
• Ashelford KE, Chuzhanova NA, Fry JC, Jones AJ, Weightman AJ. At least 1 in 20 16S rRNA sequence records currently held in public repositories is estimated to contain substantial anomalies. Appl Environ Microbiol. 2005 71(12):7724-36.
16S reference databases are not accurate
Your 16S reads are not accurate• Amongst other
things, analysed a mock community with different sequencing and bioinformatics strategies
• Kozich JJ, Westcott SL, Baxter NT, Highlander SK, Schloss PD. Development of a dual-index sequencing strategy and curation pipeline for analyzing amplicon sequence data on the MiSeq Illumina sequencing platform. Appl Environ Microbiol. 2013 S79(17):5112-20.
• Three 16S regions sequenced using 2x250bp– V4 (~250 bp), V34 (430bp), and V45 regions (~375 bp)– In the Mock community, there should be 20 OTUs
16S sequencing strategy?• The only strategy that got close to the correct result is
complete overlap of 2x250bp MiSeq reads
Your sample/DNA extraction protocol has an influence
“we found that each DNA extraction method resulted in unique community patterns”
“We observed significant differences in distribution of bacterial taxa depending on the method.”
Freezing your sample risks losing Bacteroidetes
“Samples frozen with and without glycerol as cryoprotectant indicated a major loss of Bacteroidetes in unprotected samples”
Your reagents are contaminated
• Sequenced a pure culture of Salmonella bongori
• Extracted DNA using different kits• Did serial dilutions of the pure
culture to assess impact of contaminating species
• Studying the microbiome is hard
• Please proceed carefully
WHY DO WE STUDY THE RUMEN?
Why do we stufy the rumen?• Energy from food"Our results indicate that the obese microbiome has an increased capacity to harvest energy from the diet.
Furthermore, this trait is transmissible: colonization of germ-free mice with an 'obese microbiota' results in a significantly greater increase in total body fat than colonization with a 'lean microbiota'"
Turnbaugh et al (2006) An obesity-associated gut microbiome with increased capacity for energy harvest. Nature 444(7122):1027-31
• Novel enzyme discovery"An initial assembly of the metagenomic sequence resulted in 179,092 scaffolds... Only 47 (0.03%) of the
assembled scaffolds showed high levels of similarity to previously sequenced genomes available in GenBank. These results suggest that the vast majority of the assembled scaffolds represent segments of hitherto uncharacterized microbial genomes." Hess M et al (2011) Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science. 331(6016):463-7.
• Methane EmissionsGlobally, ruminant livestock produce about 80 million metric tons of methane annually, accounting for
about 28% of global methane emissions from human-related activities.
PROJECT 1: NOVEL ENZYME DISCOVERY
What did we sequence?Sample Desc
#Reads (millions) Read type Gbp
Ag2 Sheep, highland pasture 61.84 100x2 12.37Bg2 Sheep, highland pasture 87.12 100x2 17.42
1099_C1 Cattle, maize sileage 56.60 100x2 11.321043_C2 Cattle, maize sileage 55.89 100x2 11.181033_C1 Cattle, maize sileage 63.60 100x2 12.72
983 Cattle, maize sileage 217.79 100x2 43.56D1a Red Deer, rough grazing 149.51 150x2 29.90D2a Red Deer, rough grazing 125.77 150x2 25.15D3b Red Deer, rough grazing 171.13 150x2 34.23D4b Red Deer, rough grazing 160.55 150x2 32.11R1b Reindeer, Summer Pasture 149.40 150x2 29.88R2b Reindeer, Summer Pasture 209.29 150x2 41.86
301.70
Assembly protocol• Trim reads to Q30 (sickle)• Assemble using Velvet• Manual inspection of coverage peaks• Re-assemble using MetaVelvet• At this stage, no optimisation for K (used K:51)
Taxon assignment• Tried to assign scaffolds based on similarity to existing
genomes• What cut-off did we use? Using megablast, require
– HSP of at least 100bp– % identity of 80%
Sample N50 Total Number Max Hits %557_1 Ag2 2502 171080118 73968 250047 5867 7.93
557_2 Bg2 2620 359972055 153624 152301 12770 8.31
557_3 1099_C1 1518 107617445 68547 53793 4842 7.06
557_4 1043_C2 1623 50054937 29157 54895 2963 10.16
557_5 1033_C1 1604 129661930 77631 89904 6445 8.30
557_6 983 1432 54430150 35961 37263 1954 5.43
Log coverage vs %GC
Gene predictions and domains
sheep_1 sheep_2 cow_1 cow_2 cow_3 cow_4 Reindeer_1
deer_1 deer_2 deer_3 deer_40
100000
200000
300000
400000
500000
600000
ProteinsWith DomainTotal domainsUnique domains
Sample Proteins With Domain Total domains Unique domainssheep_1 262578 109432 294242 12972sheep_2 534761 217719 566000 13517cow_1 302624 105701 248925 13072cow_2 213662 83562 200267 13031cow_3 355222 127535 302140 13298cow_4 218302 76966 182265 12723
Reindeer_1 411158 165709 420309 13566deer_1 492563 194275 465101 13572deer_2 199967 77375 185724 13017deer_3 340798 139906 342889 13477deer_4 414010 165756 413926 13540
3745645 1463936 3621788 145785
Gene prediction protocol• Extracted long ORFs (> 200bp) – also use Glimmer-MG• Translate• Compare to Pfam
– Uses pfam_scan.pl -> hmmpfam (HMMER)• Typical output: 801aa protein
• Involved in Fe transport• 54% identical, 72% positive to previously sequenced protein
– ferrous iron transporter B [Odoribacter laneus]
Clustering of Pfam families:
Clustering of known taxonomy
Cellulose degradation• Cellulose -> Glucose• Turning plants into fuel – what ruminants are good at!• Focus of e.g. biofuels
Enzyme class EC# Pfam Number found
Cellulase Various Cellulase (PF00150) 981
Endoglucanases EC 3.2.1.4 Glyco_hydro_45 (PF02015) 70
Exoglucanases EC 3.2.1.91 Glyco_hydro_48 (PF02011) 18
β-glucosidases EC 3.2.1 21 Glyco_hydro_1 (PF00232) 273
PROJECT 2: METHANE EMISSIONS
Methane production• Methane is a natural product of anaerobic microbial fermentation
– Rumen is anaerobic• Methane is a greenhouse gas (GHG) with a global warming
potential 25-fold that of carbon dioxide (IPCC 2006). • Ruminants are the major producers of methane emissions from
anthropogenic activities, – accounting for 37% of total GHG from agriculture in the UK
• Methane emissions from cattle are entirely microbial in origin
Our data set• Steers chosen from
longitudinal study• Chose high and low
methane emitters matched for breed and diet
• Submitted for metagenomic sequencing
• Approx. 11Gb per sample
Relationship to archaeal abundance• Mapped metagenomic reads
to Greengenes database• Recorded all hits in database
that are as good as best hit• Calculated lowest common
taxon (in this case, Kingdom)• Matched for breed and diet,
high methane correlates with high archaeal abundance
• qPCR confirms this
Relationship to enzyme abundance• Mapped
metagenomic reads to KEGG
• Matched for breed and diet, the abundance of several enzymes is associated with methane production
Relationship to enzyme abundance• Mapped
metagenomic reads to KEGG
• Matched for breed and diet, the abundance of several enzymes is associated with methane production
Methane pathway• Fig on left is from
Shi et al Genome Research 24(9):1517-25
• Fig on right is same enzymes in our data set, matched for breed and diet
What’s in there?• Assembled all 8 metagenomes with MetaVelvet
– Predicted genes with Prokka– Annotated using Pfam domains
• 1.5 million gene/protein predictions• Less than half have any known domain• From 44 KEGG orthologues
– 7021 in our data– 5942 unique protein sequences
• Only 29 have exact match in NR• Only 60 are 100% conserved• At 90% identity, 807 / 5942 have hit
PROJECT 3: RUMEN COMPARTMENTS
Aims• New project, BBSRC CASE studentship• Sequenced 4 rumen compartments from 4 cows• Qu: do samples cluster by rumen compartment, by
cow, or neither?
qPCR of archaea/bacteria ratio
Kraken on metagenomics data• Classified < 2% of our data!
Novoalign reads against GREENGENES
PCA of genera abundances (novoalign)
Prokka gene prediction results
557N0200 557N0201 557N0202 557N0203 557N0204 557N0205 557N0206 557N0207 557N0208 557N0209 557N0210 557N0211 Totals
CDS 245580 343655 339668 523740 411384 367806 295141 330008 236079 248166 285507 163064 3789798
CDS ≥100aa 174599 246490 249712 391531 303543 260613 211776 243615 172023 180944 207805 102805 2745456
Ongoing work• How many proteins are novel?
– Compare to nr• How many proteins in common?
– Within datasets– Between datasets
• Annotation of protein domains• Load data into Meta4 database (http://dx.doi.org/10.3389/fgene.2013.00168)
• Identify putative enzymes of interest• Sequence and analysis of additional rumen samples• Assessment of additional software tools
– Xander – focused extraction of genes from metagenomics data– ShortBRED – functional characterisation of metagenomics data
Follow me:
Twitter: @BioMickWatsonBlog: biomickwatson.wordpress.com
AcknowledgementsFunders: BBSRC, Roslin Foundation, TSB
People: Despoina Roumpeka, Rainer Roehe, John Wallace, Edinburgh GenomicsEdinburgh Genomics: http://genomics.ed.ac.ukThe Roslin Institute: http://www.roslin.ed.ac.uk