the metagenomics rast server: annotation, analysis, and comparisons perfect for pyrosequencing rob...
TRANSCRIPT
The Metagenomics RAST server: Annotation, Analysis, and
ComparisonsPerfect for Pyrosequencing
Rob Edwards
Department of Computer Science, San Diego State University
Mathematics and Computer Sciences Division, Argonne National Laboratory
Roche Life Sciences Workshop, Sept 2008
www.nmpdr.org www.theseed.org
Outline
• Metagenomics
• Tools for analyzing sequences
• Computational Challenges
• Does it work?
www.nmpdr.org www.theseed.org
Firstbacterial genome
100bacterial genomes
1,000bacterial genomes
Num
ber
of
know
n s
equence
s
Year
How much has been sequenced?
Environmentalsequencing
www.nmpdr.org www.theseed.org
Everybody inSan Diego
Everybody inUSA
AllculturedBacteria
100people
How much will be sequenced?
One genome fromevery species
Most majormicrobial environments
www.nmpdr.org www.theseed.org
Metagenomics(Just sequence it)
200 liters water 5-500 g fresh fecal matter50 g soil
Sequence
Epifluorescent Microscopy
Concentrate and purify bacteria, viruses, etc
Extract nucleic acids
Publish papers
Marine Near-shore water (~100 samples) Off-shore water (~50 samples) Near- and off-shore sediments
Metazoanassociated Corals Fish Human blood Human stool
ModernMetagenomics
Terrestrial/Soil Terragenomics Amazon rainforest Konza prairie Joshua Tree desert Air
Freshwater Aquifer Glacial lake
ExtremeHot springs (84oC; 78oC)Soda lake (pH 13)Solar saltern (>35% salt)
The Problem
How do you generate consistent and accurate annotations for metagenomes?
www.nmpdr.org www.theseed.org
The SEED Family
www.nmpdr.org www.theseed.org
Annotations using subsystemsFIG developed the notion of Subsystem – a generalization of “pathway” as a collection of functional roles jointly involved in a biological process or complex
Extended subsystems into FIGfams – protein families that perform the same functions.
www.nmpdr.org www.theseed.org
Annotation of Complete Genomes
• Automated user originated processing
• Takes 1-7 hours depending on size and complexity of the genome
• ~2,000 external submissions, including hundreds of genomes not yet publicly released.
• Reannotation of >500 genomes complete
• 1,000 users, 200 organizations, 25 countries.
http://rast.nmpdr.org/
www.nmpdr.org www.theseed.org
The metagenomics RAST server
www.nmpdr.org www.theseed.org
Automated Processing
www.nmpdr.org www.theseed.org
Summary View
Metagenomics ToolsAnnotation & Subsystems
www.nmpdr.org www.theseed.org
Metagenomics ToolsAnnotation & KEGG maps
Metagenomics ToolsRecruitment Plots
Metagenomics ToolsPhylogenetic Reconstruction
Metagenomics ToolsComparative Tools
Hours
of
Com
pute
Tim
e
Input size (MB)
Computational Requirements~19 hours of compute per input megabyte
www.nmpdr.org www.theseed.org
How much so far
986 metagenomes
79,417,238 sequences
17,306,834,870 bp (17 Gbp)
Average: ~15-20 M bp per genome
Compute time (on a single CPU):
328,814 hours = 13,700 days = 38 years
~300 GS20~300 FLX~300 Sanger
www.nmpdr.org www.theseed.org
Lots of sequencesall pyrosequencing
www.nmpdr.org www.theseed.org
Metagenomics ToolsFunctional Heat Maps
Sulfur
CDA 60.2%
CD
A 2
1.7
% Respiration
Capsule Motility
Membranetransport
Stress
Signaling
Phosphorus
RNA
MineSaltern
MarineMicrobialites
CoralFish
AnimalsFreshwater
From Sequences To Environments
Dinsdale et al, Nature 2008
Workshops
Free workshops on NMPDR, RAST, mg-RAST, SEED
Contact Leslie McNeil [email protected]
or visithttp://www.nmpdr.org/
www.nmpdr.org www.theseed.org
Acknowledgements
Environmental GenomicsForest Rohwer All the labs that
provided sequence
Metagenomics Annotation ServerRick StevensFolker MeyerBob Olson
Daniel Paarman Mark D'Souza
Jared Wilkening Andreas Wilke
Statistics & Web servicesLiz DinsdaleRobert SchmiederDana HallBeltran Rodriguez-BritoBahador Nosrat
FIGRoss OverbeekVeronika VonsteinAnnotators
www.nmpdr.org www.theseed.org
ArtistPaula Morris
Argonne SequencingMarc DomanusAreej Ammar
Artists impression : not all machines are known to explode
Terragenomics
Differences between soil samples