rob edwards phage.sdsu/~rob fellowship for interpretation of genomes,
DESCRIPTION
SIO, San Diego, May 2006. What's going on in the environment? Getting a grip on microbial physiology with genomics and metagenomics. Rob Edwards http://phage.sdsu.edu/~rob Fellowship for Interpretation of Genomes, San Diego State University, Burnham Institute for Medical Research, - PowerPoint PPT PresentationTRANSCRIPT
What's going on in the environment? Getting a grip on microbial
physiology with genomics and metagenomics
Rob Edwardshttp://phage.sdsu.edu/~rob
Fellowship for Interpretation of Genomes,San Diego State University,
Burnham Institute for Medical Research,IMEC, LLC
SIO, San Diego, May 2006
Outline
• Sequencing statistics scare skeptics
• The SEED database
• Some simply stunning Subsystems
• Mysterious missing methionine metabolism
• Marine metabolism mined from metagenomics
• Fabulous four-five-four for facile functional
findings
• Marine phage most puzzling
The Players
• FIG: Fellowship for Interpretation of Genomes
• NMPDR: Natl. Microbial Pathogen Data Resource
• BRC: NIH Bioinformatics Resource Centers
• SEED: The SEED database.
How Many Genomes Have Been Sequenced?
Complete Draft Total
Archaea
Bacteria
Eukarya
How Many Genomes Have Been Sequenced?
Complete Draft Total
Archaea 26 12 38
Bacteria
Eukarya
How Many Genomes Have Been Sequenced?
Complete Draft Total
Archaea 26 12 38
Bacteria 342 238 580
Eukarya
How Many Genomes Have Been Sequenced?
Complete Draft Total
Archaea 26 12 38
Bacteria 342 238 580
Eukarya 29 533 562
When will the 1,000thmicrobial genome be sequenced?
1,000
2,000
3,000
4,000
5,000
1996
2000 2004 2008X X X X X X X X X X
Com
ple
te G
enom
es
Year
Outline
• Sequencing statistics scare skeptics
• The SEED database
• Some simply stunning Subsystems
• Mysterious missing methionine metabolism
• Marine metabolism mined from metagenomics
• Fabulous four-five-four for facile functional
findings
• Marine phage most puzzling
http://theseed.uchicago.edu/FIG/index.cgi
The SEED database developed by FIG
Current version:
580 Bacteria (342 complete)38 Archaea (26 complete)562 Eukarya (29 complete)1335 Viruses2 Environmental Genomes
The problem:
How do you generate consistent
annotations for 1,000 genomes?
Basic biology
lacZlacI lacY lacA
Different types of clustering
< 80 % < 80 % < 80%
Act
inob
acte
ria
Aquifi
cae
Bacte
roid
etes
Chlam
ydia
e
Chlor
oflex
i
Cyano
bact
eria
Deino
cocc
us-
Ther
mus Fi
rmicut
es
Spiro
chae
tes
Ther
mot
ogae
Prot
eoba
cter
ia
1
0.8
0.6
0.4
0.2
0
Clusters of genes w/ maximum 80% identityGenes in subsystems in clustersTotal number of genomes in group
Fra
ctio
n o
f genes
in c
lust
ers
Num
ber o
f genom
es
0
40
80
120
Avera
ge
Occurrence of clustering in different genomes
Outline
• Sequencing statistics scare skeptics
• The SEED database
• Some simply stunning Subsystems
• Mysterious missing methionine metabolism
• Marine metabolism mined from metagenomics
• Fabulous four-five-four for facile functional
findings
• Marine phage most puzzling
The Subsystems Approach to Annotation
• Subsystem is a generalization of “pathway”– collection of functional roles jointly involved
in a biological process or complex
• Functional Role is the abstract biological function of a gene product– atomic, or user-defined, examples:
• 6-phosphofructokinase (EC 2.7.1.11)• LSU ribosomal protein L31p• Streptococcal virulence factors • Does not contain “putative”, “thermostable”, etc
• Populated subsystem is complete spreadsheet of functions and roles
Subsystems developed based on
• Wet lab• Chromosomal context• Metabolic context• Phylogenetic context• Microarray data• Proteomics data
• …
Example Subsystem: Histidine Degradation
1 HutH Histidine ammonia-lyase (EC 4.3.1.3)
2 HutU Urocanate hydratase (EC 4.2.1.49)
3 HutI Imidazolonepropionase (EC 3.5.2.7)4 GluF Glutamate formiminotransferase (EC 2.1.2.5)
5 HutG Formiminoglutamase (EC 3.5.3.8)
6 NfoD N-formylglutamate deformylase (EC 3.5.1.68)
7 ForI Formiminoglutamic iminohydrolase (EC 3.5.3.13)
Subsystem: Histidine Degradation
• Conversion of histidine to glutamate • Functional roles defined in table• Inclusion in subsystem is only by functional role• Controlled vocabulary …
Subsystem Spreadsheet
• Column headers taken from table of functional roles• Rows are selected genomes or organisms• Cells are populated with specific, annotated genes• Functional variants defined by the annotated roles• Variant code -1 indicates subsystem is not functional• Clustering shown by color
Organism Variant HutH HutU HutI GluF HutG NfoD ForI
Bacteroides thetaiotaomicron 1 Q8A4B3 Q8A4A9 Q8A4B1 Q8A4B0
Desulfotela psychrophila 1 gi51246205 gi51246204 gi51246203 gi51246202
Halobacterium sp. 2 Q9HQD5 Q9HQD8 Q9HQD6 Q9HQD7
Deinococcus radiodurans 2 Q9RZ06 Q9RZ02 Q9RZ05 Q9RZ04
Bacillus subtilis 2 P10944 P25503 P42084 P42068
Caulobacter crescentus 3 P58082 Q9A9MI P58079 Q9A9M0 Q9A9L9
Pseudomonas putida 3 Q88CZ7 Q88CZ6 Q88CZ9 Q88D00 Q88CZ3
Xanthomonas campestris 3 Q8PAA7 P58988 Q8PAA6 Q8PAA8 Q8PAA5
Listeria monocytogenes -1
Subsystem Spreadsheet
“The Populated Subsystem”
1 HutH Histidine ammonia-lyase (EC 4.3.1.3)
2 HutU Urocanate hydratase (EC 4.2.1.49)
3 HutI Imidazolonepropionase (EC 3.5.2.7)4 GluF Glutamate formiminotransferase (EC 2.1.2.5)
5 HutG Formiminoglutamase (EC 3.5.3.8)
6 NfoD N-formylglutamate deformylase (EC 3.5.1.68)
7 ForI Formiminoglutamic iminohydrolase (EC 3.5.3.13)
Subsystem: Histidine Degradation
Organism Variant HutH HutU HutI GluF HutG NfoD ForI
Bacteroides thetaiotaomicron 1 Q8A4B3 Q8A4A9 Q8A4B1 Q8A4B0
Desulfotela psychrophila 1 gi51246205 gi51246204 gi51246203 gi51246202
Halobacterium sp. 2 Q9HQD5 Q9HQD8 Q9HQD6 Q9HQD7
Deinococcus radiodurans 2 Q9RZ06 Q9RZ02 Q9RZ05 Q9RZ04
Bacillus subtilis 2 P10944 P25503 P42084 P42068
Caulobacter crescentus 3 P58082 Q9A9MI P58079 Q9A9M0 Q9A9L9
Pseudomonas putida 3 Q88CZ7 Q88CZ6 Q88CZ9 Q88D00 Q88CZ3
Xanthomonas campestris 3 Q8PAA7 P58988 Q8PAA6 Q8PAA8 Q8PAA5
Listeria monocytogenes -1
Subsystem Spreadsheet
Subsystem Diagram
• Three functional variants• Universal subset has three roles, followed by
three alternative paths from IV to VI• No ForI known experimentally
www.nmpdr.org
ForI
H2O
V NfoD
NH3
I III HutI IV HutG VI
H2O H2O H2O Formamide
HutH II HutU
NH3
GluF
Tetrahydrofolate FormiminotetrahydrofolateSubsystem Diagram
Subsystem Spreadsheet
• Prediction from subsystems confirmed experimentally
Organism Variant HutH HutU HutI GluF HutG NfoD ForI
Bacteroides thetaiotaomicron 1 Q8A4B3 Q8A4A9 Q8A4B1 Q8A4B0
Desulfotela psychrophila 1 gi51246205 gi51246204 gi51246203 gi51246202
Halobacterium sp. 2 Q9HQD5 Q9HQD8 Q9HQD6 Q9HQD7
Deinococcus radiodurans 2 Q9RZ06 Q9RZ02 Q9RZ05 Q9RZ04
Bacillus subtilis 2 P10944 P25503 P42084 P42068
Caulobacter crescentus 3 P58082 Q9A9MI P58079 Q9A9M0 Q9A9L9
Pseudomonas putida 3 Q88CZ7 Q88CZ6 Q88CZ9 Q88D00 Q88CZ3
Xanthomonas campestris 3 Q8PAA7 P58988 Q8PAA6 Q8PAA8 Q8PAA5
Listeria monocytogenes -1
Subsystem Spreadsheet
Outline
• Sequencing statistics scare skeptics
• The SEED database
• Some simply stunning Subsystems
• Mysterious missing methionine metabolism
• Marine metabolism mined from metagenomics
• Fabulous four-five-four for facile functional
findings
• Marine phage most puzzling
How do bacteria make methionine?
acquirehomoserine
convertcysteine to cystathione
convertcystathione tohomocysteine
acquire met orconverthomocysteine tomethionine
sulfur and acetylhomoserinesulfhydralase
Sulfhydrylation
Organism Variant
Code HSDH HK HSST HSAT AHSH/ SHSH CTGS CTBL MetH MetE BhmT MTHFR
Nostoc sp. PCC 7120 0 4427 657 619 1093
Synechocystis sp. PCC 6803 0 2356 1112 2469 1144Thermosynechococcus elongatus BP-1
0 277 1764 1027 1090 1770
Trichodesmium erythraeum IMS101
0415, 4266
6167106, 1229
2279 4433
Gloeobacter violaceus PCC 7421 0 4295 1127 2500 477 789
Anabaena variabilis ATCC 29413 33 2331 5519 3872 38734254, 6365
6434
Nostoc punctiforme 33 2895 6648 5301 5302 4055 1885Prochlorococcus marinus MED4 66 1204 1764 1714 1715 2 1 1421 295Prochlorococcus marinus str. MIT 9313
66 1141 426 875 874 225 226 728 2005
Prochlorococcus marinus subsp. marinus str. CCMP1375
66 1148 1064 799 798 404 405 957 176
Prochlorococcus marinus subsp. pastoris str. CCMP1986
66 1047 592 640 639 405 406 874 153
Synechococcus sp. WH 8102 66 706 1476 845 846 669 670 1233 2258Synechococcus elongatus PCC 7942
0 1397 769 2172 1030 2173 702 639
Homocerine activation Transsulfuration Methylation
Sulfhydrylation
Organism Variant
Code HSDH HK HSST HSAT AHSH/ SHSH CTGS CTBL MetH MetE BhmT MTHFR
Nostoc sp. PCC 7120 0 4427 657 619 1093
Synechocystis sp. PCC 6803 0 2356 1112 2469 1144Thermosynechococcus elongatus BP-1
0 277 1764 1027 1090 1770
Trichodesmium erythraeum IMS101
0415, 4266
6167106, 1229
2279 4433
Gloeobacter violaceus PCC 7421 0 4295 1127 2500 477 789
Anabaena variabilis ATCC 29413 33 2331 5519 3872 38734254, 6365
6434
Nostoc punctiforme 33 2895 6648 5301 5302 4055 1885Prochlorococcus marinus MED4 66 1204 1764 1714 1715 2 1 1421 295Prochlorococcus marinus str. MIT 9313
66 1141 426 875 874 225 226 728 2005
Prochlorococcus marinus subsp. marinus str. CCMP1375
66 1148 1064 799 798 404 405 957 176
Prochlorococcus marinus subsp. pastoris str. CCMP1986
66 1047 592 640 639 405 406 874 153
Synechococcus sp. WH 8102 66 706 1476 845 846 669 670 1233 2258Synechococcus elongatus PCC 7942
0 1397 769 2172 1030 2173 702 639
Homocerine activation Transsulfuration Methylation
?
?
Missing genes
Cyanoseed:http://cyanoseed.theFIG.info
Marineseed:http://theseed.uchicago.edu/FIG/organisms.cgi?
show=marine
predicted or measured co-regulation
genome context(virulence islands, prophages,
conserved gene clusters)
virulence mechanism
cellular localization
enzymatic activity
common phenotype
combinations of criteria
Subsystems are not just for gene clusters
How much progress has been made?
• 541 subsystems encoded
• 80 – 85% of the genes in core machinery are contained in subsystems
• 30 – 35% of genes in NMPDR organism genomes,
• 20 – 30% of other genomes contained in subsystems
Outline
• Sequencing statistics scare skeptics
• The SEED database
• Some simply stunning Subsystems
• Mysterious missing methionine metabolism
• Marine metabolism mined from metagenomics
• Fabulous four-five-four for facile functional
findings
• Marine phage most puzzling
Metagenomics
200 liters water 5-500 g fresh fecal matter
DNA/RNA LASL
Sequence
Epifluorescent Microscopy
Concentrate and purify viruses
Extract nucleic acids
Breitbart et al., multiple papers
Control datasets for metagenome comparisons
Bacteria 952,758
Archaea 49,694
Eukarya 259,653
Acid mine 7,588
Sargasso(without Shewanella, Burkholderia)
960,561
Sorcerer II ~13,000,000
Number of proteins in different datasets
Subsystems per million CDS
Determination of Statistical Differences
Between Metagenomes• Take 10,000 proteins from sample 1• Count frequency of each subsystem• Repeat 20,000 times
• Repeat for sample 2
• Combine both samples• Sample 10,000 proteins 20,000 times• Build 95% CI
• Compare medians from samples 1 and 2 with 95% CI
Rodriguez-Brito (2006). BMC Bioinformatics
Sampling Sargasso and “SEED” metagenomes
Comparison of all SubsystemsMore in Sargasso More in SEED
Is serine being used as an osmolyte?
•Few trehalose, proline, sucrose synthetic genes
•Serine is most abundant amino acid in ocean (Suttle, Keil)
•Serine is more effective osmoprotectant than glycine betaine(Yancey)
Outline
• Sequencing statistics scare skeptics
• The SEED database
• Some simply stunning Subsystems
• Mysterious missing methionine metabolism
• Marine metabolism mined from metagenomics
• Fabulous four-five-four for facile functional
findings
• Marine phage most puzzling
Metagenomics
200 liters water 5-500 g fresh fecal matter
DNA/RNA LASL
Sequence
Epifluorescent Microscopy
Concentrate and purify viruses
Extract nucleic acids
Breitbart et al., multiple papers
454
So 2004
454 Sequence Data(Only from Rohwer Lab, in one year)
• 42 libraries– 22 microbial, 20 phage
• 1,028,563,420 bp total– 33% of the human genome– 95% of all complete and partial bacterial genomes– 10% of community sequencing of JGI per year
• 9,933,184 sequences– Average 236,511 per library
• Average read length 103.5 bp– Av. read length has not increased in 12 months
The Soudan Mine, Minnesota
Red Stuff OxidizedBlack Stuff Reduced
Red and Black Samples Are Different
Cloned and 454 sequenced16S are indistinguishable
Black stuff
Red
ClonedRed
There are different amounts of metabolism in each environment
There are different amounts ofsubstrates in each environment
BlackStuff
RedStuff
But are the differences significant?
• Sample 10,000 proteins from site 1• Count frequency of each “subsystem”• Repeat 20,000 times
• Repeat for sample 2
• Combine both samples• Sample 10,000 proteins 20,000 times• Build 95% CI
• Compare medians from sites 1 and 2 with 95% CI
Rodriguez-Brito (2006). BMC Bioinformatics
Subsystem differences & metabolism
Iron acquisitionBlack Stuff
Siderophore enterobactin biosynthesisferric enterobactin transportABC transporter ferrichromeABC transporter heme
Black stuff: ferrous iron (Fe2+, ferroan [(Mg,Fe)6(Si,Al)4O10(OH)8])
Red stuff: ferric iron (goethite [FeO(OH)])
Nitrification differentiates the samples
Edwards (2006)BMC Genomics
The challenge is explaining the differences between samples
Red Sample
Arg, Trp, His UbiquinoneFA oxidationChemotaxis, FlagellaMethylglyoxal metabolism
Black Sample
Ile, Leu, ValSiderophoresGlycerolipidsNiFe hydrogenasePhenylpropionate
degradation
We can cheaply compare the importantbiochemistry happening in different
environments
We don’t care which organisms are doing the metabolism but we know what organisms are
there
Outline
• Sequencing statistics scare skeptics
• The SEED database
• Some simply stunning Subsystems
• Mysterious missing methionine metabolism
• Marine metabolism mined from metagenomics
• Fabulous four-five-four for facile functional
findings
• Marine phage most puzzling
Phages In The Worlds Oceans
GOM41 samples
13 sites5 years
SAR1 sample
1 site1 year
BBC85 samples
38 sites8 years
ARC56 samples
16 sites1 year
LI4 sites1 year
Phages, Reefs, and Human Disturbance
The Northern Line IslandsExpedition, 2005
Christmas
Kingman
Christmas
Kingman
Palmyra
Washington
Fanning
16S rDNA at each island
16S rDNA of the Proteobacteria
Phages at each island
Christmas to Kingman Bias in No. Phage HostsNegative numbers mean relatively more phage hosts at Kingman
Phages In The Worlds Oceans
GOM41 samples
13 sites5 years
SAR1 sample
1 site1 year
BBC85 samples
38 sites8 years
ARC56 samples
16 sites1 year
LI4 sites1 year
Most Marine Phage Sequences are Novel
Thanks: Mya Breitbart
Phages are specific to environments
PhageProteomicTree v. 5(Edwards, Rohwer)
ssDNA
-like
T7-likeT4-like
Marine Single-Stranded DNA Viruses
• 6% of SAR sequences ssDNA phage (Chlamydia-like Microviridae)
• 40% viral particles in SAR are ssDNA phage
• Several full-genome sequences were recovered via de novo assembly of these fragments
• Confirmed by PCR and sequencing
12,297 sequence fragments hit using TBLASTXover a ~4.5 kb genome
3890 bp 4490 bp
0
1033
SAR Aligned Against the Chlamydia 4
Individual sequence reads
Chlamydia phi 4genome
Coverage
Concatenated hits
Summary
You only need to remember:
• Subsystems are the best way to annotate genomes
• 454 generates lots of data
• We can use subsystems to find out what is going on in the environment
SDSU Forest Rohwer Beltran Brito-Rodriguez Linda Wegley
USF Mya Breitbart
University of Bielefeld Folker Meyer Lutz Krause
FIG Veronika Vonstein Ross Overbeek Gordon Pusch
ANL Rick Stevens Bob Olsen Terry Disz
Annotators Gary Olsen Andrei Ostermann Olga Zagnitko Olga Vassieva Svetlana Gerdes Ramy Aziz
UBC Curtis Suttle Amy Chan