Exploiting Gene Clusters to Curate
Annotations
October, 2003
Ross Overbeek,Fellowship for Interpretation of
Genomes (FIG)
Outline of the Talk
• The Emerging Opportunity
• The Use of Clusters to Find “Missing Genes”
• Experiences with a Single Pathway
• “The Project”
• Tools Needed to Support the Project
Three “Laws”Three “Laws”
The amount of available DNA sequence data will double every 18 months
The number of available genomes will double every 18 months
The cost of sequence will drop by a factor of 2 every 18 months.
Basic Basic FactsFacts
We have about 230-250 publicly available more-or-less complete genomes
We will have about 1000 complete genomes within 3 years
This will lead to better annotations, not worse
The majority of annotations will need to be automated, and the process must accurately follow the steps that a human expert would take
The Use of Clusters to FindMissing genes
• 3,000 - 4,000 functional roles (300 – 3,000 per organism)
• Largely conserved across the three kingdoms(sequences; functions; pathways)
• “Missing genes” are still there
Central Machinery of Life:
Horizons of gene discovery
BE1
A + DC + E FE2 E3
gene A gene B? gene C
Missing genes in metabolic pathways
making a caseMissing gene
genome 1genome 2genome 3genome 4genome 5
Functional context/neighborhood
Pathway A --> F*Enzyme E1
protein family A A1 A2 A3 A4 A5*Enzyme E2
protein family B = ? ? ? ? ? ?*Enzyme E3
protein family C C1 C2 C3 C4 C5
Globally Missing Gene (never identified in any species)
genome 1genome 2genome 3genome 4genome 5
Functional context/neighborhood
Pathway A --> F*Enzyme E1
protein family A A1 A2 A3 A4 A5*Enzyme E2
protein family B = ? ? ? ? B4 B5*Enzyme E3
protein family C C1 C2 C3 C4 C5
BE1
A + DC + E FE2 E3
gene A gene B? gene C
Missing genes in metabolic pathways
making a caseMissing gene
Locally Missing Gene (non-orthologous gene displacement)
gene A1 gene C1 gene R1gene T1gene G1 gene X1GENOME 1
GENOME 2 gene A2 gene M2 gene X2
GENOME 3 gene A3 gene S3 gene U3gene X3 gene Y3
gene N2gene C2gene Y2
gene Q3
GENE CLUSTERING ON THE CHROMOSOME (OPERONS)
Techniques of genome context analysis (I)
checking neighbors
gene A1 gene C1GENOME 1
GENOME 3 gene C3 / Z3
GENOME 4 gene A4 / X4
gene A3
gene C4
GENOME 5 gene C5 / A5
PROTEIN FUSION EVENTS
Techniques of genome context analysis (II)
checking connections
gene A1 gene C1 gene R1gene T1 gene X1GENOME 1
GENOME 5 gene C5 / A5 gene R5 gene X5
GENOME 2 gene A2 gene W2gene C2
SHARED REGULATORY SITES (REGULONS )
Techniques of genome context analysis (III)
co-regulation
gene A1 gene C1 gene I1gene X1 gene H1gene G1 gene W1 gene Y1 gene Z1GENOME 1
gene A2 gene C2 gene I2gene X2 gene H2gene G2 gene W2 gene Y2 -GENOME 2
gene A3 gene C3 gene I3gene X3 gene H3gene G3 gene W3 gene Y3 gene Z3GENOME 3
gene A4 gene C4 gene I4gene X4 gene H4- gene W4 - -GENOME 4
gene A5 gene C5 gene I5gene X5 gene H5gene G5 gene W5 - -GENOME 5
gene I6gene H6- gene W6 gene Y6 gene Z6GENOME 6
gene I7gene H7gene G7 gene W7 - gene Z7GENOME 7
gene I8gene H8- gene W8 gene Y8 -GENOME 8
gene I9gene H9- gene W9 gene Y9 gene Z9GENOME 9
gene I10-- gene W10 gene Y10 gene Z10GENOME 10
IN-GROUP
OUT-GROUP
- - -
- - -
- - -
- - -
- - -
Score: 10 10 510 68 4 4 3
Techniques of genome context analysis (IV)
co-evolutionOCCURRENCE PROFILES
Missing gene case
primary suspects
Chorismatecatabolism
Isochorismateanabolism
Trp Phe Tyrsyntheses
D-Erythrose 4-P+
Phosphoenol pyruvate
7P-2-Dehydro-3-deoxy-D-arabino-heptulosonate
3-Dehydro-Quinate
3-Dehydro-Shikimate
aroHaroFaroG
1
aroB
2
aroD3
ShikimateKinase
(EC 1.1.1.25)
aroKaroL
5
Chorismate
O5-(1-Carboxyvinyl)-3-P-Shikimate
aroA6
aroC7
Shikimate - 5 - P
H
OH
H
OH
H
H COOH
HH
OP
H
OH
H
OH
H
H COOH
HH
OH
Shikimate
4ydiBaroD
Example I: Chorismate Pathway
Missing genein all archaea
??
??
Fusion Protein
Fusion Protein
Chromosomal Clustering: Prediction
Functional coupling in chorismate pathway
1 2 3 4 5 6 7 5'EC4.1.2.15 EC4.6.1.3 EC4.2.1.10 EC1.1.1.25 EC2.7.1.71 EC2.5.1.19 EC4.6.1.4 ?
2-Dehydro-3-Deoxyphosphohep-tonate Aldolase
3-Dehydroquinate
Synthase
3-Dehydroquinate
Dehydratase
Shikimate 5-Dehydrogenase
Bacterial Shikimate
Kinase
Phosphoshikimate 1-Carboxyvinyl
Transferase
Chorismate Synthase
Archaeal Shikimate
Kinase
aroH, aroG, aroF aroB aroD aroE, ydiB aroK, aroL aroA aroC hypothetical
Escherichia coli
REC01661 REC05569 REC00721
REC05984 REC01650 REC05912 REC01649
REC05985 REC0372
REC00874 REC05421 -
Helicobacter
pylori RHP01088 RHP01230 RHP00443 RHP00644 RHP01111 RHP01338 RHP00097 -
Thermotoga maritima
RTM00236 RTM00229 RTM00228 RTM00231 RTM00229 RTM00232 RTM00230 -
Bacillus
subtilis RBS02969 RBS02266
RBS2304RBS2442
RBS02559 RBS00316 RBS02256 RBS02267 -
Clostridium acetobutylicum
RCA01750 RCA01752 RCA01757 RCA01755 RCA01756 RCA01753 RCA01754 -
Streptococcus
pneumoniae RPN00965-66 RPN00386 RPN00384 RPN00385 RPN00391 RPN00390 RPN00387 -
Saccharomyces cerevisiae
RSC01644RSC08655
RSC05895 -
Methanococcus
jannaschii- - RMJ05308 RMJ00483 - RMJ00806 RMJ07769 RMJ7785
Archaeoglobus
fulgidus- - RAG18799 RAG27692 - RAG27692 RAG50410 RAG45918
Methanobacter.
Thermoautotrop.- - RTH01640 RTH01082 - RTH02023 RTH00020 RTH01890
Aeropyrum
pernixRAP00399 RAP00398 RAP00397 RAP00396 - RAP00394 RAP00393 RAP00395
Pyrococcus
furiosusRPF01413 RPF01411-12 RPF01410 RPF01409 - RPF01402 RPF01401 RPF01407-08
Pyrococcus
abysiiRPO01190 RPO01191 RPO01192 RPO01193 - RPO01200 RPO01201 RPO01194
RSC06906 Pentafunctional Enzyme RSC06906
Clustering Fusion Occurence
Example II: “Missing Drug Target” in S.pneumoniae
acpP
fabD
accA accD accB
accC
fabHfabF
fabG
fabZ
fabI
Gene fabI of Enoyl-ACP reductase (EC 1.3.1.9) is missing in a number of
Streptococci
Clustering of FAB Genes : Prediction
Genome XGenome X
TR? 6.3.4.15fabI hyp3.5.1.?hyp
TR? 2.1.1.79 FRNS
Genome YGenome Y
5.99.1.2
Clostridium acetobutylicum Clostridium acetobutylicum
TR?
Streptococcus pyogenesStreptococcus pyogenes
? hyp
Escherichia coliEscherichia coli
EC 4…PLSXL32Pg30kMAF 2.7.4.9 2.7.7.7
TR? ?
fabH acpP
?
fabG fabF accAaccDaccCaccB fabZfabD
fabH fabD fabG acpP fabF
fabG fabF accBfabD accAaccDaccCfabZ
fabG fabF accAaccDaccCaccB fabZfabDfabH acpP
fabH acpP
fabH acpP fabG fabF accAaccDaccCaccB fabZfabD
A conserved hypothetical FMN-binding protein “?” is the best candidate for the
missing gene fabI in Gram-positive cocci
13 July 2000
Nature 406, 145 - 146 (2000) © Macmillan Publishers Ltd.
Microbiology:
A triclosan-resistant bacterial enzymeRICHARD J. HEATH AND CHARLES O. ROCK
Triclosan is an antimicrobial agent that is widely used in a variety of consumer products and acts by inhibiting one of the highly conserved enzymes (enoyl-ACP reductase, or FabI) of bacterial fatty-acid biosynthesis. But several key pathogenic bacteria do not possess FabI, and here we describe a unique triclosan-resistant flavoprotein, FabK, that can also catalyse this reaction in Streptococcus pneumoniae. Our finding has implications for the development of FabI-specific inhibitors as antibacterial agents.
Independent Experimental Verification
Missing genes, examples in cofactor pathways
prediction and experimental verification
Functional Role E.C.# Pathways Key evidenceExperimentalVerification
KYNURENINE FORMAMIDASE* 3.5.1.9 NAD/NADP (gram+, gram-) OperonKurnasov et al.
in press
RIBOSYLNICOTINAMIDE KINASE* 2.7.1.22 NAD/NADP (gram-) Operon/FusionKurnasov et al. 2002Singh et al.2002 (3D)
NaMN ADENYLYLTRANSFERASE* 2.7.7.18 NAD/NADP (gram+, gram-) Operon Zhang et al. 2002(3D)
DUAL SPECIFICITY NMN/NaMN ADENYLYLTRANSFERASE
2.7.7.1 /2.7.7.18
NAD/NADP (human, fungi)Projection from
bacteriaZhou et al. 2002 (3D)
Zhang et al. 2003 (3D)
NAD SYNTHASE, GLN-DEAMIDASE SUBUNIT*
3.5.1.- NAD/NADP B A (deep branched)
Operon/FusionShatalin et al.
in progess (3D)
BIFUNCTIONAL PANTETHEINE-PHOSPHATE ADENYLYL- TRANSFERASE/dpCoA KINASE
2.7.7.3 /2.7.1.24
COENZYME A (human) Fusion Daugherty et al, 2002
PANTETHEINE-PHOSPHATE ADENYLYLTRANSFERASE
2.7.7.3 COENZYME A E A (fungi, plants)Projection from
humanDaugherty et al, 2002
MONOFUNCTIONAL PHOSPHO-PANTOTHENOYLCYSTEINE LIGASE
6.3.2.5 COENZYME A (human)Projection from
bacteriaDaugherty et al, 2002
PANTOTHENATE KINASE 2.7.1.33 COENZYME A Operon
MONOFUNCTIONAL PYRIMIDINE DEAMINASE
3.5.4.26 FMN/FAD Operon
EXTENDED FAD SYNTHASE 2.7.7.2 FMN/FAD (human)Projection from
yeastMseeh et al. unpublished
RIBOFLAVIN TRANSPORTER FMN/FAD (gram+) Regulon
A
E
B
E
E
E
A
Missing/Found in:
B
B
B
The Leucine Degradation Cluster:
Origin of a New Perspective on Uses of Clusters
Isovaleryl-CoA dehydrogenase (EC 1.3.99.10)
Leu
Iso-valeryl-
CoA
Methyl-crotonoyl-
CoA
Methylcrotonoyl-CoA carboxylase (EC 6.4.1.4) Methylglutaconyl-
CoA hydratase (EC 4.2.1.18)
Methyl-glutaconyl-
CoA
HMG-CoA
deamination oxydation
Acetyl-CoA
Aceto-acetate
6.2.1.16
4.1.3.4
carboxylase subunit
biotin-containing subunit
Context-based enrichment of initial functional assignments
example from Brucella melitensis genome analysis
E.C. No Functional role Gene ID No. in cluster1.3.99.10 ISOVALERYL-COA DEHYDROGENASE BR0020 11 6.4.1.4 METHYLCROTONYL-COA CARBOXYLASE
- Biotin-containing subunit BR0018 3- Carboxylase subunit BR0019 4
4.2.1.18 METHYLGLUTACONYL-COA HYDRATASE BR0016 2 --------------------------------------------------------------------------------------------------------------------4.1.3.4 HYDROXYMETHYLGLUTARYL-COA LYASE BR0017* 66.2.1.16 ACETOACETATE-COA LIGASE BR0021 5
BR0017*
BR0021
BR0016
BR0018BR0019
BR0020
TIGR
specific
non-specific*
specific
non-specific*
non-specific*
frameshift
* Biotin carboxylase; Carboxyl transferase familty subunit; Enoyl-CoA hydratase/isomerase family
No gene assigned in any organism in KEGG, NCBI, TIGR
Gene assigned in B. melitensis2003 (IG)
Gene assignment propagated over26 organisms using gene clustering
Leucine degradation in Baccili
158New assignments
Organism Gene anchor Clustered genes
Gene cluster in B. subtilis
NCBI
similar to butyryl-CoA dehydrogenase
similar to long-chain acyl-CoA synthetase
similar to biotin carboxylase
gene not called
similar to hydroxymethylglutaryl-CoA lyase
similar to 3-hydroxbutyryl-CoA dehydratase
similar to propionyl-CoA carboxylase
PIR
butyryl-CoA dehydrogenase homolog yngJ
probable acid-CoA ligase (EC 6.2.1.-) yngI
biotin carboxylase homolog yngH
gene not called
hydroxymethylglutaryl-CoA lyase homolog yngG
probable enoyl-CoA hydratase (EC 4.2.1.17) yngF
propionyl-CoA carboxylase homolog yngE
Leucine degradation in Baccili
E.C. No Functional role No. in cluster1.3.99.10 ISOVALERYL-COA DEHYDROGENASE 2 6.4.1.4 METHYLCROTONYL-COA CARBOXYLASE
- BIOTIN CONTAINING SUBUNIT 3- CARBOXYLASE SUBUNIT 1
BIOTIN CARBOXYL CARRIER 74.2.1.18 METHYLGLUTACONYL-COA HYDRATASE 4 -------------------------------------------------------------------------------------------------------------- 4.1.3.4 HYDROXYMETHYLGLUTARYL-COA LYASE 56.2.1.16 ACETOACETATE-COA LIGASE 66.2.1.16* ACETOACETATE-COA LIGASE* 14
?
EC 1.3.99.10 6.3.4.14 4.2.1.18 4.1.3.4 6.2.1.16 6.2.1.3# in cluster 3 7 4 4 1 2 6 5 14
Functional role
Isovaleryl-CoA dehydrogenase
Biotin corboxyl carrier
Biotin carboxylase
Methylglutaconyl-CoA hydratase
Hydroxymethylglutaryl-CoA lyase
Acetoacetate-CoA ligase
Long-chain-fatty-acid-CoA ligase
subunit
biotin-containing subunit
carboxylase subunit
Bru. meli 1909 1911 1910 1913 1914 1912 1907 1908 1620
Bru. abor. 1089 1087 1088 1085 1086 1090 1471
Bac. anth. 2343 2345 2344 2348 2347 2346 2349 4397
Bac. cere. 2373 2375 2374 2378 2377 2376 2379 4413
Bac. halo. 1171 1174 1173 1177 1176 1175 1178 3179 3178 1172
Bac. subt. 1826 not called 1824 1821 1822 1823 1825 2856
Oce. ihey. 1695 1697 1696 1699 1698 3230 4343 504 2122Cau. cres. 2243 2234 2241 2240 476 3724 979
6.4.1.4
Methylcrotonyl-CoA carboxtlase
Listeria
Clostridia
Ralstonia
Shew.
Xylella
1 Cell division protein mraZ
3 S-adenosyl-methyltransferase mraW (EC 2.1.1.-)
4 Cell division protein ftsI
2 UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC 6.3.2.9)
2 UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-diaminopimelate ligase (EC 6.3.2.13)
5 Phospho-N-acetylmuramoyl-pentapeptide-transferase (EC 2.7.8.13)
Brevibacter
Enterococcus
Brucella
Geobacter
1 Phospho-N-acetylmuramoyl-pentapeptide-transferase (EC 2.7.8.13)
2 UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC 6.3.2.9)
6 Cell division protein ftsW
5 UDP-N-acetylglucosamine--N-acetylmuramyl-(pentapeptide) pyrophosphoryl-undecaprenol N-acetylglucosamine transferase (EC 2.4.1.227)
2 UDP-N-acetylmuramate--alanine ligase (EC 6.3.2.8)
9 Cell division protein ftsZ
11 UDP-N-acetylenolpyruvoylglucosamine reductase (EC 1.1.1.158)
2 D-alanine--D-alanine ligase (EC 6.3.2.4)
Bacteroides thetaiotaomicron
Bacillus cereus
Geobacter metallireducens
Buchnera
5 Cell division protein ftsW
1UDP-N-acetylglucosamine--N-acetylmuramyl-(pentapeptide) pyrophosphoryl-undecaprenol N-acetylglucosamine transferase (EC 2.4.1.227)
2 UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC 6.3.2.9)
8 UDP-N-acetylenolpyruvoylglucosamine reductase (EC 1.1.1.158)
9 Cell division protein ftsQ
2 UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-diaminopimelate ligase (EC 6.3.2.13)
3 Cell division protein ftsA
6 Cell division protein ftsZ
Oceanobacillus iheyensis
Enterococcus faecium DO
Escherichia coli K12
Wigglesworthia brevipalpis
2 Cell division protein ftsA
1 Cell division protein ftsZ
8 Hypothetical protein
10 Hypothetical protein
12 RNA binding protein
7 UDP-3-O-[3-hydroxymyristoyl] N-acetylglucosamine deacetylase (EC 3.5.1.-)
13 Protein translocas subunit secA
The Project: Annotate 1000 Genomes in
Three Years
• By making the task concrete, we force engineering decisions
• It will be easier to annotate 1000 genomes well than to annotate 50 well (comparative analysis is the key)
• Analysis by subsystem (rather than by genome) is clearly the key
• The use of clusters is the key to precise annotation of subsystems
Annotation by Subsystem
• Requires knowledge of known variants• Evolution of clusters plays a major role• There are three components of the task:
– Building tools to support analysis– Actually doing the analysis on 30-50
subsystems– Coordinating with groups doing a limited set of
wet lab confirmations
FIG: Building the Initial Annotation Tools
• Releasing the browser/curation tool with approximately 220-230 genomes within a few months
• Peer-to-peer updates/synchronization
• Open source and free (initially for Macs and Linux systems)