Download - Exploiting Gene Clusters to Curate Annotations October, 2003 Ross Overbeek, Fellowship for Interpretation of Genomes (FIG)

Exploiting Gene Clusters to Curate

Annotations

October, 2003

Ross Overbeek,Fellowship for Interpretation of

Genomes (FIG)

Outline of the Talk

• The Emerging Opportunity

• The Use of Clusters to Find “Missing Genes”

• Experiences with a Single Pathway

• “The Project”

• Tools Needed to Support the Project

Three “Laws”Three “Laws”

The amount of available DNA sequence data will double every 18 months

The number of available genomes will double every 18 months

The cost of sequence will drop by a factor of 2 every 18 months.

Basic Basic FactsFacts

We have about 230-250 publicly available more-or-less complete genomes

We will have about 1000 complete genomes within 3 years

This will lead to better annotations, not worse

The majority of annotations will need to be automated, and the process must accurately follow the steps that a human expert would take

The Use of Clusters to FindMissing genes

• 3,000 - 4,000 functional roles (300 – 3,000 per organism)

• Largely conserved across the three kingdoms(sequences; functions; pathways)

• “Missing genes” are still there

Central Machinery of Life:

Horizons of gene discovery

BE1

A + DC + E FE2 E3

gene A gene B? gene C

Missing genes in metabolic pathways

making a caseMissing gene

genome 1genome 2genome 3genome 4genome 5

Functional context/neighborhood

Pathway A --> F*Enzyme E1

protein family A A1 A2 A3 A4 A5*Enzyme E2

protein family B = ? ? ? ? ? ?*Enzyme E3

protein family C C1 C2 C3 C4 C5

Globally Missing Gene (never identified in any species)

genome 1genome 2genome 3genome 4genome 5

Functional context/neighborhood

Pathway A --> F*Enzyme E1

protein family A A1 A2 A3 A4 A5*Enzyme E2

protein family B = ? ? ? ? B4 B5*Enzyme E3

protein family C C1 C2 C3 C4 C5

BE1

A + DC + E FE2 E3

gene A gene B? gene C

Missing genes in metabolic pathways

making a caseMissing gene

Locally Missing Gene (non-orthologous gene displacement)

gene A1 gene C1 gene R1gene T1gene G1 gene X1GENOME 1

GENOME 2 gene A2 gene M2 gene X2

GENOME 3 gene A3 gene S3 gene U3gene X3 gene Y3

gene N2gene C2gene Y2

gene Q3

GENE CLUSTERING ON THE CHROMOSOME (OPERONS)

Techniques of genome context analysis (I)

checking neighbors

gene A1 gene C1GENOME 1

GENOME 3 gene C3 / Z3

GENOME 4 gene A4 / X4

gene A3

gene C4

GENOME 5 gene C5 / A5

PROTEIN FUSION EVENTS

Techniques of genome context analysis (II)

checking connections

gene A1 gene C1 gene R1gene T1 gene X1GENOME 1

GENOME 5 gene C5 / A5 gene R5 gene X5

GENOME 2 gene A2 gene W2gene C2

SHARED REGULATORY SITES (REGULONS )

Techniques of genome context analysis (III)

co-regulation

gene A1 gene C1 gene I1gene X1 gene H1gene G1 gene W1 gene Y1 gene Z1GENOME 1

gene A2 gene C2 gene I2gene X2 gene H2gene G2 gene W2 gene Y2 -GENOME 2

gene A3 gene C3 gene I3gene X3 gene H3gene G3 gene W3 gene Y3 gene Z3GENOME 3

gene A4 gene C4 gene I4gene X4 gene H4- gene W4 - -GENOME 4

gene A5 gene C5 gene I5gene X5 gene H5gene G5 gene W5 - -GENOME 5

gene I6gene H6- gene W6 gene Y6 gene Z6GENOME 6

gene I7gene H7gene G7 gene W7 - gene Z7GENOME 7

gene I8gene H8- gene W8 gene Y8 -GENOME 8

gene I9gene H9- gene W9 gene Y9 gene Z9GENOME 9

gene I10-- gene W10 gene Y10 gene Z10GENOME 10

IN-GROUP

OUT-GROUP

- - -

- - -

- - -

- - -

- - -

Score: 10 10 510 68 4 4 3

Techniques of genome context analysis (IV)

co-evolutionOCCURRENCE PROFILES

Missing gene case

primary suspects

Chorismatecatabolism

Isochorismateanabolism

Trp Phe Tyrsyntheses

D-Erythrose 4-P+

Phosphoenol pyruvate

7P-2-Dehydro-3-deoxy-D-arabino-heptulosonate

3-Dehydro-Quinate

3-Dehydro-Shikimate

aroHaroFaroG

1

aroB

2

aroD3

ShikimateKinase

(EC 1.1.1.25)

aroKaroL

5

Chorismate

O5-(1-Carboxyvinyl)-3-P-Shikimate

aroA6

aroC7

Shikimate - 5 - P

H

OH

H

OH

H

H COOH

HH

OP

H

OH

H

OH

H

H COOH

HH

OH

Shikimate

4ydiBaroD

Example I: Chorismate Pathway

Missing genein all archaea

??

??

Fusion Protein

Fusion Protein

Chromosomal Clustering: Prediction

Functional coupling in chorismate pathway

1 2 3 4 5 6 7 5'EC4.1.2.15 EC4.6.1.3 EC4.2.1.10 EC1.1.1.25 EC2.7.1.71 EC2.5.1.19 EC4.6.1.4 ?

2-Dehydro-3-Deoxyphosphohep-tonate Aldolase

3-Dehydroquinate

Synthase

3-Dehydroquinate

Dehydratase

Shikimate 5-Dehydrogenase

Bacterial Shikimate

Kinase

Phosphoshikimate 1-Carboxyvinyl

Transferase

Chorismate Synthase

Archaeal Shikimate

Kinase

aroH, aroG, aroF aroB aroD aroE, ydiB aroK, aroL aroA aroC hypothetical

Escherichia coli

REC01661 REC05569 REC00721

REC05984 REC01650 REC05912 REC01649

REC05985 REC0372

REC00874 REC05421 -

Helicobacter

pylori RHP01088 RHP01230 RHP00443 RHP00644 RHP01111 RHP01338 RHP00097 -

Thermotoga maritima

RTM00236 RTM00229 RTM00228 RTM00231 RTM00229 RTM00232 RTM00230 -

Bacillus

subtilis RBS02969 RBS02266

RBS2304RBS2442

RBS02559 RBS00316 RBS02256 RBS02267 -

Clostridium acetobutylicum

RCA01750 RCA01752 RCA01757 RCA01755 RCA01756 RCA01753 RCA01754 -

Streptococcus

pneumoniae RPN00965-66 RPN00386 RPN00384 RPN00385 RPN00391 RPN00390 RPN00387 -

Saccharomyces cerevisiae

RSC01644RSC08655

RSC05895 -

Methanococcus

jannaschii- - RMJ05308 RMJ00483 - RMJ00806 RMJ07769 RMJ7785

Archaeoglobus

fulgidus- - RAG18799 RAG27692 - RAG27692 RAG50410 RAG45918

Methanobacter.

Thermoautotrop.- - RTH01640 RTH01082 - RTH02023 RTH00020 RTH01890

Aeropyrum

pernixRAP00399 RAP00398 RAP00397 RAP00396 - RAP00394 RAP00393 RAP00395

Pyrococcus

furiosusRPF01413 RPF01411-12 RPF01410 RPF01409 - RPF01402 RPF01401 RPF01407-08

Pyrococcus

abysiiRPO01190 RPO01191 RPO01192 RPO01193 - RPO01200 RPO01201 RPO01194

RSC06906 Pentafunctional Enzyme RSC06906

Clustering Fusion Occurence

Example II: “Missing Drug Target” in S.pneumoniae

acpP

fabD

accA accD accB

accC

fabHfabF

fabG

fabZ

fabI

Gene fabI of Enoyl-ACP reductase (EC 1.3.1.9) is missing in a number of

Streptococci

Clustering of FAB Genes : Prediction

Genome XGenome X

TR? 6.3.4.15fabI hyp3.5.1.?hyp

TR? 2.1.1.79 FRNS

Genome YGenome Y

5.99.1.2

Clostridium acetobutylicum Clostridium acetobutylicum

TR?

Streptococcus pyogenesStreptococcus pyogenes

? hyp

Escherichia coliEscherichia coli

EC 4…PLSXL32Pg30kMAF 2.7.4.9 2.7.7.7

TR? ?

fabH acpP

?

fabG fabF accAaccDaccCaccB fabZfabD

fabH fabD fabG acpP fabF

fabG fabF accBfabD accAaccDaccCfabZ

fabG fabF accAaccDaccCaccB fabZfabDfabH acpP

fabH acpP

fabH acpP fabG fabF accAaccDaccCaccB fabZfabD

A conserved hypothetical FMN-binding protein “?” is the best candidate for the

missing gene fabI in Gram-positive cocci

13 July 2000

Nature 406, 145 - 146 (2000) © Macmillan Publishers Ltd.

Microbiology:

A triclosan-resistant bacterial enzymeRICHARD J. HEATH AND CHARLES O. ROCK

Triclosan is an antimicrobial agent that is widely used in a variety of consumer products and acts by inhibiting one of the highly conserved enzymes (enoyl-ACP reductase, or FabI) of bacterial fatty-acid biosynthesis. But several key pathogenic bacteria do not possess FabI, and here we describe a unique triclosan-resistant flavoprotein, FabK, that can also catalyse this reaction in Streptococcus pneumoniae. Our finding has implications for the development of FabI-specific inhibitors as antibacterial agents.

Independent Experimental Verification

http://www.nature.com/UNKNOWN/

Missing genes, examples in cofactor pathways

prediction and experimental verification

Functional Role E.C.# Pathways Key evidenceExperimentalVerification

KYNURENINE FORMAMIDASE* 3.5.1.9 NAD/NADP (gram+, gram-) OperonKurnasov et al.

in press

RIBOSYLNICOTINAMIDE KINASE* 2.7.1.22 NAD/NADP (gram-) Operon/FusionKurnasov et al. 2002Singh et al.2002 (3D)

NaMN ADENYLYLTRANSFERASE* 2.7.7.18 NAD/NADP (gram+, gram-) Operon Zhang et al. 2002(3D)

DUAL SPECIFICITY NMN/NaMN ADENYLYLTRANSFERASE

2.7.7.1 /2.7.7.18

NAD/NADP (human, fungi)Projection from

bacteriaZhou et al. 2002 (3D)

Zhang et al. 2003 (3D)

NAD SYNTHASE, GLN-DEAMIDASE SUBUNIT*

3.5.1.- NAD/NADP B A (deep branched)

Operon/FusionShatalin et al.

in progess (3D)

BIFUNCTIONAL PANTETHEINE-PHOSPHATE ADENYLYL- TRANSFERASE/dpCoA KINASE

2.7.7.3 /2.7.1.24

COENZYME A (human) Fusion Daugherty et al, 2002

PANTETHEINE-PHOSPHATE ADENYLYLTRANSFERASE

2.7.7.3 COENZYME A E A (fungi, plants)Projection from

humanDaugherty et al, 2002

MONOFUNCTIONAL PHOSPHO-PANTOTHENOYLCYSTEINE LIGASE

6.3.2.5 COENZYME A (human)Projection from

bacteriaDaugherty et al, 2002

PANTOTHENATE KINASE 2.7.1.33 COENZYME A Operon

MONOFUNCTIONAL PYRIMIDINE DEAMINASE

3.5.4.26 FMN/FAD Operon

EXTENDED FAD SYNTHASE 2.7.7.2 FMN/FAD (human)Projection from

yeastMseeh et al. unpublished

RIBOFLAVIN TRANSPORTER FMN/FAD (gram+) Regulon

A

E

B

E

E

E

A

Missing/Found in:

B

B

B

The Leucine Degradation Cluster:

Origin of a New Perspective on Uses of Clusters

Isovaleryl-CoA dehydrogenase (EC 1.3.99.10)

Leu

Iso-valeryl-

CoA

Methyl-crotonoyl-

CoA

Methylcrotonoyl-CoA carboxylase (EC 6.4.1.4) Methylglutaconyl-

CoA hydratase (EC 4.2.1.18)

Methyl-glutaconyl-

CoA

HMG-CoA

deamination oxydation

Acetyl-CoA

Aceto-acetate

6.2.1.16

4.1.3.4

carboxylase subunit

biotin-containing subunit

Context-based enrichment of initial functional assignments

example from Brucella melitensis genome analysis

E.C. No Functional role Gene ID No. in cluster1.3.99.10 ISOVALERYL-COA DEHYDROGENASE BR0020 11 6.4.1.4 METHYLCROTONYL-COA CARBOXYLASE

- Biotin-containing subunit BR0018 3- Carboxylase subunit BR0019 4

4.2.1.18 METHYLGLUTACONYL-COA HYDRATASE BR0016 2 --------------------------------------------------------------------------------------------------------------------4.1.3.4 HYDROXYMETHYLGLUTARYL-COA LYASE BR0017* 66.2.1.16 ACETOACETATE-COA LIGASE BR0021 5

BR0017*

BR0021

BR0016

BR0018BR0019

BR0020

TIGR

specific

non-specific*

specific

non-specific*

non-specific*

frameshift

* Biotin carboxylase; Carboxyl transferase familty subunit; Enoyl-CoA hydratase/isomerase family

No gene assigned in any organism in KEGG, NCBI, TIGR

Gene assigned in B. melitensis2003 (IG)

Gene assignment propagated over26 organisms using gene clustering

Leucine degradation in Baccili

158New assignments

Organism Gene anchor Clustered genes

Gene cluster in B. subtilis

NCBI

similar to butyryl-CoA dehydrogenase

similar to long-chain acyl-CoA synthetase

similar to biotin carboxylase

gene not called

similar to hydroxymethylglutaryl-CoA lyase

similar to 3-hydroxbutyryl-CoA dehydratase

similar to propionyl-CoA carboxylase

PIR

butyryl-CoA dehydrogenase homolog yngJ

probable acid-CoA ligase (EC 6.2.1.-) yngI

biotin carboxylase homolog yngH

gene not called

hydroxymethylglutaryl-CoA lyase homolog yngG

probable enoyl-CoA hydratase (EC 4.2.1.17) yngF

propionyl-CoA carboxylase homolog yngE

Leucine degradation in Baccili

E.C. No Functional role No. in cluster1.3.99.10 ISOVALERYL-COA DEHYDROGENASE 2 6.4.1.4 METHYLCROTONYL-COA CARBOXYLASE

- BIOTIN CONTAINING SUBUNIT 3- CARBOXYLASE SUBUNIT 1

BIOTIN CARBOXYL CARRIER 74.2.1.18 METHYLGLUTACONYL-COA HYDRATASE 4 -------------------------------------------------------------------------------------------------------------- 4.1.3.4 HYDROXYMETHYLGLUTARYL-COA LYASE 56.2.1.16 ACETOACETATE-COA LIGASE 66.2.1.16* ACETOACETATE-COA LIGASE* 14

?

EC 1.3.99.10 6.3.4.14 4.2.1.18 4.1.3.4 6.2.1.16 6.2.1.3# in cluster 3 7 4 4 1 2 6 5 14

Functional role

Isovaleryl-CoA dehydrogenase

Biotin corboxyl carrier

Biotin carboxylase

Methylglutaconyl-CoA hydratase

Hydroxymethylglutaryl-CoA lyase

Acetoacetate-CoA ligase

Long-chain-fatty-acid-CoA ligase

subunit

biotin-containing subunit

carboxylase subunit

Bru. meli 1909 1911 1910 1913 1914 1912 1907 1908 1620

Bru. abor. 1089 1087 1088 1085 1086 1090 1471

Bac. anth. 2343 2345 2344 2348 2347 2346 2349 4397

Bac. cere. 2373 2375 2374 2378 2377 2376 2379 4413

Bac. halo. 1171 1174 1173 1177 1176 1175 1178 3179 3178 1172

Bac. subt. 1826 not called 1824 1821 1822 1823 1825 2856

Oce. ihey. 1695 1697 1696 1699 1698 3230 4343 504 2122Cau. cres. 2243 2234 2241 2240 476 3724 979

6.4.1.4

Methylcrotonyl-CoA carboxtlase

Listeria

Clostridia

Ralstonia

Shew.

Xylella

1 Cell division protein mraZ

3 S-adenosyl-methyltransferase mraW (EC 2.1.1.-)

4 Cell division protein ftsI

2 UDP-N-acetylmuramoylalanine--D-glutamate ligase (EC 6.3.2.9)

2 UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-diaminopimelate ligase (EC 6.3.2.13)

5 Phospho-N-acetylmuramoyl-pentapeptide-transferase (EC 2.7.8.13)

Brevibacter

Enterococcus

Brucella

Geobacter

1 Phospho-N-acetylmuramoyl-pentapeptide-transferase (EC 2.7.8.13)


6 Cell division protein ftsW

5 UDP-N-acetylglucosamine--N-acetylmuramyl-(pentapeptide) pyrophosphoryl-undecaprenol N-acetylglucosamine transferase (EC 2.4.1.227)

2 UDP-N-acetylmuramate--alanine ligase (EC 6.3.2.8)

9 Cell division protein ftsZ

11 UDP-N-acetylenolpyruvoylglucosamine reductase (EC 1.1.1.158)

2 D-alanine--D-alanine ligase (EC 6.3.2.4)

Bacteroides thetaiotaomicron

Bacillus cereus

Geobacter metallireducens

Buchnera

5 Cell division protein ftsW

1UDP-N-acetylglucosamine--N-acetylmuramyl-(pentapeptide) pyrophosphoryl-undecaprenol N-acetylglucosamine transferase (EC 2.4.1.227)


8 UDP-N-acetylenolpyruvoylglucosamine reductase (EC 1.1.1.158)

9 Cell division protein ftsQ

2 UDP-N-acetylmuramoylalanyl-D-glutamate--2,6-diaminopimelate ligase (EC 6.3.2.13)

3 Cell division protein ftsA


Oceanobacillus iheyensis

Enterococcus faecium DO

Escherichia coli K12

Wigglesworthia brevipalpis

2 Cell division protein ftsA


8 Hypothetical protein

10 Hypothetical protein

12 RNA binding protein

7 UDP-3-O-[3-hydroxymyristoyl] N-acetylglucosamine deacetylase (EC 3.5.1.-)

13 Protein translocas subunit secA

The Project: Annotate 1000 Genomes in

Three Years

• By making the task concrete, we force engineering decisions

• It will be easier to annotate 1000 genomes well than to annotate 50 well (comparative analysis is the key)

• Analysis by subsystem (rather than by genome) is clearly the key

• The use of clusters is the key to precise annotation of subsystems

Annotation by Subsystem

• Requires knowledge of known variants• Evolution of clusters plays a major role• There are three components of the task:

– Building tools to support analysis– Actually doing the analysis on 30-50

subsystems– Coordinating with groups doing a limited set of

wet lab confirmations

FIG: Building the Initial Annotation Tools

• Releasing the browser/curation tool with approximately 220-230 genomes within a few months

• Peer-to-peer updates/synchronization

• Open source and free (initially for Macs and Linux systems)

Download - Exploiting Gene Clusters to Curate Annotations October, 2003 Ross Overbeek, Fellowship for Interpretation of Genomes (FIG)

Top Related