scale of the ‘unknown’ gene problem scale of the ‘unknown’ gene problem principles of...

20
Scale of the ‘unknown’ gene problem Principles of comparative genomics Shared plant-prokaryote genes Comparative genomics When Blast tells you nothing…. The ‘guilt by association’ principle ‘Two-dimensional’ gene annotation SEED subsystems Plant-prokaryote examples Filling ‘pathway holes’ – FolQ Linking new functions to known systems – COG0354

Upload: judith-harrell

Post on 25-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

• Scale of the ‘unknown’ gene problem

Principles of comparative genomicsPrinciples of comparative genomics

• Shared plant-prokaryote genes

• Comparative genomics• When Blast tells you nothing….• The ‘guilt by association’ principle• ‘Two-dimensional’ gene annotation• SEED subsystems

• Plant-prokaryote examples• Filling ‘pathway holes’ – FolQ• Linking new functions to known systems – COG0354

www.genomesonline.org

Whole genome sequencing progressWhole genome sequencing progress

● Functional annotation of genes has nowhere near kept pace

● Functional annotations are often absent, vague, or wrong

Ongoing

Complete

0

1000

2000

3000

4000

5000

Num

ber

of g

enom

es

6000

7000

8000

9000

10000

Dec 1

997

Jul 1

999

Jun

2000

Jan

2001

Sep 2

001

Jul 2

002

Jan

2003

Apr 2

003

Sep 2

003

Feb 2

004

Jun

2004

Oct 2

004

Apr 2

005

Oct 2

005

Aug 2

006

May

200

7

May

200

8

Aug 2

009

Mar

201

1

Orphan genesOrphan genes

• 1437/3736 enzymes (38%) with EC numbers have no associated genes

Orphan enzymesOrphan enzymes

• 20-60% of genes in any given genome have no known function or only a vague one (‘esterase’ etc)

0

20

40

60

80

100

UnknownKnown

Pe

rce

nt

of

pro

tein

sPercentage of unknown proteins encoded by diverse genomes

Bacteria Archaea Eukarya

The unknown protein problem in various groupsThe unknown protein problem in various groups

Data from The SEED http://theseed.uchicago.edu/

Esche

richia

coli

Lacto

bacil

lus

case

i

Staph

yloco

ccus

aure

usChla

myd

ia

trach

omat

is

Acidob

acte

rium

Soliba

cter

usita

tus

Synec

hocy

stis

Pyroc

occu

s

abys

si

Haloar

cula

mar

ismor

tui

Human

Arabid

opsis

Source of genes Number of genes % of genome

Plants & prokaryotes share many (unknown) genesPlants & prokaryotes share many (unknown) genes

Cyanobacteria 5470 21.0

Proteobacteria 1170 4.6

Gram+ bacteria 2280 9.1

Other bacteria 1160 4.6

Archaea 1090 4.4

Total 11170 43.4

● Estimates for Arabidopsis vary – but all are many thousands

● Functions of most shared genes are metabolic

From de Crecy-Lagard & Hanson Trends Microbiol 15: 563 (2007)

● Shared genes identifiably from various groups

● Plants are conglomerates of microbial metabolic genes

● Many opportunities for comparative genomics

The power of comparative genomicsThe power of comparative genomics

● Suppose you have an unknown plant protein:

● BlastP search gives various prokaryote hits

● None of them have clear functions Dead end

● No! This is the beginning of comparative genomics

● Predicts functions via ‘guilt by association’ principle

● Genes of related function are associated in various ways

● e.g. Enzymes in a pathway, proteins in a complex

● Whatever a gene’s associates do, it probably does too

Associationevidence

Protein-protein interactions

Organelle proteomes

Co-expression

Gene WGene XGene YGene Z

Structures

Essentiality & other phenome data

A

B

C V M

A B C D

Gene clustering

Orf XY

Orf YOrf X

Gene fusion

C

A

B

D

Shared regulatory sites

XYYX

XYYX

XYYX

XYYX

Phylogenetic occurrence

+

+––

––

+

++

Genomic evidence Post-genomic evidence

Predictions

Testing (genetics, biochemistry)

• ‘Dimensions’ are:• Molecular function (e.g., an enzyme activity with EC no.)• Functional context (e.g., other enzymes of a pathway)

• ‘2-Dimensions good, 1-dimension bad’ • Even an EC no. function may be wrong if pathway not there• Pathway context may be wrong if certain enzymes missing

• GenBank etc annotations are 1-dimensional (mol. function)

Two-dimensional gene annotationTwo-dimensional gene annotation

SEED subsystemsSEED subsystems

• Subsystems (SSs) capture both annotation dimensions

• Sets of molecular functions (e.g. enzymes) that together implement a specific biological process (e.g. a pathway)

Folate biosynthesis subsystem

Pathway hole

• SSs cover many genomes, have form of spreadsheet:• Columns are molecular functions• Rows are genomes• Each cell identifies the genes for proteins with the specific

molecular functional role in the designated genome

• Prokaryote association evidence is mainly genomic

• Plant association evidence is mainly post-genomic

• Post-genomic evidence is noisier but very useful

• Superb plant post-genomic resources:• Microarrays, RNAseq (organ- and environment-specific)• Organellar targeting prediction, proteomics (location can r/o function)• Phenome databases (chlorosis, lethality can support function)• Vast plant metabolism bibliome

Plant – prokaryote examplesPlant – prokaryote examples

FolQ – Filling a pathway holeFolQ – Filling a pathway hole

• Missing step known to be a pyrophosphohydrolase, ~17 kDa• Search genomes for small hydrolase clustered with fol genes• YlgG candidate in Firmicutes, Nudix hydrolase family, 19 kDa

Folate synthesis pathway

FolQDHN DHP DHF THF

Glu

GTP DHN-P3 DHN-P

pABA

HMDHP-P2HMDHP

Chrorismate ADC

FolE FolB FolK FolP FolC FolA[P-ase]

PabCPabAB

• YlgG has a plant homolog – At1g68760

• FolQ universally missing (prokaryotes, plants, fungi, protists)

Lactococcus lactis folate gene cluster

folCfolEK folP ylgG

Recombinant proteins release DHN-P + PPi

2 4 6 2 4 6Minutes

0

40

80

120

160

200

240

Flu

ore

sce

nc

e

WT KO

DHNP3

DHN-P3

FolQ – Experimental testsFolQ – Experimental tests

Folate synthesis pathway

FolQDHN DHP DHF THF

Glu

GTP DHN-P3 DHN-P

pABA

HMDHP-P2HMDHP

Chrorismate ADC

FolE FolB FolK FolP FolC FolA[P-ase]

PabCPabAB

• ylgG KO accumulates DHN-P3 • YlgG & At1g68760 act on DHN-P3

Pro

du

ct f

orm

atio

n (

nm

ol/

assa

y)

0

0.5

1.0

1.5

DHNP Pi PPi

YlgG

0

0.3

0.9

0.6

DHNP Pi PPi

At1g68760

Mouse

Fly

Yeast

Leishmania

At4g12130

At1g60990

Haloarcula

Natronomonas

Rickettsia

Ehrlichia

Anaplasma

Bradyrhizobium

Burkholderia

Neisseria

Xanthomonas

Psychrobacter

E. coli

Shewanella

Thermus

Deinococcus

Synechocystis

Synechococcus

Nostoc

Corynebacterium

Streptomyces

Solibacter

Blastopirellula

Pirellula

GcvT

Yeast GcvT

Mouse GcvT

Arabidopsis GcvT

Rice GcvT

COG0354 – A folate protein for Fe/S cluster repair in oxidative stress COG0354 – A folate protein for Fe/S cluster repair in oxidative stress

• In all kingdoms of life • In all kingdoms of life

- Bacteria

- Bacteria - Archaea

- Archaea - Fungi

- Fungi

Animals

Animals

Plants

Plants

• 2 plant proteins • 2 plant proteins

- 1 related to rickettsias (mitochondria)

- 1 related to rickettsias (mitochondria)

- 1 related to cyanobacteria (plastids)

- 1 related to cyanobacteria (plastids)

• Homolog of GcvT protein • Homolog of GcvT protein

- But clearly a distinct clade

- But clearly a distinct clade

COG0354 – Linking a new function to known systemCOG0354 – Linking a new function to known system

Folate-dependentFolate-dependent

COG0354 – Comparative genomics & post-genomic dataCOG0354 – Comparative genomics & post-genomic data

Developmental series

Arabidopsis Transcriptome DB(Max Planck Institute, Golm)

Mitochondrial COG0354Mitochondrial Frataxin

Ferritin 2Mitochondrial COG0354

• Co-expression in Arabidopsis • Co-expression in Arabidopsis

- Mitochondrial COG0354 expression

correlates with frataxin (Fe/S assembly)

- Mitochondrial COG0354 expression

correlates with frataxin (Fe/S assembly) - And with ferritin 2 (Fe storage)

- And with ferritin 2 (Fe storage)

COG0354 – Comparative genomics & post-genomic dataCOG0354 – Comparative genomics & post-genomic data

• Clusters with Fe/S proteins • Clusters with Fe/S proteins

COG0354 Fe/S protein Fe/S partner

0354

0354

0354 MiaB

0354

0354

nifQ fd nifX nifN nifE fd nifHnifDnifK

● Nif cluster in Methylococcus capsulatus

● Suf cluster in Rubrobacter xylanophilus

sufC sufB sufD sufS thiC

● Sdh operon in Stenotrophomonas maltophila

sdhCsdhD sdhA sdhB

● NAD synthesis cluster in Pelagibacter ubique

nadA nadC

● MiaB (Radical SAM) in Buchnera aphidicola

• Co-expression in Arabidopsis • Co-expression in Arabidopsis

COG0354 – Comparative genomics & post-genomic dataCOG0354 – Comparative genomics & post-genomic data

• Clusters with Fe/S proteins • Clusters with Fe/S proteins

- IscA proteins are scaffolds in Fe/S

cluster assembly

- IscA proteins are scaffolds in Fe/S

cluster assembly

• Only occurs if IscA is present • Only occurs if IscA is present

CO

G03

54Is

cA

ClostridialesMollicutesLactobacillalesStaphylococcaceae

ListeriaceaeBacillaceae

Bifidobacterium

CampylobacteralesBdellovibrionales

DesulfovibrionalesDesulfuromonadales Myxococcales Syntrophobacterales

Desulfobacterales

Bacteroidales Flavobacteria Sphingobacteria

Firmicutes

FusobacteriaActinobacteria

Cyanobacteria

Acidobacteriaδ/ε-Proteobacteria

α-Proteobacteriaβ-Proteobacteriaγ-ProteobacteriaMagnetococcus

PlanctomycetesChlamydiales

ChlorobiBacteroidetes

Deinococcus/ThermusChloroflexiThermotogae

Spirochaetes

Nanoarcheota

EuryarchaeotaCrenarchaeota

Bacteria

Archaea

Methanococci Methanomicrobia

Archaeoglobi Halobacteria Methanobacteria

Methanopyri Thermococci Thermoplasmata

Gene absent Gene present

• Co-expression in Arabidopsis • Co-expression in Arabidopsis

COG0354 – Comparative genomics & post-genomic dataCOG0354 – Comparative genomics & post-genomic data

• Associated with aerobic

lifestyle

• Associated with aerobic

lifestyle

• Clusters with Fe/S proteins • Clusters with Fe/S proteins

• Only occurs if IscA is present • Only occurs if IscA is present

• Co-expression in Arabidopsis • Co-expression in Arabidopsis

COG0354 – Comparative genomics & post-genomic dataCOG0354 – Comparative genomics & post-genomic data

• H2O2-induced in E. coli • H2O2-induced in E. coli

• High-throughput screens • High-throughput screens

● Essential gene in:

● Important gene in:

– E. coli (slow growth)

– Yeast (petite)

– Mycobacterium tuberculosis

– Haemophilus influenzae – Pseudomonas aeruginosa

● Plant proteins both expressed

- Essentiality & phenomics

- Essentiality & phenomics

- Proteomics

- Proteomics

● Cyano-like protein in plastids

• Associated with aerobic

lifestyle

• Associated with aerobic

lifestyle

• Clusters with Fe/S proteins • Clusters with Fe/S proteins

• Only occurs if IscA is present • Only occurs if IscA is present

• Co-expression in Arabidopsis • Co-expression in Arabidopsis

● E. coli protein has folate site

COG0354 – Predictions & Experimental ValidationCOG0354 – Predictions & Experimental Validation

COG0354 PREDICTIONS

● Is a folate-dependent enzyme

● Combats oxidative stress

● Helps make/repair Fe/S clusters

● Function is ancient & ubiquitous (like Fe/S proteins themselves)

● Folate mutations abolish activity

● Mutant oxidative stress-sensitive

● Mutant many Fe/S enzyme defects

● Complementation by all kingdoms

Controls Plant & mammal Fungi, protist, Archaea

VectorE. coli

E. coli

Plant M Plant C

Mammal

Protist

Yeast

Archaea

LB + plumbagin (oxidative stress)

The power of comparative genomicsThe power of comparative genomics

William Whewell (1794-1866) English Scientist, Philosopher, Anglican priestAn early influence on Charles DarwinCoined the term “scientist”

“The facts are known but they are insulated and unconnected…. The pearls are there but they will not hang together until some one provides the string”

Hypothesis that connects and unifies observations