selecting targets which probe family and function space

MCSG Site Visit, Argonne, January 30, 2003

Selecting Targets which Probe Selecting Targets which Probe Family and Function SpaceFamily and Function Space

How many protein families can we identify in the genomes with/without structures?

Which families should we target to maximise the structural coverage of the genomes?

Can we optimise function coverage?

James Bray, David Lee, Russell Marsden,Annabel ToddJanet Thornton, Andrzej Joachimiak

NIH Funded Midwest ConsortiumNIH Funded Midwest Consortium

CATHCATH,,Gene3DGene3D

protein families

Identify protein families in the genomesIdentify protein families in the genomes

domain families

Identify domain families and consider Identify domain families and consider domain compositions of the protein familiesdomain compositions of the protein families

domain family of known structure

Identify structurally characterised domain familiesIdentify structurally characterised domain families

650,000 protein sequences 650,000 protein sequences from 120 completed from 120 completed

genomesgenomes

14 eukaryotic genomes including human, mouse, fly, 14 eukaryotic genomes including human, mouse, fly, wormworm

92 bacterial genomes92 bacterial genomes

14 archael genomes14 archael genomes

Protein Families in Complete Genomes Protein Families in Complete Genomes with Structural/Functional Annotationswith Structural/Functional Annotations

Gene3DGene3DBuchan, Thornton, OrengoBuchan, Thornton, Orengo,, Genome Genome

Research (2002), NAR (2002)Research (2002), NAR (2002)

Currently being updated with 30 more complete genomesCurrently being updated with 30 more complete genomes

BLAST all the sequences from 120 completed BLAST all the sequences from 120 completed genomes against each and cluster into protein genomes against each and cluster into protein familiesfamilies

For each protein family identify domain composition For each protein family identify domain composition (by mapping CATH and Pfam domains)(by mapping CATH and Pfam domains)

Clustering Sequences into Protein Families Clustering Sequences into Protein Families of Known Domain Compositionof Known Domain Composition

PFscape - Protein Family LandscapePFscape - Protein Family Landscape

SAM-T99 - sequence mapping of CATH & Pfam SAM-T99 - sequence mapping of CATH & Pfam Karplus et al., NAR, 2000

TRIBE-MCL - Markov Clustering TRIBE-MCL - Markov Clustering Enright & Ouzounis, Genome Research, 2002

Consistency of TribeMCL Clusters for Consistency of TribeMCL Clusters for Genes of Known Structure in CATH Genes of Known Structure in CATH

DatabaseDatabase

Perc

en

tag

e o

f G

en

es

wit

h

Perc

en

tag

e o

f G

enes

wit

h

com

mon f

am

ily a

nnota

tion

co

mm

on f

am

ily a

nn

ota

tion

Granularity of ClusteringGranularity of Clustering

clustering ~650,000 genes from 120 clustering ~650,000 genes from 120 complete genomescomplete genomes

PFscapePFscape

Protein Family 4

Protein Family 3

Protein Family 2

Protein Family 1

~50,000 protein families of 2 or more sequences, ~50,000 protein families of 2 or more sequences, ~60,000 singletons~60,000 singletons

on average 10-15% of sequences in a genome are singletonson average 10-15% of sequences in a genome are singletons

Library of profiles (HMMs) built for representative sequences from each Library of profiles (HMMs) built for representative sequences from each CATH and Pfam domain superfamilyCATH and Pfam domain superfamily

E-value thresholds validated by structure comparisonE-value thresholds validated by structure comparison

Mapping CATH and Pfam Domains onto Mapping CATH and Pfam Domains onto Genome SequencesGenome Sequences

ScanScanagainst against

CATH & PfamCATH & PfamSAM-T99SAM-T99

HMM libraryHMM library(1467 CATH(1467 CATH6190 Pfam)6190 Pfam)

protein sequencesprotein sequencesfrom genomesfrom genomes

assign domains toassign domains toCATH and Pfam CATH and Pfam

familiesfamilies

Performance of Sequence Mapping MethodPerformance of Sequence Mapping Method

1D-HMM 1D-HMM (SAM-T99)(SAM-T99)

Coverage vs Error rate (OHPS)

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

100.0%

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1

Error rate (%)

Co

vera

ge

Sreps.v2.5_Sreps.v2.5

Sreps.v2.4_Sreps.v2.5

Percentage of remote, structurally validated CATH Percentage of remote, structurally validated CATH homologues (<35% sequence identity) identified by homologues (<35% sequence identity) identified by

SAM-T99SAM-T99

(%)

of

hom

olo

gues

fou

nd

(%)

of

hom

olo

gu

es

fou

nd

Error rate

Library of 1D-HMM models detects >80% of remote Library of 1D-HMM models detects >80% of remote homologueshomologues

50,000 protein families in Gene3DUse HMMs to identify CATH and Pfam domains in the genome sequences

domain compositions for protein families in domain compositions for protein families in Gene3DGene3D

CATHCATHPfamPfam

NewFamNewFam

0

10

20

30

40

50

60

70

80

90

100

Ape

Afu

Hsp

Mja

Mka

Mac

Mm

a

Mth

Pa

e

Pa

b

Pfu

Ph

o

Sso

Sto

Ta

c

Tv

o

Aae

Bsu

Ctr

Cpn

Cte

Dra

Ec

o

Fu

n

Hin

Hpy26

695

HpyJ9

9

Mtu

CD

C

Mtu

H37

Mge

Mpn

Pa

ePA

O1

Rpr

Sau

Sco

Syn

Te

l

Tm

a

Tp

a

Organism

Pe

rce

nta

ge

of

ge

ne

s a

nn

ota

ted

Percentage of Genes w ith a Pfam Assignment

Percentage of Genes w ith a CATH assignment

EukaryotesArchaea Bacteria

CATH and Pfam domain families cover nearly CATH and Pfam domain families cover nearly 60-90% of genome sequences60-90% of genome sequences

Pfam CATH

100

80

60

40

20

organism

Perc

en

tag

e o

f seq

uen

ces a

nn

ota

ted

Gene3DGene3D databasedatabase

Iterative Profile SearchMethodology

120 genomes clustered into ~50,000 protein families

structural domain assignments from CATH

functional domain assignments from Pfam,

domain compositions for each protein family

Also: SWISS-PROT, EC, COGs, GO, KEGG annotations

Gene3DGene3D Database:Database:Protein Families in 120 Completed Protein Families in 120 Completed

GenomesGenomes

Gene3DGene3D

http://www.biochem.ucl.ac.uk/bsm/Gene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D

Buchan, Thornton, OrengoBuchan, Thornton, Orengo,, 2002, Genome Research 2002, Genome Research

Recent update submitted to Proteins (2004)Recent update submitted to Proteins (2004)

CATH

Number of Non-identical Relatives

Pfam

NewFam



Perc

enta

ge o

f Fa

mili

es

Maximise structural coverage of the genomes by Maximise structural coverage of the genomes by targetting the largest domain families targetting the largest domain families

Perc

enta

ge o

f Fa

mili

es

•NewFam families are very small

•Target large structurally uncharacterised Pfam families to increase structural coverage of genomes

CATH, Pfam, Unassigned Hlevels vs s100

0

10

20

30

40

50

60

70

80

90

100

0 5000 10000 15000 20000 25000 30000 35000

#Hlevel targets

% T

ota

l s10

0

~70% of genomes are contained in ~2000 largest CATH and/or ~70% of genomes are contained in ~2000 largest CATH and/or Pfam domain families (1345 Pfam families with no structural Pfam domain families (1345 Pfam families with no structural

representative)representative)->Target large structurally uncharacterised Pfam families to ->Target large structurally uncharacterised Pfam families to increase coarse grained structural coverage of the genomesincrease coarse grained structural coverage of the genomes

Genome Coverage by Domain Families Genome Coverage by Domain Families

Domain Families Ordered by Size

0

50

100

0 5,000 10,000 15,000 20,000 25,000 30,000Perc

en

tag

e o

f N

on

-sin

gle

ton

Dom

ain

Perc

en

tag

e o

f N

on

-sin

gle

ton

Dom

ain

S

eq

uen

ces

in 1

20

Com

ple

ted

Gen

om

es

Seq

uen

ces

in 1

20

Com

ple

ted

Gen

om

es

Structural Family(CATH)

Close Sequence Family (30%ID)

Profile Family(HMM based/Pfam)

2000 of the largest domain families cover 70% of genome sequences (~650 CATH + ~1350 Pfam families)

How many fine grained targets should be selected to provide good homology models for all the relatives in these

families?

Fine GrainedTarget Selection

45,000 targets are needed to give good homology 45,000 targets are needed to give good homology models for 70% of eukaryotic and prokaryotic domains?models for 70% of eukaryotic and prokaryotic domains?

Number of Targets for Close Sequence Families

Perc

en

tag

e o

f N

on

-sin

gle

ton

Perc

en

tag

e o

f N

on

-sin

gle

ton

d

om

ain

seq

uen

ces

dom

ain

seq

uen

ces

prokaryotes

eukaryotes

eukaryotes plusprokaryotes

25,000 45,000 30,000


Target Selection StrategyTarget Selection Strategy

~2000 of the largest CATH and/or Pfam families cover >70% of domain sequences in the genomes

it is not feasible to target all the close sequence families in these families to build good homology models for all relatives (45,000 targets)

accurate homology models are not needed for all families

->target sequence families of biological or medical interest (these could be small families

or singletons)

->target additional representatives in very large families especially functionally diverse

families

Domain Recurrences in the GenomesDomain Recurrences in the Genomes

0

10

20

30

40

50

60

70

80

90

1001 3 5 7 9 11 13 15 49 59 67 79 96 102

219

Occurrences

No

. Of

Fa

mili

es

E.coli

M.jannaschii

S.cerevisiae

nu

mb

er

of

fam

ilie

sn

um

ber

of

fam

ilie

s

occurrencesoccurrences

730730 570570

large,extensively duplicated families

structural family(CATH)

close sequence

family (30%)

profile family(Pfam)

in these very large families we will need finer grained selection in these very large families we will need finer grained selection of targets to understand the evolution of new functions/biological of targets to understand the evolution of new functions/biological

roles in different organisms roles in different organisms

In >87% of families -> changes in substrate specificity In >87% of families -> changes in substrate specificity modulated modulated by changes in domain partners by changes in domain partners

In >92% of these families -> conservation or semi-In >92% of these families -> conservation or semi-conservation of conservation of reaction chemistryreaction chemistry

Changes in Domain Partnerships can Changes in Domain Partnerships can Modulate FunctionModulate Function

domain duplication

domain fusion, change in domain partner

67% of enzyme families in CATH show variation in 67% of enzyme families in CATH show variation in functional properties of relativesfunctional properties of relatives

Methionine Aminopeptidase Type 1

(1mat)

Creatinase (1chmA)

monomer/protein substrates

Change in Domain Partner Modulates Function

dimer/small molecule substrates

representative structures for large families may also representative structures for large families may also help to identify functional families help to identify functional families

profile family(Pfam)

close sequence

family (30%)


ProFunc: Predicting Functional SitesProFunc: Predicting Functional Sites

Most likely binding site

Surface clefts

Residue conservation

Conserved surface patches

Laskowski and Thornton

functional subclusters identified by:functional subclusters identified by:

- domain partnerships from Gene3D- domain partnerships from Gene3D

- sequence conservation- sequence conservation

- functional annotations stored in - functional annotations stored in Gene3DGene3D

- results from ProFunc analysis- results from ProFunc analysis

functional clustersfunctional clustersfamily_1

SuperfamilySuperfamily

Representative Structures for Superfamilies Representative Structures for Superfamilies will help identify Functional Subfamilieswill help identify Functional Subfamilies

family_2

family_3

family_4

family_5

selecting targets which probe family and function space

Documents

pfam domain families

protein familiesfor

pfam assignmentno

genes of known structure

validated cath homologues

genome research

representative sequences

genome sequencespfamcath