selecting targets which probe family and function space
DESCRIPTION
Selecting Targets which Probe Family and Function Space. How many protein families can we identify in the genomes with/without structures? Which families should we target to maximise the structural coverage of the genomes? Can we optimise function coverage?. CATH , Gene3D. - PowerPoint PPT PresentationTRANSCRIPT
MCSG Site Visit, Argonne, January 30, 2003
Selecting Targets which Probe Selecting Targets which Probe Family and Function SpaceFamily and Function Space
How many protein families can we identify in the genomes with/without structures?
Which families should we target to maximise the structural coverage of the genomes?
Can we optimise function coverage?
James Bray, David Lee, Russell Marsden,Annabel ToddJanet Thornton, Andrzej Joachimiak
NIH Funded Midwest ConsortiumNIH Funded Midwest Consortium
CATHCATH,,Gene3DGene3D
protein families
Identify protein families in the genomesIdentify protein families in the genomes
domain families
Identify domain families and consider Identify domain families and consider domain compositions of the protein familiesdomain compositions of the protein families
domain family of known structure
Identify structurally characterised domain familiesIdentify structurally characterised domain families
650,000 protein sequences 650,000 protein sequences from 120 completed from 120 completed
genomesgenomes
14 eukaryotic genomes including human, mouse, fly, 14 eukaryotic genomes including human, mouse, fly, wormworm
92 bacterial genomes92 bacterial genomes
14 archael genomes14 archael genomes
Protein Families in Complete Genomes Protein Families in Complete Genomes with Structural/Functional Annotationswith Structural/Functional Annotations
Gene3DGene3DBuchan, Thornton, OrengoBuchan, Thornton, Orengo,, Genome Genome
Research (2002), NAR (2002)Research (2002), NAR (2002)
Currently being updated with 30 more complete genomesCurrently being updated with 30 more complete genomes
BLAST all the sequences from 120 completed BLAST all the sequences from 120 completed genomes against each and cluster into protein genomes against each and cluster into protein familiesfamilies
For each protein family identify domain composition For each protein family identify domain composition (by mapping CATH and Pfam domains)(by mapping CATH and Pfam domains)
Clustering Sequences into Protein Families Clustering Sequences into Protein Families of Known Domain Compositionof Known Domain Composition
PFscape - Protein Family LandscapePFscape - Protein Family Landscape
SAM-T99 - sequence mapping of CATH & Pfam SAM-T99 - sequence mapping of CATH & Pfam Karplus et al., NAR, 2000
TRIBE-MCL - Markov Clustering TRIBE-MCL - Markov Clustering Enright & Ouzounis, Genome Research, 2002
Consistency of TribeMCL Clusters for Consistency of TribeMCL Clusters for Genes of Known Structure in CATH Genes of Known Structure in CATH
DatabaseDatabase
Perc
en
tag
e o
f G
en
es
wit
h
Perc
en
tag
e o
f G
enes
wit
h
com
mon f
am
ily a
nnota
tion
co
mm
on f
am
ily a
nn
ota
tion
Granularity of ClusteringGranularity of Clustering
clustering ~650,000 genes from 120 clustering ~650,000 genes from 120 complete genomescomplete genomes
PFscapePFscape
Protein Family 4
Protein Family 3
Protein Family 2
Protein Family 1
~50,000 protein families of 2 or more sequences, ~50,000 protein families of 2 or more sequences, ~60,000 singletons~60,000 singletons
on average 10-15% of sequences in a genome are singletonson average 10-15% of sequences in a genome are singletons
Library of profiles (HMMs) built for representative sequences from each Library of profiles (HMMs) built for representative sequences from each CATH and Pfam domain superfamilyCATH and Pfam domain superfamily
E-value thresholds validated by structure comparisonE-value thresholds validated by structure comparison
Mapping CATH and Pfam Domains onto Mapping CATH and Pfam Domains onto Genome SequencesGenome Sequences
ScanScanagainst against
CATH & PfamCATH & PfamSAM-T99SAM-T99
HMM libraryHMM library(1467 CATH(1467 CATH6190 Pfam)6190 Pfam)
protein sequencesprotein sequencesfrom genomesfrom genomes
assign domains toassign domains toCATH and Pfam CATH and Pfam
familiesfamilies
Performance of Sequence Mapping MethodPerformance of Sequence Mapping Method
1D-HMM 1D-HMM (SAM-T99)(SAM-T99)
Coverage vs Error rate (OHPS)
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
70.0%
80.0%
90.0%
100.0%
0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1
Error rate (%)
Co
vera
ge
Sreps.v2.5_Sreps.v2.5
Sreps.v2.4_Sreps.v2.5
Percentage of remote, structurally validated CATH Percentage of remote, structurally validated CATH homologues (<35% sequence identity) identified by homologues (<35% sequence identity) identified by
SAM-T99SAM-T99
(%)
of
hom
olo
gues
fou
nd
(%)
of
hom
olo
gu
es
fou
nd
Error rate
Library of 1D-HMM models detects >80% of remote Library of 1D-HMM models detects >80% of remote homologueshomologues
50,000 protein families in Gene3DUse HMMs to identify CATH and Pfam domains in the genome sequences
domain compositions for protein families in domain compositions for protein families in Gene3DGene3D
CATHCATHPfamPfam
NewFamNewFam
0
10
20
30
40
50
60
70
80
90
100
Ape
Afu
Hsp
Mja
Mka
Mac
Mm
a
Mth
Pa
e
Pa
b
Pfu
Ph
o
Sso
Sto
Ta
c
Tv
o
Aae
Bsu
Ctr
Cpn
Cte
Dra
Ec
o
Fu
n
Hin
Hpy26
695
HpyJ9
9
Mtu
CD
C
Mtu
H37
Mge
Mpn
Pa
ePA
O1
Rpr
Sau
Sco
Syn
Te
l
Tm
a
Tp
a
Organism
Pe
rce
nta
ge
of
ge
ne
s a
nn
ota
ted
Percentage of Genes w ith a Pfam Assignment
Percentage of Genes w ith a CATH assignment
EukaryotesArchaea Bacteria
CATH and Pfam domain families cover nearly CATH and Pfam domain families cover nearly 60-90% of genome sequences60-90% of genome sequences
Pfam CATH
100
80
60
40
20
organism
Perc
en
tag
e o
f seq
uen
ces a
nn
ota
ted
Gene3DGene3D databasedatabase
Iterative Profile SearchMethodology
120 genomes clustered into ~50,000 protein families
structural domain assignments from CATH
functional domain assignments from Pfam,
domain compositions for each protein family
Also: SWISS-PROT, EC, COGs, GO, KEGG annotations
Gene3DGene3D Database:Database:Protein Families in 120 Completed Protein Families in 120 Completed
GenomesGenomes
Gene3DGene3D
http://www.biochem.ucl.ac.uk/bsm/Gene3Dhttp://www.biochem.ucl.ac.uk/bsm/Gene3D
Buchan, Thornton, OrengoBuchan, Thornton, Orengo,, 2002, Genome Research 2002, Genome Research
Recent update submitted to Proteins (2004)Recent update submitted to Proteins (2004)
CATH
Number of Non-identical Relatives
Pfam
NewFam
Number of Non-identical Relatives
Number of Non-identical Relatives
Perc
enta
ge o
f Fa
mili
es
Maximise structural coverage of the genomes by Maximise structural coverage of the genomes by targetting the largest domain families targetting the largest domain families
Perc
enta
ge o
f Fa
mili
es
•NewFam families are very small
•Target large structurally uncharacterised Pfam families to increase structural coverage of genomes
CATH, Pfam, Unassigned Hlevels vs s100
0
10
20
30
40
50
60
70
80
90
100
0 5000 10000 15000 20000 25000 30000 35000
#Hlevel targets
% T
ota
l s10
0
~70% of genomes are contained in ~2000 largest CATH and/or ~70% of genomes are contained in ~2000 largest CATH and/or Pfam domain families (1345 Pfam families with no structural Pfam domain families (1345 Pfam families with no structural
representative)representative)->Target large structurally uncharacterised Pfam families to ->Target large structurally uncharacterised Pfam families to increase coarse grained structural coverage of the genomesincrease coarse grained structural coverage of the genomes
Genome Coverage by Domain Families Genome Coverage by Domain Families
Domain Families Ordered by Size
0
50
100
0 5,000 10,000 15,000 20,000 25,000 30,000Perc
en
tag
e o
f N
on
-sin
gle
ton
Dom
ain
Perc
en
tag
e o
f N
on
-sin
gle
ton
Dom
ain
S
eq
uen
ces
in 1
20
Com
ple
ted
Gen
om
es
Seq
uen
ces
in 1
20
Com
ple
ted
Gen
om
es
Structural Family(CATH)
Close Sequence Family (30%ID)
Profile Family(HMM based/Pfam)
2000 of the largest domain families cover 70% of genome sequences (~650 CATH + ~1350 Pfam families)
How many fine grained targets should be selected to provide good homology models for all the relatives in these
families?
Fine GrainedTarget Selection
45,000 targets are needed to give good homology 45,000 targets are needed to give good homology models for 70% of eukaryotic and prokaryotic domains?models for 70% of eukaryotic and prokaryotic domains?
Number of Targets for Close Sequence Families
Perc
en
tag
e o
f N
on
-sin
gle
ton
Perc
en
tag
e o
f N
on
-sin
gle
ton
d
om
ain
seq
uen
ces
dom
ain
seq
uen
ces
prokaryotes
eukaryotes
eukaryotes plusprokaryotes
25,000 45,000 30,000
MCSG Site Visit, Argonne, January 30, 2003
Target Selection StrategyTarget Selection Strategy
~2000 of the largest CATH and/or Pfam families cover >70% of domain sequences in the genomes
it is not feasible to target all the close sequence families in these families to build good homology models for all relatives (45,000 targets)
accurate homology models are not needed for all families
->target sequence families of biological or medical interest (these could be small families
or singletons)
->target additional representatives in very large families especially functionally diverse
families
Domain Recurrences in the GenomesDomain Recurrences in the Genomes
0
10
20
30
40
50
60
70
80
90
1001 3 5 7 9 11 13 15 49 59 67 79 96 102
219
Occurrences
No
. Of
Fa
mili
es
E.coli
M.jannaschii
S.cerevisiae
nu
mb
er
of
fam
ilie
sn
um
ber
of
fam
ilie
s
occurrencesoccurrences
730730 570570
large,extensively duplicated families
structural family(CATH)
close sequence
family (30%)
profile family(Pfam)
in these very large families we will need finer grained selection in these very large families we will need finer grained selection of targets to understand the evolution of new functions/biological of targets to understand the evolution of new functions/biological
roles in different organisms roles in different organisms
In >87% of families -> changes in substrate specificity In >87% of families -> changes in substrate specificity modulated modulated by changes in domain partners by changes in domain partners
In >92% of these families -> conservation or semi-In >92% of these families -> conservation or semi-conservation of conservation of reaction chemistryreaction chemistry
Changes in Domain Partnerships can Changes in Domain Partnerships can Modulate FunctionModulate Function
domain duplication
domain fusion, change in domain partner
67% of enzyme families in CATH show variation in 67% of enzyme families in CATH show variation in functional properties of relativesfunctional properties of relatives
Methionine Aminopeptidase Type 1
(1mat)
Creatinase (1chmA)
monomer/protein substrates
Change in Domain Partner Modulates Function
dimer/small molecule substrates
representative structures for large families may also representative structures for large families may also help to identify functional families help to identify functional families
profile family(Pfam)
close sequence
family (30%)
MCSG Site Visit, Argonne, January 30, 2003
ProFunc: Predicting Functional SitesProFunc: Predicting Functional Sites
Most likely binding site
Surface clefts
Residue conservation
Conserved surface patches
Laskowski and Thornton
functional subclusters identified by:functional subclusters identified by:
- domain partnerships from Gene3D- domain partnerships from Gene3D
- sequence conservation- sequence conservation
- functional annotations stored in - functional annotations stored in Gene3DGene3D
- results from ProFunc analysis- results from ProFunc analysis
functional clustersfunctional clustersfamily_1
SuperfamilySuperfamily
Representative Structures for Superfamilies Representative Structures for Superfamilies will help identify Functional Subfamilieswill help identify Functional Subfamilies
family_2
family_3
family_4
family_5