identification of specificity-determining positions in protein alignments
DESCRIPTION
Identification of specificity-determining positions in protein alignments. Mikhail Gelfand Research and Training Center “Bioinformatics” Institute for Information Transmission Problems, RAS ECCB2005, Madrid. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
Identification of specificity-determining positions
in protein alignments
Mikhail GelfandResearch and Training Center “Bioinformatics”
Institute for Information Transmission Problems, RAS
ECCB2005, Madrid
Motivation
• Large protein families with general function assigned by homology, not much functional information
• Much less structural data. Not many structures with substrates, cofactors etc.
• Some specificity assignments from comparative genomics=>• Search for specificity-determining positions in alignments
– identification of functional sites– prediction of specificity– understanding and eventually re-design of function
Specificity (of transporters) from comparative genomics – three
examples. 1. New specificities in a little studied family
S-box (rectangle frame)MetJ (circle frame)LYS-element (circles)Tyr-T-box (rectangles)
BC1434
FN 062 4
269.47
SON-3
CJ
CPE
LysT
MetT
TyrT
MleN
DF
CTCCB
OB
SO N-2VC-2
NM B
SON-1
VC-1
BHHP
C
TTE-nhaC
AC0744
FN0978
BL1111
CTC 00901
OB2874OB1118
NMB05 36
FN0352BC4121
EF-nhaC 1
EF-nhaC 2
PPE
LP-nha2
LP-nha1 L
L
M
G A
ELB
BS-yheL
BS-m leN
FN0650
VC2037
BC1709
SA 2292HI1107
VV21061FN207 7
BH3946
BC0373
FN14 22
BB0638
BB0637
F N1420
CTC02529SO1087
VCA0193
BT1270
C
CB
T C02520
CPE2317
FN1414
SA2117
Archaea
clostrid ia
Pasteure llaceae
malate/lactate
2. Misleading homology: The PnuC family of transporters
The RFN elements
The THI elements
3. A nightmare. The NiCoT family of nickel-cobalt transporters
SDP (Specificity-Determining Position)
Alignment position that is conserved withingroups of proteins having the same specificity(specificity groups) but differs between them
SDP is not equivalent to a functionally important position
Measure of specificity: mutual information
= count of amino acid α in group i at position p divided by the total number of sequences
= frequency of amino acid α in position p
= fraction of proteins in group i
i
p
ppp iff
ififI
groupsyspecificit all
acidsamino all )()(
),(log),(
),( if p
)(pf
)(if
Taking into account the structure of the phylogenetic tree:
random shuffling and linear regression
Z-score
)exp(
exp
pIpIpI
pZ
min
linear regression
=> positions that are more specific than expected given the tree
Smoothing: pseudocounts and similarity between amino acid residues
• m(ab) = amino acid substitution matrix
• n(a,i) = count of amino acid a at position i
Automated threshold setting: the Bernoulli estimator
Are 5 SDP with Z-score > 12 better than 10 SDP with Z-score > 9?
21 ZZ k
kZZkPk scores- Zobserved least at are thereminarg*
n
kni
iniin
k
pqC1
1minarg
kZ
k dZZZZPp )exp(2
1)( 2
pq 1
Other similar techniques
• Evolutionary trace (Lichtarge et al. 1996, 1997) – need structure; gradual construction of group-specific consensus
• Evolutionary rate shifts (DIVERGE, Gu et al. 2002) – positions with group-specific evolutionary rate
• Surface patches of slowly evolving residues (Rate4Site, Pupko et al. 2002) – need structure
• PCA in the sequence space (Casari et al., 1995)• Correlated mutations (Pazos and Valencia, 2002)• Prediction of functional sub-types (Hannenhalli and
Russell, 2000) – relative entropy of HMM profiles for groups
SDPpred: Web interfaceInput: multiple alignment of proteins
divided into specificity groups
=== AQP ===%sp|Q9L772|AQPZ_BRUME-------------------------------------mlnklsaeffgtfwlvfggcgsailaa--afp-------elgigflgvalafgltvltmayavggisg--ghfnpavslgltviiilgsts------------------------------slap------------------qlwlfwvaplvgavigaiiwkgllgrd---------------------------------------%sp|P48838|AQPZ_ECOLI-------------------------------------mfrklaaecfgtfwlvfggcgsavlaa--gfp-------elgigfagvalafgltvltmafavghisg--ghfnpavtiglwalvihgatd------------------------------kfap------------------qlwffwvvpivggiiggliyrtllekrd--------------------------------------%tr|Q92ZW9-------------------------------------mfkklcaeflgtcwlvlggcgsavlas--afp-------qvgigllgvsfafgltvltmaytvggisg--ghfnpavslglaviiilgsth------------------------------rrvp------------------qlwlfwiaplfgaaiagivwksvgeefrpvd-----------------------------------=== GLP ===%sp|P11244|GLPF_ECOLI----------------------------msqt---stlkgqciaeflgtglliffgvgcvaalkvag---------a-sfgqweisviwglgvamaiyltagvsg--ahlnpavtialwlglilaltd------------------------------dgn--------------g-vpr-flvplfgpivgaivgafayrkligrhlpcdicvveek--etttpseqkasl--------------%sp|P44826|GLPF_HAEIN----------------------------mdks-----lkancigeflgtalliffgvgcv
…
SDPpred: Output
Alignment of the family with the SDPs highlighted(Alignment view)
Detailed description of each SDP(List of SDPs)
Plot of probabilities used by the Bernoulli estimator to set the cutoff (Probability plot view)
Transcription factors from the LacI family
• Training set: 459 sequences, average length: 338 amino acids, 85 specificity groups
10 residues contact NPF (analog of the effector)
6 residues in the intersubunit contacts
7 residues contact the operator sequence
7 residues in the effector contact zone (5Ǻ<dmin<10Ǻ)
5 residues in the intersubunit contact zone (5Ǻ<dmin<10Ǻ)
6 residues in the operator contact zone (5Ǻ<dmin<10Ǻ)
– 44 SDPs
LacI from E.coli
SDP clusters at the subunit contact region
LacI (lactose repressor) from E.coli (1jwl)
EffectorEffector
DNA operatorDNA operator
Cluster ICluster I
Cluster IICluster II
Overall statistics (LacI of E. coli)
• Total 348 amino acids
• 44 SDP
Non-contacting residues (distance to the DNA, effector, or the other subunit >10Ǻ)
Contact zone (may be functional)
Contacting residues (distance to the DNA, effector, or the other subunit <5Ǻ)
Membrane channels of the MIP family
• Training set: 17 sequences, average length 280 amino acids, 2 specificity groups: Aquaporines & glyceroaquaporines
– 21 SDPs8 residues contact glycerol (substrate) (dmin<5Ǻ)
8 residues oriented to the channel
5 residues in the contacts with other subunits GlpF from E.coli
Glpf (glycerol facilitator) from E. coli (1fx8)
Cluster ICluster I Cluster IICluster II
Subunit I
SubstrateSubstrate(glycerol)(glycerol)
Two SDP clusters at the contact of subunits forming the tetramer
20Leu, 24Ile, 108Tyr of one subunit, 193Ser of another subunit
Glu43
Overall statistics (GlpF from E.coli)
• Total 281 amino acids
• 21 SDP
Contacting residues (distance to the substrate, or another subunit <5Ǻ)
Non-contacting residues (distance to the substrate, or another subunit >10Ǻ)
Contact zone (may be functional)
isocitrate/isopropylmalate dehydrogenases : combinations of specificities towards
substrate and cofactor• IDH: catalyzes the oxidation of
isocitrate to α-ketoglutorate and CO2 (TCA) using either NAD or NADP as a cofactor in organisms from prokaryotes to higher eukaryotes
• IMDH: catalyzes oxidative decarboxylation of 3-isopropylmalate into 2-oxo-4-methylvalerate (leucine biosynthesis) in prokaryotes and fungi, the cofactor is NAD
MitochondriaMitochondria
ArchaeaArchaeaBacteriaBacteria
EukaryotaEukaryota
ArchaeaArchaeaBacteriaBacteriaEukaryotaEukaryota
Selecting specificity groups
1. By substrate: all IDHs vs. all IMDHs
3. Four groups
IDH (NAD)IDH (NAD) IDH (NAD)IDH (NAD)
IDH (NADP)IDH (NADP)type IItype II
IDH (NADP)IDH (NADP)type IItype II
IMDH (NAD)IMDH (NAD) IMDH (NAD)IMDH (NAD)
IDH (NADP)IDH (NADP)type Itype I
IDH (NADP)IDH (NADP)type Itype I
IDH (NAD)IDH (NAD)
IDH (NADP)IDH (NADP)type IItype II
IMDH (NAD)IMDH (NAD)
IDH (NADP)IDH (NADP)type Itype I
2. By cofactor: all NAD-dependent vs. all NADP-dependent
Predicted SDPs
most SDPs near the substrate
SDPs near the substrate and the cofactor
SDPs near the substrate, the cofactor and the other subunit
SDPs, the cofactor and the substrate
Substrate (isocitrate)
Cofactor (NADP)Nicotinamide nucleotide
Adenine nucleotide344Lys, 345Tyr, 351Val:cofactor-specific SDPs,known determinants of specificity to cofactor
100Lys, 104Thr, 105Thr, 107Val, 337Ala, 341Thr:substrate-specific and four group SDPs, functionally not characterized
NADP-dependent IDH from E. coli (1ai2)
SDPs predicted for different groupings
cofactor-specific SDPs
substrate-specific SDPs
Four groups
154Glu158Asp
208Arg
229His
231Gly
233Ile
287Gln
300Ala
305Asn308Tyr327Asn344Lys
345Tyr
351Val38Gly 40Asp
100Lys
103Leu
105Thr
115Asn155Asn164Glu241Phe
337Ala
341Thr
97Val
98Ala
104Thr107Val 152Phe
161Ala162Gly
232Asn
245Gly
31Tyr
323Ala
36Gly
45Met
Color code:Contacts cofactorContacts substrate AND cofactorContacts substrateContacts substrate AND the other subunit Contacts the other subunit
Overview
• Transcription factors: contacts with the cofactor and the DNA
• Transporters: contacts with the substrate
• Enzymes: contacts with the substrate and the cofactor
And all:
• contacts between subunits
Protein-DNA interactions
CRP PurR
IHF TrpR
Entropy at aligned sites (blue plots) and the number of contacts (red: heavy atoms in a base pair at a distance <cutoff from a protein atom)
The observed correlation does not depend on the distance cutoff
CRP/FNR family of regulators
FNR
HcpR
CooA
Gam ma
Desulfovibrio
Desulfovibrio
TGTCGGCnnGCCGACA
TTGTgAnnnnnnTcACAA
TTGTGAnnnnnnTCACAA
TTGATnnnnATCAA
Correlation between contacting nucleotides and amino acid residues
• CooA in Desulfovibrio spp.• CRP in Gamma-proteobacteria• HcpR in Desulfovibrio spp. • FNR in Gamma-proteobacteria
DD COOA ALTTEQLSLHMGATRQTVSTLLNNLVRDV COOA ELTMEQLAGLVGTTRQTASTLLNDMIREC CRP KITRQEIGQIVGCSRETVGRILKMLEDYP CRP KXTRQEIGQIVGCSRETVGRILKMLEDVC CRP KITRQEIGQIVGCSRETVGRILKMLEEDD HCPR DVSKSLLAGVLGTARETLSRALAKLVEDV HCPR DVTKGLLAGLLGTARETLSRCLSRMVEEC FNR TMTRGDIGNYLGLTVETISRLLGRFQKYP FNR TMTRGDIGNYLGLTVETISRLLGRFQKVC FNR TMTRGDIGNYLGLTVETISRLLGRFQK
TGTCGGCnnGCCGACA
TTGTgAnnnnnnTcACAA
TTGTGAnnnnnnTCACAA
TTGATnnnnATCAA
Contacting residues: REnnnRTG: 1st arginineGA: glutamate and 2nd arginine
The correlation holds for other factors in the family
Factor Organisms Consensus Specific aa Metabolic system Inducer
CRP Enterobacteria&Vibrio&PasteurellaceaeTTGTGAnnnnnnTCACAA R E R catabolic repression cAMPVFR Pseudomonas sp. TTGTGAnnnnnnTCACAA R E R virulence cAMPCLP Xanthomonas&Xylella sp. nTGTGAnnnnnnTCACAn R E R phytopathogenicity ? (not cAMP)FNR & ANR Gamma-proteobacteria nnTTGATnnnnATCAAnn V E R response to anaerobiosis O2,NOFNR Beta-proteobacteria nnTTGATnnnnATCAAnn L E R response to anaerobiosis O2FNR & FixK Alpha-proteobacteria nnTTGATnnnnATCAAnn I/L E R nitrogen fixation O2DNR & Nnr Pseudomonas &Paracoccus nnTTGATnnnnATCAAnn P E R denitrification NO, NO2FNR Bacillus sp. nTGTGAnnTAnnTCACAn R E R response to anaerobiosis O2-low conditionsPrfA Listeria nnTTAACAnnTGTTAAnn S S R virulence ?NtcA Cyanobacteria ntGTAnCnnnnGnTACan R V R nitrogen metabolism 2-oxoglutarateCysR Cyanobacteria ? R V R sulfate utilization sulfate?CooA Desulfovibrio sp. and R.rubrum nTGTCGGCnnGCCGACAn R Q T CO utilization COHcpR* Desulfovibrio sp. TTGTgAnnnnnnTcACAA R E R prismane & sulfate reduction ?HcpR* Desulfuromonas acetoxidans, Desulfotalea psychrophilaatTTGAccnnggTCAAat S/P E R prismane ?HcpR* Clostridia, Bacteroides, Thermotogales, Fusobacteria, TreponemactGTAACawwtCTTACag R P R prismane ?HcpR* ~P. gingivalis nTGTCGCnnnnGCGACAn R A R prismane ?HcpR* ~C. difficile nnGGATnnnnnnATCCnn R S R prismane ?HcpR* ~T.tengcongensis, D.halfniensa nTGTGAnnnnnnTCACAn R E R prismane ?HcpR* ~Acidithiobacillus ferrooxidans nCTTGATTnnAATCAAGn P E R prismane ?ArcR Bacillus, Enterococcus sp. nTGTGAnATATnTCACAn R E A/S arginine catabolism O2CprK Desulfitobacterium dehalogenas nnTTAnTGnnCAnTAAnn H V R/K halorespiration aromaticsFlpA&B Lactococcus lactis nnTTGATnnnnATCAAnn P E R ? Eh, O2
Plans and perspectives. Protein-DNA interactions
1**
T
2* 9
****
4****
10****
6**
7***
8**
T G
3**
5**** 12
*
13****
15****
16****
17***
14*
11****
24*
19**
T
25****
21*
22****
23**
18****
20*
G
27****
28**
30***
31****
T
32**
29****
26*
38****
C
39****
41****
A
42****
T43
****
40****
37*
34****
35****
36****
33**
A
Each ortho logous group is reduced to a s ing le representa ive.The branch colour denotes the feeder pathw ay regula ted.The experim enta l data ava ilab le for at least one regula tor o f an orthologous group is show n by the type-face of species designations:
, and the branch outline th icknessexperim entally confirm ed sites,
experim entally confirm ed regulation (the th icker line indica te experim enta lly confirm ed pathw ay).
The Logo's num bering corresponds to the branch num bers o f the tree.
regulon pred icted de novosignal p roposed de novo,new regu lon m em bers proposednew sites predicted.
**********
1
2
3
4
5
6
7
89
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
EC_RbsR
EC_PurR
PA1949
BS_CcpA
TTE0201
R EF00754
R EF00345
BS_R bsR
SACR_LACLA
M ALR_STAXYBS_YvdE
SM b21598
SM c04260
RAFR_ECOLI
SM b20324
R R C 03428
SM c03060
EC_MalI
SM c04401
SM c02975
SM b21272
m lr2242SCRR_SALTY
RKP03067
EC_FruR
R KP05215SM b21650
GALR_STRTR
EC_EbgR
EC_TreR
VC A0654
SCRR_STAXY
SM b21372
PA2259
BS_KdgR
XC C 2369
EC_AscGSTM 1555
EC_GalS
EC_GalREC_CytR
CSCR_ECOLI
SM c03165
R Sc1790
R KP05499
R PU 04121
PA2320
EC_IdnREC_GntR
R R C 03254
STM 2345
STM 3696
EC _YcjW
EC_LacI
D -galactose & galactosidesm altose & trehalose
sucroseD -fructose
D -riboseD -xylose
LacI family of transcriptional
regulators (each branch represents a subfamily)
… and their signals
1605 regulators from 189 genomes, forming 302 groups of orthologs and binding 2518 sites
Plans and perspectives. Experimental verification
• A new family of Ni/Co transporters
• No structural data• Specificity
predicted by comparative genomics
• Predicted SDPs form several clusters in the alignment, are located on the same sides of alpha-helices
• Mutational analysis
Terminators of translation in prokaryotes / decoding of
stop-codons. Specificity of
RF1 (UAG, UAA) and RF2 (UGA,
UAA)
Fragment of the alignment (117 pairs). SDPs are shown by black boxes above the alignment.
“Interesting” positions: invariant, SDPs, variable rate.
SDPs and invariant
positions:two
decoding sites?
Plans and perspectives
• Use of 3D structures, when available. Identification of functional sites as spatial clusters of SDPs and conserved positions
• Automated identification of specificity groups based on the analysis of the phylogenetic tree
• Protein-DNA interactions• Identification of protein-protein contact
surfaces
Publications
• N.J.Oparina, O.V.Kalinina, M.S.Gelfand, L.L.Kisselev (2005) Common and specific amino acid residues in the prokaryotic polypeptide release factors RF1 and RF2: possible functional implications. Nucleic Acids Research 33 (in press).
• O.V.Kalinina, A.A.Mironov, M.S.Gelfand, A.B.Rakhmaninova (2004) Automated selection of positions determining functional specificity of proteins by comparative analysis of orthologous groups in protein families. Protein Science 13: 443-456.
• O.V.Kalinina, P.S.Novichkov, A.A.Mironov, M.S.Gelfand, A.B.Rakhmaninova (2004) SDPpred: a tool for prediction of amino acid residues that determine differences in functional specificity of homologous proteins. Nucleic Acids Research 32: W424-W428.
• O.V.Kalinina, M.S.Gelfand, A.A.Mironov, A.B.Rakhmaninova (2003) Amino acid residues forming specific contacts between subunits in tetramers of the membrane channel GlpF. Biophysics (Moscow) 48: S141-S145.
• L.A.Mirny, M.S.Gelfand (2002) Using orthologous and paralogous proteins to identify specificity determining residues in bacterial transcription factors. Journal of Molecular Biology 321: 7-20.
• L.Mirny, M.S.Gelfand (2002) Structural analysis of conserved base-pairs in protein-DNA complexes. Nucleic Acids Research 30: 1704-1711.
• http://math.belozersky.msu.ru/~psn/
Acknowledgements• Leonid Mirny (Harvard, MIT)• Olga Kalinina • Andrei A. Mironov • Alexandra B. Rakhmaninova • Dmitry Rodionov• Olga Laikova
• Howard Hughes Medical Institute • Ludwig Institute of Cancer Research• Russian Fund of Basic Research• Russian Academy of Sciences,
programs “Molecular and Cellular Biology”and “Origin and Evolution of the Biosphere”