biological meaning, statistical significance, and … · 868 n.n. alexandrov and n. go table 1....

11
1994 3: 866-875 Protein Sci. N. N. ALEXANDROV and N. GO local spatial similarities in nonhomologous proteins Biological meaning, statistical significance, and classification of data Supplementary http://www.proteinscience.org/cgi/content/full/3/6/866/DC1 "Data Supplement" References http://www.proteinscience.org/cgi/content/abstract/3/6/866#otherarticles Article cited in: service Email alerting click here top right corner of the article or Receive free email alerts when new articles cite this article - sign up in the box at the Notes http://www.proteinscience.org/subscriptions/ go to: Protein Science To subscribe to © 1994 Cold Spring Harbor Laboratory Press Cold Spring Harbor Laboratory Press on January 30, 2008 - Published by www.proteinscience.org Downloaded from

Upload: hoangnhu

Post on 06-Sep-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Biological meaning, statistical significance, and … · 868 N.N. Alexandrov and N. GO Table 1. Common SARFs with a similarity score larger than 5.0' PDB identifiers Size Corresponding

1994 3: 866-875 Protein Sci.  N. N. ALEXANDROV and N. GO  

local spatial similarities in nonhomologous proteinsBiological meaning, statistical significance, and classification of  

dataSupplementary

http://www.proteinscience.org/cgi/content/full/3/6/866/DC1 "Data Supplement"

References

http://www.proteinscience.org/cgi/content/abstract/3/6/866#otherarticlesArticle cited in:  

serviceEmail alerting

click heretop right corner of the article or Receive free email alerts when new articles cite this article - sign up in the box at the

Notes  

http://www.proteinscience.org/subscriptions/ go to: Protein ScienceTo subscribe to

© 1994 Cold Spring Harbor Laboratory Press

Cold Spring Harbor Laboratory Press on January 30, 2008 - Published by www.proteinscience.orgDownloaded from

Page 2: Biological meaning, statistical significance, and … · 868 N.N. Alexandrov and N. GO Table 1. Common SARFs with a similarity score larger than 5.0' PDB identifiers Size Corresponding

Prorein Science (1994), 32366-875. Cambridge University Press. Printed in the USA. Copyright 0 1994 The Protein Society

Biological meaning, statistical significance, and classification of local spatial similarities in nonhomologous proteins

NICKOLAI N. ALEXANDROV' AND NOBUHIRO GO2 ' Protein Engineering Research Institute, 5-2-3 Furuedai, Suita, Osaka 565, Japan 'Chemistry Department, Faculty of Science, Kyoto University, Kyoto 606, Japan

(RECEIVED November 30, 1993; ACCEPTED April 5 , 1994)

Abstract

We have completed an exhaustive search for the common spatial arrangements of backbone fragments (SARFs) in nonhomologous proteins. This type of local structural similarity, incorporating short fragments of backbone atoms, arranged not necessarily in the same order along the polypeptide chain, appears to be important for pro- tein function and stability. To estimate the statistical significance of the similarities, we have introduced a simi- larity score. We present several locally similar structures, with a large similarity score, which have not yet been reported. On the basis of the results of pairwise comparison, we have performed hierarchical cluster analysis of protein structures. Our analysis is not limited by comparison of single chains but also includes complex molecules consisting of several subunits. The SARFs with backbone fragments from different polypeptide chains provide a stable interaction between subunits in protein molecules. In many cases the active site of enzymes is located at the same position relative to the common SARFs, implying a function of the certain SARFs as a universal inter- face of the protein-substrate interaction.

Keywords: active site; protein families; similarity in protein structures; structure-function relationship

Comparison of protein 3-dimensional structures allows us to re- veal various important aspects of proteins beyond what we can learn from comparison of amino acid sequences. Three- dimensional structural similarities can be found sometimes among nonhomologous proteins and lead us to discover impor- tant relationships among them. There are several well-known cases where the whole 3-dimensional structures of nonhomol- ogous proteins or domains are very similar to each other and thus are likely to have a common ancestor. Probably the most striking example of such a case of evolutionary relationship is the (Y /P barrel enzymes (Farber & Petsko, 1990). Local struc- tural similarity may have a functional meaning, for example, binding to DNA (Steitz, 1990). Varieties of biologically mean- ingful types of structural similarity have been detected by means of different algorithms suitable for each type (Abagyan & Maiorov, 1988; Mitchell et al., 1989; Sali & Blundell, 1989; Tay- lor & Orengo, 1989; Vriend & Sander, 1991; Alexandrov et al., 1992; Fischer et al., 1992; Holm & Sander, 1993a).

Based on the observed similarities among 3-dimensional struc- tures, proteins have been classified into structural families (Levitt & Chothia, 1976; Richardson, 1981; Chothia & Finkel- " .

ematical Biology, Building 469, Room 151, FCRF-NCI, Frederick, Reprint requests to: Nickolai Alexandrov, Laboratory of the Math-

Maryland 21702; e-mail: [email protected].

stein, 1990; Overington et al., 1990; Rackovsky, 1990; Murzin & Chothia, 1992; Pascarella & Argos, 1992; Holm & Sander, 1993b; Orengo et al., 1993; Yee & Dill, 1993). From the results of comparison of 3-dimensional structures and of amino acid sequences, Chothia (1992) estimated the total number of pos- sible protein families with different basic folding patterns, which appeared to be surprisingly small - not more than several thou- sand. Earlier, Finkelstein and Ptitsyn (1987) presented an expla- nation of the observed regularity of folding patterns of secondary structures in proteins with their thermodynamic stability.

Following our previous paper (Alexandrov et al., 1992), we are interested in this paper in the most general case of local main- chain structural similarities, which are formed by backbone frag- ments with possibly different sequence connectivities among them or even with some disconnectivities when they belong to different polypeptide chains. In the previous paper we called such structural elements SARFs (spatial arrangements of back- bone fragments). In the previous paper we also developed an ef- ficient algorithm to detect common SARFs between a pair of proteins and reported some interesting cases of common SARFs. In this paper we report results of an exhaustive search through all entries in the Protein Data Bank (PDB) (Bernstein et al., 1977; Abola et al., 1987).

Several new examples of this kind of spatial similarity char- acterize common SARFs as stable and/or functionally impor-

866

Cold Spring Harbor Laboratory Press on January 30, 2008 - Published by www.proteinscience.orgDownloaded from

Page 3: Biological meaning, statistical significance, and … · 868 N.N. Alexandrov and N. GO Table 1. Common SARFs with a similarity score larger than 5.0' PDB identifiers Size Corresponding

Local spatial similarities in proteins 867

tant regions in protein structure, which may occur in unrelated proteins. We give a result of a hierarchical cluster analysis of similar SARFs, based on an exhaustive search through the PDB. According to this analysis, all the proteins can be divided into several clusters.

Results

Significance of similar SARFs

We have introduced a similarity score to evaluate a statistical sig- nificance of common SARFs from a statistical analysis of an RMS distance (RMSD) distribution among all the possible SARFs with different length in a pair of unrelated proteins (Alexandrov et al., 1992) (Fig. 1). We suggest the following func- tion as a measure of local similarity between 2 proteins:

S = 1.37 + (1.16L - 15.1)”2 - R ,

where L is a number of residues in compared SARFs and R is an RMSD between the SARFs after superposition. This simi- larity score makes it possible to compare the significance of the common SARFs of different sizes. Recently, Maiorov and Crip- pen (1994) have developed a nonstatistical criterion for struc- tural similarity, which leads basically to the same function of similarity score. This similarity score can be expressed in the units of standard deviation from the expected value of RMSD: S, = 2.1 + S/1.8. Thus, if S = 4, then S , = 4.3 and the corre- sponding RMSD is smaller than the mean by 4 . 3 ~ .

Common SARFs

We report a list of common SARFs with a similarity score S 2 5.0 in Table S1 on the Diskette Appendix; those that are dis- cussed in the text appear in Table 1. As one can expect, on the top of the list are TIM-barrel structures from muconate lactoniz- ing enzyme (PDB identifier lmle) and enolase (7enl). The sec- ond highest score was detected in retinol binding protein (lrbp) and bilin binding protein (lbbp). Third place is occupied by common SARF from poliovirus (2plv) and tomato bushy stunt virus (2tbv). The similarity in these top 3 cases has already been discussed in the literature (Lebioda & Stec, 1988; Huber et al., 1987; Hogle et al., 1985).

The similarity between bacteriochlorophyll-A (3bcl) and pea lectin (21tn) is just a large 0-sheet, formed by 8 strands. How- ever, this is an interesting example because it consists of alter- nating strands from different chains of pea lectin. We have observed several other cases of this kind of subunit-subunit in- terface, for example, a 0-sheet in prealbumin (2pab), composed of fragments from A and B subunits.

It has been pointed out that the active site in a/0 proteins is always located at the same position relative to the secondary- structure pattern (Branden, 1980). It is also known that TIM- barrel structures have their active site always at the same spatial location (Farber & Petsko, 1990). The question arises as to how common is this tendency for other SARFs. We have analyzed in detail the first 89 common SARFs from Table S1. In all these cases, the similarity scores were greater than S > 6.0. Thirty-nine of these similarities are globally similar structures (i.e., more than 60% of all residues form the common SARF) that obvi- ously have their active site at the same location. In 24 cases, the active site was not indicated in at least 1 of the 2 proteins with

0 5 10 15 20 25 RMSD, a

0 0 1 I I I 1

O 5 :

Size of fragments in residues

0 0 - O 5 : O cv

Fig. 1. A: Distribution of the RMSD values between all possible pairs of continuous 45-residue fragments from 2 unrelated proteins. As a mea- sure of the lower bound of this distribution, the value RL is determined from the condition that only 1% of pairs of fragments have smaller value of RMSD. The RMSD corresponding to the Similarity score S = 4.0 is shown as R . Assuming a normal distribution, this value of the similar- ity score corresponds to the deviation of 4.30 from the mean. B: De- pendence of the value of RL on the lengths of the fragments. An approximation, used in a definition of similarity score RL = 1.37 + (1.16L - 15.1)”2 is shown by a solid line.

a common SARF. In 15 cases, the SARF is a typical d B wound motif, which always contains an active site in the same location. In 8 other cases, the active site is located at a similar position relative to a common SARF. Only 3 cases are the exceptions, in which active sites are located in different places. We also ana- lyzed 22 less significant similarities from the bottom of Table SI. There are no globally similar structures among them, 2 cases have no active site, in 13 cases, active sites occupy similar posi- tions, and in 7 cases, active sites were in different places. Thus, even if we exclude globally similar structures and a well-known a/@ motif, 21 of 31 SARFs have overlapping active sites. Ac- tive sites were defined as overlapping if at least 20% of the at- oms of one active site are close t o one of the atoms from the second active site. Atoms are considered to be close if the dis- tance between them is less than 4 A. Of course, such an over- lapping sometimes happens occasionally. The probability of occasional overlapping of the active sites was estimated by the superimposing of randomly created SARFs. The probability var-

Cold Spring Harbor Laboratory Press on January 30, 2008 - Published by www.proteinscience.orgDownloaded from

Page 4: Biological meaning, statistical significance, and … · 868 N.N. Alexandrov and N. GO Table 1. Common SARFs with a similarity score larger than 5.0' PDB identifiers Size Corresponding

868 N.N. Alexandrov and N. GO

Table 1. Common SARFs with a similarity score larger than 5.0'

PDB identifiers Size Corresponding residues

Protein Protein of SARF RMSD Similarity 1 2 (residues) (A) score, S Protein 1 Protein 2

3bcl 21tn 73 2.47 7.24 3-12, 22-30, 40-47, 58-68, 72-81, 94-100, 234-252, 257-264

1 wsy 3grs 64 3.06 6.45 B113-Bl36, B143-B159, B207-B220,

3 h 256b 73 3.45 6.26 59-80, 92-107, 114-124, 125-134,

lcts 2CPP 64 3.13 5.93 401-412, 166-193, 260-270, 11-23 lrnh 2x2 57 2.64 5.87 3-9, 30-37, 51-59, 61-79, 106-1 11,

lrnh 8api 58 2.83 5.76 3-14, 19-29, 30-40, 52-59, 117-123,

lrnh 3icd 59 3.12 5.56 3-8, 22-28, 29-43, 51-59, 72-77,

B350-B365

143-156

115-122

127-136

113-119, 133-141

D38-D47, Cl-C9, A61-A68, A159-A169, A171-A180, B3-B9, B38-B47, A1-A8

196-219, 232-248, 105-118, 338-353

A84-Al05, B91-Bl06, B3-Bl3, B29-B38, B59-B72

157-168, 240-268, 361-371, 201-213 A110-Al76, A45-A52, A148-Al56,

B323-B351, B313-B318, B386-B393 A345-A356, A181-Al91, A112-A221,

A158-Al65, A293-A299, A54-A63 332-337, 322-328, 124-138, 355-363,

39-44, 28-34, 310-318

a Only those S A R F s that are discussed in the text are shown here. The complete table is provided on the Diskette Appendix (Table Sl).

ies from protein to protein but on average is about 0.3 and was less than 0.5 for all the pairs of proteins we analyzed. We have found many cases where the location of active sites relative to a common SARF is very conserved, even when proteins have dif- ferent topologies, functions, and/or global structures. We de- scribe below just a few such examples.

The similarity between selenomethionyl ribonuclease H (lrnh) and serine carboxypeptidase I1 (2sc2) seems to be important for their function because the locations of the active-site residues fit nicely to each other after the superposition of the SARFs (Fig. 2). Similar location of the active-site residues relative to the common SARF was also observed in isocitrate dehydrog- enase (3icd) and ribonuclease H (lrnh) structures (Fig. 3). How- ever, it is difficult to ascribe any functional role to another common SARF involving ribonuclease H, i.e., to the SARF from ribonuclease H (lrnh) and modified Al-antitrypsin (gapi), although it has a large similarity score (Fig. 4).

A common SARF from tryptophan synthase (1 wsy) and glu- tathione reductase (3grs) also appears to have a functional meaning. After superposition of the similar portions of these 2

proteins, their coenzymes occupy remarkably similar positions (Fig. 5) .

In many cases there is no obvious relation between similar S A R F s and active sites of proteins, although the similarity is sta- tistically significant. The reason for such common SARFs could be their structural stability. Examples of such stable similar ar- rangements of a-helices are common SARFs in structures of T4 lysozyme (31zm) and cytochrome b562 (256b) (Fig. 6) and in structures of citrate synthase (lcts) and cytochrome P450CAM (2cpp), both consisting of a-helical fragments (Fig. 7).

While searching for common SARFs, we considered the whole protein structure. However, in all the cases, the found similar- ities are formed mostly by a-helical or &strand fragments. It is not clear whether this is an intrinsic property of common SARFs or whether this is a result of the implemented procedure.

Hierarchical cluster analysis

Frequent occurrence of multiple SARFs implies a hierarchical organization, where a part of a S A R F common for a pair of pro-

Fig. 2. Common S A R F consisting of 57 res- dues from ribonuclease H (lrnh, red) and serine carboxypeptidase I1 (2x2, blue). Active-site residues of lrnh (Asp 10, Glu 48, Asp 70) and 2 x 2 (Asp B338, His B397, Ser A146) are shown in green and purple, respectively.

Cold Spring Harbor Laboratory Press on January 30, 2008 - Published by www.proteinscience.orgDownloaded from

Page 5: Biological meaning, statistical significance, and … · 868 N.N. Alexandrov and N. GO Table 1. Common SARFs with a similarity score larger than 5.0' PDB identifiers Size Corresponding

Local spatial similarities in proteins

yJ *.' :

869

Fig. 3. Common SAW consisting of 59 residues from ribonuclease H (lrnh, red) and isocitrate dehydrogenase (3icd, blue). Active-site residues of lrnh (Asp 10, Glu48, Asp 70) are shown in green. The position of NADP binding (purple) to 3icd fits well with the ribonuclease active-site position.

Fig. 4. Common SARF consisting of 58 res- idues from ribonuclease H (lrnh, red) and Al-antitrypsin (lapi, blue).

Fig. 5. Common SARF consist- ing of 64 residues from trypto- phan synthase (lwsy, red) and glutathione reductase (3grs, blue). Coenzymes of these proteins (pyridoxal phosphate for lwsy [green] and the prosthetic group FAD for 3grs [purple]) have the same position relative to the com- mon SARF.

" 1

Cold Spring Harbor Laboratory Press on January 30, 2008 - Published by www.proteinscience.orgDownloaded from

Page 6: Biological meaning, statistical significance, and … · 868 N.N. Alexandrov and N. GO Table 1. Common SARFs with a similarity score larger than 5.0' PDB identifiers Size Corresponding

870 N.N. Alexandrov and N. GO

.

teins matches with a part of a different SARF common for a dif- ferent pair of proteins, forming a common core. To investigate such relationships, we have performed a hierarchical cluster analysis. In this analysis the most similar structures are united first, then less similar clusters are jointed to them, and so on. In this way the hierarchical tree leads us to a natural classifica- tion of proteins into families. In Figure 8 we show a possible classification of the proteins from our compilation.

In the procedure of construction of the hierarchical tree, a new protein is joined to the structural family only if this protein is similar to most of the other structures from the family. Two sim- ilar structures can belong to different families if each of these structures contains a common family SARF. One can see, for example, that proteins glutathione reductase (3grs) and trypto- phan synthase (lwsy) have been classified into different clusters, despite the fact that they have a very high similarity score, S = 6.45; 64 C" atoms from these proteins can be superimposed with RMSD = 2.6 A. This happens because this SARF does not occur in other proteins. At the same time, lwsy shares CY/@ bar- rel structure with several proteins of this family and 3grs relates to the mutually similar proteins of a//3 doubly wound structure. Thus, this presentation of the results hides some significant pair- wise similarities. There are a few more cases of such hidden sim- ilarities: for example, a common SARF in hemerythin (1 hmo) and Klenow fragment of DNA polymerase I (ldpi) (similarity score S = 6.4) and the SARFs, described in Table 1, involving ribonuclease H (lrnh).

Discussion

We have compared all nonhomologous 3-dimensional protein structures and report here the most significant common SARFs.

Fig. 6. Common SAW consisting of 73 residues from T4 lysozyme (3lzm, red) and cytochrome b562 (256b, blue). Note that 1 helix from 2561, belongs to the sec- ond chain of the dimer.

Y

Essential features of this kind of similarity are: (1) the common regions of structures may include only a small portion of the molecule; (2) the connectivity between fragments, forming the common SARF, can be different; and (3) the fragments may be- long to different subunits of a protein molecule. We estimated the statistical significance of structural similarities from RMSD of the superimposed SARFs and number of amino acids in them. Finkelstein and Ptitsyn (1987) explained the occurrence of these not occasional similarities in terms of protein stability. Our results show that there is another possible reason- functional. It may well be that these similarities serve as a standard inter- face of protein-substrate interaction, providing a rigid frame- work for proper orientation of substrate and amino acids of the active site.

It is convenient to present the results of pairwise comparison by dividing structures into several clusters. We think that sig- nificantly similar SARFs should be an essential part of the pro- tein structure, involved in its functional activity or providing stability of the molecule. In this sense, classification of the pro- tein structures based on the similarity of common SARFs is more meaningful than the one based on the comparison of the whole structures. This classification of protein structures is pre- sented in Figure 8. The first cluster combines all the TIM-barrel structures from the compilation we used. There is also a struc- ture of cellobiohydrolase I1 (3cbh) in this cluster. This structure has different topology than a standard TIM-barrel, but spatial arrangement of 7 @-strands and surrounding helices resembles the typical TIM-barrel very much. Two other proteins, aconi- tase (5acn) and thymidylate synthase (ztsc), also have similar ar- rangements of several helices and strands and thus are related to this cluster. The second cluster unites proteins mainly with a/@ doubly wound topology. It contains, however, cases with

Cold Spring Harbor Laboratory Press on January 30, 2008 - Published by www.proteinscience.orgDownloaded from

Page 7: Biological meaning, statistical significance, and … · 868 N.N. Alexandrov and N. GO Table 1. Common SARFs with a similarity score larger than 5.0' PDB identifiers Size Corresponding

Local spatial similarities in proteins 87 1

#

c -

different connectivity along the polypeptide chain, for example spinach ferrodoxin (2fnr) and serine carboxypeptidase I1 (2sc2) structures. Methods sensitive to the order of aligning fragments relate these proteins to different clusters. The next cluster con- sists of cytochrome molecules. Two parallel a-helices are the common S A R F for the fourth cluster. Because we consider only local similarities, it is possible to unite into 1 cluster globally quite different proteins, such as thermolysin (ltln) - a large mol- ecule of 316 residues-and a small calcium binding protein (3icb) of 76 residues. This unification is sensible because, after super- position of the common SAW, the Zn2+ binding site in ltln and the Ca2+ binding site in 3icb are overlapping.

The distribution of the number of structures in clusters is not equal (Fig. 9). This inequality can be a reason to revise an esti- mation of the number of structural families made by Chothia (1992). His calculation was based on a fact that, among newly sequenced genes, about 1/12 of them belong to 1 of the 120 al- ready known structural families and on an assumption that the distribution of proteins among the families is equal, i.e., all the families contain about the same number of different proteins. In this case, the total number of protein families, F, is

F = 120 * 12 = 1,440.

However, some families, such as TIM barrels or a//3 wound structures, are much more populated than others (Fig. 8). It is not possible to derive the formula of this distribution from the plot in Figure 9 because PDB is too biased and cannot be con-

, Fig. 7. Common SAW consisting of 64 res- & idues from citrate synthase (Icts, red) and

cytochrome P45OCAh4 (2cpp, blue).

sidered as a random set of structures. Yet, with lack of further information, it would be natural to suppose that the distribu- tion is normal. The number of proteins in a family x would then be equal to:

where Nis the total number of different proteins in nature, and a is the standard deviation of the distribution. We can roughly estimate the total number of different proteins as N = 10". The result of our calculations depends on N as (log N)"2, so the precise value of N is not important. The fact that 1/12 of the proteins belong to 120 protein families means that

120 x n ( x ) = -

x=o 12 * N

From this equation we find that u = 1,200. The number of the different families, F, is equal to the num-

ber of the last family containing at least 1 protein, and can be calculated from the equation

n(F) = 1.

Solving this equation, we obtain the solution that the number of protein families is equal to F = 6,727. Compared with

Cold Spring Harbor Laboratory Press on January 30, 2008 - Published by www.proteinscience.orgDownloaded from

Page 8: Biological meaning, statistical significance, and … · 868 N.N. Alexandrov and N. GO Table 1. Common SARFs with a similarity score larger than 5.0' PDB identifiers Size Corresponding

872 N.N. Alexandrov and N. GB

- luster - PDB identifiers \Schematic diagram of the common

lmle. 7enl. Srub. l fcb l . wsy. 3cbh. Stim, Ptaa

Ptrx, l g p l . 2ci2. 3dfr. lak3. 2aat. l fg l , 2x2 , 3pgm.

3icd, 2fnr, 3cpa. lpfk. l tpt . l r le.

2afc. 3gbp. 21bp. llap. lldb, Badh.

Pprk, 5p21, 41x11. 1 rhd

I

I

lcc5, 351c. lccr

2tma. 3gap. lgcn

Phmg, lmli. lctf [

256b. Pccy. 3flx. 1 hmo. lbrd. 2tmv. lprc. 4bp2

- Cluste - B

9

- 10

11

12

- 13

14

PDB identifiers ISchematic diagram of the common

Zphh. 3grs

1 hsc, 1 hkg. Pcpp, 1 CtS

I

tdpi. Pmlt

l lrp. 3cro

4cpv. 3cln. lscp, 3icb. 31zm. 3111-1

2tSl. Putg, Pcyp. I leca. lwrp

Fig. 8. Cluster memberships for non-single-case clusters. (Confinues on facing page.)

Cold Spring Harbor Laboratory Press on January 30, 2008 - Published by www.proteinscience.orgDownloaded from

Page 9: Biological meaning, statistical significance, and … · 868 N.N. Alexandrov and N. GO Table 1. Common SARFs with a similarity score larger than 5.0' PDB identifiers Size Corresponding

Local spatial similarities in proteins 873

lustel - 2

PDB identifiers chematic diagram of the common 'DB identifiers khematic diagram of the common ;ARF

Igdl. lcla. lfkf lrbp. lbbp

Ipte, 2blm 3

- 4

- 5

lrnh. lmon

lpsg. 2hvp. 2rsp lpcy. lpaz. Paza

l lcdt . 3ebx 2alp. l ton

6

- 7

2fxb. 4fdl lphs. lgcr

3dpa. lmad. 2sod. 1 acx

Pbus. Povo

2plv. 2tbv. lbmv. Zstv, ltnf. 3bcl, Zltn, aapi, 2cd4, 2fbj. 3hla. Ppab

Fig. 8. Continued.

Cold Spring Harbor Laboratory Press on January 30, 2008 - Published by www.proteinscience.orgDownloaded from

Page 10: Biological meaning, statistical significance, and … · 868 N.N. Alexandrov and N. GO Table 1. Common SARFs with a similarity score larger than 5.0' PDB identifiers Size Corresponding

874 N.N. Alexandrov and N. GO

301 ?5 C

Cluster, x

Fig. 9. Distribution of the number of protein structures among differ- ent clusters.

Chothia’s (1992) estimation, the number of protein families has been increased but still remains on the order of several thousand.

Conclusion

Different definitions of structural similarity lead to different classifications of protein structures. In this paper we have shown the importance of a special kind of similarity, SARFs, in which a connectivity among corresponding backbone fragments may be different. We have suggested a method to estimate the sta- tistical significance of similarities and have shown that this mea- sure is correlated well with the biological meaning of these similarities. We have performed a hierarchical cluster analysis, based on the pairwise structure comparison, and have found sev- eral important structural similarities that have not been reported before.

Materials and methods

Programs and computers

We used the program SARF (Alexandrov et al., 1992) and the computer Titan and supercomputer Facom VP2600 to compare protein structures, the SPSS program for Macintosh to perform a hierarchical cluster analysis, the program NEEDL (Alexan- drov, 1992) to exclude homologous proteins from our compi- lation, and the program MATLAB to obtain a significance score function. Figures 2, 3,4, 5 , 6 , 7, and 8 were prepared with the Biosym product, Insight. Figure 9 was prepared with the pro- gram MOLSCRIPT (Kraulis, 1991).

Compilation of protein structures

A compilation of proteins used in this research includes the fol- lowing structures (PDB identifiers are given): lacx, laap, labp, lak3, lbbp, lbmv, lbrd, Ida, lca2, lcc5, lccr, lcdt, lcla, lctf, lcts, ldpi, leca, lefm, lfcb, lfkf, lfxi, lgcr, lgpl, lhip, lhkg, lhmq, lhoe, lhsc, llap, lldb, llrp, lmad, lmle, lmli, lmon, lpaz, lpfk, lphs, lphy, lprc, lpsg, lpte, lpyp, lrle, lrbb, Irbp, lrhd, lrnh, lrnt, lscp, Itgl, lthi, ltnf, lton, ltpt, lubq, Iwrp, lwsy, lxy2, lznf, 256b, 2aat, 2abx, 2alp, 2atc, 2aza, 2blm, 2ccy, 2cd4,2cdv, Zchy, 2ci2,2cpp, 2cyp, 2enl,2fbj, 2fnr, 2fxb, 2gd1,

2gls, 2gn5, 2hmg, 2hvp, 21bp, 21tn, 2p21, 2pab, 2phh, 2plv, 2prk, Zrsp, 2sc2,2sns, 2sod, 2ssi, 2stv, 2taa, 2tbv, 2tma, 2tmv, 2trx, 2ts1, 2tsc, 2utg, 2znf, 351c, 3bc1, 3cbh, 3cln, 3cpa, 3cr0, 3dfr, 3dpa, 3ebx, 3flx, 3gap, 3gbp, 3grs, 3hla, 3icb, 3icd, 31zm, 3pgm, 3tim, 3tln, 4bp2,4cpv, 4fdl,4fxn, 4ilb, 41ym, 4tgf, Sacn, Srub, 7pcy, 8adh, 8api, k a t , 9pap, 9wga. This compilation has been obtained by clustering all the sequences of the sixth release of the PDB database with a multiple alignment program and contains only 1 representative of each group. Some proteins with a lower sequence homology are not included in this list because their entire structures are known to be similar to one of the struc- tures in the list. Thus, the hemoglobin family is represented only by erythrocruorin (leca).

Cluster analysis

We analyzed the results of mutual comparison of all the protein structures by hierarchical cluster analysis. As mentioned in the Results section, the frequent occurrence of multiple SARFs im- plies a hierarchical organization of them. To investigate such re- lationships among SARFs and proteins containing them, we perform a hierarchical cluster analysis in which all the proteins appearing in the list of common SARFs with the similarity score S > 0 are clustered by the method of average linkage between groups. If a pair of proteins in the above set of proteins does not appear in the list of common SARFs, then S = 0 is assigned. At first we look for a pair of proteins with the largest S value and unite them into a cluster. This cluster now replaces the 2 pro- teins in the above set of proteins. The S value between this clus- ter and a third protein in the list is set equal to the average of the S values between the third protein and each of the 2 proteins. Then, by regarding this cluster as 1 element in the set, we repeat the same procedure of forming a new cluster until all proteins in the set become united into 1 cluster. As a result of this pro- cedure, proteins are classified hierarchically into a dendrogram. Figure 8 presents a “slice” of this dendrogram in which all the clusters with a similarity score larger than S > 4.0 are united.

Limitations and parameters of the algorithm

Finding a local similarity with the highest similarity score is a complex combinatorial problem. The approximate method we used is fast enough to complete an exhaustive search through all the PDB structures (a comparison of 2 average structures of 200 residues takes about 5 min on Titan Unix machine but can- not be guaranteed to find the similarity with the true highest sim- ilarity score). The idea of the method is to combine short similar fragments of C“ traces into large similar spatial arrangements of these fragments (SARFs). The main limitation of the method is that it operates not with a single residue but with short, simi- lar, continuous fragments. In most cases, we combine common SARFs from 6-residue fragments that can be superimposed with an RMSD of less than 0.8 A. However, in large molecules with an internal symmetry, the combinatorial complexity of the prob- lem increases dramatically and we have to tighten selection of the initial short fragments by increasing the number of residues in them until the number of initial similar fragments becomes sufficiently low (less than 5 ,000 in the current version of the pro- gram). Thus, we could miss some SARFs that are made of shorter or less similar fragments.

Cold Spring Harbor Laboratory Press on January 30, 2008 - Published by www.proteinscience.orgDownloaded from

Page 11: Biological meaning, statistical significance, and … · 868 N.N. Alexandrov and N. GO Table 1. Common SARFs with a similarity score larger than 5.0' PDB identifiers Size Corresponding

Local spatial similarities in proteins 875

Acknowledgment

We thank Mr. John Owens for his assistance.

References

Abagyan RA, Maiorov VN. 1988. A simple qualitative representation of poly- peptide chain folds: Comparison of protein tertiary structures. J Bio- mol Struct & Dyn 5:1267-1279.

Abola EE, Bernstein FC, Bryant SH, Koetzle TF, Weng JC. 1987. Protein Data Bank. In: Allen FH, Bergerhoff G, Sievers R, eds. Crystallographic databases: Information, content, software systems, scientific applica- tions. Bonn/Chester/Cambridge: International Union of Crystallogra-

Alexandrov NN. 1992. Local multiple alignment by consensus matrix. CABIOS 8:339-345.

Alexandrov NN, Takahashi K, Go N. 1992. Common spatial arrangements of backbone fragments in homologous and non-homologous proteins. JMol Biol225:5-9.

Bernstein FC, Koetzle TF, Williams GJB, Meyer EF Jr, Brice MD, Rodgers JR, Kennard 0, Shimanouchi T, Tasumi M. 1977. The Protein Data

Mol Biol112:535-542. Bank: A computer-based archival file for macromolecular structures. J

Branden C1. 1980. Relation between structure and function of a/O proteins. Q Rev Biophys 13:317-338.

Chothia C. 1992. One thousand families for the molecular biologist. Nature 357:543-544.

Chothia C, Finkelstein AV. 1990. The classification and origins of protein fold patterns. Annu Rev Biochem 59:1007-1039.

Farber GK, Petsko GA. 1990. The evolution of a/O barrel enzymes. Trends Biochem Sci I5:228-234.

Finkelstein AV, Ptitsyn OB. 1987. Why do globular proteins fit the limited set of folding patterns? Progr Biophys Mol Bioi 50:171-190.

Fischer D, Bachar 0, Nussinov R, Wolfson H. 1992. An efficient automated computer vision based technique for detection of three dimensional struc- tural motifs in proteins. J Biomol Stnrct & Dyn 9:769-789.

Hogle JM, Chow M,. Filman DJ. 1985. Three-dimensional structure of poliovirus at 2.9 A resolution. Science 229:1358-1365.

Holm L, Sander C. 1993a. Structural alignment of globins, phycocyanins and colicin A. FEBS Lett 315:301-306.

phy. pp 107-132.

Holm L, Sander C. 1993b. Protein structure comparison by alignment of distance matrices. JMol Biol233:123-138.

Huber R, Schneider M, Mayr I , Muller R, Deutzmann R, Suter F, Zuber H, Falk H, Kayser H. 1987. Molecular structure of the bilin-binding pro- tein (BBP) from Pieris brassicae after refinement at 2.0 A resolution. JMol Bioi 198:499-513.

Kraulis PJ. 1991. MOLSCRIPT: A program to produce both detailed and schematic plots of protein structures. J Appl Crystallogr 24:946-950.

Lebioda L, Stec B. 1988. Crystal structure of enolase indicates that enolase and pyruvate kinase evolved from a common ancestor. Nature333:683- 686.

Levitt M, Chothia C. 1976. Structural patterns in globular proteins. Nature 261552-558.

Maiorov VN, Crippen GM. 1994. Significance of RMSD in comparing three- dimensional structures of globular proteins. J Mol Biol235:625-634.

Mitchell EM, Artymiuk PJ, Rice DW, Willet P. 1989. Use of techniques de- rived from graph theory to compare secondary structure motifs in pro- teins. JMol Biol212:151-166.

Murzin AG, Chothia C. 1992. Protein architecture: New superfamilies. Curr Opin Struct Biol2:895-903.

Orengo CA, Flores TP, Taylor WR, Thornton JM. 1993. Identification and classification of protein fold families. Protein Eng 6:485-500.

Overington JP, Johnson MS, Sali A, Blundell TL. 1990. Tertiary structural constraints on protein evolutionary diversity: Templates, key residues and structural prediction. Proc R Soc Lond B 241:146-152.

Pascarella S, Argos P. 1992. A data bank merging related protein structures and sequences. Protein Eng 5:121-137.

Rackovsky S. 1990. Quantitative organization of the known protein X-ray structures. I. Methods and short-length-scale results. ProteinsStruct Funct Genet 7:378-402.

Richardson JS. 1981. The anatomy and taxonomy of protein structures. Adv Protein Chem 34:167-339.

Sali A, Blundell TJ. 1989. Definition of general topological equivalence in protein structures. J Mol Biol212:403-428.

Steitz TA. 1990. Structural studies of protein-nucleic acid interaction: The sources of sequence-specific binding. Q Rev Biophys 23:205-280.

Taylor WR, Orengo CA. 1989. Protein structure alignment. JMoI Biol208: 1-22.

Vriend G, Sander C. 1991. Detection of common three-dimensional substruc- tures in proteins. Proteins Struct Fund Genet 11:52-58.

Yee DP, Dill KA. 1993. Families and the structural relatedness among glob- ular proteins. Protein Sci 22384-899.

Cold Spring Harbor Laboratory Press on January 30, 2008 - Published by www.proteinscience.orgDownloaded from