how to derive a protein folding potential? a new approach to an …leonid/poten.pdf · 1166 protein...

16
J. Mol. Biol. (1996) 264, 1164–1179 How to Derive a Protein Folding Potential? A New Approach to an Old Problem Leonid A. Mirny and Eugene I. Shakhnovich* In this paper we introduce a novel method of deriving a pairwise potential Harvard University Department of Chemistry for protein folding. The potential is obtained by an optimization procedure 12 Oxford Street, Cambridge that simultaneously maximizes thermodynamic stability for all proteins in MA 02138, USA the database. When applied to the representative dataset of proteins and with the energy function taken in pairwise contact approximation, our potential scored somewhat better than existing ones. However, the discrimination of the native structure from decoys is still not strong enough to make the potential useful for ab initio folding. Our results suggest that the problem lies with pairwise amino acid contact approximation and/or simplified presentation of proteins rather than with the derivation of potential. We argue that more detail of protein structure and energetics should be taken into account to achieve energy gaps. The suggested method is general enough to allow us to systematically derive parameters for more sophisticated energy functions. The internal control of validity for the potential derived by our method is convergence to a unique solution upon addition of new proteins to the database. The method is tested on simple model systems where sequences are designed, using the preset ‘‘true’’ potential, to have low energy in a dataset of structures. Our procedure is able to recover the potential with correlation r 1 91% with the true one and we were able to fold all model structures using the recovered potential. Other statistical knowledge-based approaches were tested using this model and the results indicate that they also can recover the true potential with high degree of accuracy. 7 1996 Academic Press Limited *Corresponding author Keywords: protein folding; protein folding potential; fold recognition Introduction The problem of how to determine the correct energetics is paramount to the complete solution of the protein folding problem. Two avenues to determine energy functions (force-fields) for proteins have been pursued. The first uses more or less rigorous or semi-empirical classical or quantum mechanical calculations to determine, from the first principles, and/or fitting to spectroscopic experimental data, the forces acting between amino acids in a vacuum or in solution (Vasques et al ., 1994). This approach is rigorous but it encounters formidable compu- tational difficulties. Most importantly, it can be realized only within the framework of detailed, atomistic description of amino acids. However, detailed atom resolution models of proteins are not feasible for folding simulations due to obvious computational difficulties. An alternative, more practical, approach is to introduce simplified, coarse-grained models of proteins where amino acids are represented in a simplified way, as one of few interacting centers which may have some internal degrees of freedom as well but which are generally much simpler than real amino acids (Levitt, 1976; Ueda et al ., 1978; Miyazawa & Jernigan, 1985; Wilson & Doniach, 1989; Skolnick & Kolinsky, 1990; Shakhnovich et al ., 1991). Such models are more tractable computation- ally in both threading approaches (Finkelstein & Reva, 1991; Jones et al ., 1992) and ab initio simulations (Kolinsky & Skolnick, 1993, 1994). However, the serious problem with simplified representations of proteins is that it is difficult to describe protein energetics at the coarse-grained level of structure description. What ‘‘force-fields’’ should act between these simplified interacting centers, which remain identified with natural amino acids, such that native structures, for these 0022–2836/96/501164–16 $25.00/0 7 1996 Academic Press Limited

Upload: others

Post on 25-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

J. Mol. Biol. (1996) 264, 1164–1179

How to Derive a Protein Folding Potential? A NewApproach to an Old Problem

Leonid A. Mirny and Eugene I. Shakhnovich*

In this paper we introduce a novel method of deriving a pairwise potentialHarvard UniversityDepartment of Chemistry for protein folding. The potential is obtained by an optimization procedure12 Oxford Street, Cambridge that simultaneously maximizes thermodynamic stability for all proteins inMA 02138, USA the database.

When applied to the representative dataset of proteins and with theenergy function taken in pairwise contact approximation, our potentialscored somewhat better than existing ones. However, the discriminationof the native structure from decoys is still not strong enough to make thepotential useful for ab initio folding. Our results suggest that the problemlies with pairwise amino acid contact approximation and/or simplifiedpresentation of proteins rather than with the derivation of potential. Weargue that more detail of protein structure and energetics should be takeninto account to achieve energy gaps. The suggested method is generalenough to allow us to systematically derive parameters for moresophisticated energy functions. The internal control of validity for thepotential derived by our method is convergence to a unique solution uponaddition of new proteins to the database. The method is tested on simplemodel systems where sequences are designed, using the preset ‘‘true’’potential, to have low energy in a dataset of structures. Our procedure isable to recover the potential with correlation r 1 91% with the true oneand we were able to fold all model structures using the recoveredpotential. Other statistical knowledge-based approaches were tested usingthis model and the results indicate that they also can recover the truepotential with high degree of accuracy.

7 1996 Academic Press Limited

*Corresponding author Keywords: protein folding; protein folding potential; fold recognition

Introduction

The problem of how to determine the correctenergetics is paramount to the complete solution ofthe protein folding problem.

Two avenues to determine energy functions(force-fields) for proteins have been pursued. Thefirst uses more or less rigorous or semi-empiricalclassical or quantum mechanical calculations todetermine, from the first principles, and/or fittingto spectroscopic experimental data, the forcesacting between amino acids in a vacuum or insolution (Vasques et al., 1994). This approach isrigorous but it encounters formidable compu-tational difficulties. Most importantly, it can berealized only within the framework of detailed,atomistic description of amino acids. However,detailed atom resolution models of proteins are notfeasible for folding simulations due to obviouscomputational difficulties.

An alternative, more practical, approach is tointroduce simplified, coarse-grained models ofproteins where amino acids are represented in asimplified way, as one of few interacting centerswhich may have some internal degrees of freedomas well but which are generally much simpler thanreal amino acids (Levitt, 1976; Ueda et al., 1978;Miyazawa & Jernigan, 1985; Wilson & Doniach,1989; Skolnick & Kolinsky, 1990; Shakhnovich et al.,1991). Such models are more tractable computation-ally in both threading approaches (Finkelstein &Reva, 1991; Jones et al., 1992) and ab initiosimulations (Kolinsky & Skolnick, 1993, 1994).However, the serious problem with simplifiedrepresentations of proteins is that it is difficult todescribe protein energetics at the coarse-grainedlevel of structure description. What ‘‘force-fields’’should act between these simplified interactingcenters, which remain identified with naturalamino acids, such that native structures, for these

0022–2836/96/501164–16 $25.00/0 7 1996 Academic Press Limited

Page 2: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

Protein Folding Potential 1165

model proteins, still correspond to pronouncedenergy minima for their respective sequences? Toaddress this problem Tanaka & Scheraga (1976)proposed an approach which was later developedby Miyazawa & Jernigan (1985) in their seminalcontribution (the ‘‘MJ’’ method). The MJ method isbased on a statistical analysis of protein structuresand frequencies of contacts, defined in the realm ofsimplified protein representation. Frequencies ofindividual amino acid contacts were derived andcompared with frequencies expected in the randommixture of amino acids and the solvent. Next, aquasichemical approximation was employed relat-ing these properly normalized frequencies with‘‘potentials’’ via the relation:

uij = −T ln fij

where i and j denote amino acid types; fij arenormalized frequencies of contacts between themextracted from the database of existing structures.The definition of the energy scale denoted as‘‘temperature’’ (T ) in the quasichemical approxi-mation of MJ is a delicate problem. It wasaddressed in a recent work by Finkelstein et al.(1993, 1995) who also showed that quasichemicalapproximation may be a reasonable one under theassumption that protein sequences are random. Inthe recent study Mirny & Domany (1996) showedthat quasichemical approximation is also valid ifcontacts are independent and uniformly dis-tributed. The subsequent development of theknowledge-based approach based on quasichemi-cal approximation included efforts to incorporatedistance-dependent forces (Sippl, 1990), betterrepresentation of amino acid geometry andapproximation of multiple-body interactions(Kolinsky & Skolnick, 1993, 1994), dihedral angles(Kolaskar & Prashanth, 1979; Nishikawa & Matsuo,1993; Rooman et al., 1992; DeWitte & Shakhnovich,1994), and better treatment of solvent–proteininteractions (Park & Levitt, 1996). A detailedanalysis of knowledge-based potentials andexamples of their successful and unsuccessfulapplication is given by Kocher et al. (1994).Approaches to derive potentials from quasichemi-cal approximation, especially the most difficultissue of reference state, are discussed by (Godziket al., 1995). Real potential is believed to distinguishthe native structure by making its energymuch lower than energy of all other conformations,i.e. it provides stability of the native structure.Protein sequences should also fold fast to theirrespective native conformations. It was shown, forsimple models of proteins, that these two con-ditions, thermodynamic stability and kinetic acces-sibility, are met when the native state is apronounced energy minimum for the nativesequence, compared to the set of misfoldedconformations (Sali et al., 1994; Shakhnovich, 1994;Gutin et al., 1995). Therefore it is reasonable tosuggest that the essential property of the correct,‘‘true’’, folding potential is that the energy of anative sequence folded into its respective native

conformation should be much lower than theenergy of this sequence in every alternativeconformation.

An approach to derivation of protein foldingpotentials that takes this requirement explicitly intoaccount was proposed by Goldstein et al. (1992;GSW) and Maiorov & Crippen (1992). Goldsteinet al. maximized the quantity Tf /Tg , which isequivalent to the maximization of energy gapbetween the native state and bulk of decoys. Theyshowed that, for each individual protein, theproblem of potential optimization has a simpleanalytical solution; however their approach en-countered a serious problem in averaging thepotential over different structures in the database.Indeed, a potential optimized for one protein is notnecessarily (and in fact never!) optimal for anotherprotein, while the goal is to find a potential whichis good, or optimal, simultaneously for manyproteins. Goldstein et al. (1992) found an ad hocprocedure of averaging over a protein databasewhich offered good results in their tests.

Here, we suggest a systematic method forderiving a potential which delivers a pronouncedenergy minimum to all proteins in their nativeconformations and hence should provide fastfolding and stability of model proteins. Themethod is general and is not limited to anyform of potential or any model of a protein.Another important feature of this approach isthat it has internal criteria of self-consistency:when the derived potential does not changesignificantly upon addition of new proteins to thedatabase, it corresponds to meaningful, nontrivialenergetics.

The proposed new method of potential deri-vation should be rigorously tested and comparedwith existing approaches. However, there is aserious problem with testing parameters derived byany approach: the lack of objective rigorous criteriaof success because true potentials are not known(and they are not likely to exist since real proteinsdiffer from their simplified representations). Areasonable criterion is whether the derived poten-tials are useful for fold recognition and ab initiofolding. The results of numerous tests by manygroups (see the comprehensive analysis in thespecial issue of Proteins (Lattman, 1995)) show that,while existing knowledge-based potentials often doa decent job in fold recognition, they are notsufficiently accurate for ab initio folding. Strongevidence that the ‘‘bottleneck’’ in ab initio folding isin the energy function rather than in the searchstrategy is that ab initio procedures fail becausedecoys with energy lower than energy of the nativeconformations are found in test cases (Covell, 1994;Elofson et al., 1995). Conversely, in inverse foldingtests, the native structure, in most cases (though notalways), has lower energy than decoys (Wodak &Rooman, 1993; Lemer et al., 1995; Miyazawa &Jernigan, 1996).

The best way to assess different procedures ofderivation of potentials is to use, as a test case, a

Page 3: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

Protein Folding Potential1166

model system where the correct form of theHamiltonian (say, pairwise contact potential) isgiven, and the true potential is known. Differentprocedures to extract potentials can be applied, andthen the true and derived potentials can becompared. Further more, the derived potentials canbe used in a model system for threading or ab initiofolding to compare their performance with that ofthe true potentials and thus to close the circle.

Such a comprehensive analysis is possible in therealm of lattice models. Thomas & Dill (1996) tookthe first step in this direction when they consideredtwo-dimensional short lattice chains composed ofmonomers of two types.

Here, we test our procedure for derivation ofpotentials as well as other approaches using both:(1) representative set of the native proteins obtainedfrom the Brookhaven protein databank (PDB); and(2) three-dimensional lattice model proteins com-posed of 20 types of amino acids.

A sequence design algorithm has been recentlydeveloped which generates sequences withspecified relative energy (Z-score, Bowie et al.,1991) in a given conformation (Shakhnovich &Gutin, 1993a,b; Abkevich et al., 1995a). We use thismethod to carry out the following rigorousprocedure for testing our and alternative ap-proaches to derive potential. (1) Select at dataset ofthe native protein conformations. We use arepresentative subset of the native structuresobtained from PDB for real proteins and a set ofrandom compact lattice conformations for thelattice model. (2) Using some potential to serve asthe true potential for the model, design sequencesto have selected conformations as native ones, thuscreating a model protein databank. (3) Usingdifferent procedures, we extract ‘‘knowledge-basedpotentials’’. (4) Compare derived potentials withthe true potential. Test the performance of thederived potentials in ab initio folding simulations(for the lattice model) and threading (for realproteins), using model proteins from the builtdataset that has not been used to derive potentials.

This approach to test the potentials allows one tosystematically test by different parameter deri-vation procedures how large the database size mustbe to successfully derive parameters. Furthermore,we can create databanks of model proteins withvarious levels of stability that make it possible todetermine how well optimized protein sequencesmust be to allow successful parameter derivation.

We apply the procedure described above (exceptab initio folding tests) not only to lattice modelproteins but to also real proteins. With a set of trueparameters, we design sequences for a representa-tive set of protein conformations, derive theoptimized potential, and evaluate the maximalvalue of Z-score for a given model Hamiltonian.This value is compared with scores for nativesequences with the same optimized potential. Thiscomparison sheds light not only on the advantagesand pitfalls of the parameter derivation procedurebut, more importantly, it determines which models

can serve better for prediction of protein confor-mations.

In subsequent chapters of this paper we willcarry out this program.

Results

Lattice model

We consider a conformation of a protein chain asa self-avoiding walk on a cubic lattice. Two aminoacids that are not nearest neighbors in sequence andare located in the next vertices of the lattice areconsidered in contact. Energy of a conformation isgiven by equation (6).

Dataset of stable and folding proteins

The dataset of lattice proteins consists of 200randomly chosen compact conformations of 27-meron a 3 × 3 × 3 cube (Shakhnovich & Gutin, 1991;Shakhnovich et al., 1991; Sali et al., 1994; Socci &Onuchic, 1994). We derive the potential using thefirst 100 of the lattice proteins and then test thederived potential for the remaining 100 latticeproteins from the dataset. We use the potentialobtained by Miyazawa & Jernigan (1985) as the trueone. Using the true potential for each nativeconformation in the dataset, we design a sequencewhich minimizes Z-score for this conformation (seeabove). The stability and folding of each designedsequence are tested by the Monte Carlo foldingsimulations. Each starts from a random coil(Shakhnovich et al., 1991; Sali et al., 1994) andcontinues until it reaches its respective nativeconformations.

Derivation of potential

To obtain the potential which minimizes Z-scoresfor model proteins, we use Monte Carlo procedurein the space of potentials (see Methods). Startingfrom different random potentials, the Monte Carlosearch converges fast, and the resulting potentialdoes not depend on the starting random potential.The procedure converges to a unique potential evenat zero optimization temperature Topt = 0. Thisshows that there is only one minimum in the spaceof potentials in our model. This guarantees that thederived potential provides the global minimum tothe target function �Z�harm.

Derived versus true potential

The potentials obtained from this method arecompared with the real one in several ways. Fig-ure 1 presents 210 values of interactions in thederived potential versus the same values for the truepotential. Correlation r = 0.84 shows that ourmethod is able to reconstruct the true potential.

The values of energy for attractive interactions(U(j, h) < 0) are predicted much better than the

Page 4: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

Protein Folding Potential 1167

Figure 1. Derived potential versus true potential for thelattice model.

Figure 2. Z-score of model proteins with the derivedpotential verus Z-score with the true potential.

energy values of repulsive interactions(U(j, h) > 0). Attractive interactions stabilize thenative conformations and appear much morefrequently among native contacts. Repulsiveinteractions, in contrast, are very rare amongnative contacts and therefore the statistics ismuch poorer for them. Some repulsive contactscannot be found in the dataset of model proteins.In contrast, contacts between all amino acidsare present among native contacts in real proteins(see below). The absence of contacts between sometypes of amino acids in the model dataset isthe result of a very strong sequence design. Thedesign finds a sequence that provides very highstability of the native conformation in the givenmodel, and by doing so it eliminates repulsivecontacts which destabilize the native conformation.This observation is the first indicator that nativesequences do not appear to be well designedfor stability in terms of our model contact pairwisepotential, 4.5 A cutoff of residue-residue inter-actions etc.).

Ab initio folding with derived potentials

Ability of the derived potential to fold modelproteins is tested by their ab initio folding. Foldingsimulations are carried out using the standardMonte Carlo method for polymers on a cubiclattice. A discussion of the lattice Monte Carlosimulation technique, and its advantages andcaveats, is detailed in many publications (seee.g. Sali et al., 1994; Socci & Onuchic, 1994).Each simulation starts from a random coilconformation, proceeds at constant temperatureand lasts about five times longer than the meanfolding time.

All of the tests are performed for the proteins thatwere not used in the derivation procedure. First wecompare Z(Uder) values provided by derivedpotential with Z(Utrue) values for the true potential.Figure 2 presents Z(Uder) as a function of Z(Utrue).

Derived potential provides almost the same or evenlower values of Z-score for all proteins.

We define folding time for each protein as a meanfirst passage time, i.e. time when the nativeconformation is first reached. Time is measured inMC steps. Forty runs are performed for eachprotein for both true and derived potentials.Simulations are run at temperature T = 0.7. Fig-ure 3 presents the scatter plot of folding timeobtained for the derived potential vs folding timefor the true potential. All proteins that fold with thetrue potential fold also with the derived potentialand exhibit approximately the same folding time.

The folding test proves that the derived potentialis able to provide fast folding for all proteins withwell-designed sequences.

Figure 3. Folding time with derived potential versusfolding time with the true potential for the lattice model.100 lattice model proteins not used for derivation ofpotentials were taken for Monte-Carlo folding simu-lations. Folding ‘‘time’’ is measured in Monte-Carlo stepsrequired to reach the native conformation.

Page 5: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

Protein Folding Potential1168

Figure 4. Effect of the database size, used for derivationof the potential, on average energy gap (A) andconvergence test (B), for lattice model proteins.

Figure 5. Convergence of potential for lattice modelproteins. Correlation between potentials derived from all100 model proteins and potential derived from Nprot < 100proteins is shown as a function of Nprot.

increases, we compare the potential obtained forNprot proteins with the one obtained for all 100proteins. The correlation between these potentialsas a function of Nprot is shown in Figure 5. Clearly,as the number of proteins in the database increases,the correlation between the derived potentialsapproaches 1 and, hence, the potential converges toa unique solution.

The results of this procedure clearly demonstratethe stability of our procedure. It is also importantthat the potential, which is highly correlated withthe true potential (r = 0.8), can be obtained with only40 . . . 50 proteins, which is of the order of the sizeof the database of non-homologous stable disulfide-free proteins available from the PDB. Convergenceof the derivation procedure also guarantees thatobtained potential does not depend on the numberof specific properties of the proteins used for thederivation if the database is large enough.

Are there enough parameters? Are there toomany parameters?

An important issue is whether the number ofparameters adjusted in the potential is sufficient toprovide a large enough gap for all proteins withdesigned sequences. For example, two-letter (HP)models are too non-specific to make the nativestructure unique: for any sequence, many confor-mations of three-dimensional HP heteropolymershave the same energy as the native conformationhas. The native state in such models is not uniquein most cases; correspondingly no sequence,random or designed, can have any energy gap inthe HP models (Yue et al., 1995).

On the other hand, the number of parametersshould not be too large. If the number ofparameters is too large it is always possible to finda ‘‘potential’’ for which all members of the databaseused in the derivation have low energies, but the

Effect of the database size

How sensitive is the derived potential to thenumber of proteins used in the derivation? Howmany proteins are required to obtain a potentialsimilar to the true potential? To address thesequestions we perform a derivation of potential fordatabases with various numbers of proteins.

For the database containing Nprot proteins wederive a potential using the technique describedabove and compute the average score (energy gap)=F= = =�Z�harm= provided by the derived potential.(Figure 4A). We also compute the correlation of thetrue potential and the one obtained for Nprot

proteins (see Figure 4B).For few (1 . . . 5) proteins one can obtain a

potential that provides a very large energy gap forthese proteins. This potential, however, fails toprovide a reasonable gap for other proteins and isnot similar (r = 0.2 . . . 0.5) to the true potential. Asthe number of proteins in the database increases,the average energy gap decreases approaching arather high constant value of (=F= = 1.6). Thecorrelation between the derived and true potentialsapproaches a constant value, r = 0.85. To ensurethat the derived potential converges to a meaning-ful value as the number of proteins in the database

Page 6: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

Protein Folding Potential 1169

resulting potential is unrelated to the true potentialand does not provide low energy to proteinswhich are not members of the derivation data-base.

More specifically, the question is whether theproblem of finding parameters is under-determinedor over-determined, i.e. how the number ofindependent functions to minimize Z-scoresof individual proteins is related to the number ofindependent parameters. For an over-determinedproblem the number of functions/constraints isgreater than the number of parameters and, hence,there is no solution that minimizes all the functionswell. This is not the case for our designedsequences, since there is true potential whichprovides a large enough gap for all proteins. Belowwe address this question for native sequences ofreal proteins. If the problem is under-determined,then the number of functions/constraints is lessthan the number of parameters and one can find aninfinite number of solutions minimizing all func-tions. This is the case when the number of proteinsin the database is small. As we have shown above,the potential derived for a few proteins provides anaverage energy gap greater than that provided bythe true potential, but it shares no similarity withthe true potential.

However, as the number of proteins in thedatabase increases, an average gap approaches thatfor the true potential and the derived potentialbecomes very similar to the true one. To ensure thatwe do not have too many parameters we havedevised a control procedure with randomlyshuffled sequences.

Randomly shuffled sequences: anessential control

As a control we carried out the derivationprocedure for our database of model proteins usingshuffled sequences instead of the designed ones foreach protein. In this case one should not expect thatthere exists any potential which makes all thenative structures to be of low energy for randomlyshuffled sequences, i.e. in this case our procedureshould not lead to any meaningful solution. Whathappens in this case?

Again, for a few (1 . . . 5) proteins one can find apotential which provides a large enough energygap (=F= = 0.8 . . . 1.2) for randomly shuffled se-quences (see Figure 4A). However, in contrast tothe designed sequences, average energy gap dropssubstantially to a marginal level of =F= = 0.2 as thenumber of proteins in the database increases.Clearly, there is no correlation between the truepotential and potential derived for a database withshuffled sequence (see Figure 5). There is also nocorrelation (r = 0.0) between potentials obtained for100 proteins with shuffled sequences and potentialsobtained for Nprot < 100 of these proteins. Hence, theprocedure does not converge to any potential forproteins with randomly shuffled sequences.

Consequently, no pairwise potential can provide

stability simultaneously to all of the nativeconformations with the shuffled sequences.

Comparison of the results for designed sequenceswith the control case of shuffled sequences suggeststhat the problem of finding a pairwise potential isnot under-determined, i.e. 210 parameters of thepotential are sufficient to provide a large gap fordesigned sequences and are not sufficient toprovide a large gap for any pair of sequence andconformation.

Note that the procedure is able to distinguishbetween designed and randomly shuffled se-quences without prior information about thepotential used for the design. Designed sequencesshow the convergence of average energy gap =F= to alevel of F = 1.4 as the number of proteins increases.Potential is converging which is seen from highcorrelation between potentials obtained for differ-ent number of proteins in the dataset. In contrast,no convergence to a single potential is observed forproteins with randomly shuffled sequences (seeFigure 5). The target function F approaches smallvalues of 0.2 as the number of proteins in the database increases. Where are the native proteins onthis scale? Do they behave more like designed orlike randomly shuffled sequences?

Native proteins

The model

We build a database of proteins with less than25% of sequence homology, longer than 50 andshorter than 200 amino acids. The database contains104 proteins listed below (in pdb-code names): 1hcr1cad 1enh 1aap 1ovo 1fxd 1cse 1r69 1plf 2sn3 1bov1mjc 1hst 1hyp 1ubq 4icb 1pk4 1poh 1aba 1lmb1cyo 1brs 1fna 1mol 1stf 1gmp 1frd 1hsb 1ida 1plc1aya 1onc 1sha 1fus 1psp 1fdd 256b 1acx 1bet 1fkb1pal 2sic 1brn 2trx 1ccr 2msb 1dyn 1c2r 1etb 1gmf2rsl 1paz 1rpg 1acf 2ccy 3chy 135l 1aiz 1rcb 1adl1bbh 1slc 1eco 2end 4fxn 1ith 1cdl 1flp 2asr 1ilr 1lpe1hbi 1bab 1lba 1mba 8atc 1ash 2fx2 2hbg 2mta 1f3g1ndc 1aak 1cob 4i1b 1mbd 2rn2 1esl 1hfc 1hlb 1pnt1hjr 4dfr 119l 3dfr 2cpl 5p2l 1rcf 9wga 2alp 1fha1bbp 2gcr 1hbq. We use this database to derive thepotential that maximizes the average energy gap=�Z�harm=. We define a contact between two aminoacids as when the distance between their nearestheavy atoms is less than cutoff value 4.5 A. Incontrast to the lattice model, real proteins have adifferent length and a different number of contacts.These factors affect the value of Z-score. To accountfor the increase of Z with protein length weintroduce the following normalization:

Znorm =Z

znnat

where nnat is the number of native contacts.Normalized values of Znorm are used to computeFnorm = �Znorm�harm harmonic mean. Our criterionoveremphasizes poor scores and therefore is very

Page 7: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

Protein Folding Potential1170

Figure 6. Z-score of native proteins with differentpotentials. Proteins are arranged in order of increasinglength; this explains the systematic trend of decrease ofZ-score as protein ID no. increases.

Figure 7. Native proteins. A, Effect of the database sizeon the energy gap; B, convergence test: correlationbetween the potential derived using smaller database andthe potential derived using all 104 proteins in thedatabase.

sensitive to proteins in the database that are morerandom-like; hence their presence in the dataset candistort the resulting potential. To avoid thisdifficulty we selected proteins for the dataset,which we believe are stabilized by similar physicalforces (hydrophobic, electrostatic, H-bonds, etc.)and avoided proteins stabilized by other factorssuch as disulfides or coordinated metals, hemegroups, etc.

Figure 6 presents values of Z-score obtained forthe derived potential. Although our method findsthe potential that maximizes the average energygap (=Z=) simultaneously for all proteins, the valuesof the gap obtained for real proteins are rathersmall. This indicates that in the framework of themodel we use (contact pairwise potential, 4.5 Acutoff of residue–residue interactions etc) nopairwise potential can provide high stabilitysimultaneously to all native proteins.

How good is the model for native proteins?

The potential derived from native proteinsconverges to a certain value of Z-scores. Is thisvalue large or small? To answer this question weshould compare this value with two limiting cases:(1) when the functional form of energy function is‘‘exact’’, and sequences are well-designed for thisenergy function; and (2) with randomly shuffledsequences.

For each protein in the dataset of native proteins,we design a sequence using MJ potential as the truepotential preserving amino acid composition of thenative sequence. Then we derive a potential for asubset containing Nprot proteins from the database.The derivation is performed for the proteins builtof: (1) the native structures with their nativesequences, (2) the native structure with sequencesdesigned for them; (3) and the native structureswith randomly shuffled sequences.

Figure 7 presents average normalized energy gapvalues =Fnorm= obtained for all three sets of sequencesas a function of the number of proteins in thedatabase. Similar to the lattice model (see Fig-

ure 4A), designed sequences reach high values ofthe gap =F= = 1.2 whereas for random sequences=F= = 0.2. Derivation of potential for native se-quences yields =F= = 0.56, which is considerably lessthan the gap provided for the same structures withdesigned sequences.

Another important property of designed se-quences is that the derived potential converges toa single potential as the number of proteins in thedatabase increases. Randomly shuffled sequences,in contrast, lack this convergence. The criterion ofthis convergence is the correlation betweenpotentials derived using 100 proteins and potentialsderived using Nprot < 100 proteins. The correlationbetween potentials obtained for the database ofnative proteins as a function of the number ofproteins in the database is shown in Figure 7B.

In contrast to randomly shuffled sequences,native sequences as well as designed sequencesprovide convergence to a single potential as thenumber of proteins used for the derivationincreases. Hence, we are able to find a potentialwhich, for the given model, maximizes the energygap for all native proteins simultaneously. Thisresult clearly demonstrates that the model energyfunction used in this study of proteins ismeaningful and reflects some essential interactions,but not all, since there is a pronounced difference

Page 8: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

Protein Folding Potential 1171

Figure 8. Derived potential versus true potential for thenative proteins.

does not affect the quality of the reconstruction ofthe true potential.

Can poorly designed proteins fold?

As we have demonstrated above, native proteinsare rather poorly designed in terms of the pairwisepotential. The best possible pairwise potentialprovides a rather small energy gap to the nativeproteins, which is characterized by the typical valueof Znorm = 0.5 . . . 0.7. Well designed proteins have,in contrast, Znorm = 1.2 . . . 1.4 and may be able tofold to their native conformation, as lattice modelsimulations suggest. The question is whetherpoorly designed sequences can fold as well.

To address this question we turn to the ‘‘ideal’’lattice model and build a dataset of 200 poorlydesigned proteins. Proteins in this dataset aredesigned to have Znorm = 0.5 . . . 0.7. Then wederived potential for this dataset as describedabove and performed folding simulations for allsequences using true and derived potential. Theresult is that no protein was able to fold to its nativeconformation neither with derived nor with thetrue potential.

In all cases there was a conformation which hasan energy below the energy of the nativeconformation. Hence: (1) the native conformation isnot the global energy minimum for a poorlydesigned protein; (2) poorly designed proteins areunable to fold to their native conformations in abinitio folding simulations.

Fold recognition of poorly designed proteins

Sampling techniques that are more constrained toprotein-like conformations (Finkelstein & Reva,1991; Jones et al., 1992; Wodak & Rooman, 1993)can, however, recognize the native and native-likefolds among a small enough pool of alternativeconformations. The success of different pairwisepotentials for the fold recognition shows that thissampling technique works quite well even forpoorly designed proteins. Using our set of poorlydesigned sequences we perform fold recognitiontests for all lattice model proteins by threadingsequences of each protein through 200 alternativeconformations. Only three out of 200 sequencesrecognize a non-native conformation as those of thelowest energy. This result is in contrast to previousobservations that for every protein in this set nativeconformation is not the global energy minimum.Hence, the only reason why fold recognition worksfor 197 proteins is that a set of decoys is not toolarge and representative, so that the nativeconformation had the lowest energy.

Not surprisingly, the comparison of the results ofab initio folding simulations and fold recognitionindicates that folding is a much more complicatedproblem than fold recognition, since a much largerenergy gap is required for successful foldingcompared to fold recognition.

The question whether poorly designed proteins

between F-values for designed and randomsequences.

Since our method of derivation maximizes=Z=-scores for all proteins, no potential can providea greater =Z=-score for the studied Hamiltonian(pairwise interaction potential) than this method.Our results demonstrate that very moderate=Z=-scores can be obtained using a pairwisepotential for native proteins and no potential canincrease values of =Z= for them. However, othermodels utilizing different protein structure rep-resentations or different forms of potential can bemore efficient in providing a large enough energygap for native proteins. Using our procedure onecan compare different models quantitatively andselect the one which provides the largest energygaps for native proteins.

Can we derive a potential from the dataset ofpoorly designed sequences?

Using the lattice model and native proteins, wedemonstrated that our procedure is able toreconstruct sufficiently accurate true potential ifsequences in the database are well designed. Is thisrequirement too restrictive? How well can wereconstruct potential for proteins with poorlydesigned sequences?

To mimic the poor design of native proteinsobserved in our model, for each protein structure,we design sequences that provide the same value ofenergy gap as the native sequence. Design isperformed using MJ potential as the true one. Nextwe derive the potential for these poorly designedsequences and compare them with the truepotential.

Figure 8 presents the scatter plot of interactionenergies for the obtained potential plotted againstthe same values of the true potential. The resultshows that our procedure reconstructs potential forpoorly designed sequences very well providing acorrelation of rpoor = 0.91 with the true potential.The poor design of native sequences in our model

Page 9: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

Protein Folding Potential1172

Table 1. Comparison of different procedures for derivation of potentialCorrelation Fraction of proteins

Potential �Z� with true potential able to fold (%)

True potential 7.68 1.00 100This work 8.61 0.83 (0.82) 96Goldstein et al. (1992) 8.45 0.78 (0.71) 94Hinds & Levitt (1994) 7.18 0.86 (0.84) 99Miyazawa & Jernigan (1985) 7.09 0.75 (0.68) 95

Correlations are computed for potentials obtained using all 200 proteins. Correlationsshown in brackets are for potentials obtained using 100 proteins.

can be used for recognition of the native fold inthreading experiments has yet to be studiedsystematically.

Comparison with other potentials andtechniques for extraction of potential

Several knowledge-based techniques for deri-vation of potentials from native protein structureshave been suggested. It is important to compare ourpotential with other pairwise potentials, and ourmethod of derivation with other methods. Figure 6presents Z-scores computed for proteins of ourdatabase using our potential and two otherpotentials taken from the literature. Clearly ourpotential provides significantly lower values ofZ-score for all proteins in the database. Two otherpotentials perform well as well as providing ratherlow Z-scores. Although potentials are obtainedusing different techniques, the overall profiles ofZ-score for this dataset of proteins are very similarfor all three potentials, i.e. when a protein has lowZ-score with one potential it usually has lowZ-score with another potential. High correlationsbetween Z-score values provided by these threepotentials for the same set of proteins(rMJ,GKS = 0.83 rGKS,MS = 0.86 and rMJ,MS = 0.83) indicatethat high or low value of Z-score is a property ofa protein itself irrespective of potential used.Different proteins are known to have differentstability, i.e. different quality of design, which isdisplayed by high or low values of Z-score.

Comparison with other techniques for extractionof potential

It is important to compare not only potentialsthemselves but also the techniques for derivation ofpotential. Our ideal lattice model is very useful forthis purpose. We apply different techniques to thesame set of lattice proteins and test obtainedpotentials in the same way as we did this for ourtechnique.

Here we compare four techniques for derivationof potential. The first two are widely used statisticalknowledge-based methods to derive energy ofresidue–residue and residue–solvent interactions.Knowledge-based techniques are reproduced fol-lowing Miyazawa & Jernigan (1985, 1996) (MJ)and Hinds & Levitt (1994; HL). The third testedtechnique is the procedure suggested by Goldstein

et al. (1992; GSW). This procedure is somewhatsimilar to our method since the potential isobtained to maximize ratio Tc /Tf , which is similarto the Z score we use. Goldstein et al. found analyticexpression for potential which maximizes Tc /Tf

for one protein. To find the potential for a setof proteins they used averaging, which is notjustified but it yielded good results. We followedthe procedure described by (Goldstein et al. (1992)to test their technique. Note that both the GSWprocedure and ours are optimization techniques,whereas HL and MJ are statistical knowledge-basedones.

The results for different techniques are summar-ized in Table 1. Our potential is aimed to minimizeharmonic mean Z score and, as expected, providesa lower value of �Z�harm than other potentials. GSWprocedure gives only slightly higher values of meanZ, which proves that both optimization techniquesare powerful enough to provide a large energy gapfor proteins of a dataset. Knowledge-based tech-niques provide a large energy gap as well. Thedrastic difference between knowledge-based tech-niques and optimization techniques becomestransparent when we compare Z-scores obtainedfor different derived potentials with Z-scoresprovided by the true potential (see Figure 9). Bothoptimization techniques provide Z-scores whichare lower than Z for the true potentials. Knowl-edge-based techniques, in contrast, provide Z-scores higher than those for the true potential.Hence, knowledge-based potentials provide asmaller energy gap than the true potential does,whereas potentials obtained by optimization de-liver an energy gap which is greater than those forthe true potential. The decrease of energy gap byknowledge-based potentials can be crucial for abinitio folding, especially for weakly designedproteins which have rather small gap even with thetrue potential.

All tested techniques are also quite efficient inreconstruction of the true potential, exhibiting,however, different patterns of distortion of theoriginal potential. Both optimization techniquestend to underestimate repulsive interactions (seeFigure 1). Knowledge-based techniques, in contrast,provide good estimates of energies of repulsiveinteractions, suffering from underestimation ofattractive interactions (see Figure 10). Attractiveinteractions are responsible for stabilization of thenative conformation and underestimation of attrac-

Page 10: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

Protein Folding Potential 1173

Figure 9. Z-score for 100 test lattice proteins with potentials derived by different techniques from 100 database latticeproteins. HL, Hinds & Levitt; MJ, Miyazawa & Jernigan, GSW, Goldstein et al.; MS, this work.

tive interactions leads to the observed (Figure 9)increase in Z-score for knowledge-based potentials.

Another deformation of the true potential by MJtechnique is that it yields strong non-specificattraction between residues, which is seen as lownegative average interaction between residues(�UMJ� = −1.07 when s(U) is set to 1). Thisnon-specific attraction favors more compact confor-mation irrespective of amino acid sequence. Thiseffect can mislead ab initio folding and foldrecognition. The origin of this non-specific attrac-tion is in residue–solvent interactions taken intoaccount by MJ procedure. Estimate of the numberof solvent–solvent interactions is responsible for thenon-specific attraction.

Although all derivation procedures reconstructthe true potential with systematic deviations, allpotentials are able to provide a large enough energygap for well designed model sequences.

Discussion

In this work we proposed and tested a novelsystematic approach to the long-standing problemof how to find the correct potential for proteinfolding.

In contrast to widely used knowledge-basedstatistical technique, which relies on hardlyjustifiable assumption of Boltzmann statistics, weuse optimization in space of parameters to searchfor a potential which maximizes stability of allnative proteins in the dataset.

The procedure was tested using the ideal modelwhere sequences were designed with some known,true potential and the recovered potential turnedout to be quite close to the true one. The key featureof ideal models (both lattice and off-lattice) is thatthe form of the energy function (two-body contactHamiltonian) is ‘‘exact’’, and the goal of theparameter search is to determine 210 numbers,parameters of this Hamiltonian. We showed thatour procedure recovers the parameters reliably anduniquely. It is important to note that it is not crucialfor our method that sequences in the database arewell-designed: in fact derivation of potentials usingthe database of weakly designed sequences (i.e.having relatively high Z-score) yielded potentialswhich were similarly quite close to the truepotential. This is in contrast to the control case ofassigning randomly shuffled sequences to struc-tures: for them our procedure did not converge toany meaningful potential. In this case addition ofany new ‘‘protein’’ (in fact a structure with arandom sequence assigned to it) changed thepotential dramatically, consistent with the notionthat there is no potential which delivers low energyto all structure-random sequence pairs. In contrast,even for weakly designed sequences there is thepotential for which these sequences have low (butperhaps not the lowest) energy in their correspond-ing native conformations, and such potential isreadily recovered by our optimization technique.

Figure 10. Potential obtained by statistical knowledgebased technique versus true potential for the latticemodel.

Page 11: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

Protein Folding Potential1174

The method has internal controls of self-consist-ency. First is that the optimization procedure inparameter space converges rapidly and at allalgorithmic temperatures to a unique solution (nomultiple-minima problem in space of parameters).This suggests that the obtained solution delivers aglobal minimum of Z-scores for studied proteins:no other potential can provide, on average, lowerZ-scores for the same structures and sequences inthe same model.

Another important test of self-consistency of theproposed method is convergence of potentialswhen the database size grows. This clearly pointsout that the problem is not under-determined aswell as it indicates clearly that there indeed existsa potential with which all structures have lowenergy. This criterion is especially important anduseful when we consider more complicated, thanpairwise contact, energy functions (see below).

The ideal models provide an ideal opportunity tocompare our new method with other approaches,in particular with the methods based on quasi-chemical approximation. Comparison of the truepotential with the ones derived by the Miyazawa &Jernigan (1996) and Hinds & Levitt (1994) methods(including most demanding ab initio folding tests)shows that procedure based on the quasichemicalapproximation can extract potential with impres-sive accuracy. This conclusion is in contrast withthe assertion of Thomas & Dill (1996), who alsotested MJ procedure using lattice model and arguedthat the extracted potential is not an accurateapproximation of the true potential. We believe thatthe most important criterion of success of extractedpotential is how it performs in ab initio folding orthreading tests. Thomas and Dill’s test is similar inspirit to threading because they addressed the issueof how often the global energy minimum structureremains as such with extracted potentials, judgedby exhaustive enumeration of conformations. Theyconsidered all sequences (having unique nativestate) for 14-mers and 16-mers with randomsequences. However, the native conformations ofrandom sequences (as well as other sequenceshaving no or minimal energy gap) are extremelyunstable with respect to any uncertainties inpotentials (Bryngelson, 1993). This is in contrastwith folding sequences (with energy gaps) whichwere shown to be much more robust with respectto uncertainties in potentials (Pande et al., 1995).Unfortunately, there are no sequences in HP modelwhich have energy gaps (Thomas & Dill, 1996), andone cannot even design such sequences in HPmodel.

Therefore we believe that the major reason for theconclusion reached by Thomas & Dill (1996) is thatthey used the model where native conformation isunstable with respect to even minor uncertainties inpotentials. It is important to note also that modelswhere sequences do not have energy gaps areequally unstable with respect to point mutations(Shakhnovich & Gutin, 1991). The remarkablestability of proteins with respect to many point

mutations is further strong evidence that realproteins should have a pronounced energy gap, aproperty absent in HP models.

Building a set of alternative conformations tocompute Z-score is an important part of this work.All results discussed here have been obtainedunder the following assumptions regarding thepresentation of alternative conformations: (1) allalternative conformations have the same compact-ness as the native conformation; (2) all contactsare equally probable; and (3) they are statisticallyindependent in the set of alternative conformations.These assumptions allowed us to compute Z-scorefor a protein without building the set of alternativeconformations explicitly. In fact, in order tocompute the Z-score for pairwise potential oneneeds to calculate the average frequency of acontact and covariance of two contacts in the setof alternative conformations. The first assumptionstates that the number of contacts in alternativeconformations is the same as in the native one.Assuming compactness of alternative confor-mations we eliminate the effect of non-specificattraction/repulsion in the recognition of the nativeconformation. Non-specific attraction introducedinto potential favors most compact conformationsirrespective of amino acid sequence, which cangive rise to false positives: very compact lowenergy conformations for any protein in the foldrecognition test. One should be careful about anon-specific term in a potential as it can substan-tially affect the results of ab initio folding or foldrecognition. On the other hand, these non-specificterms can be readily eliminated by shifting theparameters by a given value (Shakhnovich, 1994;Gutin et al., 1995).

By assuming equal probability for all contacts weneglect the slight prevalence of contacts betweenamino acids close to each other along thepolypeptide chain, which exists even in randomcoil. However, since alternative conformations havethe same compactness as the native one localcontacts are not expected to dominate in theseconformations (Abkevich et al., 1995). Differentprobabilities of local and non-local contacts can betaken into account by assigning higher probabilitiesto local contacts in the set of unfolded confor-mations used in calculation of Z-score.

Our assumption of independent contacts isstrictly valid only for point-size non-connectedobjects. Chain connectivity enforces positive corre-lation between contacts i, j and i, j + l for smalll = 1, 2 . . . On the other hand, excluded volume ofamino acids leads to anti-correlation betweencontacts, i, j and i, k since amino acid i can haveonly a limited number of contacts due to excludedvolume interactions. Several other factors cancontribute to correlation between contacts inopposite ways and the final outcome of these effectshas yet to be understood.

The set of alternative conformations built in thisway turned out to be adequate for estimating theenergy gap, in our model, since lattice proteins

Page 12: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

Protein Folding Potential 1175

which have low enough Z-scores are able to foldfast to their native conformations.

In general, while deriving a potential for aparticular task and sampling procedure (foldrecognition, design of an inhibitor, ab initio foldingetc), one has to construct a set of alternativeconformations which will be used as decoys duringthe sampling.

Alternative conformations used in this workcorrespond more closely (though not exactly) tosampling by folding under the condition of averageattractive interaction between amino acids, whilefold recognition is likely to have a different set ofdecoys. This set of decoys should be used in ourprocedure of derivation of potentials for foldrecognition. It can be implemented by explicitgeneration of alternative conformations for a givenprotein by threading its sequence through otherprotein structures of the database. Frequency ofcontacts and contact correlations computed foralternative conformations built in this way are to beused for computing Z-scores and derivation ofpotential. While derived, the potential will providethe highest possible Z-score for fold recognition.This work is in progress.

Now we turn to the discussion of the resultsobtained for real proteins. First of all we see thatpairwise contact approximation is not meaninglessfor real proteins, i.e. certain aspects of theirenergetics are captured by that model. The clearevidence for that is that our procedure converges toa unique potential, and the Z-scores of proteinswith that potential are considerably lower than forrandomly shuffled sequences. This suggests thatsuch a simplified Hamiltonian still carries some‘‘signal’’. In this sense the derived two-bodypotentials are useful since they are able todiscriminate between native conformation anddecoys, when the number of decoys is not too large.

However, the Z-scores obtained for proteinswithin the pairwise contact Hamiltonian approxi-mation are not sufficiently low to provide highstability (or large energy gap) for all proteinssimultaneously. Hence all knowledge-based poten-tials can have only limited success in folding orrecognition of the native fold among alternativeconformations. This result can help in theunderstanding of the origin of problems arisingwith various structure prediction techniques. Ourresults suggest that limited success in foldingsimulations in the simple model with pairwisepotentials may be due not to incorrect potentials(i.e. 210 numbers) but rather due to the deficiencyof the model itself, and no other potentials withinthe same model of pairwise contact interactions canprovide better results uniformly for numeroustested proteins (of course there can be successeswith potential which are optimized to fold just oneprotein; Hao & Sheraga, 1996a,b); however, as ouranalysis shows, such potential (speaking in ourterms, derived by optimization from the databaseof one protein) will fail when used to fold anotherprotein.

Several models have been suggested for proteinfolding which vary in accuracy of structurerepresentation and in complexity of the energyfunction. What is the optimal number of parametersof the energy function? How does the number ofparameters affect the results of the procedure toextract potentials? To address these questions wedeveloped a convergence test which allows us toestimate stability of the obtained potential withrespect to the dataset of proteins used. This testindicates whether the number of parameters usedto maximize the energy gap (210 in the case ofcontact pairwise potential) is large enough toprovide the gap for all protein simultaneously andis small enough not to over-fit the data and adoptany random sequence to a protein structure in thedatabase. Our results indicate that 210 parametersof contact pairwise potential are not too many(potential converges as the size of the databaseincreases), but the model itself is not sufficientlyrealistic to provide the large gap for real proteins.More accurate presentation of energy function(possibly including local conformational prefer-ences, distance dependent interactions, multibodyinteractions etc.) is likely to be necessary to achievebetter discrimination between the native structureand decoys. The presented method allows us toassess systematically the validity of differentmodels and therefore can serve as a powerful toolin the search for the most adequate model forprotein folding.

Methods

Derivation of potential

Energy function assigns a value of the energy to agiven conformation for a given amino acid sequence:

E = E(Sequence, Conformation, U) (1)

where U is the set of parameters of potential to be derivedfrom known native protein structures.

We use the Z-score as a measure of how pronouncedis the energy minimum corresponding to the nativeconformations (with respect to a set of alternativeconformations; Bowie et al., 1991):

Z =EN − �E�conf

sconf(E)(2)

This is the deviation of the energy of the nativeconformation from the average energy of alternativeconformations measured in units of standard deviation.The average energy �E� and variance s(E) are computedfor a set of alternative conformations (see below). Theabsolute value of the Z-score is the natural measure of theenergy gap.

Our goal is to find a potential U that minimizesZ-scores (maximizes the energy gap) simultaneously forall proteins in the dataset. This is achieved by buildinga target function which is an appropriate combination ofindividual Z scores and then optimizing this functionwith respect to U. One should carefully choose acombination of Z-scores to optimize. If the target functionto be optimized is naively chosen, for example a sum ofZ-scores, then low values of the target function can beobtained if Z is small enough for some proteins and large

Page 13: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

Protein Folding Potential1176

for all others. To avoid this kind of a problem, one hasto minimize maxm (Z(m)), which is, however, very difficultto deal with because of its discontinuity. We chose aharmonic mean of Z(m) scores as a function to beminimized:

�Z�harm =M

sM

m = 1

1/Zm

(3)

A harmonic mean is a smooth approximation ofmaxm (Z(m)) since terms with the smallest absolute value ofZ(m) scores contribute most to the harmonic mean.

To maximize the energy gap for all proteins in adataset, we search for a potential U which maximizes thevalue of function F(U) = −�Z�harm. The value F is directlyrelated to the energy gap. Hence below we refer to F asan energy gap, with the understanding that it is notexactly identical to it, but that these two quantities havea monotonic interdependence.

We also apply some constraints to the potential U:

�U� = 0 (4)

s2(U) = �(�U� − U)2� = 1 (5)

The first constraint sets an average interaction betweenamino acids to zero, i.e. eliminates non-specific attrac-tion/repulsion between amino acids. Non-specific attrac-tion/repulsion favors more/less compact conformationsirrespective the amino acid sequence. We use the firstconstraint to avoid this kind of bias.

The second constraint sets the dispersion of interactionenergies to one. If energy is a linear function ofparameters, multiplication of U by an arbitrary constantdoes not change the values of Z-score. By settings(U) = 1, we choose units of energy.

Potential U is obtained by maximizing of F(U) using aprocedure for non-linear optimization. The potentialobtained in this way is (by procedure) the one whichprovides the largest energy gaps simultaneously to allproteins in the dataset as far as �Z�harm is an accurateapproximation of maxm (Z(m)).

One of the most important parts of the method isthe set of alternative conformations used to computeZ-scores. In general, one has to use the same set ofconformations for sampling and for computing Z-scores. For example, to optimize the potential forthreading, one has to compute Z-scores using a set ofalternative conformations obtained by threading asequence through a representative set of proteinstructures. To find a potential, one needs to generatea set of alternative conformations, and then, usethis set to compute individual Z-scores. This pro-cedure is computationally inexpensive since the setof conformations obtained by threading does notdepend on the potentials used. However, whendynamic sampling techniques (Monte Carlo, moleculardynamics, growth procedures etc.), are used forab initio folding, the set of alternative conformationsis not known in advance and, more importantly,depends on the potential applied. In this case onehas to make some assumptions about the ensemble ofalternative conformations that will enable one tocompute average energy �E�conf and variance ofenergy sconf(E) over the ensemble of alternativeconformations. In this study we show how to optimizepairwise potential for ab initio folding of a simplemodel and for threading.

Derivation of parameters for pairwise potential

Pairwise potential

We consider pairwise contact potential, i.e. the energyof a conformation is a sum of the energies of pairwisecontacts between monomers, which are not nearestneighbors in sequence:

E(j, D) = s1Ei < jEN

U(ji , jj )Dij (6)

where Dij = 1 if monomers i and j are in contact and Dij = 0otherwise. Various definitions of contacts can be used(Kocher et al., 1994). ji defines the type of amino acidresidue in position i. Potential is given by U matrix,where U(j, h) is energy of a contact between amino acidsof types j and h.

Optimization of potential

The optimization of F(U) = −�Z�harm is performed bythe Metropolis Monte Carlo procedure in space ofpotentials, i.e. at each step a cell U(j, h) of the matrix Uis chosen randomly and a small random number re[−0.1, 0.1] is added to U(j, h). This change is accepted ifit increases F(U) and rejected with probability:

1 − exp(−dF

Topt)

if it decreases F(U). Topt is the temperature ofoptimization. Optimization of potentials starts from acompletely random potential and stops when the targetfunction changes less than on e = 0.01 for the last 20,000steps.

Computation of Z-score

For a given sequence j, potential U and generated setof alternative conformations D(k), k = 1, . . . K one cancompute Z of the native conformation DN:

Z(j, DN) =E(j, DN) − �E(j, D(k))�k

s(E(j, D(k)))k(7)

where index k denotes averaging over alternativeconformations.

Instead of computing the energy of a sequence in eachalternative conformation whenever we need Z, wecompute the average quantities for the set of confor-mations and use these averages to compute Z-scores forany sequence and any potential. These average quantitiesare computed only once, which saves a significantamount of computer time.

The energy of an individual conformation for apairwise Hamiltonian is given by equation (6). Hence, theaverage energy for the set of conformation is:

�E(j, D(k))�k = s1 E i < j E N

U(ji , jj )�Dij�k (8)

where �Dij� is the average density of contacts betweenresidues number i and j in the set of alternativeconformations:

�Dij�k = 1K s

K

k = 1

D(k)ij (9)

Page 14: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

Protein Folding Potential 1177

Note that one can compute the matrix of averagedensity of a contact �Dij� only once for a set ofconformations and later use this matrix to compute theaverage energy for a sequence j and any potential U(see equation (8)).

Similarly for s(E),

s2(E(j, D(k)))k = �E2�k − �E�2k

= s1 E i < j E N

s1 E l < m E N

U(ji , jj )U(jl , jm )Tij,lm

(10)

where Ti,j,l,m is a contact correlator,

Tij,lm = �D(k)ij D(k)

lm �k − �D(k)ij �k�D(k)

lm �k (11)

which depends only on the set conformations and can becomputed in advance for a given set. Once �Dij� and Tij,lm

are computed, one can easily compute the value ofZ-score for a given sequence j, conformation DN andpotential U:

Z(j, DN, U) =

s1 E i < j E N

U(ji , jj )(DNij − �Dij�k )

X s1 E i < j E N

s1 E l < m E N

U(ji , jj )U(jl , jm )Tij,lm

(12)

Alternative conformations

For each protein in the dataset, we build an ensembleof alternative conformations which contains confor-mations with the same compactness as the native one, i.e.the same number of residue-residue contacts. In fact,instead of generating a huge number of conformations,we assume that: (1) the contacts in the alternativeconformations are distributed independently and uni-formly and that (2) the number of contacts is the same asin the native conformation. These assumptions allow oneto compute the average density of contacts �Dij� andcorrelator Tij,lm as:

�Dij� = nntotal

(13)

and

Tij,lm = gG

G

F

f

1n2

total

1ntotal

− 1n2

total

if ij $ lm

if ij = lm(14)

where n is the number of contacts in the nativeconformation, and ntotal is the total number of topologi-cally possible contacts.

Following this, the value of Z score can be computedfor each protein in the dataset and a given potential Uusing equation (12). Lattice model simulations show thatsequences with low values of Z are able to fold fast totheir native conformations (Abkevich et al., 1994; Gutinet al., 1995).

To test our derivation method and compare it withother techniques, we first turn to a simple lattice modelwhich allows us to test a potential by performingab initio folding of protein starting from a randomconformation. Next, we apply our method to derivethe parameters from the dataset of well-resolved proteinstructures.

Sequence design

The aim of our sequence design is to find a sequence(for a given potential) that delivers a low Z-score to agiven conformation. The procedure starts from a randomsequence with a given amino acid composition. Althoughdifferent sequences have different amino acid compo-sition, the composition in the dataset corresponds tothose of the native proteins (Creighton, 1993).

At each step we choose two residues at randomand attempt to permute them. A change of Z-score(dZ) associated with this permutation is computed. Ifthis permutation decreases the value of Z-score (dZ < 0),then this permutation is accepted, otherwise (dZ > 0) thepermutation is rejected with probability:

1 − exp0− dZTsel 1

The procedure stops when either no changes in sequencehave occurred in the last 1000 Nt steps or if a preset valueof Z-score is reached (Ztarget). Using this procedure andsetting different values Ztarget, we are able to generatesequences that provide the required value of Z-score fora given conformation.

AcknowledgementsWe are grateful to Alexander Gutin and Victor

Abkevich for many fruitful discussions. L.M. is gratefulto Michael Schwarz for interesting discussions. Wethank Richard Goldstein for explaining to us theiraveraging procedure. This work was funded by thePackard Foundation.

The parameter set is available from our anonymousftp-site paradox.harvard.edu; file Euv.dat in the direc-tory/pub/leonid.

ReferencesAbkevich, V. I., Gutin, A. M. & Shakhnovich, E. I. (1994).

Free energy landscape for protein folding kinetics,intermediates, traps and multiple pathways intheory and lattice model simulations. J. Chem. Phys.101, 6052–6062.

Abkevich, V. I., Gutin, A. M. & Shakhnovich, E. I. (1995).Impact of local and non-local interactions onthermodynamics and kinetics of protein folding.J. Mol. Biol. 252, 460–471.

Bowie, J. U., Luthy, R. & Eisenberg, D. (1991). Amethod to identify protein sequences that fold intoa known three-dimensional structure. Science, 253,164–170.

Bryant, S. H. & Lawrence, C. E. (1993). An empiricalenergy function for threading protein sequencethrough the folding motif. Proteins: Struct. Funct.Genet. 16, 92–112.

Bryngelson, J. D. (1994). When is a potential accurateenough for structure prediction? J. Chem. Phys. 100,6038–6045.

Covell, D. G. (1994). Low resolution models ofpolypeptide chain collapse. J. Mol. Biol. 25, 1032–1043.

Creighton, T. E. (1993). Proteins: Structures and MolecularProperties, W. H. Freeman and Co, New York.

DeWitte, R. & Shakhnovich, E. (1994). Pseudodihedrals:simplified protein backbone representation withknowledge-based energy. Protein Sci. 3, 1570–1581.

Page 15: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

Protein Folding Potential1178

Elofson, A. Le Grand, S. & Eisenberg, D. (1995). Localmoves: an efficient algorithm for simulation ofprotein folding. Proteins: Struct. Funct. Genet. 23,73–82.

Finkelstein, A. V. & Reva, B. A. (1991). Search for themost stable folds of protein chains. Nature, 351,497–499.

Finkelstein, A. V., Gutin, A. M. & Badretdinov, A. Ya.(1993). Why are the same protein folds used toperform different functions? FEBS Letters, 325,23–28.

Finkelstein, A. V., Gutin, A. M. & Badretdinov, A. Ya.(1995). Why do protein architectures have Boltz-mann-like statistics? Proteins: Struct. Funct. Genet. 23,142–149.

Godzik, A., Kolinski, A. & Skolnick, J. (1995). Areproteins ideal mixtures of amino acids? Analysis ofenergy parameter sets. Protein Sci. 4, 2101–2117.

Goldstein, R., Luthey-Schulten, Z. A. & Wolynes, P.(1992). Optimal protein-folding codes from spin-glass theory. Proc. Natl Acad. Sci. USA, 89,4918–4922.

Gutin, A. M., Abkevich, V. I. & Shakhnovich, E. I.(1995). Evolution-like selection of fast-foldingmodel proteins. Proc. Natl Acad. Sci. USA, 92,1282–1286.

Hao, M. & Scheraga, H. (1996a). How optimization ofpotential function affects protein folding. Proc. NatlAcad. Sci. USA, 93, 4984–4989.

Hao, M.-H. & Scheraga, H. A. (1996b). Optimizingpotential function for protein folding. J. Phys. Chem.100, 14540–14548.

Hinds, D. A. & Levitt, M. (1994). Exploring confor-mational space with a simple lattice model forprotein structure. J. Mol. Biol. 243, 668–682.

Jones, D. T., Taylor, W. R. & Thornton, J. M. (1992). A newapproach to protein fold recognition. Nature, 358,86–89.

Kocher, J. P., Rooman, M. J. & Wodak, S. J. (1994). Factorsinfluencing the ability of knowledge-based poten-tials to identify native sequence-structure matches.J. Mol. Biol. 235, 1598–1613.

Kolaskar, A. S. & Prashanth, D. (1979). Empiricaltorsional potential functions from protein structuredata. Int. J. Pept. Protein Res. 14, 88–98.

Kolinski, A. & Skolnick, J. (1993). A general method forthe prediction of three dimensional structure andfolding pathway of globular proteins: application ofdesigned helical proteins. J. Chem. Phys. 98,7420–7433.

Kolinski, A. & Skolnick, J. (1994). Monte Carlosimulations of protein folding. I. Lattice model andinteraction scheme. Proteins: Struct. Funct. Genet 18,338–352.

Lattman, E. E. (1995). Protein structure prediction.Proteins: Struct. Funct. Genet. 23, 295–462.

Levitt, M. (1976). A simplified representation of proteinconformation for rapid simulation of protein folding.J. Mol. Biol. 104, 59–107.

Lemer, C., Rooman, M. & Wodak, S. (1995). Proteinstructure prediction by threading methods: evalu-ation of current techniques. Proteins: Struct. Funct.Genet. 23, 321–332.

Maiorov, V. N. & Crippen, G. M. (1992). Contact potentialthat recognizes the correct folding of globularproteins. J. Mol. Biol. 227, 876–888.

Mirny, L. & Domany, E. (1996). Protein fold recognitionand dynamics in the space of contact maps. Proteins:Struct. Funct. Genet. In the press.

Miyazawa, S. & Jernigan, R. L. (1985). Estimation ofeffective interresidue contact energies from proteincrystal structures: quasi-chemical approximation.Macromolecules, 18, 534–552.

Miyazawa, S. & Jernigan, R. (1996). Residue–residuepotentials with a favorable contact pair term and anunfavorable high packing density term, for simu-lation and threading. J. Mol. Biol. 256, 623–644.

Nishikawa, K. & Matsuo, Y. (1993). Development ofpseudoenergy potentials for assessing protein3-D-1-D compatibility and detecting weak hom-ologies. Protein Eng. 6, 811–820.

Pande, V., Grosberg, A. & Tanaka, T. (1995). Howaccurate must potentials be for successfulmodeling of protein folding? J. Chem. Phys. 103,9482–9491.

Park, B. & Levitt, M. (1996). Energy functionsthat discriminate X-ray and near native foldsfrom well-constructed decoys. J. Mol. Biol. 258,367–392.

Rooman, M., Kocher, J. A. & Wodak, S. (1992). Extractinginformation on folding from aminoacid sequence:accurate predictions for protein regions withpreferred conformation in the absence of tertiaryinteractions. Biochemistry, 32, 10226–10238.

Sali, A. Shakhnovich, E. I. & Karplus, M. (1994a). Kineticsof protein folding. A lattice model study of therequirements for folding to the native state. J. Mol.Biol. 235, 1614–1636.

Shakhnovich, E. I. & Gutin, A. M. (1991). Influenceof point mutations on protein structure. Prob-ability of a neutral mutation. J. Theoret. Biol. 149,537–546.

Shakhnovich, E. I., Farztdinov, G. M., Gutin, A. M. &Karplus, M. (1991). Protein folding bottle-necks: alattice Monte Carlo simulation. Phys. Rev. Letters, 67,1665–1667.

Shakhnovich, E. I. & Gutin, A. M. (1993a). Engineering ofstable and fast-folding sequences of model proteins.Proc. Natl Acad. Sci. USA, 90, 7195–7198.

Shakhnovich, E. I. & Gutin, A. M. (1993b). A novelapproach to design of stable proteins. Protein Eng. 6,793–800.

Shakhnovich, E. I. (1994). Proteins with selectedsequences fold to their unique native conformation.Phys. Rev. Letters 72, 3907–3910.

Sippl, M. J. (1990). Calculation of conformationalensembles from potentials of mean force. Anapproach to the knowledge-based prediction of localstructures in globular proteins. J. Mol. Biol. 213,859–883.

Skolnick, J. & Kolinski, A. (1990). Simulations ofthe folding of a globular protein. Science, 250,1121–1125.

Socci, N. D. & Onuchic, J. N. (1994). Folding kinetics ofprotein-like heteropolymers. J. Chem. Phys. 101,1519–1528.

Tanaka, S. & Scheraga, H. A. (1976). Medium- andlong-range interaction parameters between aminoacids for predicting three-dimensional structures ofproteins. Macromolecules, 9, 945–950.

Thomas, P. & Dill, K. (1996). Statistical potentialsextracted from protein structures: How accurate arethey? J. Mol. Biol. 257, 457–469.

Ueda, Y., Taketomi, H. & Go, N. (1978). Studies of proteinfolding, unfolding and fluctuations by computersimulations. II. A three-dimensional lattice model oflysozyme. Biopolymers, 17, 1531–1548.

Vasques, M., Nemethy, G. & Scheraga, H. (1994).

Page 16: How to Derive a Protein Folding Potential? A New Approach to an …leonid/poten.pdf · 1166 Protein Folding Potential model system where the correct form of the Hamiltonian (say,

Protein Folding Potential 1179

Conformational energy calculations on polypeptidesand proteins. Chem. Rev. 94, 2183–2239.

Wilson, C. & Doniach, S. (1989). Computer model todynamically simulate protein folding: studies withcrambin. Proteins: Struct. Funct. Genet. 6, 193–209.

Wodak, S. & Rooman, M. (1993). Generating and testingprotein folds. Curr. Opin. Struct. Biol. 3, 247–259.

Yue, K., Fiedig, K., Thomas, P. Chan, H. S., Shakhnovich,E. I. & K. A. (1995). A test of lattice protein foldingalgorithms. Proc. Natl Acad. Sci. USA, 92, 325–329.

Edited by F. E. Cohen

(Received 15 July 1996; accepted 18 September 1996)