simple physical models connect theory and …morozov/pdfs/alm_morozov_jmb...simple physical models...

14
Simple Physical Models Connect Theory and Experiment in Protein Folding Kinetics Eric Alm 1 , Alexandre V. Morozov 2 , Tanja Kortemme 3 and David Baker 3 * 1 Lawrence Berkeley National Lab, Physical Biosciences Division, Berkeley, CA 94720 USA 2 Department of Physics University of Washington Box 351560, Seattle, WA 98195 USA 3 Department of Biochemistry Howard Hughes Medical Institute, University of Washington, J-567 Health Sciences, Box 357350, Seattle WA 98195-7350, USA Our understanding of the principles underlying the protein-folding problem can be tested by developing and characterizing simple models that make predictions which can be compared to experimental data. Here we extend our earlier model of folding free energy landscapes, in which each residue is considered to be either folded as in the native state or completely disordered, by investigating the role of additional factors representing hydrogen bonding and backbone torsion strain, and by using a hybrid between the master equation approach and the simple transition state theory to evaluate kinetics near the free energy barrier in greater detail. Model calculations of folding f-values are compared to experimental data for 19 proteins, and for more than half of these, experimental data are reproduced with correlation coefficients between r ¼ 0.41 and 0.88; calculations of transition state free energy barriers correlate with rates measured for 37 single domain proteins (r ¼ 0.69). The model provides insight into the contribution of alternative-folding pathways, the validity of quasi-equilibrium treatments of the folding landscape, and the magnitude of the Arrhenius prefactor for protein folding. Finally, we discuss the limitations of simple native-state-based models, and as a more general test of such models, provide predictions of folding rates and mechanisms for a comprehensive set of over 400 small protein domains of known structure. q 2002 Elsevier Science Ltd. All rights reserved Keywords: protein folding; transition state; kinetics; f-values; master equation *Corresponding author Introduction Recent theoretical models by Alm & Baker, 1 Galzitskaya & Finkelstein, 2 and Munoz & Eaton 3 have focused on the importance of topology in determining protein-folding mechanisms, using simple free energy functions to make predictions about folding rates and transition state (TS) structures. All three groups used a simplified approach in which each residue is considered to be either ordered as in the native state or com- pletely disordered, with ordered residues occur- ring in one or more contiguous segments of the protein chain (the multiple sequence approxi- mation). The models balance the entropic cost of ordering residues against the free energy decrease associated with making native interactions. Munoz & Eaton scaled the strength of native interactions on the basis of protein stability, a first- order approximation that allowed for the calcu- lation of relative folding rates. Galzitskaya & Finkelstein, and Alm & Baker considered inter- actions between ordered segments, and were successful at predicting the distribution of struc- ture in the folding TS for a limited number of proteins. In contrast to these models, work by Portman et al. introduces the use of a local order parameter to bypass the requirement that indi- vidual residues be either completely ordered or disordered. 4–6 Studies by Clementi et al., and more recently by Koga & Takada continue the theme of native topology-based models, but relax the multiple sequence approximation by performing off-lattice simulations using a simplified represen- tation of the protein chain. 7,8 The correlation between folding rates and the simple topological measure, average contact order, suggests that such models may be sufficient to explain folding rates. 9 However, recent experi- mental data indicate that proteins with similar 0022-2836/02/$ - see front matter q 2002 Elsevier Science Ltd. All rights reserved E.A. & A.V.M. contributed equally to this work. E-mail address of the corresponding author: [email protected] Abbreviations used: TS, transition state; TST, transition state theory. doi:10.1016/S0022-2836(02)00706-4 available online at http://www.idealibrary.com on B w J. Mol. Biol. (2002) 322, 463–476

Upload: others

Post on 29-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Simple Physical Models Connect Theory and …morozov/pdfs/Alm_Morozov_JMB...Simple Physical Models Connect Theory and Experiment in Protein Folding Kinetics Eric Alm1, Alexandre V

Simple Physical Models Connect Theory andExperiment in Protein Folding Kinetics

Eric Alm1, Alexandre V. Morozov2, Tanja Kortemme3 and David Baker3*

1Lawrence Berkeley NationalLab, Physical BiosciencesDivision, Berkeley, CA 94720USA

2Department of PhysicsUniversity of WashingtonBox 351560, Seattle, WA 98195USA

3Department of BiochemistryHoward Hughes MedicalInstitute, University ofWashington, J-567 HealthSciences, Box 357350, SeattleWA 98195-7350, USA

Our understanding of the principles underlying the protein-foldingproblem can be tested by developing and characterizing simple modelsthat make predictions which can be compared to experimental data.Here we extend our earlier model of folding free energy landscapes, inwhich each residue is considered to be either folded as in the native stateor completely disordered, by investigating the role of additional factorsrepresenting hydrogen bonding and backbone torsion strain, and byusing a hybrid between the master equation approach and the simpletransition state theory to evaluate kinetics near the free energy barrier ingreater detail. Model calculations of folding f-values are comparedto experimental data for 19 proteins, and for more than half of these,experimental data are reproduced with correlation coefficients betweenr ¼ 0.41 and 0.88; calculations of transition state free energy barrierscorrelate with rates measured for 37 single domain proteins (r ¼ 0.69).The model provides insight into the contribution of alternative-foldingpathways, the validity of quasi-equilibrium treatments of the foldinglandscape, and the magnitude of the Arrhenius prefactor for proteinfolding. Finally, we discuss the limitations of simple native-state-basedmodels, and as a more general test of such models, provide predictionsof folding rates and mechanisms for a comprehensive set of over 400small protein domains of known structure.

q 2002 Elsevier Science Ltd. All rights reserved

Keywords: protein folding; transition state; kinetics; f-values; masterequation*Corresponding author

Introduction

Recent theoretical models by Alm & Baker,1

Galzitskaya & Finkelstein,2 and Munoz & Eaton3

have focused on the importance of topology indetermining protein-folding mechanisms, usingsimple free energy functions to make predictionsabout folding rates and transition state (TS)structures. All three groups used a simplifiedapproach in which each residue is considered tobe either ordered as in the native state or com-pletely disordered, with ordered residues occur-ring in one or more contiguous segments of theprotein chain (the multiple sequence approxi-mation). The models balance the entropic cost ofordering residues against the free energy decreaseassociated with making native interactions.

Munoz & Eaton scaled the strength of nativeinteractions on the basis of protein stability, a first-order approximation that allowed for the calcu-lation of relative folding rates. Galzitskaya &Finkelstein, and Alm & Baker considered inter-actions between ordered segments, and weresuccessful at predicting the distribution of struc-ture in the folding TS for a limited number ofproteins. In contrast to these models, work byPortman et al. introduces the use of a local orderparameter to bypass the requirement that indi-vidual residues be either completely ordered ordisordered.4 – 6 Studies by Clementi et al., and morerecently by Koga & Takada continue the theme ofnative topology-based models, but relax themultiple sequence approximation by performingoff-lattice simulations using a simplified represen-tation of the protein chain.7,8

The correlation between folding rates and thesimple topological measure, average contact order,suggests that such models may be sufficient toexplain folding rates.9 However, recent experi-mental data indicate that proteins with similar

0022-2836/02/$ - see front matter q 2002 Elsevier Science Ltd. All rights reserved

E.A. & A.V.M. contributed equally to this work.

E-mail address of the corresponding author:[email protected]

Abbreviations used: TS, transition state; TST, transitionstate theory.

doi:10.1016/S0022-2836(02)00706-4 available online at http://www.idealibrary.com onBw

J. Mol. Biol. (2002) 322, 463–476

Page 2: Simple Physical Models Connect Theory and …morozov/pdfs/Alm_Morozov_JMB...Simple Physical Models Connect Theory and Experiment in Protein Folding Kinetics Eric Alm1, Alexandre V

topologies, such as proteins G and L, or the src andspectrin SH3 domains and Sso7d, can fold viadifferent mechanisms, suggesting that sometopologies allow multiple, nearly isoenergeticfolding pathways, particularly when there is asymmetry in the native-state structure.10 – 12 Proteindesign studies that successfully switch the native-folding pathway to an alternative-folding pathwayin proteins L and G variants by strengthening orweakening specific intramolecular interactions,while not substantially changing the folding rate,further support this hypothesis.13,14 These resultspoint to the need for a more accurate free energyfunction to characterize folding in some proteins,and motivate the addition of terms that reflecthydrogen bonding and backbone torsion strain toour simple model. Guerois & Serrano recentlyused such an energy function to predict foldingmechanisms successfully within the SH3 family.15

Simple theoretical models can be used to test ourunderstanding of the folding process. Before anyconclusions can be drawn, however, the validityof the model must be checked by comparison withexperimental data. In principle, the results of anyexperimental measurement of folding kinetics orthermodynamics can be derived from a completemodel of the folding landscape. In practice, thefolding of many proteins has been characterizedin terms of the folding rate of the naturally occur-ring protein, and by using site-directed muta-

genesis (f-value analysis) to probe the distri-bution of structure in the folding TS.16 If theresults from model calculations and experimentalmeasurements are in agreement, then our basicunderstanding of the processes that underlie pro-tein folding is validated. Additionally, there areseveral ways that these simple theoretical modelscan be used to provide information inaccessible bystandard experimental techniques. In particular,unstable excited states such as the TS can bedirectly characterized, the energy landscape canbe perturbed, and alternative folding pathwayshigher in free energy than the experimentallyobserved pathway can be identified.

We extend our previous model of protein-folding mechanisms by (1) investigating theaddition of hydrogen bonding and backbonetorsional strain terms to the free energy function,(2) computing folding rates and comparing theseto experimental data for 37 proteins, (3) computingf-values and comparing these to experimentaldata for 19 single domain proteins, (4) makingavailable predictions of folding rates and TS struc-tures for a comprehensive set of small proteindomains of known structure, and (5) comparingthe transition state theory (TST) approximation forfolding rates with a more rigorous approach basedon the kinetic master equations, and estimatingthe Arrhenius prefactor for the folding of smallproteins.

Basic assumptions

As in previous models,1 –3 the folding freeenergy landscape is simulated by enumerating allconfigurations available to a protein chain. A con-figuration is uniquely defined by the state of allresidues in the protein, where each residue istaken to be in one of two states: ordered as inthe folded structure, or completely disordered.Ordered residues are required to occur in one ortwo contiguous groups in the linear proteinsequence. Protein configurations are consideredto be linked kinetically if they differ with respectto the state of exactly one residue.

In the basic free energy function, attractive inter-actions are taken to be proportional to the surfacearea buried by the ordered regions, and wherenoted, the energy function includes terms thatreflect hydrogen bond strength and backbone tor-sion strain as described in Methods. Energeticallyfavorable attractive interactions are offset by theentropic cost of ordering residues and of closingloops between ordered regions.

Results and Discussions

Comparing predicted and observedfolding mechanisms

We compare model calculations of protein-folding pathways with experimentally observed

Figure 1. Statistical test of predicted f-values. Toassess the accuracy of f-value predictions, predictedvalues were compared to experimentally measuredvalues and to 1000 random permutations of themeasured values as described in the text. Bars show thepercent of randomly permuted values that do not corre-late as well with predictions as the measured values(lower correlation coefficient). Predictions were madeusing the basic free energy function, except for starredproteins, for which the full free energy function inclu-ding hydrogen bonding and torsion strain terms wasused. In the following Figures, protein names are abbre-viated as follows: acylP, muscle acylphosphatase; proC,procarboxypeptidase; S6, ribosomal protein S6; CI2,chymotrypsin inhibitor 2; FKBP, FK501-binding protein;lmb, lambda repressor; suc1, the protein product of thecell cycle gene p13suc1; villin, the N-terminal domain ofthe villin headpiece; CheY, bacterial chemotactic proteinCheY; bar, barnase; ten, tenascin; fib, the tenth type IIIdomain repeat of fibronectin; U1A, U1A spliceosomalprotein; G, protein G; L, protein L; src, the src SH3domain; spec, the a-spectrin SH3 domain; sso, the Sso7dSH3 domain.

464 Simple Models in Protein Folding Kinetics

Page 3: Simple Physical Models Connect Theory and …morozov/pdfs/Alm_Morozov_JMB...Simple Physical Models Connect Theory and Experiment in Protein Folding Kinetics Eric Alm1, Alexandre V

f-values by generating a simulated TS ensemblethat consists of the 100 lowest free energy TSconfigurations on the model landscape (each con-figuration in the ensemble is weighted accordingto its free energy). A computed f-value for agiven residue is taken to be the frequency withwhich that residue is ordered in the TS ensemble.An implicit assumption is that each residue isordered because of its favorable interactions withother ordered residues.

To test the statistical significance of the modelf-value predictions, the correlation of predictionsto experimental data was compared to thecorrelation of predictions to randomized data inthe following manner: 1000 decoy data sets weregenerated for each protein by randomly permutingthe measured f-values. The sequence position foreach measurement was kept constant while thevalues were permuted to insure that results werenot biased by the choice of experimentally probedsites. Next, the linear correlation coefficient, r,between the predicted values and the 1001 totaldata sets was assessed, and the rank of the corre-lation to the actual data set was computed as apercentile as shown in Figure 1. For over half ofthe proteins examined, the correlation to the actualdata was above the 99th percentile.

Experimentally characterized TSs can be dividedinto three categories based on the distribution off-values observed: (1) polarized TSs in smallerproteins, (2) compact subdomains within largerproteins, and (3) diffuse TSs in which the observedstructure is not as well-formed as in the nativestate and is distributed across much of the protein.

As shown in Figure 2, calculations for proteins inthe first two categories seem to be more reliablethan those for proteins in the third.

Testing the free energy function

To test the value of adding additional termsto our model, calculations were performed usingseveral variants of our free energy function: (1) thebasic function including only the entropic and sur-face area burial based terms; (2) the basic functionplus a term reflecting backbone torsion strain(non-glycine residues with positive f torsionangles penalized þ0.5 kcal/mol upon ordering;1 cal ¼ 4.184 J); (3) the basic function plus a termreflecting hydrogen bonding (backbone–backboneand side-chain–backbone hydrogen bondsbetween ordered residues assigned a free energybetween 0 and 20.5 kcal/mol each); (4) the basicfunction plus both the backbone torsion strain andhydrogen bonding terms. As shown in Figure 2,the additional terms did not greatly affect theaccuracy of f-value predictions for most proteins.For some proteins, however, the additional termsdid make a difference as described in detail infollowing sections. In particular, for proteins Land G, differences in hydrogen bond strength andbackbone torsion strain play a crucial role in deter-mining the folding pathway in model simulations.Additional terms were tested, such as amino acidand backbone torsion angle dependent freeenergies for ordering residues, but did not improveresults enough to justify the added number of freeparameters. Model calculations that allowed for

Figure 2. Comparison of four freeenergy functions. Free energy func-tions were compared based on theaccuracy of f-value predictionstaken from calculations in whicheach was used. Bars indicate linearcorrelation coefficient, r, betweenmodel predictions and measuredvalues. Four sets of predictions areshown: basic model using surfacearea and entropy terms only,basic þ backbone torsion strain,basic þ hydrogen bonding, basic þbackbone torsion strain and hydro-gen bonding terms.

Simple Models in Protein Folding Kinetics 465

Page 4: Simple Physical Models Connect Theory and …morozov/pdfs/Alm_Morozov_JMB...Simple Physical Models Connect Theory and Experiment in Protein Folding Kinetics Eric Alm1, Alexandre V

Figure 3. Predicted versus observed f-values. f-Value predictions were computed for 19 proteins as described in thetext. f-Value predictions are shown (left column) next to experimentally measured f-values (right column). f-Valuesfor each residue are displayed on the protein using a gradient from blue (f ¼ 0) to red (f ¼ 1), white regions indicateresidues without experimental data. (a) Proteins L and G, and redesigned proteins L and G (labeled nuL and nuG), notethat since comprehensive f-value analysis has not been carried out on the two redesigned proteins, only predictionsare shown. (b) SH3 fold: src, spectrin, and Sso7d. (c) Large proteins with compact subdomains: CheY and barnase.(d) IgG fold: titin, tenascin, and fibronectin domain 10. (e) Ferredoxin fold: acylphosphatase, procarboxypeptidase,

A

G

AA LAA

nuG

A

nuL

D

titin

DD tenascinDD

fibronectin

D

B

src

BB spectrinBB

sso

B E

acylP

EEproC

EEU1A

EE

S6

E

C

barnase

CCCheY

C F

lambda

FFFKBP

FFCI-2

FF

suc1

F

G

U1A

GGGGG

HHHvillin

466 Simple Models in Protein Folding Kinetics

Page 5: Simple Physical Models Connect Theory and …morozov/pdfs/Alm_Morozov_JMB...Simple Physical Models Connect Theory and Experiment in Protein Folding Kinetics Eric Alm1, Alexandre V

three contiguous groups of ordered residuesrequired that 2–5 residues be linked togetherkinetically in order to reduce the computationaldifficulty, and as a result, did not perform as wellat reproducing f-values and folding rates (resultsnot shown).

In the discussion and Figures that follow,f-values were computed using the basic freeenergy function except where indicated.

Small proteins with polarized transition states

Proteins L and G

Proteins L and G provide a good example ofhow model calculations can be used to interpretexperimental results. These two proteins have avery similar topology as shown in Figure 3(a).Moreover, the symmetry of their native-statetopology suggests that for every folding pathway,there is a complementary pathway, which beginsordering residues at the opposite end of the chain,and indeed the two proteins have complementarydistributions of structure in their folding transitionstate ensembles. As might be expected, modelcalculations based on surface area burial andentropic considerations alone (the basic free energyfunction) do not accurately reproduce experi-mentally observed TS structure. Adding thehydrogen bond and torsion strain terms to the freeenergy function seems to capture the experi-mentally observed TS structure, prompting us toinvestigate in greater detail the factors that lead tothis apparent symmetry breaking on the modelfolding landscape.

f-Value analysis of protein L indicates that theN-terminal hairpin is formed, and the C-terminalhairpin remains disordered in the folding TS.11 Asshown in Figure 3(a), model f-value calculationsreproduce this asymmetry. In model calculations,the symmetry breaking is due to favorable side-chain–main-chain hydrogen bonding in theN-terminal hairpin and two non-glycine residueswith unfavorable positive f-angles in the C-termi-nal b-turn. In the case of protein G, the C-terminalhairpin rather than the N-terminal hairpin isordered in the TS,12 and has been shown to bestable in isolation.17 In model calculations, thedifferences in hairpin stability and the asymmetricf-value distribution result from extensive hydro-gen bonding in the C-terminal hairpin and agreater total amount of surface area burial. Thus,for these two proteins, differences in hydrogenbond strength and backbone torsion strain, whichaccount for a very small part of the total free

energy function, are sufficient to change the distri-bution of structure at the rate-limiting step.

As an experimental test of the hypothesis thatchanges in the relative stabilities of local structuralelements can change the folding pathway, compu-tational protein design methods have been used tostabilize the first b-hairpin in protein G and thesecond hairpin in protein L. The folding TSs ofthe two redesigned proteins were found to bereversed: in contrast to their wild-type counter-parts, the first hairpin is ordered in the TS of theredesigned protein G and the second hairpin islargely ordered in the TS of the redesigned proteinL.13,18 As shown in Figure 3(a), model calculationson crystal structures of these mutants using thefull free energy function seem to capture theenergetic differences between the mutant andwild-type proteins, and predict that the comple-mentary pathways should be observed for thesemutants. In the redesigned protein L, the non-glycine residues with positive f torsion angleswere replaced by a canonical type I0 turn, loweringthe free energy of the second turn. For theredesigned protein G, hydrophobic packing isoptimized in the N-terminal hairpin and a keyhydrogen bond is deleted in the C-terminalhairpin.

SH3 fold

The SH3 fold is another case for which the modelreproduces the observed TS structure. The foldconsists of two opposing three-stranded b-sheets.In the src and spectrin SH3 domains, the localthree stranded sheet comprised of the well-packeddistal loop and the n-src loop is found to beordered at the folding TS, while the sheet com-prised of the RT loop and C-terminal strandis mostly disordered.10,19 In model calculations(Figure 3(b)), this asymmetric distribution ofstructure in the TS can be attributed in part tomore contacts in the compact distal loop hairpincompared to the relatively disordered RT loop, butis likely related to topological considerations aswell. The three stranded sheet that includes thedistal loop consists of three local b-strand pairs,while the opposing sheet contains a b-strand pairseparated in sequence by nearly the length of theprotein. When the model is used to simulatekinetics for a circularly permuted variant of thespectrin SH3 domain in which the N and C terminiare joined, and the chain is cleaved between thetwo strands of the distal loop, this region is nolonger predicted to be ordered at the rate limitingstep, although the strength of the interactions inthis region are unchanged.10 Consistent with this

and U1A. (f) l-repressor, FKBP, CI-2, and suc1. (g) U1A TS placement at different stabilities: left column showspredicted f-values at 0 kcal/mol (top), þ3 kcal/mol unstable (middle), þ5 kcal/mol (bottom); right column showsexperimental f-values at mf/meq values of: 0.5 (top), 0.7 (middle), and 0.85 (bottom). (h) f-Value predictions for thevillin headpiece at two stabilities: 0 kcal/mol (top), þ5 kcal/mol (bottom); right column shows experimental values.

Simple Models in Protein Folding Kinetics 467

Page 6: Simple Physical Models Connect Theory and …morozov/pdfs/Alm_Morozov_JMB...Simple Physical Models Connect Theory and Experiment in Protein Folding Kinetics Eric Alm1, Alexandre V

result, experimental f-value analysis of thiscircular permutant indicates a different distri-bution of structure at the TS.20

The Sso7d domain differs notably from the otherSH3 domains considered in that it contains onlyone three-stranded b-sheet, as the C-terminalregion is helical. Although the distal loop b-strandsare similar in structure to those of the otherSH3 domains, the turn region is presumably lessfavorable due to the presence of five glycineresidues. Both experimentally and in the modelcalculations, the C-terminal helix and nearbyregions of the n-src loop are found to be orderedin the TS. Similar results on these SH3 folds wereobtained by Guerois & Serrano who found hydro-gen bonding and torsion angle dependent termsnecessary to reproduce the experimental results,15

and by Koga & Takada.8

Large proteins with compact subdomains

CheY and barnase

For CheY and barnase, results from the foldingmodel point to a mechanism that is derived fromfirst-order “topological” features. In fact, modelcalculations including only the topologically basedterms, the surface area burial and entropy, performslightly better than those using additional termsat reproducing the experimentally observed TSstructure. The folding model identifies the compactsubdomain at the CheY N-terminus and thecompact core made by packing the C-terminalb-sheet against the N-terminal helix in barnase asoptimal solutions that maximize surface area burialwhile ordering the fewest number of residues(Figure 3(c)). Consistent with these results, theN-terminal domain of CheY has been shown to bestructured in the folding TS,21 and for barnase, theC-terminal b-sheet and the N-terminal a-helixaccount for most of the structure observed in thefolding TS.22

Immunoglobulin fold

The three immunoglobulin proteins studiedhave two minicore regions that correspond to thestrands on the left and right half of the proteinsshown in Figure 3(d), either of which could beused to nucleate the folding process. From a purelytopological perspective, local interactions make thestrands on the left more favorable as a foldingnucleus, while the strands on the right form aminicore that is stabilized by interactions thatspan most of the length of the linear proteinsequence. Model predictions for all three proteinssuggest that the local core should comprise thefolding nucleus; experimental data confirm thisprediction for two of the three proteins (titin andtenascin), but also suggest that, for fibronectin,interactions in the non-local core are sufficientlystrong to shift the folding nucleus to this alterna-tive site.23 – 25

Proteins with diffuse transition states

Of the proteins studied by f-value analysis todate, about half display a diffuse pattern of largelyintermediate f-values distributed across most ofthe protein. Model calculations reproduce thesedistributions in some cases, but are generally lessaccurate for proteins in this class compared tothose described above. This may reflect theinherent limitation of the model that residuesare either completely ordered or disordered, notpartially ordered.

Ferredoxin-like fold

Two interwoven bab motifs form the scaffoldfor the ferredoxin-like fold, and as a result, theprotein core includes mostly non-local interactionsbetween both helices and all four strands. Com-pared to the immunoglobulin fold, this fold doesnot appear to have a clear topological bias towarda specific nucleus. However, for one protein inthis set, U1A, Ternstrom et al. report a polarizedTS centered around a locally stabilized minicore.26

In particular, the asymmetry in the helix placementin U1A results in a very local minicore includingthe first helix and the adjacent beta-hairpin.Experimentally, these differences are realized inthe folding TSs for the four proteins: acylphospha-tase, procarboxypeptidase and ribosomal proteinS6 have diffuse TSs with intermediate f-valuesspread across most of the length of the proteins;27–29

for the asymmetric U1A, the TS is polarized, andhigh f-values are clustered in the local minicoresurrounding the first helix. For this reason, wehave included U1A in the class of large proteinswith compact subdomains in Figure 2.26 Modelcalculations on this set succeed at picking up onlow resolution topological biases in the case ofU1A, but do not perform as well on the othermore symmetric proteins which may include morenon-local interactions as well as multiple foldingpathways (Figure 3(e)). Interestingly, Koga &Takada recently reported similar results on thisgroup of proteins using an off-lattice model,8

suggesting that these results reflect an inherentlimitation of Go-type models rather than an artifactof our particular approach.

CI-2, FKBP12, l-repressor, Suc1

The remaining proteins have TSs that can bedescribed as delocalized (Figure 3(f)). CI-2 is anexample of a protein with a low average f-value,and the residues ordered in the TS structure arespread across part of the a-helix and some residueson the nearby b-strands.30 This may indicatethat multiple pathways are accessible for this pro-tein, or a partial ordering of many residues in anucleation–condensation type reaction.31 Modelcalculations for CI-2 identify the a-helix andnearby b-strands as a compact local minicore, butoverestimate the total amount of structure in the

468 Simple Models in Protein Folding Kinetics

Page 7: Simple Physical Models Connect Theory and …morozov/pdfs/Alm_Morozov_JMB...Simple Physical Models Connect Theory and Experiment in Protein Folding Kinetics Eric Alm1, Alexandre V

TS. The FKBP12 TS has been compared to that ofCI-2, and is mostly delocalized with low f-valuesspread throughout the large b-sheet.32 Modelcalculations, however, predict high f-values in asmall somewhat compact region of the protein,failing to reproduce the pattern of low and deloca-lized f-values, suggesting that there may bemultiple folding pathways available to this proteinwhile only one is identified by model calculations.For Suc1, much of the observed structure in the TSis found in the b sheet, particularly the centerstrands.33 Strands two and four make extensivecontacts including a network of electrostaticinteractions, but strand two is not predicted tobe structured in model calculations. The lack ofstructure in helix 1, despite the strong helicalpropensity of its amino acids, observed bothexperimentally and in model calculations may bedue in part to the relatively small surface areaburied in this region of the native structure asevaluated by our energy function. In the case ofl-repressor, the model correctly identifies thehelices determined by experiment to be importantfor stability at the rate-limiting step, and thesehelices form a well-packed interface that buries asubstantial amount of surface area.34

Limitations of the model

There are three key limitations of the simplemodel that taken together may account for the dis-agreement between calculated and experimentallymeasured f-values observed for some proteins:(1) the free energy function used is of limitedaccuracy; (2) configurations are taken to consist ofresidues that are either fully ordered or disordered,although actual protein conformations may includepartially ordered structure; (3) only native inter-actions are considered. Figure 2 shows how thecalculated TS for protein G and protein L can besensitive to relatively small changes in the freeenergy function. Since changes as small as 1 kcal/mol in conformations near the rate-limiting stepcan dramatically change the amount of fluxthrough different pathways, a good free energyfunction should be able to predict protein stabilityto within 1 kcal/mol, well beyond the reach ofcurrently available potential functions. Proteinswith diffuse TSs illustrate the second limitation ofthe model: experimentally much of the proteinchain is observed to be partially ordered at therate limiting step of folding for this group, but themodel allows only for residues to be fully orderedor fully disordered. The ACBP folding TS (notshown), which includes many non-nativeinteractions,35 highlights the third limitation: non-native interactions cannot be accounted for by amodel which considers only native structure.The assumption that only native interactions con-tribute can also be complicated by relatively smallshifts that could occur late in folding, for example,an a-helix could shift only slightly relative to an

adjacent b-sheet, but dramatically change thedistribution of contacting residues.

Transition state placement

The Hammond postulate from organic chemistrysuggests that lowering the free energy differencebetween the TS and folded state on the foldingpathway should increase their structuralsimilarity.31 Applying this postulate to modelcalculations, the TS should have a higher value ofNf, the number of residues ordered, as theDDGfolding is lowered. Experimentally, the TS shouldhave a higher mf/meq value, which measures thefraction of surface area buried at the rate-limitingstep. The m-values are denaturant dependenciesfor the folding rate constant (mf) and the equi-librium constant (meq). The folding pathway forU1A has been characterized extensively at differentvalues of mf/meq.26 For comparison, the DDGfolding ofmodel calculations was adjusted by scaling thesurface area burial term in our free energyfunction. Figure 3(g) shows a series of f-valuepredictions using the full free energy function(including hydrogen bonding and torsion strainterms) in which the DDGfolding is scaled to: 0, þ3,and þ5 kcal/mol; experimental measurementsthat correspond to mf/meq values of: 0.5, 0.7, and0.85 are shown for comparison.26 Interestingly,calculations using our basic free energy function(data not shown) resulted in a sharp transitionbetween a TS that is mostly disordered at lowDDGfolding to one that is mostly ordered at highDDGfolding, while calculations with the full freeenergy function (shown in Figure 3(g)) producedpartially ordered TS configurations at intermediatestabilities. One explanation for these results is thatthe full free energy function maps out a morecomplicated free energy landscape with manylocal minima that have the potential to becomesaddle points as the overall stability is changed.The calculations shown reproduce the order inwhich structure is formed as the rate-limiting stepis pushed toward the fully ordered state. Modelcalculations on other proteins suggest that similarchanges in TS structure accompanying changes inprotein stability are quite common, and for theSso7d SH3 domain, the predicted TS can changequalitatively as the stability is changed; zerostability TSs (such as the one depicted in Figure3(b)) for the Sso7d SH3 domain consist of ordermainly in the n-src loop and the C-terminal helix,while earlier TSs (for stabilized landscapes thatfavor folded configurations) are structured mainlyin the N-terminal RT loop (not shown). Thus, thequality of f-value predictions may be dependenton accurate placement of the TS along the reactioncoordinate. In some cases, the model may be ableto determine the relative free energy differencebetween configurations at the same position of thereaction coordinate, Nf, but not between configu-rations at different values of Nf using the defaultscaling factor for attractive interactions, g: for

Simple Models in Protein Folding Kinetics 469

Page 8: Simple Physical Models Connect Theory and …morozov/pdfs/Alm_Morozov_JMB...Simple Physical Models Connect Theory and Experiment in Protein Folding Kinetics Eric Alm1, Alexandre V

example, experimental results for the villin head-piece correlate significantly better with predictionsfor a þ5 kcal/mol destabilized protein (r ¼ 0.59)than with predictions for a zero stability landscape(r ¼ 0.24) as shown in Figure 3(h).36

Characterizing the transition state ensemble

An approximation commonly employed in thestudy of protein-folding landscapes is that ofthermodynamic equilibrium across the landscape.The master equation approach37 – 39 recentlyemployed by Cieplak et al. and Finkelstein et al. isa more rigorous treatment which includes fluxover suboptimal-folding pathways, as well asdeviations from equilibrium concentrations nearthe TS barrier.2,40 We pursue this approach furtherand directly compare the concentration and fluxthrough excited states with those predicted from asimpler model that assumes equilibrium acrossthe folding landscape. In addition, we test thebasic assumptions of the transition state theory(TST) by comparing folding rates calculated usingthe two methods. To reduce the number of statesconsidered by the computationally more expensivemaster equation approach, we introduce a hybridapproach in which only states in the vicinity ofthe folding barrier are treated explicitly. Configu-rations distant from the barrier are treated as athermally weighted source population (on theunfolded side) or a sink (on the folded side).Further, multiple residues are combined into links

that are ordered or disordered simultaneously.2

In all master equation calculations we employthe basic free energy function without hydrogenbonding or backbone torsion strain terms.

Figure 4 shows the relative population of allconfigurations on the folding landscape of thea-spectrin SH3 domain as a function of the numberof residues ordered at equilibrium and for thesteady-state case. There is markedly less orderingof residues in the N-terminal part of the proteinjust after the TS barrier (Nf . 30) in the steady-state (lower plot) compared to the equilibrium dis-tribution (upper plot) because there is little barriercrossing in this region (the lowest energy transitionstates are ordered in the C-terminal part of theprotein). Conformations with the N-terminal partof the protein ordered are lower in free energy atNf . 35, as evident in the equilibrium distribution,but are not kinetically accessible.

The master equation approach was used todetermine protein-folding rates, and Figure 5compares these rates to those calculated from asimpler TST rate expression (the Arrhenius ratelaw):

kf ¼ k0 exp 2DGTS

RT

� �ð1Þ

where k0 is the intrinsic kinetic rate (Arrheniusprefactor), DG TS is the free energy of the TS, R isthe gas constant, and T is the temperature. To facili-tate comparison, configurations higher in freeenergy than the lowest TS were excluded in both

Figure 4. Deviation of steady-state distribution from equilibriumdue to depletion near the foldingfree energy barrier. The frequencywith which a particular residue isordered in the ensemble of struc-tures with a given value of Nf (thenumber of residues ordered) isshown. Upper plot: equilibriumdistribution for the a-spectrin SH3domain, where each configurationis weighted by its Boltzmannfactor. Lower plot: steady-statedistribution, where each configu-ration is weighted according to itssteady-state concentration obtainedvia the master equation. The popu-lations on both graphs are shownwith the following color scheme(on a scale from 0 to 1): 0.01–0.35,hues of blue; 0.35–0.63, hues ofgreen; 0.63–0.9, hues of yellow;0.9–1.0, hues of light brown.

10 20 30 40 50

1030

50R

esid

ue p

ositi

on

Nf

SH3 domain

10 20 30 40 50

1030

50R

esid

ue p

ositi

on

Nf

SH3 domain

470 Simple Models in Protein Folding Kinetics

Page 9: Simple Physical Models Connect Theory and …morozov/pdfs/Alm_Morozov_JMB...Simple Physical Models Connect Theory and Experiment in Protein Folding Kinetics Eric Alm1, Alexandre V

calculations. We expect the TST to overestimatethe folding rates, since it neglects populationdepletion close to folding barriers. Nonetheless,Figure 5 shows an excellent agreement betweenmaster equation and TST rates (with correlationcoefficient r ¼ 0.98). Therefore, the simpler TSTrate expression is used for comparison across thelarger set of folding rates in the following section.

To probe the contribution of suboptimal pathsto folding dynamics, configurations above athreshold free energy cutoff over the lowest freeenergy TS were removed from the landscape(Figure 6). Exclusion of paths with free energymaxima more than 2 kcal/mol greater than thelowest free energy TS did not appreciably changethe rate of flux across the landscape, suggestingthat in the model, the dominant contribution tofolding dynamics comes from pathways within2 kcal/mol of the lowest free energy path.

Folding rates

Model calculations of TS free energy were com-pared with experimentally measured folding ratesfor a set of 39 proteins. The experimental rateswere taken at the midpoint of the denaturationcurves where proteins are 50% folded, to beconsistent with the model calculations in whichthe protein stability was set to zero. TS freeenergies were computed from the 100 lowest free

10 12 14 16 18

1012

1416

18E

ffect

ive

barr

ier

(kca

l/mol

)

TST barrier (kcal/mol)

Figure 5. Comparison of folding rates obtained frommaster equation approach to TST. For each protein inthe test set, the effective free energy barrier for thesteady-state approach (calculated as 2RT ln(kf/k0)) isplotted versus the free energy of the single lowest freeenergy TS. The free energy cutoff for the master equationis set to þ0.03 kcal/mol above the lowest free energy TS,such that only contributions from the lowest freeenergy path are included for comparison with the TSTprediction. The dashed line shown is y ¼ x. The linearcorrelation coefficient between the two methods isr ¼ 0.98. The TST barrier can be higher than the effectivemaster equation barrier because the master equationapproach allows flux from a single TS into multiplekinetically connected configurations, thus increasing thetotal flux.

Figure 6. Contribution of highenergy states to protein-foldingkinetics. The calculated effectivefree energy barriers (calculated as2RT ln(kf/k0)) for protein G andthe src SH3 domain model land-scapes are shown as a function ofthe threshold above which proteinconfigurations were excluded fromthe calculation.

Simple Models in Protein Folding Kinetics 471

Page 10: Simple Physical Models Connect Theory and …morozov/pdfs/Alm_Morozov_JMB...Simple Physical Models Connect Theory and Experiment in Protein Folding Kinetics Eric Alm1, Alexandre V

energy TSs, using:

DGTS ¼ 2RT logX

i[TS ensemble

exp 2DGi

RT

� �( )ð2Þ

where DGi is the energy of the ith TS in theensemble.

Using the simple free energy function withoutcontributions from hydrogen bonding or torsionalstrain the correlation between experimental andcomputed rates is 0.67 (Figure 7). To be consistentwith equation (1), the free energies DGi in equation(2) were scaled by a factor of 0.54 to make the slopeof the best fit line in Figure 7 equal to 21/RT. Thescale factor may be viewed as a correction for theoverestimation of the TS barriers, which occurs inour model since some residues are only partiallyordered and some interactions are only partiallyformed at the TS. The incorporation of partialstructure into such models has been addressedby Portman et al., and is an important area forfuture study.4 – 6 The correlation of the model TSfree energies with observed folding rates does notchange significantly using a free energy functionthat includes hydrogen bonding and torsion straincorrections (data not shown). f-value distributionsare likely to be more sensitive to such high resolu-tion details than are folding rates, because given aset of pathways with roughly equal free energybarriers, changes of 1–2 kcal/mol can have largeeffects on the level of flux through the differentpathways while having relatively little effect onfolding rates.

The prefactor in the Arrhenius rate law can beestimated by extrapolation to a zero free energybarrier. The result for k0 is approximately 105 s21,in agreement with earlier semi-empirical and

theoretical estimates.3,41 Since the Arrhenius pre-factor provides an upper bound for protein-foldingrates in the absence of a free energy barrier, it isinteresting to note that naturally occurring proteinsspan an almost entire range of in vitro folding rates,from the fastest physically allowed to the slowestbiologically relevant.

Conclusions

Previous studies by our group and others haveshown the applicability of simple models basedon native structure to modeling protein-foldingkinetics. We have further tested and extended thisclass of models by making predictions of foldingrates and f-values for a comprehensive set of pro-teins that have been experimentally characterized.

We find that experimentally probed TS struc-tures generally fall into three categories: (1) smallproteins with polarized TSs, (2) large proteinswith compact subdomains, and (3) proteins withdiffuse TSs. Our results suggest that an approachbased on native state topology performs better atreproducing f-values for proteins in the first twocategories than for the third, and that addingadditional terms such as hydrogen bonding andbackbone torsion strain does not greatly affectmodel accuracy for most proteins. This may resultfrom a fundamental limit of such models: twostructurally distinct pathways that differ in freeenergy by as little as 1–2 kcal/mol can producerates almost indistinguishable from thoseobtained from the lower energy pathway alone.This suggests that proteins with a single obser-vable folding mechanism might have alternative-folding pathways close in free energy that arenever observed experimentally. An ideal modelmust in principle be able to detect such smalldifferences, otherwise a nearly isoenergetic alterna-tive pathway can be mistaken for the lowest infree energy. The scatter in Figure 7, however, indi-cates that a simple free energy function based onthe fraction of native contacts formed is notsufficiently accurate to discriminate betweenalternative-folding pathways separated by only afew kcal/mol. Additional limitations include theabsence of non-native and partially formed inter-actions. In particular, partially formed interactionsmay be partly responsible for the poor per-formance of the model on proteins with diffuse TSstructure.

Despite these limitations, we find that such amodel can reproduce f-value distributions formany proteins, and that a simple model with noadjustable parameters predicts TS energies thatcorrelate reasonably well with experimentallymeasured folding rates. On the basis of theseresults, we use our model to make some generalobservations about protein-folding landscapes,which are not straightforward on the basis of theavailable experimental data alone. First, we notethat simply changing the strength of stabilizing

Figure 7. Comparison of calculated TS energy tomeasured folding rates. The x-axis shows the free energyof the TS ensemble computed from model calculationsusing the basic free energy function including onlysurface area burial and entropy terms. The free energyfunction is scaled by a factor of 0.54 such that the slopeof the best fit line is 21/RT. TS free energies are calcu-lated by taking the logarithm of the partition functionthat includes the 100 lowest free energy TSs. The linearcorrelation coefficient between predicted free energiesand log folding rates is r ¼ 0.67, with a p-value of 7 £ 106.

472 Simple Models in Protein Folding Kinetics

Page 11: Simple Physical Models Connect Theory and …morozov/pdfs/Alm_Morozov_JMB...Simple Physical Models Connect Theory and Experiment in Protein Folding Kinetics Eric Alm1, Alexandre V

interactions is enough to change the TS structure,particularly in proteins with symmetric native-state topologies such as proteins G and L. Quanti-tatively, we observe that most of the configurationsthat contribute to flux through the TS ensemble arewithin 2 kcal/mol of the lowest free energy TS con-figuration. This result is not trivial, since there are avery large number of higher energy configurationsthat could contribute because of their higher collec-tive entropy. Although we observe some depletionof states near the folding free energy barrierrelative to the populations expected at equilibrium,TST provides a very good approximation tokinetics on our model landscape, implying thatthe folding free energy barrier is large comparedto the ruggedness of the landscape, and is rela-tively symmetric. Finally, since our model allowsfor the direct calculation of the TS free energy, wecan estimate the magnitude of the Arrhenius pre-factor term to be about 105 s21, in agreement withearlier theoretical and semi-empirical estimates.3,41

Since our model was developed to account forthe data available to the authors, the key test of itsvalidity will be its ability to predict the outcomeof future protein-folding experiments. To facilitatesuch testing, we provide predictions† of foldingrates and mechanisms for a comprehensive set ofprotein domains of known structure under 100residues in length. The simple physical basis ofour model allows for predictions to be used tointerpret new experimental results when the twoare in agreement, and for other cases should helppoint out limitations of Go-type models, and areasfor further improvement.

Methods

Identifying transition-state configurations

A TS is defined as the highest energy state on thelowest energy path from the unfolded to the foldedstate. Additional transition states are defined as thehighest free energy states on the lowest free energypaths which do not include previously identified tran-sition states. Transition states are identified using analgorithm described previously.1

Free energy function

The full free energy function including both hydrogenbonding and backbone torsion strain terms is given by:

F ¼ 2gDSA 2 HðconfigÞ þX

r[residues

OrdðrÞFlocalðrÞ

þ 1:8RTXloops

lnðL

L0Þ ð3Þ

where

DSA ðsurface areaÞ ¼ SAfolded 2 SAunfolded

is the difference between folded and unfolded solvent-

accessible surface areas,

FlocalðrÞ ¼2:3 kcal=mol if f . 0 and not glycine;

0:8 kcal=mol otherwise

(

OrdðrÞ ¼1 if residue r is ordered

0 if residue r is disordered

(

Here, f is a backbone torsion angle. The first twoterms represent the interactions that contribute to proteinstability: surface area burial and hydrogen bonding. Theconstant, g, which controls the free energy associatedwith surface area burial, is fixed for each protein suchthat the stability of the folded state ensemble (definedas all configurations with at least half of their residuesordered) is equal to that of the unfolded state ensemble(all configurations with less than half of their residuesordered), which is approximately true for most proteins(the models described by Galzitskaya & Finkelstein, andMunoz & Eaton also scale native interactions to adjustprotein stability2,3). To compute buried surface area,ordered residues are modeled using their native stateatomic coordinates and the unfolded state is modeled asan extended chain. Disordered residues are not con-sidered in surface area calculations. Calculations are car-ried out using the method of LeGrand & Merz.42

Energies of backbone–backbone, side-chain–back-bone and side-chain–side-chain hydrogen bonds(H (config)) were determined using an empirical poten-tial function described elsewhere (T.K., A.M. & D.B.,unpublished results). Briefly, the potential requiresexplicit placement of polar hydrogen atoms, and isdependent on (a) the distance between the hydrogen(H) and the acceptor (A) atoms, (b) the angle at thehydrogen atom (D–H· · ·A) (D, donor atom), and (c) theangle at the acceptor atom (H· · ·A–AB) (AB, heavyatom bound to the acceptor atom). The distance depen-dence was described by a 10–12 potential with a mini-mum at a distance of 1.9 A between acceptor andhydrogen atoms. The angle-dependent terms of thehydrogen bonding potential were derived from hydro-gen bond geometries observed in high-resolution (2.0 Aor better) protein crystal structures. Only hydrogenbonds with proton positions given by the chemistry ofthe donor group were considered in the derivation ofthe potential. The observed angle-dependent proba-bilities for side-chain–side-chain hydrogen bonds in thedatabase of protein structures showed maxima at 1808(for the angle at the hydrogen atom) and 1208–1408 (forthe angle at a donor with sp2 hybridization, with aslightly sharper distribution around a maximum of 1208for a donor with sp3 hybridization), and were used toderive interaction energies for side-chain–side-chainand side-chain–backbone hydrogen bonds (backbone–backbone hydrogen bonds displayed a slightly dif-ferent geometry, presumably due to steric constraints).To apply the derived hydrogen bonding function tohydrogen bond scoring in the experimentally deter-mined structures used here, explicit hydrogen atomswere placed either according to the known geometryaround the donor group (for N, Q, R, and W), orby optimization of the hydrogen bonding network invol-ving all polar groups, using a scoring function whichincluded van der Waals interactions and solvation termsas well as the hydrogen bonding function describedabove. Finally, hydrogen bonds were scaled between 0and 20.5 kcal/mol. The hydrogen bonding term isabsent from all calculations using the basic free energyfunction.† http://www.bakerlab.org

Simple Models in Protein Folding Kinetics 473

Page 12: Simple Physical Models Connect Theory and …morozov/pdfs/Alm_Morozov_JMB...Simple Physical Models Connect Theory and Experiment in Protein Folding Kinetics Eric Alm1, Alexandre V

The third term represents the cost of ordering a singleresidue and is taken to be 1.8 kcal/mol as in ourpreviously described model.1 For calculations includingthe torsion strain penalty, 0.5 kcal/mol was added tothe cost of ordering non-glycine residues with positivef-angles.

The last term represents the entropic cost of closing aloop between two ordered segments, where L is thelength of the loop (L0 ¼ 0.15 A). This entropy was esti-mated from simulations of loop closure frequencies inpolypeptide chains,43 and has the same functional formas the Jacobsen–Stockmayer expression used in polymerchemistry.44

Free parameters

The free energy associated with surface area burialwas scaled for each protein such that the unfolded andfolded state were of equal stability, and the entropy costof ordering a single residue was taken to be 1.8 kcal/mol.1 The entropy loss associated with loop closure wastaken directly from off-lattice simulations.43 The strengthof hydrogen bonds and the penalty for non-glycineresidues with positive f-angles were chosen such thatthey improved the ability of the model to accuratelypredict f-values, and did not significantly affect TSenergies. For the calculation of folding rates, the basicfree energy function without hydrogen bonding or back-bone torsion strain terms was used, and the free energieswere multiplied by a factor of 0.54 such that modelpredictions could be directly compared with experi-mental measurements (observed folding rates indicatethat free energy barriers to folding span about 9 kcal/mol).

Master equation approach

In order to investigate deviations from a simple TSTand their effect on protein folding, we solve a system of

master equations given by:

dni

dt¼

Xj

ðkjinj 2 kijniÞ ð4Þ

where i denotes a single protein conformation, andMetropolis rules are used to describe an elementarykinetic step from i to j:

kij ¼ k0 £

0 if ij impossible

1 if Fi $ Fj

eðFi2FjÞ=RT if Fi , Fj

8>><>>:

Here, k0 is the intrinsic kinetic rate (crossing attemptfrequency).

In the case of steady-state protein diffusion across thefree energy landscape, dni/dt ¼ 0, and the system ofequations becomes algebraic. We introduce a hybridapproach in which we consider protein folding inmicroscopic detail close to the barrier while assumingBoltzmann distribution of states well before the barrier.This allows the treatment of the important features atthe barrier to be without approximation while retainingcomputational tractability. The approach is implementedthrough two boundary conditions: a source at thermalequilibrium, and an irreversible sink (Figure 8):

ni ¼ n0e2Fi=RT i [ equilibrium wall

ni ¼ 0 i [ absorbing wall

(

where n0 sets the total number of particles (proteins) inthe ensemble. The equilibrium wall is defined as the setof configurations directly connected to the source popu-lation at equilibrium, and the absorbing wall is definedas the set of configurations directly connected to theirreversible sink.

The folding rate is given by:

rate ¼flux into absorbing wall

n0

The nonequilibrium steady-state solution correspondsto creating a source of particles in the denatured welland a sink in the native well. By moving the equilibriumwall further away from the location of the TSs, we canrelax the assumption of equilibrium in the vicinity ofthe folding barriers. On the other hand, assuming equi-librium among low free energy conformations distantfrom the barriers helps reduce the computational com-plexity of the problem. Note also that by moving theabsorbing wall further away, we allow a particle torecross the barrier multiple times.

In order to make the system of algebraic equationswell-defined, we identified clusters of kineticallyconnected nodes on the free energy landscape. Onlyclusters stretching from the source to the sink can carryflux across; isolated groups of nodes can only equilibratewithin themselves. The number of nodes included in theflux-carrying cluster depends on the free energy cutoffimposed on every node; setting the cutoff lower thanthe lowest free energy TS makes all clusters dis-connected, and setting the cutoff higher includes effectsfrom multiple TSs.

Finally, we used a reduced landscape in which severalresidues are added or removed in a single kinetic stepwhenever the number of conformations was too greatfor the master equation approach to be feasible.2

F

Nf

Feff

SOURCE SINK

Equilibrium Wall Absorbing Wall

Figure 8. Schematic demonstration of the masterequation approach using a one-dimensional free energylandscape. Master equations are solved in the regionbetween the source (equilibrium wall) and the sink(absorbing wall). At the source, thermodynamic equi-librium is imposed; a particle (protein) is committed tothe final state once it reaches the sink.

474 Simple Models in Protein Folding Kinetics

Page 13: Simple Physical Models Connect Theory and …morozov/pdfs/Alm_Morozov_JMB...Simple Physical Models Connect Theory and Experiment in Protein Folding Kinetics Eric Alm1, Alexandre V

Folding rates

To investigate the relationship between predicted TSfree energies and experimentally measured foldingrates, a set of 37 small proteins was used. The data forthis set were taken from Grantcharova et al.,45 Jackson46

and from Schymkowitz et al.,33 and Guerois & Serrano,15

and measured at the midpoint transition where theprotein stability is zero. A complete list of the proteinsused and their corresponding PDB identifiers is includedin the following section.

PDB files

To build model landscapes for f-value and foldingrate predictions the following proteins and PDB fileswere used: cytochrome b562, 256b; barstar, 1a19; ACBP,2abd; tendamistat, 2ait; acylphosphatase, 1aps; pro-carboxypeptidase, 1aye; Sso7d SH3 domain, 1bf4;a-spectrin SH3 domain, 1bk2; CheY, 2chf; chymotrypsininhibitor 2, 2ci2; cspB cold-shock protein, 1csp; ribosomalprotein L9, 1div; FK506-binding protein, 1fkb; srcSH3 domain, 1fmk; fibronectin type III repeat 9, 1fnf;fibronectin type III repeat 10, 1fnf; HPR, 1hdn; CD2lymphocyte adhesion protein, 1hng; horse heart cyto-chrome c, 1hrc; colicin E9 immunity protein Im9, 1imq;l-repressor, 1lmb; cspA cold-shock protein, 1mjc;dihydrolipoamide acetyltransferase from pyruvatedehydrogenase, 2pde; protein G, 1pgx; protein GC-terminal beta-hairpin, 1pgx; PI3-kinase SH3 domain,1pks; ribosomal protein S6, 1ris; barnase, 1rnb; fyn SH3domain, 1shf; tenascin, 1ten; titin, 1tit; ubiquitin, 1ubq;U1A spliceosomal protein, 1urn; villin headpiece, 2vik;twitchin, 1wit. The coordinates for the suc1 monomerwere obtained from Jane Endicott. The coordinates forprotein L were obtained from Jason O’Neill prior torelease in the PDB as 1hz5.

Acknowledgments

This work was supported by the Howard HughesMedical Institute. T.K. was supported by post-doctoralfellowships of the European Molecular BiologyOrganization and the Human Frontier Science ProgramOrganization. E.A. was also supported by a MolecularBiophysics Training Grant from the NIH.

References

1. Alm, E. & Baker, D. (1999). Prediction of protein-folding mechanisms from free-energy landscapesderived from native structures. Proc. Natl Acad. Sci.USA, 96, 11305–11310.

2. Galzitskaya, O. & Finkelstein, A. (1999). A theoreticalsearch for folding/unfolding nuclei in three-dimen-sional protein structures. Proc. Natl Acad. Sci. USA,96, 11299–11304.

3. Munoz, V. & Eaton, W. (1999). A simple model forcalculating the kinetics of protein folding fromthree-dimensional structures. Proc. Natl Acad. Sci.USA, 96, 11311–11316.

4. Portman, J., Takada, S. & Wolynes, P. (1998).Variational theory for site resolved protein foldingfree energy surfaces. Phys. Rev. Lett. 81, 5237–5240.

5. Portman, J., Takada, S. & Wolynes, P. (2001). Micro-scopic theory of protein folding rates. I. Finestructure of the free energy profile and folding routesfrom a variational approach. J. Chem. Phys. 114,5069–5081.

6. Portman, J., Takada, S. & Wolynes, P. (2001). Micro-scopic theory of protein folding rates. II. Localreaction coordinates and chaindynamics. J. Chem.Phys. 114, 5082–5096.

7. Clementi, C., Nymeyer, H. & Onuchic, J. (2000).Topological and energetic factors: what determinesthe structural details of the transition state ensembleand en-route intermediates for protein folding? Aninvestigation for small globular proteins. J. Mol. Biol.298, 937–953.

8. Koga, N. & Takada, S. (2001). Roles of native top-ology and chain-length scaling in protein folding: asimulation study with a Go-like model. J. Mol. Biol.313, 171–180.

9. Plaxco, K., Simons, K. & Baker, D. (1998). Contactorder, transition state placement and the refoldingrates of single domain proteins. J. Mol. Biol. 277,985–994.

10. Grantcharova, V., Riddle, D., Santiago, J., Alm, E.,Tsai, J. & Baker, D. (1999). Experiment and theoryhighlight role of native state topology in SH3 folding.Nature Struct. Biol. 6, 1016–1024.

11. Kim, D., Fisher, C. & Baker, D. (2000). A breakdownof symmetry in the folding transition state of proteinL. J. Mol. Biol. 298, 971–984.

12. McCallister, E., Alm, E. & Baker, D. (2000). Criticalrole of beta-hairpin formation in protein g folding.Nature Struct. Biol. 7, 669–673.

13. Nauli, S., Kuhlman, B. & Baker, D. (2001). Computer-based redesign of a protein folding pathway. NatureStruct. Biol. 8, 602–605.

14. Kuhlman, B., O’Neill, J., Kim, D., Zhang, K. & Baker,D. (2001). Conversion of monomeric protein L to anobligate dimer by computational protein design.Proc. Natl Acad. Sci. USA, 98, 10687–10691.

15. Guerois, R. & Serrano, L. (2000). The SH3-foldfamily: experimental evidence and prediction ofvariations in the folding pathways. J. Mol. Biol. 304,967–982.

16. Matouschek, A., Kellis, J. T. J., Serrano, L. & Fersht,A. (1989). Mapping the transition-state and pathwayof protein folding by protein engineering. Nature,340, 122–126.

17. Blanco, F., Rivas, G. & Serrano, L. (1994). A shortlinear peptide that folds into a native stable beta-hairpin in aqueous solution. Nature Struct. Biol. 1,584–590.

18. Kuhlman, B., O’Neill, J., Kim, D., Zhang, K. & Baker,D. (2002). Accurate computer-based design of a newbackbone conformation in the second turn of proteinL. J. Mol. Biol. 315, 471–477.

19. Martinez, J., Pisabarro, M. & Serrano, L. (1998).Obligatory steps in protein folding and the confor-mational diversity of the transition state. NatureStruct. Biol. 5, 721–729.

20. Viguera, A., Blanco, F. & Serrano, L. (1995). The orderof secondary structure elements does not determinethe structure of a protein but does affect its foldingkinetics. J. Mol. Biol. 247, 670–681.

21. Lopez-Hernandez, E. & Serrano, L. (1996). Structureof the transition state for folding of the 129 aa proteinCheY resembles that of a smaller protein, CI-2. Fold.Des. 1, 43–55.

Simple Models in Protein Folding Kinetics 475

Page 14: Simple Physical Models Connect Theory and …morozov/pdfs/Alm_Morozov_JMB...Simple Physical Models Connect Theory and Experiment in Protein Folding Kinetics Eric Alm1, Alexandre V

22. Serrano, L., Matouschek, A. & Fersht, A. (1992). Thefolding of an enzyme. III. Structure of the transition-state for unfolding of barnase analysed by a proteinengineering procedure. J. Mol. Biol. 224, 805–818.

23. Fowler, S. & Clarke, J. (2001). Mapping the foldingpathway of an immunoglobulin domain: structuraldetail from phi value analysis and movement of thetransition state. Structure, 9, 355–366.

24. Cota, E., Steward, A., Fowler, S. & Clarke, J. (2001).The folding nucleus of a fibronectin type III domainis composed of core residues of the immunoglobu-lin-like fold. J. Mol. Biol. 305, 1185–1194.

25. Hamill, S., Steward, A. & Clarke, J. (2000). The fold-ing of an immunoglobulin-like greek key protein isdefined by a common-core nucleus and regionsconstrained by topology. J. Mol. Biol. 297, 165–178.

26. Ternstrom, T., Mayor, U., Akke, M. & Oliveberg, M.(1999). From snapshot to movie: phi analysis of pro-tein folding transition states taken one step further.Proc. Natl Acad. Sci. USA, 96, 14854–14859.

27. Chiti, F., Taddei, N., White, P. M., Bucciantini, M.,Magherini, F., Stefani, M. & Dobson, C. M. (1999).Mutational analysis of acylphosphatase suggests theimportance of topology and contact order in proteinfolding. Nature Struct. Biol. 6, 1005–1009.

28. Villegas, V., Martinez, J., Aviles, F. & Serrano, L.(1998). Structure of the transition state in the foldingprocess of human procarboxypeptidase A2 acti-vation domain. J. Mol. Biol. 283, 1027–1036.

29. Otzen, D. & Oliveberg, M. (1999). Salt-induceddetour through compact regions of the proteinfolding landscape. Proc. Natl Acad. Sci. USA, 96,11746–11751.

30. Itzhaki, L., Otzen, D. & Fersht, A. (1995). Thestructure of the transition-state for folding of chymo-trypsin inhibitor-2 analysed by protein engineeringmethods: evidence for a nucleation-condensationmechanism for protein folding. J. Mol. Biol. 254,260–288.

31. Matthews, J. & Fersht, A. (1995). Exploring theenergy surface of protein folding by structure-reactivity relationships and engineered proteins:observation of hammond behavior for the grossstructure of the transition-state and anti-hammondbehavior for structural elements for unfolding/folding of barnase. Biochemistry, 34, 6805–6814.

32. Fulton, K., Main, E., Daggett, V. & Jackson, S. E.(1999). Mapping the interactions present in the tran-

sition state for unfolding/folding of FKBP12. J. Mol.Biol. 291, 445–461.

33. Schymkowitz, J., Rousseau, F., Irvine, L. & Itzhaki, L.(2000). The folding pathway of the cell-cycle regu-latory protein p13suc1: clues for the mechanism ofdomain swapping. Struct. Fold. Des. 8, 89–100.

34. Burton, R., Huang, G., Daugherty, M., Calderone, T.& Oas, T. (1997). The energy landscape of a fast-folding protein mapped by Ala-Gly substitutions.Nature Struct. Biol. 4, 305–310.

35. Kragelund, B., Osmark, P., Neergaard, T., Schiodt, J.,Kristiansen, K., Knudsen, J. & Poulsen, F. M. (1999).The formation of a native-like structure containingeight conserved hydrophobic residues is rate limit-ing in two-state protein folding of ACBP. NatureStruct. Biol. 6, 594–601.

36. Choe, S., Matsudaira, P., Wagner, G. & Shakhnovich,E. (2000). Differential stabilization of two hydro-phobic cores in the transition state of the villin 14tfolding reaction. J. Mol. Biol. 304, 99–115.

37. Delbruck, M. (1940). Statistical fluctuations in auto-catalytic reactions. J. Chem. Phys. 8, 120–124.

38. van Kampen, N. G. (1981). Stochastic Processes inPhysics and Chemistry, North-Holland, New York.

39. Hanggi, P., Talkner, P. & Borkovec, M. (1990).Reaction-rate theory: fifty years after Kramers. Rev.Mod. Phys. 62, 251–341.

40. Cieplak, M., Henkel, M., Karbowski, J. & Banavar, J.(1998). Master equation approach to protein foldingand kinetic traps. Phys. Rev. Lett. 80, 3654–3657.

41. Eaton, W. (1999). Searching for downhill scenarios inprotein folding. Proc. Natl Acad. Sci. USA, 96,5897–5899.

42. Le Grand, S. & Merz, K. (1993). Rapid approximationto molecular surface area via the use of boolean logicand look-up tables. J. Comput. Chem. 14, 349–352.

43. Yi, Q., Scalley-Kim, M., Alm, E. & Baker, D. (2000).NMR characterization of residual structure in thedenatured state of protein L. J. Mol. Biol. 299,1341–1351.

44. Jacobson, H. & Stockmayer, W. (1950). Intramolecularreaction in polycondensations. I. The theory of linearsystems. J. Chem. Phys. 18, 1600–1606.

45. Grantcharova, V., Alm, E., Baker, D. & Horwich, A.(2001). Mechanisms of protein folding. Curr. Opin.Struct. Biol. 11, 70–82.

46. Jackson, S. (1998). How do small single-domainproteins fold? Fold. Des. 3, R81–R91.

Edited by C. R. Matthews

(Received 6 March 2002; received in revised form 2 July 2002; accepted 9 July 2002)

476 Simple Models in Protein Folding Kinetics