:m , pages 357 - 365 a simple method for predicting the · in these three sander, 1983b), which...

9
r~~ C I.ll 8 " 05 Vol.4. no.3. 1988 :M , Pages 357 - 365 A simple method for predicting the secondary structure of globular proteins: ... implications and accuracy O.Gascue/land J.L,Go/mard .. Abstract methods, the homology has an imprecise meaning. which is A hod . dfi d.. h nda t tu between 'close from an evolutionary point of view' and 'having met lSpresente or pre lCtlng t e seco ry s roc re " " " , if lob la 'h- h . . .d It ' b d similar ammo acid composition. The best reported result, of 0 g u r protelns), om t elr amzno aCl sequence. lS ase "" . , , . I l ' , if h 'I kno b. lo ' I these three methods, IS that of Levm et at. With a prediction on a ngorous statlStlca exp oltatlon 0 t e wet - wn lO glca . fi ha h ' .d ' , if h nda accuracy of - 62 % over three states for the 62 proteInS of the act t t t e ammo act COmpOSltlO1L5 0 eac seco ry Kb h dS d d b nk dIn: II' I I ' a sc an an er ata a . structure are lJJerent, I'l'e a so propose an evatuatlon process U ti 1 th' ,. , , , ,n ortunate y IS Important Improvementover prevIous that allows us to estlmate the capaClt). of a method to predict ethods. rnainl 'd th ' th Kab h and S d ' . m IS y ueto epresencem e sc an er the secondary structureof a newprotein WhiCh does not have da bank fh 1 d fu . naIl , 1 ed ' h , . ,ta ,0 omo ogous an nctlo y re at proteInS, suc any homologous protel1L5 whose structure LS already known. This h . (2GCH) d . (2PTN) h' h , ,. as c ymotrypsm an trypSIn w IC present a evaluatlon process shows that our met/lad has a predlCtlOn h 1 f42% Th 62% ' th ' th . omo ogy 0 0 . us 0 overestimates e accuracy WI accuracy of 58,7% over three states for the 62protel1L5 of the h' h th thods ed ', th d f V -'- h and Sand (1983 ) -1-. bank 11 . I . b tt than w IC ese me pr ICt e secon ary structure 0 a new ftuu$C er a UUta ,us reSUtt lS e er " . ha b ' d b h 'd I d '--.1- Li (19 '7 '4) proteIn which does not have any homologous proteInS whose t t 0 tame y t e most \\'1 e y use metrwus- m "" ,. , Chou and Fasman (1978) and Gamier et at. (1978)-and also struc~re IS alread~ kno:,n. However, this problem IS, partICU- han hat b ' db thod based loc I ho lo ' larly Important SInce m the case where the proteIn to be t t 0 laine, ya recent me on a mo gles , . ' Le ' I 1986 \ 0 d .. hod ' . [. and predicted has an homologue whose structure IS already known, " vm et a ., ~. ur pre lctlon met lS very Simp e , d ( d ' ) be d . dd ' tl b ' I d . t and Its secon ary an tertiary structure may enve Irec y may e Imp emente on any mlcroconlpu er even on d 1b al' . , , an accurate y y Igmng, programmable pocketcalculators. A simplePascallmplemen- In thi dary red , , , . . ,., s paper we propose a new secon structure p Ictlon tatlon of the method predlCtlOn algonthm lS given, The thod h.h all GGBSM (G 1 dG1 dB ' , ' if I . if . fi [d ' and me , w IC we c ascue an 0 mar aslc mterpretatlon 0 our resu ts m tenns 0 protein 0 mg S ' , cat M thod) GGBSM. 1 ,. al th"" hi h , , ' tatlStl e. IS a pure y statistic me UCl w c dlrectlons for further work are discussed. , 1 1 , th 11 kn b' 1 . cat ;:: th the ' ngorous y exp Olts e we - own 10 ogl lact at aminO lot od f acid compositions of each type of secondary structure are r uc Ion different. The term 'basic' both expresses the fact that this The enormous increase in our knowledgeof DNA sequences method is very simple, and also that it may be considered as has led to the increased use of protein secondary structure a startingpoint for further and more completeapproaches to prediction methods from the amino acid sequence, These the protein sequences, methods havebeendealt with in numerous researches during We also proposea new way to evaluate the prediction the last 15years.In 1983 some importantstandardization work methods which allowsus to estimate their ability to predict the wasdone by Kabsch andSander (1983a,b): the standardization secondarystructure of a protein which does not have a of the data by proposing objective secondary structure homologue whose structure is alreadyknown, This evaluation assignments from crystallographic data; choiceof a databank process shows that GGBSM hasan accuracy over three states containing 62 proteins < 50% homologous; the definition of of 58.7%. This resultis somewhat better than that obtained in a benchmark to compare methods;and the comparison of the the mostwidely used methods (- 50% for Chou and Fasman; three most widely used methods: Lim (1974),Chou andFasman - 56% for Lim and Garnier et al.) and it is also better than (1978) and Gamier et at. (1978), the result obtained by Levin et at. (57.5%), In 1986three new methodswere published(Sweet, 1986; In addition,GGBSM is well balanced: for every stateS, the Nishikawa and Ooi, 1986; Levin et ai, 1986), Thesethree number of residues predicted in thestate S is equal to the number methods are based on the same hypothesis: short homologous of residues observed in the state S. Our method doesnot have sequences of amino acidshave a similar secondary structure, the undesirable feature shownby other methods (Kabschand even if theycome from non-homologous proteins. In these three Sander,1983b),which over- or under-predictcertain states. Unite 194 de1'1NSERM, 91Bd.deI'Hopital, 75013 Paris, France Systemand methods I Present address: Centre de Redlerdle en Informatique de Montpellier, 860 roe . deSaint Priest, 34100 Montpellier, France All programs descrIbed below were developed on a VAX 785. 357 .. "'"r. ;)if" J/ii...

Upload: others

Post on 14-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: :M , Pages 357 - 365 A simple method for predicting the · In these three Sander, 1983b), which over- or under-predict certain states. Unite 194 de 1'1NSERM, 91 Bd. de I'Hopital,

r~~C I.ll 8 " 05 Vol.4. no.3. 1988

:M , Pages 357 - 365

A simple method for predicting thesecondary structure of globular proteins:... implications and accuracy

O.Gascue/l and J.L,Go/mard ..

Abstract methods, the homology has an imprecise meaning. which isA hod . dfi d.. h nda t tu between 'close from an evolutionary point of view' and 'havingmet lS presente or pre lCtlng t e seco ry s roc re " " " ,

if lob la 'h- h . . .d It ' b d similar ammo acid composition. The best reported result, of0 g u r protelns), om t elr amzno aCl sequence. lS ase "" .

, , . I l ' , if h ' I kno b . lo ' I these three methods, IS that of Levm et at. With a predictionon a ngorous statlStlca exp oltatlon 0 t e wet - wn lO glca .

fi ha h ' .d ' , if h nda accuracy of - 62 % over three states for the 62 proteInS of the

act t t t e ammo act COmpOSltlO1L5 0 eac seco ryK b h d S d d b nkdIn: II' I I ' a sc an an er ata a .

structure are lJJerent, I'l'e a so propose an evatuatlon process U ti 1 th' ,. ,, , ,n ortunate y IS Important Improvement over prevIous

that allows us to estlmate the capaClt). of a method to predict ethods. rnainl' d th '

th Kab h and S d' . m IS y ueto epresence m e sc an erthe secondary structure of a new protein WhiCh does not have da bank f h 1 d fu . naIl, 1 ed ' h, . ,ta ,0 omo ogous an nctlo y re at proteInS, suc

any homologous protel1L5 whose structure LS already known. This h . (2GCH) d . (2PTN) h ' h, ,. as c ymotrypsm an trypSIn w IC present a

evaluatlon process shows that our met/lad has a predlCtlOn h 1 f42% Th 62% ' th '

th. omo ogy 0 0 . us 0 overestimates e accuracy WIaccuracy of 58, 7% over three states for the 62 protel1L5 of the h' h th thods ed',th d fV -'- h and Sand (1983 ) -1-. bank 11 . I . b tt than w IC ese me pr ICt e secon ary structure 0 a new

ftuu$C er a UUta ,us reSUtt lS e er " .ha b ' d b h 'd I d '--.1- Li (19 '7 '4) proteIn which does not have any homologous proteInS whose

t t 0 tame y t e most \\'1 e y use metrwus- m "" ,. ,

Chou and Fasman (1978) and Gamier et at. (1978)-and also struc~re IS alread~ kno:,n. However, this problem IS, partICU-

han hat b ' d b thod based loc I ho lo ' larly Important SInce m the case where the proteIn to bet t 0 laine, ya recent me on a mo gles , .'Le ' I 1986\ 0 d .. hod ' . [. and predicted has an homologue whose structure IS already known," vm et a ., ~. ur pre lctlon met lS very Simp e , d ( d ' ) be d . d d ' tlb '

I d . t and Its secon ary an tertiary structure may enve Irec ymay e Imp emente on any mlcroconlpu er even on d 1 b al' .

, , an accurate y y Igmng,programmable pocket calculators. A simple Pascallmplemen- In thi dary red, ,

, . . ,., s paper we propose a new secon structure p Ictlontatlon of the method predlCtlOn algonthm lS given, The thod h. h all GGBSM(G 1 d G 1 d B ', ' if I . if . fi [d ' and me , w IC we c ascue an 0 mar aslcmterpretatlon 0 our resu ts m tenns 0 protein 0 mg S ' ,catM thod) GGBSM. 1 ,. al th""

hi h, , ' tatlStl e. IS a pure y statistic me UCl w cdlrectlons for further work are discussed. , 1 1 , th 11 kn b ' 1 .cat ;:: th the 'ngorous y exp Olts e we - own 10 ogl lact at aminO

lot od f acid compositions of each type of secondary structure arer uc Ion different. The term 'basic' both expresses the fact that this

The enormous increase in our knowledge of DNA sequences method is very simple, and also that it may be considered ashas led to the increased use of protein secondary structure a starting point for further and more complete approaches toprediction methods from the amino acid sequence, These the protein sequences,methods have been dealt with in numerous researches during We also propose a new way to evaluate the predictionthe last 15 years. In 1983 some important standardization work methods which allows us to estimate their ability to predict thewas done by Kabsch and Sander (1983a,b): the standardization secondary structure of a protein which does not have aof the data by proposing objective secondary structure homologue whose structure is already known, This evaluationassignments from crystallographic data; choice of a data bank process shows that GGBSM has an accuracy over three statescontaining 62 proteins < 50% homologous; the definition of of 58.7%. This result is somewhat better than that obtained ina benchmark to compare methods; and the comparison of the the most widely used methods (- 50% for Chou and Fasman;three most widely used methods: Lim (1974), Chou and Fasman - 56% for Lim and Garnier et al.) and it is also better than(1978) and Gamier et at. (1978), the result obtained by Levin et at. (57.5%),

In 1986 three new methods were published (Sweet, 1986; In addition, GGBSM is well balanced: for every state S, theNishikawa and Ooi, 1986; Levin et ai, 1986), These three number of residues predicted in the state S is equal to the numbermethods are based on the same hypothesis: short homologous of residues observed in the state S. Our method does not havesequences of amino acids have a similar secondary structure, the undesirable feature shown by other methods (Kabsch andeven if they come from non-homologous proteins. In these three Sander, 1983b), which over- or under-predict certain states.

Unite 194 de 1'1NSERM, 91 Bd. de I'Hopital, 75013 Paris, France System and methodsI Present address: Centre de Redlerdle en Informatique de Montpellier, 860 roe .de Saint Priest, 34100 Montpellier, France All programs descrIbed below were developed on a V AX 785.

357

.."'"r.

;)if"

J/ii...

Page 2: :M , Pages 357 - 365 A simple method for predicting the · In these three Sander, 1983b), which over- or under-predict certain states. Unite 194 de 1'1NSERM, 91 Bd. de I'Hopital,

..

O.Gascuel and J.L.Golmard

GGBSM itself was written in FORTRAN 77. The aligning Pi,S ~ : state Helix.program used to extract homologies from the Kabsch and 1,0 .. tat Ex! d dSander data bank and the Levin et al. (1976) prediction method . seen e .

were written in V AX LISP (2.1). A Pascal-Microsoft 0.8 C-term1nus

3.3-version of the GGBSM prediction algorithm (but not theprograms for calculating the parameters values) was im- 0.6

plemented on an IBM AT and was integrated in the biostationenvironment (Nanard and Nanard, 1985). A simplified version 0.4

of this latter program is given later. ReSidue to be0.2 predicted

Algorithms Relative position i.0

Mathematical model - 6 - 5 - 4 - 3 - 2 - 1 0 1 2 3 4 5 6 7 8 9 10 11

GGBSM is a 'local' and 'residue-by-residue' method. It Fig. 1. Importance of the relative positions: 0 state Helix, . state Extended.

determines which state is the most probable for each residueof the protein, on the basis of local considerations on the Preference . Helix fonners: E. A. M, Land Ksequence. In this study, as in most recent studies, only three 2 EI Hellxbreakers: G,P,T,S, andN.

states were considered: Helix (helix 3 or 4), Extended and Coil !nus(remaining residues or random coil).

The biological concept which underlies GGBSM is that theamino acid composition of each type of secondary structure is 1different. The problem is reversing this knowledge, in orderto express the secondary structure as a function of the localamino acid composition. Numerous mathematical transcriptionsare possible, and these form the basis of different methods(e.g. Chou and Fasman, 1978; or Garnier et al., 1978). 0 010 20 to 40 40 to 60 60 to 80 80 to 100

GGBSM proposes a new solution to this problem. It is based Percentage of the length.

on three sets of parameters whose values, calculated over the Fig. Z. Histogram of the preferences along helices of the five stronger helix62 proteins of the Kabsch and Sander data bank, are given in formers and of the five stronger helix breakers, calculated over the 62 proteinsthe Implementation section. of the Kabsch and Sander data bank.

The parameters Pi,S. Given R as the residue to be predicted, when moving from beginning to end and reaches its maximumthe parameters Pi,s indicate, for the state S, the importance of near the helix end, towards the C terminus. Briefly, propen-the position i relative to R. The Pi,s are independent of residue sity to helix is higher at the helix end than at the beginning.type. The intuitive idea is that the state of R is primarily When considering amino acids one at a time, distributiondetermined by R's own type, slightly less by the types of its histograms are not so simple as those in Figure 2: negativelytwo neighbors, again less by the types of the two following charged residues, prolines and glycines are mainly located atresidues on both sides, and so on. The Pi,s measure this effect. the start of the helix; positively charged and hydrophobicGiven a state S, the larger i is (in absolute value), the smaller residues are located at the end (Chou and Fasman, 1974).is Pi,s, The Po,s which indicate the importance of the central Finally, all these facts suggest that helix folding preferably startsposition, taken by R, are arbitrarily fixed to 1. Figure 1 near the helix end and mostly propagates towards therepresents the values of these parameters for the state Helix and N terminus.Extended, plotted against the relative position i. On the other hand, the curve of the parameters attached to

The curve of the parameters attached to the state Helix is the state Extended is symmetrical. The state Extended (versusclearly asymmetric. The state Helix (versus not Helix) of a not Extended) of a residue R is determined with the sameresidue is mostly determined by the type of amino acids located strength by residues located before and after R. This is the resultafter it, towards the C terminus. This may be explained by the of the amino acid distribution in extended regions which is quiteirregular distribution, along helices, of helix formers and regular and symmetrical. Another difference between Helix andbreakers (see Figure 2). At the start of the helix, towards the Extended curves is that the Extended curve decreases moreN terminus, the preferences (i.e. the ratio between the frequency quickly when moving away from R. This derives from thein the given region and the average frequency) of helix breakers average length of extended regions (-4.5 residues) which isand formers are close to 1. The gap between these two smaller than the average length of helices (- 9 residues).preferences, which is a measure of propensity to helix, increases The curve of the parameters attached to the state Coil appears

358

Page 3: :M , Pages 357 - 365 A simple method for predicting the · In these three Sander, 1983b), which over- or under-predict certain states. Unite 194 de 1'1NSERM, 91 Bd. de I'Hopital,

Protein secondary' structure pcMiction

between the Helix curve and the Extended curve (to simplify of residues of T-SET observed in the state S.matters, this curve has not been represented on Figure 1, but (ii) In practice, when trying to determine the secondaryvalues of the parameters are given later). The state Coil (versus structure of a new protein, one often has access to complemen-not Coil) of a residue is determined more by the type of residues tary information about the relative abundances of helix, extendedlocated after it, towards the C terminus, than by the type of and random coil, in this protein. This information may comeresidues located before it, towards the N terminus. However, from physical measurements such as circular dichroism orthis effect is weaker than that observed in the case of helices. RaD1an spectroscopy. It can be more hypothetical and derivedIt is due to the amino acid distribution in the random coil which from the use of methods for classifying proteins into structuralis complex and non-regular. The breadth of the Coil curve is classes, such as Klein and Delisi (1986) or Nakashima et at.in-between the breadth of the Helix curve and of the Extended (1986). Then, the values of the parameters Ns may be adjustedcurve. This is because the average length of random coil regions so that when applied to the protein under study, the GGBSM(=6.5 residues) is intermediate between the other two (see method exactly predicts the desired percentage of residues inabove). each of the three states. Practical outcomes of this interesting

feature of the method will be studied in further work.The parameters It,s. These measure the preference for the stateS of the amino acid of type t. We define the type of an amino Let us briefly explain the way in which these parametersacid either as being a given amino acid, e.g. a Phe or a Trp, are used (more details are given below). Given the sequenceor belonging to a given category of amino acids, e.g. a . .. Ala-Asp-Gly-Asp ... where the residue to predict R isnegatively charged amino acid, that will be expressed by being the first Asp (the Ala is in position -1, the first Asp is in pos-an Asp-Glu. Intuitively, Ipro.H measuring the preference of ition 0, the Gly is in position 1, the second Asp is in positionprolines for helices is low. On the other hand, IA/a,H measur- 2), the prediction is based on the comparison of three numbers,ing the preference of alanines for helices is high. attached to each of the three states: CBLF(R, Coil), CBLF(R,

In the GGBSM version presented here, 12 types of amino Helix), CBLF(R, Extended). Using these three numbers, theacids have been empirically retained: Ala, Cys, Asp-Glu, probability of each state is calculated. The chosen state is the

Phe-Trp-Tyr, Gly, His, lie-Val, Lys-Arg, Leu-Met, Asn-Gln, one with the highest CBLFand consequently the highest prob-Pro, Ser-Thr. Amino acids grouped in the same category have ability. The CBLF are calculated in the following way:common physico-chemical properties and/or have similarpreferences for each of the three states. CBLF(R, Coil) = NCoil' (. . . IAla,C.P -I.C + IAsp-Glu.c.po,c+ IG.:-c.PI,c

From a qualitative point of view, these parameters have a + IAsp-Glu,C.P2,C . . .)meaning close to those of the conformational parameters of CBLF(R, Helix) =NHelix' (. . . IAla,H.P -I,H + IAsp-Glu.H.Po,H + IG.).H.Pj,HChou and Fasman (1974). Nevertheless certain differences exist: + IAsP-Glu,H.P2,H. , .Je.g. IHis H which expresses (for GGBSM) the histidine'spreferen~ for the state Helix, is relatively high, while the and similarly for CBLF (R, Extended).corresponding conformational parameter is neutral (1.079 when GGBSM takes into account, with the use of the parameterscalculated on the 62 proteins of the Kabsch and Sander data Pi,s, that when predicting R, the R's type is most important andbank). These slight differences arise from the fact that these that when moving away from R, the type of residues becomestwo types of parameters are very different from a mathematical less and less important. The importance of the relative positionpoint of view. This aspect will clearly appear below. is considered as independent of the type of the residue which

takes this position. This is intermediate between the two methodsThe parameters Ns. Using the parameters NCoi/' NHe/ix and commonly used to extract information from sequences.N &tended, the method may be adjusted so that when applied to (i) The first of these methods relies on the calculation of ana given set of proteins, it exactly predicts a given percentage averaged amino acid composition on a window which slidesof residues in each of the three states. These parameters play along the sequence. The method of Hopp and Woods (1981)a similar role to the decision constants in the Garnier et at. for locating protein antigenic determinant uses this principle.(1978) model (GaR model). Only the relative values of these The method of Chou and Fasman (1978) is also partly basedparameters are important. Thus, we have fixed NCoi/ = 1. on this principle. Given the sequence above (. . . Ala-Asp-These parameters may be used in two different ways. Gly-Asp . . '), it is considered that in the window there is

(i) They may be used in order to make the method well one Ala, one Gly and two Asp; the calculation is based on sumsbalanced. Given T-SET, the set of proteins of known structure such as: . . . + W Ala,S + W GIy,S + 2.W Asp.S . . . where W;,staken into account by the method (in technical terms, T-SET expresses the amino acid t's preference for the state S. Relativeis called the training set), the values of the parameters Ns may positions of amino acids included in the window are not takenbe adjusted so that for each state S, the number of residues of into account. This model is less complete than the one usedT-SET predicted in the state S is exactly equal to the number by GGBSM and will give most often less good results.

359

Page 4: :M , Pages 357 - 365 A simple method for predicting the · In these three Sander, 1983b), which over- or under-predict certain states. Unite 194 de 1'1NSERM, 91 Bd. de I'Hopital,

O.Gascuel and J.L.Golmard

(ii) In the second of these two methods, the importance of (vii) The calculation of the probabilities P (R,S) is done usingthe position, for each type of amino acid, is taken into account. the following formulas:The calculation is based on sums such as: . . . + W Ala -15 + BLF(RS) = E p. * 1 * C (R i t), , \ , 1,5 t,S R.S' .W Asp,O,S + WG/y,1,S + W Asp,2,S + . . ., where Wt,i,S expresses (for i inside F and t in SO7)the importance for the state S of an amino acid type t in the R,Srelative position i. The GOR method is based on this principle. CBLF (R,S) = Ns * BLF (R,S)

The method of Nishikawa and Ooi (1980) for predicting the where BLF'means the bilinear form; where CBLF means thesurface-interior diagram of globular proteins is also based on corrected bilinear form and Pi,s, It,S and Ns are as definedthis principle. This implies a more complete model than that above; andused by GGBSM and should theoretically give better results, eCBl.F(R, S}under certain conditions which we will discuss below. P(R, S) = E CBl.F(R, X) for every X in SOS

First, there exists an important difference between GGBSM .. . e .. . .and both previous methods. They are based on coefficients v:hich IS a very classical way to turn functions mto a probability-(called W;,i,S above), which are assumed to be independent and like form. . . ... ... .independently evaluated. On the other hand, in GGBSM no such .At both te~m, It IS ~I~cult to d~stingulsh a loc~ aminoassumption is made and an appropriate statistical method (see acid composltl~n. and to Imtiate the wI~dows. In addition, thebelow) is used in order to learn globally the optimal values of sequence terInlm ar:e most often flexible and unstructured.the whole set of parameters. Thus, roughly speaking, GGBSM's Th~refore, when R IS one of the three firs! or of .the three la.stparameters are better estimated than GOR's or Nishikawa and residues of the monomer under study, R IS predicted as Coil.Ooi's coefficients. In all other cases the state with the largest probability is chosen.

We could question whether it .would be possible to learn Calculating the parameters Pi,s and It,s. The parameters Pi,sgl?b~ly the values .of the coefficients Wt,i,S. As observed by and It,s are calculated independently for each state S. TheNlshl~wa and 001 (1980), the answer w~uld probably be principle is as follows.negative because there are too many coefficients: 1020 (three (i) Given a state S the criterion to be minimized. .states x 20 amino acids x a window of length 17) for a ' IS.

relatively smaIl amount of data « 11 000 residues). E [BLF (R,S) - TRUE-STATE (R,S)] 2 for every residue R in

Another difference, at a practical level, is that our method the training set.requires few Parameters: 70 (32p. + 361 + 2N details are 1RUE-STATE (R,S) = 1 when the state of R is Sand

I,S t,S S, th .0given below), while a method such as GOR requires many more 0 erwlse .

coefficients: 1022 (1020W;,i,S + two decision constants). (ii) The functions called BLF are bilinear in relation to the

Cal la . h l two sets of parameters Pi S and It s.cu tmg t. e parameter va ues ...' ,

(111) If the parameter values of one of these two sets are fixed,Let us give some notations, definitions and precisions e.g. the Pi,s values, one may calculate by the least-squares

(i) The secondary structure is predicted independently for method (Kendall and Stuart, 1976) the optimal parameter valueseach monomer of the protein. of the other set, e.g. the It,s values.

(ii) SOS is the set of states. Here SOS = {Coil. Helix. (iv) The calculation of the parameters relies on an iterative

Extended}. use of the above property. The algorithm is as follows: fixing(iii) SOT is the set of types (see above). arbitrarily the Pi,s values and calculating the It,s; fixing the It,s(iv) FR,s is the window in which R is in relative position 0, to the values calculated previously and calculating the Pi,s

and which is taken into account when determining whether R fixing the Pi,s, to the values calculated previously andis or is not in state S. Here, for the state Coil the window is calculating the It,s, etc. until convergence (or similarly: fixing(-3, + 6) in size, for Helix (-6, + 11) and for Extended arbitrarily the It,s values and calculating the Pi,s; fixing the Pi,S(- 3, + 3). These sizes have been empirically determined so to the values calculated previously, etc.).that for each relative position i inside the window attached to (v) This algorithm converges rapidly: < 10 iterations arestate S, Pi,s has sufficient value: - ~ 0.1 (see above Figure necessary. Without additional constraints, it does not give a

1 and lrnplementation). These sizes have also been chosen so unique solution. This follows from the fact that BLF(R,S)that they optimize the method's accuracy. remains equal when multiplying all Pi,s by given constant

(v) C is a Boolean function receiving three parameters: (:#: 0) and dividing all I"s by the same constant. But, whena window FR,s; a relative position i; and an amino acid type fIXing Po,s = 1 (this constraint is coherent with the model, see

t (see above). C(FR,s. i,t) returns 1 when the amino acid in above), the algorithm converges towards a unique minimum.the relative position i in the window FR,s has type t and We have not formally demonstrated this, but in practice we haveotherwise O. verified that, after shifting the starting point, the same minimum

(vi) P (R,S) is the R's probability of being in state S. is reached.

360

Page 5: :M , Pages 357 - 365 A simple method for predicting the · In these three Sander, 1983b), which over- or under-predict certain states. Unite 194 de 1'1NSERM, 91 Bd. de I'Hopital,

Protein secondary structure prediction

Calculating the parameters N Slate. Once the Pi,s and II,s values there is clearly a training set. Thus, it applies to statistical

have been determinded, the N Helix and N Extended values (N Coil is methods and to pattem-recognition-like methods such as Levin

fixed to 1, see above) are calculated by minimizing the criterion: et al. (1986). However, it does not apply to expert-system-like

. 2 methods, such as Lim (1974), Chou and Pasman (1978) orE [Number ofpredlcted (S)~Nu~er of observed (S)] Cohen et al. (1983). These methods are based on sets of

on the three states: Coz/, HellX and Extended. empirical rules, given by the authors of the methods and

The minimization algorithm is the non-linear simplex one theoreticatty universal (i.e. not only valid for proteins whose

(NeIder and Mead, 1965). This algorithm is robust and appeared structure is known but also for all other proteins). This

perfectly suited to our problem. universality is clearly theoretical because the authors of the

methods were inevitably inspired by what they knew. This is

The evaluation process well demonstrated for Lim's method, which obtains much better

The evaluation process we propose estimates the capacity of results .on proteins whose struc~re was known before its

a method to predict the secondary structure of a new protein conc:ePtIon (- 64%), than on proteins whose structure has been

which does not have any homologous proteins whose structure elucidat~ afte~ards (- 56%) (see ~bsch an~ Sander 1983b).

is already known. The principle is as follows: for all proteins What is the difference between this evaluatIon process and

P of the Kabsch and Sander data bank, the true secondary ~e more cl~sical ?ne called 'jack ~~e' (Efron, 1:82), ~hich

structure of P is compared to the secondary structure predicted ~s often used. In this test, ~h.en predictIng the p'rotem P, P itself

by the method when removing P itself from the data bank, as is removed f~om the traIrung set, but .n~t its homologues.

usual, but removing the proteins which are homologous to P. Because protems ~om?logous to P are s~milar t~ P at seve~al

Various other solutions based on the same idea are con- levels, an overestImatIon of the method s capacity to predict

ceiva'.Jle. In order to fa~ilitate comparisons, the s~lution we P usually follows. ~rom a statistical p?int of ,~iew ~d ~iven

chose is as close as possible to the benchmark of Kabsch and a meth~, results will ~ better when usmg the Jack kni~e than

Sander (1983b). In particular, we have used the same data bank v:hen usmg our evaluatIon process. However, the gap will very

and we have not added to it the few new proteins which have likel.y be larger for methods bas~ ~n local homology, such as

been elucidated since 1983 and which are weakly homologous Levm et al. (1986), than for statIstIcal methods such as GOR

to the proteins already in the data bank. The only small or GGBSM.

difference, at the level of data, is that for all proteins, the most .recent crystallographic data-and seconda!y structure implementation

assignment-were used. A simple version of the GGBSM prediction algorithm is given

In the present study, we considered that two proteins are in Figure 3. It is written in Pascal Microsoft (version 3.3) on

homologous when they have an homology> 30%, or when an ffiM AT. Let us give some additional comments.

they contain two segments of length> 60 amino acids, with MAX-LENGTH is the maximum length of the sequence to

an homology> 30%. Homologies were computed by an be predicted. Here, it is fixed to 5000 amino acids. The program

algorithm similar to Lipman and Pearson (1985). In order to deals with sequences of such length without any problem.

avoid artefacts, possible with such methods, only functionally Longer sequences may be dealt with by dividing them into

related proteins-in a broad sense, very close to that given in subsequences of length ~ 5000. The program's speed is

Kabsch and Sander (1983a)-were aligned. Nineteen pairs of - 1.5 s CPU for a sequence of 100 amino acids.

homologous proteins were found. These are (using the CBLP-P is a record which contains all parameters necessary

Brookhaven identifiers): to compute what is called CBLF(R,S) above. FIRST is the begin-

ning of the window and LAST the end. Parameter values of(155C, 3C2C), (155C, 3CYT) , (351C, 3C2C) , (3C2C, 3CYT), P ' d N . 1 ed . PIS ITS d NSis 11 s an s are respectIve y stor m , an .

(IAPR,2AAP), (ICTX, INXB), (IECD, ILHB), (ILHB, I MBN) , CB' L' P' P COIL . . ed th C .1 CBLPP HELIX., - - is associat to estate Ol, - -(ILHB,2MHB), (lMBN,2MHB), (IREI, 3FAB) , <_ACT,8PAP), th u I . to e state £Ie lX, etc.(lEST, 2ALP) , (lEST, 2OCH),. (lEST,2PTN), (2ALP, 2OCH) , h ed INITIALIZATION . h . T e proc ure assigns t e previous

(2ALP,2SGA), (2OCH, 2PTN) , (2SGA,2PTN). . th . al P . tl . th d PP hvanables to eir vue. Irs y, usmg e proce ure , eac

Homologies between monomers of a protein, e.g. between amino acid is associated to a numerical type, e.g. alanine (A

2MHB a-chain and 2MHB (3-chain, are not indicated above in single-letter code) is associated to type 1, isoleucine and valine

because they are not useful for the evaluation process we (I and V in single-letter code) which have the same type (see

propose. Indeed, when predicting a given monomor of a protein, above) are associated to 4. Letters which do not correspond

the whole protein (all monomers) is removed from the training to any amino (X, 0, a, etc.) have type O. The correspondence

set. between the ASCII code of a character and its type is stored

Such an evaluation process applies to all methods in which in AA-TYPES. Secondly, the CBLP-P's are initialized, e.g.

361

Page 6: :M , Pages 357 - 365 A simple method for predicting the · In these three Sander, 1983b), which over- or under-predict certain states. Unite 194 de 1'1NSERM, 91 Bd. de I'Hopital,

f

O.Gascuel and J.L.Golrnard

PROGRAM GGBSM(INPUT,OUTPUT); PIS[-2]:- 0.610814;PIS[-1]:- 0.711503;CONST MAX_LENGTH - 5000; FISC 0]:- 1.0 ;PIS[ 1]:- 0.996892;

FISC 2]:- 0.845693;PIS[ 3]:- 0.74947 ;TYPE CBLF_P - RECORD FISC 4]:- 0.589796;PIS[ 5]:- 0.538197;

FIRST,LAST: INTEGER; FISC 6]:- 0.458475;PIS[ 7]:- 0.374935;PIS: ARRAY[-6..+11] OF REAL; FISC 8]:- 0.332311;PIS[ 9]:- 0.291095;ITS: ARRAY[0..12] OF REAL; PIS[10]:- 0.221007;PIS[11]:- 0.159564;NS : REAL; END;

END; WITH CBLF_P_EXTENDED DOBEGIN

VAR AA_TYPES : ARRAY[33..255] OF 0..12; FIRST:--3;LAST :-+3;NS:-1.5451;ITS[0]:= 0.0;

CBLF_P_COIL, ITS[ 1] :--0.005853;ITS[ 2]:- 0.096215;CBLF_P_EXTENDED, ITS[ 3]:- 0.016752;ITS[ 4]:- 0.214617;CBLF_P_HELIX : CBLF_P; ITS[ 5]:- 0.098072;ITS[ 6]:- 0.01415 ;SEQUENCE: ARRAY[l..MAX_LENGTH] OF 0..255; ITS[ 7] :--0.078651;ITS[ 8]:- 0.124579;LENGTH: INTEGER; ITS[ 9]:- 0.079342;ITS[10] :--0.062711;FILE_O : TEXT; ITS[11]:--0.001842;ITS[12] :--0.014743;

PIS[-3]:- 0.16148 ;PIS[-2]:- 0.498551;

PROCEDURE INITIALIZATION; PIS[-l]:- 0.864631;PIS[ 0]:- 1.0 ;VAR CHARACTER: INTEGER; FISt 1]:- 0.850783;PIS[ 2]:- 0.49398 ;PROCEDURE FF(C:CHAR;I:INTEGER); FISC 3]:- 0.169746;BEGIN AA_TYPES[ORD (C) ] :-I END; END;

BEGIN END;FOR CHARACTER:=33 TO 255 DO

AA_TYPES[CHARACTER] :-0; PROCEDURE READ_SEQUENCE;FF('A',l); VAR FILE_I: TEXT;FF('C',2); NAME : STRING(80);AA : CHAR;FF('H',3); BEGINFF('I',4);FF('V',4); LENGTH:-O;FF ('L', 5);FF ('M', 5); WRITE ('Name of the sequence') ;READLN(NAME);FF('G',6); ASSIGN(FILE I,NAME);RESET(FILE I);FF('P' ,7); REPEAT - -FF('F',8);FF('W',8);FF('Y',8); READ (FILE_I,AA);FF('S',9);FF('T',9); IF AA>' , THEN

FF('D',10);FF('E',10); BEGINFF('N',ll);FF('Q',ll); LENGTH:-LENGTH+1;SEQUENCE[LENGTH]:-QRD(AA)FF('K',12);FF('R',12); END;WITH CBLF_P_COIL DO UNTIL EOF (FILE_I) OR (LENGTH=MAX LENGTH);

BEGIN CLOSE(FILE_I);-FIRST:--3;LAST:-+6;NS:-1.0;ITS[0] :-0.0; END;ITS[ 1]:- 0.018074;ITS[ 2]:- 0.073735;ITS[ 3]:- 0.071903;ITS[ 4] :--0.03926 ; PROCEDURE ASK_OUTPUT;ITS[ 5] :=-0.018393;ITS[ 6]:- 0.294359; VAR NAME : STRING(80);ITS[ 7]:- 0.445666;ITS[ 8]:- 0.012034; BEGINITS[ 9]:- 0.16152 ;ITS[10]:- 0.17015 ; WRITELN;ITS[ll]:- 0.187471;ITS[12]:- 0.109817; WRITE('Name of the output file');PIS[-3]:- 0.154212;PIS[-2]:- 0.434159; READLN(NAME);PIS[-l]:- 0.706102;PIS[ 0]:- 1.0 ; ASSIGN(FILE_O,NAME);REWRITE(FILE 0);FISC 1]:- 0.915659;PIS[ 2]:- 0.559442; END; -FISC 3]:- 0.336078;PIS[ 4]:- 0.109662;FISC 5]:- 0.098532;PIS[ 6]:= 0.098014; FUNCTION C_SUM(POS:INTEGER;CBLF:CBLF P) :REAL;END; VAR TOTAL: REAL; -

WITH CBLF P HELIX DO INDICE,FIRST P,LAST P : INTEGER'BEGIN - - BEGIN - - ,

FIRST:--6;LAST:-+11;NS:-1.2311;ITS[0] :=0.0; WITH CBLF DOITS[ 1]:- 0.147418;ITS[ 2]:- 0.024205; BEGINITS[ 3]:- 0.114178;ITS[ 4]:- 0.037925; TOTAL:-O.O;ITS[ 5]:- 0.111157;ITS[ 6]:--0.081108; FIRST_P:-FIRST+POS;LAST_P:-LAST+POS;ITS[ 7]:--0.144076;ITS[ 8]:- 0.063689' IF FIRST P<l THEN FIRST P:-1', - ,ITS [ 9]:--0.031711;ITS[10]:- 0.050114; IF LAST_P>LENGTH THEN LAST P:-LENGTH;ITS[ll]:- 0.01434 ;ITS[12]:- 0.093082' FOR INDICE:-FIRST P TO LAST P DO, -PIS[-6]:- 0.169079;PIS[-5]:- 0.195383; TOTAL:-TOTAL+PIS[INDICE-POS] *PIS[-4]:- 0.326344;PIS[-3]:- 0.430292; ITS[AA_TYPES[SEQUENCE[INDICE]]];

362

Page 7: :M , Pages 357 - 365 A simple method for predicting the · In these three Sander, 1983b), which over- or under-predict certain states. Unite 194 de 1'1NSERM, 91 Bd. de I'Hopital,

r

Protein secondary structure prediction

C SUM:=TOTAL*NS; Predictions are computed by the procedure SECONDARY-END; STRUCTURE-PREDICTION exactly as defined above. They

END; are stored in a fIle whose name is given by the user during the

execution of the procedure ASK-OUTPUT. Each line of thisPROCEDURE SECONDARY STRUCTURE PREDICTION; " .VAR ECS COIL, ECS HELIX, - file corresponds to one residue and contaIns the residue number,

ECS-EXTENDED-: TECS : REAL; the amino acid, the state which is predicted and probabilitiesP :-INTEGER; of the three states Coil, Helix and Extended respectively.

BEGINFOR P: =1 TO LENGTH DO Discussion

BEGIN

ECS_COIL:=EXP (C_SUM(P,CBLF_P_COIL»; Table I contains comparative results of five methods: (1) LimECS_HELIX:=EXP (C_SUM(P, CBLF_P_HELIX» ; (1974), (2) Gamier et al (1978), (3) Chou and Fasman (1978),ECS_EXTENDED:=EXP(C_SUM(P,CBLF_P_EXTENDED»; (4) Le . l

(1986) d (5) GGBSMTE':S:=ECS COIL+ECS HELIX+ECS EXTENDED; vm et a an .WRITE (FILE 0,P:5,CHR(SEQUENCE[PI) :2); Results of Lim's method are those published by Kabsch andIF «ECS_COIL>CECS_HELIX) AND Sander (1983b). As indicated above, the result averaged over

(ECS_COIL>=ECS_EXTENDED» all proteins, 59.3%, is a very optimistic estimation of theOR (P<-3) OR (P>LENGTH-3) method's accuracy. A better estimation is the result it obtains

THEN WRITE(FILE 0,' C ') .ELSE IF ECS HELIX>cECS EXTENDED on the post-1974 proteIns: 56.3%.

THEN WRITE(FILE 0-;-' H ') GOR's and Chou's results are also those published by KabschELSE WRITE(FILE=O,' E '); and Sander (1983b). Although their evaluation process is

WRITELN (FILE_O, different to ours, one may consider that in the case of theseECS_COIL/TECS: 3: 4,' , , two methods the difference is low (see above) and that theECS HELIX/TECS:3:4,' " K b h d S d 1 . ood .. f th ECS-EXTENDED/TECS:3:4); a sc an an er resuts give a g estimation 0 eEND; - accuracy.

END; Results of the two last methods were obtained using the

evaluation process described above. Method 4 has beenBEGIN rewritten by ourselves as described in Levin et al. (1986).INITIALIZATION; Pr . b .

ul fth 1 thods ail blREAD SEQUENCE; otem- y-protem res ts 0 ese ast two me are av a eASK OUTPUT; on request.SECONDARY_STRUCTURE_PREDICTION; GGBSM is the only method to be well balanced. The numberCLOSE (FILE_O) ; of residues predicted in each of the states is very close to theEND. number of residues observed in this state. These numbers are

Fig. 3. A simple Pascal version of the GGBSM secondary structure prediction not exactly equal because results given here are obtained through

algorithm. a true evaluation process in which the training set is modifiedfor each predicted protein. The interest of a well balanced

(i) the preference of alanine for the state Coil, called IA/a. c method appears at two levels.above, corresponds to ITS[l] (alanine has the numerical type (i) At the practical level. Interpreting predictions of a weil-l) inside CBLF-P-COIL and has a value of 0.018074; (ii) the balanced method is easier. When using a method that is notimportance of position 5 for the state Helix, called PS.H above, well balanced, one has to correct mentally (and rougWy) thecorresponds to PIS[5] inside CBLF-P-HELIX and has a value defects of the method. When the method predicts too many Coilof 0.538197; (iii) the decision constant of the state Extended, and not enough Extended, one has to correct this and one hascalled NExtended above, corresponds to NS inside CBLF-P- to suppose that some of the predicted Coil are actually Extended.EXTENDED and has a value of 1.5451. (ii) At the performance level. For the same percentage of

The sequence (in single-letter code) to be predicted is well-predicted residues, a well-balanced method is usually betterread by the procedure READ-SEQUENCE and stored in than one which is not. Indeed, the size of the classes are verySEQUENCE which is an array of characters, given in ASCII unequal: Coil == 50%, Helix == 30% and Extended == 20%.code. Special characters (ASCII code <33) such as carriage When a method, e.g. Lim's predicts too many Coil (59.4%return or space are not read. LENGTH is the length of the for Lim's) and not enough Extended (15.7% for Lim's), it takessequence to be predicted. fewer risks and its accuracy is overestimated.

The function C-SUM has two parameters: a residue number, The results of every method fluctuate widely among proteins,POS, and a CBLF-P, CBLF. It returns what is called e.g. for GGBSM, 27% of well predicted residues in the caseCBLF(R,S) above, R being the residue of number POS and S of melittin (lML T) and 81 % in the case of avian pancreaticthe state corresponding to CBLF. polypeptide (lPPT). Thus, the result averaged over all

363

Page 8: :M , Pages 357 - 365 A simple method for predicting the · In these three Sander, 1983b), which over- or under-predict certain states. Unite 194 de 1'1NSERM, 91 Bd. de I'Hopital,

;

O.Gascuel and J.L.Golrnard

Table I. Predictive success of five methods: details of Coil (C). Helix (H) and Extended (E) predictions averaged over all proteins of the Kabsch andSander data bank

(I) Lim (2) GOR (3) Chou (4) Levin (5) GGBSM

Predicted/observed matrix (percentage of the total number of residues)

Observed Observed Observed Observed Obser.'ed

C H E C H E C H E C H E C H E

Predicted C 37.4 11.3 10.7 28.4 7.0 6.4 26.9 8.2 6.3 33.3 8,8 7.8 34.9 8.8 7.5Predicted H 7,6 14.3 3.0 11.2 15.9 3.4 10,8 11.9 3.9 11.6 16.8 4.9 9.6 15,3 4.1Predicted E 5.3 2.7 7,6 10.8 5.3 11.6 12,6 8.3 11.1 5.9 3.5 7.4 6.3 5,0 8.5

Success 59,3% 55.9% 49.9% 57.5% 58,7%

Predicted! observed total (%)

C H E C H E C H E C H E C H E

Predicted 59.4 25.0 15.7 41.8 30,5 27.6 41.4 26,5 32.0 49.9 33.2 16,8 51,1 29.0 19.9Observed 50.4 28.3 21.3 50.4 28.3 21.3 50.4 28.3 21,3 50.9 29,0 20.1 50,9 29.0 20.1

proteins-58. 7 % of well-predicted residues for GGBSM-does ways such as antigenic site prediction or active site prediction.not have any real meaning. It is only of interest in the framework Specially interesting are statistical methods like GGBSMof a comparison with other methods, tested on the same because they allow spectra to be plotted where the abscissaproteins, through the same evaluation process. forms the sequence and the ordinate supports the probabilities

Our method, like other comparable methods, will fair badly of each of the states (see in particular Ptitsyn and Finkelstein,when applied to membrane-associated proteins, because there 1983). Such spectra contain much more information than theis no membrane-associated protein in the Kabsch and Sander simple result of the decision process which consists of choosingdata bank. Thus, the method's parameters are calculated under the most probable state. They not only contain the decision itselfthe assumption that proteins fold in water and they are not but also the confidence one may have in this decision. Theyapplicable outside this assumption. In the same way, our method may also be used to compare sequences when the aim isis not reliable when applied to proteins which depart from the detecting structural or functional similarities.secondary structure paradigm, e.g. the disulphide-bridge-rich GGBSM's qualities derive from the simplicity of theproteins. mathematical model we used. Because this model is based on

There has long been a general consensus which considers it few parameters (70 free parameters), we might employ animpossible to obtain a 100% secondary structure prediction. efficient learning method allowing the optimal values of theA global folding theory considering tertiary interactions would parameters to be globally estimated. This reasonablebe needed. However, the different attempts to solve the final parametrization also causes the method to be robust, i.e. resultsproblem of conformation prediction for globular proteins from are very close when directly predicting the training set (59.9%sequence alone have as yet been disappointing. A line of on the Kabsch and Sander data bank) and when testing theapproach seems to try to arrange secondary structures into the method using the process we propose (58.7% on the same data).form of super-secondary structures and domains (Cohen et al. , This highlights the almost complete independence between the1983; Taylor and Thornton, 1983). But such approaches are method and the data bank used as the training set: predictionsbased on secondary structure prediction methods. Thus, we go on new proteins will not be much modified by modificationsback to the necessity of developing and of improving these of the data bank. Finally, this small parameterization and themethods. More generally, in our opinion, the problem of simplicity of the calculations present a practical advantage:conformation prediction has to be tackled hierarchically, step GGBSM may be implemented on any microcomputer and evenby step and by successive approximations. Such an approach on programmable pocket calculators.has been successfully applied (Lathrop et al., 1987) on the We do not think that GGBSM itself may be greatly improved.related problem of matching proteins which are close at func- Very surprizingly (at least for us), when shifting amino acidtionallevel but distant from an evolutionary point of view. This regroupings (or types), very small changes in the results areway, accuracy of the initial steps-secondary structure predic- observed. For example, when shifting from the 12 amino acidtion would obviously be one of them-is of prime importance types mentioned above, to 13 types where glutamic acid andsince final prediction would strongly depend on them. aspartic acid have been separated, almost no change is observed.

Outside these prospective considerations, secondary structure However, glutamic acid's and aspartic acid's preferences forprediction methods bring to light a large amount of information each state are quite different (Chou and Fasman, 1974). In theconcerning protein sequences which may be used in various same way, a 20-type solution (one type for each amino acid)

364

Page 9: :M , Pages 357 - 365 A simple method for predicting the · In these three Sander, 1983b), which over- or under-predict certain states. Unite 194 de 1'1NSERM, 91 Bd. de I'Hopital,

r

Protein secondary structure predictionJ

,

does not lead to better results. Our interpretation is that local Referencesamino acid compositions are highly dependent on one another Chou,P. Y. and Fasman,G.D. (1974) Conformational parnmete~ for amino acids[this somewhat contradicts the Gamier et al. 's (1978) assump- in helical, beta-sheet, and random coil regions calculated from proteins.tionofindependenceofthecoefficients see above]. Infonnation Biochemistry, 13, 211-222.

.ed b th ..art ed ' dant Th Chou,P. Y. and Fasman,G.D. (1978) Empirical prediction of protein confor-

cam y esequenceISmp r un . econsequences mation. Annu. Rev. Biochem., 47, 251-276.

are: (i) it is possible, as we do, to group certain amino acids Cohen,F.E., Abarbanel,R.M., Kuntz,I.D. and FIetterick,R.J. (1983) Secon-two by two or three by three; (ii) an imperfect choice of amino ~ stru~re assignment for alpha/beta proteins by a combinatorial approach.

.d . d t . BIochemistry, 22, 4894-4904.aCI regroupmgs oes not cause 00 serIOUS consequences. Efron,B. (1982) The Jack Knife, the Bootstrap and other Resampling Plans.Similarly, performance does not seem to be very sensitive to Society for Industrial and Applied Mathematics, Philadelphia.the choice of the sizes of the windows attached to each state Eisenberg,D., Weiss,R.M. and Terwilliger,T.C. (1984) The hydrophobic( bo ) F. 11 t . t th t th 1 201 f moment detects periodicity in protein hydrophobicity. Proc. Nail. Acad. Sci.see a ve. ma y, we es Ima e a no more an - 100 USA 81 140-144.

accuracy improvement may be obtained by optimizing the Garnie;,J.,'Osguthorpe,D.J. and Robson,B. (1978) Analysis of the accuracyempirical choices presented in this paper. and implications ?f simple meth.ods for predicting the secondaj} structure

Th f th th . al od I ' . 1.. . of globular proteIns. J. Mol. BIOi., 120, 97 -120.e .count~rp~ 0 . e ma ematic. m. e s Simp ICIty is ~ Hopp, T.P. and Woods,K.R. (1981) Prediction of protein antigenic derernlinants

exceSSive sImplIficatIon of the bIOlogICal model. Certam from amino acid sequences. Proc. Natl. Acad. Sci. USA, 78, 3824-3828.determinants of the secondary structure are not taken into Kabsch, W. and Sander ,C. (1983a) Dictionary of protein secondaj} structure:

. pattern recognition of hydrogen-bonded and geometrical features. Biopolymers,account. We thInk m partIcular of: (1) the hydrophobIC moment 22 2577 - 2637.

of the whole protein sequence, which has a predictive power Kabs~h, W. and Sander ,C. (1983b) How good are predictions of proteinon the relative' abundances of helix and extended chain secondarystructure?FEBSLett., 155, 179-182.(E . be I 1984) (..) th art . ul d tn . Kenda1I,M. and Stuart,A. (1976) The Advanced Theory of Statistics. Vol. 2.

Isen rg et a ., ; 11 every p IC ar an asymme c Charles Griffin, London.

amino acid distribution in helices (Chou and Fasman, 1974), Klein,P. and Delisi,C. (1986) Classifying proteins into structural groups.which currently is only globally taken into account by the Biopolymers, 25, 1659-1672.

.. fin~ ..d h 1 bal LathropR.H.,WebsterT.A.andSmithT.(1987)ARIADNE:Pattern-<lirectedmethod. In additIon, the al predICtIon oes not ave ago inference and hierarchical abstractions in protein structure recognition. ACM

consistency, since it may comprise unobserved state chainings Commun., 30, 909-921.such as: ... Extended- Helix- Extended- Helix. .. or Levin,J., Robson,B. and Garnier,J. (1986) An algorithm for secondaI} structure

h . 1 determination in proteins based on sequence similarity. FEBS Lelt., 205,anyt mg e se. 303-308.

The good results obtained by GGBSM seem to indicate that Lim, V.I. (1974) Algorithms for prediction of alpha-helical and beta-structuralthe above-mentioned determinants are somewhat less important regions in globular proteins. J. Mol. Bio!, 88. 873-:-.894. . . . .th th h . h th o thod ' b ed N rth 1 taki Lipman,D.J. and Pearson,W.R. (1985) Rapid and sensItive protem slffillanty

an ose on w IC IS me IS as . eve e ess, ng h S . 227 1435- 1441searc es. Clence" .these detenninants into account is, in our opinion, the best way Nakashima,H., Nishikawa,K. and Ooi,T. (1986) The folding type of a proteinto improve our method. Two complementary approaches may is relevant to the amino acid composition. J. Biochem., 99, 153-162.be .d d Nanard,M. and Nanard,J. (1985) A user-friendly biological ".orkstation.

conSi ere. B. h. . 67-5 429- 432; . 10C Imle" .(1) Usmg GGBSM as a startmg pomt and buildmg on It. As Nelder,J.A. and Mead,R. (1965) A simplex method for function minimiza-

suggested above, coupling GGBSM with a method for class- ti~n. Compurer J.,:, 308-313. . . . . .. . .. al 1 h Kl . d Nishlkawa,K. and OoI,T. (1980) Prediction of the surface-mtenor diagram ofIfymg protems mto structur c asses suc as em an globular proteins by an empirical method. Int. J. Peptide Protein Res., 16,

Delisi (1986), using among other things the hydrophobic 19-32.moment would allow the decision constants Ns to be particu- Nishikawa,K. and Ooi,T. (1986) Amino acid sequence homology applied to1 . ed ' d th " th thod ' t be . ed the prediction of protein secondary structures, and joint prediction with

anz an erelore.e me . s accura~y 0 Impro~.. existing methods. Biochim. Biophys. Acta, 871, 45 - 54.A method for smoothmg and mterprettmg the probability Ptitsyn,O.B. and Finkelstein,A. V. (1983) Theory of protein secondaj} structurespectra conceivably based on certain ideas contained in the and algorithm of its prediction. Biopolymers, 22, 15-25.

k 'f Ch d F (1978) C h t I (1983) d Sweet,R.M. (1986) Evolutionary similarity among peptide segments is a basiswor so ouan asman, 0 ene a.. . an forpredictionofproteinfolding.Biopolymers,25,1565-1577.Taylor and Thornton (1983), would result m consIstent Taylor, W .R. and Thornton,J .M. (1983) Prediction of super secondaI} structure.predictions and better results. Nature, 301, 540-542.

(ii) Searching for new statistical methods which have Received on December 15, 1987; accepted on April 12, 1988GGBSM's qualities (i.e. small parametrization and globalestimation of the parameters) but taking a more complete Circle No.4 on Reader Enquiry Cardstructural model into account.

Acknowledgements

With sincere thanks to Nicky, Bruno Bachimont, Olivier Bacquet, Francois

Colonna Cesari, Antoine Danchin. Philippe Marliere, Chris Sander, William

Saurin and Pierre Zweigenbaum for help in starting this work, technical

assistance, advice, encouragement and plenty of meaningful discussions.

365