assessment of global and local model quality in casp8 using pcons and proq

proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS

QUALITY: PREDICTIONS

Assessment of global and local model qualityin CASP8 using Pcons and ProQPer Larsson,1 Marcin J. Skwark,1 Bjorn Wallner,1* and Arne Elofsson1,2*

1Department of Biochemistry and Biophysics, Center for Biomembrane Research, Stockholm

Bioinformatics Center, Stockholm University, SE-10691 Stockholm, Sweden

2Unidad de Biofisica (CSIC-UPV/EHU), Universidad del Pais Vasco Aptdo 644, 48080 Bilbao, Spain

INTRODUCTION

In CASP7,1 we showed that consensus-based Model Quality

Assessment Programs (MQAPs)2,3 are superior to other types of

MQAPs. In a consensus method, the quality of a model is predicted

by measuring its overall similarity to other models. Consensus

MQAPs can also be used to predict the local model quality, that is,

the accuracy of individual model fragments.4 In CASP7, it was clear

that Pcons performed superior both for the prediction of the global

quality of a model and for the local quality.1 In CASP5 and CASP6,

we also found that Pcons was capable of selecting models better

than the models produced by the single best server.5,6 However, in

CASP7, this was no longer true, as the Zhang-server7 performed

much better than all other servers and the models selected by Pcons

were actually slightly worse, on average.

Pcons is a consensus method that utilizes a set of alternative

protein models as input. A structural superposition algorithm,8 is

used to search for recurring structural patterns in the whole set

of models. Pcons predicts the quality of all models, by assigning a

score to each model reflecting the average similarity to the entire

ensemble of models. The idea being that recurring patterns are

more likely to be correct as compared to patterns than only occur

Per Larsson and Marcin J. Skwark contributed equally to this work.

The authors state no conflict of interest.

Grant sponsors: Swedish Research Councils; Foundation for Strategic Research (SSF);

EMBRACE Project; Grant number: LSHG-CT-2004-512092; Foundation for Internationalisa-

tion of Higher Education and Research (STINT) and Bizkaia - Xede; Marie-Curie Fellowship;

Grant number: MOIF-CT-2005-040496; EU 70th Marie Curie Initial Training Network Transys;

Grant number: FP7-PEOPLE-2007-1-1-ITN

*Correspondence to: Arne Elofsson and Bjorn Wallner, DBB, Stockholm University, 106 91

Stockholm, Sweden. E-mail: [email protected] or [email protected]

Received 13 March 2009; Revised 22 April 2009; Accepted 29 April 2009

Published online 11 May 2009 in Wiley InterScience (www.interscience.wiley.com).

DOI: 10.1002/prot.22476

Abbreviations: QA, quality assessment; MQAP, Model Quality Assessment Program.

ABSTRACT

Model Quality Assessment Programs (MQAPs) are

programs developed to rank protein models. These

methods can be trained to predict the overall global

quality of a model or what local regions in a model

that are likely to be incorrect. In CASP8, we partici-

pated with two predictors that predict both global

and local quality using either consensus informa-

tion, Pcons, or purely structural information, ProQ.

Consistently with results in previous CASPs, the

best performance in CASP8 was obtained using the

Pcons method. Furthermore, the results show that

the modification introduced into Pcons for CASP8

improved the predictions against GDT_TS and now

a correlation coefficient above 0.9 is achieved,

whereas the correlation for ProQ is about 0.7. The

correlation is better for the easier than for the

harder targets, but it is not below 0.5 for a single

target and below 0.7 only for three targets. The cor-

relation coefficient for the best local quality MQAP

is 0.68 showing that there is still clear room for

improvement within this area. We also detect that

Pcons still is not always able to identify the best

model. However, we show that using a linear combi-

nation of Pcons and ProQ it is possible to select

models that are better than the models from the

best single server. In particular, the average quality

over the hard targets increases by about 6% com-

pared with using Pcons alone.

Proteins 2009; 77(Suppl 9):167–172.VVC 2009 Wiley-Liss, Inc.

Key words: quality assessment; MQAP; consensus.

VVC 2009 WILEY-LISS, INC. PROTEINS 167

in one or a few models. In earlier versions of Pcons, a

complicated relationship between the average similarity, the

reported scores, and other features was used. However, as

CASP7 Pcons has been based only on the average similarity

between one model and all other models, as in 3D-Jury.3

The consensus approach outlined earlier results in

both a global quality score, reflecting the overall correct-

ness, and local quality score, reflecting the local correct-

ness. In CASP7, the global score was based on the aver-

age global structure similarity using LGscore8 to measure

the similarity between the models, whereas the local score

was based on the average local structure similarity S-

score.8 In CASP8, we used the average S-score also for

the global quality estimates. In CASP7, Pcons was the

only consensus-based method participating in the MQAP

section, whereas in CASP8 several others participated as

well. One line of development in the MQAP field because

of the introduction of the MQAP category in CASP7 is

the development of meta-MQAPs, that is, MQAP meth-

ods that combine the results from several MQAPs to

improve the predictions.9,10 We actually did this already

in CASP55 using a simple linear combination of ProQ11

and Pcons.

In addition ProQ11 was tested in CASP8 for both local

and global quality predictions. ProQ11 and ProQres4 uti-

lize a combination of structural features to predict the

global and local quality, respectively. They both use

similar types of structural features such as atom–atom

contacts, residue–residue contacts, surface area exposure,

and secondary structure agreement, as inputs to a neural

network trained to predict the quality. The difference is

that ProQ derives these features for the complete protein

model and predicts the global quality, whereas ProQres

derives these features for a tertiary sequence window to

get a localized quality prediction. In CASP8, ProQres

participated as the localized version of ProQ. No changes

have been made to ProQ or ProQres since CASP7.

Following the CASP tradition, this article is divided

into two sections, what went right (Correlations

improves since CASP7) and what went wrong (Pcons

does not always select the best model).

METHODS

Pcons uses a structural superposition algorithm as the

basis for the consensus analysis. Any structural superpo-

sition algorithm could in principle be used, and pairwise

comparisons are then made between all input models.

However, for a given similarity measure the best per-

formance is reached if the same algorithm is used both

in the consensus analysis and as the target value for

Pcons, that is, to best predict GDT_TS, GDT_TS should

also be used in the consensus analysis and vice versa for

any quality measure. In this work, both S-score and

LG-score are used as measures of structural similarity.

LG-score detects segments in common between the

model and correct target structure. Based on these

egments, a structural comparison score, Sstr, defined by:

Sstr ¼X 1

1þ did0

� �2ð1Þ

is calculated, where di is the distance between the i resi-

due in the native structure and in the model, and d0 is a

distance threshold. This score ranges from 1 for a perfect

prediction (di 5 0) to 0 when di goes to inifinity. The

distance threshold defines the distance for which the

score should be 0.5. Here, the distance threshold was set

to sqrt(5). This is done both for LG-score and S-score.

LG-score is then the logarithm of a P-value that depends

both on the S-score and the length of the match, that is

the log of the probability of finding match with the same

length and equal or better S-score.

As in CASP7, the goal was to predict the local CA-CA

deviation, which is exactly di. Thus, all methods return

this measure, which can be turned into Sstr by rearragn-

ing Eq. (1). The global measure is then simply the sum

of the local quality scores, divided by the number of

models compared for each target.

Further in this work, we use the term ‘‘per-target’’ cor-

relation when referring to correlations between Pcons/

ProQ scores and the actual quality, calculated for each

target individually, and ‘‘overall’’ correlation when refer-

ring to all models and all targets simultaneously.

Correlations improved since CASP7

In Table I, the global and local correlations for the Pcons

and ProQ MQAPs against the model quality as measured

by GDT_TS12 (global) and RMSD (local) are shown. In

addition to the results for Pcons and ProQ, we also report

the results for the Pcons version used in CASP7, here termed

Pcons-LG. The only difference between Pcons-LG and the

new Pcons version is that the new version uses the average

S-score instead of the average LGscore for structural evalua-

tion. For the local qualities, the measured structural similar-

ity is based on the S-score in both versions, but because a

Table IPearson Correlation Coefficients of Predicted and Measured Model

Qualities for the MQAPs

Global and localcorrelation

MQAP RGDT overall RGDT pertarget

Rlocal overall Rlocal pertarget

Pcons 0.914 0.919 0.683 0.687Pcons-LG 0.683 0.893 0.677 0.681ProQ 0.704 0.664 0.422 0.418Pcomb 0.914 0.922 0.689 0.688

The global quality is measured using GDT_TS, while the local is measured using

RMSD. In addition to Pcons and ProQ that participated in CASP8, the Pcons ver-

sion used in CASP7 (Pcons-LG) and Pcomb, a simple linear combination of

Pcons (90%) and ProQ (10%) are included.

P. Larsson et al.

168 PROTEINS

global structural alignment is performed using the different

target functions there might be slight differences. In CASP,

the average correlation coefficients ‘‘per target’’ have previ-

ously been used for evaluation. This is because it has been

difficult to agree on a single measure that is independent of

the target length. A method that performs well for both

‘‘per-target’’ and ‘‘overall’’ criteria is clearly preferable, as it

can provide the user with an overall estimate of the quality.

In our CASP7 article,1 we argued that a measure such as the

S-score8 could be used to measure the quality independently

of target length. However, in CASP GDT_TS is still used and

therefore we modified Pcons to fit this scoring functions

better, by taking the predicted S-score and dividing it with

the length of the target protein.

It can be seen that when using GDT_TS the new Pcons

version performs clearly better than Pcons-LG, Table I. In

particular the ‘‘overall’’ correlation is dramatically improved

from 0.68 to 0.91, but also a smaller improvement for the

‘‘per target’’ correlation can be seen. The correlations are

similar if Maxsub13 or TM-score14 are used, whereas, as

expected, the correlation versus the S-score is worse (data not

shown). As discussed earlier the main reason for the improve-

ment is the fact that S-score normalized by length correlate

better than LG-score with the measure for correctness used in

CASP, GDT_TS (0.89 vs. 0.72). Thus, a successful prediction

of such a measure would also have a higher correlation. This

is also augmented by the fact that the performance improve-

ment when comparing targets of equal length, that is, the

‘‘per target’’ performance, is marginal compared with the

increase when comparing targets of different lengths.

In Figure 1, the ‘‘per-target’’ correlation sorted by the

median GDT_TS score per target is shown. As in CASP7

the correlation for Pcons is worse for the more difficult

targets. In comparison with CASP7, there are now fewer

targets with unsatisfactory correlation. In CASP7, there

existed about 10 targets with a correlation coefficient

below 0.7, but in CASP8 we only found three such tar-

gets. This is due to the different scoring functions used

in CASP8, as the old Pcons version showed a similar per-

formance as in CASP7 (data not shown). As expected

ProQ did not show any improvement since CASP7 as the

program was identical. The ‘‘overall’’ correlation against

Figure 1Average correlation coefficients per target for Pcons and ProQ. In the two top panels the correlation for the global MQAPs are shown and in the

lower the local ones.

Pcons and ProQ MQAPs in CASP8

PROTEINS 169

GDT_TS actually dropped slightly from 0.79 to 0.70 for

some unknown reason, Table I. Also for ProQ the per-

formance is worse for the harder targets, Figure 1.

As in CASP7, Pcons and ProQ perform worse for local

than for global predictions, with correlations coefficients

of 0.68 and 0.42, respectively, Table I. The correlation

coefficients are basically identical to what was observed

in CASP7. The same is true for the alternative methods

to evaluate local quality (data not shown). No very

strong dependence on model quality can be found for

Pcons or ProQ, Figure 1. However, it is clear that ProQ

never shows a good correlation for the most difficult tar-

gets, while Pcons sometimes does. Furthermore, the local

quality correlation for Pcons rarely drops below 0.5 indi-

cating that it is quite often useful.

Pcons does not always select the best model

In CASP5 and CASP6, Pcons was able to select better

first ranked models than the best single server did. How-

ever, in CASP7 this was no longer true. In Table II it can

be shown that neither in CASP8 did Pcons select better

models than the models produced by the best single

server, the Zhang-server, Table II. As described earlier, a

trend since CASP7 has been to develop combined

MQAPs and we were therefore interested to see, if this

could improve on this situation. However, already in

CASP511 we used a linear combination of ProQ and

Pcons scores to select better model. In CASP6, we also

used a combined score.15 However, neither in CASP7

nor in CASP8 we examined this approach, but others

have continued developing such methods.9,10 Therefore,

we thought that it could be interesting to see if it still

was possible to improve the Pcons results by combing

them with ProQ. Linear combinations of the ProQ and

Pcons scores were tested and it was found that a good

performance could be obtained by using a combination

of 90% Pcons and 10% ProQ. This will be referred to as

the Pcomb method below. Although the correlation

against GDT_TS (or any other quality measure) did not

change significantly it is clear that this combination

selects better models, Table II. The average GDT_TS

increase from 62.8 to 63.5. In particular for the "hard’’

models the average GDT_TS increase by 6% (35.3 to

37.5) and the selected models are even slightly better

than the models from the Zhang-server.

To gain some further understanding on how well

Pcomb, ProQ, and other methods select the top ranked

model, the GDT_TS score for the highest rank model

from each method is plotted against the GDT_TS score

for the selected Pcons model, Figure 2. First, it can be

seen that for most models Pcomb, Zhang-server, and

Pcons select models of similar quality, whereas ProQ

often selects slightly worse models and in quite a few

cases significantly worse models. Furthermore, only in a

handful of cases models exist that are more than 20

GDT_TS units better than the ones selected Pcons.

It can also be seen that a large part of the improve-

ment in Pcomb is due to that Pcomb manages to select

the best model for two targets, T0472 and T0462,

whereas Pcons selects clearly worse models. In T0472

Pcons selects the COMA_TS1 model (GDT_TS 5 43),

whereas Pcomb (and ProQ) selects the FAMSD_TS1

model (GDT_TS 5 68). The Pcons scores for these two

models are virtually identical (0.269 vs. 0.268), but the

ProQ score for FAMSD_TS1 is much better (0.416 vs.

0.233). Visual inspection shows that the FAMSD_TS1

models is quite nice, whereas the COMA_TS1 model

does not represent the correct fold, neither does it con-

tain very much secondary structure. One reason, why the

COMA_TS1 model obtains such a high Pcons score is

because it is exactly identical to five other models from

the two different COMA servers. In T0462, a similar sit-

uation appears, as Pcons selects MULTICOM-REFI-

Table IIThe Average GDT_TS for the Model Ranked Highest by the Different

MQAPs

Average quality of top ranked models

MQAP ALL EASY HARD

Pcons 62.8 75.3 35.3ProQ 56.6 68.7 30.2Pcomb 63.5 75.4 37.5

Zhang-server 63.1 72.2 36.7Best 68.3 79.0 44.9

Zhang-server, the best automatic server, and the results of the best possible selec-

tion are included for reference. Hard models are defined as the models where the

median GDT_TS is below 0.45.

Figure 2The GDT_TS score of the best model for each target selected by Pcons,X-axis, and Pcomb (black dots), ProQ ‘‘green squares’’, Zhang-server

(red triangles) on the Y-axis. In addition the GDT_TS score of the best

model is shown as blue stars. Three targets (T0462, T0472 and T0498)

discussed in the text are market. The two dotted lines represents models

with a GDT_TS score of 20 better/worse than the Pcons models.

P. Larsson et al.

170 PROTEINS

NE_TS4 (GDT_TS 5 35) that is very similar to several

other models from the six different MULTICOM servers.

Also in this case, ProQ and Pcons manage to select the

overall best model, Zhang-server_TS3 (GDT_TS 5 60)

with a ProQ score of 0.545 versus 0.259 for MULTI-

COM-REFINE_TS4.

A third, interesting, target is T0498, an artificial pro-

tein L1c, one of proteins designed to have high sequence

identity but different fold and function.16 Here, neither

Zhang-server, Pcomb, or Pcons manage to select a model

that is even close in quality to the best model. This is

also the only model where the predicted quality from

Pcons is more than 15 GDT_TS units higher than the

actual model (data not shown). There exist three virtu-

ally identical best models (HHpred4_TS1, HHpred5_TS1,

FEIG_TS1, with identical quality scores, GDT_TS 5 75),

but Pcons and Pcomb selects the Phyre_de_novo_TS1

(GDT_TS 5 29). The GDT_TS of the first ranked

Zhang-server model is also 29. The Pcons score of the

incorrect models is very high (0.80) while the correct

models have a Pcons score of 0.20. This would indicate

that this really is a case where the consensus approach

(and the Zhang-server) completely fails as Pcons not

only is unable to identify the correct prediction from a

few servers but also provides a very confident prediction

of an erroneous model. The ProQ scores are comparable

between the two sets of models, whereas ProsaII actually

manages to identify the difference in quality with a very

good score. Interestingly, the correct model is a three-hel-

ical immunoglobulin-binding bundle, whereas all the

incorrect models are of an ubuiquitin like fold (a mixed

beta–alpha fold). Secondary structure predictions also

suggest that it is a mixed alpha1beta protein, indicating

that the current generation of structure prediction tools

cannot deal with proteins that have not evolved within

the standard biological framework.

The last two cases in which a model with a difference

in GDT_TS score of 20 or higher could have been found

are T0467 and T0476. Both these targets are difficult

targets, where the Pcons score is quite low, but there

exist a few models that are relatively good. Clearly, in

such cases a consensus method cannot be expected to

identify the best model, indicating that consensus

MQAPs should perhaps not be trusted too much when

the consensus is low.

CONCLUSION

The CASP8 Pcons method performs significantly better

than the version used in CASP7 for global quality esti-

mates, when GDT_TS is used as a measure of model

quality. This is particularly clear when studying the over-

all correlation over all models. The correlation coefficient

for Pcons in CASP8 is 0.90 or higher independent of

what quality measure is used. The correlation coefficient

is close to the correlation between the different quality

measures (0.95–0.99), indicating that it might be difficult

to improve Pcons much further. On the other hand the

local correlations did not improve since CASP7 and there

should be more room for improvements in this category

as the correlation with the observed quality is lower

(0.68).

One area where Pcons failed is to always select the

best model for each target. However, we show that a

simple linear combination of Pcons and ProQ can pro-

vide the user with models that are better than the

models from the best single server in particular for

harder targets. Furthermore, analysis of a few targets

where Pcons selected suboptimal models indicates that

when several closely related servers are included it is

possible that consensus methods perform badly. There-

fore, in future CASPs it would probably be possible to

improve the performance by only using a reduced set of

servers as the input to Pcons. In addition, it might be

better to turn to other MQAP methods when the

consensus is low. Finally, it might be useful to try to

identify artificial proteins and treat them differently.

However, so far we have not come up with a better

method to modify Pcons than the simple linear

combination of Pcons and ProQ used in Pcomb. For the

hard targets this improves the quality of the selected

models by 6%. However, as the Pcomb method only has

been tested nonblindly on the CASP8 targets, the values

for the weights need to optimized on a larger set.

Furthermore, the improvement needs to be verified in a

blind test.

ACKNOWLEDGMENTS

This work was supported by grants to AE from the

Swedish Research Councils (NT and M), the foundation

for strategic research (SSF) and the EU 60th Framework

Program through EMBRACE project, contract LSHG-CT-

2004-512092. The visit of AE to EHU was supported by

the foundation for internationalisation of higher educa-

tion and research (STINT) and Bizkaia - Xede. BW is

supported by Marie-Curie Fellowship, MOIF-CT-2005-

040496. MS is funded by the EU 70th Marie Curie Initial

Training Network Transys, contract FP7-PEOPLE-2007-1-

1-ITN.

REFERENCES

1. Wallner B, Elofsson A. Prediction of global and local model quality

in CASP7 using Pcons and ProQ. Proteins 2007;69(Suppl 8):

184–193.

2. Lundstrom J, Rychlewski L, Bujnicki J, Elofsson A. Pcons: a neural-

network-based consensus predictor that improves fold recognition.

Protein Sci 2001;10:2354–2362.

3. Ginalski K, Elofsson A, Fischer D, Rychlewski L. 3D-Jury: a simple

approach to improve protein structure predictions. Bioinformatics

2003;19:1015–1018.

Pcons and ProQ MQAPs in CASP8

PROTEINS 171

4. Wallner B, Elofsson A. Identification of correct regions in protein

models using structural, alignment, and consensus information.

Protein Sci 2006;15:900–913.

5. Wallner B, Fang H, Elofsson A. Automatic consensus-based fold

recognition using Pcons, ProQ, and Pmodeller. Proteins 2003;53

(Suppl 6):534–541.

6. Wallner B, Elofsson A. Pcons5: combining consensus, structural

evaluation and fold recognition scores. Bioinformatics 2005;21:

4248–4254.

7. Zhang Y. Template-based modeling and free modeling by I-TASSER

in CASP7. Proteins 2007;69(Suppl 8):108–117.

8. Cristobal S, Zemla A, Fischer D, Rychlewski L, Elofsson A. A study

of quality measures for protein threading models. BMC Bioinfor-

matics 2001;2.

9. McGuffin L. The ModFOLD server for the quality assessment of

protein structural models. Bioinformatics 2008;24:586–587.

10. Pawlowski M, Gajda M, Matlak R, Bujnicki J. MetaMQAP: a meta-

server for the quality assessment of protein models. BMC Bioinfor-

matics 2008;9:403.

11. Wallner B, Elofsson A. Can correct protein models be identified?

Protein Sci 2003;12:1073–1086.

12. Zemla A, Venclovas C, Moult J, Fidelis K. Processing and analysis of

CASP3 protein structure predictions. Proteins Suppl 1999;3:22–29.

13. Siew N, Elofsson A, Rychlewski L, Fischer D. MaxSub: an auto-

mated measure to assess the quality of protein structure predic-

tions. Bionformatics 2000;16:776–785.

14. Zhang Y, Skolnick J. Scoring function for automated assess-

ment of protein structure template quality. Proteins 2004;57:

702–710.

15. Wallner B, Elofsson A. All are not equal: a benchmark of dif-

ferent homology modeling programs. Protein Sci 2005;14:1315–

1327.

16. He Y, Chen Y, Alexander P, Bryan P, Orban J. NMR structures of

two designed proteins with high sequence identity but different

fold and function. Proc Natl Acad Sci USA 2008;105:14412–

14417.

P. Larsson et al.

172 PROTEINS

assessment of global and local model quality in casp8 using pcons and proq

Documents