assessment of global and local model quality in casp8 using pcons and proq
TRANSCRIPT
![Page 1: Assessment of global and local model quality in CASP8 using Pcons and ProQ](https://reader031.vdocuments.us/reader031/viewer/2022020517/5750251e1a28ab877eb239c0/html5/thumbnails/1.jpg)
proteinsSTRUCTURE O FUNCTION O BIOINFORMATICS
QUALITY: PREDICTIONS
Assessment of global and local model qualityin CASP8 using Pcons and ProQPer Larsson,1 Marcin J. Skwark,1 Bjorn Wallner,1* and Arne Elofsson1,2*
1Department of Biochemistry and Biophysics, Center for Biomembrane Research, Stockholm
Bioinformatics Center, Stockholm University, SE-10691 Stockholm, Sweden
2Unidad de Biofisica (CSIC-UPV/EHU), Universidad del Pais Vasco Aptdo 644, 48080 Bilbao, Spain
INTRODUCTION
In CASP7,1 we showed that consensus-based Model Quality
Assessment Programs (MQAPs)2,3 are superior to other types of
MQAPs. In a consensus method, the quality of a model is predicted
by measuring its overall similarity to other models. Consensus
MQAPs can also be used to predict the local model quality, that is,
the accuracy of individual model fragments.4 In CASP7, it was clear
that Pcons performed superior both for the prediction of the global
quality of a model and for the local quality.1 In CASP5 and CASP6,
we also found that Pcons was capable of selecting models better
than the models produced by the single best server.5,6 However, in
CASP7, this was no longer true, as the Zhang-server7 performed
much better than all other servers and the models selected by Pcons
were actually slightly worse, on average.
Pcons is a consensus method that utilizes a set of alternative
protein models as input. A structural superposition algorithm,8 is
used to search for recurring structural patterns in the whole set
of models. Pcons predicts the quality of all models, by assigning a
score to each model reflecting the average similarity to the entire
ensemble of models. The idea being that recurring patterns are
more likely to be correct as compared to patterns than only occur
Per Larsson and Marcin J. Skwark contributed equally to this work.
The authors state no conflict of interest.
Grant sponsors: Swedish Research Councils; Foundation for Strategic Research (SSF);
EMBRACE Project; Grant number: LSHG-CT-2004-512092; Foundation for Internationalisa-
tion of Higher Education and Research (STINT) and Bizkaia - Xede; Marie-Curie Fellowship;
Grant number: MOIF-CT-2005-040496; EU 70th Marie Curie Initial Training Network Transys;
Grant number: FP7-PEOPLE-2007-1-1-ITN
*Correspondence to: Arne Elofsson and Bjorn Wallner, DBB, Stockholm University, 106 91
Stockholm, Sweden. E-mail: [email protected] or [email protected]
Received 13 March 2009; Revised 22 April 2009; Accepted 29 April 2009
Published online 11 May 2009 in Wiley InterScience (www.interscience.wiley.com).
DOI: 10.1002/prot.22476
Abbreviations: QA, quality assessment; MQAP, Model Quality Assessment Program.
ABSTRACT
Model Quality Assessment Programs (MQAPs) are
programs developed to rank protein models. These
methods can be trained to predict the overall global
quality of a model or what local regions in a model
that are likely to be incorrect. In CASP8, we partici-
pated with two predictors that predict both global
and local quality using either consensus informa-
tion, Pcons, or purely structural information, ProQ.
Consistently with results in previous CASPs, the
best performance in CASP8 was obtained using the
Pcons method. Furthermore, the results show that
the modification introduced into Pcons for CASP8
improved the predictions against GDT_TS and now
a correlation coefficient above 0.9 is achieved,
whereas the correlation for ProQ is about 0.7. The
correlation is better for the easier than for the
harder targets, but it is not below 0.5 for a single
target and below 0.7 only for three targets. The cor-
relation coefficient for the best local quality MQAP
is 0.68 showing that there is still clear room for
improvement within this area. We also detect that
Pcons still is not always able to identify the best
model. However, we show that using a linear combi-
nation of Pcons and ProQ it is possible to select
models that are better than the models from the
best single server. In particular, the average quality
over the hard targets increases by about 6% com-
pared with using Pcons alone.
Proteins 2009; 77(Suppl 9):167–172.VVC 2009 Wiley-Liss, Inc.
Key words: quality assessment; MQAP; consensus.
VVC 2009 WILEY-LISS, INC. PROTEINS 167
![Page 2: Assessment of global and local model quality in CASP8 using Pcons and ProQ](https://reader031.vdocuments.us/reader031/viewer/2022020517/5750251e1a28ab877eb239c0/html5/thumbnails/2.jpg)
in one or a few models. In earlier versions of Pcons, a
complicated relationship between the average similarity, the
reported scores, and other features was used. However, as
CASP7 Pcons has been based only on the average similarity
between one model and all other models, as in 3D-Jury.3
The consensus approach outlined earlier results in
both a global quality score, reflecting the overall correct-
ness, and local quality score, reflecting the local correct-
ness. In CASP7, the global score was based on the aver-
age global structure similarity using LGscore8 to measure
the similarity between the models, whereas the local score
was based on the average local structure similarity S-
score.8 In CASP8, we used the average S-score also for
the global quality estimates. In CASP7, Pcons was the
only consensus-based method participating in the MQAP
section, whereas in CASP8 several others participated as
well. One line of development in the MQAP field because
of the introduction of the MQAP category in CASP7 is
the development of meta-MQAPs, that is, MQAP meth-
ods that combine the results from several MQAPs to
improve the predictions.9,10 We actually did this already
in CASP55 using a simple linear combination of ProQ11
and Pcons.
In addition ProQ11 was tested in CASP8 for both local
and global quality predictions. ProQ11 and ProQres4 uti-
lize a combination of structural features to predict the
global and local quality, respectively. They both use
similar types of structural features such as atom–atom
contacts, residue–residue contacts, surface area exposure,
and secondary structure agreement, as inputs to a neural
network trained to predict the quality. The difference is
that ProQ derives these features for the complete protein
model and predicts the global quality, whereas ProQres
derives these features for a tertiary sequence window to
get a localized quality prediction. In CASP8, ProQres
participated as the localized version of ProQ. No changes
have been made to ProQ or ProQres since CASP7.
Following the CASP tradition, this article is divided
into two sections, what went right (Correlations
improves since CASP7) and what went wrong (Pcons
does not always select the best model).
METHODS
Pcons uses a structural superposition algorithm as the
basis for the consensus analysis. Any structural superpo-
sition algorithm could in principle be used, and pairwise
comparisons are then made between all input models.
However, for a given similarity measure the best per-
formance is reached if the same algorithm is used both
in the consensus analysis and as the target value for
Pcons, that is, to best predict GDT_TS, GDT_TS should
also be used in the consensus analysis and vice versa for
any quality measure. In this work, both S-score and
LG-score are used as measures of structural similarity.
LG-score detects segments in common between the
model and correct target structure. Based on these
egments, a structural comparison score, Sstr, defined by:
Sstr ¼X 1
1þ did0
� �2ð1Þ
is calculated, where di is the distance between the i resi-
due in the native structure and in the model, and d0 is a
distance threshold. This score ranges from 1 for a perfect
prediction (di 5 0) to 0 when di goes to inifinity. The
distance threshold defines the distance for which the
score should be 0.5. Here, the distance threshold was set
to sqrt(5). This is done both for LG-score and S-score.
LG-score is then the logarithm of a P-value that depends
both on the S-score and the length of the match, that is
the log of the probability of finding match with the same
length and equal or better S-score.
As in CASP7, the goal was to predict the local CA-CA
deviation, which is exactly di. Thus, all methods return
this measure, which can be turned into Sstr by rearragn-
ing Eq. (1). The global measure is then simply the sum
of the local quality scores, divided by the number of
models compared for each target.
Further in this work, we use the term ‘‘per-target’’ cor-
relation when referring to correlations between Pcons/
ProQ scores and the actual quality, calculated for each
target individually, and ‘‘overall’’ correlation when refer-
ring to all models and all targets simultaneously.
Correlations improved since CASP7
In Table I, the global and local correlations for the Pcons
and ProQ MQAPs against the model quality as measured
by GDT_TS12 (global) and RMSD (local) are shown. In
addition to the results for Pcons and ProQ, we also report
the results for the Pcons version used in CASP7, here termed
Pcons-LG. The only difference between Pcons-LG and the
new Pcons version is that the new version uses the average
S-score instead of the average LGscore for structural evalua-
tion. For the local qualities, the measured structural similar-
ity is based on the S-score in both versions, but because a
Table IPearson Correlation Coefficients of Predicted and Measured Model
Qualities for the MQAPs
Global and localcorrelation
MQAP RGDT overall RGDT pertarget
Rlocal overall Rlocal pertarget
Pcons 0.914 0.919 0.683 0.687Pcons-LG 0.683 0.893 0.677 0.681ProQ 0.704 0.664 0.422 0.418Pcomb 0.914 0.922 0.689 0.688
The global quality is measured using GDT_TS, while the local is measured using
RMSD. In addition to Pcons and ProQ that participated in CASP8, the Pcons ver-
sion used in CASP7 (Pcons-LG) and Pcomb, a simple linear combination of
Pcons (90%) and ProQ (10%) are included.
P. Larsson et al.
168 PROTEINS
![Page 3: Assessment of global and local model quality in CASP8 using Pcons and ProQ](https://reader031.vdocuments.us/reader031/viewer/2022020517/5750251e1a28ab877eb239c0/html5/thumbnails/3.jpg)
global structural alignment is performed using the different
target functions there might be slight differences. In CASP,
the average correlation coefficients ‘‘per target’’ have previ-
ously been used for evaluation. This is because it has been
difficult to agree on a single measure that is independent of
the target length. A method that performs well for both
‘‘per-target’’ and ‘‘overall’’ criteria is clearly preferable, as it
can provide the user with an overall estimate of the quality.
In our CASP7 article,1 we argued that a measure such as the
S-score8 could be used to measure the quality independently
of target length. However, in CASP GDT_TS is still used and
therefore we modified Pcons to fit this scoring functions
better, by taking the predicted S-score and dividing it with
the length of the target protein.
It can be seen that when using GDT_TS the new Pcons
version performs clearly better than Pcons-LG, Table I. In
particular the ‘‘overall’’ correlation is dramatically improved
from 0.68 to 0.91, but also a smaller improvement for the
‘‘per target’’ correlation can be seen. The correlations are
similar if Maxsub13 or TM-score14 are used, whereas, as
expected, the correlation versus the S-score is worse (data not
shown). As discussed earlier the main reason for the improve-
ment is the fact that S-score normalized by length correlate
better than LG-score with the measure for correctness used in
CASP, GDT_TS (0.89 vs. 0.72). Thus, a successful prediction
of such a measure would also have a higher correlation. This
is also augmented by the fact that the performance improve-
ment when comparing targets of equal length, that is, the
‘‘per target’’ performance, is marginal compared with the
increase when comparing targets of different lengths.
In Figure 1, the ‘‘per-target’’ correlation sorted by the
median GDT_TS score per target is shown. As in CASP7
the correlation for Pcons is worse for the more difficult
targets. In comparison with CASP7, there are now fewer
targets with unsatisfactory correlation. In CASP7, there
existed about 10 targets with a correlation coefficient
below 0.7, but in CASP8 we only found three such tar-
gets. This is due to the different scoring functions used
in CASP8, as the old Pcons version showed a similar per-
formance as in CASP7 (data not shown). As expected
ProQ did not show any improvement since CASP7 as the
program was identical. The ‘‘overall’’ correlation against
Figure 1Average correlation coefficients per target for Pcons and ProQ. In the two top panels the correlation for the global MQAPs are shown and in the
lower the local ones.
Pcons and ProQ MQAPs in CASP8
PROTEINS 169
![Page 4: Assessment of global and local model quality in CASP8 using Pcons and ProQ](https://reader031.vdocuments.us/reader031/viewer/2022020517/5750251e1a28ab877eb239c0/html5/thumbnails/4.jpg)
GDT_TS actually dropped slightly from 0.79 to 0.70 for
some unknown reason, Table I. Also for ProQ the per-
formance is worse for the harder targets, Figure 1.
As in CASP7, Pcons and ProQ perform worse for local
than for global predictions, with correlations coefficients
of 0.68 and 0.42, respectively, Table I. The correlation
coefficients are basically identical to what was observed
in CASP7. The same is true for the alternative methods
to evaluate local quality (data not shown). No very
strong dependence on model quality can be found for
Pcons or ProQ, Figure 1. However, it is clear that ProQ
never shows a good correlation for the most difficult tar-
gets, while Pcons sometimes does. Furthermore, the local
quality correlation for Pcons rarely drops below 0.5 indi-
cating that it is quite often useful.
Pcons does not always select the best model
In CASP5 and CASP6, Pcons was able to select better
first ranked models than the best single server did. How-
ever, in CASP7 this was no longer true. In Table II it can
be shown that neither in CASP8 did Pcons select better
models than the models produced by the best single
server, the Zhang-server, Table II. As described earlier, a
trend since CASP7 has been to develop combined
MQAPs and we were therefore interested to see, if this
could improve on this situation. However, already in
CASP511 we used a linear combination of ProQ and
Pcons scores to select better model. In CASP6, we also
used a combined score.15 However, neither in CASP7
nor in CASP8 we examined this approach, but others
have continued developing such methods.9,10 Therefore,
we thought that it could be interesting to see if it still
was possible to improve the Pcons results by combing
them with ProQ. Linear combinations of the ProQ and
Pcons scores were tested and it was found that a good
performance could be obtained by using a combination
of 90% Pcons and 10% ProQ. This will be referred to as
the Pcomb method below. Although the correlation
against GDT_TS (or any other quality measure) did not
change significantly it is clear that this combination
selects better models, Table II. The average GDT_TS
increase from 62.8 to 63.5. In particular for the "hard’’
models the average GDT_TS increase by 6% (35.3 to
37.5) and the selected models are even slightly better
than the models from the Zhang-server.
To gain some further understanding on how well
Pcomb, ProQ, and other methods select the top ranked
model, the GDT_TS score for the highest rank model
from each method is plotted against the GDT_TS score
for the selected Pcons model, Figure 2. First, it can be
seen that for most models Pcomb, Zhang-server, and
Pcons select models of similar quality, whereas ProQ
often selects slightly worse models and in quite a few
cases significantly worse models. Furthermore, only in a
handful of cases models exist that are more than 20
GDT_TS units better than the ones selected Pcons.
It can also be seen that a large part of the improve-
ment in Pcomb is due to that Pcomb manages to select
the best model for two targets, T0472 and T0462,
whereas Pcons selects clearly worse models. In T0472
Pcons selects the COMA_TS1 model (GDT_TS 5 43),
whereas Pcomb (and ProQ) selects the FAMSD_TS1
model (GDT_TS 5 68). The Pcons scores for these two
models are virtually identical (0.269 vs. 0.268), but the
ProQ score for FAMSD_TS1 is much better (0.416 vs.
0.233). Visual inspection shows that the FAMSD_TS1
models is quite nice, whereas the COMA_TS1 model
does not represent the correct fold, neither does it con-
tain very much secondary structure. One reason, why the
COMA_TS1 model obtains such a high Pcons score is
because it is exactly identical to five other models from
the two different COMA servers. In T0462, a similar sit-
uation appears, as Pcons selects MULTICOM-REFI-
Table IIThe Average GDT_TS for the Model Ranked Highest by the Different
MQAPs
Average quality of top ranked models
MQAP ALL EASY HARD
Pcons 62.8 75.3 35.3ProQ 56.6 68.7 30.2Pcomb 63.5 75.4 37.5
Zhang-server 63.1 72.2 36.7Best 68.3 79.0 44.9
Zhang-server, the best automatic server, and the results of the best possible selec-
tion are included for reference. Hard models are defined as the models where the
median GDT_TS is below 0.45.
Figure 2The GDT_TS score of the best model for each target selected by Pcons,X-axis, and Pcomb (black dots), ProQ ‘‘green squares’’, Zhang-server
(red triangles) on the Y-axis. In addition the GDT_TS score of the best
model is shown as blue stars. Three targets (T0462, T0472 and T0498)
discussed in the text are market. The two dotted lines represents models
with a GDT_TS score of 20 better/worse than the Pcons models.
P. Larsson et al.
170 PROTEINS
![Page 5: Assessment of global and local model quality in CASP8 using Pcons and ProQ](https://reader031.vdocuments.us/reader031/viewer/2022020517/5750251e1a28ab877eb239c0/html5/thumbnails/5.jpg)
NE_TS4 (GDT_TS 5 35) that is very similar to several
other models from the six different MULTICOM servers.
Also in this case, ProQ and Pcons manage to select the
overall best model, Zhang-server_TS3 (GDT_TS 5 60)
with a ProQ score of 0.545 versus 0.259 for MULTI-
COM-REFINE_TS4.
A third, interesting, target is T0498, an artificial pro-
tein L1c, one of proteins designed to have high sequence
identity but different fold and function.16 Here, neither
Zhang-server, Pcomb, or Pcons manage to select a model
that is even close in quality to the best model. This is
also the only model where the predicted quality from
Pcons is more than 15 GDT_TS units higher than the
actual model (data not shown). There exist three virtu-
ally identical best models (HHpred4_TS1, HHpred5_TS1,
FEIG_TS1, with identical quality scores, GDT_TS 5 75),
but Pcons and Pcomb selects the Phyre_de_novo_TS1
(GDT_TS 5 29). The GDT_TS of the first ranked
Zhang-server model is also 29. The Pcons score of the
incorrect models is very high (0.80) while the correct
models have a Pcons score of 0.20. This would indicate
that this really is a case where the consensus approach
(and the Zhang-server) completely fails as Pcons not
only is unable to identify the correct prediction from a
few servers but also provides a very confident prediction
of an erroneous model. The ProQ scores are comparable
between the two sets of models, whereas ProsaII actually
manages to identify the difference in quality with a very
good score. Interestingly, the correct model is a three-hel-
ical immunoglobulin-binding bundle, whereas all the
incorrect models are of an ubuiquitin like fold (a mixed
beta–alpha fold). Secondary structure predictions also
suggest that it is a mixed alpha1beta protein, indicating
that the current generation of structure prediction tools
cannot deal with proteins that have not evolved within
the standard biological framework.
The last two cases in which a model with a difference
in GDT_TS score of 20 or higher could have been found
are T0467 and T0476. Both these targets are difficult
targets, where the Pcons score is quite low, but there
exist a few models that are relatively good. Clearly, in
such cases a consensus method cannot be expected to
identify the best model, indicating that consensus
MQAPs should perhaps not be trusted too much when
the consensus is low.
CONCLUSION
The CASP8 Pcons method performs significantly better
than the version used in CASP7 for global quality esti-
mates, when GDT_TS is used as a measure of model
quality. This is particularly clear when studying the over-
all correlation over all models. The correlation coefficient
for Pcons in CASP8 is 0.90 or higher independent of
what quality measure is used. The correlation coefficient
is close to the correlation between the different quality
measures (0.95–0.99), indicating that it might be difficult
to improve Pcons much further. On the other hand the
local correlations did not improve since CASP7 and there
should be more room for improvements in this category
as the correlation with the observed quality is lower
(0.68).
One area where Pcons failed is to always select the
best model for each target. However, we show that a
simple linear combination of Pcons and ProQ can pro-
vide the user with models that are better than the
models from the best single server in particular for
harder targets. Furthermore, analysis of a few targets
where Pcons selected suboptimal models indicates that
when several closely related servers are included it is
possible that consensus methods perform badly. There-
fore, in future CASPs it would probably be possible to
improve the performance by only using a reduced set of
servers as the input to Pcons. In addition, it might be
better to turn to other MQAP methods when the
consensus is low. Finally, it might be useful to try to
identify artificial proteins and treat them differently.
However, so far we have not come up with a better
method to modify Pcons than the simple linear
combination of Pcons and ProQ used in Pcomb. For the
hard targets this improves the quality of the selected
models by 6%. However, as the Pcomb method only has
been tested nonblindly on the CASP8 targets, the values
for the weights need to optimized on a larger set.
Furthermore, the improvement needs to be verified in a
blind test.
ACKNOWLEDGMENTS
This work was supported by grants to AE from the
Swedish Research Councils (NT and M), the foundation
for strategic research (SSF) and the EU 60th Framework
Program through EMBRACE project, contract LSHG-CT-
2004-512092. The visit of AE to EHU was supported by
the foundation for internationalisation of higher educa-
tion and research (STINT) and Bizkaia - Xede. BW is
supported by Marie-Curie Fellowship, MOIF-CT-2005-
040496. MS is funded by the EU 70th Marie Curie Initial
Training Network Transys, contract FP7-PEOPLE-2007-1-
1-ITN.
REFERENCES
1. Wallner B, Elofsson A. Prediction of global and local model quality
in CASP7 using Pcons and ProQ. Proteins 2007;69(Suppl 8):
184–193.
2. Lundstrom J, Rychlewski L, Bujnicki J, Elofsson A. Pcons: a neural-
network-based consensus predictor that improves fold recognition.
Protein Sci 2001;10:2354–2362.
3. Ginalski K, Elofsson A, Fischer D, Rychlewski L. 3D-Jury: a simple
approach to improve protein structure predictions. Bioinformatics
2003;19:1015–1018.
Pcons and ProQ MQAPs in CASP8
PROTEINS 171
![Page 6: Assessment of global and local model quality in CASP8 using Pcons and ProQ](https://reader031.vdocuments.us/reader031/viewer/2022020517/5750251e1a28ab877eb239c0/html5/thumbnails/6.jpg)
4. Wallner B, Elofsson A. Identification of correct regions in protein
models using structural, alignment, and consensus information.
Protein Sci 2006;15:900–913.
5. Wallner B, Fang H, Elofsson A. Automatic consensus-based fold
recognition using Pcons, ProQ, and Pmodeller. Proteins 2003;53
(Suppl 6):534–541.
6. Wallner B, Elofsson A. Pcons5: combining consensus, structural
evaluation and fold recognition scores. Bioinformatics 2005;21:
4248–4254.
7. Zhang Y. Template-based modeling and free modeling by I-TASSER
in CASP7. Proteins 2007;69(Suppl 8):108–117.
8. Cristobal S, Zemla A, Fischer D, Rychlewski L, Elofsson A. A study
of quality measures for protein threading models. BMC Bioinfor-
matics 2001;2.
9. McGuffin L. The ModFOLD server for the quality assessment of
protein structural models. Bioinformatics 2008;24:586–587.
10. Pawlowski M, Gajda M, Matlak R, Bujnicki J. MetaMQAP: a meta-
server for the quality assessment of protein models. BMC Bioinfor-
matics 2008;9:403.
11. Wallner B, Elofsson A. Can correct protein models be identified?
Protein Sci 2003;12:1073–1086.
12. Zemla A, Venclovas C, Moult J, Fidelis K. Processing and analysis of
CASP3 protein structure predictions. Proteins Suppl 1999;3:22–29.
13. Siew N, Elofsson A, Rychlewski L, Fischer D. MaxSub: an auto-
mated measure to assess the quality of protein structure predic-
tions. Bionformatics 2000;16:776–785.
14. Zhang Y, Skolnick J. Scoring function for automated assess-
ment of protein structure template quality. Proteins 2004;57:
702–710.
15. Wallner B, Elofsson A. All are not equal: a benchmark of dif-
ferent homology modeling programs. Protein Sci 2005;14:1315–
1327.
16. He Y, Chen Y, Alexander P, Bryan P, Orban J. NMR structures of
two designed proteins with high sequence identity but different
fold and function. Proc Natl Acad Sci USA 2008;105:14412–
14417.
P. Larsson et al.
172 PROTEINS