anexampleoftemplatebased proteinstructuremodeling byglobaloptimization · 2017. 3. 13. ·...

TASK QUARTERLY vol. 20, No 4, 2016, pp. 341–351

AN EXAMPLE OF TEMPLATE BASEDPROTEIN STRUCTURE MODELING

BY GLOBAL OPTIMIZATIONKEEHYOUNG JOO1,2, INSUK JOUNG1,3

AND JOOYOUNG LEE1,2,3

1Center for In Silico Protein Science, Korea Institute for Advanced StudySeoul 130-722, Korea

2Center for Advanced Computation, Korea Institute for Advanced StudySeoul 130-722, Korea

3School of Computational Sciences, Korea Institute for Advanced StudySeoul 130-722, Korea

(received: 26 August 2016; revised: 23 September 2016;accepted: 30 September 2016; published online: 17 October 2016)

Abstract: CASP (Critical Assessment of protein Structure Prediction) is a community-wideexperiment for protein structure prediction taking place every two years since 1994. In CASP11held in 2014, according to the official CASP11 assessment, our method named `nns' was rankedas the second best server method based on models ranked as first out of 81 targets. In `nns',we applied the powerful global optimization method of conformational space annealing to threestages of optimization, including multiple sequence-structure alignment, three-dimensional (3D)chain building, and side-chain remodeling. For the fold recognition, a new alignment methodcalled CRFalign was used. The good performance of the nns server method is attributed tothe successful fold recognition carried out by combined methods including CRFalign, and thecurrent modeling formulation incorporating accurate structural aspects collected from multipletemplates. In this article, we provide a successful example of `nns' predictions for T0776, forwhich all details of intermediate modeling data are provided.Keywords: template based modeling; protein structure modeling; global optimization; casp;homology modeling; sequence alignment; fold recognitionDOI: https://doi.org/10.17466/tq2016/20.4/b

1. IntroductionFor the protein structure modeling of CASP11 targets using our `nns'

server method, we updated our protein modeling protocols used for CASP7 toCASP10 [1–3]. The approach follows the usual template-based modeling (TBM)

342 K. Joo, I. Joung and J. Lee

procedure including steps of template selection/alignment, 3D (three-dimensional)chain building, and refinement of side-chains and/or backbones [1]. We applied theglobal optimization method of conformational space annealing (CSA) [4–9] to thethree stages of protein modeling including multiple sequence-structure alignment(MSA) [10], 3D chain building [11], and side-chain remodeling [1–3]. One ofthe most important steps in protein structure modeling is to find appropriatetemplate structures for a given target. This typically involves pairwise alignmentsbetween the target sequence and template structures in the PDB database. ForCASP11, we used a new sequence-structure alignment method called CRFalignby improving a base alignment model constructed from a simplified HHpred [12]scoring scheme. As for the selection of candidate templates, three fold-recognitionmethods were employed, including (i) CRFpred which is based on the newalignment method of CRFalign (ii) FOLDFINDER which was used in our previousTBM protocols for CASP7-10 [1–3] and (iii) HHpred [12]. CRFalign alignmentsbetween a target sequence and selected templates were utilized to build MSAlibraries for consistency scores used in MSACSA [10]. Details of our updatedprocedure can be found elsewhere [13]

2. Server Prediction by nnsCompared to our CASP10 server prediction protocol, PMS [3] nns is updated

in two aspects. The first is a new sequence-template alignment method calledCRFalign, a fold-recognition method based on machine learning, utilizing sequenceand structure features, and a quality assessment score by QA1 [3] (the identicalin-house model quality assessment method used during CASP10). The secondnew feature is an updated side-chain re-modeling method which used the Scwrl4rotamer library. For more details, readers are directed to read our CASP11publication [13]. In this article, we discuss the details of the 3D modeling procedureof T0776 to provide a more concrete example of the nns procedure.

2.1. Fold recognition of T0776Table 1 shows the result of the fold recognition of T0776. For this target,

a total of 38 templates were collected from the three fold-recognition methodsmentioned above. For each of these templates, we generated five 3D models fol-lowing a simple procedure similar to MODELLER [14], and the quality of eachmodel was estimated using our in-house quality assessment protocol (QA1) [3].Using the first model of each template in Table 1, we generated a network of tem-plates where the edge weight between two templates was set to the TM-score ofthem. The templates were grouped by a network-based clustering method calledModCSA [15–17]. In Table 1, we observe that two communities of templates wereidentified. Subsets of templates belonging to an identical community (cluster) werecollected to generate template lists as shown in Table 2. The top list of each com-munity contains the core templates scored high in Table 1 (templates of up to 93%of the top QA value (0.7052)). Considering the other templates (up to 80% of thetop QA value), up to 4 additional templates were used to generate combinations.

An Example of Template Based Protein Structure Modeling... 343

Table 1. The results of fold recognition for T0776 are shown; QA is the quality assessmentscore, ComID is the community ID, and Degree is the sum of edge weightsconnected to the template node; in this example of T0776, the network consistingof 38 template nodes is divided into two communities

Rank Templates QA ComID Degree1 4iyjA 0.7052 2 5.35992 3w7vA 0.6929 1 17.72973 3rjtA 0.6862 1 17.78354 3milA 0.6774 1 17.83005 2q0qA 0.6728 1 18.02486 2hsjA 0.6660 2 5.13707 4ppyA 0.6504 2 5.43078 4jggA 0.6201 1 17.86769 1vjgA 0.6162 1 17.9166

10 4h08A 0.6085 2 5.014511 1yzfA 0.6008 1 17.245012 3p94A 0.5993 2 5.348213 3hp4A 0.5983 1 17.557414 4hf7A 0.5923 2 5.381815 2vptA 0.5874 1 17.416616 3dciA 0.5817 1 17.850517 2o14A 0.5771 1 17.128718 2wabA 0.5513 1 17.963419 1ivnA 0.5479 1 17.785920 3bzwA 0.5319 1 17.533621 1bwpA 0.5303 2 5.039022 4hyqA 0.5097 1 17.552923 1escA 0.4913 1 17.715424 4i8iA 0.4667 1 16.438725 2w9xA 0.4578 1 16.669126 1fxwF 0.4565 2 5.040527 4lhsA 0.4408 1 16.786328 2waaA 0.4361 1 16.895529 1deoA 0.4169 1 16.663930 3slrA 0.4022 1 14.409331 1zmbA 0.4019 1 15.021732 3u37A 0.4006 1 16.294333 3kvnA 0.3982 1 16.798934 3dc7A 0.3972 1 17.381535 4nrdA 0.3470 1 17.140336 4m8kA 0.3130 1 16.411337 3skvA 0.3067 1 17.004038 4c1bA 0.2314 2 3.1600


Table 2. A list of template combinations is shown; for each community, we considered up to9 combinations of templates; the top list contains the core templates scored highin Table 1 (templates of up to 93% of the top QA value (0.7052) for eachcommunity; considering the other templates (up to 80% of the top QA value), up to4 additional templates are used to generate combinations; Ntemp is the numberof templates for each list

ComID list Templates Ntemp1 list1 4iyjA, 2hsjA 21 list2 4iyjA, 2hsjA, 4ppyA 31 list3 4iyjA, 2hsjA, 4h08A 31 list4 4iyjA, 2hsjA, 3p94A 31 list5 4iyjA, 2hsjA, 4ppyA, 4h08A 41 list6 4iyjA, 2hsjA, 4ppyA, 3p94A 41 list7 4iyjA, 2hsjA, 4h08A, 3p94A 41 list8 4iyjA, 2hsjA, 4ppyA, 4h08A, 3p94A 51 list9 4iyjA, 2hsjA, 4ppyA, 4h08A, 3p94A, 4hf7A 62 list10 3w7vA, 3rjtA, 3milA, 2q0qA 42 list11 3w7vA, 3rjtA, 3milA, 2q0qA, 4jggA 52 list12 3w7vA, 3rjtA, 3milA, 2q0qA, 1vjgA 52 list13 3w7vA, 3rjtA, 3milA, 2q0qA, 1yzfA 52 list14 3w7vA, 3rjtA, 3milA, 2q0qA, 4jggA, 1vjgA 62 list15 3w7vA, 3rjtA, 3milA, 2q0qA, 4jggA, 1yzfA 62 list16 3w7vA, 3rjtA, 3milA, 2q0qA, 1vjgA, 1yzfA 62 list17 3w7vA, 3rjtA, 3milA, 2q0qA, 4jggA, 1vjgA, 1yzfA 72 list18 3w7vA, 3rjtA, 3milA, 2q0qA, 4jggA, 1vjgA, 1yzfA, 3hp4A 8

After generating template lists, sequence-based alignments between targetand template sequences [13], and structure-structure alignment by TM-align [18]between templates were used for building a restraint library for matched residuepairs. Then, multiple sequence alignment was carried using MSACSA [3]. Figure 1shows the highest-score alignment generated for list16. The alignment was thenpassed to the 3D-chain-building procedure.

2.2. 3D Chain BuildingFor 3D-chain-building of T0776, we used the following energy function:

𝐸 = 𝐸stereo−chemistry +𝐸vdWrepul +𝐸phys +𝐸restraint (1)

where 𝐸stereo−chemistry is taken from the Modeller energy function [14] and corre-sponds to the stereo-chemical term, related to bond lengths, bond angles, torsionangles and improper torsion angles according to CHARMM22 [19]. 𝐸vdW

repul corre-sponds to the repulsive part of the van der Waals potential [20], and the attractivepart is set to a constant. The repulsive potential is defined as


Figure 1. The multiple sequence alignment generated by MSACSA using the templatesof list16 is shown

𝐸vdWrepul = ∑

u�∑u�>u�

⎧{⎨{⎩

𝜖u�u� [( u�u�u�u�u�u�

)12

−2( u�u�u�u�u�u�

)6]+𝜖u�u�, 𝑟u�u� < 𝜎u�u�

0, 𝑟u�u� ≥ 𝜎u�u�

(2)

where 𝜖 and 𝜎 are according to the CHARMM22 parameter, and the sum is overall non-bonded atom pairs.

𝐸phys includes modeling energy terms for free modeling (or ab initio mode-ling) targets including dynamic fragment assembly (DFA) terms [21], the DFIREstatistical potential term [22], the hydrogen bonding term [23], and the GOAPterm [24] as follows:

𝐸phys = 𝐸dfa +𝐸dfire +𝐸GOAP +𝐸hbond (3)

Finally, 𝐸restraint contains the pairwise distance restraint term in terms ofthe Lorentzian function [3], with a predicted variability value as the uncertaintyof the restraint distance [25]. The weights for the energy terms were re-optimizedusing 27 small one-domain CASP9 targets.

Initial models for the structure optimization are generated by Modellerusing the templates and the multiple sequence alignment and subsequently, themodels are modified by perturbing Cartesian coordinates or torsional angles.


In Table 3 we show examples of Lorentzian restraints with predicted sigmavalues. 𝑑 is the value of the atom pair distance extracted from a template, and𝜎 is its corresponding sigma value. We note that multiple values of 𝑑 and 𝜎 areshown for a given atom pair representing the presence of multiple templates.

Table 3. Some examples of the Lorentzian restraints with predicted sigma values are shown;𝑑 is the value of the atom pair distance extracted from templates, and 𝜎 is itscorresponding sigma value; it should be noted that multiple values of 𝑑 and 𝜎 areshown for a given atom pair representing the presence of multiple templates

AtomTypes 𝜎1 𝑑1 𝜎2 𝑑2 𝜎3 𝑑3 𝜎4 𝑑4 …pairsCA-CA 383–870 0.5524 13.5496CA-CA 383–904 0.8370 14.0315 1.6115 14.5468CA-CA 383–926 1.5347 11.4204 1.5422 12.3390 3.6552 15.8246 4.2628 14.2020 …

…N-O 385–786 0.5262 10.3278N-O 382–952 1.0525 10.7042 4.9056 11.5793 5.2089 13.1140…

BB-SD 581–592 1.1616 5.3657 0.6318 7.4930 1.3852 6.5124…

SD-SD 817–1044 1.3896 3.5229 1.4333 4.9000 2.1545 4.2864…

For all 18 lists shown in Table 2, we performed 3D-chain-building procedureproducing 100 3D models for each list. By performing another quality assessmentof the generated 3D models, we estimated the average quality of 100 3D modelsfor each list, which is shown in Table 4. At the time of 3D modeling, the list 16was chosen as the best one, and the lowest energy structure was selected fromthe csa16 run. The side-chain of this structure was remodeled while its backbonestructure was fixed. Table 4 shows the quality of the backbone structures in termsof TM-score measured after the native structure of T0776 is released. TMave is theaverage TM-score of the 100 3D models and TMgmin is the TM-score of the lowestenergy structure among these 100 3D models. For T0776, the selection of csa16is well justified by high TM-score values. Table 5 shows the quality of side-chainsbefore and after the side-chain remodeling.

In Figure 2 we show the energy landscape of the final 100 CSA bankstructures for the csa16 run of T0776. Although the lowest energy structuredoes not correspond to the highest TM-score structure, by optimizing the energyfunction, better quality 3D models were generated. When the model quality ofnns is compared to all the other CASP11 server models for T0776s, we observea significant advantage of our `nns' model (Figure 3). In addition, when weexamine the quality of the template structures used in generating our `nns'model (see Table 6), we observe that the TM-score of the `nns' model 1, 0.918


Table 4. The quality assessment and model selection calculated during the modeling stage isshown; QA represents the assessment score of 100 3D models generated. The actualmodel quality measured by the TMscore program after the native structure ofT0776 is released is shown also; TMave is the average TM-score of 100 3D modelsand TMgmin is the TM-score of the lowest energy structure among these 100 3Dmodels; for T0776, the selection of csa16 was well justified by high TM-score values

csa ComID QA TMave TMgmin

csa16 2 1.5550 0.9278 0.9278csa11 2 1.5503 0.9119 0.9131csa13 2 1.5417 0.9198 0.9186csa10 2 1.5447 0.9178 0.9182csa12 2 1.5368 0.9098 0.9095csa14 2 1.5153 0.9089 0.9162csa17 2 1.5016 0.9217 0.9220csa8 1 1.5055 0.7606 0.7661csa15 2 1.4914 0.9151 0.9187csa18 2 1.4864 0.9227 0.9260csa2 1 1.5044 0.7676 0.7608csa9 1 1.4946 0.7608 0.7628csa3 1 1.4874 0.7819 0.7800csa5 1 1.4768 0.7527 0.7540csa1 1 1.4417 0.7602 0.7629csa4 1 1.4434 0.7435 0.7475csa6 1 1.3330 0.7419 0.7444

Figure 2. The energy landscape of the final 100 CSA bank structures is shown in terms ofenergy vs. TM-score; although the lowest energy structure does not correspond to the highestTM-score structure, better quality 3D models are generated by optimizing the energy function


Table 5. Improvement of the side-chain accuracy is demonstrated; the csa16 modelrepresents the side-chain accuracy of the lowest energy structure among 1003D models in the csa16 CSA calculation; the side-chain model represents theside-chain accuracy of our side-chain remodeling procedure; the values mean thatthe percentage of correct 𝜒1 and both 𝜒1 and 𝜒2 (𝜒1+2) torsions; a torsion anglewithin 30 degrees away from the correct value is counted as a correct torsion

model 𝜒1 𝜒1+2

csa16 65.59 45.95side-chain model 70.43 51.35

Figure 3. The quality of all CASP11 server models (model1) is shown in terms of TM-score


(Figure 2), is much improved over the TM-scores of templates used in modeling.This is more so in the sense that the TM-score of the `nns' model 1 is measuredin terms of sequences-specific structure comparison, while the TM-score of thetemplate structure in Table 6 is measured in a non sequences-specific fashion.The model 1 structure of nns, which is the lowest-energy structure from the final100 bank structures, is shown in superimposition of the native structure of T0776(Figure 4).

Table 6. A summary of the template quality in list16 is shown; the template quality ismeasured by the structure alignment method of TMalign, after the native structureof T0776 is released

Templates TM-score SeqID3w7vA 0.8762 0.3293rjtA 0.8842 0.2903milA 0.7449 0.1582q0qA 0.7058 0.1551vjgA 0.7486 0.1511yzfA 0.7516 0.185

Figure 4. The lowest-energy structure from the final 100 bank structures of csa16, the model1 structure of nns, is shown in superimposition of the native structure of T0776; the native

structure is shown in grey, and the nns model1 structure is shown in color


3. ConclusionsOur server method of CASP11, `nns' produced the best CASP11 server

model 1 for T0776. This result is based on the utilization of an efficient globaloptimization method called CSA (conformational space annealing) at three layersof modeling stages including multiple sequence alignment, 3D chain building,and side-chain re-modeling. We demonstrate the modeling steps for T0776 byproviding detailed intermediate-stage data. From this analysis, we demonstratethat good quality templates were identified and grouped together in the foldrecognition step. By generating low-energy 3D models satisfying a collection ofcontradicting restraints, even higher quality of 3D models was generated. Thesuccessful quality assessment procedure identified the best CSA run. The side-chain remodeling procedure was also quite successful in generating improved side-chains. All of these multiple-model-generation-followed-by-screening procedureswere ideally combined to generate the best 3D server model for T0776. For furtherimprovement of the nns method, much efforts are needed to develop more reliable3D model quality assessment methods along with the development of the energyfunction for 3D model generation.

AcknowledgementsThis work was supported by the National Research Foundation of Korea

(NRF) grant funded by the Korea government (MEST) (No. 2008-0061987). Wethank the KIAS Center for Advanced Computation for providing computingresources. This work was supported by the National Institute of Supercomputingand Networking / Korea Institute of Science and Technology Information withsupercomputing resources including technical support (KSC-2014-C3-01).

References[1] Joo K, Lee J, Lee S, Seo J H, Lee S J and Lee J 2007 Prot. Struct. Funct. Bioinf. 69

(S8) 83[2] Krieger E et al. 2009 Prot. Struct. Funct. Bioinf. 77 (S9) 114[3] Joo K, Lee J, Sim S, Lee S Young, Lee K, Heo S, Lee I-H, Lee S Jong and Lee J 2014

Prot. Struct. Funct. Bioinf. 82 188[4] Lee J, Scheraga H A and Rackovsky S 1997 J. Comput. Chem. 18 (9) 1222[5] Lee J, Scheraga H A and Rackovsky S 1998 Biopolymers 46 (2) 103[6] Lee J, Liwo A, Ripoll D R, Pillardy J and Scheraga H A 1999 Prot. Struct. Funct. Genet.

S3 204[7] Lee J, Liwo A, Ripoll D R, Pillardy J, Gibson K D, Saunders J A and Scheraga H A

2000 J. Comput. Chem. 77 (1) 90[8] Lee J and Scheraga H A 1999 International Journal of Quantum Chemistry 75 (3) 255[9] Lee J, Lee I-H and Lee J 2003 Physical Review Letters 91 (8) 80201

[10] Joo K, Lee J, Kim I, Lee S Jong and Lee J 2008 Biophysical Journal 95 (10) 4813[11] Joo K, Lee J, Seo J-H, Lee K, Kim B-Gee and Lee J 2009 Prot. Struct. Funct. Bioinf.

75 (4) 1010[12] Soding J 2005 Bioinformatics 21 (7) 951[13] Joo K et al. 2016 Prot. Struct. Funct. Bioinf. 84 (S1) 221[14] Sali A and Blundell T L 1993 Journal of Molecular Biology 234 (3) 779[15] Lee J, Gross S P and Lee J 2012 Physical Review E 85 (5) 56702


[16] Lee J, Gross S P and Lee J 2013 Sci. Rep. 3 2197[17] Lee J and Lee J 2013 PloS One 8 (4), e60372[18] Zhang Y and Skolnick J 2005 Nucleic Acids Research 33 (7) 2302[19] MacKerell A D et al. 1998 J. Phys. Chem. B 102 (18) 3586 doi: 10.1021/jp973084f[20] Joo K, Joung ISuk, Lee J, Lee J, Lee W, Brooks B, Lee S Jong and Lee J 2015 Prot.

Struct. Funct. Bioinf. 83 2251[21] Lee J, Lee J, Sasaki T N, Sasai M, Seok C and Lee J 2011 Prot. Struct. Funct. Bioinf.

79 (8) 2403[22] Zhou H and Zhou Y 2002 Protein Science 11 (11) 2714[23] Kortemme T, Morozov A V and Baker D 2003 Journal of Molecular Biology 326 (4) 1239[24] Zhou H and Skolnick J 2011 Biophysical Journal 101 (8) 2043[25] Lee J, Lee K, Joung I, Joo K, Brooks B R and Lee J 2015 BMC Bioinformatics 16 (1) 94

352 TASK QUARTERLY vol. 20, No 4, 2016

anexampleoftemplatebased proteinstructuremodeling byglobaloptimization · 2017. 3. 13. ·...

Documents