3d modeling the motivation for 3d modeling of proteins · in the different approaches aimed at...

3D modeling

Eran Eyal2011May

The motivation for 3D modeling of proteins

• The structure may hint for function especially if there is similarity for other well-annotated proteins

• Using the structure we can speculate on important functional regions

• Lab experiments can be more carefully designed

• Using structural models we can apply docking methods to understand interactions with both other proteins and small molecules

Basic principles for modeling

• Use every piece of information from the current databases

• Choose the modeling method according to the current knowledge on you problem. Algorithms which are based on existing structures are always more accurate.

• Building order:-Secondary structures-Loops-Side chains

• Evaluation of modeling procedures is done by modeling proteins with known structure and by examining features of the models

• Quantitative criteria for modeling accuracy exist and include indices such as the RMSD

A. Homology modelling

---

G

---

Y

---

M

AAAA

KSTA

AGGG

YFFY

LEDA

VVVV

LVI

L

SEDS

Alignment to proteins of a known structure

Structural model

B. Fold recognition

query sequence

+Known protein folds

SLVAYGAAM

Structural model

C. Ab initio

query sequence

SLVAYGAAM

Structural model

• Current projected for solving protein structures are oriented toward solving structures of proteins without detected homology to any existing structure.• Within a few years we expect to have almost basic folds of proteins, such that for almost every proteins there will be homolog with solved structure.

Modeling using homology

• There are millions of protein sequences but only several thousands folds.

• Statistically, every second protein in the sequence databases we can detect homolog (identity > 30%) in the structural databases.Such homolog possesses an identical fold and a very similar structure

Working flow

• Search for sequence similarity between the query sequence and adatabase of protein structures (PDB) using sequence alignment algorithms such as blast and psi-blast

• In case of fuilaure we can still try to use MSA or other representation for conserved motifs to increase sensitivity

• Construct an accurate alignment between the query sequence and the similar sequences. Use additional evolutionary data to increase alignment quality

• Correct alignment in this step is crucial. Algorithms for homology modeling absolutely depend on the alignment

• An algorithms for multiple alignment include ClustalW, T-Coffee, Muscle. Manual intervention is possible and often desired when we have additional information not considered by the alignment algorithm

• Many sequences in the alignment increase the chance for correctalignment and accurate final model.

• For this reason it is sometimes recommended to screen also sequence databases such as Swissprot or PIR and not only structural databases

Identify proteins with solved structures similar to the query

Construct alignment

Construct model

Check the model

End

Model construction

• Defining conserved blocks from the alignment. Usually these correspond to well-defined secondary structures

• Define reference base for the construction according to the coordinates mean of all templates

• For each conserved region determine the backbone structure based on average coordinates of the templates in that region. If there is a single protein which is significantlymore similar in the region use coordinates of that protein.

The main strategies to construct the backbone in homology modeling are:

Fragment based homology:

Construction by distance constraints:

• Construct conserved regions in the alignment first (well defined secondary elements).

• Construct the loops which connect the conserved regions

• Construct simultaneously the entire structure using distance constraints derived from the homologous structures.

Connect the secondary structures with loops:

• Using a database of loops within known structures. Theses databse are accessed using the sequence of the loop, its length and the structure and orientation of the loop.

• Using ab initio approach, without any prior knowledge using a search in the space of loop conformations and scoring function.

http://sbi.imim.es/cgi-bin/archdb/loops.pl

אתר ים ברשת לבני ית מוד לים

MODELLER – http://salilab.org/modeller/

http://swift.cmbi.ru.nl/whatif/–WHAT IF http://swift.cmbi.ru.nl/servers/html/index.html

http://swissmodel.expasy.org/–MODEL -SWISS

http://psb00.snv.jussieu.fr/wloop/Wloop

Swiss-Modelhttp://swissmodel.expasy.org/

Modellerhttp://salilab.org/modeller/

Modeller is based on distance constraints derived from the distances between the amino acids in the template structures

http://modbase.compbio.ucsf.edu/modbase-cgi/index.cgi

Fold recognition (Threading)

• In the absence of detected homology to other proteins with solved structure we can still use other strategy which makes use of available structural data

• Using this approach the query sequence is threaded into each fold in a a library of all available folds

• Using accurate scoring functions (usually knowledge-based) we get a compatibility score for the sequence to fit each given fold

• A statistical significant score tells that the query protein adopts a structure similar to that fold

• This approach is less accurate than homology modeling but is more applicable

• When the true fold is not represented in the library we will not be able by definition to construct accurate model by this approach

• Future availability of all folds will extend the utility of this approach

• The method heavily depends on the scoring function. The main differences between different fold recognition tools is the scoring function

Input:

sequence

Hydrogen bond donor

GlycinHydrophobic

A library of known protein folds

Hydrogen bond acceptor

S=20S=5S=-2Z=5Z=1.5Z= -1

Hydrogen bond donor

GlycinHydrophobic

Hydrogen bond acceptor

Scoring functions for 1D-3D compatibility

• Two main strategies exist to evaluate whether a given fold is compatible with a given sequence: structural profile and contact potentials

• Methods which are based on structural profile uses profiles similar to those used in sequence searches. The profile is built for each fold based on positions and structural characteristics of amino acids which compose the profile

• The profile is a well defined mathematical structure well suited for algorithms such as dynamic programming. It can be therefore determined which profile gives the best score versus a given sequence.

• A profile is a table which holds a row for each position and a column for each type of amino acid, including two columns for gap open and extension.

• The score for each amino acid reflects the likelihood to find this amino acid in this type of structural environment

• “Structural environment” could be for example combination of secondary structure (usually 3 states), solvent accessibility (2-3 states) and hydrophobic nature of the region

• Scores obtained by algorithms which use structural profiles is often transformed to a normalized Z score, which allows estimation of significance.

10100

N

::::::::

10100

167

-9987-242

10100

-80101

-50101

Gext

GopY…DCA

Amino acid

Sequ

ence

po

sitio

n

Contact potential (knowledge based potential

• This method is based on tables which gives a pseudo-energetic score for the interaction between any pair of amino acids, with possible more detailed information about the conformations and orientation of the amino acids

• This method makes use of distance matrices as a tool for representation of the folds. For every matrix the sequence is put in both axes of the distance matrix and the interactions are evaluated using the contact potentials

• • • • • • • • • • • • • •Amino acid index

••••

1

N1 N

•••• • •

••

Am

ino

acid

inde

x

For each pair of adjacent amino acids the contact potential is added for the overall score. The total score of all interactions is compatibility score of the fold for the sequence.

Fold recognition sites

Profiles:

Contact potentials

123D http://www-Immb.ncifcrf.gov/~nicka/123D.html

3D-PSSM http://www.sbg.bio.ic.ac.uk/~3dpssm/index2.html

PHYRE http://www.sbg.bio.ic.ac.uk/~phyre/

Structural modeling using Ab initio methods

• A field of great theoretical interest and little practical use is the prediction of protein structures without any prior knowledge, without using sequence similarity or compatibility to known folds

• Scoring functions similar to those used by other types of modeling approaches are applied

• Theoretically if we can check all possible folds and the scoring function will be ideal, we will be able to detect the global minima

Ab initio algorithms include:

1. Searching procedure which scan in the conformation space and generate models.

2. Scoring function which takes each model evaluates it and eventually ranks it.

Due to the complexity of the search, heuristic algorithms are applied which include random choice components

The parameters which change in the search are backbone dihedral angles which define the folding

A

C

B

E

D

A

C

B

ED

Scoring functions applied for ab initiopredictions

• Force fields

• Knowledge based potentials/contact potentials

• Terms based on surface areas and common volumes which represent structural location and interactions

• The protein can be represented in the amino acid level or the atom level

•ROSETTA (David Baker lab): small fragments (9 residue) consistent with local sequence. Use Monte Carlo procedure to connect them. Such approach puts less demand on energy function since local interactions are accounted in fragments.

• The energy function favors compact structures with paired b strands and buried hydrophobic residues.

ROSETTA (David Baker lab)

J Mol Biol. 2001 Mar 9;306(5):1191-9

Side chain construction

• The final step in the model construction is side chain placement

• Side chain conformation can be taken from the homologous structures, however it is also a practice to model side chains ab-initio

• Despite the complexity of the problem, side chain algorithms are quite successful.

p

Phe

AsnConformation - a given setof dihedral angle which defines a structure.

Rotamer - energetically favourable conformation.

Model evaluation

• Following the model building we can try assess its validity using variety of tools.

• If problems are being detected with the model we can repeat few of the building steps and reconstruct the model, or improve certain parts of it.

• We can try to assess the validity of the model using specific information for the case in hand or using general information derived from the databases and from our understanding of protein structural properties in general.

• The evaluation procedure may detect false models due to wrong templates, miss-alignment

• The evaluation programs often detect regions which are likely to be wrong

• Loops and side chains are frequently built according to geometric considerations. Evaluation which is based on chemical compatibility of the amino acids involved can be important in such cases

• Critical and objective evaluation for modeling is by constructing models for proteins with available structures.

• However, there is always the risk that information from the model will serve in some direct or indirect way for the process

• A more reliable and objective test is to predict the structure before it is deposited to the public databases

• CASP is a competition in structural modeling. The competition takes place every two years and includes different sub categories for modeling tasks.

http://predictioncenter.org/

CASP: Critical Assessment of protein Structure Prediction

Modeling side chain conformations

• Modeling side chain conformations is a necessary final step in the different approaches aimed at constructing a full 3D model of proteins: Homology modelling, fold recognition and ab-initio methods.

• The amino acid side chains determine the global fold of the protein. Identity and structure of individual side chains determine the protein stability and its interactions with other molecules.

• Understand and predict ligand binding.• Design proteins (e.g., sequences compatible with a given fold).• Complete structural information not resolved by experimental procedures.

Why model side chain conformations?

It is important to model side chains in order to:• Predict thermal-stability of mutants.

• Given backbone coordinates and the amino acid sequence, model the structure of all the protein atoms.

>1STP:STREPTAVIDIN – BIOTINDPSKDSKAQVSAAEAGITGTWYNQLGSTFIVTAGADGALTGTYESAVGNAESRYVLTGRYDSAPATDGSGTALGWTVAWKNNYRNAHSATTWSGQYVGGAEARINTQWLLTSGTTEANAWKSTLVGHDTFTKVKPSAASIDAAKKAGVNNGNPLDAVQQ

Definition of the problem

Why is side chain modeling a difficult problem ?

• Consider a short protein with 50 amino acids. Each side chain has on average 2 dihedral angles (χ angles). Assuming that we will sample every 20º in the dihedral angle space,

N = (360/20)(50×2) ~10125

• This number is too large to be sampled

• Algorithms that find good solutions by screening only parts of the search space are needed

• The side chain modeling problem can be roughly divided into searching procedure and scoring function.

• The searching procedure should sample the search space (in our case usually the torsion angle space) and create conformations.

• The scoring function evaluates each conformation created by the searching procedure. The evaluation scores are used to rank the conformations and pick the best one to be the final model.

Components of the problem

In practice, there are interrelations between the two, for example:

• Given a very time consuming scoring function we should use searching functions which sample a smaller fraction of the search space.

• Searching procedures usually use the scores obtained by the scoring function to direct the search.

• A rough energy (scoring) landscape will require more dense sampling.

Components of the problem Rotamer libraries

• Already in the 70s, Janin et al. showed that different side chain conformations are not found in equal distribution over the dihedral angle space but tend to cluster at specific regions of the space, much as in the Ramachandran plot.

• In the 80’s, this observation was used to improve modeling of side chain conformations.

• Today essentially all programs that model side chain conformations use rotamer libraries.

HCα

H

OH

Cβ

NiCi

H

The 3 staggered conformations of Serine

HCα

H

OH

Cβ

NiCi

H

H

Cα

HOH

Cβ

NiCi

H

Valine Chi1 distribution Serine Chi1 distribution

Complete Rotamer Library

Ponder & Richards, 1987No.No. χ1 p σ p|χ1 σ χ1 χ1+2 σσ

Backbone independent library

Dunbrack & Cohen, 1997

http://dunbrack.fccc.edu

φ ϕ p χ1 χ2 χ3 χ4

Dunbrack & Cohen, 1997

Backbone dependent library

What do rotamer libraries provide?

• Rotamer libraries reduce significantly the number of conformations that need to be evaluated during the search.

• This is done with almost no risk of missing the real conformations.

• Even small libraries of about 100-150 rotamers cover about 96-97% of the conformations actually found in protein structures.

Accuracy vs library size

• Consider again our protein of 50 amino acids. Each side chain has on average 9 rotamers. Assuming that we search now in the space of rotamers:

N = 950 ≈ 1047

The searching space is restricted and oriented but the number of combinations is still too large for a naive search

Now, Rotamer libraries also provide information about the intra-residue energy

• The probabilities of each rotamer in the library can be applied to estimate the potential energy due to interactions within the side chain and with the local backbone atoms, using the Boltzmann distribution.

Ei ∝ln(Pi)

• This term takes the role of a torsion term and intra residue van der Waals forces in the scoring function.

Rotamer libraries – where to find them ?

Dunbrack rotamer libraries:http://dunbrack.fccc.edu/bbdep/bbdepdownload.php

Backbone dependent and independent libraries

Lovell rotamer library:http://kinemage.biochem.duke.edu/databases/rotamer.html

The best backbone independent library

• Greed search

• DEE (Dead End Elimination)

• Self consistence algorithms

• Monte Carlo algorithms

Searching procedures used in side chain modeling programs

Greed search

• This naive approach systematically scans the search space at a specified resolution.

• Theoretically it should cover the correct solutions.

• In practice it can be used only for small scale problems.

• Does not benefit from calculations already performed (no memory)

DEE (Dead End Elimination)

• Sophisticated algorithmic approach for side chain modeling. Aims to safely reduce the search space without loosing the GMEC(Global Minimum Energy Conformation).

• The algorithm eliminates rotamers which can not be part of the GMEC.

• In successive iterations more and more rotamers can be eliminated.

• The algorithm stops when no more rotamers can be eliminated.

• Usually at this point, only one rotamer is left for each of several side chains (i.e., these are part of the GMEC). For several others, only a (hopefully) few are left.

•An additional algorithm is then applied to obtain the final model.

The basic condition of the DEE algorithm

• A rotamer can be safely eliminated when the minimum energy it can obtain by interaction with other rotamers is still higher (worse) than the maximum possible energy that another rotamer of the same residue can have

( ) ( ) ( ) ( ) , ],max[min,min jijiEiEjiEiEj

snsnnsrj

sr ≠+>+ ∑∑ rs

rt

rotamer space

scor

e

Desmet et al., 1992

criterion is an improvementGoldsteinThe

• A rotamer can be safely eliminated when some other rotamer exists with lower (better) energy for any given environment.

• This criteria is much less restrictive and therefore more powerful. It requires though more computational time.

( ) ( ) ( ) ( ) , 0],,[min jijiEjiEiEiE stsrj

str ≠>−+− ∑ rs

rt

rotamer space

scor

e

improvementGoldsteinThe

Even more efficient criteria can be obtained, at the price of more computations

rs

rt

rotamer space

scor

e

rt’

Lasters et al., 1995

Disadvantages of DEE

• Not suitable for all scoring functions. Assumes that the scoring function can be expressed as a sum of a pair-wise interactions

• Relatively slow procedure

• Frequently an additional step is needed to converge to the GMEC

Self consistent algorithms

• In this type of algorithm, the side chains are built in sequential order, one after the other.

• The algorithm explores only a small subset of the space and usually does not find the GMEC.

• The running time is very dependent on the specific implementation. The complexity is basically linear with the number of residues.

• The algorithm iterates several times until convergence of the total score with the side chain conformations is reached, or until some preset condition is reached.

Self consistent algorithms

• Random choices can potentially be made to select a side chain to be modeled

• The selected rotamer is modeled and a new conformation is obtained with score Enew. The new conformation is accepted according to the Metropolis criteria:

• Enew < Eold accept• Enew ≥ Eold accept with probability e-[(Enew-Eold)/T]

where T is a temperature parameter

Monte Carlo type algorithms General types of scoring functions

• Force fields.

• Functions based on geometrical properties.

• Knowledge-based potentials.

Force fields

• Force fields are frequently used for side chain modeling

• The components usually included are the VDW interactions and the torsion angle terms. Electrostatic interactions and hydrogen bond terms usually do not contribute much to the accuracy.

• Standard force fields usually do not give impressive results for side chain modeling.

• But, they are relatively fast and appropriate to any search procedure

Geometric based scoring

• Scoring functions based on geometric features such as surfaces and volumes are becoming popular.

• Contact surfaces between atoms can replace the attractive VDW terms in standard force fields and give some indication about packing.

• Solvent accessible surface correlates with solvation effects.

• Overlapping volumes between atoms (represented as spheres) generate a softer potential for atom-atom clashing.

Evaluation of side chain modeling programs

• The performance of side chain modeling programs is evaluated by modeling existing structures.

• The parameters usually checked are the percentage of correct χ1 angles, correct χ1 and χ2, and the RMSD between the side chain heavy atoms.

• Usually the side chain population is divided into buried side chains and exposed side chains and every group is evaluated separately.

The practical problem of side chain modeling

• The way we deal today with the problem of protein structure prediction is very different from the way nature deals with it.

• Due to technical issues such as computation time we are usually forced to accept a fixed backbone and only then put the side chains on it.

• The quality of the side chain modeling is therefore heavily dependent on the position of the backbone. If the initial backbone conformation is wrong, the side chain modeling quality will be accordingly bad.

• What is really needed is a “combined” algorithm that optimizes backbone conformation simultaneously with side chain modeling.

Backbone vs side chain accuracy

Tuffery et al., 1997

• Side chain modeling techniques are accurate only when the backbone structure is very accurate (as with mutations, simple homology models, or filling in missing side chain atoms of crystal structures).

The practical problem of side chain modeling The SCcomp program for side chain modeling

• Our program is based on the Complementarity Function. This function uses contact surface areas between atoms in combination with chemical properties of the atoms.

• The complementarity function was used for liganddocking and analysis of atomic contacts in protein structures.

• We suspected that with several additions and modifications this function should be appropriate also for side chain modeling

The SCcomp program for side chain modeling

• We added several necessary terms for side-chain modeling to the complementarity function.

• A term for internal potential energy of the side chains was added, based on rotamer library probabilities.

• A term to account for steric clashes between atoms was added based on overlapping volumes of the spheres representing the atoms.

• A term to account for solvation effects was based on the solvent accessible surface of the proteins.

SCcomp scoring function

solsolprobprobvolvolcompres EKEKE K E E ⋅+⋅+⋅+=

Evaluating scoring function performance

• The coefficients (K’s) of the scoring function were optimized using a genetic algorithm (software developed by Rafi Najmanovich). The function optimized was the RMSD between model and the experimental structure.

Experimentally solved structuresGeneral form of scoring function

GA

Explicit form of scoring function withvalues for the coefficients

Scoring function performance

• According to the optimized parameters, side chain atoms always “prefer” to have contact surface area with other atoms than to have solvent accessible surface.

• Surprisingly, this observation included also polar atoms, which create hydrogen bonds with solvent molecules.

• The physical meaning of this phenomenon is not entirely clear, but its direct consequence is maximization of protein packing, in agreement with the known dense packing of globular proteins.

Scoring function performance Iterative searching procedure

Model all side chains one after the other

while the environment is fixed

Repeat until all conformations converge, or until maximum defined number of iterations

Complete model

Backbone coordinates

Stochastic searching procedure

Randomly select side chain and assign a rotamer r with probability Pr according to

Boltzmann distribution

Repeat until some low temperature value

Complete model

Backbone coordinates with random side chain conformations

Decrease temperature

Stochastic searchIterative search

Eyal et al., 2004

Performance of modeling programs

www.weizmann.ac.il/sgedg/sccomp.htmlhttp://

Eyal et al., 2004

http://www1.jcsg.org/scripts/prod/scwrl/serve.cgi

3d modeling the motivation for 3d modeling of proteins · in the different approaches aimed at...

Documents