protein structure modelling
DESCRIPTION
Overview of todays lecture Levels of Protein structure Protein Structure Prediction Secondary Structure Prediction Chou-Fasman Method GOR Method NN based methods Tertiary Structure Prediction ab inito based methods Challenges Limitations Overview of Rostetta Method Overview of CASP and CAMEOTRANSCRIPT
Protein Structure Modelling
Structural Bioinformatics II Protein Structure Modelling R.S.K.
Vijayan , Overview of todays lecture
Levels of Protein structure Protein Structure Prediction Secondary
Structure Prediction Chou-Fasman Method GOR Method NN based methods
Tertiary Structure Prediction ab inito based methods Challenges
Limitations Overview of Rostetta Method Overview of CASP and CAMEO
Levels of Protein Structure
There are four levels of protein structure. Primary structure (10)
Secondary structure (20) Super secondary structure, folds and
domains Tertiary structure (30) Quaternary structure (40) The
primary structure of protein refers to the amino acid sequence of
the polypeptide chain. Secondary structure in Proteins
Secondary structureis the general three-dimensional form oflocal
segments of proteins The Dictionary of Protein Secondary Structure
(DSSP) is commonly used to describe the protein secondary structure
with single letter codes. There are eight different types of
secondary structure G = 3-turn helix (310helix). Min length 3
residues. H = 4-turn helix ( helix). Min length 4 residues. I =
5-turn helix ( helix). Min length 5 residues (Extremely rare) T =
hydrogen bonded turn (3, 4 or 5 turn) E = extended strand (parallel
and/or anti-parallel). Min length 2 residues. B = residue in
isolated -bridge (single pair -sheet hydrogen bond formation) S =
bend (the only non-hydrogen-bond based assignment). C = coil
(residues which are not in any of the above conformations). The
principal number in the helix notation denotes the number of
residues per turn and the subscript tells the number of atoms in
the ring formed by closing the hydrogen bond Protein Tertiary
Structure
Tertiary structurerefers to the three-dimensionalstructureof the
entire polypeptide chain The tertiary structure is defined by its
atomiccoordinates and is determined using techniques such as X-ray
crystallography, NMR spectroscopy, and Cyro-EM. The function of a
protein depends on its tertiary structure. Function Sequence
Structure Quaternary Structure Many proteins are made up of a
single, continuouspolypeptide chain(monomeric). Some proteins
contain two or morepolypeptide chainscalled subunits/chains
(multimeric). Quaternary structuredescribes the arrangement of two
or more subunits/chains, to form one integral structure in a
multiunit protein The arrangement of the subunits gives rise to a
stable structure It includes organizations from simpledimersto
large homooligomers andcomplexes Subunits may be identical (Homo)
or different (Hetero) GABAAIon Channel- Hetero pentamer HIV
Protease - Homo dimer Levels of Protein Structure Deciphering the
Protein Folding Code
Protein folding problem the "holy grail" ofmodern biological
Research Given an amino acid sequence, predict its3D structure
(Forward folding problem) How proteins fold so quickly ? Leventhial
paradox what happens when this process goes awry (when proteins
misfold)? Has been studied for more than 4 decades. Still very
muchan open problem "Inverse Folding" Problem Given a particular 3D
structurefold, identify amino acid sequence that can adopt this
fold. There will be a number of sequences compatible for a
particular target because homologous proteins are known to adopt
the same fold. Protein design:rational designof
newproteinmolecules, with the ultimate goal of designing novel
function and/or behavior. Bioengineering and biomedical
applications. Protein Secondary Structure Prediction
Predicting proteinsecondary structure from amino acid sequence has
been attempted since the late 1950s. Secondary structure
predictionmethods aim to predict the localsecondary
structuresofproteinsbased only on knowledge of theirprimary
sequence. Assigning regions of the amino acid sequence as
likelyalpha helices,beta strands, orturns. The principle behind
most secondary structure predictions is to look for patterns of
residue conservation that are indicative of secondary structures
like those shown above. The early methods suffered from a lack of
data. To date, over 20 different secondary structure prediction
methods have been developed. Current methods can achieve up 80%
overall accuracy forglobular proteins. The accuracy of current
protein secondary structure prediction methods is assessed in
weeklybenchmarkssuch as LiveBenchandEVA. Amino-acids Propensity
Values
The main criterion for alpha helix preference is that the amino
acid side chain shouldcover and protect the backbone H-bondsin the
core of the helix. Ala,Leu,Met,Phe,Glu,Gln,His,Lys,Arg Helix
breakers Gly : Side chain H too small to protect H bond Pro: Ridig
structure(phi = -60), Side chain linked to alpha N. Asp, Asn, Ser:
H-bonding side chainscompete directly with backbone H-bonds Large
aromatic residues (Tyr, Phe and Trp) and -branched amino acids
(Thr, Val, Ile) are favored to be found in strands in themiddleof
sheets. Because every other side chain in a sheet is pointing in
the opposite direction, leaving room for beta-branched side chains
to pack. Guzzo AV:The influence of amino acid sequence on protein
structure.Biophys J1965,5:809822. Chou and Fasman,Ann. Rev
Biochem.47258 (1978). PSSPApplications Prediction of protein
secondary structure provide information that is useful for a) ab
initio structure prediction b) asadditional constraint for
fold-recognition algorithms. c) help the design of site-directed or
deletion mutants that willpreserve the native protein structure
(where and how to subclone protein fragments for expression). d)
For refinement of sequence alignments e) a step toward the goal of
understanding protein folding (A hierarchical approach to solve the
protein folding problem). f) Identifying protein function Secondary
structure elements start to form in specific nucleation point
during folding The quality of secondary structure prediction is
measured based on Q3 score. The Q3 score is the average of each Qi
(i = helix, sheet, loop), where Qi is defined as the percentage of
correctly predicted residues in state i to the total number of
experimentally observed residues in state i PSSP Algorithms First
Generation: Second Generation: Third Generation:
There arethree generations in PSSP algorithms: First Generation:
Based on statistical information of single amino acids and were
limited by the small number of proteins with solved structures.
Chow-Fasman, 1974 (First approach): uses a combination of
statistical and heuristic rules. GOR, 1978 : Information-theoretic
framework. Second Generation: larger database and use of statistics
based on windows (segments) of amino acids. Typically a window
contains amino acids. The second-level approximation, involving
pairs of residues, provides a better model (GOR3) algorithm. (local
dependencies). Third Generation: Based on the use of evolutionary
information Incorporates multiple sequence alignment to obtain
additional information based on the observed patterns in sequence
variability, and the location of insertions and deletion Chou and
Fasman Algorithm
Start by computing amino acids propensities to belong to a given
type of secondary structure Amino Acid-Helix-SheetTurn Ala Cys Leu
Met Glu Gln His Lys Val Ile Phe Tyr Trp Thr Gly Ser Asp Asn Pro Arg
Propensities > 1 Favors -Helix Favors -strand Favors b-strand
Favors turn Chou and Fasman Algorithm (cont...)
Predicting helices: - find nucleation site: 4 out of 6 contiguous
residues with P() >1. - extension: extend helix in both
directions until a set of 4 contiguous residues has an average P()
< 1 (breaker). - if average P() over whole region is >1, it
is predicted to be helical. Predicting strands: - find nucleation
site: 3 out of 5 contiguous residues with P() > 1. - extension:
extend strand in both directions until a set of 4 contiguous
residues has an average P()< 1 (breaker). - if average P() over
whole region is > 1, it is predicted to be a strand. Any region
containing overlapping ( -helical and -sheet assignments are taken
to be helical if the average P(-helix) > P(-sheet) for that
region. It is a beta sheet if the average P(-sheet > P() for
that region. Chou and Fasman Algorithm (cont...)
Predicting turns: - for each tetrapeptide starting at residue i,
compute: - PTurn (average propensity over all 4 residues) - P(t) =
f(i)*f(i+1)*f(i+2)*f(i+3) - If the averages for the tetrapeptide
obey the inequality PTurn > P()and PTurn > P()and PTurn >
1 and F > then, the tetrapeptide is considered a turn.
Position-specific parameters for turn Each position has distinct
amino acid preferences. Examples: At position 2, Pro is
highlypreferred; Trp is disfavored Bewareof Q3Values Itss important
to be aware that the Q3 score can give an overoptimistic estimate
of accuracy than might be expected. Because there are only 3
states, even a random guessing would yield a 3-state accuracy (Q3 )
of about 33% assuming that all structures are equally likely. The
numbers of residues in helices, strands, and loops in the database
are frequently not evenly distributed, with loops usually
comprising the greatest proportion. ALHEASGPSVILFGSDVTVPPASNAEQAK
hhhhhooooeeeeoooeeeooooohhhhh ohhhooooeeeeoooooeeeooohhhhhh
hhhhhoooohhhhooohhhooooohhhhh Amino acid sequence Actual Secondary
Structure Q3=22/29=76% Q3=22/29=76% Secondary structure assignment
in real proteins is uncertain to about 10% (disagreement between
DSSP and STRIDE);Therefore, a perfect prediction would have Q3=
90%. Chou and Fasman Algorithm (cont...)
Advantages ofChou-Fasman: Propensity for a specific conformation is
evaluated in the context of the flanking residues using simple
rules. Disadvantages ofChou-Fasman: Correlations between different
positions in the sequence based completely on empirical rules.
Ambiguity in the assignment of overlapping regions. Accuracy below
60% (remember 33.3% is the lower limit). GOR Method GOR
method(Garnier-Osguthorpe-Robson) is aninformation theory-based
method. GOR method is also based onprobabilityparameters derived
from empirical studies of knownexperimental structures. GOR method
takes into account not only the propensities of individualamino
acidsto form particular secondary structures, but also
theconditional probabilityof the amino acid to form a secondary
structure given that its immediate neighbors have already formed
that structure. Evaluate each residue PLUS adjacent 8 N-terminal
and 8 carboxyl-terminal residues sliding window of 17 residue.
Underpredicts -strand regions. GOR method accuracy Q3 = ~64% GOR
Method Position-dependent propensities for helix, sheet or turn has
been calculated for all residue types. For each position j in the
sequence, eight residues on both sides of the actual position are
considered. Statistical information derived from proteins of known
structure is stored in three (17X20). Three matrices, one each for
, , coil A helix propensity table contains info about propensity
for certain residues at 17 positions when the conformation of
residue j is helical. The predicted state of aaj is calculated as
the sum of the position-dependent propensities of all residues
around aaj. Suppose aj is the amino acid that we are trying to
categorize. GOR looks at the residues aj8aj aj aj+7aj+8.
Intuitively, it assigns position-dependent probabilitiesbased on
what it has calculated from protein databases. GOR Method Third
Generation Methods
Use evolutionary information based on multiple sequence
alignmentand expert methods (Neural Networks )for perdition. The
most important algorithms of today PHD NNPREDICT PSIPRED Due to the
improvement of protein information in databases i.e. better
evolutionary information, todays predictive accuracy is~80%. It is
believed that maximum reachable accuracy is 88%. An artificial
neural network is composed of manyartificial neurons that are
linked togetheraccording to a specific network architecture.
Thegoal of the neural network is to transform theinputs into
meaningful outputs. Tertiary Structure Prediction
Major Techniques Template Based Modeling Homology Modeling
Threading Template-Free Modeling Prediction from sequence using
first principles ab initio Methods Physics-Based Knowledge-Based
Synonyms : de novo modelling, physics based. Overview of ab initio
method
Typically ab initio modelling conducts a conformational search
under the guidance of a designed energy function. This procedure
usually generates a number of possible conformations (structure
decoys), and final models are selected from them. Therefore, a
successful ab initio modelling depends on three factors: an
accurate energy function with which the native structure of a
protein corresponds to the most thermodynamically stable state,
compared to all possible decoy structures (2) an efficient search
method which can quickly identify the low-energy states through
conformational search; (3) selection of native-like models from a
pool of decoy structures. Overview of ab initio method
Disadvantages: Ab initio prediction - not practical for large
sequences (< 100 aa) Computationally very expensive. Currently,
the accuracy of ab initio modelling is low and the success is
limited to small proteins . Advantages: Can give insights into
folding mechanism. Understanding protein misfolding Doesnt require
homologs Only way to model new folds Useful for de novo protein
design Challenges in Protein folding
Energetics We dont know all the forces involved in detail Too
computationally expensive BY FAR! ( Folding takes places at the
order of micro seconds to milliseconds) Conformational search
impossibly large 100 a.a. protein, 2 moving dihedrals, 2 possible
positions for each diheral: 2200 conformations! Levinthals Paradox
Proteins fold in a couple of seconds?? Multiple-minima problem
Understanding protein folding via molecular simulation
Advances in computer hardware, software and algorithms have now
made it possible to simulate protein folding. Atomistic models has
been used for more than decades to address protein folding problem
(M. Levitt, A. Warshel 1975). The first ever longtime scale study
on protein folding using MD simulation (Peter Kollman 1998) Time
scale for protein folding Challenges Accurate force fields Adequate
sufficient sampling Robust data analysis. Rosetta Approach The
Rosetta Approach (David Baker lab, Univ. of Washington). Performs
Monte Carlo search through space of conformations to find minimal
energy conformation Rosetta searches structure space by replacing
the torsion angles of a fragment in the current model with torsion
angles from known structure fragments The Rosetta Approach Given:
protein sequence P for each window of length 9 in P assemble a set
of structure fragments (using PSI-BLAST) M = initial structure
model of P (fully extended conformation) S = score(M) while
stopping criteria not met randomly select a fixed width window of
amino acids from P randomly select a fragment from the list for
this window M = M with torsion angles in window replaced by angles
from fragment S = score(M) if Metropolis criterion(S, S) satisfied
M = M S = S Return: predicted structure M The Rosetta Scoring
Approach
Rosetta scoring function takes into account residue environment
(solvation) residue pair interactions (electrostatics, disulfides)
strand pairing (hydrogen bonding) strand arrangement into sheets
helix-strand packing steric repulsion scoring function search
progressively adds terms during search initially on the steric
overlap term is used then all but compactness terms are used search
is initiated from different random seeds for some applications, an
atomic-level scoring function is used Critical Assessment of
protein Structure Prediction (CASP)
A community-wide, worldwide experiment forprotein structure
predictionthat is held every two years since 1994. Evaluation of
the results is carried out in the following prediction categories:
Tertiary structureprediction (all CASPs) ( Divided in to Template
based and template free method) Secondary structure
prediction(dropped after CASP5) Prediction ofstructure
complexes(CASP2 only; a separate experiment CAPRI) residue-residue
contact prediction (starting CASP4) disordered regions prediction
(starting CASP5) domainboundary prediction (CASP6CASP8)
functionprediction (starting CASP6) model quality assessment
(starting CASP7) model refinement (starting CASP7) high-accuracy
template-based prediction (starting CASP7)