protein structure modelling

30
Protein Structure Modelling R.S.K. Vijayan [email protected] , [email protected] Structural Bioinformatics II

Upload: lucas-chambers

Post on 18-Jan-2018

224 views

Category:

Documents


0 download

DESCRIPTION

Overview of todays lecture Levels of Protein structure Protein Structure Prediction Secondary Structure Prediction Chou-Fasman Method GOR Method NN based methods Tertiary Structure Prediction ab inito based methods Challenges Limitations Overview of Rostetta Method Overview of CASP and CAMEO

TRANSCRIPT

Protein Structure Modelling
Structural Bioinformatics II Protein Structure Modelling R.S.K. Vijayan , Overview of todays lecture
Levels of Protein structure Protein Structure Prediction Secondary Structure Prediction Chou-Fasman Method GOR Method NN based methods Tertiary Structure Prediction ab inito based methods Challenges Limitations Overview of Rostetta Method Overview of CASP and CAMEO Levels of Protein Structure
There are four levels of protein structure. Primary structure (10) Secondary structure (20) Super secondary structure, folds and domains Tertiary structure (30) Quaternary structure (40) The primary structure of protein refers to the amino acid sequence of the polypeptide chain. Secondary structure in Proteins
Secondary structureis the general three-dimensional form oflocal segments of proteins The Dictionary of Protein Secondary Structure (DSSP) is commonly used to describe the protein secondary structure with single letter codes. There are eight different types of secondary structure G = 3-turn helix (310helix). Min length 3 residues. H = 4-turn helix ( helix). Min length 4 residues. I = 5-turn helix ( helix). Min length 5 residues (Extremely rare) T = hydrogen bonded turn (3, 4 or 5 turn) E = extended strand (parallel and/or anti-parallel). Min length 2 residues. B = residue in isolated -bridge (single pair -sheet hydrogen bond formation) S = bend (the only non-hydrogen-bond based assignment). C = coil (residues which are not in any of the above conformations). The principal number in the helix notation denotes the number of residues per turn and the subscript tells the number of atoms in the ring formed by closing the hydrogen bond Protein Tertiary Structure
Tertiary structurerefers to the three-dimensionalstructureof the entire polypeptide chain The tertiary structure is defined by its atomiccoordinates and is determined using techniques such as X-ray crystallography, NMR spectroscopy, and Cyro-EM. The function of a protein depends on its tertiary structure. Function Sequence Structure Quaternary Structure Many proteins are made up of a single, continuouspolypeptide chain(monomeric). Some proteins contain two or morepolypeptide chainscalled subunits/chains (multimeric). Quaternary structuredescribes the arrangement of two or more subunits/chains, to form one integral structure in a multiunit protein The arrangement of the subunits gives rise to a stable structure It includes organizations from simpledimersto large homooligomers andcomplexes Subunits may be identical (Homo) or different (Hetero) GABAAIon Channel- Hetero pentamer HIV Protease - Homo dimer Levels of Protein Structure Deciphering the Protein Folding Code
Protein folding problem the "holy grail" ofmodern biological Research Given an amino acid sequence, predict its3D structure (Forward folding problem) How proteins fold so quickly ? Leventhial paradox what happens when this process goes awry (when proteins misfold)? Has been studied for more than 4 decades. Still very muchan open problem "Inverse Folding" Problem Given a particular 3D structurefold, identify amino acid sequence that can adopt this fold. There will be a number of sequences compatible for a particular target because homologous proteins are known to adopt the same fold. Protein design:rational designof newproteinmolecules, with the ultimate goal of designing novel function and/or behavior. Bioengineering and biomedical applications. Protein Secondary Structure Prediction
Predicting proteinsecondary structure from amino acid sequence has been attempted since the late 1950s. Secondary structure predictionmethods aim to predict the localsecondary structuresofproteinsbased only on knowledge of theirprimary sequence. Assigning regions of the amino acid sequence as likelyalpha helices,beta strands, orturns. The principle behind most secondary structure predictions is to look for patterns of residue conservation that are indicative of secondary structures like those shown above. The early methods suffered from a lack of data. To date, over 20 different secondary structure prediction methods have been developed. Current methods can achieve up 80% overall accuracy forglobular proteins. The accuracy of current protein secondary structure prediction methods is assessed in weeklybenchmarkssuch as LiveBenchandEVA. Amino-acids Propensity Values
The main criterion for alpha helix preference is that the amino acid side chain shouldcover and protect the backbone H-bondsin the core of the helix. Ala,Leu,Met,Phe,Glu,Gln,His,Lys,Arg Helix breakers Gly : Side chain H too small to protect H bond Pro: Ridig structure(phi = -60), Side chain linked to alpha N. Asp, Asn, Ser: H-bonding side chainscompete directly with backbone H-bonds Large aromatic residues (Tyr, Phe and Trp) and -branched amino acids (Thr, Val, Ile) are favored to be found in strands in themiddleof sheets. Because every other side chain in a sheet is pointing in the opposite direction, leaving room for beta-branched side chains to pack. Guzzo AV:The influence of amino acid sequence on protein structure.Biophys J1965,5:809822. Chou and Fasman,Ann. Rev Biochem.47258 (1978). PSSPApplications Prediction of protein secondary structure provide information that is useful for a) ab initio structure prediction b) asadditional constraint for fold-recognition algorithms. c) help the design of site-directed or deletion mutants that willpreserve the native protein structure (where and how to subclone protein fragments for expression). d) For refinement of sequence alignments e) a step toward the goal of understanding protein folding (A hierarchical approach to solve the protein folding problem). f) Identifying protein function Secondary structure elements start to form in specific nucleation point during folding The quality of secondary structure prediction is measured based on Q3 score. The Q3 score is the average of each Qi (i = helix, sheet, loop), where Qi is defined as the percentage of correctly predicted residues in state i to the total number of experimentally observed residues in state i PSSP Algorithms First Generation: Second Generation: Third Generation:
There arethree generations in PSSP algorithms: First Generation: Based on statistical information of single amino acids and were limited by the small number of proteins with solved structures. Chow-Fasman, 1974 (First approach): uses a combination of statistical and heuristic rules. GOR, 1978 : Information-theoretic framework. Second Generation: larger database and use of statistics based on windows (segments) of amino acids. Typically a window contains amino acids. The second-level approximation, involving pairs of residues, provides a better model (GOR3) algorithm. (local dependencies). Third Generation: Based on the use of evolutionary information Incorporates multiple sequence alignment to obtain additional information based on the observed patterns in sequence variability, and the location of insertions and deletion Chou and Fasman Algorithm
Start by computing amino acids propensities to belong to a given type of secondary structure Amino Acid-Helix-SheetTurn Ala Cys Leu Met Glu Gln His Lys Val Ile Phe Tyr Trp Thr Gly Ser Asp Asn Pro Arg Propensities > 1 Favors -Helix Favors -strand Favors b-strand Favors turn Chou and Fasman Algorithm (cont...)
Predicting helices: - find nucleation site: 4 out of 6 contiguous residues with P() >1. - extension: extend helix in both directions until a set of 4 contiguous residues has an average P() < 1 (breaker). - if average P() over whole region is >1, it is predicted to be helical. Predicting strands: - find nucleation site: 3 out of 5 contiguous residues with P() > 1. - extension: extend strand in both directions until a set of 4 contiguous residues has an average P()< 1 (breaker). - if average P() over whole region is > 1, it is predicted to be a strand. Any region containing overlapping ( -helical and -sheet assignments are taken to be helical if the average P(-helix) > P(-sheet) for that region. It is a beta sheet if the average P(-sheet > P() for that region. Chou and Fasman Algorithm (cont...)
Predicting turns: - for each tetrapeptide starting at residue i, compute: - PTurn (average propensity over all 4 residues) - P(t) = f(i)*f(i+1)*f(i+2)*f(i+3) - If the averages for the tetrapeptide obey the inequality PTurn > P()and PTurn > P()and PTurn > 1 and F > then, the tetrapeptide is considered a turn. Position-specific parameters for turn Each position has distinct amino acid preferences. Examples: At position 2, Pro is highlypreferred; Trp is disfavored Bewareof Q3Values Itss important to be aware that the Q3 score can give an overoptimistic estimate of accuracy than might be expected. Because there are only 3 states, even a random guessing would yield a 3-state accuracy (Q3 ) of about 33% assuming that all structures are equally likely. The numbers of residues in helices, strands, and loops in the database are frequently not evenly distributed, with loops usually comprising the greatest proportion. ALHEASGPSVILFGSDVTVPPASNAEQAK hhhhhooooeeeeoooeeeooooohhhhh ohhhooooeeeeoooooeeeooohhhhhh hhhhhoooohhhhooohhhooooohhhhh Amino acid sequence Actual Secondary Structure Q3=22/29=76% Q3=22/29=76% Secondary structure assignment in real proteins is uncertain to about 10% (disagreement between DSSP and STRIDE);Therefore, a perfect prediction would have Q3= 90%. Chou and Fasman Algorithm (cont...)
Advantages ofChou-Fasman: Propensity for a specific conformation is evaluated in the context of the flanking residues using simple rules. Disadvantages ofChou-Fasman: Correlations between different positions in the sequence based completely on empirical rules. Ambiguity in the assignment of overlapping regions. Accuracy below 60% (remember 33.3% is the lower limit). GOR Method GOR method(Garnier-Osguthorpe-Robson) is aninformation theory-based method. GOR method is also based onprobabilityparameters derived from empirical studies of knownexperimental structures. GOR method takes into account not only the propensities of individualamino acidsto form particular secondary structures, but also theconditional probabilityof the amino acid to form a secondary structure given that its immediate neighbors have already formed that structure. Evaluate each residue PLUS adjacent 8 N-terminal and 8 carboxyl-terminal residues sliding window of 17 residue. Underpredicts -strand regions. GOR method accuracy Q3 = ~64% GOR Method Position-dependent propensities for helix, sheet or turn has been calculated for all residue types. For each position j in the sequence, eight residues on both sides of the actual position are considered. Statistical information derived from proteins of known structure is stored in three (17X20). Three matrices, one each for , , coil A helix propensity table contains info about propensity for certain residues at 17 positions when the conformation of residue j is helical. The predicted state of aaj is calculated as the sum of the position-dependent propensities of all residues around aaj. Suppose aj is the amino acid that we are trying to categorize. GOR looks at the residues aj8aj aj aj+7aj+8. Intuitively, it assigns position-dependent probabilitiesbased on what it has calculated from protein databases. GOR Method Third Generation Methods
Use evolutionary information based on multiple sequence alignmentand expert methods (Neural Networks )for perdition. The most important algorithms of today PHD NNPREDICT PSIPRED Due to the improvement of protein information in databases i.e. better evolutionary information, todays predictive accuracy is~80%. It is believed that maximum reachable accuracy is 88%. An artificial neural network is composed of manyartificial neurons that are linked togetheraccording to a specific network architecture. Thegoal of the neural network is to transform theinputs into meaningful outputs. Tertiary Structure Prediction
Major Techniques Template Based Modeling Homology Modeling Threading Template-Free Modeling Prediction from sequence using first principles ab initio Methods Physics-Based Knowledge-Based Synonyms : de novo modelling, physics based. Overview of ab initio method
Typically ab initio modelling conducts a conformational search under the guidance of a designed energy function. This procedure usually generates a number of possible conformations (structure decoys), and final models are selected from them. Therefore, a successful ab initio modelling depends on three factors: an accurate energy function with which the native structure of a protein corresponds to the most thermodynamically stable state, compared to all possible decoy structures (2) an efficient search method which can quickly identify the low-energy states through conformational search; (3) selection of native-like models from a pool of decoy structures. Overview of ab initio method
Disadvantages: Ab initio prediction - not practical for large sequences (< 100 aa) Computationally very expensive. Currently, the accuracy of ab initio modelling is low and the success is limited to small proteins . Advantages: Can give insights into folding mechanism. Understanding protein misfolding Doesnt require homologs Only way to model new folds Useful for de novo protein design Challenges in Protein folding
Energetics We dont know all the forces involved in detail Too computationally expensive BY FAR! ( Folding takes places at the order of micro seconds to milliseconds) Conformational search impossibly large 100 a.a. protein, 2 moving dihedrals, 2 possible positions for each diheral: 2200 conformations! Levinthals Paradox Proteins fold in a couple of seconds?? Multiple-minima problem Understanding protein folding via molecular simulation
Advances in computer hardware, software and algorithms have now made it possible to simulate protein folding. Atomistic models has been used for more than decades to address protein folding problem (M. Levitt, A. Warshel 1975). The first ever longtime scale study on protein folding using MD simulation (Peter Kollman 1998) Time scale for protein folding Challenges Accurate force fields Adequate sufficient sampling Robust data analysis. Rosetta Approach The Rosetta Approach (David Baker lab, Univ. of Washington). Performs Monte Carlo search through space of conformations to find minimal energy conformation Rosetta searches structure space by replacing the torsion angles of a fragment in the current model with torsion angles from known structure fragments The Rosetta Approach Given: protein sequence P for each window of length 9 in P assemble a set of structure fragments (using PSI-BLAST) M = initial structure model of P (fully extended conformation) S = score(M) while stopping criteria not met randomly select a fixed width window of amino acids from P randomly select a fragment from the list for this window M = M with torsion angles in window replaced by angles from fragment S = score(M) if Metropolis criterion(S, S) satisfied M = M S = S Return: predicted structure M The Rosetta Scoring Approach
Rosetta scoring function takes into account residue environment (solvation) residue pair interactions (electrostatics, disulfides) strand pairing (hydrogen bonding) strand arrangement into sheets helix-strand packing steric repulsion scoring function search progressively adds terms during search initially on the steric overlap term is used then all but compactness terms are used search is initiated from different random seeds for some applications, an atomic-level scoring function is used Critical Assessment of protein Structure Prediction (CASP)
A community-wide, worldwide experiment forprotein structure predictionthat is held every two years since 1994. Evaluation of the results is carried out in the following prediction categories: Tertiary structureprediction (all CASPs) ( Divided in to Template based and template free method) Secondary structure prediction(dropped after CASP5) Prediction ofstructure complexes(CASP2 only; a separate experiment CAPRI) residue-residue contact prediction (starting CASP4) disordered regions prediction (starting CASP5) domainboundary prediction (CASP6CASP8) functionprediction (starting CASP6) model quality assessment (starting CASP7) model refinement (starting CASP7) high-accuracy template-based prediction (starting CASP7)