lecture 12 cs5661 structural bioinformatics motivation concepts structure prediction summary
TRANSCRIPT
Lecture 12 CS566 1
Structural Bioinformatics
• Motivation
• Concepts
• Structure Prediction
• Summary
Lecture 12 CS566 2
Motivation• Holy Grail: Mapping between sequence and
structure. Structure = F(Sequence). What is F?
• Why– Structure dictates chemistry, thermodynamics and
therefore function– Not all structures can be (need be?) determined
experimentally• Cost
• Experimental limitations
Lecture 12 CS566 3
Concepts – Prediction spectrum
Decreasing reliance on known structures
HomologyModeling
Threading ab initio Quantum Mechanics
Lecture 12 CS566 4
Concepts - Common Principles• Constraints to reduce search space • Consideration of many alternate conformations
– Protein backbone dihedral angles (‘Twists along axis of protein’)
– Amino-acid geometry (‘Amino-acids can have more than one shape’)
• Method for local optimization• Scoring function to compare conformations
Lecture 12 CS566 5
Evaluation of quality of prediction
• RMSD comparison with experimentally known structure
• Comparison with crystal structure quality criteria– Ramachandran Plot
• Residue specific dihedral angle distribution
• CASP (Critical assessment of structure prediction) and CAFASP (..Fully Automated..) competitions
Lecture 12 CS566 6
Methods• Knowledge-based constraints of search space
– Homology Modeling– Threading – ab initio (Based on knowledge primitives: not true ab initio)
• Approaches to refinement– Quantum mechanics (ab initio)
• Based on quantum mechanical model of elementary particles• Unscalable
– Molecular mechanics• Uses parametric Force Fields (Newton’s laws, Hooke’s law, …)• Typically used for local or constrained global optimization• Molecular Dynamics or Monte Carlo-based
Lecture 12 CS566 7
Homology modeling• Homology
– Based on sequence-sequence similarity ( > ~25%, the higher, the better)
– Steps• Pair-wise local sequence similarity to identify related structures (possible
templates)• Refine alignment by global pair-wise sequence similarity and msa• Overlay sequence backbone (N-C-C) on template • Model loops based on
– Statistical knowledge from databases of known structures– Molecular mechanics
• Model side-chains (approach similar to that of loops)• Molecular mechanical unconstrained local optimization• Pray for a good solution!
Lecture 12 CS566 8
Threading • Based on sequence-structure similarity• Concept
– Residues in core adopt fewer conformations than surface
• Approach– Thread sequence through all known structures– Score match with core of each structure based on
• Environmental scoring matrices and/or• Amino acid neighborhood matrices (a la Dot matrix)
– Refine structure using molecular mechanics based on best template(s)
Lecture 12 CS566 9
Rosetta (“ab initio”) Approach
• Pioneered by David Baker’s group in the late 1990s• Remarkable success in CASP and CAFASP experiments• Recently made publicly available on an automated server
by Christopher Bystroff’s group• Pot pourri of many different approaches• Key components
– ‘Divide and conquer’ strategy with respect to length of sequence to be modeled
– Use of knowledge based energy function
Lecture 12 CS566 10
‘Divide and conquer’• Mimics natural process of protein folding• Compromise between extremes of
– Looking for homologous sequences with known structure
– Modeling a priori (one amino acid at a time)• Use library of 3D structures of fragments of length
3 and 9 derived from the crystal structure database (a priori estimates = 8K and ~ 1012).
• Break up query sequence into a set of 3mers and 9mers, to find matches with above library – using a sequence profile approach
Lecture 12 CS566 11
‘Divide and conquer’
• Once matches found, reduces to combinatorial problem of selecting best set of fragments with most energetically favorable structure
• In practice, Monte Carlo based search of possible combinations is carried out.
Lecture 12 CS566 12
Knowledge based energy function
• Fundamentally,∆G = ∆H - T ∆S
• Free energy is the enthalpy less an entropic term that is proportional to temperature
• Entropy is proportional to the natural log of the number of conformations/possible states
S = K ln W
Lecture 12 CS566 13
Knowledge based energy function
• Hence makes sense to use existing distribution of structures to derive energy function
• Energy function is based on taking statistical distribution of 3D shapes in database of known structures as the underlying probability distribution
• For a given structure, deviations from probability distribution are subject to proportional energetic penalties
Lecture 12 CS566 14
Rosetta – Steps used in CASP4
1. If possible, use PSI-BLAST to find similar sequences
A. If found, use the multiple sequence alignment to break down sequence into domains to be modeled independently
B. For domains with similarity to known structures, use Homology based approach
C. For remaining domains, carry out Rosetta
Lecture 12 CS566 15
Rosetta - Steps
2. For domains with similarity to other sequences, apply following steps to the homologs as well (consensus modeling)
3. Generate fragment library for each queryA. Collect 3mer and 9mer sub-structures from the PDB with
similarity to 3mer and 9mer subsequences
4. Use Monte Carlo approach for backbone fragment substitution into query
A. Pick a fragment at random from library (~40,000 fragment substitutions for each structure)
B. Repeat A several timesC. Between 10K and 100K conformations (‘decoys’) generated for
each target
Lecture 12 CS566 16
Rosetta - Steps
5. Filter set of conformations to remove unlikely structuresA. Remove structures with minimal long range interactions (low
contact order)B. Remove structures with unrealistic strands
6. Add side chains as statistically predicted by the backbone conformation
7. Cluster set of conformations (including, when available, the generated structures of homologues)
8. Representative structures from the top 5 most-populous clusters are candidate structures
Lecture 12 CS566 17
Summary• Methods like Rosetta represents a breakthrough in
the ab initio prediction of protein 3D structure and are very useful in cases where homology cannot be observed
• For CASP4, at least one subsequence longer than 50 residues could be predicted ‘correctly’ (< 6.5 rmsd) in 17 of 21 cases
• Combination of various approaches works best
Lecture 12 CS566 18
Summary• However, both completeness and accuracy
of prediction leave ample room for improvement– RMS error frequently too high to be useful– Even in homology modeling, template per se is
often better match!– Often, only subsequences are accurately
modeled, and not the whole structure– The Nobel Prize is still up for grabs!