lecture 12 cs5661 structural bioinformatics motivation concepts structure prediction summary

Lecture 12 CS566 1

Structural Bioinformatics

• Motivation

• Concepts

• Structure Prediction

• Summary

Lecture 12 CS566 2

Motivation• Holy Grail: Mapping between sequence and

structure. Structure = F(Sequence). What is F?

• Why– Structure dictates chemistry, thermodynamics and

therefore function– Not all structures can be (need be?) determined

experimentally• Cost

• Experimental limitations

Lecture 12 CS566 3

Concepts – Prediction spectrum

Decreasing reliance on known structures

HomologyModeling

Threading ab initio Quantum Mechanics

Lecture 12 CS566 4

Concepts - Common Principles• Constraints to reduce search space • Consideration of many alternate conformations

– Protein backbone dihedral angles (‘Twists along axis of protein’)

– Amino-acid geometry (‘Amino-acids can have more than one shape’)

• Method for local optimization• Scoring function to compare conformations

Lecture 12 CS566 5

Evaluation of quality of prediction

• RMSD comparison with experimentally known structure

• Comparison with crystal structure quality criteria– Ramachandran Plot

• Residue specific dihedral angle distribution

• CASP (Critical assessment of structure prediction) and CAFASP (..Fully Automated..) competitions

Lecture 12 CS566 6

Methods• Knowledge-based constraints of search space

– Homology Modeling– Threading – ab initio (Based on knowledge primitives: not true ab initio)

• Approaches to refinement– Quantum mechanics (ab initio)

• Based on quantum mechanical model of elementary particles• Unscalable

– Molecular mechanics• Uses parametric Force Fields (Newton’s laws, Hooke’s law, …)• Typically used for local or constrained global optimization• Molecular Dynamics or Monte Carlo-based

Lecture 12 CS566 7

Homology modeling• Homology

– Based on sequence-sequence similarity ( > ~25%, the higher, the better)

– Steps• Pair-wise local sequence similarity to identify related structures (possible

templates)• Refine alignment by global pair-wise sequence similarity and msa• Overlay sequence backbone (N-C-C) on template • Model loops based on

– Statistical knowledge from databases of known structures– Molecular mechanics

• Model side-chains (approach similar to that of loops)• Molecular mechanical unconstrained local optimization• Pray for a good solution!

Lecture 12 CS566 8

Threading • Based on sequence-structure similarity• Concept

– Residues in core adopt fewer conformations than surface

• Approach– Thread sequence through all known structures– Score match with core of each structure based on

• Environmental scoring matrices and/or• Amino acid neighborhood matrices (a la Dot matrix)

– Refine structure using molecular mechanics based on best template(s)

Lecture 12 CS566 9

Rosetta (“ab initio”) Approach

• Pioneered by David Baker’s group in the late 1990s• Remarkable success in CASP and CAFASP experiments• Recently made publicly available on an automated server

by Christopher Bystroff’s group• Pot pourri of many different approaches• Key components

– ‘Divide and conquer’ strategy with respect to length of sequence to be modeled

– Use of knowledge based energy function

Lecture 12 CS566 10

‘Divide and conquer’• Mimics natural process of protein folding• Compromise between extremes of

– Looking for homologous sequences with known structure

– Modeling a priori (one amino acid at a time)• Use library of 3D structures of fragments of length

3 and 9 derived from the crystal structure database (a priori estimates = 8K and ~ 1012).

• Break up query sequence into a set of 3mers and 9mers, to find matches with above library – using a sequence profile approach

Lecture 12 CS566 11

‘Divide and conquer’

• Once matches found, reduces to combinatorial problem of selecting best set of fragments with most energetically favorable structure

• In practice, Monte Carlo based search of possible combinations is carried out.

Lecture 12 CS566 12

Knowledge based energy function

• Fundamentally,∆G = ∆H - T ∆S

• Free energy is the enthalpy less an entropic term that is proportional to temperature

• Entropy is proportional to the natural log of the number of conformations/possible states

S = K ln W

Lecture 12 CS566 13

Knowledge based energy function

• Hence makes sense to use existing distribution of structures to derive energy function

• Energy function is based on taking statistical distribution of 3D shapes in database of known structures as the underlying probability distribution

• For a given structure, deviations from probability distribution are subject to proportional energetic penalties

Lecture 12 CS566 14

Rosetta – Steps used in CASP4

1. If possible, use PSI-BLAST to find similar sequences

A. If found, use the multiple sequence alignment to break down sequence into domains to be modeled independently

B. For domains with similarity to known structures, use Homology based approach

C. For remaining domains, carry out Rosetta

Lecture 12 CS566 15

Rosetta - Steps

2. For domains with similarity to other sequences, apply following steps to the homologs as well (consensus modeling)

3. Generate fragment library for each queryA. Collect 3mer and 9mer sub-structures from the PDB with

similarity to 3mer and 9mer subsequences

4. Use Monte Carlo approach for backbone fragment substitution into query

A. Pick a fragment at random from library (~40,000 fragment substitutions for each structure)

B. Repeat A several timesC. Between 10K and 100K conformations (‘decoys’) generated for

each target

Lecture 12 CS566 16

Rosetta - Steps

5. Filter set of conformations to remove unlikely structuresA. Remove structures with minimal long range interactions (low

contact order)B. Remove structures with unrealistic strands

6. Add side chains as statistically predicted by the backbone conformation

7. Cluster set of conformations (including, when available, the generated structures of homologues)

8. Representative structures from the top 5 most-populous clusters are candidate structures

Lecture 12 CS566 17

Summary• Methods like Rosetta represents a breakthrough in

the ab initio prediction of protein 3D structure and are very useful in cases where homology cannot be observed

• For CASP4, at least one subsequence longer than 50 residues could be predicted ‘correctly’ (< 6.5 rmsd) in 17 of 21 cases

• Combination of various approaches works best

Lecture 12 CS566 18

Summary• However, both completeness and accuracy

of prediction leave ample room for improvement– RMS error frequently too high to be useful– Even in homology modeling, template per se is

often better match!– Often, only subsequences are accurately

modeled, and not the whole structure– The Nobel Prize is still up for grabs!

lecture 12 cs5661 structural bioinformatics motivation concepts structure prediction summary

Documents