monte-carlo simulated annealing and its application to build protein 3d models using residue-residue...

72
Monte-Carlo Simulated Annealing and its Application to Build Protein 3D Models Using Residue-residue Contacts Badri Adhikari (Advisor: Dr. Jianlin Cheng) Department of Computer Science University of Missouri Columbia MO 65211 2/26/2015

Upload: badri-adhikari

Post on 19-Jan-2017

19 views

Category:

Science


0 download

TRANSCRIPT

Monte-Carlo Simulated Annealing and its Application to Build Protein 3D Models Using

Residue-residue ContactsBadri Adhikari

(Advisor: Dr. Jianlin Cheng)

Department of Computer Science

University of Missouri

Columbia MO 65211

2/26/2015

Motivation

“We are now entering a phase in which the evolutionary information in the genetic sequences of the living system is being rapidly read using advanced sequencing technology.”

“Using the resulting massive sequence data sets, successful decoding of the molecular record of evolutionary constraints could now reveal structural and functional information about proteins at an unprecedented rate.”

How does a PDB file look like?

cartoon representation carbon-alpha trace all atoms

What are residue-residue contactsin proteins?

Definition of a contactA pair of residues i and j are in contact if the distance between their Cβ atoms is less than or equal to 8 Å.

Other definitions are also used:● use of Cα instead of Cβ● use of lower or higher distance

thresholds like 7 Å or 12 Å

Contact threshold

at 8 Å threshold at 12 Å threshold at 20 Å threshold

How well can we predict contacts?

Recent protein contact predictorsMetaPSICOV GREMLIN CCMpred FreeContact DNcon

Prof. David Jones Prof. David Baker Johannes Söding Dr. Burkhard Rost Dr. Jianlin Cheng

University College London (UCL) University of Washington University of Munich (LMU Munich) Technische Universität München (TUM)

University of Missouri - Columbia

Method Model

DCA Inverse Potts Model: maximum entropy

EV couplings/fmDCA Inverse Potts Model: maximum entropy

plmDCA Inverse Potts Model: maximum likelihood

GREMLIN Inverse Potts Model: maximum likelihood

PSICOV Sparse inverse covariance estimation

RLS Regularized Least Squares inverse covariance

mdMI Multi-dimensional Mutual Information

gaussDCA Continuous maximum entropy inverse Potts model

http://www.predictioncenter.org/casp11/doc/presentations/CASP11_RR_AK.pdf

MetaPSICOVDNcon

Can we RE-construct proteins using residue contacts?

Adding secondary structure as well

Protein structure prediction using residue contacts

“Although we never reached RMSD <5 Å (the lowest average RMSD equals 7.4 Å), we note that, even in presence of extremely noisy and erroneous contact maps, many reconstructed structures have average RMSD lower than 10 Å.”

EVFOLD vs FRAGFOLD- building models from the scratch vs building models using fragments

- “Our de novo folding protocol for a medium-size protein using evolutionarily derived constraints does not require high-performance computing and can be done in well under an hour on a standard laptop computer.”

Examples of State Space Search Problems

Travelling Salesman Problem

http://en.wikipedia.org/wiki/File:Bruteforce.gif

Given a list of cities and the distances between each pair of cities, what is the shortest possible route that visits each city exactly once and returns to the origin city?

Solution to a symmetric TSP with 7 cities using brute force search. Note: Number of permutations: (7-1)!/2 = 360

Configuration of Atoms

Protein Fold Space Search Problem

backbone Carbon-alpha atoms

Choosing a move!

- Computation time for a single function evaluation can be large!

- Can we compute energy of all configurations?

- How big is the design space?

Global vs Local Minima

Hill Climbing and Random Walk

Hill Climbing

Random Walk

Monte Carlo Simulated Annealinga search algorithm

Annealing- want to produce materials of good properties, like strength- involves create liquid version and then solidifying

example: casting- desirable to arrange the atoms in a systematic fashion, which

in other words corresponds to low energy- we want minimum energy

annealing- physical process of controlled cooling

- make move with some probability- such that, if the move is a good one, the probability is high

- we want ΔE to be influenced by the probability- allow good moves (high ΔE) with high probability, and allow bad moves

with low probability

Introduce probability

control how ΔE influences probability- Function

- range should be 0 and 1 (probability)- domain should be infinite

- Sigmoid function

ΔE=0 ΔE

control how ΔE influences probability

Stochastic Hill Climbing

Effect of ΔE and T

HC

RW

worse …

better

Effect of Temperature

sigmoid function

ΔE=0

ΔE=0

low T

high T

How to choose value for T ?

- we follow the physical world approach- cool the system gradually and hope that the

system will settle to an optimal state- we initialize T to some high value- gradually decrease T- some monotonically decreasing function (cooling rate)

Inner loop / epoch

Outer loop

Simulated Annealing

SA Block Diagram

Intuition behind SA

water bubble rings maze

SA works where the energy surface is jagged, and Hill Climbing would get stuck in local optimum.

Intuition behind SA

CA

B

we want to maximize

EAB

1D surface

Original Paper introducing the idea

The AnalogyStatistical Mechanics The behavior of systems with many degrees of freedom in thermal equilibrium at a finite temperature.Combinatorial OptimizationFinding the minimum of a given function depending on many variables.Analogy

● If a liquid material cools and anneals too quickly, then the material will solidify into a sub-optimal configuration.

● If the liquid material cools slowly, the crystals within the material will solidify optimally into a state of minimum energy (i.e. ground state).

● This ground state corresponds to the minimum of the cost function in an optimization problem.

Key Ingredients for SA1. A concise description of a configuration (architecture, design, topology) of

the system (Design Vector).2. A random generator of rearrangements of the elements in a configuration

(Neighborhoods). This generator encapsulates rules so as to generate only valid configurations. Perturbation function.

3. A quantitative objective function containing the trade-offs that have to be made (Simulation Model and Output Metric(s)). Surrogate for system energy.

4. An annealing schedule of the temperatures and/or the length of times for which the system is to be evolved.

How to select initial and final temperature?

empirically

Fragment assembly basedprotein structure prediction using

residue contacts

Rosetta

FRAGFOLD

Reconstruction of the protein 1GUU- Initial Temperature = 50 (length of the protein)- Final Temperature = 0.01- Number of SA iterations = 3200- Number of true Cβ contacts = 45- Contact energy = Root Mean Square Deviations

A movie clip demonstrating an application of monte carlo simulated annealing to reconstruct a small alpha helical protein 1GUU. https://www.youtube.com/watch?v=2p5x7XROxlo

Simulated Annealing Implemented in CNS suite v1.3

CNS suite v1.3- Crystallography & NMR System (CNS)- http://cns-online.org/v1.3/ - is the result of an international collaborative effort among several research

groups. - most commonly used algorithms in macromolecular structure

determination- Examples: crystallographic refinement and NMR structure calculation using

NOEs- Distance Geometry Simulated Annealing algorithm (dg_sa.inp script)

DGSA algorithm- Distance Geometry Simulated Annealing (DGSA) implemented in CNS is based

on Havel and Crippen distance geometry algorithm- At first all known bond lengths, bond angles, dihedral angles, planary restraints

and van der Walls radii, together with NOE distance ranges are translated into upper and lower bounds on distances between atoms involved

- Then, two distance matrices are generated: a matrix of lower bounds and, a matrix of upper bounds.

- To obtain a trial structure, a distance matrix that gives rise to a single trial structure is generated by selecting a random distance that lies between the upper and lower restraints for each residue pair.

- This is followed by simulated annealing to regularize and refine the structure.- Final step of energy minimization is performed using 10 cycles of 200 steps of

Powell minimization.

SA parametersHigh Temperature Dynamics

Temperature = 2000Number of steps = 1000

Cooling StageStarting Temperature = 2000Decrease the temperature at steps of 25Final temperature = 0

Energy Function

CONFOLD: Residue-residue Contact guided ab initio protein folding

CONFOLD http://protein.rnet.missouri.edu/confold/

Contact filtering from stage 1 to stage 2 for the protein 1NRV. (A) Superimposition of the best model in stage 1 reconstructed with top-0.6L contacts by CONFOLD (orange) with the native structure (green). The model has TM-score of 0.50. Among the top-0.6L (60) contacts, 5 out of 8 erroneous contacts that were removed in stage 2 are visualized in the native structure along with the distance between their Cβ-Cβ atoms. The filtered, predicted contacts (20-59, 53-73, 30-36, 49-56, and 88-93) have Cβ-Cβ distances of 23, 23, 20, 12, and 9 Å respectively, in the native structure. Each pair of residues predicted to be in contact denoted by same color. (B) Superimposition of the best model in stage 2 reconstructed with reduced/filtered top-0.6L contacts by CONFOLD (orange) with the native structure (green). TM-score of the model is 0.61.

Acknowledgements

Additional Slides

Additional ReferencesMod-01 Lec-14 Optimization I (Simulated Annealing) https://www.youtube.com/watch?v=dg5zUxdAE_E

Lecture 13: Predicting Protein Structure http://ocw.mit.edu/courses/biology/7-91j-foundations-of-computational-and-systems-biology-spring-2014/video-lectures/lecture-13-predicting-protein-structure/

http://ocw.mit.edu/courses/biology/7-91j-foundations-of-computational-and-systems-biology-spring-2014/lecture-slides/MIT7_91JS14_Lecture13.pdf

Multidisciplinary System Design Optimization at MIT OCW http://ocw.mit.edu/courses/engineering-systems-division/esd-77-multidisciplinary-system-design-optimization-spring-2010/lecture-notes/MITESD_77S10_lec10.pdf

http://www.video-gif-converter.com/download.html

Online LaTeX Equation Editor http://www.codecogs.com/latex/eqneditor.php

Images to Video Tool http://en.cze.cz/Images-to-video

Introductory Overview of Simulated Annealing https://www.youtube.com/watch?v=tdsTfZMqAxw

AbstractCurrently, a considerable amount of effort is being spent to improve prediction accuracy of protein residue-residue contacts in order to use them to build three-dimensional models of proteins. In this talk, we will introduce protein residue-residue contacts, the problem of folding proteins using residue-residue contacts, and then discuss what monte carlo simulated annealing algorithm is and how it can be applied to fold proteins with and without the use of structural fragments. We will discuss a simple implementation of simulated annealing and its parameters to fold proteins using structural fragments. To explore how simulated annealing can be used to build proteins from the scratch, we will discuss the implementation of simulated annealing algorithm in the protein modeling tool CNS suite v1.3. In addition, we will present a method to extend the current implementation to incorporate protein secondary structure information as implemented in the method CONFOLD.

Why sigmoid in theory and exponential in implementation?

● This is because in theory we are discussing the probability of both moves: good and bad.

● However, according to SA implementation, when we have a good move we accept the move with probability 1. So, it makes no sense to compute probability when ΔE is good (<0 when minimizing and >0 when maximizing)

● We are only interested to compute probability when the next move is a bad move.● This we can do directly using the exponential term. The exponential term can

range from 0 to 1 (because 1 divided by exp[any +ve value] lies between 1 and 0).

● This means, if we have ΔE ~ 0, exp(-ΔE/T) is ~ 1 at high temperature, and if we have ΔE very high the term is almost 0.

Why is it also called Monte-Carlo Simulated Annealing? Metropolis criterion

http://link.springer.com/referenceworkentry/10.1007%2F978-0-387-74759-0_403This is the algorithm.