algorithms exploiting the chain structure of proteins

Algorithms Exploiting the Chain Structure of Proteins

Itay LotanComputer Science

Proteins 101

Involved in all functions of our body: metabolism, motion, defense, etc.

Michael Levitt

Protein representation

Torsion angle model:

Cα model:

Resi Resi+1 Resi+2 Resi+3

Structure determination

Bernhard Rupp

X-ray crystallography

Outline

1. Fast energy computation during Monte Carlo simulation

2. Model completion for protein X-ray crystallography

3. Large scale computation of similarity

Exploit specific properties of proteins to perform the computation efficiently

Outline

Lotan, Schwarzer, Halperin* and Latombe. J. Comput. Bio. 2004 (to appear)

*CS Department, Tel-Aviv University

Monte Carlo simulation (MCS)

Estimate thermodynamic quantities

Search for low-energy conformations and the folded structure

Popular method for sampling the conformation space of proteins:

MCS: How it works

2. Compute energy E of new conformation3. Accept with probability:

Requires >>106 steps to sample adequately

/( ) min 1, bE k TP accept e

1. Propose random change in conformation

Bonded terms: Bond lengths: Bond angles: Dihedral angles:

Non-bonded terms: Van der Waals: Electrostatic: Heuristic: Go models, HP models, etc.

Energy function

Pair-wise interactions

Cutoff distance (6 - 12Å) Linear number of interactions

contribute to energy (Halperin & Overmars ’98)

Challenge: Find all interacting pairs without enumerating all pairs

Related workComputer Science Bounding volume

hierarchies for collision detection Gotschalk et al. ’96 Larsen et al. ’00 Guibas et al. ’02

Space partition methods for collision detection Faverjon ’84 Halperin & Overmars ’98

Collisions detection for chains Halperin et al. ’97 Guibas et al. ’02

Biology Neighbor lists

Verlet ’67 Brooks et al. ’83

Grid Quentrec & Brot ’73 Hockney et al. ’74 Van Gunsteren et al. ’84

Neighbor lists + grid Yip & Elber ’89 Petrella ’02

Grid method

d: Cutoff distance

Linear complexity Optimal in worst case

Contributions Efficient maintenance and self-collision

detection for kinematic chains Efficient computation of pair-wise

interactions in MCS of proteins Scheme for caching and reusing partial

energy sums during MCS MCS software*

Much faster than existing algorithm (grid method)

*Download at: http://robotics.stanford.edu/~itayl/mcs

Properties of kinematic chains

Small changes large effects

Small changes large effects Local changes global effects

Small changes large effects Local changes global effects Few DoF changes long rigid sub-

chains

Small changes large effects Local changes global effects Few DoF changes long rigid sub-

chains

ChainTree: A tale of two hierarchies Transform hierarchy: approximates

kinematics of protein backbone at successive resolutions

Bounding volume hierarchy: approximates geometry of protein at successive resolutions

Hierarchy of transforms

A BC D

E FG H

TAB TBC

THITCD TDE TEF TFG TGH

TCE TEG TGI

TAE TEI

Hierarchy of bounding volumes

BA HGFEDC

CD EF GHAB

The ChainTree

TAC AB TCE CD TEG EF TGI GH

TAE AD TEI EH

TAI AH

A BC D

E FG H

Updating the ChainTree

TAC AB TCE CD TEG EF TGI GH

TAE AD TEI EH

TAI AH

A BC D

E FG H

Computing the energy

A B C D E F G H

J K L M

Pruning rules:1. Prune search when distance between bounding volumes

is more than cutoff distance2. Do not search inside rigid sub-chains

Recursively search ChainTree for interactions

A B C D E F G H

J K L M

A B C D E F G H

J K L M

A B C D E F G H

J K L M

[N] [O]

A B C D E F G H

J K L M

[N-O][N] [O]

[B-C][A-D]

A B C D E F G H

J K L M

[D][C-D]

[N] [N-O]

[J-K] [K] [K-L][J-M][J-L] [K-M]

[B-G][A-H]

[B-C][A-D]

[D][C-D]

[B-E][A-F]

[C-E][C-F]

[C-G][C-H][D-G][D-H]

[B][A-B]

[D-E][D-F]

[L] [L-M] [M]

[F][E-F]

[F-G][E-H]

[G][H-G]

A B C D E F G H

J K L M

A B C D E F G H

J K L M

[N] [N-O]

[J-K] [K] [K-L][J-M][J-L] [K-M]

[B-G][A-H]

[B-C][A-D]

[D][C-D]

[B-E][A-F]

[C-E][C-F]

[C-G][C-H][D-G][D-H]

[B][A-B]

[D-E][D-F]

[L] [L-M] [M]

[F][E-F]

[F-G][E-H]

[G][H-G]

Only changed interactions are found

Reuse unaffected partial sums

Better performance for

Longer proteins

Fewer simultaneous changes

Updating:

Searching:

Computational complexity

log nO k k

43n worst case bound

Much faster in practice

1CTF 1JB01HTB1LE2

ChainTree

[68 res.] [144 res.] [374 res.] [755 res.]

1CTF 1JB01HTB1LE2

ChainTree

[68 res.] [144 res.] [374 res.] [755 res.]

1-DoF change 5-DoF change

Simulation of α-Synuclein

140 res. protein implicated in Parkinson’s disease

Multi-canonical Replica-exchange MC regime

Over 1000 CPU days of simulation Study conformations at room temp. Joint work with Vijay Pande

Outline

Lotan, van den Bedem*, Deacon* and Latombe, WAFR 2004

van den Bedem*, Lotan, Latombe and Deacon*, submitted to Acta. Cryst. D

* Joint Center for Structural Genomics (JCSG) at SSRL

Protein Structure Initiative

152K sequenced genes (30K/year)

25K determined structures (3.6K/year)

Reduce cost and time to determine protein structure

Develop software to automatically interpret the electron density map (EDM)

3-D “image” of atomic structure High value (electron density) at atom

centers Density falls off exponentially away from

center

Automated model building

~90% built at high resolution (2Å) ~66% built at medium to low

resolution (2.5 – 2.8Å) Gaps left at noisy areas in EDM

(blurred density)

Gaps need to be resolved manually

The Fragment completion problem

Input EDM Partially resolved structure 2 Anchor residues Length of missing fragment

Output A small number of candidate structures

for missing fragment

A robotics inverse kinematics (IK) problem

Related workComputer Science Exact IK solvers

Manocha & Canny ’94 Manocha et al. ’95

Optimization IK solvers Wang & Chen ’91

Redundant manipulators Khatib ’87 Burdick ’89

Motion planning for closed loops Han & Amato ’00 Yakey et al. ’01 Cortes et al. ’02, ’04

Biology/Crystallography Exact IK solvers

Wedemeyer & Scheraga ’99 Coutsias et al. ’04

Optimization IK solvers Fine et al. ’86 Canutescu & Dunbrack Jr. ’03

Ab-initio loop closure Fiser et al. ’00 Kolodny et al. ’03

Database search loop closure Jones & Thirup ’86 Van Vlijman & Karplus ’97

Semi-automatic tools Jones & Kjeldgaard ’97 Oldfield ’01

Contributions Sampling of gap-closing fragments

biased by the EDM Refinement of fit to density without

breaking closure Fully automatic fragment completion

software for X-ray Crystallography

Novel application of a combination of inverse kinematics techniques

Two-stage IK method

1. Candidate generations: Optimize density fit while closing the gap

2. Refinement: Optimize closed fragments without breaking closure

Stage 1: candidate generation

Generate random conformation Close using Cyclic Coordinate Descent

(CCD) (Wang & Chen ’91, Canutescu & Dunbrack Jr. ’03)

(CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03)

CCD moves biased toward high-density

Stage 2: refinement

1-D manifold

Target function T (goodness of fit to EDM) Minimize T while retaining closure Closed conformations lie on Self-motion

manifold of lower dimension

Stage 2: null-space minimization

Jacobian: linear relation between joint velocities and end-effector linear and angular velocity .

(6 matrix)x J q q n

Compute minimizing move using:

† T T qq J q x N N

null | 0J q J q

N – orthonormal basis of null space

Stage 2: minimization with closure

1. Choose sub-fragment with n > 6 DOFs2. Compute using SVD3. Project onto 4. Move until minimum is reached or

closure is broken

( )T q q null( )Jnull( )J

Escape from local minima using Monte Carlo with simulated annealing

Test: artificial gaps Completed structure (gold standard) Good density (1.6Å res.) Remove fragment and rebuild

Length High (2.0Å) Medium (2.5Å) Low (2.8Å)

4 100% (0.14Å) 100% (0.19Å) 100% (0.32Å)

8 100% (0.18Å) 100% (0.23Å) 100% (0.36Å)

12 91% (0.51Å) 96% (0.41Å) 91% (0.52Å)

15 91% (0.53Å) 88% (0.63Å) 83% (0.76Å)

Produced by H. van den Bedem

Test: true gaps Completed structure (gold standard) O.K. density (2.4Å res.) 6 gaps left by model builder (RESOLVE)

Length Top scorer Lowest error

4 0.44Å 0.40Å

4 0.22Å 0.22Å

5 0.78Å 0.78Å

5 0.36Å 0.36Å

7 0.72Å 0.66Å

10 0.43Å 0.43Å

Produced by H. van den Bedem

Example: TM0423PDB: 1KQ3, 376 res.2.0Å resolution12 residue gapBest: 0.3Å aaRMSD

Example: TM0813

GLU-83

GLY-96

PDB: 1J5X, 342 res.2.8Å resolution12 residue gapBest: 0.6Å aaRMSD

Example: TM0813

GLU-83

GLY-96

PDB: 1J5X, 342 res.2.8Å resolution12 residue gapBest: 0.6Å aaRMSD

Example: TM0813

GLU-83

GLY-96

PDB: 1J5X, 342 res.2.8Å resolution12 residue gapBest 0.6Å aaRMSD

Outline

Lotan and Schwarzer, J. Comput. Biol. 11(2–3): 299–317, 2004

Large scale similarity Analysis of simulation trajectories

Molecular dynamics simulation Monte Carlo simulation

Clustering of decoy sets (e.g., Shortle et al. ’98)

Stochastic Roadmap Simulation (Apaydin et al. ’03)

Fast similarity measures are needed for analyzing large sets of conformations

Uniform simplification of protein structure for similarity computation

Speed-up existing similarity measures

Method offers trade-off between speed and precision

Efficient computation of nearest neighbors

Contributions

m-Averaged approximation Cut chain into pieces of length m Replace each sequence of m Cα atoms

by its centroid

3n coordinates 3n/m coordinates

Chains and distances

Proximity along the chain entails spatial proximity

Far away links along the chain are spatially distant (on average)

Similarity measures

1( , ) min

T i ii

cRMS P Q p Tqn

2( , )

n iP Qij ij

dRMS P Q d dn n

1. Decoy sets: conformations from the Park-Levitt set (Park et al, ’97), N =10,000

2. Random sets: conformations generated by the program FOLDTRAJ (Feldman & Hogue, ’00), N = 5000

Evaluation: test sets

8 structurally diverse proteins (54 -76 residues)

Evaluation results: decoy sets

m cRMS dRMS3 0.99 0.96-0.98

4 0.98-0.99 0.94-0.97

6 0.92-0.99 0.78-0.93

9 0.81-0.98 0.65-0.96

12 0.54-0.92 0.52-0.69

9x for cRMS (m = 9) 36x for dRMS (m = 6)

Higher correlation for random sets!

Brute force complexity:

• for all

k Nearest-neighbors problem

Given a set S of conformations of a protein and a query conformation c, find the k conformations in S most similar to c

logN k L

2 logN k L c S

N – size of S L – time to compute similarity

kd-tree: time per query

Limitations:1. Requires Minkowski metric:2. Less efficient when d>20

Efficient nearest neighbor search

logO kd N

dist P Q p q

cRMS is not a Minkowski metric

dRMS has dimensionality of ( 1)

Reduce dRMS dimensionality using SVD

Reduction using SVD

1. Stack m-averaged distance matrices as vectors

2. Compute the SVD of entire set3. Project onto principle components

dRMS is reduced to 20 dimensions

Complexity of SVD ~ 4n

Testing the method

Use decoy sets (N = 10,000) and random sets (N = 5,000)

m-averaging with (m = 4) Project onto 16 PCs for decoys, 12 PCs

for random sets Find k = 10, 25, 100 NNs for 250

conformations in each set

Results

Decoy sets: ~77% correct Furthest NN off by 10% - 15% (0.7Å – 1.5Å) ~4k approximate NNs contain all true k NNs

Random sets: slightly better results

Use reduction as fast filter

Running Time

N = 100,000, m=4, PC = 16

Find k = 100 for each conformationBrute-force: ~84 hoursBrute-force + m-averaging: ~4.8 hoursBrute-force + m-averaging + SVD: 41 minuteskd-tree + m-averaging + SVD: 19 minutes

kd-tree has more impact for larger sets

Contributions Energy computation in MCS

Efficient maintenance and self-collision detection for kinematic chains

Efficient computation of pair-wise interactions in MCS of proteins

Caching scheme for partial energy sums during MCS MCS software

Model completion in X-ray crystallography sampling of gap-closing fragments biased towards the EDM Refinement of fit to density without breaking closure Fully automatic fragment completion software

Similarity computation for large conformation sets Uniform simplification of protein structure for similarity

computation Speed-up existing similarity measures Method offers trade-off between speed and precision Efficient computation of nearest neighbors

Take-home message

Taking into account physical properties of proteins can lead to efficient algorithms for a wide variety of applications in structural biology

Outlook

Models that simplify the physics and chemistry of proteins

Algorithms that exploit properties of protein models

computer scientistbiophysicist/biochemist

Develop simplified protein models that lend themselves to efficient computations

Acknowledgements Jean-Claude Latombe Vijay Pande Michael Levitt Leo Guibas Axel Brunger, Balaji Prabhakar, Serafim Batzoglou Fabian Schwarzer, Henry van den Bedem, Dan Halperin Carlo Tomasi Daniel Russakoff, Rachel Kolodny Latombe group

Serkan Apaydin, Tim Bretl, Joel Brown, Phil Fong, Mitul Saha, Pekka Isto, Kris Hauser

Pande groupBojan Zagrovic, Stefan Larson, Lillian Chong, Young Min Rhee, Sidney Elmer, Chris Snow, Guha Jayachandran, Eric Sorin, Sung-Joo Lee, Jim Cladwell, Michael Shirts, Nina Singhal, Relly Brandman, Vishal Vaidyanathan, Nick Kelley, Mark Engelhardt

Levitt GroupPatrice Koehl, Tanya Raschke, Erik Lindahl

Thank you!

algorithms exploiting the chain structure of proteins

dof changes

energy halperin overmars

chains halperin

lowenergy conformations

partial energy sums

geometry of protein

kinematics of protein

bond angles

Documents

improved backpropagation algorithms by exploiting data...

optimization under uncertainty: structure-exploiting...

algorithms for finding coalitions exploiting a new ... ·...

exploiting immunological metaphors in the … · abstract...

algorithms exploiting the chain structure of proteins itay...

aspire: exploiting asynchronous parallelism in iterative...

globular proteins. types of proteins globular proteins...

speeding up algorithms for hidden markov models by...

exploiting performance portability in search … performance...

aspire: exploiting asynchronous parallelism in iterative...

optimization algorithms for pipelined parallelism ·...

speeding up algorithms for hidden markov models by...

comp4211 – advanced computer architectures & algorithms...

convex optimization...

exploiting high bandwidth memory for graph algorithms

discovering and exploiting bacterial proteins as...

algorithms exploiting hariharan · 2009-01-02 · abstract...

exploiting structure in scientiﬁc computing · algorithms...

speeding up algorithms for hidden markov models by...

finding proteins with prion-like domains and their ... ·...