Algorithms Exploiting the Chain Structure of Proteins
Itay LotanComputer Science
Proteins 101
Involved in all functions of our body: metabolism, motion, defense, etc.
Michael Levitt
Protein representation
Torsion angle model:
Cα model:
NN
NN
C’
C’
C’
C’
O
O O
O
C
C
C
C
C
C C
C
Resi Resi+1 Resi+2 Resi+3
Structure determination
Bernhard Rupp
X-ray crystallography
Outline
1. Fast energy computation during Monte Carlo simulation
2. Model completion for protein X-ray crystallography
3. Large scale computation of similarity
Exploit specific properties of proteins to perform the computation efficiently
Outline
1. Fast energy computation during Monte Carlo simulation
2. Model completion for protein X-ray crystallography
3. Large scale computation of similarity
Lotan, Schwarzer, Halperin* and Latombe. J. Comput. Bio. 2004 (to appear)
*CS Department, Tel-Aviv University
Monte Carlo simulation (MCS)
Estimate thermodynamic quantities
Search for low-energy conformations and the folded structure
Popular method for sampling the conformation space of proteins:
MCS: How it works
2. Compute energy E of new conformation3. Accept with probability:
Requires >>106 steps to sample adequately
/( ) min 1, bE k TP accept e
1. Propose random change in conformation
Bonded terms: Bond lengths: Bond angles: Dihedral angles:
Non-bonded terms: Van der Waals: Electrostatic: Heuristic: Go models, HP models, etc.
Energy function
Pair-wise interactions
Cutoff distance (6 - 12Å) Linear number of interactions
contribute to energy (Halperin & Overmars ’98)
Challenge: Find all interacting pairs without enumerating all pairs
Related workComputer Science Bounding volume
hierarchies for collision detection Gotschalk et al. ’96 Larsen et al. ’00 Guibas et al. ’02
Space partition methods for collision detection Faverjon ’84 Halperin & Overmars ’98
Collisions detection for chains Halperin et al. ’97 Guibas et al. ’02
Biology Neighbor lists
Verlet ’67 Brooks et al. ’83
Grid Quentrec & Brot ’73 Hockney et al. ’74 Van Gunsteren et al. ’84
Neighbor lists + grid Yip & Elber ’89 Petrella ’02
Grid method
d: Cutoff distance
ddd
Linear complexity Optimal in worst case
Contributions Efficient maintenance and self-collision
detection for kinematic chains Efficient computation of pair-wise
interactions in MCS of proteins Scheme for caching and reusing partial
energy sums during MCS MCS software*
Much faster than existing algorithm (grid method)
*Download at: http://robotics.stanford.edu/~itayl/mcs
Properties of kinematic chains
Small changes large effects
Properties of kinematic chains
Small changes large effects
Properties of kinematic chains
Small changes large effects Local changes global effects
Properties of kinematic chains
Small changes large effects Local changes global effects Few DoF changes long rigid sub-
chains
Properties of kinematic chains
Small changes large effects Local changes global effects Few DoF changes long rigid sub-
chains
ChainTree: A tale of two hierarchies Transform hierarchy: approximates
kinematics of protein backbone at successive resolutions
Bounding volume hierarchy: approximates geometry of protein at successive resolutions
Hierarchy of transforms
Hierarchy of transforms
A BC D
E FG H
I
TAB TBC
TAC
THITCD TDE TEF TFG TGH
TCE TEG TGI
TAE TEI
TAI
Hierarchy of bounding volumes
BA HGFEDC
CD EF GHAB
AD EH
AH
The ChainTree
TAB A
TBC B
TCD C
TDE D
TEF E
TFG F
TGH G
THI H
TAC AB TCE CD TEG EF TGI GH
TAE AD TEI EH
TAI AH
A BC D
E FG H
I
Updating the ChainTree
TAB A
TBC B
TCD C
TDE D
TEF E
TFG F
TGH G
THI H
TAC AB TCE CD TEG EF TGI GH
TAE AD TEI EH
TAI AH
A BC D
E FG H
I
Computing the energy
A B C D E F G H
J K L M
N O
P
Pruning rules:1. Prune search when distance between bounding volumes
is more than cutoff distance2. Do not search inside rigid sub-chains
Recursively search ChainTree for interactions
A B C D E F G H
J K L M
N O
P
Computing the energy
[P]
A B C D E F G H
J K L M
N O
P
Computing the energy
[N]
[P]
A B C D E F G H
J K L M
N O
P
Computing the energy
[N] [O]
[P]
A B C D E F G H
J K L M
N O
P
Computing the energy
[N-O][N] [O]
[P]
Computing the energy
[N-O]
[J-K]
[A-C]
[B-C][A-D]
[B-D]
A B C D E F G H
J K L M
N O
P
[J]
[N]
[K]
[C]
[D][C-D]
[O]
[P]
Computing the energy
[P]
[N] [N-O]
[J-K] [K] [K-L][J-M][J-L] [K-M]
[A-G]
[B-G][A-H]
[B-H]
[A-C]
[B-C][A-D]
[B-D]
[C]
[D][C-D]
[A-E]
[B-E][A-F]
[B-F]
[C-E][C-F]
[C-G][C-H][D-G][D-H]
[J]
[A]
[B][A-B]
[D-E][D-F]
[O]
[L] [L-M] [M]
[E]
[F][E-F]
[E-G]
[F-G][E-H]
[F-H]
[H]
[G][H-G]
A B C D E F G H
J K L M
N O
P
Computing the energy
E(O)
A B C D E F G H
J K L M
N O
P
[P]
[N] [N-O]
[J-K] [K] [K-L][J-M][J-L] [K-M]
[A-G]
[B-G][A-H]
[B-H]
[A-C]
[B-C][A-D]
[B-D]
[C]
[D][C-D]
[A-E]
[B-E][A-F]
[B-F]
[C-E][C-F]
[C-G][C-H][D-G][D-H]
[J]
[A]
[B][A-B]
[D-E][D-F]
[O]
[L] [L-M] [M]
[E]
[F][E-F]
[E-G]
[F-G][E-H]
[F-H]
[H]
[G][H-G]
Computing the energy
Only changed interactions are found
Reuse unaffected partial sums
Better performance for
Longer proteins
Fewer simultaneous changes
Updating:
Searching:
Computational complexity
log nO k k
43n worst case bound
Much faster in practice
Test
20
260
140
120
100
80
60
40
280
1CTF 1JB01HTB1LE2
ChainTree
Grid
Tim
e (
in m
Se
c.)
[68 res.] [144 res.] [374 res.] [755 res.]
120
100
80
60
40
20
140
1CTF 1JB01HTB1LE2
ChainTree
Grid
Tim
e (
in m
Se
c.)
[68 res.] [144 res.] [374 res.] [755 res.]
1-DoF change 5-DoF change
Simulation of α-Synuclein
140 res. protein implicated in Parkinson’s disease
Multi-canonical Replica-exchange MC regime
Over 1000 CPU days of simulation Study conformations at room temp. Joint work with Vijay Pande
Outline
1. Fast energy computation during Monte Carlo simulation
2. Model completion for protein X-ray crystallography
3. Large scale computation of similarity
Lotan, van den Bedem*, Deacon* and Latombe, WAFR 2004
van den Bedem*, Lotan, Latombe and Deacon*, submitted to Acta. Cryst. D
* Joint Center for Structural Genomics (JCSG) at SSRL
Protein Structure Initiative
152K sequenced genes (30K/year)
25K determined structures (3.6K/year)
Reduce cost and time to determine protein structure
Develop software to automatically interpret the electron density map (EDM)
EDM
3-D “image” of atomic structure High value (electron density) at atom
centers Density falls off exponentially away from
center
Automated model building
~90% built at high resolution (2Å) ~66% built at medium to low
resolution (2.5 – 2.8Å) Gaps left at noisy areas in EDM
(blurred density)
Gaps need to be resolved manually
The Fragment completion problem
Input EDM Partially resolved structure 2 Anchor residues Length of missing fragment
Output A small number of candidate structures
for missing fragment
A robotics inverse kinematics (IK) problem
Related workComputer Science Exact IK solvers
Manocha & Canny ’94 Manocha et al. ’95
Optimization IK solvers Wang & Chen ’91
Redundant manipulators Khatib ’87 Burdick ’89
Motion planning for closed loops Han & Amato ’00 Yakey et al. ’01 Cortes et al. ’02, ’04
Biology/Crystallography Exact IK solvers
Wedemeyer & Scheraga ’99 Coutsias et al. ’04
Optimization IK solvers Fine et al. ’86 Canutescu & Dunbrack Jr. ’03
Ab-initio loop closure Fiser et al. ’00 Kolodny et al. ’03
Database search loop closure Jones & Thirup ’86 Van Vlijman & Karplus ’97
Semi-automatic tools Jones & Kjeldgaard ’97 Oldfield ’01
Contributions Sampling of gap-closing fragments
biased by the EDM Refinement of fit to density without
breaking closure Fully automatic fragment completion
software for X-ray Crystallography
Novel application of a combination of inverse kinematics techniques
Two-stage IK method
1. Candidate generations: Optimize density fit while closing the gap
2. Refinement: Optimize closed fragments without breaking closure
Stage 1: candidate generation
Generate random conformation Close using Cyclic Coordinate Descent
(CCD) (Wang & Chen ’91, Canutescu & Dunbrack Jr. ’03)
Stage 1: candidate generation
Generate random conformation Close using Cyclic Coordinate Descent
(CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03)
Stage 1: candidate generation
Generate random conformation Close using Cyclic Coordinate Descent
(CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03)
Stage 1: candidate generation
Generate random conformation Close using Cyclic Coordinate Descent
(CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03)
Stage 1: candidate generation
Generate random conformation Close using Cyclic Coordinate Descent
(CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03)
CCD moves biased toward high-density
Stage 2: refinement
1-D manifold
Target function T (goodness of fit to EDM) Minimize T while retaining closure Closed conformations lie on Self-motion
manifold of lower dimension
Stage 2: null-space minimization
Jacobian: linear relation between joint velocities and end-effector linear and angular velocity .
(6 matrix)x J q q n
Compute minimizing move using:
† T T qq J q x N N
q
null | 0J q J q
qx
N – orthonormal basis of null space
Stage 2: minimization with closure
1. Choose sub-fragment with n > 6 DOFs2. Compute using SVD3. Project onto 4. Move until minimum is reached or
closure is broken
( )T q q null( )Jnull( )J
Escape from local minima using Monte Carlo with simulated annealing
Test: artificial gaps Completed structure (gold standard) Good density (1.6Å res.) Remove fragment and rebuild
Length High (2.0Å) Medium (2.5Å) Low (2.8Å)
4 100% (0.14Å) 100% (0.19Å) 100% (0.32Å)
8 100% (0.18Å) 100% (0.23Å) 100% (0.36Å)
12 91% (0.51Å) 96% (0.41Å) 91% (0.52Å)
15 91% (0.53Å) 88% (0.63Å) 83% (0.76Å)
Produced by H. van den Bedem
Test: true gaps Completed structure (gold standard) O.K. density (2.4Å res.) 6 gaps left by model builder (RESOLVE)
Length Top scorer Lowest error
4 0.44Å 0.40Å
4 0.22Å 0.22Å
5 0.78Å 0.78Å
5 0.36Å 0.36Å
7 0.72Å 0.66Å
10 0.43Å 0.43Å
Produced by H. van den Bedem
Example: TM0423PDB: 1KQ3, 376 res.2.0Å resolution12 residue gapBest: 0.3Å aaRMSD
Example: TM0813
GLU-83
GLY-96
PDB: 1J5X, 342 res.2.8Å resolution12 residue gapBest: 0.6Å aaRMSD
Example: TM0813
GLU-83
GLY-96
PDB: 1J5X, 342 res.2.8Å resolution12 residue gapBest: 0.6Å aaRMSD
Example: TM0813
GLU-83
GLY-96
PDB: 1J5X, 342 res.2.8Å resolution12 residue gapBest 0.6Å aaRMSD
Outline
1. Fast energy computation during Monte Carlo simulation
2. Model completion for protein X-ray crystallography
3. Large scale computation of similarity
Lotan and Schwarzer, J. Comput. Biol. 11(2–3): 299–317, 2004
Large scale similarity Analysis of simulation trajectories
Molecular dynamics simulation Monte Carlo simulation
Clustering of decoy sets (e.g., Shortle et al. ’98)
Stochastic Roadmap Simulation (Apaydin et al. ’03)
Fast similarity measures are needed for analyzing large sets of conformations
Uniform simplification of protein structure for similarity computation
Speed-up existing similarity measures
Method offers trade-off between speed and precision
Efficient computation of nearest neighbors
Contributions
m-Averaged approximation Cut chain into pieces of length m Replace each sequence of m Cα atoms
by its centroid
3n coordinates 3n/m coordinates
Chains and distances
Proximity along the chain entails spatial proximity
3d l
Far away links along the chain are spatially distant (on average)
ci cj
Similarity measures
2
21
1( , ) min
n
T i ii
cRMS P Q p Tqn
2
2 1
2( , )
( 1)
n iP Qij ij
i j
dRMS P Q d dn n
1. Decoy sets: conformations from the Park-Levitt set (Park et al, ’97), N =10,000
2. Random sets: conformations generated by the program FOLDTRAJ (Feldman & Hogue, ’00), N = 5000
Evaluation: test sets
8 structurally diverse proteins (54 -76 residues)
Evaluation results: decoy sets
m cRMS dRMS3 0.99 0.96-0.98
4 0.98-0.99 0.94-0.97
6 0.92-0.99 0.78-0.93
9 0.81-0.98 0.65-0.96
12 0.54-0.92 0.52-0.69
9x for cRMS (m = 9) 36x for dRMS (m = 6)
Higher correlation for random sets!
Brute force complexity:
•
• for all
k Nearest-neighbors problem
Given a set S of conformations of a protein and a query conformation c, find the k conformations in S most similar to c
logN k L
2 logN k L c S
N – size of S L – time to compute similarity
kd-tree: time per query
Limitations:1. Requires Minkowski metric:2. Less efficient when d>20
Efficient nearest neighbor search
logO kd N
1
1
,d rr
i ii
dist P Q p q
cRMS is not a Minkowski metric
dRMS has dimensionality of ( 1)
2
n n
Reduce dRMS dimensionality using SVD
Reduction using SVD
1. Stack m-averaged distance matrices as vectors
2. Compute the SVD of entire set3. Project onto principle components
dRMS is reduced to 20 dimensions
Complexity of SVD ~ 4n
Testing the method
Use decoy sets (N = 10,000) and random sets (N = 5,000)
m-averaging with (m = 4) Project onto 16 PCs for decoys, 12 PCs
for random sets Find k = 10, 25, 100 NNs for 250
conformations in each set
Results
Decoy sets: ~77% correct Furthest NN off by 10% - 15% (0.7Å – 1.5Å) ~4k approximate NNs contain all true k NNs
Random sets: slightly better results
Use reduction as fast filter
Running Time
N = 100,000, m=4, PC = 16
Find k = 100 for each conformationBrute-force: ~84 hoursBrute-force + m-averaging: ~4.8 hoursBrute-force + m-averaging + SVD: 41 minuteskd-tree + m-averaging + SVD: 19 minutes
kd-tree has more impact for larger sets
Contributions Energy computation in MCS
Efficient maintenance and self-collision detection for kinematic chains
Efficient computation of pair-wise interactions in MCS of proteins
Caching scheme for partial energy sums during MCS MCS software
Model completion in X-ray crystallography sampling of gap-closing fragments biased towards the EDM Refinement of fit to density without breaking closure Fully automatic fragment completion software
Similarity computation for large conformation sets Uniform simplification of protein structure for similarity
computation Speed-up existing similarity measures Method offers trade-off between speed and precision Efficient computation of nearest neighbors
Take-home message
Taking into account physical properties of proteins can lead to efficient algorithms for a wide variety of applications in structural biology
Outlook
Models that simplify the physics and chemistry of proteins
Algorithms that exploit properties of protein models
computer scientistbiophysicist/biochemist
Develop simplified protein models that lend themselves to efficient computations
Acknowledgements Jean-Claude Latombe Vijay Pande Michael Levitt Leo Guibas Axel Brunger, Balaji Prabhakar, Serafim Batzoglou Fabian Schwarzer, Henry van den Bedem, Dan Halperin Carlo Tomasi Daniel Russakoff, Rachel Kolodny Latombe group
Serkan Apaydin, Tim Bretl, Joel Brown, Phil Fong, Mitul Saha, Pekka Isto, Kris Hauser
Pande groupBojan Zagrovic, Stefan Larson, Lillian Chong, Young Min Rhee, Sidney Elmer, Chris Snow, Guha Jayachandran, Eric Sorin, Sung-Joo Lee, Jim Cladwell, Michael Shirts, Nina Singhal, Relly Brandman, Vishal Vaidyanathan, Nick Kelley, Mark Engelhardt
Levitt GroupPatrice Koehl, Tanya Raschke, Erik Lindahl
Thank you!