folding@home and genome@home: protein folding and design with distributed computing stefan larson...
TRANSCRIPT
![Page 1: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/1.jpg)
Folding@Home and Genome@home: Protein folding and design with distributed computing
Stefan Larson
Pande GroupDept. of Chemistry and Biophysics Program
Stanford University
![Page 2: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/2.jpg)
CreditsPande GroupDr. Vijay Pande
– Folding@home• Siraj Khaliq• Young Min Rhee• Michael Shirts• Chris Snow• Eric Sorin• Bojan Zagrovic • Sidney Elmer
– Genome@home• Stefan Larson• Vishal Vaidyanathan• Amit Garg• Guha Jayachandran
Collaborators
• Adam Beberg (Mithral)• Dr. Jed Pitera (IBM)• Dr. Bill Swope (IBM)• Dr. Jay Ponder (Wash U)• Folding@home users
• Dr. John Desjarlais (Xencor)• Jeremy England (Harvard)• Genome@home users
![Page 3: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/3.jpg)
Molecular simulations in computational
biology
![Page 4: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/4.jpg)
Common challenges of Computational Biology
• Problems related to folding– Structure prediction– Binding– Protein-protein interaction
• Issues:– Models
• Force fields (e.g. Charmm, Amber)• Lots of parameters, constrained by experiment: good
enough?
– Sampling• Can simulate 1ns = 10-9 sec in a day• Need to sample 104 to 106 ns!
![Page 5: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/5.jpg)
Why simulate?• Physics chemistry biology
– Start from the laws of physics and chemistry,explain the properties of biomolecules
• Experiments: less detailed– Spectroscopies, FRET, NMR, etc.– Crystals are static
• Simulations: very detailed– Femtosecond time resolution– Angstrom spatial resolution– Much like having thousands of completely detailed single
molecule experiments
![Page 6: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/6.jpg)
Goals
• Can we characterize folding computationally?– Accurate rates– Detailed mechanisms
• Can we design proteins?– Specific stable structure– Retention of function
![Page 7: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/7.jpg)
Challenges of simulation
Models (force fields)
Sampling (tractability)
Analysis (insight)
![Page 8: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/8.jpg)
Simulating protein folding
![Page 9: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/9.jpg)
The Challenges of Protein Folding Simulation
1. How can we overcome the long timescales?• Fastest proteins in 10’s to 100’s s• Simulations orders of magnitude shorter
2. Are force fields good enough?• Would we reach the native state (w/o NS info)? • Would we quantitatively predict folding rates, G, etc
under experimental conditions (30C)?
3. Can we use simulation to learn about folding?• By what mechanism do they fold? • Do we agree with any folding theories?
![Page 10: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/10.jpg)
Relevant timescales
10-15
femto10-12
pico10-9
nano10-6
micro10-3
milli100
seconds
Bond vibration
Isomeris-ation
Waterdynamics
Helixforms
Fastestfolders
typicalfolders
slowfolders
• 16 order of magnitude range– Femtosecond timesteps– Need to simulate micro to milliseconds
long MD run
where weneed to be
MDstep
where we’dlove to be
![Page 11: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/11.jpg)
Traditional parallel MD:Few, long trajectories
• Divide the force calculations between processors– Spatial
decomposition for work division
– Requires fast communication T3E supercomputer IBM Blue Gene
Duan and Kollman, Science (1998)
Problem: we need WAY more time than is available at current supercomputer centers
![Page 12: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/12.jpg)
Our method:Many, short trajectories
• Advantages of exponential kinetics:– Number that fold in time t:
M f(t) = M[1–exp(-kt)] ~ Mkt for small ktM ~ 10,000 procs, k ~ 1/10,000ns, t ~ 20ns/proc expect Mkt ~ 20 simulations to fold
• Computationally economical– Doesn’t waste resources on communication– Natural for large, heterogeneous clusters
• Important for folding– Heterogeneity of paths, statistics– ergodicity
![Page 13: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/13.jpg)
http://folding.stanford.edu
![Page 14: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/14.jpg)
Distributed computing
home… lab/office… anywhere
The client uses the spare CPU cycles on a user’s computer to run the simulation algorithm on the assigned structure. Results are automatically returned and exchanged for a new work unit on a daily basis.
The server sends and receives the work units (essentially just protein structures and sequences). It verifies, collates and stores the returned data, completes initial analyses, and computes user statistics for the website.
![Page 15: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/15.jpg)
Worldwide distributed computing
![Page 16: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/16.jpg)
Protein folding results
![Page 17: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/17.jpg)
What to fold?…fastest folders
1
10
102
103
104
105
Nanose
con
ds,
CPU
-days
10
60
1
CPU
years
PPA alphahelix
betahairpinBBA5 villin
![Page 18: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/18.jpg)
Rates: predicted vs experiment
1
10
100
1000
10000
100000
1 10 100 1000 10000 100000experimental measurement
(nanoseconds)
Pre
dic
ted
fold
ing
tim
e
(n
an
osecon
ds)
PPA
alpha helix
betahairpin
villinExperiments:
villin: Raleigh, et al, SUNY, Stony Brook
BBAW:Gruebele, et al, UIUC
beta hairpin: Eaton, et al, NIH
alpha helix: Eaton, et al, NIH
PPA: Gruebele, et al, UIUC
BBAW
![Page 19: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/19.jpg)
Mechanism: How did these proteins fold?
• Form secondary structure first– Form helices & hairpins– Hierarchical, decrease in entropy
• Collapse first– Hydrophobically driven– Need to remove water to form hydrogen bonds
• Form rough native shape first– Need to find the right “topology” first– Then pack side chains
![Page 20: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/20.jpg)
What have we learned?
• Can tackle sampling today
• Forcefields sufficient? Folding to the native state folding rate prediction
• Role of water– Explicit solvent not crucial to rate determination?– Compare to explicit solvent simulation
• Universal mechanism of folding?– Maybe no universal mechanism: all proteins could be different?
1
10
100
1000
10000
100000
1 10 100 1000 10000 100000experimental measurement
(nanoseconds)
Pre
dic
ted
fold
ing
tim
e
(nan
osecon
ds)
PPA
alpha helix
betahairpin
villin
BBAW
![Page 21: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/21.jpg)
Protein design
![Page 22: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/22.jpg)
Stanford UniversityStefan LarsonAmit GargGuha JayachandranDr. Vijay Pande
Harvard UniversityJeremy England
Xencor, Inc.Dr. John Desjarlais
gah.stanford.edu
Exploring sequence space:
large scale protein design
![Page 23: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/23.jpg)
Utility of large sequence libraries
Directed evolution
• constrain and guide mutagenesis steps
• enrich starting material in “structured” sequences.
Homology modeling
• broader sequence database for finding homologues
• generate sequence profiles for alignments, etc.
Drug design
• In silico screening of peptide and peptide-mimetic ligands to reduce lead libraries for drug design.
![Page 24: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/24.jpg)
Computational exploration of
sequence spaceApproach
• Detailed all-atom protein representations
• Standard molecular mechanics force-fields
• Generate large sequence libraries
• Apply results to relevant biomedical questions
Challenges
• modeling backbone flexibility
• generating sequence diversity
• large scale iteration of design process
![Page 25: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/25.jpg)
Sequence prediction algorithm
Wollacott AM, Desjarlais JR. “Virtual interaction profiles of proteins.” J Mol Biol. 2001, 313(2):317-42.
Raha K, Wollacott AM, Italia MJ, Desjarlais JR. “Prediction of amino acid sequence from structure.” Protein Sci. 2000, 9(6):1106-19.
Johnson EC, Lazar GA, Desjarlais JR, Handel TM. “Solution structure and dynamics of a designed hydrophobic core variant of ubiquitin.” Structure Fold Des. 1999, 7(8):967-76.
Desjarlais JR, Handel TM. “Side-chain and backbone flexibility in protein core design.” J Mol Biol. 1999, 290(1):305-18.
Lazar GA, Desjarlais JR, Handel TM. “De novo design of the hydrophobic core of ubiquitin.” Protein Sci. 1997, 6(6):1167-78.
Desjarlais JR, Handel TM. “De novo design of the hydrophobic cores of proteins.” Protein Sci. 1995, 4(10):2006-18.
Energy function
• Amber/OPLS parameters
• implicit solvation
Sampling
• genetic algorithm
• structure-dependent rotamer space
![Page 26: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/26.jpg)
Structural
ensembles
Increased sequence diversity
Decreased identity to native sequence
0
1
2
3
4
5
6
7
8
0 5 10 15 20 25 30 35 40 45 50Number of structural variants
Seq
uen
ce e
ntr
op
y [<
exp
(S(i
))>]
0
0.1
0.2
0.3
0.4
0.5
0.6
0.0 15.0 30.0 45.0 60.0 75.0 90.0
Identity (%)
Fre
qu
en
cy
Structural ensemble, full sequenceSingle structure, full sequence
![Page 27: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/27.jpg)
Large scale sequence generation
Total structures 253
Total backbone variants
25,300
Total time of data collection
62 days
Processors available
3,000
Total sequences generated
188,725
Diversity study:
![Page 28: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/28.jpg)
Sequence quality
1E-17
1E-16
1E-15
1E-14
1E-13
1E-12
1E-11
1E-10
1E-09
1E-08
1E-07
1E-06
1E-05
0.0001
0.001
0.01
0.1
1
10
0 25 50 75 100 125 150 175 200 225
Designed sequence profile (ranked by E-value)
E-v
alu
e o
f b
est
PD
B h
it
0
5
10
15
20
25
30
Ave
rag
e id
enti
ty t
o n
ativ
e se
qu
ence
(%
)
![Page 29: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/29.jpg)
Designability
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
3 3.4 3.8 4.2 4.6 5 5.4 5.8 6.2 6.6 7Sequence entropy [exp(S)]
Fre
qu
ency
antifreeze
toxin
copper-bind
rubredoxin
Kunitz_BPTI
Phage_DNA_bind
All 253 structures
![Page 30: Folding@Home and Genome@home: Protein folding and design with distributed computing Stefan Larson Pande Group Dept. of Chemistry and Biophysics Program](https://reader036.vdocuments.us/reader036/viewer/2022062805/5697bfaf1a28abf838c9d346/html5/thumbnails/30.jpg)
New directionsOngoing work
• Characterization of sequence space
• Natural sequence diversity (SH3)
• Homology modeling database
• SH3 peptide ligand design
• Experimental validation of designed sequences
• Hybrid approaches to protein design
• Design of peptide-mimetic ligands
• Design of functional proteins
• New design algorithms and parameter sets