analysis of blue gene molecular dynamicsparida/dimacsworkshopjune20... · overview current...
TRANSCRIPT
IBM T. J. Watson Research Center
IBM Computational Biology Center June 22, 2005 © 2005 IBM Corporation
Analysis of High BandwidthMolecular Dynamics Resultsfrom the Blue Gene Project
Frank Suits
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Overview
Current biomolecular simulations on Blue Gene– Simulations range from small proteins in water to large proteins in
lipid membranes
Types of output and storage needs– Small numbers of large systems; Large numbers of small systems
Ways to handle I/O– Data reduction is key
Role of visualization in data reduction– Several examples, including an experimental view of protein
sequence motifs
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
December 1999:
IBM Announces $100 Million Research Initiative to build World's Fastest Supercomputer
"Blue Gene" to Tackle Protein Folding Grand Challenge
YORKTOWN HEIGHTS, NY, December 6, 1999 -- IBM today announced a new $100 million exploratory research initiative to build a supercomputer 500 times more powerful than the world’s fastest computers today. The new computer -- nicknamed "Blue Gene" by IBM researchers -- will be capable of more than one quadrillion operations per second (one petaflop). This level of performance will make Blue Gene 1,000 times more powerful than the Deep Blue machine that beat world chess champion Garry Kasparov in 1997, and about 2 million times more powerful than today's top desktop PCs.
Blue Gene's massive computing power will initially be used to model the folding of human proteins, making this fundamental study of biology the company's first computing "grand challenge" since the Deep Blue experiment. Learning more about how proteins fold is expected to give medical researchers better understanding of diseases, as well as potential cures.
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Actual Blue Gene TimelineDecember 1999: Blue Gene project announcement
October 2000: Blue Matter software development begins
June 2003: First chips completed
November 2003: BG/L Half rack prototype (512 nodes) ranked #73 (1.435 TFlop/s).
May 2004: First production Blue Matter runs on membrane systems
November 2004: 16-rack Livermore system #1 in Top500 at 70 TFlop/s (1/4 of completed system)
May 2005: 32-rack Livermore system achieves 135 TFlop/s
May 2005: Watson 20-rack system, BG/W, completed – 91 TFlop/s, unofficially #2 in world.
Later in 2005: Livermore completes 64-rack system
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Molecular Dynamics Time Scales
10-15 10-12 10-9 10-6 10-3 1 103 106 109| | | | | | | | |
Bond Vibration
Adapted from “The Protein Folding Problem”, Chan and Dill, Physics Today, Feb. 1993
DNA Twisting
Helix-Coil Transition
Protein Folding
Electron Transfer
Hinge Motion
Ligand-Protein Binding
Lipid exchange via diffusion
Torsional correlation in lipid headgroups
Simulation Experiment
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Hairpin – first Blue Matter system – 5000 atoms
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
From Packets to Publications
What data are we analyzing?– Molecular dynamics data are output from each node as
individual binary packets of information
– These packets must be framed for each timestep and checked for completeness
– Then they are aggregated and stored in more usable form as energy traces or atom coordinates over time (trajectory)
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Systems studied with Blue MatterIncreasing size and complexity with increasing compute power
Hairpin in water – 5000 atoms– 237 serial runs on SP– 2 publications
Lipid and Lipid/Cholesterol bilayers –15000 atoms– 32-way MPI runs on SP
• Some on BG/L– Several papers and talks pending
Rhodopsin in Lipid/Cholesterol bilayer– 40000 atoms– Milestone system, running microsecond
scale simulation this year on BG/L– Significant interest in scientific results
expected
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
How much data?
50K atoms with pos, vel ⇒ 2.5MB “state”
1 rack yields 5 hours per nanosecond with 2 fstimestep ⇒ 100K steps/hour
1 rack produces 250 GB / hour
BGW (20 racks) produces 5 TB / hour
How to capture, analyze, and archive??
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Answer – don’t need all that data
Positions and velocities needed only rarely
Typically store positions every 500 timesteps at low resolution (16-bit) for analysis
Positions and velocities stored at full resolutions every 5000 timesteps or so, for “restart”
Immediate reduction and selective archiving of data makes it much more manageable
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Mystery Plot: Many stories in a simple line plot(Familiar data to all)
10
20
30
40
50
60
70
80
355 355.5 356 356.5 357 357.5 358 358.5 359 359.5 360
355 360
80
10
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Some analysis modes
Validation– Energy and momentum should be conserved– No temperature drift
Visual inspection of configuration and behavior– Sanity check – system is behaving as it should
Reduction to quantitative results that can match experiment– Diffusion constants, NMR-related correlation times,
lifetimes
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
The end result of validation: Excellent energy conservation
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Examples of analysis for Blue Gene Mol. Dynamics
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Hairpin free energy surface – Thermodynamic view (static)
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Free energy surface with trajectories: Kinetics
Each color isa separate trajectory
Some overlap,others are distinct
Can they be chainedtogether?
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
System moves among 30 bins. Stripchart viewHuge reduction: From complex arrangement of atoms to 5 bits
30 bins
237 trajectories in full set of runs. Build transition matrix
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Rad
ius
of g
yrat
ion
Number of native hydrogen bonds
Assign bins to free energy surface
0
1
23
4
5
::
::
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Markov view of protein folding kineticsCaptures timescale and pathways of proteins in landscape
Describing Protein Folding Kinetics byMolecular Dynamics Simulations. 1. TheorySwope, Pitera, SuitsJ. Phys. Chem. B; 2004 108(21) 6571-81
Describing Protein Folding Kinetics by MD Sim. 2Applications to Alanine Dipeptide and B HairpinJ. Phys. Chem. B; 2004; 108(21) 6582-94Swope, Pitera, Suits, Pitman, Eleftheriou,Fitch, Germain, Rayshubski, Ward, Zhestkov,Zhou
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Lipid/Cholesterol/Water membrane simulation
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Lipid Membrane – 13,000 atoms
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
How to quantify what’s going on?
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Cholesterols - how are they interacting?
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Cholesterol and Lipid Diffusion as r2 vs. time lag
0
5
10
15
20
25
30
35
40
45
50
0 2 4 6 8 10 12 14
Time (ns)
r² (Ǻ
²)
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Diffusion “constants” as function of time lag
0
1
2
3
4
5
6
7
8
9
10
0 2 4 6 8 10 12
Time (ns)
D (s
lope
of r
²/4)
(1E-
8 cm
²/s)
CholesterolLipid
Diffusion Calculations inSimulated Lipid-Cholesterol BilayersSuits, Pitman, FellerGordon Conference onComputational ChemistryJuly, 2004
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Lipid neighborhood around a cholesterol
Each lipid has two different “chains,” shown red and blue
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
2D contours give some idea of neighborhood,but only in slice. 3D possibilities?
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
3D isosurfaces of density show lipid distributed symmetrically,while cholesterols show strong orientation preference…
Red: Lipid Blue: Other cholesterols
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Also see water pulled in from aboveand cholesterols preferentially oriented to each other . . .
Red: water layer
Blue: other cholesterols
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
… while the two individual lipid chains have preference
Molecular Dynamics Investigation of Structure andDynamics of Cholesterol in a Polyunsaturated Lipid BilayerPitman, Suits, Mackerell, FellerEmerging Challenges in Membrane BiophysicsJune, 2004, Sun Valley, Idaho
Molecular Dynamics Investigation of the StructuralProperties of Phosphatidlyethanolamine Lipid BilayersPitman, Suits, FellerJ. Phys. Chem. B., 2005
Red and Blue: The two different chains on each lipid
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Current and future work
Milestone system currently in production on BG/L– Rhodopsin: GPCR protein in cholesterol/lipid bilayer– Light receptor, and represents large class of drug targets
Combines all aspects of previous simulations:– Protein behavior– Lipid membrane environment– Effect of cholesterol on membrane and protein
Rich with analysis opportunitiesEnsemble now running on many racks
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Rhodopsin and the Eye
http://www2.mrc-lmb.cam.ac.uk/groups/GS/eye.html
RetinaLight sensitive
Protein
Outer segment
of Rod
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
GPCR-based drugs among the 200 best-selling prescriptionsGPCR target Drug Disease Company 2000 sales(US $m)
Zantac AstraZeneca 870
Pepcid Merck 850
Claritin Schering-Plough 2,200
Allegra Aventia 1,100
Risperdal Psychosis Johnson & Johnson 1,600
Imitrex Migraine GlaxoSmithKline 1,100
BuSpar Anxiety Bristol-Myers Squibb 714
Zyprexa Schizophrenia Eli Lilly 2,400
Angiotensin receptors Cozaar Merck 1,700
Toprol-XL AstraZeneca 580
Coreg Congestive heart failure GlaxoSmithKline 250
Serevent Asthma GlaxoSmithKline 940
Muscarinic acetylcholine receptors
Atrovent COPD BoehringerIngelheim
600
GnRH receptors Zoladex Cancer AstraZeneca 740
Dopamine receptors Requip Parkinson’s diseases AstraZeneca 90
Prostaglandin (PGE1) receptors
Cytotec Ulcers Pharmacia 100
ADP receptors Plavix Stroke Bristol-Myers Squibb 900
Adrenoceptors
Hypertension
5-HT receptors
Allergies
Ulcers
Histamine receptors
http://www.predixpharm.com/market_table.htm
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Rhodopsin in lipid/cholesterol bilayer – 43,000 atoms
First scientific publication with BG/L hardware (JACS 2005)
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Bioinformatics visualization experiment
Find novel 2D visualization that captures protein motifs from simple patterns
Simple reduction of data with minimal transformation/heuristics
Let the eye find the patterns, if any
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Motifs from E-Coli Protein SequencesCount Pattern: :8 DEADR4 DEAEA6 DEAEL4 DEAER6 DEAIA5 DEAKA6 DEAKR: : (50,000 lines)
With large number of related patterns,
how to see relative pattern frequency?
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Motif visualization
Goals:– Provide an understandable reduction of the full data
– Show relative population of 3-character patterns
– Allow “drill-down” on selected patterns of interest
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
3D Scatter plot view of three-letter distribution
Origin is AAA, axes end at YAA, AYA, AAY.
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
2D representation of 4-char alphabet
A B
C D
A B
C D
A B
C D
A B
C D
A B
C DA B
C DCA CB
CC CD
DA DB
DC DD
AA AB
AC AD
BA BB
BC BD
Extend to 25-char alphabet:
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
2D representation with 25 character alphabet
AA ABAF
AYAX
AT
AA
AY
BA
BY
YA
YY
RESULT ⇒
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Result for E-coli
Single view shows distribution for all 3-char patterns
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Diff views
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Results of experiment
Novel view
Possibly interesting
Need to try with other data sets
Combine with hierarchical reordering of axes
Easy to try, and captures original data in form the eye can process without imposing bias
Simple form of data reduction
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Conclusions and future directions in Blue Gene Analysis
Analysis involves a staged reduction of infoUse visualization when appropriateUseful for results and validation – and insightBG/L machine continues to growMany molecular dynamics studies are ongoing and will accelerate as machine growsSmall number of large molecular systems, and ensembles of smaller systemsStay tuned for more results and publications
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Many Acknowledgements
Alex BalaeffBruce BerneMaria EleftheriouScott FellerBlake FitchRobert GermainAlan GrossfieldLaxmi Parida
Jed Pitera
Mike Pitman
Alex Rayshubskiy
Bill Swope
Chris Ward
Yuri Zhestkov
Ruhong Zhou
And… Blue Gene Hardware & System Software teams
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Backup
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Scaling Directions
timescaleco
mpl
exity
statist
ical ce
rtainty
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
BG/L communication network
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
Ocean view with Torus
IBM T. J. Watson Research Center
© 2005 IBM CorporationIBM Computational Biology Center June 22, 2005
The science plan – a spectrum of projects
systematically cover a range of system sizes, topological complexity– discovering the "rules" of folding
– applying those rules to have impact on disease
address a broad range of scientific questions and impact areas:– thermodynamics
– folding kinetics
– folding-related disease (CF, Alzheimer's, GPCR's)
improve our understanding not just of protein folding but protein function
1LE1
1L2Y1EOM
1ENH
1BBL
1LMB
1FME
GPCR in membrane