biopilot: data-intensive computing for complex …...2 samatova_biopilot_sc07 biological research is...
TRANSCRIPT
Presented by
BioPilot: Data-Intensive Computingfor Complex Biological Systems
Nagiza Samatova
Computer Science Research Group
Computer Science and Mathematics Division
Sponsored by
the Department of Energy’s Office of Science
Advanced Scientific Computing Research
Scientific Discovery through Advanced Computing
2 Samatova_BioPilot_SC07
Biological research is becoming a high-throughput, data-intensive science, requiringdevelopment and evaluation of new methods and algorithms for large-scale
computational analysis, modeling, and simulation
-80 -60 -40 -20 0 20 40 60 800
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Figure 1
Re
lative
Fre
qu
en
cy o
f A
pp
ea
ran
ce
Fragment Ion Mass Offset
Computational algorithms for proteomics• Improve the efficiency and reliability of protein identification and
quantification
• Peptide identification using MS/MS using pre-computed spectral databases
• Combinatorial peptide search with mutations, post- translationalmodifications and cross-linked constructs
Inference, modeling, and simulation of biological networks• Reconstruction of cellular network topologies using high-throughput
biological data
• Stochastic and differential equation-based simulations of the dynamics ofcellular networks
Integration of bioinformatics and biomolecular modelingand simulation
• Structure, function, and dynamics of complex biomolecular system inappropriate environments
• Comparative analysis of large-scale molecular simulation trajectories
• Event recognition in biomolecular simulations
• Multi-level modeling of protein models and their conformational space
Goals for enabling data-intensivecomputing in biology
3 Samatova_BioPilot_SC07
Comparative molecular trajectoryanalysis with BioSimGrid
• Routine protein simulations generateTB trajectories.
• Routine analysis tools (ED) requiremultiple passes.
• Correlation analyses have randomaccess patterns.
• Comparative analyses requiremultiple/distributed trajectory access.
Comparative Trajectory Analysis
Tools
ENERGY
GRADIENT
OPTIMIZE
DYNAMICS
THERMODYNAMICS
QMD
QM/MM
ET
QHOP
INPUT
PROPERTY
PREPARE
ANALYZE
ESP
VIB
Classical Force Field
DFT
SCF: RHF UHF ROHF
MP2: RHF UHF
MP3: RHF UHF
MP4: RHF UHF
RI-MP2
CCSD(T): RHF
CASSCF/GVB
MCSCF
MR-CI-PT
CI: Columbus Full Selected
Integral API
Geometry
Basis Sets
PEigS
pFFT
LAPACK
BLAS
MA
Global Arrays
ecce
ChemIO
NWCHM
The analysis of molecular dynamicstrajectories presents a large,data-intensive analysis problem:
The scientific challenge:
Integrated molecular simulationand comparative moleculartrajectory analysis:
• Enzymatic reaction mechanisms
• Molecular machines
• Molecular basis of signal and materialtransport
T. P. Straatsma, Computational Biology and Bioinformatics
4 Samatova_BioPilot_SC07
d
Building accurate protein structures• Homology models can be built from related protein structures, but even with
proper alignment, usually have about 4A RMS error–too large to permitmeaningful computational simulation/molecular dynamics.Goal is to reduce the errors in initial models.
• We multi-align the group of neighbors usingsequence and structure information to findthe most stable parts of the domain.Extracted stable core structure is 20 30%closer in terms of RMSD to the target thanto any of the original templates.
• More flexible parts of target are modeledlocally by choosing most sequentiallysimilar loops from the library of localsegments found in all the homologues.
• A genetic algorithm conformation searchstrategy using Cray XT4 uses backboneand sidechain angles as parameters.Each GA step is evaluated by minimization.
A computed template compared
with the target, improved from 3.4A
initially to 2.3A
E. Uberbacher, Biological and Environmental Sciences Division
5 Samatova_BioPilot_SC07
• Ultra-scale docking. The Bayesian potentials allowexploration of 100s and 1000s of protein complexes insteadof one or two.
• Docking from independently crystallized subunits. Wedemonstrated excellent results on complexes reconstructedfrom ~500 independently crystallized subunits.
• Whole genome predictions. Developed technology hasbeen used to predict 1000s of protein complexes on thegenome-wide level in several organisms.
Best structure. The set with the largest common
kernel always includes nearly the best native
structure present in the ensemble of 10,000
folded structures.
Quality estimator. The size of the largest common
kernel (obtained from the connectivity graph)
provides an excellent estimate of how close the
selected structure is to the native ones.
A. Gorin, Computer Science and Mathematics Division
Analysis of ultra-large structuralensembles
6 Samatova_BioPilot_SC07
Predictive proteome analysis usingadvanced computation and physical modelsScientific challenge:
Data sizes: 30,000 to 200,000 spectra from a single experiment to becompared with as many as millions of theoretical spectra
• Computational toolsidentify only 10 20%of spectra fromproteomics analysis.
• Fragmentation patternsare highly sequence-specific.
• Statisticalcharacterization ofnoise essential.
• Identification rateincreased by 23–40%.
NH3-HC -CO- NH-HC -CO- NH-HC -CO- NH-HC -COOH
R R R R
20%@5% False Pos. Rate
Increase in identified peptides
40%@1% False Pos. Rate
W. R. Cannon, Computational Biology and Bioinformatics
7 Samatova_BioPilot_SC07
In-depth analysis of MS proteomicsdata flows
Mass-spectrometry-based proteomics is one of the richest sources of thebiological information, but the data flows are enormous (100,000s ofsamples each consisting of 100,000s of spectra).
Computationally, bothadvanced graph algorithmsand memory-based indexes(> 1 TB are required forin-depth analysis of thespectra).
Two principally new capabilities are demonstrated: • Quantity of identified peptides is 50% increased. • Among highly reliable identifications, increase is many-fold.
A. Gorin, Computer Science and Mathematics Division
8 Samatova_BioPilot_SC07
ScalaBLAST
• Standalone BLAST applicationwould take >1 year for IMG(>1.6 million proteins) vs IMGand >3 years for IMG vs nr.
• “PERL script” approach breakseven modest clusters becauseof poor memory management.
Scientific challenge:
Good scaling on both shared anddistributed memory architectures
C. S. Oehmen, Computational Biology and Bioinformatics
120
0 150
Sca
ling
fa
cto
r
Number of processors
100500
40
80
MPP2
ALTIX
HTCBLAST
0 1000
Qu
eri
es p
er
(pro
c*m
inu
te)
Number of processors
100500
3
4
6
10000
5
2
1work factor
ideal
IMG 1.6
IMG 2.2
Run-t
ime (
hours
) 4035302520151050
0 1 2 3 4 5 6 7 8
Trillions of pairwise comparisons
9 Samatova_BioPilot_SC07
After: ~1GB
Before: ~2TB
pGraph: Parallel Graph Library foranalysis of biological networks
Major features:
• Memory reduction by 1000 times
• Scalability to 100s of processors
• FPT-based search space reduction theory
Find cliques Merge cliques
2,109
vertices
16,169
edges
4123 modulesModule
851
“common-target”
associations
Association
pGraph
Nagiza Samatova, Computer Science and Mathematics Division
10 Samatova_BioPilot_SC07
Discovering cell networks withstructure and genome context
2007, Sep 7
27(5): 793
The conserved ASD domain links ~30%
of ECF sigma and anti-sigma pairs in the
signal transduction cascade, e.g.,
response to iron and 1O2 with FecIR,
RpoE-ChrR proteins.
Many combinations of different sigma ( )
and sensor (S) domains exist. Two levels
of positional clustering, (1) domain and
(2) gene neighbor, generate vast
permutations between protein pairs.
The structure of the ChrR anti-sigma factor from Rhodobactersphaeroides reveals domains which help explain the
combinatoric complexity of gene regulation in response to
environmental conditions. An advanced toolkit for graph,
genome context, and visual analytics was used to study ECF
sigma factor regulation.
FecI FecR
RpoE ChrR
3D structure of ChrRanti-sigma regulator
ECF regulation
1 ASD
2
3
4
ASD
S4
S3
anti-
ASD
ASD
S2
S1
Sofia Heidi, Computational Biology and Bioinformatics
11 Samatova_BioPilot_SC07
• To model microbial communities, the smallest realistic model wouldinclude >109 cells x 100 reactions ~1011 reactions.
• Currently can handle ~104–106 reactions on a single-processor machine.
Scientific challenge:
Stochastic simulations of biologicalnetworks
• J. Phys. Chem. B 105, 11026,2001
• Speedup over exact GillespieSSA ~25
Multinomial Tau-leaping method:
• Speedups ~40–200
Probability-weighted dynamicMonte Carlo method…
H. Resat, Computational Biology and Bioinformatics
1 4
Number of phosphorylation binding sites
32 5 876 9 1211100
250
500
750
1000
1250
1500
Ave
rag
e r
un
nin
g t
ime
(se
co
nd
s)
12 Samatova_BioPilot_SC07
Contacts
Tjerk Straatsma
Principal Investigator
Pacific Northwest National Laboratory
Nagiza Samatova
Principle Investigator
Oak Ridge National Laboratory
Christine Chalk
Program Manager
U.S. Department of Energy
Office of Science
Advanced Scientific Computing Research, SciDAC
12 Samatova_BioPilot_SC07