algorithms for in situ data analytics of next generation ... · algorithms for in situ data...

Post on 05-Oct-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Algorithms for In Situ Data Analytics of Next Generation

Molecular Dynamics WorkflowsMichela Taufer

Department of Electrical Engineering and Computer ScienceThe University of Tennessee Knoxville

Acknowledgements

T. Johnston B. Zhang T. Estrada A Liwo

S. Crivelli

M. CuendetE. Deelman

R. da Silva

Sponsors:

H. Weinstein

A. RazaviT. Do

S. Thomas A. PlanteB. Mulligan

Trends in Next-Generation Systems

Source: Lucy Nowell (DOE)

5x

12x

64x

13x

Rising Importance of Ensembles

Source: https://wci.llnl.gov/simulation/computer-codes/uncertainty-quantification

Exas

cale

Peta

scal

eTe

rasc

ale

Sierra (2018)

Sequoia (2013)

Purple (2005)

Number of Jobs

Syst

em S

cale

(FLO

PS)

Few Many

Widening I/O Gap

3

Classical Molecular Dynamics Simulations

By Vincent Voelz Gorup - CC BY-SA 3.0, http://www.voelzlab.org/

• A MD simulation comprises of hundreds of thousands of MD job

• Each job preforms hundreds of thousands of MD steps

4

Classical Molecular Dynamics Simulations

Forces on single atoms à Acceleration

à Velocity à Position

• A MD step computes forces on single atoms (e.g., bond, angle, dihedrals, nonbond)• Forces are added to

compute acceleration• Acceleration is used to

update velocities • Velocities are used to

update the atom positions• Every n steps, all atom

positions are stored à 3D snapshot or frame

5

Analyzing MD Frames: Present and Future

6

ANALYSIS

Local Storage

MEMORY

CN

MD code

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

Parallel File System

MEMORY

CN

MD code

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

CN

ANALYSIS

Parallel File System

Present Future

In Situ and In Transit Analysis

Core 1

Core 4

Core 7

Core 8

Core 1

Core 1

Core 5

Core 6

Simulation

Nod

e

AnalysisSimulation

Node 1 Node 2 Node 3

Network InterconnectAnalysis

Shar

ed M

emor

y

Example of tools:• DataSpaces (Rutgers U.)• DataStager (GeorgiaTech)

In situ and in transit analysis requires rethinking data algorithms 7

Building a Closed-loop Workflow

Plumed

MD code (e.g., GROMACS)

In-memory Staging AreaDataSpaces

A4MDanalytics

Retriever

Dataflow

IngestorControlflow

Run n-stride simulation steps Collective

variables

Data Generation Data AnalyticsData Storage

Dataflow

Burst Buffer

Parallel File System

(e.g., Lustre)

A4MDanalytics

A4MDanalytics

algorithms

ML-inferred algorithms

Controlflow ControlflowControlflow

Data Feedback

8

Building a Closed-loop Workflow

Plumed

MD code (e.g., GROMACS)

In-memory Staging AreaDataSpaces

In-memory Staging AreaDataSpaces RetrieverRetriever

DataflowDataflow

IngestorIngestorControlflowControlflow

Run n-Stride simulation stepsRun n-Stride simulation steps Collective

variablesCollective variables

Data GenerationData Generation Data AnalyticsData AnalyticsData StorageData Storage

DataflowDataflow

Burst BufferBurst Buffer

Parallel File System

(e.g., Lustre)

Parallel File System

(e.g., Lustre)

ML-inferred algorithmsML-inferred algorithms

ControlflowControlflow ControlflowControlflowControlflowControlflow

Data FeedbackData Feedback

Plumed

MD code (e.g., GROMACS)

In-memory Staging AreaDataSpaces Retriever

Dataflow

IngestorControlflow

Run n-Stride simulation steps Collective

variables

Data Generation Data AnalyticsData Storage

Dataflow

Burst Buffer

Parallel File System

(e.g., Lustre)

ML-inferred algorithms

Controlflow ControlflowControlflow

Data Feedback

A4MDanalytics

A4MDanalytics

A4MDanalytics

algorithms

9

Analytics for Molecular Dynamics

• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria and

protein engineering

Analytics for Molecular Dynamics

• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria and

protein engineering

A4MD: Protein-Ligand Docking

12

T. Estrada, B. Zhang, P. Cicotti, R. S. Armen, M. Taufer: A scalable and accurate method for classifying protein-ligand binding geometries using a MapReduce approach. Comp. in Bio. and Med. 42(7): 758-771 (2012)

From 3D Atomic Structures to 3D Points

13

x-y axis

x-z axis

y-z axis

x-y axis

x-z axis

y-z axis

Protein pocket

Docked ligand

Docked ligand

Metadata: each 3D pointrepresents the position of one docked ligand in the protein pocket

Search for Dense Spaces: Octree Clustering

14

Search for Dense Spaces: Octree Clustering

Octree nodes

Metadata: ligand conformations

15

Search for Dense Spaces: Octree Clustering

Near-native ligand structures

(RMSD <= 2A)

Deepest, more dense octant

found by octree clustering

16

Search: Linear in complexity using Mimir- a MapReduce over MPI framework

Case Study: Sampled Conformations - Ligand 1k1l

−1−0.5

00.5

1 −1−0.5

00.5

1

−1

−0.5

0

0.5

1

17

T. Estrada, B. Zhang, P. Cicotti, R. S. Armen, M. Taufer: A scalable and accurate method for classifying protein-ligand binding geometries using a MapReduce approach. Comp. in Bio. and Med. 42(7): 758-771 (2012)

Case Study: Sampled Conformations - Ligand 1k1l

Near-native ligand structures

(RMSD <= 2A)

Deepest, more dense octant

found by octree clustering

18

Analytics for Molecular Dynamics

• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria and

protein engineering

A4MD: Rare Events in MD Simulations

Transformations:

Movements:

20

A4MD: Rare Events in MD Simulations

• We want to capture what is going on in each frame without:• Disrupting the simulation (e.g., stealing CPU and memory on

the node)• Moving all the frames to a central file system and analyzing

them once the simulation is over• Comparing each frame with past frames of the same job• Comparing each frame with frames of other jobs

Frames (or snapshots) of an MD trajectory:

Frame 55 Frame 60 Frame 65 Frame 70 Frame 75 Frame 80

21

From 3D Atomic Structure to a Single Eigenvalue

Drop all but not the backbone atoms (Cα atoms)

CβiCα

j

T. Johnston, B. Zhang, A. Liwo, S. Crivelli, and M. Taufer. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of Computational Chemistry (JCC), 38(16):1419-1430, 2017.

From 3D Atomic Structure to a Single Eigenvalue

Measure the distance between Cα

j and Cβi

Build a bipartite distance matrix by comparing two substructures

i

j

λmaxCompute largest eigenvalue

CβiCα

j d

T. Johnston, B. Zhang, A. Liwo, S. Crivelli, and M. Taufer. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of Computational Chemistry (JCC), 38(16):1419-1430, 2017.

Case Study: Capturing Movement of α-helices

24

Capture movement of structures (α-helices) with respect to each other

1330 1360 1390

T. Johnston, B. Zhang, A. Liwo, S. Crivelli, and M. Taufer. In-Situ Data Analytics and Indexing of Protein Trajectories. Journal of Computational Chemistry (JCC), 38(16):1419-1430, 2017.

24

Case Study: Capturing Movement of α-helices

Monitor largest eigenvalue of entire protein

25

Case Study: Capturing Movement of α-helices

Something is changing

Monitor largest eigenvalue of entire protein

26

Case Study: Capturing Movement of α-helices

Individual α-helices (Helix 1, Helix 2, and Helix 3) appear stable

Monitor largest eigenvalue of single α-helices

27

Case Study: Capturing Movement of α-helices

Monitor largest eigenvalue of bipartite distance matrix

First and second α-helices appear stable; third helix moves

28

1330

1360

1390

Case Study: Capturing Movement of α-helices

Analysis: Linear in complexity using local metadata (eigenvalues) with DataSpaces

Analytics for Molecular Dynamics

• Drug design and protein-ligand docking• Protein folding and rare events • Protein variants expressed from yeast or bacteria

and protein engineering

A4MD: Proteins with Similar Functions Key principle: proteins with similar sequences have similar functions• Measure millions of protein variants expressed from yeast or bacteria• Structure proteins to produce desired properties (protein engineering)

31

Protein Representations

3D Cartesian representation Surface representationMulti-fold representation

32

From Multi-fold Representation to Image Encoding

33

T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.

From Multi-fold Representation to Image Encoding

T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.

34

From Multi-fold Representation to Image Encoding

T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.

Case Study: High-Throughput Protein Analysis

convolutional neural network

• 62,991 proteins from the Protein Data Bank• Eight biological processes from biological process taxonomy in RCSB-PDB

Proteins as 3D tens

T. Estrada, J. Benson, H. Carrillo-Cabada, A. Razavi, M. Cuendet, H. Weinstein, E. Deelman, and M. Taufer. GraphicEncoding of Proteins for Efficient High-Throughput Analysis. ICPP 2018.

36

Google’s Inception-v3, Gem-Net

Challenges and Opportunity

A workflow that integrates both simulations and analytics must have these key properties: • Efficiency: Optimize workflows’ performance and power usage

associated to data movement and analytics• Generality: Build workflows that support different types of

analytics across different MD applications• Non-invasive: Capture data from MD simulations without

rewriting legacy codes or simulation scripts • Portability: Execute combined simulations and analytics across

different platforms and with heterogenous resources • Scalability: (Re)design ML algorithms for knowledge discovery

at scale

37

top related