enhanced mpsm3 for applications to quantum biological simulations

27
© 2016 IBM Corporation Enhanced MPSM3 for applications to quantum biological simulations Cristiano Malossi, IBM Research - Zurich A. Pozdneev, V. Weber, T. Laino, C. Bekas, A. Curioni, IBM Research - Zurich

Upload: alexander-pozdneev

Post on 15-Apr-2017

168 views

Category:

Science


0 download

TRANSCRIPT

© 2016 IBM Corporation

Enhanced MPSM3 for applications to quantumbiological simulations

Cristiano Malossi, IBM Research - Zurich

A. Pozdneev, V. Weber, T. Laino, C. Bekas, A. Curioni, IBM Research - Zurich

© 2016 IBM Corporation2

IBM Research

Motivations

Applications of quantum Hamiltonians to biological systems is limited by the cost of performing long calculations on large systems (+30K atoms).

Classical forcefields and QM/MM are good for conformational changes and localized reactions, respectively. Thus the need for developing scalable algorithms that allow the applications of quantum Hamiltonians to biological systems, to:

NADH:ubiquinone oxidoreductase

succinate dehydrogenase

large scale ion motion large scale electron transfer

© 2016 IBM Corporation3

IBM Research

Outlook and Goal

Goal: Design an efficient parallel sparse matrix-matrix multiply.

Introduction: Born-Oppenheimer molecular dynamics.

Parallelization: midpoint-based parallel sparse matrix-matrix multiplication for matrices with decay.

Benchmark: weak and strong scaling, and communication volume on BlueGene/Q1.

Summary

1IBM and Blue Gene are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies.

© 2016 IBM Corporation4

IBM Research

Introduction: Born-Oppenheimer molecular dynamics

The core operation of the SCF iterations is the sparse matrix-matrix multiplication.

Each SCF iteration requires the construction of the density matrix.

Each MD step requires U to be calculated at the relaxed ground state electronic density.

© 2016 IBM Corporation5

IBM Research

Parallel sparse matrix-matrix multiplication

atoms

cell

Atoms in the simulation cell.

© 2016 IBM Corporation6

IBM Research

Parallel sparse matrix-matrix multiplication

box

Simulation cell divided into boxes. Each box and its atoms are owned by a process.

© 2016 IBM Corporation7

IBM Research

Parallel sparse matrix-matrix multiplication

i

k

These two atoms are owned by different processes.

© 2016 IBM Corporation8

IBM Research

Parallel sparse matrix-matrix multiplication

i

k

+A

ik

The matrix block Aij is owned by the process where the midpoint

(distance) resides.

© 2016 IBM Corporation9

IBM Research

Parallel sparse matrix-matrix multiplication

i

k

+A

ik

The matrix block Aik is owned by the process where the midpoint

(distance) resides.

© 2016 IBM Corporation10

IBM Research

Parallel sparse matrix-matrix multiplication

i

k

+A

ik

j

Bkj

+

Another matrix block Bkj.

© 2016 IBM Corporation11

IBM Research

Parallel sparse matrix-matrix multiplication

i

k

j+C

ij

The result Cij of the product A

ikB

kj is owned by the process where the

midpoint between i and j resides.

© 2016 IBM Corporation12

IBM Research

Parallel sparse matrix-matrix multiplication

Blocks Aik and B

kj are sent to the process that owns C

ij and multiplication

takes place. Blocks are sent along x, y and z.

i

k

Aik

j

Bkj

+C

ij = C

ij + A

ik B

kj

© 2016 IBM Corporation13

IBM Research

Improved MPSM3

The process that owns the midcell performs the multiplication.

+

x

x

© 2016 IBM Corporation14

IBM Research

Improved MPSM3

All blocks A and B are sent to the process that owns the midcell

and

multiplication takes place. Blocks are sent along x, y and z.

A**

B**

+

C = C + AB

x

x

© 2016 IBM Corporation15

IBM Research

Improved MPSM3

Process that does the multiplication needs to redistribute the results to neighbors processes. Blocks are sent along x, y and z.

+C

i'j'

+

i

j i'j'

Cij

© 2016 IBM Corporation16

IBM Research

Improved MPSM3

Redistribution of the computed matrix

Exchange of local matrices

Local products

© 2016 IBM Corporation17

IBM Research

Benchmark: weak scaling

Time per DM build wrt MPI tasks, PM6

About 19 waters per task

Parallel efficiency: 92% at 110592 MPI tasks (2.1M waters)

Nbr non-zero elements: 1.6k/water (O1) and 1.0k/water (O2)

© 2016 IBM Corporation18

IBM Research

Benchmark: weak scaling

Time per DM build wrt MPI tasks, PM6

About 19 waters per task

Parallel efficiency: 92% at 110592 MPI tasks (2.1M waters)

Nbr non-zero elements: 1.6k/water (O1) and 1.0k/water (O2)

Constant walltime with proportional resources

© 2016 IBM Corporation19

IBM Research

Benchmark: weak scaling (improved MPSM3)

Time per DM build wrt MPI tasks, PM6

About 19 waters per task

Nbr non-zero elements: 1.6k/water ~10x

© 2016 IBM Corporation20

IBM Research

Benchmark: weak scaling (improved MPSM3)

Time per DM build wrt MPI tasks, PM6

About 19 waters per task

Nbr non-zero elements: 1.6k/water

Improved MPSM3 competes already with libdbcsr for small system/MPI tasks ratio

https://dbcsr.cp2k.org/

© 2016 IBM Corporation21

IBM Research

Benchmark: strong scaling

Time per DM build wrt MPI tasks, PM6

110k (S1), 373k (S2) and 1124k (S3) waters

© 2016 IBM Corporation22

IBM Research

Benchmark: strong scaling

Time per DM build wrt MPI tasks, PM6

110k (S1), 373k (S2) and 1124k (S3) waters

Largest system:

Matrix dimensions: 6749184 x 6749184

Non-zero: 3.9E-3%

Nbr. multiplies: 17

Sparsity boost vs dense: 42760 x

© 2016 IBM Corporation23

IBM Research

Benchmark: strong scaling (improved MPSM3)

Time per DM build wrt MPI tasks, PM6

32k (S0), 110k (S1)

© 2016 IBM Corporation24

IBM Research

Benchmark: communication volume

Total communication volume (Isend/Irecv) per DM build wrt MPI tasks

110k (S1), 373k (S2) and 1124k (S3) waters

BlueGene/Q

© 2016 IBM Corporation25

IBM Research

Summary

The MPSM3 and its improved version shows (1 push per direction):

close to perfect weak scaling

very good strong scaling

communication volume decreases as nbr task increases

fewer logistic operations (improved version)

Providing proportional resources, we can perform a MD step in about few dozen of seconds regardless of system size.

© 2016 IBM Corporation26

IBM Research

Parallel sparse matrix-matrix multiplication

radius

Interaction of an atom with its neighbors.

© 2016 IBM Corporation27

IBM Research

References

SEMD I: Midpoint-based parallel sparse matrix-matrix multiplication algorithm for matrices with decayValéry Weber, Teodoro Laino, Alexander Pozdneev, Irina Fedulova, and Alessandro CurioniJournal of Chemical Theory and Computation 2015 11 (7), 3145-3152https://doi.org/10.1021/acs.jctc.5b00382

Enhanced MPSM3 for Applications to Quantum Biological SimulationsAlexander Pozdneev, Valéry Weber, Teodoro Laino, Costas Bekas, and Alessandro CurioniIn Proceedings of SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis, Salt Lake City, Utah, November 13-18, 2016, Article no. 9https://dl.acm.org/citation.cfm?id=3014916