enhanced mpsm3 for applications to quantum biological simulations

© 2016 IBM Corporation

Enhanced MPSM3 for applications to quantumbiological simulations

Cristiano Malossi, IBM Research - Zurich

A. Pozdneev, V. Weber, T. Laino, C. Bekas, A. Curioni, IBM Research - Zurich

© 2016 IBM Corporation2

IBM Research

Motivations

Applications of quantum Hamiltonians to biological systems is limited by the cost of performing long calculations on large systems (+30K atoms).

Classical forcefields and QM/MM are good for conformational changes and localized reactions, respectively. Thus the need for developing scalable algorithms that allow the applications of quantum Hamiltonians to biological systems, to:

NADH:ubiquinone oxidoreductase

succinate dehydrogenase

large scale ion motion large scale electron transfer


IBM Research

Outlook and Goal

Goal: Design an efficient parallel sparse matrix-matrix multiply.

Introduction: Born-Oppenheimer molecular dynamics.

Parallelization: midpoint-based parallel sparse matrix-matrix multiplication for matrices with decay.

Benchmark: weak and strong scaling, and communication volume on BlueGene/Q1.

Summary

1IBM and Blue Gene are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies.


IBM Research

Introduction: Born-Oppenheimer molecular dynamics

The core operation of the SCF iterations is the sparse matrix-matrix multiplication.

Each SCF iteration requires the construction of the density matrix.

Each MD step requires U to be calculated at the relaxed ground state electronic density.


IBM Research

Parallel sparse matrix-matrix multiplication

atoms

cell

Atoms in the simulation cell.


IBM Research


box

Simulation cell divided into boxes. Each box and its atoms are owned by a process.


IBM Research


i

k

These two atoms are owned by different processes.


IBM Research


i

k

+A

ik

The matrix block Aij is owned by the process where the midpoint

(distance) resides.


IBM Research


i

k

+A

ik

The matrix block Aik is owned by the process where the midpoint

(distance) resides.


IBM Research


i

k

+A

ik

j

Bkj

+

Another matrix block Bkj.


IBM Research


i

k

j+C

ij

The result Cij of the product A

ikB

kj is owned by the process where the

midpoint between i and j resides.


IBM Research


Blocks Aik and B

kj are sent to the process that owns C

ij and multiplication

takes place. Blocks are sent along x, y and z.

i

k

Aik

j

Bkj

+C

ij = C

ij + A

ik B

kj


IBM Research

Improved MPSM3

The process that owns the midcell performs the multiplication.

+

x

x


IBM Research

Improved MPSM3

All blocks A and B are sent to the process that owns the midcell

and

multiplication takes place. Blocks are sent along x, y and z.

A**

B**

+

C = C + AB

x

x


IBM Research

Improved MPSM3

Process that does the multiplication needs to redistribute the results to neighbors processes. Blocks are sent along x, y and z.

+C

i'j'

+

i

j i'j'

Cij


IBM Research

Improved MPSM3

Redistribution of the computed matrix

Exchange of local matrices

Local products


IBM Research

Benchmark: weak scaling

Time per DM build wrt MPI tasks, PM6

About 19 waters per task

Parallel efficiency: 92% at 110592 MPI tasks (2.1M waters)

Nbr non-zero elements: 1.6k/water (O1) and 1.0k/water (O2)


IBM Research

Benchmark: weak scaling



Parallel efficiency: 92% at 110592 MPI tasks (2.1M waters)

Nbr non-zero elements: 1.6k/water (O1) and 1.0k/water (O2)

Constant walltime with proportional resources


IBM Research

Benchmark: weak scaling (improved MPSM3)



Nbr non-zero elements: 1.6k/water ~10x


IBM Research

Benchmark: weak scaling (improved MPSM3)



Nbr non-zero elements: 1.6k/water

Improved MPSM3 competes already with libdbcsr for small system/MPI tasks ratio

https://dbcsr.cp2k.org/


IBM Research

Benchmark: strong scaling


110k (S1), 373k (S2) and 1124k (S3) waters


IBM Research

Benchmark: strong scaling



Largest system:

Matrix dimensions: 6749184 x 6749184

Non-zero: 3.9E-3%

Nbr. multiplies: 17

Sparsity boost vs dense: 42760 x


IBM Research

Benchmark: strong scaling (improved MPSM3)


32k (S0), 110k (S1)


IBM Research

Benchmark: communication volume

Total communication volume (Isend/Irecv) per DM build wrt MPI tasks


BlueGene/Q


IBM Research

Summary

The MPSM3 and its improved version shows (1 push per direction):

close to perfect weak scaling

very good strong scaling

communication volume decreases as nbr task increases

fewer logistic operations (improved version)

Providing proportional resources, we can perform a MD step in about few dozen of seconds regardless of system size.


IBM Research


radius

Interaction of an atom with its neighbors.


IBM Research

References

SEMD I: Midpoint-based parallel sparse matrix-matrix multiplication algorithm for matrices with decayValéry Weber, Teodoro Laino, Alexander Pozdneev, Irina Fedulova, and Alessandro CurioniJournal of Chemical Theory and Computation 2015 11 (7), 3145-3152https://doi.org/10.1021/acs.jctc.5b00382

Enhanced MPSM3 for Applications to Quantum Biological SimulationsAlexander Pozdneev, Valéry Weber, Teodoro Laino, Costas Bekas, and Alessandro CurioniIn Proceedings of SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis, Salt Lake City, Utah, November 13-18, 2016, Article no. 9https://dl.acm.org/citation.cfm?id=3014916

https://doi.org/10.1021/acs.jctc.5b00382

https://dl.acm.org/citation.cfm?id=3014916

enhanced mpsm3 for applications to quantum biological simulations

Science