enhanced mpsm3 for applications to quantum biological simulations
TRANSCRIPT
© 2016 IBM Corporation
Enhanced MPSM3 for applications to quantumbiological simulations
Cristiano Malossi, IBM Research - Zurich
A. Pozdneev, V. Weber, T. Laino, C. Bekas, A. Curioni, IBM Research - Zurich
© 2016 IBM Corporation2
IBM Research
Motivations
Applications of quantum Hamiltonians to biological systems is limited by the cost of performing long calculations on large systems (+30K atoms).
Classical forcefields and QM/MM are good for conformational changes and localized reactions, respectively. Thus the need for developing scalable algorithms that allow the applications of quantum Hamiltonians to biological systems, to:
NADH:ubiquinone oxidoreductase
succinate dehydrogenase
large scale ion motion large scale electron transfer
© 2016 IBM Corporation3
IBM Research
Outlook and Goal
Goal: Design an efficient parallel sparse matrix-matrix multiply.
Introduction: Born-Oppenheimer molecular dynamics.
Parallelization: midpoint-based parallel sparse matrix-matrix multiplication for matrices with decay.
Benchmark: weak and strong scaling, and communication volume on BlueGene/Q1.
Summary
1IBM and Blue Gene are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies.
© 2016 IBM Corporation4
IBM Research
Introduction: Born-Oppenheimer molecular dynamics
The core operation of the SCF iterations is the sparse matrix-matrix multiplication.
Each SCF iteration requires the construction of the density matrix.
Each MD step requires U to be calculated at the relaxed ground state electronic density.
© 2016 IBM Corporation5
IBM Research
Parallel sparse matrix-matrix multiplication
atoms
cell
Atoms in the simulation cell.
© 2016 IBM Corporation6
IBM Research
Parallel sparse matrix-matrix multiplication
box
Simulation cell divided into boxes. Each box and its atoms are owned by a process.
© 2016 IBM Corporation7
IBM Research
Parallel sparse matrix-matrix multiplication
i
k
These two atoms are owned by different processes.
© 2016 IBM Corporation8
IBM Research
Parallel sparse matrix-matrix multiplication
i
k
+A
ik
The matrix block Aij is owned by the process where the midpoint
(distance) resides.
© 2016 IBM Corporation9
IBM Research
Parallel sparse matrix-matrix multiplication
i
k
+A
ik
The matrix block Aik is owned by the process where the midpoint
(distance) resides.
© 2016 IBM Corporation10
IBM Research
Parallel sparse matrix-matrix multiplication
i
k
+A
ik
j
Bkj
+
Another matrix block Bkj.
© 2016 IBM Corporation11
IBM Research
Parallel sparse matrix-matrix multiplication
i
k
j+C
ij
The result Cij of the product A
ikB
kj is owned by the process where the
midpoint between i and j resides.
© 2016 IBM Corporation12
IBM Research
Parallel sparse matrix-matrix multiplication
Blocks Aik and B
kj are sent to the process that owns C
ij and multiplication
takes place. Blocks are sent along x, y and z.
i
k
Aik
j
Bkj
+C
ij = C
ij + A
ik B
kj
© 2016 IBM Corporation13
IBM Research
Improved MPSM3
The process that owns the midcell performs the multiplication.
+
x
x
© 2016 IBM Corporation14
IBM Research
Improved MPSM3
All blocks A and B are sent to the process that owns the midcell
and
multiplication takes place. Blocks are sent along x, y and z.
A**
B**
+
C = C + AB
x
x
© 2016 IBM Corporation15
IBM Research
Improved MPSM3
Process that does the multiplication needs to redistribute the results to neighbors processes. Blocks are sent along x, y and z.
+C
i'j'
+
i
j i'j'
Cij
© 2016 IBM Corporation16
IBM Research
Improved MPSM3
Redistribution of the computed matrix
Exchange of local matrices
Local products
© 2016 IBM Corporation17
IBM Research
Benchmark: weak scaling
Time per DM build wrt MPI tasks, PM6
About 19 waters per task
Parallel efficiency: 92% at 110592 MPI tasks (2.1M waters)
Nbr non-zero elements: 1.6k/water (O1) and 1.0k/water (O2)
© 2016 IBM Corporation18
IBM Research
Benchmark: weak scaling
Time per DM build wrt MPI tasks, PM6
About 19 waters per task
Parallel efficiency: 92% at 110592 MPI tasks (2.1M waters)
Nbr non-zero elements: 1.6k/water (O1) and 1.0k/water (O2)
Constant walltime with proportional resources
© 2016 IBM Corporation19
IBM Research
Benchmark: weak scaling (improved MPSM3)
Time per DM build wrt MPI tasks, PM6
About 19 waters per task
Nbr non-zero elements: 1.6k/water ~10x
© 2016 IBM Corporation20
IBM Research
Benchmark: weak scaling (improved MPSM3)
Time per DM build wrt MPI tasks, PM6
About 19 waters per task
Nbr non-zero elements: 1.6k/water
Improved MPSM3 competes already with libdbcsr for small system/MPI tasks ratio
https://dbcsr.cp2k.org/
© 2016 IBM Corporation21
IBM Research
Benchmark: strong scaling
Time per DM build wrt MPI tasks, PM6
110k (S1), 373k (S2) and 1124k (S3) waters
© 2016 IBM Corporation22
IBM Research
Benchmark: strong scaling
Time per DM build wrt MPI tasks, PM6
110k (S1), 373k (S2) and 1124k (S3) waters
Largest system:
Matrix dimensions: 6749184 x 6749184
Non-zero: 3.9E-3%
Nbr. multiplies: 17
Sparsity boost vs dense: 42760 x
© 2016 IBM Corporation23
IBM Research
Benchmark: strong scaling (improved MPSM3)
Time per DM build wrt MPI tasks, PM6
32k (S0), 110k (S1)
© 2016 IBM Corporation24
IBM Research
Benchmark: communication volume
Total communication volume (Isend/Irecv) per DM build wrt MPI tasks
110k (S1), 373k (S2) and 1124k (S3) waters
BlueGene/Q
© 2016 IBM Corporation25
IBM Research
Summary
The MPSM3 and its improved version shows (1 push per direction):
close to perfect weak scaling
very good strong scaling
communication volume decreases as nbr task increases
fewer logistic operations (improved version)
Providing proportional resources, we can perform a MD step in about few dozen of seconds regardless of system size.
© 2016 IBM Corporation26
IBM Research
Parallel sparse matrix-matrix multiplication
radius
Interaction of an atom with its neighbors.
© 2016 IBM Corporation27
IBM Research
References
SEMD I: Midpoint-based parallel sparse matrix-matrix multiplication algorithm for matrices with decayValéry Weber, Teodoro Laino, Alexander Pozdneev, Irina Fedulova, and Alessandro CurioniJournal of Chemical Theory and Computation 2015 11 (7), 3145-3152https://doi.org/10.1021/acs.jctc.5b00382
Enhanced MPSM3 for Applications to Quantum Biological SimulationsAlexander Pozdneev, Valéry Weber, Teodoro Laino, Costas Bekas, and Alessandro CurioniIn Proceedings of SC16: The International Conference for High Performance Computing, Networking, Storage and Analysis, Salt Lake City, Utah, November 13-18, 2016, Article no. 9https://dl.acm.org/citation.cfm?id=3014916