large-scale electronic structure simulations with mvapich2...
Post on 18-Apr-2020
2 Views
Preview:
TRANSCRIPT
Large-scale Electronic Structure Simulations with MVAPICH2on Intel Knights Landing Manycore Processors
Hoon Ryu, Ph.D.
(E: elec1020@kisti.re.kr)
Principal Researcher / Korea Institute of Science and Technology Information (KISTI)
Principal Investigator / KISTI Intel® Parallel Computing Center (IPCC)
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 2
Electronic Structure CalculationsAn application that has customers in a a HUGE range
• The state of motion of electrons in an electrostatic field created by the stationary nuclei.
• Quantum Physics; Quantum Chemistry; Nanoscale Materials and Devices
à Physics, Chemistry, Materials Science, Electrical Engineering and Mechanical Engineering ETC.
• Huge customers in the society of computational science
KISTI-4 (Tachyon-II) utilization (2013-2014)
Quantum Simulations
Bio/Chem
Fluid/Combustion/Structure
Meteorology
ParticlePhysics
AstroPhysics ETC
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 3
Electronic Structure CalculationsIn a perspective of “numerical analysis”
• Two PDE-coupled Loop: Schrödinger Equation and Poisson Equation
• Both equation involve system matrices (Hamiltonian and Poisson)
à DOFs of those matrices are proportional to the # of grids in the simulation domains
• Schrödinger Equations
à Normal Eigenvalue Problem
• Poisson Equations
à Linear System Problem à Ax = b
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 4
Electronic Structure CalculationsIn a perspective of “numerical analysis”
• Two PDE-coupled Loop: Schrödinger Equation and Poisson Equation
• Both equation involve system matrices (Hamiltonian and Poisson)
à DOFs of those matrices are proportional to the # of grids in the simulation domains
• Schrödinger Equations
à Normal Eigenvalue Problem
• Poisson Equations
à Linear System Problem à Ax = b
How large are these system matrices?Why do we need to handle those?
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 5
Needs for “Large” Electronic StructuresNeeds for High Performance Computing
1. Quantum Simulations of “Realizable” Nanoscale Materials and Devices à Needs to handle large-scale atomic systems (~ A few tens of nms)
[Logic Transistors (FinFET)]
[Quantum Dot Devices]
30nm3 Silicon Box? à About a million atoms
2. DOF of Matrices of Governing Equationsà Linearly proportional to # of atoms (w/ some weight)
3. Parallel Computing
Ax = b
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 6
Development Strategy: DD, Matrix HandlingLarge-scale Schrödinger, Poisson Eqns.
Schrödinger Equation• Normal Eigenvalue Problem (Electronic Structure)
• Hamiltonian is always symmetric
Poisson Equation• Linear System Problem (Electrostatics: Q-V)
• Poisson matrix is always symmetric
Domain Decomposition• MPI + OpenMP
• Effectively multi-dimensional decomposition
Z
Thread 1 Thread 2
Thread 3 Thread 4
Adiag(0)
Adiag(1)
W(0,1)
W(1,0)
Adiag(2)
W(1,2)
W(2,1)
Adiag(3)
W(2,3)
W(3,2)
Rank 3
Rank 2
Rank 1
Rank 0
Mapping to TB Hamiltonian
Ran
k 0
Ran
k 1
Ran
k 2
Thread 0
Thread 1
Thread 2
Thread 3R
ank
3Y
X
Slab 0
Slab 1
Slab 2
Slab 3
Matrix Handling• Tight-binding Hamiltonian (Schrödinger Eq.)
• Finite Difference Method (Poisson Eq.)
• Nearest Neighbor Coupling: Highly Sparse à CSR
[100] Zincblende Unitcell0
1
23
4
5
67
Cation
Anion Onsite
(0)
Onsite (1)
Onsite (2)
Onsite (3)
Onsite (4)
Onsite (5)
Onsite (6)
Onsite (7)
Coupling(0,4)
Coupling(4,0)
Coupling(1,4)
Coupling(1,5)
Coupling(4,1)
Coupling(5,1)
Coupling(2,4)
Coupling(2,6)
Coupling(4,2)
Coupling(6,2)
Coupling(3,4)
Coupling(4,3)
Coupling(3,7)
Coupling(7,3)
Mapping to TB Hamiltonian
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 7
Development Strategy: Numerical AlgorithmsSchrödinger Equations
Schrödinger Eqs. w/ LANCZOS Algorithmà C. Lanczos, J. Res. Natl. Bur. Stand. 45, 255
• Normal Eigenvalue Problem (Electronic Structure)
• Hamiltonian is always symmetric
• Original Matrix à T matrix-reduction
• Steps for Iteration: Purely Scalable Algebraic Ops.
Self-consistent Loop for Device Simulations
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 8
Development Strategy: Numerical AlgorithmsPoisson Equations
Poisson Eqs. w/ CG Algorithmà A Problem of Solving Linear Systems
• Conv. Guaranteed: Symmetric & Positive Definite
• Poisson is always S & PD.
• Steps for Iteration: Purely Scalable
Algebraic Ops.
Self-consistent Loop for Device Simulations
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 9
Performance Bottleneck?Matrix-vector Multiplier: Sparse Matrices
W
W
W
W
Ran
k 0
Ran
k 1
Ran
k 2
Ran
k 3
X(0)
X(1)
X(2)
X(3)
Y(0)
Y(1)
Y(2)
Y(3)
Thread 0
Thread 1
Thread 2
Thread 3
Thread 0
Thread 1
Thread 2Thread 3
Thread 0
Thread 1
Thread 2
Thread 3
Thread 0
Thread 1
Thread 2
Thread 3
Ele
men
t-w
ise
mul
tiplic
atio
n &
sum
of
X(i)
and
Y(i)
in r
ank
i to
get Z
i
Ran
k 0
Ran
k 1
Ran
k 2
Ran
k 3
Z0
Z1
Z2
Z3
Ran
k 0
Ran
k 1
Ran
k 2
Ran
k 3
Tota
l sum
with
MP
I col
lect
ive
com
mun
icat
ion
(W=Z
0+Z1
+Z2+
Z3)(i) (ii) (iii)
MP
I poi
nt-to
-poi
ntco
mm
unic
atio
n
Adiag(0)
Adiag(1)
W(0,1)
W(1,0)
Adiag(2)
W(1,2)
W(2,1)
Adiag(3)
W(2,3)
W(3,2)
Thread 0
Thread 1
Thread 2
Thread 3
Ran
k 0
Ran
k 1
Ran
k 2
Ran
k 3
X(0)
X(1)
X(2)
X(3)
(ii)(i)
(iii) (iv)
Adiag(2) X(2)
W(2,1) X(1)
X(3)W(2,3)
X
X
X
+
+
Res
ult v
ecto
r in
Ran
k 2
Adiag(0)
Adiag(1)
W(0,1)
W(1,0)
Adiag(2)
W(1,2)
W(2,1)
Adiag(3)
W(2,3)
W(3,2)
Thread 0
Thread 1
Thread 2
Thread 3
Ran
k 0
Ran
k 1
Ran
k 2
Ran
k 3
X(0)
X(1)
X(2)
X(3)
Adiag(0)
Adiag(1)
W(0,1)
W(1,0)
Adiag(2)
W(1,2)
W(2,1)
Adiag(3)
W(2,3)
W(3,2)
Thread 0
Thread 1
Thread 2
Thread 3
Ran
k 0
Ran
k 1
Ran
k 2
Ran
k 3
X(0)
X(1)
X(2)
X(3)
MP
I poi
nt-to
-poi
ntco
mm
unic
atio
n
Vector Dot-Product (VVDot) (Sparse) Matrix-vector Multiplier (MVMul)
Main Concerns for Performance• Collective Communication
à May not be the main bottleneck as we only need to
collect a single value from a single MPI process
• Matrix-vector Multiplier
à Communication happens, but would not be a critical
problem as it only happens between adjacent ranks
à Cache-miss affects vectorization efficiency
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 10
Intel® Knights Landing (KNL) ProcessorsA Simple Overview of Architecture
Summary of Spec & Issues• Up to 72 CPU cores
à 36 Tiles in a 2D mesh (2 AVX-512 VPU per core)
à DDR4, 6 Channels
à Up to 16GB MCDRAM (HBM) Total
• Self-hosted, No issue in copying time
• Extracting performance means
à Better memory access patternsà More threads & Vectorization
A. Sodani, Intel® Xeon Phi™ Processor “Knights Landing”
Architectural Overview, ISC (2015)
HBM Modes• Cache Mode
à No source changes needed to use
• Flat mode
à MCDRAM mapper to physical address space
à Accessed through (memkind) library or numactl
• Hybrid
à Combination of the above two
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 11
Single-node Performance: The power of MCDRAMPerformance w/ Intel® KNL Processors
Intel MPI MVAPICH
Intel MPI(HBM)
MVAPICH(HBM)
Description of BMT Target and Test Mode• 36x12x12(nm
3) [100] Si:P quantum dot
à Material candidates for Si Quantum Info. Processors
(Nature Nanotech. 9, 430)
à 2.1Mx2.1M Hamiltonian Matrix
• KNL (Xeon Phi 7210); Up to 256 (64x4) cores
• MCDRAM control w/ numactrl; Quad Mode; MPI only• Intel Parallel Studio 2017 (Intel MPI)
• MVAPICH2-2.2 with Intel Composer XE 2017 (MVAPICH)
Results• With no MCDRAM
à No clear speed-up beyond 32 cores
• With MCDRAMà Full scalability up to 64 hardware cores
Points of Questions• How is the performance of MVAPICH2 w.r.t. Intel?
à Spot of comparison: Communications
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 12
Single-node Performance: MVAPICH vs. Intel MPIPerformance w/ Intel® KNL Processors
Description of BMT Target and Test Mode• 36x12x12(nm
3) [100] Si:P quantum dot
à Material candidates for Si Quantum Info. Processors
(Nature Nanotech. 9, 430)
à 2.01Mx2.1M Hamiltonian Matrix
• KNL (Xeon Phi 7210); Up to 256 (64x4) cores
• MCDRAM control w/ numactrl; Quad Mode; MPI only• Intel Parallel Studio 2017 (Intel MPI)
• MVAPICH2-2.2 with Intel Composer XE 2017 (MVAPICH)
Points of Questions• How is the performance of MVAPICH2 w.r.t. Intel?
à Where is the spot for comparison in this simple
• A focus on communication timeà MVAPICH2 clearly shows longer communication times
à MPI_Send/Recv + Allreduce function calls
• Why?à The reason may be related to the resource allocation
Communication Time
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 13
MVAPICH vs. Intel MPI: CPU allocation to Rank ID’sPerformance w/ Intel® KNL Processors
A. Sodani, Intel® Xeon Phi™ Processor “Knights Landing”
Architectural Overview, ISC (2015)
How are physical cores allocated to MPI processes? • Cluster set to quadrant mode
• Sequential vs. Random?
2 31 4
Quadrant
Physical
Core
Ranks (Starting w/ 1)
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 14
MVAPICH vs. Intel MPI: CPU allocation to Rank ID’sPerformance w/ Intel® KNL Processors
A. Sodani, Intel® Xeon Phi™ Processor “Knights Landing”
Architectural Overview, ISC (2015)
How are physical cores allocated to MPI processes? • Cluster set to quadrant mode
• Sequential vs. Random?
2 31 4
Ranks (Starting w/ 1)
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 15
MVAPICH vs. Intel MPI: CPU allocation to Rank ID’sPerformance w/ Intel® KNL Processors
MVAPICH2 àß Intel MPIWhat about in terms of computing time?• 1 physical core has 4 hardware threads
• If 1 physical core is mapped by multiple MPI processes, computing time would get less benefits
Best affirnity for this workload: “clock-wise” and “inter-core” allocations of MPI processes
16 MPI 64 MPI 16 MPI 64 MPI
2 31 4
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 16
Summary and Acknowledgements Electronic Structure Simulations w/ Intel MPI and MVAPICH2
• Introduction to Code Functionality
• Main Numerical Problems and Performance Bottleneck
• Performance (speed and energy consumption) in a single KNL node
• Intel MPI vs MVAPICH 2: affinity of MPI processes
à w/ initial analysis of its potential effects on performance
Members of KISTI Intel® Parallel Computing Center
Seungmin LeeYosang Jeong Geunchul Park Dukyun Nam Hoon Ryu
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 17
Mapping MPI processes to CPUs in KNLElectronic Structure Simulations w/ Intel MPI and MVAPICH2
2 31 4
J. Cazes et al., Programming Intel’s 2nd Generation Xeon Phi, IXPUG 2016 KNL Tutorial, ISC (2016)
Strategy• We know the id’s of cpus in each quadrant
à (e.g) CPU 0~15 is in the 2nd
quadrant
• Need approximations to describe the cpu
map in a single quadrant
à it is not clear which cpu has which id.
• The following approximation would not hurt
the generality of the issue.
0 1 2 34 5 6 78 9 10 1112 13 14 15
48 49 50 5152 53 54 5556 57 58 5960 61 62 63
Quad 1 Quad 4
Hoon Ryu / Large-scale Electronic Structure Simulations with MVAPICH2 on Intel KNL Processors 18
Multi-node Performance: ScalabilityPerformance w/ Intel® KNL Processors
Description of BMT Target and Test Mode• 5 CB states in 27x33x33(nm
3) [100] Si:P quantum dot
à Material candidates for Si Quantum Info. Processors
(Nature Nanotech. 9, 430)
à 14.4Mx14.4M Hamiltonian Matrix
• KNL (Xeon Phi 7250) nodes; Up to 272 (68x4) cores/node
à (4 MPI processes + 68 threads) per node
à Quad / Flat mode, No OPA (10G network)
à Strong scalability measured up to 3 nodes
Results• Speed-enhancement with HBM becomes larger as a single
node takes larger workload
• Inter-node strong scalability is quite nice with no HBM
à 2.36x speed-up with 3x computing expense (no HBM)
à ~78% scalability (2.36/3) is what we usually get from
multi-core base HPCs (Tachyon-II HPC in KISTI)
à 1.58x speed-up with HBM (prob. size is not large enough)
H. Ryu et al., to be presented in SC17 (Intel® HPCDC)
top related