1 first-principles molecular dynamics for petascale computers françois gygi dept of applied...
TRANSCRIPT
1
First-Principles Molecular Dynamics for Petascale Computers
François GygiDept of Applied Science, UC [email protected]://eslab.ucdavis.edu
Zhaojun BaiDept of Computer Science, UC Davis
Giulia GalliDept of Chemistry, UC Davis
Kwan-Liu MaDept of Computer Science, UC Davis
Supported by NSF-ITR-HECURA 0749217
2
The Qbox project
• Qbox is a C++/MPI implementation of First-Principles Molecular Dynamics (FPMD)
• Qbox includes a quantum mechanical description of electronic structure within Density Functional Theory
• Applications to Materials Science, Chemistry, Nanoscience
• Software development focuses on large-scale parallelism
3
Qbox code architecture
Qbox
ScaLAPACK/PBLAS
BLACS
MPI
BLAS/ATLAS
XercesC
(XML parser)
FFTW lib
DGEMM lib
http://eslab.ucdavis.edu/software/qbox
4
Qbox performance results
8 k-points: 207.3 TFlop/s (56% of peak)
4 k-points: 187.7 TFlop/s (51% of peak)
1 k-point: 108.8 TFlop/s (30% of peak)
2006 ACM/IEEE Gordon Bell Award for peak performance
• Electronic structure of a 1000-atom Molybdenum sample
• 12,000 electrons• LLNL BlueGene/L
5
Current Qbox availability on Teragrid Platforms
• Mercury, NCSA • Cobalt, NCSA• Tungsten, NCSA• BlueGene/L, SDSC• IBM p655, SDSCOther platforms• ANL BG/L• ANL BG/P• NERSC Franklin, Cray XT4• NCSA Abe
6
New scalable algorithms for electronic structure calculations
• One-sided Jacobi simultaneous diagonalization algorithm used in electronic structure calculations – 64-node dual-dual-core
AMD Opteron/Infinipath cluster
– 1 rack ANL BlueGene/L
0
1
2
3
4
5
6
7
8
9
10
0 200 400 600 800 1000 1200
N_CPU
Sp
eed
up
: t(
NC
PU
min
)/t(
NC
PU
)
m=8192 BG/L speedup m=8192 AMD/Opt speedupBG/L ideal speedup AMD/Opt ideal speedup
7
Qbox scalability for nanoscience applications
• Electronic structure of a 2260-atom silicon nanowire
• Cray-XT4, up to 8k CPUs • Superlinear scaling due
to cache effects and size-dependent MPI protocols
• 86% parallel efficiency between 2k and 8k CPUs
0
1
2
3
4
5
0 2048 4096 6144 8192
N_CPU
Sp
eed
up
: t(
NC
PU
min
)/t(
NC
PU
)
Qbox / Cray-XT4 ideal speedup
8
Qbox parallel I/O strategy
• Advanced functions in MPI-IO are not supported by all file systems (MPI_File_write_shared, etc.)
• Qbox uses a strategy based on shared file pointer objects• Achieves >700 MB/s write rate for file sizes of 50–250 GB
platform #tasks write speed
Cray-XT4 2048 778 MB/s
Cray-XT4 4096 715 MB/s
Cray-XT4 8192 687 MB/s
BG/P (ANL) 2048 814 MB/s
9
Analysis of MPI message traffic patterns in Qbox
• Multiple traffic patterns are involved during a Qbox simulation– physics kernels– 3D Fourier transforms– ScaLAPACK linear algebra
• Logical-to-physical mapping of tasks has a large impact on performance on large platforms (> 4k CPUs)
• We are developing instrumentation and visualization tools to analyze message traffic patterns on various interconnect architectures
Mapping of 65536 MPI tasks on the 32x32x64 torus of the LLNL BG/L
10
Analysis of MPI message traffic patterns in Qbox• Screenshot of the message traffic visualization tool showing
MPI calls in a ScaLAPACK matrix multiplication (C. Muelder, K-L Ma, UCDavis)
11
Qbox current developments
• Deployment on TeraGrid track-2 platforms• Applications to Nanoscience simulations
– G. Galli, Chemistry UCDavis• Specialized linear algebra algorithms
– Z. Bai, Computer Science, UCDavis• Visualization
– K-L. Ma, Computer Science, UCDavis• Application-specific data compression algorithms• Large dataset management (1010 – 1012 bytes)• XML standards for electronic structure data
(http://www.quantum-simulation.org)
Supported by NSF-ITR-HECURA 0749217
http://eslab.ucdavis.edu