1 first-principles molecular dynamics for petascale computers françois gygi dept of applied...

1

First-Principles Molecular Dynamics for Petascale Computers

François GygiDept of Applied Science, UC [email protected]://eslab.ucdavis.edu

Zhaojun BaiDept of Computer Science, UC Davis

Giulia GalliDept of Chemistry, UC Davis

Kwan-Liu MaDept of Computer Science, UC Davis

Supported by NSF-ITR-HECURA 0749217

mailto:[email protected]

2

The Qbox project

• Qbox is a C++/MPI implementation of First-Principles Molecular Dynamics (FPMD)

• Qbox includes a quantum mechanical description of electronic structure within Density Functional Theory

• Applications to Materials Science, Chemistry, Nanoscience

• Software development focuses on large-scale parallelism

3

Qbox code architecture

Qbox

ScaLAPACK/PBLAS

BLACS

MPI

BLAS/ATLAS

XercesC

(XML parser)

FFTW lib

DGEMM lib

http://eslab.ucdavis.edu/software/qbox

4

Qbox performance results

8 k-points: 207.3 TFlop/s (56% of peak)

4 k-points: 187.7 TFlop/s (51% of peak)

1 k-point: 108.8 TFlop/s (30% of peak)

2006 ACM/IEEE Gordon Bell Award for peak performance

• Electronic structure of a 1000-atom Molybdenum sample

• 12,000 electrons• LLNL BlueGene/L

5

Current Qbox availability on Teragrid Platforms

• Mercury, NCSA • Cobalt, NCSA• Tungsten, NCSA• BlueGene/L, SDSC• IBM p655, SDSCOther platforms• ANL BG/L• ANL BG/P• NERSC Franklin, Cray XT4• NCSA Abe

6

New scalable algorithms for electronic structure calculations

• One-sided Jacobi simultaneous diagonalization algorithm used in electronic structure calculations – 64-node dual-dual-core

AMD Opteron/Infinipath cluster

– 1 rack ANL BlueGene/L

0

1

2

3

4

5

6

7

8

9

10

0 200 400 600 800 1000 1200

N_CPU

Sp

eed

up

: t(

NC

PU

min

)/t(

NC

PU

)

m=8192 BG/L speedup m=8192 AMD/Opt speedupBG/L ideal speedup AMD/Opt ideal speedup

7

Qbox scalability for nanoscience applications

• Electronic structure of a 2260-atom silicon nanowire

• Cray-XT4, up to 8k CPUs • Superlinear scaling due

to cache effects and size-dependent MPI protocols

• 86% parallel efficiency between 2k and 8k CPUs

0

1

2

3

4

5

0 2048 4096 6144 8192

N_CPU

Sp

eed

up

: t(

NC

PU

min

)/t(

NC

PU

)

Qbox / Cray-XT4 ideal speedup

8

Qbox parallel I/O strategy

• Advanced functions in MPI-IO are not supported by all file systems (MPI_File_write_shared, etc.)

• Qbox uses a strategy based on shared file pointer objects• Achieves >700 MB/s write rate for file sizes of 50–250 GB

platform #tasks write speed

Cray-XT4 2048 778 MB/s



BG/P (ANL) 2048 814 MB/s

9

Analysis of MPI message traffic patterns in Qbox

• Multiple traffic patterns are involved during a Qbox simulation– physics kernels– 3D Fourier transforms– ScaLAPACK linear algebra

• Logical-to-physical mapping of tasks has a large impact on performance on large platforms (> 4k CPUs)

• We are developing instrumentation and visualization tools to analyze message traffic patterns on various interconnect architectures

Mapping of 65536 MPI tasks on the 32x32x64 torus of the LLNL BG/L

10

Analysis of MPI message traffic patterns in Qbox• Screenshot of the message traffic visualization tool showing

MPI calls in a ScaLAPACK matrix multiplication (C. Muelder, K-L Ma, UCDavis)

11

Qbox current developments

• Deployment on TeraGrid track-2 platforms• Applications to Nanoscience simulations

– G. Galli, Chemistry UCDavis• Specialized linear algebra algorithms

– Z. Bai, Computer Science, UCDavis• Visualization

– K-L. Ma, Computer Science, UCDavis• Application-specific data compression algorithms• Large dataset management (1010 – 1012 bytes)• XML standards for electronic structure data

(http://www.quantum-simulation.org)

Supported by NSF-ITR-HECURA 0749217

http://eslab.ucdavis.edu

http://eslab.ucdavis.edu/

1 first-principles molecular dynamics for petascale computers françois gygi dept of applied...

Documents

qbox project qbox

ucdavis slide

qbox screenshot

qbox scalability

cpus slide

qbox multiple traffic

qbox performance results

edusoftwareqbox slide