accélération de calculs de simulations des matériaux sur gpu · 2016. 9. 2. · april 15-18,...

70
1 NVIDIA TEGRA K1 Mar 2, 2014 NVIDIA Confidential Francois Courteille Senior Solution Architect, Accelerated Computing Accélération de calculs de simulations des matériaux sur GPU

Upload: others

Post on 03-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

1

NVIDIA TEGRA K1 Mar 2, 2014

NVIDIA Confidential

Francois Courteille Senior Solution Architect, Accelerated Computing

Accélération de calculs de simulations des matériaux sur GPU

Page 2: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

2

AGENDA

Tesla, the NVIDIA compute platform

Quantum Chemistry Applications Overview

Molecular Dynamics Applications Overview

Material Science Applications at the start of the Pascal Era

Page 3: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

3

THE WORLD LEADER IN VISUAL COMPUTING

GAMING ENTERPRISE TECHNOLOGY HPC & CLOUD AUTO

Page 4: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

4

OUR TEN YEARS IN HPC

2006 2008 2012 2016 2010 2014

Fermi: World’s First HPC GPU

Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs

World’s First Atomic Model of HIV Capsid

GPU-Trained AI Machine Beats World Champion in Go

Stanford Builds AI Machine using GPUs

World’s First 3-D Mapping of Human Genome

CUDA Launched

World’s First GPU Top500 System

Google Outperform Humans in ImageNet

Discovered How H1N1 Mutates to Resist Drugs

AlexNet beats expert code by huge margin using GPUs

Page 5: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

5

CREDENTIALS BUILT OVER TIME

300K CUDA Developers, 4x Growth in 4 years

Majority of HPC Applications are GPU-Accelerated, 410 and Growing

100% of Deep Learning Frameworks are Accelerated

113

206

242

370

410

0

50

100

150

200

250

300

350

400

450

2011 2012 2013 2014 2015 2016

287

# of Applications Academia Games Finance

Manufacturing Internet Oil & Gas

National Labs Automotive Defense

M & E

300K

TORCH

THEANO

CAFFE

MATCONVNET

PURINEMOCHA.JL

MINERVA MXNET*

BIG SUR TENSORFLOW

WATSON CNTK

www.nvidia.com/appscatalog

Page 6: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

6

END-TO-END TESLA PRODUCT FAMILY

HYPERSCALE HPC

Tesla M4, M40

MIXED-APPS HPC

Tesla K80

STRONG-SCALING HPC

Tesla P100

FULLY INTEGRATED DL SUPERCOMPUTER

DGX-1

For customers who need to get going now with fully

integrated solution

Hyperscale & HPC data centers running apps that

scale to multiple GPUs

HPC data centers running mix of CPU and GPU workloads

Hyperscale deployment for DL training, inference, video &

image processing

Page 7: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

8

TESLA FOR SIMULATION

LIBRARIES

TESLA ACCELERATED COMPUTING

LANGUAGES DIRECTIVES

ACCELERATED COMPUTING TOOLKIT

Page 8: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

10

MD: All key codes are GPU-accelerated

Great multi-GPU performance

Focus on dense (up to 16) GPU nodes &/or large # of

GPU nodes

ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND, ESPResso,

Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*,

LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD,

OpenMM, PolyFTS, SOP-GPU* & more

QC: All key codes are ported or optimizing

Focus on using GPU-accelerated math libraries,

OpenACC directives

GPU-accelerated and available today:

ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS-

UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012,

NWChem, OCTOPUS*, PEtot, QUICK, Q-Chem, QMCPack,

Quantum Espresso/PWscf, QUICK, TeraChem*

Active GPU acceleration projects:

CASTEP, GAMESS, Gaussian, ONETEP, Quantum

Supercharger Library*, VASP & more

green* = application where >90% of the workload is on GPU

OVERVIEW OF LIFE & MATERIAL ACCELERATED APPS

Page 9: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

11

MD VS. QC ON GPUS

“Classical” Molecular Dynamics Quantum Chemistry (MO, PW, DFT, Semi-Emp)

Simulates positions of atoms over time; chemical-biological or

chemical-material behaviors

Calculates electronic properties; ground state, excited states, spectral properties,

making/breaking bonds, physical properties

Forces calculated from simple empirical formulas (bond rearrangement generally forbidden)

Forces derived from electron wave function (bond rearrangement OK, e.g., bond energies)

Up to millions of atoms Up to a few thousand atoms

Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods

Single precision dominated Double precision is important

Uses cuBLAS, cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC

Geforce (Accademics), Tesla (Servers) Tesla recommended

ECC off ECC on

Page 10: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

12

GPU-Accelerated Quantum Chemistry Apps

Abinit

ACES III

ADF

BigDFT

CP2K

GAMESS-US

Gaussian

GPAW

LATTE

LSDalton

MOLCAS

Mopac2012

NWChem

Green Lettering Indicates Performance Slides Included

GPU Perf compared against dual multi-core x86 CPU socket.

Quantum SuperCharger Library

TeraChem

VASP

WL-LSMS

Octopus

ONETEP

Petot

Q-Chem

QMCPACK

Quantum Espresso

Page 11: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

ABINIT

Page 12: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

FROM RESEARCH TO INDUSTRY

6th International ABINIT Developer Workshop

April 15-18, 2013 - Dinard, France

Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France

Florent Dahm C-S – 22 av. Galilée. 92350 Le Plessis-Robinson

With a contribution from Yann Pouillon for the build system

www.cea.fr

USING ABINIT ON GRAPHICS PROCESSING UNITS (GPU)

Page 13: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

ABINIT ON GPU

Page 14: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

OUTLINE

Introduction

Introduction to GPU computing Architecture, programming model

Density-Functional Theory with plane waves How to port it on GPU

ABINIT on GPU

Implementation

Performances of ABINIT v7.0

How To…

Compile

Use

ABINIT on GPU| April 15, 2013 | PAGE 13

Page 15: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

!

!

!

!

!

LOCAL OPERATOR: FAST FOURIER TRANSFORMS

g r g r g v dr e v r

c e

'

eff eff g

g'

FFT

ABINIT on GPU| April 15, 2013 | PAGE 14

  Included in Cuda package : cuFFT library

Interfaced with c and Fortran

  A system dependent efficiency is expected

  Small FFT underuse the GPU and are dominated by

the time required to transfer the data to/from the GPU

  Within PAW, small plane wave basis are used

~ =

i

~ ( )

( i

)

FTT

  Our choice (for a first version):

Replace the MPI plane wave level

by a FFT computation on GPU

Page 16: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

!

!

!

!

!

LOCAL OPERATOR: FAST FOURIER TRANSFORMS

fourwf – option 2

Performance highly dependent on FFT size

Speedup

FFT size

ABINIT on GPU| April 15, 2013 | PAGE 15

TEST CASE

107 gold atoms cell

TGCC-Curie

(Intel Westmere + Nvidia M2090)

  Implementation

  Interface Cuda / Fortran

  New routines that encapsulate cufft calls

  Development of 3 small cuda kernels

  Optimization of memory copy to overlap

computation and data transfer

Page 17: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

!

!

LOCAL OPERATOR: FAST FOURIER TRANSFORMS

Improvement : multiple band optimization

107 gold atoms 31 copper atoms

ABINIT on GPU| April 15, 2013 | PAGE 16

TEST CASE

TGCC-Curie

(Intel Westmere + Nvidia M2090)

bandpp fourwf on GPU

1

2

4

10

20

100

36,02

20,79

14,14

9,16

9,50

7,94

bandpp fourwf on GPU

1

2

4

8

18

72

112,94

84,55

70,55

61,91

54,79

50,55

  The Cuda kernels are more intensively used

Requires less transfers

  See bandpp variable When Locally Optimal Block Preconditioned Conjugate

Gradient algorithm is used

Page 18: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

!

!

!

! d

NON LOCAL OPERATOR

~p

~p = ∑ g ' g ' ψ ψ   Development of 2 Cuda Kernels for g ' ~ ψ ψ

i , j

•  Non-local operator

• 

• 

• 

Overlap operator

Forces and stress tensor

PAW occupation matrix

In addition to all MPI

parallelization levels !

•  Need a large number of PW

•  Need a large number of NL projectors

TGCC-Curie

(Intel Westmere + Nvidia M2090) ABINIT on GPU| April 15, 2013 | PAGE 17

Performances (NL operator only)

nonlop CPU nonlop GPU Speedup time (s) time (s) GPU vs CPU

107 gold atoms

3142

1122

2.8

31 copper atoms

85

42

2.0

3 UO2 atoms

63

36

1.8

  Implementation

i

i

  Development of 2 Cuda Kernels for g VNL = ∑ g pi Dij ~p j

  Using collaborative memory zones as much as possible (threads linke

to the same plane wave lie in the same collaborative zone) Used for : •  Non-l

Page 19: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

Matrix-matrix and matrix-vector products

! These products are not parallelized

! Size of vectors/matrix increases with the

number of band/plane-wave CPU cores.

! Our choice : use of cuBlas

Algorithm 1 lobpcg

Require: 0 = { 0 , . . . , 0 } close to the minimum and K a pre- 1 m conditioner; P = {P

(0) , . . . , P0 } is initialized to 0. 1 m

1: for i=0,1, , do . . . 2: ⌥(i) = ⌥( (i) )

3: R(i) = H (i) ⌥(i) O (i)

4: W(i) = KR(i)

5: The Rayleigh-Ritz method is applied within the subspace ⌅ = n o P(i) (i) (i) (i) (i) (i)

1 , . . . , Pm , 1 , . . . , m , W1 , . . . , Wm

6: (i+1) = (i) (i) + ⇤(i) W(i) + (i) P(i)

7: P(i+1) = ⇤(i) W(i) + (i) P(i) 8: end for

! Size of matrix increases when b

parallelization is used

___ .z - ! Need for a free LAPACK packag

Locally Optimal Block

Preconditioned Conjugate

Gradient

ABINIT on GPU| April 15, 2013 | PAGE 18

Diagonalization/othogonalization within blocks

and

  e on GPU

Our choice : use of MAGMA

bandpp Total tim e Tim e Line ar Alge bra

1 674 119

2 612 116

10 1077 428

50 5816 5167

Page 20: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

! A project by Innovating Computing Laboratory, University of Tenessee

! “Free of use” project

! First stable release 25-08-2011; Current version 1.3

! Computation is cut in order to offload consuming parts on GPU

! Goal : address hybrid environments

Matrix Algebra on GPU and

Multicore Architectures Architectures 2013-03-

12

The MAGMA project aims 1o oeveop a dense linear agebra library Downloads

similar 1o LAPACK bu1 for heterogeneouslhybrd architectures,

2012-11-14

People

Dcx:umentatbn

ABINIT on GPU| April 15, 2013 | PAGE 19

enable app'catons 1o fully expoit the power 1hat each of the hybrd

components offers.

lver27 2013

License

Copyrqht 2013 Th.. Uni ,:.r.;it of T..-,nn.. - e , All rt}h re ..

d. in

conditbns are met:

ing

Home

Overview Matrix Algebra on GPU and Multicore Latest MAGMA News

News

MAGMA MIC 1.0 Beta for Intel Xeon

Phi Coprocessors Re~ased

Publcations

starting with current 'Multcore-tGPU' systems.

Partners The MAGMA research is based on 1he dlea 1hat, 1o address the MAGMA 1.3 Released

complex challenges of 1heemerging hybrd environments, optimal

software solutbns will themselves have to hybrdize, combining the 2012-11-13 User Forum

strengths of differen1 agornhms wnhin a single framework. Buitling MAGMA MIC 0.3 for Intel Xeon Phi

on this dlea, we aim to oesqn linearagebraagornhms and Coprocessors Released

frameworks for hybrd manycore and GPU systems that can

2012-10-24

clMAGMA 1.0 Released

2012-06-29

MAGMA 1.2.1 Released

ICL{>l3r

Sponsored By: Industry Support From: AMD:I @ Mlaoeolt ,..............

Page 21: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

ABINIT on GPU| April 15, 2013 | PAGE 20

!

!

ITERATIVE EIGENSOLVER

Performances

TGCC-Curie

(Intel Westmere + Nvidia M2090)

  Highly dependent on the physical system

  Depends only on the size of blocks used in the LOBPCG algorithm

Speedup of cuBLAS/MAGMA code Speedup on a single MPI process

4,00

3,50

3,00

2,50 215 silicon atoms

2,00 107 gold atoms

1,50 31 copper atoms

1,00 3 UO2 atoms

0,50

0,00 bandpp=

2 4 10 factor related to the size of blocks (bands) size

Page 22: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

OUTLINE

Introduction

Introduction to GPU computing Architecture, programming model

Density-Functional Theory with plane waves How to port it on GPU

ABINIT on GPU

Implementation

Performances of ABINIT v7.0

How To…

Compile

Use

ABINIT on GPU| April 15, 2013 | PAGE 21

Page 23: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

!

!

PERFORMANCES ON A SINGLE GPU

Comparing architectures…

TGCC-Titane

Intel Nehalem+ NVidia S1070 (Tesla)

TGCC-Curie

Intel Westmere + NVidia M2090 (Fermi)

ABINIT on GPU| April 15, 2013 | PAGE 22

  Strongly dependent on architecture

Titane : fast CPU+ old GPU

  BaTiO3 case is too small to take benefit from GPU

Time proc 0 (s) Titane Curie CPU CPU+GPU Speedup CPU CPU+GPU Speedup

107 gold atoms

31 copper atoms

5 BaTiO3 atoms

4230

153

85

1857

102

98

2,3

1,5

0,9

5154

235

86

621

64

73

8.3

3,7

1,2

Page 24: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

!

PERFORMANCES ON MULTIPLE GPU

TGCC-Curie

Intel Westmere + NVidia M2090 (Fermi)

8 CPU (MPI only) + 8 GPU

200 CPU (MPI only) + 200 GPU

ABINIT on GPU| April 15, 2013 | PAGE 23

  Less efficient when the number of MPI processes increases

(the load decreases on the GPU)

Time proc 0 (s) Curie CPU CPU+GPU Speedup

215 silicon atoms (45 it

er.)

25856 6309 4,1

Time proc 0 (s) Curie CPU CPU+GPU Speedup

215 silicon atoms (1 iter

.)

107 gold atoms

31 copper atoms

16160

866

49

3680

621

64

4,4

4,7

1,5

Page 25: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

!

!

!

!

!

PERFORMANCES VS OTHER CODES

[1] Harju et al, App. Parallel and Sc. Comp. Lectures Notes in Computer Science 7782, p3 (2013)

[2] Spiga et al, pdp, 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, p368 (2012)

[3] Mainz et al, Computer Physics Communication 182, p1421 (2011)

[4] Hacene et al,, Journal of Computational Chemistry 33, p2581(2012)

[5] Genovese et al, Journal of Chemical Physics, 131, 034103 (2009)

[6] Hakala et al, PARA 2012. LNCS, vol 7782, p63, Springer, Heidelberg (2013)

ABINIT on GPU| April 15, 2013 | PAGE 24

  Other DFT codes have been ported on GPU.

Speedup of plane wave codes are similar

  Quantum espresso (plane waves) : x3/x4 (sequential), x2/x3 (parallel) [1,2]

  VASP (plane waves) : x3/x4 [3,4]

BigDFT (wavelets) : x5/x7 [5]

  GPAW (real space) : x10/x15 [1,6]

Page 26: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

BigDFT

Page 27: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

Gaussian

Page 28: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

April 4-7, 2016 | Silicon Valley

Roberto Gomperts (NVIDIA), Michael Frisch (Gaussian, Inc.), Giovanni Scalmani (Gaussian, Inc.), Brent Leback (NVIDIA/PGI)

ENABLING THE ELECTRONIC STRUCTURE PROGRAM GAUSSIAN ON GPGPUS USING OPENACC

Page 30: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

34

CURRENT STATUS Single Node

Implemented

Energies for Closed and Open Shell HF and DFT (less than a handful of XC-functionals missing)

First derivatives for the same as above

Second derivatives for the same as above

Using only

OpenACC

CUDA library calls (BLAS)

6/29/2016

Page 31: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

35

GAUSSIAN PARALLELISM MODEL

CPU Cluster

OpenMP

CPU Node

GPU

OpenACC

Page 32: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

36

CLOSING REMARKS

Significant Progress has been made in enabling Gaussian on GPUs with OpenACC

OpenACC is increasingly becoming more versatile

Significant work lies ahead to improve performance

Expand feature set:

PBC, Solvation, MP2, ONIOM, triples-Corrections

Page 33: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

EARLY PERFORMANCE RESULTS (CCSD(T))

24.8 24.8 22.7 2.1 3.7 3.6 1.3 2.3 0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

Total l913: "triples" Iterative CCSD

Ibuprofen CCSD(t) Calculation Speed Ups Relative to CPU-Only Full Node

20 Threads 0 GPUs 6 Threads 6 GPUs

Labels: Time in hrs

System: 2 Sockets E5-2690 V2 (2x 10 Cores @ 3.0 GHz); 128 GB RAM (DD3-1600); Used 108 GB

GPUs: 6 Tesla K40m (15 SMPs @ 875 MHz); 12 GB Global Memory

Method CCSD(t)

No. of Atoms 33

Basis Set 6-31G(d,p)

No. of Basis Funcs 315

No. Occ Orbitals 41

No. Virt Orbitals 259

No. of Cycles 15

No. CCSD iters 16

Page 34: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

VASP

Page 35: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

GPU VASP Collaboration

Collaborators

2013-2014 Project Scope Minimization algorithms to calculate electronic ground state

Blocked Davidson (ALGO = NORMAL & FAST)

RMM-DIIS (ALGO = VERYFAST & FAST)

K-Points

Optimization for critical step in exact exchange calculations

Earlier work Speeding up plane-wave electronic-structure calculations using graphics-processing units, Maintz, Eck, Dronskowski

VASP on a GPU: application to exact-exchange calculations of the stability of elemental boron, Hutchinson, Widom

Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units, Hacene, Anciaux-Sedrakian,

Rozanska, Klahr, Guignon, Fleurat-Lessard

U of Chicago

Page 36: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

Target Workloads

Silica (“medium”) 7 Å thick slab of amorphous silica, 240 atoms (Si68O148H24)

RMM-DIIS (ALGO = VERYFAST)

Nial-MD (“large”) Liquid metal molecular dynamics sample of Nickel-based superalloy

500 atoms, 9 chemical species total

Blocked Davidson (ALGO = NORMAL)

INTERFACE (“large”) Interface of platinum metal with water

108 Pt atoms, and 120 water molecules (468 atoms)

Blocked Davidson & RMM-DIIS (ALGO = FAST)

Page 37: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

RUNTIME DISTRIBUTION FOR SILICA

Time in sec for 1 K40 GPU + 1 IvyBridge core

0 500 1000 1500 2000 2500 3000 3500

Optimized GPU port

original GPU port

CPU

Memcopy

Gemm

FFT

Other

Page 38: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

March 2016

VASP 5.4.1 w/ Patch#1(Haswell & K80)

Page 39: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

43

VASP Interface Benchmark

Running VASP version 5.4.1

The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs

“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to

2.7X faster calculations. (patch #1 available at VASP.at)

Blocked Davidson + RMM-DIIS (ALGO=Fast)

828.27

580.63

506.25

387.26

284.06

0

100

200

300

400

500

600

700

800

900

1 Node 1 Node +1x K80 per node

1 Node + 2x K80per node

[Old Data]

1 Node +2x K80 per node

1 Node +4x K80 per node

Ela

pse

d t

ime (

s)

Interface

1.4X 1.6X

2.1X 3.0X

*Lower is better

Page 40: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

44

VASP Silica IFPEN Benchmark

542.87

432.48

372.74

258.14

210.47

0

100

200

300

400

500

600

1 Node 1 Node +1x K80 per node

1 Node + 2x K80 pernode

[Old Data]

1 Node +2x K80 per node

1 Node +4x K80 per node

Ela

pse

d t

ime (

s)

Silica IFPEN

1.3X

1.5X

2.1X

2.6X

*Lower is better

Running VASP version 5.4.1

The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs

“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to

2.7X faster calculations. (patch #1 available at VASP.at)

RMM-DIIS (ALGO=Veryfast)

Page 41: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

45

VASP Si-Huge Benchmark

6464.42

4296.15

2871.09 2865.92

1987.94

0

1000

2000

3000

4000

5000

6000

7000

1 Node 1 Node +1x K80 per node

1 Node + 2x K80 pernode

[Old Data]

1 Node +2x K80 per node

1 Node +4x K80 per node

Ela

pse

d t

ime (

s)

Si-Huge

1.5X

2.3X 2.3X

3.3X

*Lower is better Running VASP version 5.4.1

The blue node contains Dual Intel Xeon

E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs

“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to

2.7X faster calculations. (patch #1 available at VASP.at)

Blocked Davidson + RMM-DIIS (ALGO=Fast)

Page 42: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

46

VASP SupportedSystems Benchmark

310.63

252.43

228.90

164.21

129.99

0

50

100

150

200

250

300

350

1 Node 1 Node +1x K80 per node

1 Node + 2x K80 pernode

[Old Data]

1 Node +2x K80 per node

1 Node +4x Tesla K80 per

node

Ela

pse

d t

ime (

s)

SupportedSystems

1.2X

1.4X

1.9X

2.4X

*Lower is better Running VASP version 5.4.1

The blue node contains Dual Intel Xeon

E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs

“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to

2.7X faster calculations. (patch #1 available at VASP.at)

Blocked Davidson + RMM-DIIS (ALGO=Fast)

Page 43: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

47

VASP NiAl-MD Benchmark

722.82

237.02

197.19 184.93

142.98

0

100

200

300

400

500

600

700

800

1 Node 1 Node + 1x K80 pernode

1 Node + 2x K80 pernode

[Old Data]

1 Node +2x K80 per node

1 Node +4x K80 per node

Ela

pse

d t

ime (

s)

NiAl-MD

3.0X 3.7X 3.9X 5.1X

*Lower is better Running VASP version 5.4.1

The blue node contains Dual Intel Xeon

E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs

“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to

2.7X faster calculations. (patch #1 available at VASP.at)

Blocked Davidson + RMM-DIIS (ALGO=Fast)

Page 44: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

48

VASP B.hR105 Benchmark

1196.86

604.05

334.44 343.69

236.13

0

200

400

600

800

1000

1200

1400

1 Node 1 Node + 1x K80 pernode

1 Node + 2x K80 pernode

[Old Data]

1 Node +2x K80 per node

1 Node +4x K80 per node

Ela

pse

d t

ime (

s)

B.hR105

2.0X

3.6X 3.5X 5.1X

*Lower is better

Running VASP version 5.4.1

The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs

“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to

2.7X faster calculations. (patch #1 available at VASP.at)

Hybrid Functional with blocked Davicson (ALGO=Normal)

Page 45: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

49

VASP B.aP107 Benchmark

27993.62

7494.36 7632.31

4432.31 4513.40

0

5000

10000

15000

20000

25000

30000

1 Node 1 Node +1x K80 per node

1 Node + 2x K80 pernode

[Old Data]

1 Node +2x K80 per node

1 Node +4x K80 per node

Ela

pse

d t

ime (

s)

B.aP107

3.7X 3.7X 6.3X 6.2X

*Lower is better

Running VASP version 5.4.1

The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs

“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to

2.7X faster calculations. (patch #1 available at VASP.at)

Hybrid functional calculation (exact exchange) with blocked Davidson. No KPoint

parallelization.

Hybrid Functional with blocked Davicson (ALGO=Normal)

Page 46: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

50

GPU-ACCELERATED MOLECULAR DYNAMICS APPS

ACEMD

AMBER

CHARMM

DESMOND

ESPResSO

Folding@Home

GPUGrid.net

GROMACS

HALMD

HOOMD-Blue

LAMMPS

mdcore

Green Lettering Indicates Performance Slides Included

GPU Perf compared against dual multi-core x86 CPU socket.

MELD

NAMD

OpenMM

PolyFTS

Page 47: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

NAMD 2.11

Page 48: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

53

NEW GPU FEATURES IN NAMD 2.11

• GPU-accelerated simulations up to twice as fast as NAMD 2.10

• Pressure calculation with fixed atoms on GPU works as on CPU

• Improved scaling for GPU-accelerated particle-mesh Ewald calculation

• CPU-side operations overlap better and are parallelized across cores.

• Improved scaling for GPU-accelerated simulations

• Nonbonded force calculation results are streamed from the GPU for better overlap.

• NVIDIA CUDA GPU-acceleration binaries for Mac OS X

Selected Text from the NAMD website

Page 49: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

54

NAMD 2.11 IS UP TO 2X FASTER

0

5

10

15

20

25

1 Node 2 Nodes 4 Nodes

Sim

ula

ted T

ime (

ns/

day)

APoA1 (92,224 atoms)

1.2X

1.6X 2.0X

NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 [email protected] (IvyBridge) CPUs + 2x Tesla K80 (autoboost) GPUs

Page 50: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

55

NAMD 2.11 APOA1 ON 1 AND 2 NODES

Running NAMD version 2.11

The blue nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs + Tesla

K80 (autoboost) GPUs

2.77

11.67

16.99

5.22

19.73

24.31

0

5

10

15

20

25

1 Node 1 Node +1x K80

1 Node +2x K80

2 Nodes 2 Nodes +1x K80

2 Nodes +2x K80

Sim

ula

ted T

ime (

ns/

day)

APoA1 (92,224 atoms)

4.2X

6.1X

3.8X

4.7X

Page 51: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

56

NAMD 2.11 APOA1 ON 4 AND 8 NODES

Running NAMD version 2.11

The blue nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs + Tesla

K80 (autoboost) GPUs

10.27

20.64

23.52

16.85

27.83 27.74

0

5

10

15

20

25

30

4 Nodes 4 Nodes +1x K80

4 Nodes +2x K80

8 Nodes 8 Nodes +1x K80

8 Nodes +2x K80

Sim

ula

ted T

ime (

ns/

day)

APoA1 (92,224 atoms)

2.0X

2.3X 1.7X 1.6X

Page 52: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

57

NAMD 2.11 STMV ON 1 AND 2 NODES

Running NAMD version 2.11

The blue nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Dual Intel E5-2698 [email protected] CPUs (Haswell) + Tesla

K80 (autoboost) GPUs

0.23

1.03

1.75

0.46

1.98

3.27

0

1

2

3

4

1 Node 1 Node +1x K80

1 Node +2x K80

2 Nodes 2 Nodes +1x K80

2 Nodes +2x K80

Sim

ula

ted T

ime (

ns/

day)

STMV (1,066,628 atoms)

4.5X

7.6X 4.3X

7.1X

Page 53: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

58

NAMD 2.11 STMV ON 4 AND 8 NODES

Running NAMD version 2.11

The blue nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Dual Intel E5-2698 [email protected] CPUs (Haswell) + Tesla

K80 (autoboost) GPUs

0.90

3.61

4.54

1.74

5.86

6.24

0

2

4

6

8

4 Nodes 4 Nodes +1x K80

4 Nodes +2x K80

8 Nodes 8 Nodes +1x K80

8 Nodes +2x K80

Sim

ula

ted T

ime (

ns/

day)

STMV (1,066,628 atoms)

4.0X

5.0X

3.4X

3.6X

Page 54: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

AMBER 14

Page 55: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

60

JAC on K40s, K80s and M6000s

Running AMBER version 14

The blue node contains Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs

The green nodes contain Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro

M6000@987Mhz GPUs

37.38

134.82

121.30

174.34

161.53

200.34

225.34 219.83

0

50

100

150

200

250

1 HaswellNode

1 CPU Node+ 1x K40

1 CPU Node+ 0.5x K80

1 CPU Node+ 1x K80

1 CPU Node+ 1x M6000

1 CPU Node+ 2x K40

1 CPU Node+ 2x K80

1 CPU Node+ 2x M6000

Sim

ula

ted T

ime (

ns/

day)

PME-JAC_NVE

3.2X

4.7X

4.3X

5.4X

6.0X 5.9X

3.6X

Page 56: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

61

Cellulose on K40s, K80s and M6000s

Running AMBER version 14

The blue node contains Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs

The green nodes contain Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro

M6000@987Mhz GPUs

1.93

8.96

7.87

11.76

10.49

13.67

15.38 14.90

0

4

8

12

16

20

1 HaswellNode

1 CPU Node+ 1x K40

1 CPU Node+ 0.5x K80

1 CPU Node+ 1x K80

1 CPU Node+ 1x M6000

1 CPU Node+ 2x K40

1 CPU Node+ 2x K80

1 CPU Node+ 2x M6000

Sim

ula

ted T

ime (

ns/

day)

PME-Cellulose_NVE

4.1X

6.1X

5.4X

8.0X 7.7X

4.6X

7.1X

Page 57: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

62

Factor IX on K40s, K80s and M6000s

Running AMBER version 14

The blue node contains Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs

The green nodes contain Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro

M6000@987Mhz GPUs

9.68

40.48

33.59

50.70 47.80

61.18 60.93

66.89

0

10

20

30

40

50

60

70

80

1 HaswellNode

1 CPU Node+ 1x K40

1 CPU Node+ 0.5x K80

1 CPU Node+ 1x K80

1 CPU Node+ 1x M6000

1 CPU Node+ 2x K40

1 CPU Node+ 2x K80

1 CPU Node+ 2x M6000

Sim

ula

ted T

ime (

ns/

day)

PME-FactorIX_NVE

3.5X

5.2X 5.0X

6.4X 6.3X

7.0X

4.2X

Page 58: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

63

JAC on M40s

21.11

157.68

230.18

246.15

0

50

100

150

200

250

300

1 Node 1 Node +1x M40 per node

1 Node +2x M40 per node

1 Node +4x M40 per node

Sim

ula

ted T

ime (

ns/

Day)

PME - JAC_NVE

7.5X

10.9X

11.7X

Running AMBER version 14

The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Single Intel

Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

Page 59: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

64

Cellulose on M40s

Running AMBER version 14

The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Single Intel

Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

1.07

10.12

14.40

15.90

0

2

4

6

8

10

12

14

16

18

1 Node 1 Node +1x M40 per node

1 Node +2x M40 per node

1 Node +4x M40 per node

Sim

ula

ted T

ime (

ns/

Day)

PME - Cellulose_NPT

9.5X

13.5X 14.9X

Page 60: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

65

Myoglobin on M40s

9.83

232.20

300.86

322.09

0

50

100

150

200

250

300

350

1 Node 1 Node +1x M40 per node

1 Node +2x M40 per node

1 Node +4x M40 per node

Sim

ula

ted T

ime (

ns/

Day)

GB - Myoglobin

23.6X

30.6X 32.8X

Running AMBER version 14

The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Single Intel

Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

Page 61: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

66

Nucleosome on M40s

0.13

4.67

9.05

16.11

0

2

4

6

8

10

12

14

16

18

1 Node 1 Node +1x M40 per node

1 Node +2x M40 per node

1 Node +4x M40 per node

Sim

ula

ted T

ime (

ns/

Day)

GB - Nucleosome

35.9X

69.6X

123.9X

Running AMBER version 14

The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Single Intel

Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

Page 62: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

October 2015

GROMACS 5.1

Page 63: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

68

3M Waters on K40s and K80s

Running GROMACS version 5.1

The blue node contains Dual Intel E5-2698 [email protected] CPUs

The green nodes contain Dual Intel E5-2698 [email protected] CPUs + either NVIDIA

Tesla K40@875Mhz or Tesla K80@562Mhz (autoboost) GPUs

0.81

1.32

1.88 1.85

2.72 2.76

3.23

0

1

2

3

4

1 HaswellNode

1 CPU Node +1x K40

1 CPU Node +1x K80

1 CPU Node +2x K40

1 CPU Node +2x K80

1 CPU Node +4x K40

1 CPU Node +4x K80

Sim

ula

ted T

ime (

ns/

day)

Water [PME] 3M

1.6X

2.3X 2.3X

3.4X 3.4X

4.0X

Page 64: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

69

3M Waters on Titan X

Running GROMACS version 5.1

The blue node contains Dual Intel E5-2698 [email protected] CPUs

The green nodes contain Dual Intel E5-2698 [email protected] CPUs + GeForce GTX

TitanX@1000Mhz GPUs 0.81

1.53

2.36

2.99

0

1

2

3

4

1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX

Sim

ula

ted T

ime (

ns/

day)

Water [PME] 3M

1.9X

2.9X

3.7X

Page 65: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

70

TESLA P100 ACCELERATORS

Tesla P100 for NVLink-enabled Servers

Tesla P100 for PCIe-Based Servers

5.3 TF DP ∙ 10.6 TF SP ∙ 21 TF HP

720 GB/sec Memory Bandwidth, 16 GB

4.7 TF DP ∙ 9.3 TF SP ∙ 18.7 TF HP

Config 1: 16 GB, 720 GB/sec

Config 2: 12 GB, 540 GB/sec

Unified Memory

CPU

Tesla P100

PAGE MIGRATION ENGINE

Simpler Parallel Programming

Virtually Unlimited Data Size

Performance w/ data locality

Page 66: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

72

EXTRAORDINARY STRONG SCALING

CPU: Dual Socket Intel E5-2680v3 12 cores, 128 GB DDR4 per node, FDR IB VASP 5.4.1_05Feb16, Si-Huge Dataset. 16, 32 Nodes are estimated based on same scaling from 4 to 8 nodes AMBER 16 Pre-release, CRSPR based on PDB ID 5f9r, 336,898 atoms

One Strong Node Faster Than Lots of Weak Nodes

VASP PERFORMANCE

0x

2x

4x

6x

8x

10x

12x

0 4 8 12 16 20 24 28 32

2x P100

8x P100

Single P100 PCIe Node vs Lots of Weak Nodes

# of CPU Server Nodes

Speed-u

p v

s 1 C

PU

Serv

er

Node

4x P100

32 CPU Nodes

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100

ns/

day

# of Processors (CPUs and GPUs)

AMBER Simulation of CRISPR, Nature’s Tool for Genome Editing

1 Node with 4 P100 GPUs

48 CPU Nodes Comet

Supercomputer

Page 67: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

74

MASSIVE LEAP IN PERFORMANCE

0x

5x

10x

15x

20x

25x

30x

NAMD VASP HOOMD Blue AMBER

2x K80 2x P100 (PCIe) 4x P100 (PCIe)

Speed-u

p v

s D

ual Socket

Bro

adw

ell

CPU: Xeon E5-2697v4, 2.3 GHz, 3.6 GHz Turbo Caffe Alexnet: batch size of 256 Images; VASP, NAMD, HOOMD-Blue, and AMBER average speedup across a basket of tests

Page 68: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

76

TCO OVERVIEW

1 RACK

# of Racks (Power Required)

8 (320kW)

6 (240kW)

4 (160kW)

2 (80kW)

10 (400 kW)

960 CPUs / 260 kW

1280 CPUs / 350 kW

20CPUs + 80 P100s / 32 kW

7 RACKS

CAFFE ALEXNET

VASP

Compute Servers,

85%

Non-Compute 15%

EQUAL PERFORMANCE IN LESS RACKS BUDGET: SMALLER, EFFICIENT

Compute Servers,

39%

Rack, Cabling

Infrastructure

Networking

Non-compute,

61%

Sources: Microsoft Research on Datacenter Costs

9 RACKS

Page 69: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

77

NVIDIA DGX-1 WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER

42.5 TFLOPS FP64/85 TFLOPS FP32/170 TFLOPS FP16

8x Tesla P100 16GB

NVLink Hybrid Cube Mesh

Accelerates Major AI Frameworks

Dual Xeon

7 TB SSD Deep Learning Cache

Dual 10GbE, Quad IB 100Gb

3RU – 3200W

Page 70: Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

79

THANK YOU! Q&A: WHAT CAN I HELP YOU BUILD?