Download - Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

1

NVIDIA TEGRA K1 Mar 2, 2014

NVIDIA Confidential

Francois Courteille Senior Solution Architect, Accelerated Computing

Accélération de calculs de simulations des matériaux sur GPU

2

AGENDA

Tesla, the NVIDIA compute platform

Quantum Chemistry Applications Overview

Molecular Dynamics Applications Overview

Material Science Applications at the start of the Pascal Era

3

THE WORLD LEADER IN VISUAL COMPUTING

GAMING ENTERPRISE TECHNOLOGY HPC & CLOUD AUTO

4

OUR TEN YEARS IN HPC

2006 2008 2012 2016 2010 2014

Fermi: World’s First HPC GPU

Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs

World’s First Atomic Model of HIV Capsid

GPU-Trained AI Machine Beats World Champion in Go

Stanford Builds AI Machine using GPUs

World’s First 3-D Mapping of Human Genome

CUDA Launched

World’s First GPU Top500 System

Google Outperform Humans in ImageNet

Discovered How H1N1 Mutates to Resist Drugs

AlexNet beats expert code by huge margin using GPUs

5

CREDENTIALS BUILT OVER TIME

300K CUDA Developers, 4x Growth in 4 years

Majority of HPC Applications are GPU-Accelerated, 410 and Growing

100% of Deep Learning Frameworks are Accelerated

113

206

242

370

410

0

50

100

150

200

250

300

350

400

450

2011 2012 2013 2014 2015 2016

287

# of Applications Academia Games Finance

Manufacturing Internet Oil & Gas

National Labs Automotive Defense

M & E

300K

TORCH

THEANO

CAFFE

MATCONVNET

PURINEMOCHA.JL

MINERVA MXNET*

BIG SUR TENSORFLOW

WATSON CNTK

www.nvidia.com/appscatalog

6

END-TO-END TESLA PRODUCT FAMILY

HYPERSCALE HPC

Tesla M4, M40

MIXED-APPS HPC

Tesla K80

STRONG-SCALING HPC

Tesla P100

FULLY INTEGRATED DL SUPERCOMPUTER

DGX-1

For customers who need to get going now with fully

integrated solution

Hyperscale & HPC data centers running apps that

scale to multiple GPUs

HPC data centers running mix of CPU and GPU workloads

Hyperscale deployment for DL training, inference, video &

image processing

8

TESLA FOR SIMULATION

LIBRARIES

TESLA ACCELERATED COMPUTING

LANGUAGES DIRECTIVES

ACCELERATED COMPUTING TOOLKIT

10

MD: All key codes are GPU-accelerated

Great multi-GPU performance

Focus on dense (up to 16) GPU nodes &/or large # of

GPU nodes

ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND, ESPResso,

Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*,

LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD,

OpenMM, PolyFTS, SOP-GPU* & more

QC: All key codes are ported or optimizing

Focus on using GPU-accelerated math libraries,

OpenACC directives

GPU-accelerated and available today:

ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS-

UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012,

NWChem, OCTOPUS*, PEtot, QUICK, Q-Chem, QMCPack,

Quantum Espresso/PWscf, QUICK, TeraChem*

Active GPU acceleration projects:

CASTEP, GAMESS, Gaussian, ONETEP, Quantum

Supercharger Library*, VASP & more

green* = application where >90% of the workload is on GPU

OVERVIEW OF LIFE & MATERIAL ACCELERATED APPS

11

MD VS. QC ON GPUS

“Classical” Molecular Dynamics Quantum Chemistry (MO, PW, DFT, Semi-Emp)

Simulates positions of atoms over time; chemical-biological or

chemical-material behaviors

Calculates electronic properties; ground state, excited states, spectral properties,

making/breaking bonds, physical properties

Forces calculated from simple empirical formulas (bond rearrangement generally forbidden)

Forces derived from electron wave function (bond rearrangement OK, e.g., bond energies)

Up to millions of atoms Up to a few thousand atoms

Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods

Single precision dominated Double precision is important

Uses cuBLAS, cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC

Geforce (Accademics), Tesla (Servers) Tesla recommended

ECC off ECC on

12

GPU-Accelerated Quantum Chemistry Apps

Abinit

ACES III

ADF

BigDFT

CP2K

GAMESS-US

Gaussian

GPAW

LATTE

LSDalton

MOLCAS

Mopac2012

NWChem

Green Lettering Indicates Performance Slides Included

GPU Perf compared against dual multi-core x86 CPU socket.

Quantum SuperCharger Library

TeraChem

VASP

WL-LSMS

Octopus

ONETEP

Petot

Q-Chem

QMCPACK

Quantum Espresso

ABINIT

FROM RESEARCH TO INDUSTRY

6th International ABINIT Developer Workshop

April 15-18, 2013 - Dinard, France

Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France

Florent Dahm C-S – 22 av. Galilée. 92350 Le Plessis-Robinson

With a contribution from Yann Pouillon for the build system

www.cea.fr

USING ABINIT ON GRAPHICS PROCESSING UNITS (GPU)

http://www.cea.fr

http://www.cea.fr

http://www.cea.fr

ABINIT ON GPU

OUTLINE

Introduction

Introduction to GPU computing Architecture, programming model

Density-Functional Theory with plane waves How to port it on GPU

ABINIT on GPU

Implementation

Performances of ABINIT v7.0

How To…

Compile

Use

ABINIT on GPU| April 15, 2013 | PAGE 13

!

!

!

!

!

LOCAL OPERATOR: FAST FOURIER TRANSFORMS

g r g r g v dr e v r

c e

'

eff eff g

g'

FFT


  Included in Cuda package : cuFFT library

Interfaced with c and Fortran

  A system dependent efficiency is expected

  Small FFT underuse the GPU and are dominated by

the time required to transfer the data to/from the GPU

  Within PAW, small plane wave basis are used

~ =

i

~ ( )

( i

)

FTT

  Our choice (for a first version):

Replace the MPI plane wave level

by a FFT computation on GPU

!

!

!

!

!


fourwf – option 2

Performance highly dependent on FFT size

Speedup

FFT size


TEST CASE

107 gold atoms cell

TGCC-Curie

(Intel Westmere + Nvidia M2090)

  Implementation

  Interface Cuda / Fortran

  New routines that encapsulate cufft calls

  Development of 3 small cuda kernels

  Optimization of memory copy to overlap

computation and data transfer

!

!


Improvement : multiple band optimization

107 gold atoms 31 copper atoms


TEST CASE

TGCC-Curie


bandpp fourwf on GPU

1

2

4

10

20

100

36,02

20,79

14,14

9,16

9,50

7,94

bandpp fourwf on GPU

1

2

4

8

18

72

112,94

84,55

70,55

61,91

54,79

50,55

  The Cuda kernels are more intensively used

Requires less transfers

  See bandpp variable When Locally Optimal Block Preconditioned Conjugate

Gradient algorithm is used

!

!

!

! d

NON LOCAL OPERATOR

~p

~p = ∑ g ' g ' ψ ψ   Development of 2 Cuda Kernels for g ' ~ ψ ψ

i , j

•  Non-local operator

• 

• 

• 

Overlap operator

Forces and stress tensor

PAW occupation matrix

In addition to all MPI

parallelization levels !

•  Need a large number of PW

•  Need a large number of NL projectors

TGCC-Curie

(Intel Westmere + Nvidia M2090) ABINIT on GPU| April 15, 2013 | PAGE 17

Performances (NL operator only)

nonlop CPU nonlop GPU Speedup time (s) time (s) GPU vs CPU

107 gold atoms

3142

1122

2.8

31 copper atoms

85

42

2.0

3 UO2 atoms

63

36

1.8

  Implementation

i

i

  Development of 2 Cuda Kernels for g VNL = ∑ g pi Dij ~p j

  Using collaborative memory zones as much as possible (threads linke

to the same plane wave lie in the same collaborative zone) Used for : •  Non-l

Matrix-matrix and matrix-vector products

! These products are not parallelized

! Size of vectors/matrix increases with the

number of band/plane-wave CPU cores.

! Our choice : use of cuBlas

Algorithm 1 lobpcg

Require: 0 = { 0 , . . . , 0 } close to the minimum and K a pre- 1 m conditioner; P = {P

(0) , . . . , P0 } is initialized to 0. 1 m

1: for i=0,1, , do . . . 2: ⌥(i) = ⌥( (i) )

3: R(i) = H (i) ⌥(i) O (i)

4: W(i) = KR(i)

5: The Rayleigh-Ritz method is applied within the subspace ⌅ = n o P(i) (i) (i) (i) (i) (i)

1 , . . . , Pm , 1 , . . . , m , W1 , . . . , Wm

6: (i+1) = (i) (i) + ⇤(i) W(i) + (i) P(i)

7: P(i+1) = ⇤(i) W(i) + (i) P(i) 8: end for

! Size of matrix increases when b

parallelization is used

___ .z - ! Need for a free LAPACK packag

Locally Optimal Block

Preconditioned Conjugate

Gradient

 

 

 


Diagonalization/othogonalization within blocks

 

and

  e on GPU

Our choice : use of MAGMA

bandpp Total tim e Tim e Line ar Alge bra

1 674 119

2 612 116

10 1077 428

50 5816 5167

! A project by Innovating Computing Laboratory, University of Tenessee

! “Free of use” project

! First stable release 25-08-2011; Current version 1.3

! Computation is cut in order to offload consuming parts on GPU

! Goal : address hybrid environments

Matrix Algebra on GPU and

Multicore Architectures Architectures 2013-03-

12

The MAGMA project aims 1o oeveop a dense linear agebra library Downloads

similar 1o LAPACK bu1 for heterogeneouslhybrd architectures,

2012-11-14

People

Dcx:umentatbn


enable app'catons 1o fully expoit the power 1hat each of the hybrd

components offers.

lver27 2013

 

 

 

 

 

License

Copyrqht 2013 Th.. Uni ,:.r.;it of T..-,nn.. - e , All rt}h re ..

d. in

conditbns are met:

ing

Home

Overview Matrix Algebra on GPU and Multicore Latest MAGMA News

News

MAGMA MIC 1.0 Beta for Intel Xeon

Phi Coprocessors Re~ased

Publcations

starting with current 'Multcore-tGPU' systems.

Partners The MAGMA research is based on 1he dlea 1hat, 1o address the MAGMA 1.3 Released

complex challenges of 1heemerging hybrd environments, optimal

software solutbns will themselves have to hybrdize, combining the 2012-11-13 User Forum

strengths of differen1 agornhms wnhin a single framework. Buitling MAGMA MIC 0.3 for Intel Xeon Phi

on this dlea, we aim to oesqn linearagebraagornhms and Coprocessors Released

frameworks for hybrd manycore and GPU systems that can

2012-10-24

clMAGMA 1.0 Released

2012-06-29

MAGMA 1.2.1 Released

ICL{>l3r

Sponsored By: Industry Support From: AMD:I @ Mlaoeolt ,..............


!

!

ITERATIVE EIGENSOLVER

Performances

TGCC-Curie


  Highly dependent on the physical system

  Depends only on the size of blocks used in the LOBPCG algorithm

Speedup of cuBLAS/MAGMA code Speedup on a single MPI process

4,00

3,50

3,00

2,50 215 silicon atoms

2,00 107 gold atoms

1,50 31 copper atoms

1,00 3 UO2 atoms

0,50

0,00 bandpp=

2 4 10 factor related to the size of blocks (bands) size

OUTLINE

Introduction

Introduction to GPU computing Architecture, programming model

Density-Functional Theory with plane waves How to port it on GPU

ABINIT on GPU

Implementation

Performances of ABINIT v7.0

How To…

Compile

Use


!

!

PERFORMANCES ON A SINGLE GPU

Comparing architectures…

TGCC-Titane

Intel Nehalem+ NVidia S1070 (Tesla)

TGCC-Curie

Intel Westmere + NVidia M2090 (Fermi)


  Strongly dependent on architecture

Titane : fast CPU+ old GPU

  BaTiO3 case is too small to take benefit from GPU

Time proc 0 (s) Titane Curie CPU CPU+GPU Speedup CPU CPU+GPU Speedup

107 gold atoms

31 copper atoms

5 BaTiO3 atoms

4230

153

85

1857

102

98

2,3

1,5

0,9

5154

235

86

621

64

73

8.3

3,7

1,2

!

PERFORMANCES ON MULTIPLE GPU

TGCC-Curie

Intel Westmere + NVidia M2090 (Fermi)

8 CPU (MPI only) + 8 GPU

200 CPU (MPI only) + 200 GPU


  Less efficient when the number of MPI processes increases

(the load decreases on the GPU)

Time proc 0 (s) Curie CPU CPU+GPU Speedup

215 silicon atoms (45 it

er.)

25856 6309 4,1

Time proc 0 (s) Curie CPU CPU+GPU Speedup

215 silicon atoms (1 iter

.)

107 gold atoms

31 copper atoms

16160

866

49

3680

621

64

4,4

4,7

1,5

!

!

!

!

!

PERFORMANCES VS OTHER CODES

[1] Harju et al, App. Parallel and Sc. Comp. Lectures Notes in Computer Science 7782, p3 (2013)

[2] Spiga et al, pdp, 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, p368 (2012)

[3] Mainz et al, Computer Physics Communication 182, p1421 (2011)

[4] Hacene et al,, Journal of Computational Chemistry 33, p2581(2012)

[5] Genovese et al, Journal of Chemical Physics, 131, 034103 (2009)

[6] Hakala et al, PARA 2012. LNCS, vol 7782, p63, Springer, Heidelberg (2013)


  Other DFT codes have been ported on GPU.

Speedup of plane wave codes are similar

  Quantum espresso (plane waves) : x3/x4 (sequential), x2/x3 (parallel) [1,2]

  VASP (plane waves) : x3/x4 [3,4]

BigDFT (wavelets) : x5/x7 [5]

  GPAW (real space) : x10/x15 [1,6]

BigDFT

Gaussian

April 4-7, 2016 | Silicon Valley

Roberto Gomperts (NVIDIA), Michael Frisch (Gaussian, Inc.), Giovanni Scalmani (Gaussian, Inc.), Brent Leback (NVIDIA/PGI)

ENABLING THE ELECTRONIC STRUCTURE PROGRAM GAUSSIAN ON GPGPUS USING OPENACC

33

PREVIOUSLY Earlier Presentations

GRC Poster 2012

ACS Spring 2014

GTC Spring 2014 ( recording at http://on-demand.gputechconf.com/gtc/2014/video/S4613-enabling-gaussian-09-gpgpus.mp4 )

WATOC Fall 2014

GTC Spring 2016 (this full recording at http://mygtc.gputechconf.com/quicklink/4r13O5r; requires registration)

http://on-demand.gputechconf.com/gtc/2014/video/S4613-enabling-gaussian-09-gpgpus.mp4












http://mygtc.gputechconf.com/quicklink/4r13O5r

http://mygtc.gputechconf.com/quicklink/4r13O5r

34

CURRENT STATUS Single Node

Implemented

Energies for Closed and Open Shell HF and DFT (less than a handful of XC-functionals missing)

First derivatives for the same as above

Second derivatives for the same as above

Using only

OpenACC

CUDA library calls (BLAS)

6/29/2016

35

GAUSSIAN PARALLELISM MODEL

CPU Cluster

OpenMP

CPU Node

GPU

OpenACC

36

CLOSING REMARKS

Significant Progress has been made in enabling Gaussian on GPUs with OpenACC

OpenACC is increasingly becoming more versatile

Significant work lies ahead to improve performance

Expand feature set:

PBC, Solvation, MP2, ONIOM, triples-Corrections

EARLY PERFORMANCE RESULTS (CCSD(T))

24.8 24.8 22.7 2.1 3.7 3.6 1.3 2.3 0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

Total l913: "triples" Iterative CCSD

Ibuprofen CCSD(t) Calculation Speed Ups Relative to CPU-Only Full Node

20 Threads 0 GPUs 6 Threads 6 GPUs

Labels: Time in hrs

System: 2 Sockets E5-2690 V2 (2x 10 Cores @ 3.0 GHz); 128 GB RAM (DD3-1600); Used 108 GB

GPUs: 6 Tesla K40m (15 SMPs @ 875 MHz); 12 GB Global Memory

Method CCSD(t)

No. of Atoms 33

Basis Set 6-31G(d,p)

No. of Basis Funcs 315

No. Occ Orbitals 41

No. Virt Orbitals 259

No. of Cycles 15

No. CCSD iters 16

GPU VASP Collaboration

Collaborators

2013-2014 Project Scope Minimization algorithms to calculate electronic ground state

Blocked Davidson (ALGO = NORMAL & FAST)

RMM-DIIS (ALGO = VERYFAST & FAST)

K-Points

Optimization for critical step in exact exchange calculations

Earlier work Speeding up plane-wave electronic-structure calculations using graphics-processing units, Maintz, Eck, Dronskowski

VASP on a GPU: application to exact-exchange calculations of the stability of elemental boron, Hutchinson, Widom

Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units, Hacene, Anciaux-Sedrakian,

Rozanska, Klahr, Guignon, Fleurat-Lessard

U of Chicago

Target Workloads

Silica (“medium”) 7 Å thick slab of amorphous silica, 240 atoms (Si68O148H24)

RMM-DIIS (ALGO = VERYFAST)

Nial-MD (“large”) Liquid metal molecular dynamics sample of Nickel-based superalloy

500 atoms, 9 chemical species total

Blocked Davidson (ALGO = NORMAL)

INTERFACE (“large”) Interface of platinum metal with water

108 Pt atoms, and 120 water molecules (468 atoms)

Blocked Davidson & RMM-DIIS (ALGO = FAST)

RUNTIME DISTRIBUTION FOR SILICA

Time in sec for 1 K40 GPU + 1 IvyBridge core

0 500 1000 1500 2000 2500 3000 3500

Optimized GPU port

original GPU port

CPU

Memcopy

Gemm

FFT

Other

March 2016

VASP 5.4.1 w/ Patch#1(Haswell & K80)

43

VASP Interface Benchmark

Running VASP version 5.4.1

The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs

“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to

2.7X faster calculations. (patch #1 available at VASP.at)

Blocked Davidson + RMM-DIIS (ALGO=Fast)

828.27

580.63

506.25

387.26

284.06

0

100

200

300

400

500

600

700

800

900

1 Node 1 Node +1x K80 per node

1 Node + 2x K80per node

[Old Data]

1 Node +2x K80 per node


Ela

pse

d t

ime (

s)

Interface

1.4X 1.6X

2.1X 3.0X

*Lower is better

44

VASP Silica IFPEN Benchmark

542.87

432.48

372.74

258.14

210.47

0

100

200

300

400

500

600


1 Node + 2x K80 pernode

[Old Data]



Ela

pse

d t

ime (

s)

Silica IFPEN

1.3X

1.5X

2.1X

2.6X

*Lower is better






RMM-DIIS (ALGO=Veryfast)

45

VASP Si-Huge Benchmark

6464.42

4296.15

2871.09 2865.92

1987.94

0

1000

2000

3000

4000

5000

6000

7000



[Old Data]



Ela

pse

d t

ime (

s)

Si-Huge

1.5X

2.3X 2.3X

3.3X

*Lower is better Running VASP version 5.4.1

The blue node contains Dual Intel Xeon

E5-2698 [email protected] (Haswell) CPUs





46

VASP SupportedSystems Benchmark

310.63

252.43

228.90

164.21

129.99

0

50

100

150

200

250

300

350



[Old Data]


1 Node +4x Tesla K80 per

node

Ela

pse

d t

ime (

s)

SupportedSystems

1.2X

1.4X

1.9X

2.4X








47

VASP NiAl-MD Benchmark

722.82

237.02

197.19 184.93

142.98

0

100

200

300

400

500

600

700

800

1 Node 1 Node + 1x K80 pernode


[Old Data]



Ela

pse

d t

ime (

s)

NiAl-MD

3.0X 3.7X 3.9X 5.1X








48

VASP B.hR105 Benchmark

1196.86

604.05

334.44 343.69

236.13

0

200

400

600

800

1000

1200

1400

1 Node 1 Node + 1x K80 pernode


[Old Data]



Ela

pse

d t

ime (

s)

B.hR105

2.0X

3.6X 3.5X 5.1X

*Lower is better






Hybrid Functional with blocked Davicson (ALGO=Normal)

49

VASP B.aP107 Benchmark

27993.62

7494.36 7632.31

4432.31 4513.40

0

5000

10000

15000

20000

25000

30000



[Old Data]



Ela

pse

d t

ime (

s)

B.aP107

3.7X 3.7X 6.3X 6.2X

*Lower is better






Hybrid functional calculation (exact exchange) with blocked Davidson. No KPoint

parallelization.

Hybrid Functional with blocked Davicson (ALGO=Normal)

50

GPU-ACCELERATED MOLECULAR DYNAMICS APPS

ACEMD

AMBER

CHARMM

DESMOND

ESPResSO

Folding@Home

GPUGrid.net

GROMACS

HALMD

HOOMD-Blue

LAMMPS

mdcore

Green Lettering Indicates Performance Slides Included

GPU Perf compared against dual multi-core x86 CPU socket.

MELD

NAMD

OpenMM

PolyFTS

NAMD 2.11

53

NEW GPU FEATURES IN NAMD 2.11

• GPU-accelerated simulations up to twice as fast as NAMD 2.10

• Pressure calculation with fixed atoms on GPU works as on CPU

• Improved scaling for GPU-accelerated particle-mesh Ewald calculation

• CPU-side operations overlap better and are parallelized across cores.

• Improved scaling for GPU-accelerated simulations

• Nonbonded force calculation results are streamed from the GPU for better overlap.

• NVIDIA CUDA GPU-acceleration binaries for Mac OS X

Selected Text from the NAMD website

54

NAMD 2.11 IS UP TO 2X FASTER

0

5

10

15

20

25

1 Node 2 Nodes 4 Nodes

Sim

ula

ted T

ime (

ns/

day)

APoA1 (92,224 atoms)

1.2X

1.6X 2.0X

NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 [email protected] (IvyBridge) CPUs + 2x Tesla K80 (autoboost) GPUs

55

NAMD 2.11 APOA1 ON 1 AND 2 NODES

Running NAMD version 2.11

The blue nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs + Tesla

K80 (autoboost) GPUs

2.77

11.67

16.99

5.22

19.73

24.31

0

5

10

15

20

25

1 Node 1 Node +1x K80

1 Node +2x K80

2 Nodes 2 Nodes +1x K80

2 Nodes +2x K80

Sim

ula

ted T

ime (

ns/

day)


4.2X

6.1X

3.8X

4.7X

56

NAMD 2.11 APOA1 ON 4 AND 8 NODES



The green nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs + Tesla


10.27

20.64

23.52

16.85

27.83 27.74

0

5

10

15

20

25

30


4 Nodes +2x K80


8 Nodes +2x K80

Sim

ula

ted T

ime (

ns/

day)


2.0X

2.3X 1.7X 1.6X

57

NAMD 2.11 STMV ON 1 AND 2 NODES



The green nodes contain Dual Intel E5-2698 [email protected] CPUs (Haswell) + Tesla


0.23

1.03

1.75

0.46

1.98

3.27

0

1

2

3

4

1 Node 1 Node +1x K80

1 Node +2x K80


2 Nodes +2x K80

Sim

ula

ted T

ime (

ns/

day)

STMV (1,066,628 atoms)

4.5X

7.6X 4.3X

7.1X

58

NAMD 2.11 STMV ON 4 AND 8 NODES



The green nodes contain Dual Intel E5-2698 [email protected] CPUs (Haswell) + Tesla


0.90

3.61

4.54

1.74

5.86

6.24

0

2

4

6

8


4 Nodes +2x K80


8 Nodes +2x K80

Sim

ula

ted T

ime (

ns/

day)

STMV (1,066,628 atoms)

4.0X

5.0X

3.4X

3.6X

AMBER 14

60

JAC on K40s, K80s and M6000s

Running AMBER version 14

The blue node contains Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs

The green nodes contain Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro

M6000@987Mhz GPUs

37.38

134.82

121.30

174.34

161.53

200.34

225.34 219.83

0

50

100

150

200

250

1 HaswellNode

1 CPU Node+ 1x K40

1 CPU Node+ 0.5x K80

1 CPU Node+ 1x K80

1 CPU Node+ 1x M6000

1 CPU Node+ 2x K40

1 CPU Node+ 2x K80


Sim

ula

ted T

ime (

ns/

day)

PME-JAC_NVE

3.2X

4.7X

4.3X

5.4X

6.0X 5.9X

3.6X

61

Cellulose on K40s, K80s and M6000s




M6000@987Mhz GPUs

1.93

8.96

7.87

11.76

10.49

13.67

15.38 14.90

0

4

8

12

16

20

1 HaswellNode

1 CPU Node+ 1x K40


1 CPU Node+ 1x K80


1 CPU Node+ 2x K40

1 CPU Node+ 2x K80


Sim

ula

ted T

ime (

ns/

day)

PME-Cellulose_NVE

4.1X

6.1X

5.4X

8.0X 7.7X

4.6X

7.1X

62

Factor IX on K40s, K80s and M6000s




M6000@987Mhz GPUs

9.68

40.48

33.59

50.70 47.80

61.18 60.93

66.89

0

10

20

30

40

50

60

70

80

1 HaswellNode

1 CPU Node+ 1x K40


1 CPU Node+ 1x K80


1 CPU Node+ 2x K40

1 CPU Node+ 2x K80


Sim

ula

ted T

ime (

ns/

day)

PME-FactorIX_NVE

3.5X

5.2X 5.0X

6.4X 6.3X

7.0X

4.2X

63

JAC on M40s

21.11

157.68

230.18

246.15

0

50

100

150

200

250

300

1 Node 1 Node +1x M40 per node

1 Node +2x M40 per node


Sim

ula

ted T

ime (

ns/

Day)

PME - JAC_NVE

7.5X

10.9X

11.7X


The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs

The green nodes contain Single Intel

Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs

64

Cellulose on M40s





1.07

10.12

14.40

15.90

0

2

4

6

8

10

12

14

16

18




Sim

ula

ted T

ime (

ns/

Day)

PME - Cellulose_NPT

9.5X

13.5X 14.9X

65

Myoglobin on M40s

9.83

232.20

300.86

322.09

0

50

100

150

200

250

300

350




Sim

ula

ted T

ime (

ns/

Day)

GB - Myoglobin

23.6X

30.6X 32.8X





66

Nucleosome on M40s

0.13

4.67

9.05

16.11

0

2

4

6

8

10

12

14

16

18




Sim

ula

ted T

ime (

ns/

Day)

GB - Nucleosome

35.9X

69.6X

123.9X





October 2015

GROMACS 5.1

68

3M Waters on K40s and K80s

Running GROMACS version 5.1

The blue node contains Dual Intel E5-2698 [email protected] CPUs

The green nodes contain Dual Intel E5-2698 [email protected] CPUs + either NVIDIA

Tesla K40@875Mhz or Tesla K80@562Mhz (autoboost) GPUs

0.81

1.32

1.88 1.85

2.72 2.76

3.23

0

1

2

3

4

1 HaswellNode

1 CPU Node +1x K40

1 CPU Node +1x K80

1 CPU Node +2x K40

1 CPU Node +2x K80

1 CPU Node +4x K40

1 CPU Node +4x K80

Sim

ula

ted T

ime (

ns/

day)

Water [PME] 3M

1.6X

2.3X 2.3X

3.4X 3.4X

4.0X

69

3M Waters on Titan X

Running GROMACS version 5.1

The blue node contains Dual Intel E5-2698 [email protected] CPUs

The green nodes contain Dual Intel E5-2698 [email protected] CPUs + GeForce GTX

TitanX@1000Mhz GPUs 0.81

1.53

2.36

2.99

0

1

2

3

4

1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX

Sim

ula

ted T

ime (

ns/

day)

Water [PME] 3M

1.9X

2.9X

3.7X

70

TESLA P100 ACCELERATORS

Tesla P100 for NVLink-enabled Servers

Tesla P100 for PCIe-Based Servers

5.3 TF DP ∙ 10.6 TF SP ∙ 21 TF HP

720 GB/sec Memory Bandwidth, 16 GB

4.7 TF DP ∙ 9.3 TF SP ∙ 18.7 TF HP

Config 1: 16 GB, 720 GB/sec

Config 2: 12 GB, 540 GB/sec

Unified Memory

CPU

Tesla P100

PAGE MIGRATION ENGINE

Simpler Parallel Programming

Virtually Unlimited Data Size

Performance w/ data locality

72

EXTRAORDINARY STRONG SCALING

CPU: Dual Socket Intel E5-2680v3 12 cores, 128 GB DDR4 per node, FDR IB VASP 5.4.1_05Feb16, Si-Huge Dataset. 16, 32 Nodes are estimated based on same scaling from 4 to 8 nodes AMBER 16 Pre-release, CRSPR based on PDB ID 5f9r, 336,898 atoms

One Strong Node Faster Than Lots of Weak Nodes

VASP PERFORMANCE

0x

2x

4x

6x

8x

10x

12x

0 4 8 12 16 20 24 28 32

2x P100

8x P100

Single P100 PCIe Node vs Lots of Weak Nodes

# of CPU Server Nodes

Speed-u

p v

s 1 C

PU

Serv

er

Node

4x P100

32 CPU Nodes

0

5

10

15

20

25

30

35

40

45

50

0 20 40 60 80 100

ns/

day

# of Processors (CPUs and GPUs)

AMBER Simulation of CRISPR, Nature’s Tool for Genome Editing

1 Node with 4 P100 GPUs

48 CPU Nodes Comet

Supercomputer

74

MASSIVE LEAP IN PERFORMANCE

0x

5x

10x

15x

20x

25x

30x

NAMD VASP HOOMD Blue AMBER

2x K80 2x P100 (PCIe) 4x P100 (PCIe)

Speed-u

p v

s D

ual Socket

Bro

adw

ell

CPU: Xeon E5-2697v4, 2.3 GHz, 3.6 GHz Turbo Caffe Alexnet: batch size of 256 Images; VASP, NAMD, HOOMD-Blue, and AMBER average speedup across a basket of tests

76

TCO OVERVIEW

1 RACK

# of Racks (Power Required)

8 (320kW)

6 (240kW)

4 (160kW)

2 (80kW)

10 (400 kW)

960 CPUs / 260 kW

1280 CPUs / 350 kW

20CPUs + 80 P100s / 32 kW

7 RACKS

CAFFE ALEXNET

VASP

Compute Servers,

85%

Non-Compute 15%

EQUAL PERFORMANCE IN LESS RACKS BUDGET: SMALLER, EFFICIENT

Compute Servers,

39%

Rack, Cabling

Infrastructure

Networking

Non-compute,

61%

Sources: Microsoft Research on Datacenter Costs

9 RACKS

http://research.microsoft.com/en-us/um/people/dmaltz/papers/DC-Costs-CCR-editorial.pdf

http://research.microsoft.com/en-us/um/people/dmaltz/papers/DC-Costs-CCR-editorial.pdf

77

NVIDIA DGX-1 WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER

42.5 TFLOPS FP64/85 TFLOPS FP32/170 TFLOPS FP16

8x Tesla P100 16GB

NVLink Hybrid Cube Mesh

Accelerates Major AI Frameworks

Dual Xeon

7 TB SSD Deep Learning Cache

Dual 10GbE, Quad IB 100Gb

3RU – 3200W

79

THANK YOU! Q&A: WHAT CAN I HELP YOU BUILD?

Download - Accélération de calculs de simulations des matériaux sur GPU · 2016. 9. 2. · April 15-18, 2013 - Dinard, France Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France Florent Dahm

Top Related