1
NVIDIA TEGRA K1 Mar 2, 2014
NVIDIA Confidential
Francois Courteille Senior Solution Architect, Accelerated Computing
Accélération de calculs de simulations des matériaux sur GPU
2
AGENDA
Tesla, the NVIDIA compute platform
Quantum Chemistry Applications Overview
Molecular Dynamics Applications Overview
Material Science Applications at the start of the Pascal Era
3
THE WORLD LEADER IN VISUAL COMPUTING
GAMING ENTERPRISE TECHNOLOGY HPC & CLOUD AUTO
4
OUR TEN YEARS IN HPC
2006 2008 2012 2016 2010 2014
Fermi: World’s First HPC GPU
Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs
World’s First Atomic Model of HIV Capsid
GPU-Trained AI Machine Beats World Champion in Go
Stanford Builds AI Machine using GPUs
World’s First 3-D Mapping of Human Genome
CUDA Launched
World’s First GPU Top500 System
Google Outperform Humans in ImageNet
Discovered How H1N1 Mutates to Resist Drugs
AlexNet beats expert code by huge margin using GPUs
5
CREDENTIALS BUILT OVER TIME
300K CUDA Developers, 4x Growth in 4 years
Majority of HPC Applications are GPU-Accelerated, 410 and Growing
100% of Deep Learning Frameworks are Accelerated
113
206
242
370
410
0
50
100
150
200
250
300
350
400
450
2011 2012 2013 2014 2015 2016
287
# of Applications Academia Games Finance
Manufacturing Internet Oil & Gas
National Labs Automotive Defense
M & E
300K
TORCH
THEANO
CAFFE
MATCONVNET
PURINEMOCHA.JL
MINERVA MXNET*
BIG SUR TENSORFLOW
WATSON CNTK
www.nvidia.com/appscatalog
6
END-TO-END TESLA PRODUCT FAMILY
HYPERSCALE HPC
Tesla M4, M40
MIXED-APPS HPC
Tesla K80
STRONG-SCALING HPC
Tesla P100
FULLY INTEGRATED DL SUPERCOMPUTER
DGX-1
For customers who need to get going now with fully
integrated solution
Hyperscale & HPC data centers running apps that
scale to multiple GPUs
HPC data centers running mix of CPU and GPU workloads
Hyperscale deployment for DL training, inference, video &
image processing
8
TESLA FOR SIMULATION
LIBRARIES
TESLA ACCELERATED COMPUTING
LANGUAGES DIRECTIVES
ACCELERATED COMPUTING TOOLKIT
10
MD: All key codes are GPU-accelerated
Great multi-GPU performance
Focus on dense (up to 16) GPU nodes &/or large # of
GPU nodes
ACEMD*, AMBER (PMEMD)*, BAND, CHARMM, DESMOND, ESPResso,
Folding@Home, GPUgrid.net, GROMACS, HALMD, HOOMD-Blue*,
LAMMPS, Lattice Microbes*, mdcore, MELD, miniMD, NAMD,
OpenMM, PolyFTS, SOP-GPU* & more
QC: All key codes are ported or optimizing
Focus on using GPU-accelerated math libraries,
OpenACC directives
GPU-accelerated and available today:
ABINIT, ACES III, ADF, BigDFT, CP2K, GAMESS, GAMESS-
UK, GPAW, LATTE, LSDalton, LSMS, MOLCAS, MOPAC2012,
NWChem, OCTOPUS*, PEtot, QUICK, Q-Chem, QMCPack,
Quantum Espresso/PWscf, QUICK, TeraChem*
Active GPU acceleration projects:
CASTEP, GAMESS, Gaussian, ONETEP, Quantum
Supercharger Library*, VASP & more
green* = application where >90% of the workload is on GPU
OVERVIEW OF LIFE & MATERIAL ACCELERATED APPS
11
MD VS. QC ON GPUS
“Classical” Molecular Dynamics Quantum Chemistry (MO, PW, DFT, Semi-Emp)
Simulates positions of atoms over time; chemical-biological or
chemical-material behaviors
Calculates electronic properties; ground state, excited states, spectral properties,
making/breaking bonds, physical properties
Forces calculated from simple empirical formulas (bond rearrangement generally forbidden)
Forces derived from electron wave function (bond rearrangement OK, e.g., bond energies)
Up to millions of atoms Up to a few thousand atoms
Solvent included without difficulty Generally in a vacuum but if needed, solvent treated classically (QM/MM) or using implicit methods
Single precision dominated Double precision is important
Uses cuBLAS, cuFFT, CUDA Uses cuBLAS, cuFFT, OpenACC
Geforce (Accademics), Tesla (Servers) Tesla recommended
ECC off ECC on
12
GPU-Accelerated Quantum Chemistry Apps
Abinit
ACES III
ADF
BigDFT
CP2K
GAMESS-US
Gaussian
GPAW
LATTE
LSDalton
MOLCAS
Mopac2012
NWChem
Green Lettering Indicates Performance Slides Included
GPU Perf compared against dual multi-core x86 CPU socket.
Quantum SuperCharger Library
TeraChem
VASP
WL-LSMS
Octopus
ONETEP
Petot
Q-Chem
QMCPACK
Quantum Espresso
ABINIT
FROM RESEARCH TO INDUSTRY
6th International ABINIT Developer Workshop
April 15-18, 2013 - Dinard, France
Marc Torrent CEA, DAM, DIF, F-91297 Arpajon, France
Florent Dahm C-S – 22 av. Galilée. 92350 Le Plessis-Robinson
With a contribution from Yann Pouillon for the build system
www.cea.fr
USING ABINIT ON GRAPHICS PROCESSING UNITS (GPU)
ABINIT ON GPU
OUTLINE
Introduction
Introduction to GPU computing Architecture, programming model
Density-Functional Theory with plane waves How to port it on GPU
ABINIT on GPU
Implementation
Performances of ABINIT v7.0
How To…
Compile
Use
ABINIT on GPU| April 15, 2013 | PAGE 13
!
!
!
!
!
LOCAL OPERATOR: FAST FOURIER TRANSFORMS
g r g r g v dr e v r
c e
'
eff eff g
g'
FFT
ABINIT on GPU| April 15, 2013 | PAGE 14
Included in Cuda package : cuFFT library
Interfaced with c and Fortran
A system dependent efficiency is expected
Small FFT underuse the GPU and are dominated by
the time required to transfer the data to/from the GPU
Within PAW, small plane wave basis are used
~ =
i
~ ( )
( i
)
FTT
Our choice (for a first version):
Replace the MPI plane wave level
by a FFT computation on GPU
!
!
!
!
!
LOCAL OPERATOR: FAST FOURIER TRANSFORMS
fourwf – option 2
Performance highly dependent on FFT size
Speedup
FFT size
ABINIT on GPU| April 15, 2013 | PAGE 15
TEST CASE
107 gold atoms cell
TGCC-Curie
(Intel Westmere + Nvidia M2090)
Implementation
Interface Cuda / Fortran
New routines that encapsulate cufft calls
Development of 3 small cuda kernels
Optimization of memory copy to overlap
computation and data transfer
!
!
LOCAL OPERATOR: FAST FOURIER TRANSFORMS
Improvement : multiple band optimization
107 gold atoms 31 copper atoms
ABINIT on GPU| April 15, 2013 | PAGE 16
TEST CASE
TGCC-Curie
(Intel Westmere + Nvidia M2090)
bandpp fourwf on GPU
1
2
4
10
20
100
36,02
20,79
14,14
9,16
9,50
7,94
bandpp fourwf on GPU
1
2
4
8
18
72
112,94
84,55
70,55
61,91
54,79
50,55
The Cuda kernels are more intensively used
Requires less transfers
See bandpp variable When Locally Optimal Block Preconditioned Conjugate
Gradient algorithm is used
!
!
!
! d
NON LOCAL OPERATOR
~p
~p = ∑ g ' g ' ψ ψ Development of 2 Cuda Kernels for g ' ~ ψ ψ
i , j
• Non-local operator
•
•
•
Overlap operator
Forces and stress tensor
PAW occupation matrix
In addition to all MPI
parallelization levels !
• Need a large number of PW
• Need a large number of NL projectors
TGCC-Curie
(Intel Westmere + Nvidia M2090) ABINIT on GPU| April 15, 2013 | PAGE 17
Performances (NL operator only)
nonlop CPU nonlop GPU Speedup time (s) time (s) GPU vs CPU
107 gold atoms
3142
1122
2.8
31 copper atoms
85
42
2.0
3 UO2 atoms
63
36
1.8
Implementation
i
i
Development of 2 Cuda Kernels for g VNL = ∑ g pi Dij ~p j
Using collaborative memory zones as much as possible (threads linke
to the same plane wave lie in the same collaborative zone) Used for : • Non-l
Matrix-matrix and matrix-vector products
! These products are not parallelized
! Size of vectors/matrix increases with the
number of band/plane-wave CPU cores.
! Our choice : use of cuBlas
Algorithm 1 lobpcg
Require: 0 = { 0 , . . . , 0 } close to the minimum and K a pre- 1 m conditioner; P = {P
(0) , . . . , P0 } is initialized to 0. 1 m
1: for i=0,1, , do . . . 2: ⌥(i) = ⌥( (i) )
3: R(i) = H (i) ⌥(i) O (i)
4: W(i) = KR(i)
5: The Rayleigh-Ritz method is applied within the subspace ⌅ = n o P(i) (i) (i) (i) (i) (i)
1 , . . . , Pm , 1 , . . . , m , W1 , . . . , Wm
6: (i+1) = (i) (i) + ⇤(i) W(i) + (i) P(i)
7: P(i+1) = ⇤(i) W(i) + (i) P(i) 8: end for
! Size of matrix increases when b
parallelization is used
___ .z - ! Need for a free LAPACK packag
Locally Optimal Block
Preconditioned Conjugate
Gradient
ABINIT on GPU| April 15, 2013 | PAGE 18
Diagonalization/othogonalization within blocks
and
e on GPU
Our choice : use of MAGMA
bandpp Total tim e Tim e Line ar Alge bra
1 674 119
2 612 116
10 1077 428
50 5816 5167
! A project by Innovating Computing Laboratory, University of Tenessee
! “Free of use” project
! First stable release 25-08-2011; Current version 1.3
! Computation is cut in order to offload consuming parts on GPU
! Goal : address hybrid environments
Matrix Algebra on GPU and
Multicore Architectures Architectures 2013-03-
12
The MAGMA project aims 1o oeveop a dense linear agebra library Downloads
similar 1o LAPACK bu1 for heterogeneouslhybrd architectures,
2012-11-14
People
Dcx:umentatbn
ABINIT on GPU| April 15, 2013 | PAGE 19
enable app'catons 1o fully expoit the power 1hat each of the hybrd
components offers.
lver27 2013
License
Copyrqht 2013 Th.. Uni ,:.r.;it of T..-,nn.. - e , All rt}h re ..
d. in
conditbns are met:
ing
Home
Overview Matrix Algebra on GPU and Multicore Latest MAGMA News
News
MAGMA MIC 1.0 Beta for Intel Xeon
Phi Coprocessors Re~ased
Publcations
starting with current 'Multcore-tGPU' systems.
Partners The MAGMA research is based on 1he dlea 1hat, 1o address the MAGMA 1.3 Released
complex challenges of 1heemerging hybrd environments, optimal
software solutbns will themselves have to hybrdize, combining the 2012-11-13 User Forum
strengths of differen1 agornhms wnhin a single framework. Buitling MAGMA MIC 0.3 for Intel Xeon Phi
on this dlea, we aim to oesqn linearagebraagornhms and Coprocessors Released
frameworks for hybrd manycore and GPU systems that can
2012-10-24
clMAGMA 1.0 Released
2012-06-29
MAGMA 1.2.1 Released
ICL{>l3r
Sponsored By: Industry Support From: AMD:I @ Mlaoeolt ,..............
ABINIT on GPU| April 15, 2013 | PAGE 20
!
!
ITERATIVE EIGENSOLVER
Performances
TGCC-Curie
(Intel Westmere + Nvidia M2090)
Highly dependent on the physical system
Depends only on the size of blocks used in the LOBPCG algorithm
Speedup of cuBLAS/MAGMA code Speedup on a single MPI process
4,00
3,50
3,00
2,50 215 silicon atoms
2,00 107 gold atoms
1,50 31 copper atoms
1,00 3 UO2 atoms
0,50
0,00 bandpp=
2 4 10 factor related to the size of blocks (bands) size
OUTLINE
Introduction
Introduction to GPU computing Architecture, programming model
Density-Functional Theory with plane waves How to port it on GPU
ABINIT on GPU
Implementation
Performances of ABINIT v7.0
How To…
Compile
Use
ABINIT on GPU| April 15, 2013 | PAGE 21
!
!
PERFORMANCES ON A SINGLE GPU
Comparing architectures…
TGCC-Titane
Intel Nehalem+ NVidia S1070 (Tesla)
TGCC-Curie
Intel Westmere + NVidia M2090 (Fermi)
ABINIT on GPU| April 15, 2013 | PAGE 22
Strongly dependent on architecture
Titane : fast CPU+ old GPU
BaTiO3 case is too small to take benefit from GPU
Time proc 0 (s) Titane Curie CPU CPU+GPU Speedup CPU CPU+GPU Speedup
107 gold atoms
31 copper atoms
5 BaTiO3 atoms
4230
153
85
1857
102
98
2,3
1,5
0,9
5154
235
86
621
64
73
8.3
3,7
1,2
!
PERFORMANCES ON MULTIPLE GPU
TGCC-Curie
Intel Westmere + NVidia M2090 (Fermi)
8 CPU (MPI only) + 8 GPU
200 CPU (MPI only) + 200 GPU
ABINIT on GPU| April 15, 2013 | PAGE 23
Less efficient when the number of MPI processes increases
(the load decreases on the GPU)
Time proc 0 (s) Curie CPU CPU+GPU Speedup
215 silicon atoms (45 it
er.)
25856 6309 4,1
Time proc 0 (s) Curie CPU CPU+GPU Speedup
215 silicon atoms (1 iter
.)
107 gold atoms
31 copper atoms
16160
866
49
3680
621
64
4,4
4,7
1,5
!
!
!
!
!
PERFORMANCES VS OTHER CODES
[1] Harju et al, App. Parallel and Sc. Comp. Lectures Notes in Computer Science 7782, p3 (2013)
[2] Spiga et al, pdp, 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, p368 (2012)
[3] Mainz et al, Computer Physics Communication 182, p1421 (2011)
[4] Hacene et al,, Journal of Computational Chemistry 33, p2581(2012)
[5] Genovese et al, Journal of Chemical Physics, 131, 034103 (2009)
[6] Hakala et al, PARA 2012. LNCS, vol 7782, p63, Springer, Heidelberg (2013)
ABINIT on GPU| April 15, 2013 | PAGE 24
Other DFT codes have been ported on GPU.
Speedup of plane wave codes are similar
Quantum espresso (plane waves) : x3/x4 (sequential), x2/x3 (parallel) [1,2]
VASP (plane waves) : x3/x4 [3,4]
BigDFT (wavelets) : x5/x7 [5]
GPAW (real space) : x10/x15 [1,6]
BigDFT
Gaussian
April 4-7, 2016 | Silicon Valley
Roberto Gomperts (NVIDIA), Michael Frisch (Gaussian, Inc.), Giovanni Scalmani (Gaussian, Inc.), Brent Leback (NVIDIA/PGI)
ENABLING THE ELECTRONIC STRUCTURE PROGRAM GAUSSIAN ON GPGPUS USING OPENACC
33
PREVIOUSLY Earlier Presentations
GRC Poster 2012
ACS Spring 2014
GTC Spring 2014 ( recording at http://on-demand.gputechconf.com/gtc/2014/video/S4613-enabling-gaussian-09-gpgpus.mp4 )
WATOC Fall 2014
GTC Spring 2016 (this full recording at http://mygtc.gputechconf.com/quicklink/4r13O5r; requires registration)
34
CURRENT STATUS Single Node
Implemented
Energies for Closed and Open Shell HF and DFT (less than a handful of XC-functionals missing)
First derivatives for the same as above
Second derivatives for the same as above
Using only
OpenACC
CUDA library calls (BLAS)
6/29/2016
35
GAUSSIAN PARALLELISM MODEL
CPU Cluster
OpenMP
CPU Node
GPU
OpenACC
36
CLOSING REMARKS
Significant Progress has been made in enabling Gaussian on GPUs with OpenACC
OpenACC is increasingly becoming more versatile
Significant work lies ahead to improve performance
Expand feature set:
PBC, Solvation, MP2, ONIOM, triples-Corrections
EARLY PERFORMANCE RESULTS (CCSD(T))
24.8 24.8 22.7 2.1 3.7 3.6 1.3 2.3 0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
Total l913: "triples" Iterative CCSD
Ibuprofen CCSD(t) Calculation Speed Ups Relative to CPU-Only Full Node
20 Threads 0 GPUs 6 Threads 6 GPUs
Labels: Time in hrs
System: 2 Sockets E5-2690 V2 (2x 10 Cores @ 3.0 GHz); 128 GB RAM (DD3-1600); Used 108 GB
GPUs: 6 Tesla K40m (15 SMPs @ 875 MHz); 12 GB Global Memory
Method CCSD(t)
No. of Atoms 33
Basis Set 6-31G(d,p)
No. of Basis Funcs 315
No. Occ Orbitals 41
No. Virt Orbitals 259
No. of Cycles 15
No. CCSD iters 16
VASP
GPU VASP Collaboration
Collaborators
2013-2014 Project Scope Minimization algorithms to calculate electronic ground state
Blocked Davidson (ALGO = NORMAL & FAST)
RMM-DIIS (ALGO = VERYFAST & FAST)
K-Points
Optimization for critical step in exact exchange calculations
Earlier work Speeding up plane-wave electronic-structure calculations using graphics-processing units, Maintz, Eck, Dronskowski
VASP on a GPU: application to exact-exchange calculations of the stability of elemental boron, Hutchinson, Widom
Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units, Hacene, Anciaux-Sedrakian,
Rozanska, Klahr, Guignon, Fleurat-Lessard
U of Chicago
Target Workloads
Silica (“medium”) 7 Å thick slab of amorphous silica, 240 atoms (Si68O148H24)
RMM-DIIS (ALGO = VERYFAST)
Nial-MD (“large”) Liquid metal molecular dynamics sample of Nickel-based superalloy
500 atoms, 9 chemical species total
Blocked Davidson (ALGO = NORMAL)
INTERFACE (“large”) Interface of platinum metal with water
108 Pt atoms, and 120 water molecules (468 atoms)
Blocked Davidson & RMM-DIIS (ALGO = FAST)
RUNTIME DISTRIBUTION FOR SILICA
Time in sec for 1 K40 GPU + 1 IvyBridge core
0 500 1000 1500 2000 2500 3000 3500
Optimized GPU port
original GPU port
CPU
Memcopy
Gemm
FFT
Other
March 2016
VASP 5.4.1 w/ Patch#1(Haswell & K80)
43
VASP Interface Benchmark
Running VASP version 5.4.1
The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs
“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to
2.7X faster calculations. (patch #1 available at VASP.at)
Blocked Davidson + RMM-DIIS (ALGO=Fast)
828.27
580.63
506.25
387.26
284.06
0
100
200
300
400
500
600
700
800
900
1 Node 1 Node +1x K80 per node
1 Node + 2x K80per node
[Old Data]
1 Node +2x K80 per node
1 Node +4x K80 per node
Ela
pse
d t
ime (
s)
Interface
1.4X 1.6X
2.1X 3.0X
*Lower is better
44
VASP Silica IFPEN Benchmark
542.87
432.48
372.74
258.14
210.47
0
100
200
300
400
500
600
1 Node 1 Node +1x K80 per node
1 Node + 2x K80 pernode
[Old Data]
1 Node +2x K80 per node
1 Node +4x K80 per node
Ela
pse
d t
ime (
s)
Silica IFPEN
1.3X
1.5X
2.1X
2.6X
*Lower is better
Running VASP version 5.4.1
The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs
“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to
2.7X faster calculations. (patch #1 available at VASP.at)
RMM-DIIS (ALGO=Veryfast)
45
VASP Si-Huge Benchmark
6464.42
4296.15
2871.09 2865.92
1987.94
0
1000
2000
3000
4000
5000
6000
7000
1 Node 1 Node +1x K80 per node
1 Node + 2x K80 pernode
[Old Data]
1 Node +2x K80 per node
1 Node +4x K80 per node
Ela
pse
d t
ime (
s)
Si-Huge
1.5X
2.3X 2.3X
3.3X
*Lower is better Running VASP version 5.4.1
The blue node contains Dual Intel Xeon
E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs
“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to
2.7X faster calculations. (patch #1 available at VASP.at)
Blocked Davidson + RMM-DIIS (ALGO=Fast)
46
VASP SupportedSystems Benchmark
310.63
252.43
228.90
164.21
129.99
0
50
100
150
200
250
300
350
1 Node 1 Node +1x K80 per node
1 Node + 2x K80 pernode
[Old Data]
1 Node +2x K80 per node
1 Node +4x Tesla K80 per
node
Ela
pse
d t
ime (
s)
SupportedSystems
1.2X
1.4X
1.9X
2.4X
*Lower is better Running VASP version 5.4.1
The blue node contains Dual Intel Xeon
E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs
“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to
2.7X faster calculations. (patch #1 available at VASP.at)
Blocked Davidson + RMM-DIIS (ALGO=Fast)
47
VASP NiAl-MD Benchmark
722.82
237.02
197.19 184.93
142.98
0
100
200
300
400
500
600
700
800
1 Node 1 Node + 1x K80 pernode
1 Node + 2x K80 pernode
[Old Data]
1 Node +2x K80 per node
1 Node +4x K80 per node
Ela
pse
d t
ime (
s)
NiAl-MD
3.0X 3.7X 3.9X 5.1X
*Lower is better Running VASP version 5.4.1
The blue node contains Dual Intel Xeon
E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs
“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to
2.7X faster calculations. (patch #1 available at VASP.at)
Blocked Davidson + RMM-DIIS (ALGO=Fast)
48
VASP B.hR105 Benchmark
1196.86
604.05
334.44 343.69
236.13
0
200
400
600
800
1000
1200
1400
1 Node 1 Node + 1x K80 pernode
1 Node + 2x K80 pernode
[Old Data]
1 Node +2x K80 per node
1 Node +4x K80 per node
Ela
pse
d t
ime (
s)
B.hR105
2.0X
3.6X 3.5X 5.1X
*Lower is better
Running VASP version 5.4.1
The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs
“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to
2.7X faster calculations. (patch #1 available at VASP.at)
Hybrid Functional with blocked Davicson (ALGO=Normal)
49
VASP B.aP107 Benchmark
27993.62
7494.36 7632.31
4432.31 4513.40
0
5000
10000
15000
20000
25000
30000
1 Node 1 Node +1x K80 per node
1 Node + 2x K80 pernode
[Old Data]
1 Node +2x K80 per node
1 Node +4x K80 per node
Ela
pse
d t
ime (
s)
B.aP107
3.7X 3.7X 6.3X 6.2X
*Lower is better
Running VASP version 5.4.1
The blue node contains Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel Xeon E5-2698 [email protected] (Haswell) CPUs + Tesla K80 (autoboost) GPUs
“[Old Data]” = pre-Bugfix: patch #1 for vasp.5.4.1.05Feb15 which yield up to
2.7X faster calculations. (patch #1 available at VASP.at)
Hybrid functional calculation (exact exchange) with blocked Davidson. No KPoint
parallelization.
Hybrid Functional with blocked Davicson (ALGO=Normal)
50
GPU-ACCELERATED MOLECULAR DYNAMICS APPS
ACEMD
AMBER
CHARMM
DESMOND
ESPResSO
Folding@Home
GPUGrid.net
GROMACS
HALMD
HOOMD-Blue
LAMMPS
mdcore
Green Lettering Indicates Performance Slides Included
GPU Perf compared against dual multi-core x86 CPU socket.
MELD
NAMD
OpenMM
PolyFTS
NAMD 2.11
53
NEW GPU FEATURES IN NAMD 2.11
• GPU-accelerated simulations up to twice as fast as NAMD 2.10
• Pressure calculation with fixed atoms on GPU works as on CPU
• Improved scaling for GPU-accelerated particle-mesh Ewald calculation
• CPU-side operations overlap better and are parallelized across cores.
• Improved scaling for GPU-accelerated simulations
• Nonbonded force calculation results are streamed from the GPU for better overlap.
• NVIDIA CUDA GPU-acceleration binaries for Mac OS X
Selected Text from the NAMD website
54
NAMD 2.11 IS UP TO 2X FASTER
0
5
10
15
20
25
1 Node 2 Nodes 4 Nodes
Sim
ula
ted T
ime (
ns/
day)
APoA1 (92,224 atoms)
1.2X
1.6X 2.0X
NAMD 2.10 & NAMD 2.11 contain Dual Intel E5-2697 [email protected] (IvyBridge) CPUs + 2x Tesla K80 (autoboost) GPUs
55
NAMD 2.11 APOA1 ON 1 AND 2 NODES
Running NAMD version 2.11
The blue nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs + Tesla
K80 (autoboost) GPUs
2.77
11.67
16.99
5.22
19.73
24.31
0
5
10
15
20
25
1 Node 1 Node +1x K80
1 Node +2x K80
2 Nodes 2 Nodes +1x K80
2 Nodes +2x K80
Sim
ula
ted T
ime (
ns/
day)
APoA1 (92,224 atoms)
4.2X
6.1X
3.8X
4.7X
56
NAMD 2.11 APOA1 ON 4 AND 8 NODES
Running NAMD version 2.11
The blue nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs + Tesla
K80 (autoboost) GPUs
10.27
20.64
23.52
16.85
27.83 27.74
0
5
10
15
20
25
30
4 Nodes 4 Nodes +1x K80
4 Nodes +2x K80
8 Nodes 8 Nodes +1x K80
8 Nodes +2x K80
Sim
ula
ted T
ime (
ns/
day)
APoA1 (92,224 atoms)
2.0X
2.3X 1.7X 1.6X
57
NAMD 2.11 STMV ON 1 AND 2 NODES
Running NAMD version 2.11
The blue nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel E5-2698 [email protected] CPUs (Haswell) + Tesla
K80 (autoboost) GPUs
0.23
1.03
1.75
0.46
1.98
3.27
0
1
2
3
4
1 Node 1 Node +1x K80
1 Node +2x K80
2 Nodes 2 Nodes +1x K80
2 Nodes +2x K80
Sim
ula
ted T
ime (
ns/
day)
STMV (1,066,628 atoms)
4.5X
7.6X 4.3X
7.1X
58
NAMD 2.11 STMV ON 4 AND 8 NODES
Running NAMD version 2.11
The blue nodes contain Dual Intel E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Dual Intel E5-2698 [email protected] CPUs (Haswell) + Tesla
K80 (autoboost) GPUs
0.90
3.61
4.54
1.74
5.86
6.24
0
2
4
6
8
4 Nodes 4 Nodes +1x K80
4 Nodes +2x K80
8 Nodes 8 Nodes +1x K80
8 Nodes +2x K80
Sim
ula
ted T
ime (
ns/
day)
STMV (1,066,628 atoms)
4.0X
5.0X
3.4X
3.6X
AMBER 14
60
JAC on K40s, K80s and M6000s
Running AMBER version 14
The blue node contains Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs
The green nodes contain Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro
M6000@987Mhz GPUs
37.38
134.82
121.30
174.34
161.53
200.34
225.34 219.83
0
50
100
150
200
250
1 HaswellNode
1 CPU Node+ 1x K40
1 CPU Node+ 0.5x K80
1 CPU Node+ 1x K80
1 CPU Node+ 1x M6000
1 CPU Node+ 2x K40
1 CPU Node+ 2x K80
1 CPU Node+ 2x M6000
Sim
ula
ted T
ime (
ns/
day)
PME-JAC_NVE
3.2X
4.7X
4.3X
5.4X
6.0X 5.9X
3.6X
61
Cellulose on K40s, K80s and M6000s
Running AMBER version 14
The blue node contains Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs
The green nodes contain Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro
M6000@987Mhz GPUs
1.93
8.96
7.87
11.76
10.49
13.67
15.38 14.90
0
4
8
12
16
20
1 HaswellNode
1 CPU Node+ 1x K40
1 CPU Node+ 0.5x K80
1 CPU Node+ 1x K80
1 CPU Node+ 1x M6000
1 CPU Node+ 2x K40
1 CPU Node+ 2x K80
1 CPU Node+ 2x M6000
Sim
ula
ted T
ime (
ns/
day)
PME-Cellulose_NVE
4.1X
6.1X
5.4X
8.0X 7.7X
4.6X
7.1X
62
Factor IX on K40s, K80s and M6000s
Running AMBER version 14
The blue node contains Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs
The green nodes contain Dual Intel E5-2698 [email protected], 3.6GHz Turbo CPUs + either NVIDIA Tesla K40@875Mhz, Tesla K80@562Mhz (autoboost), or Quadro
M6000@987Mhz GPUs
9.68
40.48
33.59
50.70 47.80
61.18 60.93
66.89
0
10
20
30
40
50
60
70
80
1 HaswellNode
1 CPU Node+ 1x K40
1 CPU Node+ 0.5x K80
1 CPU Node+ 1x K80
1 CPU Node+ 1x M6000
1 CPU Node+ 2x K40
1 CPU Node+ 2x K80
1 CPU Node+ 2x M6000
Sim
ula
ted T
ime (
ns/
day)
PME-FactorIX_NVE
3.5X
5.2X 5.0X
6.4X 6.3X
7.0X
4.2X
63
JAC on M40s
21.11
157.68
230.18
246.15
0
50
100
150
200
250
300
1 Node 1 Node +1x M40 per node
1 Node +2x M40 per node
1 Node +4x M40 per node
Sim
ula
ted T
ime (
ns/
Day)
PME - JAC_NVE
7.5X
10.9X
11.7X
Running AMBER version 14
The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Single Intel
Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
64
Cellulose on M40s
Running AMBER version 14
The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Single Intel
Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
1.07
10.12
14.40
15.90
0
2
4
6
8
10
12
14
16
18
1 Node 1 Node +1x M40 per node
1 Node +2x M40 per node
1 Node +4x M40 per node
Sim
ula
ted T
ime (
ns/
Day)
PME - Cellulose_NPT
9.5X
13.5X 14.9X
65
Myoglobin on M40s
9.83
232.20
300.86
322.09
0
50
100
150
200
250
300
350
1 Node 1 Node +1x M40 per node
1 Node +2x M40 per node
1 Node +4x M40 per node
Sim
ula
ted T
ime (
ns/
Day)
GB - Myoglobin
23.6X
30.6X 32.8X
Running AMBER version 14
The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Single Intel
Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
66
Nucleosome on M40s
0.13
4.67
9.05
16.11
0
2
4
6
8
10
12
14
16
18
1 Node 1 Node +1x M40 per node
1 Node +2x M40 per node
1 Node +4x M40 per node
Sim
ula
ted T
ime (
ns/
Day)
GB - Nucleosome
35.9X
69.6X
123.9X
Running AMBER version 14
The blue node contain Single Intel Xeon E5-2698 [email protected] (Haswell) CPUs
The green nodes contain Single Intel
Xeon E5-2697 [email protected] (IvyBridge) CPUs + Tesla M40 (autoboost) GPUs
October 2015
GROMACS 5.1
68
3M Waters on K40s and K80s
Running GROMACS version 5.1
The blue node contains Dual Intel E5-2698 [email protected] CPUs
The green nodes contain Dual Intel E5-2698 [email protected] CPUs + either NVIDIA
Tesla K40@875Mhz or Tesla K80@562Mhz (autoboost) GPUs
0.81
1.32
1.88 1.85
2.72 2.76
3.23
0
1
2
3
4
1 HaswellNode
1 CPU Node +1x K40
1 CPU Node +1x K80
1 CPU Node +2x K40
1 CPU Node +2x K80
1 CPU Node +4x K40
1 CPU Node +4x K80
Sim
ula
ted T
ime (
ns/
day)
Water [PME] 3M
1.6X
2.3X 2.3X
3.4X 3.4X
4.0X
69
3M Waters on Titan X
Running GROMACS version 5.1
The blue node contains Dual Intel E5-2698 [email protected] CPUs
The green nodes contain Dual Intel E5-2698 [email protected] CPUs + GeForce GTX
TitanX@1000Mhz GPUs 0.81
1.53
2.36
2.99
0
1
2
3
4
1 Haswell Node 1 CPU Node + 1x TitanX 1 CPU Node + 2x TitanX 1 CPU Node + 4x TitanX
Sim
ula
ted T
ime (
ns/
day)
Water [PME] 3M
1.9X
2.9X
3.7X
70
TESLA P100 ACCELERATORS
Tesla P100 for NVLink-enabled Servers
Tesla P100 for PCIe-Based Servers
5.3 TF DP ∙ 10.6 TF SP ∙ 21 TF HP
720 GB/sec Memory Bandwidth, 16 GB
4.7 TF DP ∙ 9.3 TF SP ∙ 18.7 TF HP
Config 1: 16 GB, 720 GB/sec
Config 2: 12 GB, 540 GB/sec
Unified Memory
CPU
Tesla P100
PAGE MIGRATION ENGINE
Simpler Parallel Programming
Virtually Unlimited Data Size
Performance w/ data locality
72
EXTRAORDINARY STRONG SCALING
CPU: Dual Socket Intel E5-2680v3 12 cores, 128 GB DDR4 per node, FDR IB VASP 5.4.1_05Feb16, Si-Huge Dataset. 16, 32 Nodes are estimated based on same scaling from 4 to 8 nodes AMBER 16 Pre-release, CRSPR based on PDB ID 5f9r, 336,898 atoms
One Strong Node Faster Than Lots of Weak Nodes
VASP PERFORMANCE
0x
2x
4x
6x
8x
10x
12x
0 4 8 12 16 20 24 28 32
2x P100
8x P100
Single P100 PCIe Node vs Lots of Weak Nodes
# of CPU Server Nodes
Speed-u
p v
s 1 C
PU
Serv
er
Node
4x P100
32 CPU Nodes
0
5
10
15
20
25
30
35
40
45
50
0 20 40 60 80 100
ns/
day
# of Processors (CPUs and GPUs)
AMBER Simulation of CRISPR, Nature’s Tool for Genome Editing
1 Node with 4 P100 GPUs
48 CPU Nodes Comet
Supercomputer
74
MASSIVE LEAP IN PERFORMANCE
0x
5x
10x
15x
20x
25x
30x
NAMD VASP HOOMD Blue AMBER
2x K80 2x P100 (PCIe) 4x P100 (PCIe)
Speed-u
p v
s D
ual Socket
Bro
adw
ell
CPU: Xeon E5-2697v4, 2.3 GHz, 3.6 GHz Turbo Caffe Alexnet: batch size of 256 Images; VASP, NAMD, HOOMD-Blue, and AMBER average speedup across a basket of tests
76
TCO OVERVIEW
1 RACK
# of Racks (Power Required)
8 (320kW)
6 (240kW)
4 (160kW)
2 (80kW)
10 (400 kW)
960 CPUs / 260 kW
1280 CPUs / 350 kW
20CPUs + 80 P100s / 32 kW
7 RACKS
CAFFE ALEXNET
VASP
Compute Servers,
85%
Non-Compute 15%
EQUAL PERFORMANCE IN LESS RACKS BUDGET: SMALLER, EFFICIENT
Compute Servers,
39%
Rack, Cabling
Infrastructure
Networking
Non-compute,
61%
Sources: Microsoft Research on Datacenter Costs
9 RACKS
77
NVIDIA DGX-1 WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER
42.5 TFLOPS FP64/85 TFLOPS FP32/170 TFLOPS FP16
8x Tesla P100 16GB
NVLink Hybrid Cube Mesh
Accelerates Major AI Frameworks
Dual Xeon
7 TB SSD Deep Learning Cache
Dual 10GbE, Quad IB 100Gb
3RU – 3200W
79
THANK YOU! Q&A: WHAT CAN I HELP YOU BUILD?