april 4-7, 2016 | silicon valley enabling the electronic ......4 gaussian a computational chemistry...
Post on 23-Jan-2021
0 Views
Preview:
TRANSCRIPT
April 4-7, 2016 | Silicon Valley
Roberto Gomperts (NVIDIA), Michael Frisch (Gaussian, Inc.), Giovanni Scalmani (Gaussian, Inc.), Brent Leback (NVIDIA/PGI)
ENABLING THE ELECTRONIC STRUCTURE PROGRAM GAUSSIAN ON GPGPUS USING OPENACC
2
PREVIOUSLY Earlier Presentations
GRC Poster 2012
ACS Spring 2014
GTC Spring 2014 ( recording at http://on-demand.gputechconf.com/gtc/2014/video/S4613-enabling-gaussian-09-gpgpus.mp4 )
WATOC Fall 2014
3
TOPICS
Gaussian: Design Guidelines, Parallelism and Memory Model
Implementation: Top-Down/Bottom-Up
OpenACC: Extensions, Hints & Tricks
Early Performance
Closing Remarks
4
GAUSSIAN
A Computational Chemistry Package that provides state-of-the-art capabilities for electronic structure modeling
Gaussian 09 is licensed for a wide variety of computer systems
All versions of Gaussian 09 contain virtually every scientific/modeling feature, and none imposes any artificial limitations on calculations other than computational resources and time constraints
Researchers use Gaussian to, among others, study molecules and reactions; predict and interpret spectra; explore thermochemistry, photochemistry and other excited states; include solvent effects, and many more
4/1/2016
5
DESIGN GUIDELINES
General
Establish a Framework for the GPU-enabling of Gaussian
Code Maintainability (Code Unification)
Leverage Existing code/algorithms, including Parallelism and Memory Model
Simplifies Resolving Problems
Simplifies Improvement on existing code
Simplifies Adding New Code
4/1/2016
6
DESIGN GUIDELINES
Accelerate Gaussian for Relevant and Appropriate Theories and Methods
Relevant: many users of Gaussian
Appropriate: time consuming and good mapping to GPUs
Resource Utilization
Ensure efficient use of all available Computational Resources
CPU cores and memory
Available GPUs and memory
4/1/2016
7
CURRENT STATUS Single Node
Implemented
Energies for Closed and Open Shell HF and DFT (less than a handful of XC-functionals missing)
First derivatives for the same as above
Second derivatives for the same as above
Using only
OpenACC
CUDA library calls (BLAS)
4/1/2016
8
IMPLEMENTATION MODEL Application Code
+
GPU CPU Small Fraction of the Code
Large Fraction of Execution
time
Compute-Intensive Functions
Rest of Sequential CPU Code
9
GAUSSIAN PARALLELISM MODEL
CPU Cluster
OpenMP
CPU Node
GPU
OpenACC
10
GAUSSIAN: MEMORY MODEL
CPU Cluster
OpenMP
CPU Node
GPU
OpenACC
11
TREES IN THE FOREST OpenMP Parallel Region
Static Call Tree
Integrals Generation
Integrals “Digestion”
PRISMC ACLEAR
ARRMAX AUNITM
ACLEAR BRADRV
C2PGEN CHEKS4
CLMLPT COPRIM
DIGJE DOIRT1
DOIRT2 DOIRT3
DOIRT4 DOIRTR
DOIRTS GGEMV
DODYP DOPRIN
DOTSTM FNDHSC
FNDHSU GENCIT
GENSCL GETCWU
GLINCO IARMAX
INDSE INTMEM
INTPWP ISALG
KETDRV LBIT
LCOUNT LDCNVR
MAKZON MAXINT
MDVLEN PCKC4A
CALERF GLINCO
LOADC4 LOADDC
KPBCDC MAXINT
MDVLEN PC4TSJ
CULL2 PC4TSX
CALERF CNCALC
CULL1 PCKBXT
CULL2 PCKCEN
CULL1 PCKDEN
PCKFRG PCKRAN
IGFIX VPCKRN
CULL1 PETIIJ
PICKC4 GLINCO
IGFIX LOADC4
LOADDC PCKFRG
PICKS4 PRMDIG
PRMROW AMOVE
PRMZON PROFFZ
PRSMAR PRSMSE
SEHAM STEPTT
SUMRF PRLIN1
AMOVE GLINB
GLININ GLINLO
GLINLP GLININ
INTPWP GLINOP
GLINRE IMOVE
LBIT LCLEAR
PRLIN2 GLIN2D
GLININ GLINLO
GLINRE PRMRAF
ACLEAR AMOVE
AUNITM BRADRV
C2PGEN C4PRAF
CHEKS4 COPRIM
DIGRAF BOGGRI
DGSA03 DGSA06
DGSA13 DGSA16
DGST01 DGST02
DG0201 DG0202
DG0203 DG0206
DG0210 DG0220
DG0230 DG0233
DG0260 DG0266
DGST03 DGST06
DGST0X DGST11
DGST12 DG1211
DG1222 DG1233
DG1266 DGST16
DGST1X DGST24
DGSTAS LODRGO
MAKER3 NFUNSH
RSSHL STOR2Y
STORR1 RSSHL
STR2R1 STORR2
RSSHL STR2R2
STORR3 RSSHL
STR2R3 DODYP
DOPRIN DOTSTM
GENCIT GETCWU
GLINCO IARMAX
ICLEAR INDC
ISALG KETDRV
LDCNVR MAKZON
PETIIJ PICKR4
CHKSYM GLINCO
PRLOAD PRMPTH
CHOOSE PTHINF
PRMRAL IFRPOS
INTOWP ITRPOS
LNK1E MDCACH
MDVLEN PRSMDI
PRMROW PRMZON
PRRDRV RFBRKT
RFCONT STEPTT
12
APPROACH Data and Compute Regions Management
Data Management Top Down:
Create and initialize as appropriate large data region on device
Peruse the device memory as Compute Regions are enabled
Compute Bottom Up:
Create as many Accelerator Routines as possible
Incrementally add Compute Regions driving the Accelerator Routines
Incrementally add Routines with own Compute Regions
4/1/2016
13
OPENACC OpenMP Parallel Region
OpenACC directives at the “leaves”
Device Memory Management at highest possible level
OpenACC directives at the “leaves”
Device Memory Management at highest possible level
Move Directives up the calling tree
PRISMC ACLEAR
ARRMAX AUNITM
ACLEAR BRADRV
C2PGEN CHEKS4
CLMLPT COPRIM
DIGJE DOIRT1
DOIRT2 DOIRT3
DOIRT4 DOIRTR
DOIRTS GGEMV
DODYP DOPRIN
DOTSTM FNDHSC
FNDHSU GENCIT
GENSCL GETCWU
GLINCO IARMAX
INDSE INTMEM
INTPWP ISALG
KETDRV LBIT
LCOUNT LDCNVR
MAKZON MAXINT
MDVLEN PCKC4A
CALERF GLINCO
LOADC4 LOADDC
KPBCDC MAXINT
MDVLEN PC4TSJ
CULL2 PC4TSX
CALERF CNCALC
CULL1 PCKBXT
CULL2 PCKCEN
CULL1 PCKDEN
PCKFRG PCKRAN
IGFIX VPCKRN
CULL1 PETIIJ
PICKC4 GLINCO
IGFIX LOADC4
LOADDC PCKFRG
PICKS4 PRMDIG
PRMROW AMOVE
PRMZON PROFFZ
PRSMAR PRSMSE
SEHAM STEPTT
SUMRF PRLIN1
AMOVE GLINB
GLININ GLINLO
GLINLP GLININ
INTPWP GLINOP
GLINRE IMOVE
LBIT LCLEAR
PRLIN2 GLIN2D
GLININ GLINLO
GLINRE PRMRAF
ACLEAR AMOVE
AUNITM BRADRV
C2PGEN C4PRAF
CHEKS4 COPRIM
DIGRAF BOGGRI
DGSA03 DGSA06
DGSA13 DGSA16
DGST01 DGST02
DG0201 DG0202
DG0203 DG0206
DG0210 DG0220
DG0230 DG0233
DG0260 DG0266
DGST03 DGST06
DGST0X DGST11
DGST12 DG1211
DG1222 DG1233
DG1266 DGST16
DGST1X DGST24
DGSTAS LODRGO
MAKER3 NFUNSH
RSSHL STOR2Y
STORR1 RSSHL
STR2R1 STORR2
RSSHL STR2R2
STORR3 RSSHL
STR2R3 DODYP
DOPRIN DOTSTM
GENCIT GETCWU
GLINCO IARMAX
ICLEAR INDC
ISALG KETDRV
LDCNVR MAKZON
PETIIJ PICKR4
CHKSYM GLINCO
PRLOAD PRMPTH
CHOOSE PTHINF
PRMRAL IFRPOS
INTOWP ITRPOS
LNK1E MDCACH
MDVLEN PRSMDI
PRMROW PRMZON
PRRDRV RFBRKT
RFCONT STEPTT
14
OPENACC PGI Extensions
Data Directives
NoCreate (Also in Compute Directives)
Compute Directives
Gang levels: (Dim:1,2,3)
Collapse (Force:N)
4/1/2016
15
OPENACC EXTENSIONS NoCreate
Subroutine DoSomething(NTT,FA,FB,…) Real*8 FA(*), FB(*) C$ACC Parallel If(OnGPU) C$ACC+ Present(FA) NoCreate(FB) C$ACC Loop Gang Vector Do I = 1,NTT FA(i) = expression If(OpenShell) then FB(i) = expression EndIf EndDo C$ACC End Parallel
FA is always allocated On Device
FB only allocated if OpenShell=.T.
Note: The If(OnGPU) clause requires compilation with –ta=host,nvidia
16
OPENACC EXTENSIONS Collapse Force
Subroutine DoSomething(N,NV,NU,IVIND,IUIND,W…) Integer IVIND(*),IUIND(*) Real*8 W(N,*,*) C$ACC Routine(DoWork) Vector C$ACC Parallel If(OnGPU) Present(IVIND,IUIND,W) C$ACC Loop Gang Collapse(Force:2) Do IV = 1,NV IVD = IVIND(IV) Do IU = 1,NU IUD = IUIND(IU) Call DoWork(N,W(1,IUD,IVD)) EnDo EndDo C$ACC End Parallel
IVD statement Prevents Regular Collapse
17
OPENACC EXTENSIONS Multiple Dimension Gangs
Subroutine DoSomething(N,NV,NU,W…) Real*8 W(N,*,*) C$ACC Routine(DoWork) Gang C$ACC Parallel If(OnGPU) Present(IVIND,IUIND,W) C$ACC+ Num_Gangs(nGng1,nGng2,nGng3) Vector_Length(lvGPU) C$ACC Loop Gang(Dim:3) Do IV = 1,NV C$ACC Loop Gang(Dim:2) Do IU = 1,NU Call DoWork(N,W(1,IU,IV)) EnDo EndDo C$ACC End Parallel
DoWork contains a Gang Vector Loop
18
HINTS & TRICKS Out of OpenACC Scope & Bind
Bind clause
A routine that can be called in different contexts:
Called with its own OpenACC compute region
Called outside an OpenACC scope within an OpenACC region but in different modes: Gang-Vector, Vector, Seq
4/1/2016
19
HINTS & TRICKS Bottom-Up approach
Subroutine DoSomething(N,A,…) Real*8 A(*) C$ACC Routine(aClear) Gang … Call aClear(N,A) … C$ACC Parallel If(OnGPU) Present(A) C$ACC+ Num_Gangs(nGng1) Vector_Length(lvGPU) Call aClear(N,A) C$ACC End Parallel …
aClear clears Host copy
aClear clears Device copy If OnGPU =.T.
20
HINTS & TRICKS Out of Scope or in Gang Compute Region
Subroutine aClear(N,A) Real*8 A(*) Parameter(Zer0=0.0d0) C$ACC Routine Gang C$ACC Loop Gang Vector Do I = 1, N A(I) = Zero EndDo
aClear can be called outside an OpenACC region
21
HINTS & TRICKS Bottom-Up approach
Subroutine DoSomething(N,A,…) Real*8 A(*) C$ACC Routine(aClear) Gang … Call aClrGP(N,A) Call aClear(N,A) … C$ACC Parallel If(OnGPU) Present(A) C$ACC+ Num_Gangs(nGng1) Vector_Length(lvGPU) Call aClear(N,A) C$ACC End Parallel …
aClrGP clears Device copy If OnGPU =.T.
aClear clears Host copy
aClear clears Device copy If OnGPU =.T.
22
HINTS & TRICKS Inside or outside OpenACC scope
Subroutine aClrGP(N,A) Real*8 A(*) Parameter(Zer0=0.0d0) C$ACC Kernels Loop If(OnGPU) Present(A) C$ACC Loop Gang Vector Do I = 1, N A(I) = Zero EndDo
Clears Device copy of A if OnGPU = .T.
Otherwise Host copy
23
HINTS & TRICKS Multiple ARs in Compute Region
Subroutine DoSomething(N,A,B,…) Real*8 A(3),B(*) C$ACC Routine(aClear) Gang C$ACC Routine(Use_A) Gang … C$ACC Parallel If(OnGPU) Present(A,B) C$ACC+ Num_Gangs(nGng1) Vector_Length(lvGPU) Call aClear(3,A) Call Use_A(N,A,B,…) C$ACC End Parallel …
Data Dependency Hazard
A has to be initialized before being used
N>3
24
HINTS & TRICKS Bind
Subroutine DoSomething(N,A,B,…) Real*8 A(3),B(*) C$ACC Routine(aClear) Seq Bind(aClear_s) C$ACC Routine(Use_A) Gang … C$ACC Parallel If(OnGPU) Present(A,B) C$ACC+ Num_Gangs(nGng1) Vector_Length(lvGPU) Call aClear(3,A) Call Use_A(N,A,B,…) C$ACC End Parallel …
Allows to use the same Name for aClear
25
HINTS & TRICKS Bound Seq
Subroutine aClear_s(N,A) Real*8 A(*) Parameter(Zer0=0.0d0) C$ACC Routine Seq C$ACC Loop Seq Do I = 1, N A(I) = Zero EndDo
Seq instead of Gang
26
HINTS & TRICKS Conditional on GPU or Host
Beyond the “if” clause in Compute Regions
Marking a section of code in a Compute Region that should only be executed by the host or the device
ACC_On_Device(ACC_Device_HOST)
ACC_On_Device(ACC_Device_Not_HOST)
PGI’s Implementation
Not exactly like an “#ifdef” macro but close
4/1/2016
27
ACC_ON_DEVICE(…)
Subroutine DoSomething(…) Implicit Real*8(A-H,O-Z) C$ACC Routine Seq C$ACC Routine(DeviceSub) Seq … If(ACC_On_Device(ACC_Device_HOST) then Call HostSub(…) EndIf If(ACC_On_Device(ACC_Device_Not_HOST) then Call DeviceSub(…) EndIf …
HostSub runs only on Host No need for Routine Directive
DeviceSub runs only on Device
28
ADVANCED OPTIMIZATIONS
• It is imperative to use a Dynamic Load Distribution Mechanism
• Already in place for CPU parallelism
• On the fly move work from GPU to Core
• Improve performance in certain places by controlled replication of matrices
4/1/2016
29
PERFORMANCE ASSESMENT Hardware
Processor
Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz (32 cores/2 sockets)
Memory: 256GB
GPU
Tesla K80 (8 GPUs/2 Boards), Boost Clocks: MEM 2505, SM 875
Topology
4/1/2016
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity GPU0 X PIX SOC SOC SOC SOC SOC SOC 0-15 GPU1 PIX X SOC SOC SOC SOC SOC SOC 0-15 GPU2 SOC SOC X PIX PHB PHB PHB PHB 16-31 GPU3 SOC SOC PIX X PHB PHB PHB PHB 16-31 GPU4 SOC SOC PHB PHB X PIX PXB PXB 16-31 GPU5 SOC SOC PHB PHB PIX X PXB PXB 16-31 GPU6 SOC SOC PHB PHB PXB PXB X PIX 16-31 GPU7 SOC SOC PHB PHB PXB PXB PIX X 16-31
30
PERFORMANCE ASSESSMENT
What do we compare?
Runs with 32 cores and 0 GPUs with 32 cores and 8 GPUs
Energies, 1st and 2nd derivatives, Closed and Open Shell
Various basis sets and XC-functionals
All calculations with
High Integral Accuracy (10-12)
Ultra Fine Grid
Molecular Systems
4/1/2016
31
PERFORMANCE ASSESEMENT Molecular Systems
4/1/2016
Valinomycin Open Shell
168 atoms
Force Calculation
Basis set: 6-311+G(2d,p)
Basis Functions: 2646
XC-functional: HSEH1PBE
Charge: +1
Multiplicity: 2
Convergence: 29 cycles
Alanine 25
259 atoms
Energy Calculation
Basis set: cc-pVTZ
Basis Functions: 5690
XC-functional: wB97X-d
Frequency calculation
Basis set: 6-31G*
Basis Functions: 2195
XC-functional: APFD
32
RESULTS
0
100
200
300
400
500
600
700
800
Ala25 E[32/8] Ala25 E[32/0] Val UF[32/8] Val UF[32/0] Ala25 V[32/8] Ala25 V[32/0]
ERI XC Rest
1.39
1.16
1.59
1.48
1.19
1.70
1.22
1.15
1.36
Tota
l Executi
on T
ime (
m)
33
“DILUTION” WITH MANY CORES
1.2
5
1.1
3
1.5
0
1.2
5
2.0
0
1.5
0
3.2
5
2.1
3
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0
11.0
c/g=1 c/g=2 c/g=3 c/g=4 c/g=6 c/g=8
g/c speed up 2.0 g/c speed up 3.0 g/c speed up 5.0 g/c speed up 10.0
Speed u
p c
Core
s g G
PU
s/c C
ore
s 0 G
PU
s
Ratio Number of Cores/Number of GPUs
34
CLOSING REMARKS
Significant Progress has been made in enabling Gaussian on GPUs with OpenACC
OpenACC is increasingly becoming more versatile
Significant work lies ahead to improve performance
Expand feature set:
PBC, Solvation, MP2, ONIOM, triples-Corrections
35
ACKNOWLEDGEMENTS
Development is taking place with:
Hewlett-Packard (HP) Series SL2500 Servers (Intel® Xeon® E5-2680 v2 (2.8GHz/10-core/25MB/8.0GT-s QPI/115W, DDR3-1866)
NVIDIA® Tesla® GPUs (K40 and later)
PGI Accelerator Compilers (16.x) with OpenACC (2.5 standard)
4/1/2016
April 4-7, 2016 | Silicon Valley
THANK YOU
JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join
top related