april 4-7, 2016 | silicon valley enabling the electronic ......4 gaussian a computational chemistry...

Post on 23-Jan-2021

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

April 4-7, 2016 | Silicon Valley

Roberto Gomperts (NVIDIA), Michael Frisch (Gaussian, Inc.), Giovanni Scalmani (Gaussian, Inc.), Brent Leback (NVIDIA/PGI)

ENABLING THE ELECTRONIC STRUCTURE PROGRAM GAUSSIAN ON GPGPUS USING OPENACC

3

TOPICS

Gaussian: Design Guidelines, Parallelism and Memory Model

Implementation: Top-Down/Bottom-Up

OpenACC: Extensions, Hints & Tricks

Early Performance

Closing Remarks

4

GAUSSIAN

A Computational Chemistry Package that provides state-of-the-art capabilities for electronic structure modeling

Gaussian 09 is licensed for a wide variety of computer systems

All versions of Gaussian 09 contain virtually every scientific/modeling feature, and none imposes any artificial limitations on calculations other than computational resources and time constraints

Researchers use Gaussian to, among others, study molecules and reactions; predict and interpret spectra; explore thermochemistry, photochemistry and other excited states; include solvent effects, and many more

4/1/2016

5

DESIGN GUIDELINES

General

Establish a Framework for the GPU-enabling of Gaussian

Code Maintainability (Code Unification)

Leverage Existing code/algorithms, including Parallelism and Memory Model

Simplifies Resolving Problems

Simplifies Improvement on existing code

Simplifies Adding New Code

4/1/2016

6

DESIGN GUIDELINES

Accelerate Gaussian for Relevant and Appropriate Theories and Methods

Relevant: many users of Gaussian

Appropriate: time consuming and good mapping to GPUs

Resource Utilization

Ensure efficient use of all available Computational Resources

CPU cores and memory

Available GPUs and memory

4/1/2016

7

CURRENT STATUS Single Node

Implemented

Energies for Closed and Open Shell HF and DFT (less than a handful of XC-functionals missing)

First derivatives for the same as above

Second derivatives for the same as above

Using only

OpenACC

CUDA library calls (BLAS)

4/1/2016

8

IMPLEMENTATION MODEL Application Code

+

GPU CPU Small Fraction of the Code

Large Fraction of Execution

time

Compute-Intensive Functions

Rest of Sequential CPU Code

9

GAUSSIAN PARALLELISM MODEL

CPU Cluster

OpenMP

CPU Node

GPU

OpenACC

10

GAUSSIAN: MEMORY MODEL

CPU Cluster

OpenMP

CPU Node

GPU

OpenACC

11

TREES IN THE FOREST OpenMP Parallel Region

Static Call Tree

Integrals Generation

Integrals “Digestion”

PRISMC ACLEAR

ARRMAX AUNITM

ACLEAR BRADRV

C2PGEN CHEKS4

CLMLPT COPRIM

DIGJE DOIRT1

DOIRT2 DOIRT3

DOIRT4 DOIRTR

DOIRTS GGEMV

DODYP DOPRIN

DOTSTM FNDHSC

FNDHSU GENCIT

GENSCL GETCWU

GLINCO IARMAX

INDSE INTMEM

INTPWP ISALG

KETDRV LBIT

LCOUNT LDCNVR

MAKZON MAXINT

MDVLEN PCKC4A

CALERF GLINCO

LOADC4 LOADDC

KPBCDC MAXINT

MDVLEN PC4TSJ

CULL2 PC4TSX

CALERF CNCALC

CULL1 PCKBXT

CULL2 PCKCEN

CULL1 PCKDEN

PCKFRG PCKRAN

IGFIX VPCKRN

CULL1 PETIIJ

PICKC4 GLINCO

IGFIX LOADC4

LOADDC PCKFRG

PICKS4 PRMDIG

PRMROW AMOVE

PRMZON PROFFZ

PRSMAR PRSMSE

SEHAM STEPTT

SUMRF PRLIN1

AMOVE GLINB

GLININ GLINLO

GLINLP GLININ

INTPWP GLINOP

GLINRE IMOVE

LBIT LCLEAR

PRLIN2 GLIN2D

GLININ GLINLO

GLINRE PRMRAF

ACLEAR AMOVE

AUNITM BRADRV

C2PGEN C4PRAF

CHEKS4 COPRIM

DIGRAF BOGGRI

DGSA03 DGSA06

DGSA13 DGSA16

DGST01 DGST02

DG0201 DG0202

DG0203 DG0206

DG0210 DG0220

DG0230 DG0233

DG0260 DG0266

DGST03 DGST06

DGST0X DGST11

DGST12 DG1211

DG1222 DG1233

DG1266 DGST16

DGST1X DGST24

DGSTAS LODRGO

MAKER3 NFUNSH

RSSHL STOR2Y

STORR1 RSSHL

STR2R1 STORR2

RSSHL STR2R2

STORR3 RSSHL

STR2R3 DODYP

DOPRIN DOTSTM

GENCIT GETCWU

GLINCO IARMAX

ICLEAR INDC

ISALG KETDRV

LDCNVR MAKZON

PETIIJ PICKR4

CHKSYM GLINCO

PRLOAD PRMPTH

CHOOSE PTHINF

PRMRAL IFRPOS

INTOWP ITRPOS

LNK1E MDCACH

MDVLEN PRSMDI

PRMROW PRMZON

PRRDRV RFBRKT

RFCONT STEPTT

12

APPROACH Data and Compute Regions Management

Data Management Top Down:

Create and initialize as appropriate large data region on device

Peruse the device memory as Compute Regions are enabled

Compute Bottom Up:

Create as many Accelerator Routines as possible

Incrementally add Compute Regions driving the Accelerator Routines

Incrementally add Routines with own Compute Regions

4/1/2016

13

OPENACC OpenMP Parallel Region

OpenACC directives at the “leaves”

Device Memory Management at highest possible level

OpenACC directives at the “leaves”

Device Memory Management at highest possible level

Move Directives up the calling tree

PRISMC ACLEAR

ARRMAX AUNITM

ACLEAR BRADRV

C2PGEN CHEKS4

CLMLPT COPRIM

DIGJE DOIRT1

DOIRT2 DOIRT3

DOIRT4 DOIRTR

DOIRTS GGEMV

DODYP DOPRIN

DOTSTM FNDHSC

FNDHSU GENCIT

GENSCL GETCWU

GLINCO IARMAX

INDSE INTMEM

INTPWP ISALG

KETDRV LBIT

LCOUNT LDCNVR

MAKZON MAXINT

MDVLEN PCKC4A

CALERF GLINCO

LOADC4 LOADDC

KPBCDC MAXINT

MDVLEN PC4TSJ

CULL2 PC4TSX

CALERF CNCALC

CULL1 PCKBXT

CULL2 PCKCEN

CULL1 PCKDEN

PCKFRG PCKRAN

IGFIX VPCKRN

CULL1 PETIIJ

PICKC4 GLINCO

IGFIX LOADC4

LOADDC PCKFRG

PICKS4 PRMDIG

PRMROW AMOVE

PRMZON PROFFZ

PRSMAR PRSMSE

SEHAM STEPTT

SUMRF PRLIN1

AMOVE GLINB

GLININ GLINLO

GLINLP GLININ

INTPWP GLINOP

GLINRE IMOVE

LBIT LCLEAR

PRLIN2 GLIN2D

GLININ GLINLO

GLINRE PRMRAF

ACLEAR AMOVE

AUNITM BRADRV

C2PGEN C4PRAF

CHEKS4 COPRIM

DIGRAF BOGGRI

DGSA03 DGSA06

DGSA13 DGSA16

DGST01 DGST02

DG0201 DG0202

DG0203 DG0206

DG0210 DG0220

DG0230 DG0233

DG0260 DG0266

DGST03 DGST06

DGST0X DGST11

DGST12 DG1211

DG1222 DG1233

DG1266 DGST16

DGST1X DGST24

DGSTAS LODRGO

MAKER3 NFUNSH

RSSHL STOR2Y

STORR1 RSSHL

STR2R1 STORR2

RSSHL STR2R2

STORR3 RSSHL

STR2R3 DODYP

DOPRIN DOTSTM

GENCIT GETCWU

GLINCO IARMAX

ICLEAR INDC

ISALG KETDRV

LDCNVR MAKZON

PETIIJ PICKR4

CHKSYM GLINCO

PRLOAD PRMPTH

CHOOSE PTHINF

PRMRAL IFRPOS

INTOWP ITRPOS

LNK1E MDCACH

MDVLEN PRSMDI

PRMROW PRMZON

PRRDRV RFBRKT

RFCONT STEPTT

14

OPENACC PGI Extensions

Data Directives

NoCreate (Also in Compute Directives)

Compute Directives

Gang levels: (Dim:1,2,3)

Collapse (Force:N)

4/1/2016

15

OPENACC EXTENSIONS NoCreate

Subroutine DoSomething(NTT,FA,FB,…) Real*8 FA(*), FB(*) C$ACC Parallel If(OnGPU) C$ACC+ Present(FA) NoCreate(FB) C$ACC Loop Gang Vector Do I = 1,NTT FA(i) = expression If(OpenShell) then FB(i) = expression EndIf EndDo C$ACC End Parallel

FA is always allocated On Device

FB only allocated if OpenShell=.T.

Note: The If(OnGPU) clause requires compilation with –ta=host,nvidia

16

OPENACC EXTENSIONS Collapse Force

Subroutine DoSomething(N,NV,NU,IVIND,IUIND,W…) Integer IVIND(*),IUIND(*) Real*8 W(N,*,*) C$ACC Routine(DoWork) Vector C$ACC Parallel If(OnGPU) Present(IVIND,IUIND,W) C$ACC Loop Gang Collapse(Force:2) Do IV = 1,NV IVD = IVIND(IV) Do IU = 1,NU IUD = IUIND(IU) Call DoWork(N,W(1,IUD,IVD)) EnDo EndDo C$ACC End Parallel

IVD statement Prevents Regular Collapse

17

OPENACC EXTENSIONS Multiple Dimension Gangs

Subroutine DoSomething(N,NV,NU,W…) Real*8 W(N,*,*) C$ACC Routine(DoWork) Gang C$ACC Parallel If(OnGPU) Present(IVIND,IUIND,W) C$ACC+ Num_Gangs(nGng1,nGng2,nGng3) Vector_Length(lvGPU) C$ACC Loop Gang(Dim:3) Do IV = 1,NV C$ACC Loop Gang(Dim:2) Do IU = 1,NU Call DoWork(N,W(1,IU,IV)) EnDo EndDo C$ACC End Parallel

DoWork contains a Gang Vector Loop

18

HINTS & TRICKS Out of OpenACC Scope & Bind

Bind clause

A routine that can be called in different contexts:

Called with its own OpenACC compute region

Called outside an OpenACC scope within an OpenACC region but in different modes: Gang-Vector, Vector, Seq

4/1/2016

19

HINTS & TRICKS Bottom-Up approach

Subroutine DoSomething(N,A,…) Real*8 A(*) C$ACC Routine(aClear) Gang … Call aClear(N,A) … C$ACC Parallel If(OnGPU) Present(A) C$ACC+ Num_Gangs(nGng1) Vector_Length(lvGPU) Call aClear(N,A) C$ACC End Parallel …

aClear clears Host copy

aClear clears Device copy If OnGPU =.T.

20

HINTS & TRICKS Out of Scope or in Gang Compute Region

Subroutine aClear(N,A) Real*8 A(*) Parameter(Zer0=0.0d0) C$ACC Routine Gang C$ACC Loop Gang Vector Do I = 1, N A(I) = Zero EndDo

aClear can be called outside an OpenACC region

21

HINTS & TRICKS Bottom-Up approach

Subroutine DoSomething(N,A,…) Real*8 A(*) C$ACC Routine(aClear) Gang … Call aClrGP(N,A) Call aClear(N,A) … C$ACC Parallel If(OnGPU) Present(A) C$ACC+ Num_Gangs(nGng1) Vector_Length(lvGPU) Call aClear(N,A) C$ACC End Parallel …

aClrGP clears Device copy If OnGPU =.T.

aClear clears Host copy

aClear clears Device copy If OnGPU =.T.

22

HINTS & TRICKS Inside or outside OpenACC scope

Subroutine aClrGP(N,A) Real*8 A(*) Parameter(Zer0=0.0d0) C$ACC Kernels Loop If(OnGPU) Present(A) C$ACC Loop Gang Vector Do I = 1, N A(I) = Zero EndDo

Clears Device copy of A if OnGPU = .T.

Otherwise Host copy

23

HINTS & TRICKS Multiple ARs in Compute Region

Subroutine DoSomething(N,A,B,…) Real*8 A(3),B(*) C$ACC Routine(aClear) Gang C$ACC Routine(Use_A) Gang … C$ACC Parallel If(OnGPU) Present(A,B) C$ACC+ Num_Gangs(nGng1) Vector_Length(lvGPU) Call aClear(3,A) Call Use_A(N,A,B,…) C$ACC End Parallel …

Data Dependency Hazard

A has to be initialized before being used

N>3

24

HINTS & TRICKS Bind

Subroutine DoSomething(N,A,B,…) Real*8 A(3),B(*) C$ACC Routine(aClear) Seq Bind(aClear_s) C$ACC Routine(Use_A) Gang … C$ACC Parallel If(OnGPU) Present(A,B) C$ACC+ Num_Gangs(nGng1) Vector_Length(lvGPU) Call aClear(3,A) Call Use_A(N,A,B,…) C$ACC End Parallel …

Allows to use the same Name for aClear

25

HINTS & TRICKS Bound Seq

Subroutine aClear_s(N,A) Real*8 A(*) Parameter(Zer0=0.0d0) C$ACC Routine Seq C$ACC Loop Seq Do I = 1, N A(I) = Zero EndDo

Seq instead of Gang

26

HINTS & TRICKS Conditional on GPU or Host

Beyond the “if” clause in Compute Regions

Marking a section of code in a Compute Region that should only be executed by the host or the device

ACC_On_Device(ACC_Device_HOST)

ACC_On_Device(ACC_Device_Not_HOST)

PGI’s Implementation

Not exactly like an “#ifdef” macro but close

4/1/2016

27

ACC_ON_DEVICE(…)

Subroutine DoSomething(…) Implicit Real*8(A-H,O-Z) C$ACC Routine Seq C$ACC Routine(DeviceSub) Seq … If(ACC_On_Device(ACC_Device_HOST) then Call HostSub(…) EndIf If(ACC_On_Device(ACC_Device_Not_HOST) then Call DeviceSub(…) EndIf …

HostSub runs only on Host No need for Routine Directive

DeviceSub runs only on Device

28

ADVANCED OPTIMIZATIONS

• It is imperative to use a Dynamic Load Distribution Mechanism

• Already in place for CPU parallelism

• On the fly move work from GPU to Core

• Improve performance in certain places by controlled replication of matrices

4/1/2016

29

PERFORMANCE ASSESMENT Hardware

Processor

Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30GHz (32 cores/2 sockets)

Memory: 256GB

GPU

Tesla K80 (8 GPUs/2 Boards), Boost Clocks: MEM 2505, SM 875

Topology

4/1/2016

GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity GPU0 X PIX SOC SOC SOC SOC SOC SOC 0-15 GPU1 PIX X SOC SOC SOC SOC SOC SOC 0-15 GPU2 SOC SOC X PIX PHB PHB PHB PHB 16-31 GPU3 SOC SOC PIX X PHB PHB PHB PHB 16-31 GPU4 SOC SOC PHB PHB X PIX PXB PXB 16-31 GPU5 SOC SOC PHB PHB PIX X PXB PXB 16-31 GPU6 SOC SOC PHB PHB PXB PXB X PIX 16-31 GPU7 SOC SOC PHB PHB PXB PXB PIX X 16-31

30

PERFORMANCE ASSESSMENT

What do we compare?

Runs with 32 cores and 0 GPUs with 32 cores and 8 GPUs

Energies, 1st and 2nd derivatives, Closed and Open Shell

Various basis sets and XC-functionals

All calculations with

High Integral Accuracy (10-12)

Ultra Fine Grid

Molecular Systems

4/1/2016

31

PERFORMANCE ASSESEMENT Molecular Systems

4/1/2016

Valinomycin Open Shell

168 atoms

Force Calculation

Basis set: 6-311+G(2d,p)

Basis Functions: 2646

XC-functional: HSEH1PBE

Charge: +1

Multiplicity: 2

Convergence: 29 cycles

Alanine 25

259 atoms

Energy Calculation

Basis set: cc-pVTZ

Basis Functions: 5690

XC-functional: wB97X-d

Frequency calculation

Basis set: 6-31G*

Basis Functions: 2195

XC-functional: APFD

32

RESULTS

0

100

200

300

400

500

600

700

800

Ala25 E[32/8] Ala25 E[32/0] Val UF[32/8] Val UF[32/0] Ala25 V[32/8] Ala25 V[32/0]

ERI XC Rest

1.39

1.16

1.59

1.48

1.19

1.70

1.22

1.15

1.36

Tota

l Executi

on T

ime (

m)

33

“DILUTION” WITH MANY CORES

1.2

5

1.1

3

1.5

0

1.2

5

2.0

0

1.5

0

3.2

5

2.1

3

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

9.0

10.0

11.0

c/g=1 c/g=2 c/g=3 c/g=4 c/g=6 c/g=8

g/c speed up 2.0 g/c speed up 3.0 g/c speed up 5.0 g/c speed up 10.0

Speed u

p c

Core

s g G

PU

s/c C

ore

s 0 G

PU

s

Ratio Number of Cores/Number of GPUs

34

CLOSING REMARKS

Significant Progress has been made in enabling Gaussian on GPUs with OpenACC

OpenACC is increasingly becoming more versatile

Significant work lies ahead to improve performance

Expand feature set:

PBC, Solvation, MP2, ONIOM, triples-Corrections

35

ACKNOWLEDGEMENTS

Development is taking place with:

Hewlett-Packard (HP) Series SL2500 Servers (Intel® Xeon® E5-2680 v2 (2.8GHz/10-core/25MB/8.0GT-s QPI/115W, DDR3-1866)

NVIDIA® Tesla® GPUs (K40 and later)

PGI Accelerator Compilers (16.x) with OpenACC (2.5 standard)

4/1/2016

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

top related