extending a legacy fortran code to gpus using...

Extending a legacy fortran code to GPUs using OpenACC

G. Staffelbach

J. Legaux

[email protected]

mailto:[email protected]

GTC 2018 Nvidia

CERFACS

Research center in Toulouse (arround 150 people)

Specialises in training in scientific computing and technology transfer from research to industry :

✦ Three main topics :

✓ Algorithmics

✓ Global Change

✓ Computational fluid dynamics

2

Acknowledgements

This work would not have been possible without these people:

NVIDIA:

F. Pariente, F. Courteille, S. Chaveau

IBM:

P. Vezolle, L. Lucido

Cellule de veille technologique GENCI:

G. Hautreux

CERFACS:

I. d’Ast, N. Monnier

GTC 2018 Nvidia

Applications Prediction of combustion in highly complex cases

4

Energy & Heavy duty

manufacturing

Confort

Environment & security

Transport & Aerospace

GTC 2018 Nvidia 5

CPU capacity

[Flop/s]

2010 20202000 20152005

Problem

size [DoF]

104

GTC 2018 Nvidia

AVBP: an Open Science code

6

EM2C CNRS

IMFT CNRS

CORIA CNRS

LMFA

ONERA

U. Leuven

TU Munich

U. Twente

Von Karman institute

U. Sherbrooke

CIEMAT Madrid

ETH Zurich

Gent University

Pittsburg University

RESEARCH ENTITIES INDUSTRY

SAFRAN HELICOPTER ENGINES

SAFRAN AIRCRAFT ENGINES

AIRBUS-SAFRAN LAUNCHERS

AIRBUS

CNES

TOTAL

RENAULT

PEUGEOT S.A

ALSTOM

ANSALDO

HONEYWELL

SIEMENS

AIR LIQUIDE

GDFC

ER

FA

CS

HPC centers

code

models

needs

GTC 2018 Nvidia

Designing an aeronautical engine ?

How many burners to ensure fast and reliable ignition ?

7

esclapez, barre et al

GTC 2018 Nvidia

Safety applications Predicting the overpressure generated by a confined

explosion ?

8

Exp. by Gexcon

Ove

rPre

ssu

re

[mB

ar]

Time[ms]

20M tets1B tets

GTC 2018 Nvidia

HPC and AVBP

Excellent strong scaling response on CPU systems with full MPI

Time to solution improve by factor 2 using MPI + OpenMP

9

Mpi tasks

Mpi tasksMpi tasks

How can we take advantage of rapidly expanding and performant CPU+GPU systems ? Ex: Piz Daint (PRACE / CSCS )

GTC 2018 Nvidia

The AVBP code Started in 1997

300 users

20 new users/dev per year

2 ‘constant’ maintainers

10

➡Built when “vector processors” were out and parallel scalar processors started to be in

➡Almost all new devs are usually from academia and from aCFD background not CSE!

GTC 2018 Nvidia

Extending AVBP with OpenACCCode needs to remain “simple”✦ Very large code, active development and in “production”

✦ Fast new feature cycle (4 months)

✦ Limited HPC developers and needs to be compatible with “CFD” students

✦ Code needs to remain portable

✓ CUDA is not viable option

✓ Two solutions : OpenMP or OpenACC

Why OpenACC over OpenMP? ✦ Slightly simpler syntax for GPUs

✦ Active (and enthusiastic) support from Nvidia and PGI

✦ Limited support at the time on most systems for OpenMP 4 features, specially on IBM Power systems were we could access P100 GPUs at the beginning.

11

GTC 2018 Nvidia

How does AVBP work ? Fortran + MPI ( + OpenMP ) + Parallel HDF5 I/O

12

Input/Read

Decomposition and distribution of data

Numerical Scheme

Postprocessing and I/O

END ?NO YES

Convective Scheme

Diffusion Scheme

Boundary treatment

Physical models

Main loop

GTC 2018 Nvidia

Fortran + MPI ( + OpenMP ) + Parallel HDF5 I/O

Extending AVBP for GPUs

13

Convective Scheme

Diffusion Scheme

Boundary treatment

Input/Read

Decomposition and distribution of data

Numerical Scheme

Postprocessing and I/O

END ?NO YES

Physical models

High I/O sections

incompatible with

GPU

Compute

intensive kernels

compatible with

GPUs

Main loop

GTC 2018 Nvidia

Extending AVBP for GPUs

14

Simplified source workflow profile Deeper is higher Scheme = 64%

(usually arround80%)

Courtesy of Lucido et al.

Power 8 CPU only

GTC 2018 Nvidia

Extending AVBP for GPUs Typical structure and Most intensive kernels

15

DO n = 1, ngroupCall scheme (global R data, global RW data)

END DO

MAIN LOOP

USE module only scheme_data

CALL function1(global_R_data,global_RW_data, scheme_data)..CALL function…(global_R_data,global_RW_data, scheme_data)

USE module only internal_dataDO i=1,ncells

X[i] = B* X[i] +. A*Y[i]END DO

GTC 2018 Nvidia

Extending AVBP for GPUs Typical structure and Most intensive kernels

16

DO n = 1, ngroupCall scheme (global R data, global RW data)

END DO

USE module only scheme_data

CALL function1(global_R_data,global_RW_data, scheme_data)..CALL function…(global_R_data,global_RW_data, scheme_data)

USE module only internal_dataDO i=1,ncells

X[i] = B* X[i] +. A*Y[i]END DO

COARSE GRAIN

FINE GRAIN

MAIN LOOP

GTC 2018 Nvidia

0

100

200

300

400

1 2 4 8 16 32 64

nb of threads

x13 x32

Coarse grain approach Derived from current most effective OpenMP implementation

17

!thread …Call scheme (group(…))

…Call scheme (group(…))

!thread 1Call scheme (group(1))Call scheme (group(2))…

Call scheme (group(…))!$OMP PARALLEL DO (..)DO n = 1, ngroup

Call schemeEND DO

Convective scheme

TIM

E [s

]

X86 THREADS

GTC 2018 Nvidia

Coarse grain performance

18

Accelerated Scheme = 36 % ( x2 accel)

Slowdown of someother functions (compute time step )

Courtesy of Lucido et al.

Power 8 + 1 P100

GTC 2018 Nvidia

Test hardware Neptune ( CERFACS)

✦ 2 E5-2680 v3 sandy bridge + 1 nvidia M5000

✓ No double precision !

Ouessant prototype ( IDRIS , GENCI )

✦ IBM firestone POWER8 + 4x NVIDIA P100.

✓ Nvlink enabled !

KRAKEN ( CERFACS )

✦ 2 Skylake Xeon(R) Gold 6140 + 1x NVIDIA V100

19

GTC 2018 Nvidia

Extending AVBP with OpenACC First coarse grain implementation was encouraging but

unsuccessful ✦ Very few directives but …➡ Modifications of high-level data structures‣ Independent copies of scheme’s local array for each call‣ Double memory usage even without ACC usage).

➡ Use of global arrays inside routines ‣ implicitly available on the CPU with OpenMP‣ implicitly NOT available on the GPU with OpenACC➡ Extremely difficult debugging :‣ high-level routine on GPU => big “black box”

‣ Some minor limitations of PGI ACC observed‣ OpenACC not able to handle character array (as of 18.3)‣ OpenACC can not handle ALLOCATABLE arrays inside derived

types

20

GTC 2018 Nvidia

Extending AVBP with fine grain Switch Small kernels

✓ Only target computation-heavy loops in the code

✓ Identify arrays that are used for those computations

✓ Explicitly manage memory exchanges of those arrays between CPU and GPU memories

✓ Explicitly offload the concerned loops to the GPU

A tedious, step-by-step work, but easy to check

✓ Each loop can easily be isolated, ported one at a time for debugging, optimisation or precision evaluation(remember M5000)

✓ Partial porting, execution and validation is possible

21

GTC 2018 Nvidia

But First … a more vector friendly structure…

Code was started when long vector processors were no longer ‘the future’.

23

DO n = 1, nnodeDO nv = 1, nvertDO e = 1, neqarray(e, nv, n) = …

END DOEND DO

END DO

➢Loops are unstructured and build for short array

lengths

➢Typical values for the loop:

➢ nnode : several millions to billions (mesh dependent)

➢ nvert : 3 to 8 (mesh element type dependent)

➢ neq : 1 to 15 (physics)

GTC 2018 Nvidia

DO e = 1, neqDO nv = 1, nvertDO n = 1, nnode

array(n, nv, e) = …END DO

END DOEND DO


Code was started when long vector processors were no longer ‘the future’.

24

➢Typical values for the loop:

➢ nnode : several millions to billions (mesh dependent)

➢ nvert : 3 to 8 (mesh element type dependent)

➢ neq : 1 to 15 (physics)

➢very large inner loop, small outer loop

➢ the inner loop can use many warps of 32 cores

➢ outer loops can be distributed over the SMP

➢ Far better potential usage of GPUs WITH LARGE

DATASETS

GTC 2018 Nvidia


Performance of the gradients computation for the SIMPLE case per cell group size

25

GTC 2018 Nvidia

Handling memory NVLINK + Unified memory not available on all the systems.

ACC managed memory not able to ‘detect’ wrapped allocations via personalized routines : ✦ CALL myalloc( array)

This means that memory needs to be controlled « by hand »

In our case, data in the cells can be pre-allocated✦ Good even for CPU ✦ For GPU: ✓ !$ACC ENTER DATA CREATE(array) ✦ Avoid implicit creation✓ !$ACC DATA PRESENT(array)✓ !$ACC KERNELS DEFAULT(PRESENT)✦ check using –Minfo=accel

26

GTC 2018 Nvidia

Validating results Different architecture and different parallelism. Order of operations can not be guaranteed

How to compute data twice ?✦ !$ACC KERNELS DEFAULT(PRESENT) IF(using_acc)

27

GTC 2018 Nvidia

Validating results

Call controlled via valid_acc and using_acc flags : tedious but easilydeployable on the code.

Same Function can run on CPU/GPU !

Most cases ( 80% are strictly identical), the rest are between 10^-11 and 10^-23 (results from V100).

Most errors seem to be cumulative or different operation approximations. Overall behavior is currently acceptable for large scale simulations.

28

GTC 2018 Nvidia

Handling MPI MPI in the code is negligible but structure requires

synchronisations between partitions : copies ..

29

GTC 2018 Nvidia

Direct MPI calls on GPU using cuda-aware implementation: available on post recent MPI libraries (OpenMPI, MVAPICH,IBM Spectrum)

However, we still need to handle the MPI buffers construction/manipulation

Handling MPI

30

!$ACC HOST_DATA USE_DEVICE(tmp_buf_recv)CALL MPI_Irecv(tmp_buf_recv(ofs),cnt,mpi_real_type,rank,tag,&

comm,mpi_reqs(i),ierr) !$ACC END HOST_DATA

GTC 2018 Nvidia

Transfers use unstructured arrays

Handling MPI

31

➡ Routines that build/extract data to and from buffers used in MPI communications

➡ Fundamentally bad for GPU parallelism• Variable execution path • Counter variables that get

unpredictible increments• Data movements depend on the

correctness and order of the counter variables increments

• !$ACC lead to sequentialexecution, forcing parallelisationleads to wrong order of operations

lp = 1dp = 1 DO i=1,runlist_cnt

cnt = runlist(i)SELECT CASE(cnt)CASE(1)

DO j=1,list_lengthlid = indices(lp)ofs(1)=(dep_data(dp)…)DO k=1,neq

field_ptr(k,lid)=recv_buf(ofs(1)+k)

END DOdp=dp+2lp=lp+1

END DOCASE(2)...dp=dp+4

...

GTC 2018 Nvidia

Complete rewrite of the transfer module

Handling MPI

32

➡Precompute boundary counters

➡ Express the counters independently

for each iteration

➡ The counters can them be privatised

for each iteration, allowing full

parallelisation

DO i=1,runlist_cntlp(i) =… dp(i)= …

DO i=1,runlist_cntcnt = runlist_depcnt(i)!$ACC LOOP PRIVATE (dp, lp, s, ofs) VECTOR(128)DO j=1,runlist_length(i)dp = runlist_dp(i) + (j-1) * cnt * 2lp = …DO l = 0, cnt ofs(cnt) = dep_data(dp…)DO k=1,neq DO l=1,cnts(k) = s(k) + recv_buf(ofs(l) + k)

DO k=1,neq !$ACC ATOMIC UPDATE field_ptr(k,lid(lp))=s(k

SIMPLE EXPLO

Tim

e [s

]

Power 8 + 4 GPUS

This is the only code that has been rewritten explicitlyfor GPU . Rest of the code remains identical for CPU.

GTC 2018 Nvidia

Handling MPI 1 MPI per GPU not efficient.

Occupancy bellow 50%

Multi-Process Service (MPS) allows for multiple concurrent MPIs on GPU:

✓ Share the ressources

✓ Split of the workload

✓ Computation / communication overlap

AVBP already suited for multi MPIs

33

CPU

GPU

GTC 2018 Nvidia

Test cases The “Simple” test (3M/9M)

34

Image courtesy of V. Moureau ( CORIA/ CNRS)

The “Explo 20MAO” test (20M)

Quillatre et al.

Gas turbine simulation.

Explosion in a confined space.

GTC 2018 Nvidia

Acceleration 3M cells case

35

0

5

10

15

20

CPU 1 MPI GPU 1 MPI GPU 2 MPI GPU 4 MPI GPU 8 MPI GPU 16 MPI

gradient

scheme

1 Skylake + V100

GTC 2018 Nvidia


36

0

4

7

11

14

CPU 1 MPI GPU 1 MPI GPU 2 MPI GPU 4 MPI GPU 8 MPI GPU 16 MPI

gradient

scheme

1 Skylake + V100

GTC 2018 Nvidia


37

0

2

5

7

9

11

CPU 1 MPI GPU 1 MPI GPU 2 MPI GPU 4 MPI GPU 8 MPI

gradient

scheme

1 Skylake + V100

GTC 2018 Nvidia

What’s next Current limitation, Same ammount of MPI’s in CPU and GPU

Next objective : Test on larger system

Extended OpenACC to physical kernels ✦ Coverage will extend from around 70% to 90% of the code.

Improve local performance✦ Currently work is divided between gang/worker/vector

uniforminously. ✦ Some parts might benefit from specific kernel distributions

Asynchronous execution of kernels.

38

GTC 2018 Nvidia

Whay to take away OpenACC allows for a simple but efficient porting of legacy

fortran codes to GPU without almost no code duplication.

We can use GPU and CPU+GPU nodes are way waster than simple CPU nodes.

We still need to optimize to increase efficiency.

If changes are required, they are often be beneficial for modern CPUs too.

PGI/OpenACC are constantly evolving, keep up to date and share your issues/experiences. PGI/nvidia are highly reactive and more issues disappear by the next release

39

GTC 2018 Nvidia 40

THANK YOU

extending a legacy fortran code to gpus using...

Documents