porting scientific research codes to gpus with cuda fortran: incompressible fluid dynamics using the...
TRANSCRIPT
![Page 1: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/1.jpg)
Porting Scientific Research Codes to GPUs with CUDA Fortran:Incompressible Fluid Dynamics using the Immersed Boundary Method
Josh Romero, Massimiliano Fatica - NVIDIAVamsi Spandan, Roberto Verzicco - Physics of Fluids, University of Twente
HPC Advisory Council Workshop, Stanford, CA, February 2018
![Page 2: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/2.jpg)
Outline
● Introduction and Motivation
● Solver Details
● GPU implementation in CUDA Fortran
● Benchmarking and Results
● Conclusions
![Page 3: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/3.jpg)
Introduction and Motivation
● Increased availability of GPU compute resources:○ Explosion of interest in Machine Learning○ Focus on energy efficiency for exascale
● Lots of choices to make:○ OpenACC vs. CUDA○ CUDA C vs. CUDA Fortran
● Getting existing Fortran codes up and running on GPUs can be easy if you use the right tools
● Talk is focused on getting up and running with “low-effort.”
![Page 4: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/4.jpg)
Solver Details
![Page 5: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/5.jpg)
Solver Details
● Incompressible CFD solver for DNS computations in structured domains
● IB + structural solver using method described in [1]
○ Immersed interface contributes forcing term to fluid
○ Interface structural dynamics treated as triangulated network of springs
[1] Spandan et al., Journal of Computational Physics, 2017
![Page 6: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/6.jpg)
Solver Details
InitializeSolver Compute RK step Compute IB
forcing term Structural update
RK Loop
Timestep Loop
![Page 7: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/7.jpg)
GPU Implementation in CUDA Fortran
![Page 8: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/8.jpg)
CUDA Fortran
● Baseline CPU code is written in Fortran so natural choice for GPU port is CUDA Fortran.
● Benefits:○ More control than OpenACC:
■ Explicit GPU kernels written natively in Fortran are supported■ Full control of host/device data movement
○ Directive-based programming available via CUF kernels○ Easier to maintain than mixed CUDA C and Fortran approaches
● Requires PGI compiler (community edition available now for free)
![Page 9: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/9.jpg)
Profiling with NVPROF + NVVP + NVTX
● NVPROF:○ Can be used to gather detailed kernel properties and timing information
● NVIDIA Visual Profiler (NVVP):○ Graphical interface to visualize and analyze NVPROF generated profiles○ Does not show CPU activity out of the box
● NVIDIA Tools EXtension (NVTX) markers:○ Enables annotation with labeled ranges within program○ Useful for categorizing parts of profile to put activity into context○ Can be used to visualize normally hidden CPU activity (e.g. MPI communication)
![Page 10: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/10.jpg)
NVIDIA Visual Profiler with NVTX Markers
![Page 11: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/11.jpg)
GPU Porting of Key Computational Routines
● In many CFD (and similar) codes, common code patterns appear:
○ Tightly-nested loop computations (computation of derivatives using stencils)
○ Common mathematical computations (Fourier transforms, matrix-algebra)
● But there are also unique patterns specific to a given application:
○ Computation of IB forcing on flow field
○ Computation of interface structural forces
![Page 12: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/12.jpg)
Case 1: Tightly-nested loops
Consider the original CPU subroutine to compute the divergence.
subroutine divg use param use local_arrays, only: q1, q2, q3,& dph, jpv, ipv,& udx3m ...
do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic) dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc)
dph(ic,jc,kc) = dqcap*usdtal
enddo enddo enddo
end subroutine divg
![Page 13: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/13.jpg)
Case 1: Tightly-nested loops
Now, consider the version for GPU using CUF kernel directives.
subroutine divg use param use local_arrays, only: q1=>q1_d, q2=>q2_d, q3=>q3_d,& dph=>dph_d, jpv=>jpv_d, ipv=>ipv_d,& udx3m=>udx3m_d ... !$cuf kernel do(3) do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic)
dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc)
dph(ic,jc,kc) = dqcap*usdtal
enddo enddo enddo
end subroutine divg
![Page 14: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/14.jpg)
Case 1: Tightly-nested loops
● CUF kernel directive automatically generates GPU kernels for tightly nested loops.
● Scalar data passed by value to device.
● Array data must already be resident on device.
subroutine divg use param use local_arrays, only: q1=>q1_d, q2=>q2_d, q3=>q3_d,& dph=>dph_d, jpv=>jpv_d, ipv=>ipv_d,& udx3m=>udx3m_d ... !$cuf kernel do(3) do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic)
dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc)
dph(ic,jc,kc) = dqcap*usdtal
enddo enddo enddo
end subroutine divg
![Page 15: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/15.jpg)
Case 1: Tightly-nested loops
● For getting data onto the device, CUDA Fortran allows for straightforward declaration/allocation of device data.
subroutine divg use param use local_arrays, only: q1=>q1_d, q2=>q2_d, q3=>q3_d,& dph=>dph_d, jpv=>jpv_d, ipv=>ipv_d,& udx3m=>udx3m_d ... !$cuf kernel do(3) do kc = kstart,kend do jc = 1,n2m do ic = 1,n1m kp = kc+1; jp = jpv(jc); ip = ipv(ic)
dqcap = (q1(ip,jc,kc) - q1(ic,jc,kc)) * dx1 & +(q2(ic,jp,kc) - q2(ic,jc,kc)) * dx2 & +(q3(ic,jc,kp) - q3(ic,jc,kc)) * udx3m(kc)
dph(ic,jc,kc) = dqcap*usdtal
enddo enddo enddo
end subroutine divg
module local_arrays real(8), allocatable :: q1(:,:,:) real(8), device, allocatable :: q1_d(:,:,:) ...end module local_arrays
allocate(q1(nx,ny,nz)); q1 = 0.d0allocate(q1_d(nx,ny,nz); q1_d = q1
Alternative using sourced allocation:allocate(q1_d, source = q1)
![Page 16: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/16.jpg)
Additional CUF kernel features
● CUF kernels can be used to perform reductions of scalar device data.
● Final reduced result can be on the host or device.
subroutine calculate_volume_gpu (Volume,nv,nf,xyz,vert_of_face) integer, dimension (3,nf), device, intent(in) :: vert_of_face real(8), dimension (nv,3), device, intent(in) ::xyz real(8), intent(out) :: Volume ... Volume = 0.d0
!$cuf kernel do (1) do i = 1,nf v1 = vert_of_face(1,i) v2 = vert_of_face(2,i) v3 = vert_of_face(3,i)
x1 = xyz(v1,1); x2 = xyz(v2,1); x3 = xyz(v3,1) y1 = xyz(v1,2); y2 = xyz(v2,2); y3 = xyz(v3,2) z1 = xyz(v1,3); z2 = xyz(v2,3); z3 = xyz(v3,3)
Volume = Volume + (x1 * (y2*z3 - z2*y3) + & x2 * (y3*z1 - z3*y1) + & x3 * (y1*z2 - z1*y2))/6.d0 enddo
end subroutine calculate_volume_gpu
![Page 17: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/17.jpg)
Case 2: Common Mathematical Computations
● Beyond loop-based computations, many codes use common math computations for which there are GPU libraries readily available:
○ FFT: CUFFT○ BLAS: CUBLAS○ Linear Algebra: CUSOLVER
● Use wisely: Favor batched implementations when available, avoid many repeated calls of small operations
![Page 18: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/18.jpg)
Case 2: Common Mathematical Computations
Consider the original CPU code for completing a real-to-complex FFT using FFTW library.
coefnorm = 1.d0/(dble(n1m) * dble(n2m))
do k = kstart,kend do j = 1,n2m do i = 1,n1m xr(j,i) = dph(i,j,k) enddo enddo
call dfftw_execute_dft_r2c(fwd_plan,xr,xa)
do j = 1,n2m/2 + 1 do i = 1,n1m dpho(i,j,k) = dreal(xa(j,i)) * coefnorm dpho(i,j+n2mh,k) = dimag(xa(j,i)) * coefnorm enddo enddo
end do
![Page 19: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/19.jpg)
Case 2: Common Mathematical Computations
Now consider the version for GPU using CUFFT library.
● Modified to use batched 2D FFTs
● Final loop merged with later packing loop ← kernel fusion
coefnorm = 1.d0/(dble(n1m) * dble(n2m))
!$cuf kernel do (3)do k = kstart,kend do j = 1,n2m do i = 1,n1m xr_d(j,i,k) = dph_d(i,j,k) enddo enddoenddo
istat = cufftExecD2Z(cufft_fwd_plan, xr_d, xa_d)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Scaling/rearrangement combined with subsequent loop!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
![Page 20: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/20.jpg)
Case 2: Common Mathematical Computations
Now consider the version for GPU using CUFFT library.
● Modified to use batched 2D FFTs
● Final loop merged with later packing loop ← kernel fusion
coefnorm = 1.d0/(dble(n1m) * dble(n2m))
!$cuf kernel do (3)do k = kstart,kend do j = 1,n2m do i = 1,n1m xr_d(j,i,k) = dph_d(i,j,k) enddo enddoenddo
istat = cufftExecD2Z(cufft_fwd_plan, xr_d, xa_d)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Scaling/rearrangement combined with subsequent loop!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
integer :: cufft_fwd_planinteger :: rank(2), inembed(2), onembed(2)
rank(1) = n1m; rank(2) = n2minembed(1) = n1m; inembed(2) = n2monembed(1) = n1m; onembed(2) =n2m/2 + 1
istat = cufftPlanMany(cufft_fwd_plan, 2, rank, inembed, 1, & n1m*n2m, onembed, 1, n1m*(n2m/2 + 1),& CUFFT_D2Z, kend-kstart+1)
![Page 21: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/21.jpg)
Interfaces for BLAS routines
● PGI provides overloaded interfaces for BLAS routines.
● Calls with device-resident arrays are automatically passed to the CUBLAS library.
use cudaforuse cublas
integer :: m, n, kreal(8) :: alpha, betareal(8) :: a(m,k), b(k,n), c(m,n)real(8),device :: a_d(m,k), b_d(k,n), c_d(m,n)
...
! DGEMM using linked CPU librarycall DGEMM(‘N’, ‘N’, m, n, k, alpha, a, m, b, k, & beta, c, m)
! DGEMM using CUBLAScall DGEMM(‘N’, ‘N’, m, n, k, alpha, a_d, m, b_d, k, & beta, c_d, m)
![Page 22: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/22.jpg)
Case 3: Unique computations
● The need for custom kernels arises in most programs:
○ Unique computations not amenable to a CUF kernel
○ Common mathematical operation, but no good GPU library implementation:
■ Tridiagonal LU factorization/solves with multiple RHS
○ Pattern of library usage that would be poor performing on GPU:
■ Data interpolation from flow grid to structural grid involves many small matrix and vector computations.
![Page 23: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/23.jpg)
Example 1: Batched Tridiagonal Solver
● Flow solver requires tridiagonal LU factorization/solves with multiple RHS
● Wrote batched tridiagonal solver using Thomas algorithm
● One GPU thread assigned per RHS
● To ensure coalesced access of RHS values by threads, data transposition required:
rhs_d(1:N1*N2, 1:NRHS) → rhs_t_d(1:NRHS, 1:N1*N2)
![Page 24: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/24.jpg)
Example 2: Data Interpolation Between Grids
This is the most time consuming operation in the IBM portion of the solver.
Goal is to compute interpolated value on structural grid from flow grid.
![Page 25: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/25.jpg)
Example 2: Data Interpolation Between Grids
For a given triangle i:● Form 27-point support domain
around triangle centroid. ● Compute transfer function,
using support point and centroid data.
●
Final centroid result scattered back to support points or to triangle vertices.
![Page 26: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/26.jpg)
Example 2: Data Interpolation Between Grids
For a given triangle i:● Form 27-point support domain
around triangle centroid. ● Compute transfer function,
using support point and centroid data.
●
Final centroid result scattered back to support points or to triangle vertices.
![Page 27: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/27.jpg)
Example 2: Data Interpolation Between Grids
Computation of transfer function for each triangle requires:
● 4 x 4 matrix inversion● Several small matrix-vector
multiplies:○ [1 x 4][4 x 4] and [1 x 4][4 x 27]
Final computation of is an inner product of 27 values.
![Page 28: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/28.jpg)
Example 2: Data Interpolation Between Grids
GPU strategy:● Process each triangle using a
warp (32 thread unit), map threads to support points
● Data is warp-local → most matrix algebra can be completed efficiently using warp shuffle intrinsics.
● Scattering of final result completed using atomic adds.
![Page 29: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/29.jpg)
Benchmarking and Results
![Page 30: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/30.jpg)
Verification Case
![Page 31: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/31.jpg)
Benchmarking Case
● Unit cube, quiescent flow
● N = 128, 256, 384
● # of Particles = 1, 8, 27, 64
● Particle Resolution= 1280, 5120, 20480 triangles
● Run on:○ 1x 16-core Intel(R) Xeon(R) CPU
E5-2698 v3 @ 2.30GHz
○ 1x NVIDIA Tesla V100 PCIe
![Page 32: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/32.jpg)
Grid Resolution
Fixed # of Particles = 8Particle Resolution = 5120 triangles
Fluid: ● 10 to 14x speedup vs.
CPU
IB + Structural: ● 40 to 100x speedup vs.
CPU
● Percentage of time:○ CPU: 72% to 14%○ GPU: 20% to 6%
![Page 33: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/33.jpg)
Particle Resolution
Fixed N = 256Fixed # of Particles = 8
IB + structural solver time increases at reduced rate on GPU:
● CPU: 15% to 55%● GPU: 6% to 13%
![Page 34: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/34.jpg)
Number of Particles
Fixed N = 256Particle Resolution = 5120 triangles
IB + Structural solver time increases at similar rates:
● CPU: 14% to 59%● GPU: 5% to 22%
![Page 36: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/36.jpg)
Conclusions
![Page 37: Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics Using the Immersed Boundary Method](https://reader031.vdocuments.us/reader031/viewer/2022020301/5aaa85a37f8b9af9198b463d/html5/thumbnails/37.jpg)
Conclusions
● Porting research codes to GPUs is worth the investment○ Faster runtimes enable larger cases, more rapid experimentation
● Large performance gains can be achieved with low effort using CUDA Fortran○ CUF kernel directives○ CUDA-enabled libraries○ Custom kernels when all else fails
● Working with developers to apply current code to challenging research cases
● Some previous work with these developers can be found on GitHub: https://github.com/PhysicsofFluids/AFiD_GPU_opensource