progress toward accelerating cam-se. jeff larkin along with: rick archibald, ilene carpenter, kate...

Progress Toward Accelerating CAM-SE.

Jeff Larkin <[email protected]>Along with:

Rick Archibald, Ilene Carpenter , Kate Evans, Paulius Micikevicius , Jim Rosinski, Jim Schwarzmeier, Mark

Taylor

Background

• In 2009 ORNL asked many of their top users: What sort of science would you do on a 20 Petaflops machine in 2012?– Answer to come on next slide

• Center for Accelerated Application Research (CAAR) established to determine:– Can a set of codes from various disciplines be made to

effectively use GPU accelerators with the combined efforts of domain scientists and vendors

– Each team has a science lead, code lead, members from ORNL, Cray, Nvidia, and elsewhere

CAM-SE Target Problem

• 1/8 degree CAM, using CAM-SE dynamical core and Mozart tropospheric chemistry.

• Why is acceleration needed to “do” the problem?– When including all the tracers associated with Mozart

atmospheric chemistry, the simulation is too expensive to run at high resolution on today’s systems.

• What unrealized parallelism needs to be exposed?– In many parts of the dynamics, parallelism needs to

include levels (k) and chemical constituents (q).

Profile of Runtime

edgeunpack

edgepack

verre

map2

quadratic_

splin

e

quadratic_

splin

e_monotone

properti

es_quadra

tic_sp

line

mono_filter4

mono_filter2

euler_ste

p

diverg

ence_sp

here

limite

r2d_ze

ro

diverg

ence_sp

here_wk

gradient_sp

here

imp_so

l

chemist

ry (outsi

de of imp_so

l)02468

101214161820

Buffer packing for Boundary Ex-change

Euler Step

Laplace Sphere Weak

Vertical Remap

% o

f Run

time

Next Steps

• Once the dominant routines were identified, standalone kernels were created for each.

• Early efforts tested PGI & HMPP directive, plus CUDA C, CUDA Fortran, and OpenCL

• Directives-based compiler were too immature at the time– Poor support for Fortran modules and derived types– Did not allow implementation at a high enough level

• CUDA Fortran provided good performance while allowing us to remain in Fortran

Identifying Parallelism

• HOMME parallelizes both MPI and OpenMP over elements

• Most of the tracer advection can also parallelize over tracers (q) and levels (k)– Vertical remap is the exception, due to vertical

dependence in levels.• Parallelizing over tracers and sometimes levels

while threading over quadrature points (nv) provides ample parallelism within each element to utilize GPU effectively.

Status• Euler_step & laplace_sphere_wk were straightforward to rewrite

in CUDA Fortran• Vertical Remap was rewritten to be more amenable to GPU

(made it vectorize)– Resulting code is 2X faster on CPU than original code and has been

given back to the community• Edge Packing/Unpacking for boundary exchange needs to be

rewritten (Ilene talked about this already)– Designed for 1 element per MPI rank, but we plan to run with more– Once this is node-aware, it can also be device-aware and greatly

reduce PCIe transfers• Someone said yesterday: “As with many kernels, the ratio of FLOPS per by

transfer determines successful acceleration.”

Status (cont.)

• Kernels were put back into HOMME and validation tests were run and passed– This version did nothing to reduce data movement, only

tested kernel accuracy– In process of porting forward to current trunk and do more

intelligent data movement• Currently reevaluating directives now that compilers

have matured– Directives-based vertical remap now slightly outperforms

hand-tuned CUDA– Still working around derived_type issues

Challenges

• Data Structures (Object-Oriented Fortran)– Every node has an array of element derived types, which

contains more arrays– We only care about some of these arrays, so data movement

isn’t very natural– We must essentially change many non-contiguous CPU arrays

into a contiguous GPU array• Parallelism occurs at various levels of the calltree, not just

leaf routines, so compiler must be able to inline leaves in order to use directives– Cray compiler handles this via whole program analysis, PGI

compiler may support this via inline library

Challenges (cont.)

• CUDA Fortran requires everything live in the same module– Must duplicate some routines and data structures

from several module in our “cuda_mod”– Insert ifdefs that hijack CPU routine calls and forward

the request to matching cuda_mod routines– Simple for user, but developer must maintain

duplicate routines

– Hey Dave, when will this get changed? ;)

Until the Boundary Exchange is rewritten, euler_step performance is hampered by data movement. Streaming over elements helps, but may not be realistic for the full code.

With data transfer, laplace_sphere_wk is a wash, but since all necessary data is already resident from euler_step, kernel only time is realistic.

Vertical remap rewrite is 2X faster on the CPU and still faster on GPU. All data already resident on device from euler_step, so kernel-only time is realistic.

Future Work

• Use CUDA 4.0 dynamic pinning of memory to allow overlapping & better PCIe performance

• Move forward to CAM5/CESM1– No chance of our work being used otherwise

• Some additional, small kernels are needed to allow data to remain resident– Cheaper to run these on the GPU than to copy the data

• Reprofile with accelerated application to identify next most important routines– Chemisty implicit solver is expected to be next– Physics is expected to require mature, directives-based compiler– Rinse, repeat

Conclusions

• Much has been done, much remains• For a fairly new, cleanly written code, CUDA Fortran

was tractable.– HOMME has very similar loop nests throughout, that was

key to making this possible– Still results in multiple code paths to maintain, so we’d

prefer to move to directives for the long-run• We believe GPU accelerators will be beneficial for the

selected problem– We hope that it will also benefit a wider audience (CAM5

should help this)

progress toward accelerating cam-se. jeff larkin along with: rick archibald, ilene carpenter, kate...

Documents