an optimized solver for unsteady transonic aerodynamics...

1

An Optimized Solver for Unsteady Transonic Aerodynamics and Aeroacoustics Around Wing Profiles

Jean-Marie Le Gouez, Onera CFD Department

Jean-Matthieu Etancelin, ROMEO HPC Center, Université de Reims Champagne-Ardennes

Thanks to Nikolay Markovskiy dev-tech at NVIDIA research Center, GB,and to Carlos Carrascal, research master intern

GTC 2016, April 7th, San José California

2

Unsteady CFD for aerodynamics profiles

•Context

•State of the art for unsteady Fluid Dynamics simulations of aerodynamics profiles

•Prototypes for new generation flow solvers

•NextFlow GPU prototype : Development stages, data models, programming languages, co-processing tools

•Capacity of TESLA networks for LES simulations

•Performance measurements, tracks for further optimizations

•Outlook


3

The expectations of the external users:•Extended simulation domains: � effects of wake on downstream components, blade-vor tex interaction on helicopters, thermal loadings by the reactor jets on composite structure s,

•Model of full systems and not only the individual c omponents : multi-stage turbomachinery internal flo ws, couplings between the combustion chamber and the turbine aerodynamics, …

•More multi-scale effects : representation of techno logical effects to improve the overall flow system efficiency : grooves in the walls, local injectors for flow / acoustics control,

•Advanced usage of CFD : adjoint modes for automatic shape optimization and grid refinement, uncertaint y management, input parameters defined as pdf,

Système Cassiopee for application productivity, modularity and coupling,associated to the elsA solver, partly OpenSource

General context : CFD at Onera


4

Expectations from the internal users :

- to develop and validate state of the art physical models : transition to turbulence, wall models, sub-grid closure models, flame stability,

- to propose novel designs in rupture for aeronautics in terms of aerodynamics, propulsion integration, noise mitigation, …

- to tackle the CFD grand challenges

� New classes of numerical methods, less dependent on the grids, more robustand versatile,

� Computational efficiency near the hardware design performance, high parallelscalability,

Decision to launch research projects :

� On deployment of the DG method for complex cases : AGHORA code� On modular multi-solver architecture within the Cassiopee set of tools

Onera

CFD at Onera

elsA


5

Improvement of predictive capabilities in the last 5 year RANS / zonal LES of the flow around a High-Lift win g

2D steady RANS and 3D LES 7,5 Mpts

2009

Mach 0.18Rey 1 400 000/corde

LEISA project Onera FUNK software

2014


•Optimized on a CPU architecture MPI / OpenMP / vecto rization

•CPU ressources for 70ms of simulation : JADE compute r (CINES) Cpu time alloted by Genci•Nxyz~ 2 600 Mpts 4096 cores / 10688 domains T CPU~ 6 200 000 h Residence time : 63 days

NextFlow : Spatially High-Order Finite Volume me thod for RANS / LESDemonstration of the feasability of porting these a lgorithms on heterogeneous architectures

,

,


8

Multi-GPU implementation of a High Order Finite Vol ume solver

Main Choices : CUDA, Thrust, mvapich, Reasons : resource-aware programming, productivity librairies

Hierarchy of memories correspond to the algorithm phases1/ main memory for field and metrics variables : 40 million cells on a K40 (12Gb),

and for the communication buffers (halo of cells for other partitions)2/ shared memory at the stream multi-processor level for stencil operations3/ careful use of registers for node, cell, face algorithms

Stages of the project

•Initial porting with the same data model organization than on the CPU

•Generic refinement of coarse triangular elements with curved faces : hierarchy of grids

•Multi-GPU implementation of a highly space-parallel model : extruded in the span direction and periodic

•On-going work on a 3D generalization of the preceding phases : embedded grids inside a regular distribution (Octree-type)


9

1st Approach: Block Structuration of a Regular Line ar Grid

Partition the mesh into small blocks

SM SMSM SM

Block Block Block Block

SM: Stream Multiprocessor

Map the GPU scalable structure


10

Relative advantage of the small block partition

●Bigger blocks provide

• Better occupancy

• Less latency due to kernel launch

• Less transfers between blocks

●Smaller blocks provide

• Much more data caching

0

20

40

60

80

100

L1 hit rate0

0,2

0,4

0,6

0,8

1

Fluxes time (normalized)0

0,2

0,4

0,6

0,8

1

1,2

Overall time(normalized)

2561024409624097

● Final speedup wrt. to 2 hyperthreaded Westmere CPU: ~2GTC 2016, April 7th, San José California

11

Unique grid connectivity for the inner algorithmOptimal to organize data for coalescent memory accessduring the algorithm and communication phasesEach coarse element in a block is allocated to an inner thread(threadId.x)

2nd approach : Embedded grids, hierachical data model NXO-GPU

Hierachical model for the grid : high order (quarti c polynomial) triangles generated by gmsh refined on the GPU

the whole fine grid as such could remain unknown to the host CPU


Imposing a sub-structuration to the grid and data model (inspired by the ‘tesselation’ mechanism in surface rendering)

12

Code structure

Preprocessing

Postprocessing

Mesh generation and block and generic refinement generation

Visualization and data analysis

Solver

Allocation and initialization of data structure from the modified mesh file

Computational routine

Time stepping

Data fetching binder

Computational binders

GPU allocation and initialization binders

CUDA kernels

Fortran

Fortran

C

C

C

CUDA


13

Version 2 : Measured efficiency on Tesla K20C (with respect to 2 Cpu Xeon 5650, OMP loop-based)

Initial results on a K20C : Max. Acceleration = 38 wrt to 2 Westmere sockets

Improvement of the Westmere CPU efficiency : OpenMP t ask-based rather than inner-loop

Same block data model on the CPU also, then the K20 C GPU / CPU acceleration drops to 13 ( 1 K20c = 150 Westmere cores)

�

In fact this method is memory bounded, and GPU band width is critical.

More CPU optimisation needed (cache blocking, vector isation ?)

Flop count : around 80 Gflops DP /K20C

These are valuable flop, not Ax=b, but highly non l inear Riemann solver flop with high order (4th, 5th ) extrapolated

values, characteristic splitting to avoid interfere nce between waves, … :

it requires a very high memory traffic to permit th eses flops : wide stencils method

Thanks to the NVIDIA GB dev-tech group for their su pport, “my flop is rich”


14

Version 3 : 2.5D periodic spanwise (circular shift vectors), MULTI-GPU / MPI

Objective : one Billion cells on a cluster

with only 64 TESLA K20 or 16 K80

(40 000 cells * 512 spanwise stations per partition :

20 million cells addressed to each TESLA K20)

The CPU (MPI / Fortran, OpenMP inner loop-based)

and GPU ( GPUDirect / C/ Cuda) versions are in the

same executable, for efficiency and accuracy

comparisons

High CPU vectorisation (all variables are vectors of length 256

to 512) in the 3rd homogeneous direction

Full data parallel Cuda kernels with coalesced memory access


•Coarse partitionning : number of partitions equal to the number of sockets / accelerators

15

Version 3 : 2.5D periodic spanwise (cshift vectors) , MULTI-GPU / MPIInitial performance measurements


16

Initial Kernel Optimization and analysis performed b y NVIDIA DevTech

After this first optimization : ratio of 14 in performance K40 / 8-core Ivy-Bridge socket

Strategy for further optimization of performances:

Increase occupancy, reduce registers’ use, reduce amount of operations with global memory and texture cache for wide arrays in read-only in a kernel

Put stencil coefficients in shared memory, Use constant memory, __launch_bounds__(128, 4)


17

Next stage of optimizations

Work done by Jean-Matthieu

• - Use thread collaboration to transfer stencil data

• from main memory to shared memory

• - Refactor the kernel where face-stencil operations are done :

• split in two phases to reduce stress on registers

• - Use the thrust library to class the face and cell indices into lists to template the kernels according to the list number

and avoid internal conditional switches

Enable an overlapping :

- computations in the center of the partition,

- transfer of the halo cells at the periphery, use of mvapich2

by using multiple streams and further classification of the cell and face indices : center� periphery (thrust)


18

Version 3 : Kernel granularity revised to optimize r egister use, Overlapping of communications with computations at the centers of the partitions, local memory usage and inner-block thread collabora tion


TAYLOR-GREEN Vortex

Scalability analysis with up to one billion cells and 4th degree

polynomial reconstruction (5 dof per cell, stencil size 68 cells)

with 1 to 128 Gpu (K20Xm)

High performance : 12 ns to compute one set of 5 fluxes on an

interface from a wide stencil : 180 GBytes/s, 170 Gflops DP

Scalability drops only for extreme degraded usage : small grid 1283

cells on more than 32 GPUs, over 30% of cells to exchange

Strong scalability

Weak scalability

High Order CFD Workshop Case 3.5 Taylor-Green Vortex

,

,

20

GPU implementation of the NextFlow solver

Performance on each K20Xm GPU :

in k3 1,8e-8 s per RHS, 0.36s for 20 000 000 cells

in k4 2,5e-8 s per RHS, 0.50s for 20 000 000 cells

� Taylor Green vortex 256**3 - wall-clock = 12 hours on 16 IVY-Bridge processors (total 128 cores) : 1600 hours CPU Intel core

25 minutes on 16 Tesla K20M GPU

By comparison, at the 1st HO CFD workshop , this case requested between 1100 and 33000 Intel core Cpu hours, depending on the numerical method

1. Taylor Green vortex 512**3 - wall-clock : 4 hours on 16 Tesla K20M GPUs

Taylor-Green Vortex Rey = 1600Computations on wedges


�Grids (structured ?) Octree � Tet-tree

All tets are identical, only oriented differently in space

From a grid of very coarse « structured tets » (bottom right): perform a refinement based on a simple

criterion (distance to an object) : 8, 82 , 83 …coarse tets in each (figure on top right)

� Tet-tree ‘Coarse’ grid , managed, partitioned on the cluster by the CPU thread 0 of each node

Each coarse tet of any size is filled dynamically with small tets : finite volumes for the solver

The size of the inner grid is adapted dynamically to the solution by refinement fronts crossing the coarse edges

The coarse tets are clustered by identical refinement level : these sets are alloted to the multiprocessors of theaccelarators available on nodes

On-going work Hierarchical grids based on the generic refinement of a coarse grid of type Octree/Tet-tree


On-going work Hierarchical grids based on the generic refinement of a coarse grid of type Octree/Tet-tree

�A generic set offilling grids are generated on this type of simple gmsh models (the «tet-farm ») : inner connectivity list,coefficients of the spatial scheme, halos of ghost-cell andtheir correspondence with the inner numbering of neighbors, HO projectioncoefficients when the filling grid density varies in time

�This commoninner data model is only stored on the GPUs and accessed in a coalesced way by the threads

�Wall boundary conditions are Immersed Boundary conditionsor CAD–Cut cells with curved geometry


23

Conclusion

A number of preparatory projects enabled to acquire a good expertise on the porting of CFD solvers, their compute

intensive kernels and intefaces, and the best organization of the data models for MULTI-GPU performance.

A high Compute intensity was reached by approching the peak main memory bandwidth and almost fully overlapping

computations and communications for big models : up to 80 million cells on a K80.

The initial choice of cuda, thrust, mvapich revealed correct : good stability of the language, SDK and associated

programming productivity tools

A project of full software deployment of a variety of CFD options for complex 3D geometries and adaptive grid

refinement, without the need for a preliminary meshing tool, was started


an optimized solver for unsteady transonic aerodynamics...

Documents