a generic performance analysis technique applied to

22
Full Terms & Conditions of access and use can be found at https://www.tandfonline.com/action/journalInformation?journalCode=gcfd20 International Journal of Computational Fluid Dynamics ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/gcfd20 A Generic Performance Analysis Technique Applied to Different CFD Methods for HPC Marta Garcia-Gasulla, Fabio Banchelli, Kilian Peiro, Guillem Ramirez- Gargallo, Guillaume Houzeaux, Ismaïl Ben Hassan Saïdi, Christian Tenaud, Ivan Spisso & Filippo Mantovani To cite this article: Marta Garcia-Gasulla, Fabio Banchelli, Kilian Peiro, Guillem Ramirez- Gargallo, Guillaume Houzeaux, Ismaïl Ben Hassan Saïdi, Christian Tenaud, Ivan Spisso & Filippo Mantovani (2020) A Generic Performance Analysis Technique Applied to Different CFD Methods for HPC, International Journal of Computational Fluid Dynamics, 34:7-8, 508-528, DOI: 10.1080/10618562.2020.1778168 To link to this article: https://doi.org/10.1080/10618562.2020.1778168 © 2020 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group Published online: 16 Jul 2020. Submit your article to this journal Article views: 1016 View related articles View Crossmark data

Upload: others

Post on 15-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Full Terms & Conditions of access and use can be found athttps://www.tandfonline.com/action/journalInformation?journalCode=gcfd20

International Journal of Computational Fluid Dynamics

ISSN: (Print) (Online) Journal homepage: https://www.tandfonline.com/loi/gcfd20

A Generic Performance Analysis Technique Appliedto Different CFD Methods for HPC

Marta Garcia-Gasulla, Fabio Banchelli, Kilian Peiro, Guillem Ramirez-Gargallo, Guillaume Houzeaux, Ismaïl Ben Hassan Saïdi, Christian Tenaud,Ivan Spisso & Filippo Mantovani

To cite this article: Marta Garcia-Gasulla, Fabio Banchelli, Kilian Peiro, Guillem Ramirez-Gargallo, Guillaume Houzeaux, Ismaïl Ben Hassan Saïdi, Christian Tenaud, Ivan Spisso &Filippo Mantovani (2020) A Generic Performance Analysis Technique Applied to Different CFDMethods for HPC, International Journal of Computational Fluid Dynamics, 34:7-8, 508-528, DOI:10.1080/10618562.2020.1778168

To link to this article: https://doi.org/10.1080/10618562.2020.1778168

© 2020 The Author(s). Published by InformaUK Limited, trading as Taylor & FrancisGroup

Published online: 16 Jul 2020.

Submit your article to this journal

Article views: 1016

View related articles

View Crossmark data

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS2020, VOL. 34, NOS. 7–8, 508–528https://doi.org/10.1080/10618562.2020.1778168

A Generic Performance Analysis Technique Applied to Different CFDMethods forHPC

Marta Garcia-Gasulla a, Fabio Banchelli a, Kilian Peiro a, Guillem Ramirez-Gargalloa,Guillaume Houzeaux a, Ismaïl Ben Hassan Saïdib, Christian Tenaud b, Ivan Spissoc and Filippo Mantovani a

aBarcelona Supercomputing Center, Barcelona, Spain; bUniversité Paris-Saclay, CNRS, LIMSI, Orasy, France; cCINECA, Bologna, Italy

ABSTRACTFor complex engineering and scientific applications, Computational Fluid Dynamics (CFD) simula-tions require a huge amount of computational power. As such, it is of paramount importance tocarefully assess the performance of CFD codes and to study them in depth for enabling optimisationand portability. In this paper, we study three complex CFD codes, OpenFOAM, Alya and CHORUSrepresenting two numerical methods, namely the finite volume and finite-element methods, onboth structured and unstructured meshes. To all codes, we apply a generic performance analysismethod based on a set of metrics helping the code developer in spotting the critical points that canpotentially limit the scalability of a parallel application. We show the root cause of the performancebottlenecks studying the three applications on the MareNostrum4 supercomputer. We concludeproviding hints for improving the performance and the scalability of each application.

ARTICLE HISTORYReceived 2 March 2020Accepted 19 May 2020

KEYWORDSPerformance analysis; MPI;efficiency; parallelisation;CFD; finite element; finitevolume

1. Introduction

Fluid-dynamics is more and more leveraging numeri-cal techniques, that allow to compute solutions for thenon-linear equations describing regimes and geome-tries relevant in engineering and industrial contexts.Several different numerical approaches have been the-oretically developed and implemented on several mas-sively parallel computers. In this paper, we considerthree different complex Computational Fluid Dynam-ics (CFD) codes, for the following classes of numeri-cal methods: finite element (Alya) and finite volume(OpenFOAM and CHORUS). Moreover, the threecodes address different flow problems, incompress-ible flows (Alya and OpenFOAM) and compressibleflows (CHORUS) that lead to different numericaltime integrations, from completely explicit schemesto fully implicit integrations. Also, both unstruc-tured meshes (Alya and OpenFOAM) and structuredmeshes (CHORUS) are represented.

Those numerical methods heavily rely on HighPerformance Computing (HPC) resources. Therefore,on the one hand, their efficient implementation isof paramount importance for extracting the com-putational power offered by new HPC systems. On

CONTACT Marta Garcia-Gasulla [email protected]

the other hand, the portability of performance opti-misation is also critical, since developers and main-tainers of CFD codes cannot re-implement theirmethod for each new HPC technology appearing onthe market (Banchelli et al. 2020b; Mantovani et al.in press; Banchelli et al. 2020a).

For this reason, we focus the study of this paper onthe performance analysis of the aforementioned CFDapplications. We apply a set of general performancemetrics defined in the POP performance model (Wag-ner et al. 2017); the POP metrics are a set of efficiencyand scalability indicators that can be obtained for MPIapplications. These indicators identify the main fac-tors limiting the scalability of the code, based on theseindicators, a more detailed analysis is conducted tounderstand which is the reason for the degradation ofperformance. From this analysis, we extract insightsfor improving the performance and the scalability ofthe selected codes. We are indeed interested in com-plex geometries and how they perform at mid-large-scale. The point of view that wewould like to provide isthe one of a ‘regular domain scientist’ who needs to runa real scientific case efficiently on an HPC cluster. Thefinal goal of the paper is to highlight computational

© 2020 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis GroupThis is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial-NoDerivatives License (http://creativecommons.org/licenses/by-nc-nd/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited, and is not altered, transformed, orbuilt upon in any way.

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS 509

bottlenecks coming from the way the applications areprogrammed or, sometimes, the way HPC systems areconfigured. The added value of our approach is thatnone of the optimisations proposed should be system-specific so performance portability is preserved.

The main contributions of this paper are: (i) weapply the same performance analysis method to threeCFD applications selected as representatives of themain CFD methods running on nowadays supercom-puters; (ii) we study the outcome of performance anal-ysis data gathered on a state-of-the-art HPC clusterand we provide insights for improving the perfor-mance and the scalability.

The remaining part of the document is struc-tured as follows. In Section 2, we introduce eachof the three computational methods studied in thispaper. In Section 3, we describe the performanceanalysis method used to study the HPC applications.In Section 4, we discuss the performance analysisperformed on each application. We conclude withSection 5 where we summarise performance and scal-ability insights.

2. Environment

We selected three CFD codes, employing different spa-tial numerical methods as well as time integrationschemes. Table 1 summarises the different algorithms.

We carried out the performance study runningon MareNostrum4, the Spanish national supercom-puter hosted at the Barcelona Supercomputing Center.The rest of this section is dedicated to introduce thedetails of the hardware platform and the applicationsemployed in our study.

2.1. Hardware Environment

The MareNostrum4 supercomputer is a Tier-0 super-computer ranked 30th in theTop500 (November 2019)with a total of 3456 × 86 compute nodes. MareNos-trum4 is based on Intel Xeon Platinum processorsfrom the Skylake generation. The nodes of the sys-tem are interconnected using Intel Omni-Path high

performance network and run SuSE Linux EnterpriseServer as operating system. Each node is equippedwith:CPU – 2 sockets Intel Xeon Platinum 8160 CPU with24 cores each running at 2.10GHz for a total of 48cores per node.Cache – 32KB of L1 data cache; 32KB L1 instructioncache; 1024 kB L2 data cache; 33792 kB L3 data cache.L1 and L2 are local to each core while L3 is sharedamong all 24 cores of a socket.Memory – 96GB of main memory 1.880GB/core,12 × 8GB 2667MhzDIMM(216 nodes highmemory,10368 cores with 7.928GB/core).HPCNetwork – 100Gbit/s Intel Omni-Path HFI Sili-con 100 Series PCI-E adapter.Service Network – 10Gbit Ethernet.Local Storage – 200GB local SSD available as tempo-rary storage.Network Storage – IBM GPFS filesystem.

Figure 1 shows a schematic view of a node ofMareNostrum4.

2.2. OpenFOAM: UnstructuredMesh Finite Volume

2.2.1. ContextOpenFOAM (for ‘Open-source Field Operation AndManipulation ’) is the free, open source CFD soft-ware developed primarily by OpenCFD Ltd since2004 (Chen et al. 2014). Currently, there are threemainvariants of OpenFOAM software that are released asfree and open-source software under the GNU Gen-eral Public License Version 3: CFD Direct ,1 the opensource CFD toolbox2 and foam-extend .3 The develop-ment model follows a cathedral style (Morgan 2000).

OpenFOAM constitutes a C++ CFD toolbox forcustomised numerical solvers (over 60 of them) thatcan perform simulations of basic CFD, combustion,turbulence modelling, electromagnetics, heat trans-fer, multiphase flow, stress analysis and even financialmathematics. It has a large user base across most areasof engineering and science, from both commercial andacademic organisations. The number of OpenFOAMusers has been steady increasing. It is now estimated to

Table 1. CFD codes under study.

Code Regime Numerical method Mesh Solution Algorithm

OpenFoam Incompressible Finite volume Unstructured Implicit SimpleAlya Incompressible Finite element Unstructured Semi-implicit Fractional stepCHORUS Compressible Finite volume Structured Explicit Operator splitting

510 M. GARCIA-GASULLA ET AL.

Figure 1. Schematic view of a node of MareNostrum4.

be in the order of thousands, with themajority of thembeing engineers in Europe. It is one of the most usedopen-source CFD code used in HPC environment.

2.2.2. Physical ProblemThe problem considered in this work is the sim-ulation of automotive aerodynamics using the Dri-vAer model, made publicly available by the Tech-nical University of Munich .4 The model consistsof 61M cells. The solver used is the simpleFoambased on the SIMPLE (Semi-Implicit Method forPressureLinked Equations) algorithm implemented inOpenFOAM.

2.2.3. Numerical MethodThe main features of the numerical method Open-FOAM relies upon are: segregated, iterative solutionof linear algebra solvers, unstructured finite volumemethod (FVM), co-located variables, equation cou-pling (e.g. SIMPLE, PIMPLE, etc.). It uses C++ andobject-oriented programming to develop a syntacticalmodel of equation mimicking and scalar-vector-tensoroperations .5 Algorithm 1 shows a code snippet for theNavier–Stokes equations. Equationmimicking is quiteobvious: fvm:laplacian means an implicit finitevolume discretisation for the Laplacian operator, andsimilarly for fvm::div for the divergence operator.On the other hand, fvc::grad means an explicit

finite volume discretisation for the gradient opera-tor. The parentheses (,) means the product of theenclosed quantities, including tensor products. Thediscretisation is based on low-order (typically second-order) gradient schemes. It is possible to select differ-ent choices of iterative solvers for the solution of thelinear solver algebra based on SpMV (Sparse Matrix-Vector)multiplication, ranging fromConjugateGradi-ent up to Multi-Grid Methods. The temporal schemescan be selected among first- and second-order implicitCrank–Nicolson and Euler schemes.

Algorithm 1 OpenFOAM pseudo-code for Navier–Stokes equationsSolve (

fvm::ddt(rho,U)+fvm::div(U,U)-fvm::laplacian(mu,U)==-fvc::grad(p)+f

);

2.2.4. ParallelisationThe method of parallel computing used by Open-FOAM is based on domain decomposition, in whichthe geometry and associated fields are broken into

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS 511

subdomains and allocated to separate processorsfor solution. The process of parallel computationinvolves: decomposition of mesh and fields; runningthe application in parallel; and, post-processing thedecomposed case .6 Themethod of parallel computingused by OpenFOAM is based on the standard MPI. Aconvenient interface,Pstream, that is a stream-basedparallel communication library, is used to plug in anyMPI library into OpenFOAM. It is a light wrapperaround the selected MPI interface (Jasak 2012). Tradi-tionally, FVM parallelisation uses the halo layerapproach: data for cells next to a processor bound-ary is duplicated. Halo layer covers all processorboundaries and is explicitly updated through paral-lel communication cells. Instead, OpenFOAM oper-ates in zero halo layer approach, whichgives flexibility in the communication pattern. FVMoperations ‘look parallel’ without data dependency(Jasak 2011).

2.2.5. Actual BottlenecksFrom previous works (Culpo 2011; Bnà et al. 2019),it is known that the performance of the linear solverslimits the scalability of OpenFOAM to the order of afew thousand of cores. Meantime, the technologicaltrends observed in the most powerful supercomput-ers (e.g. the Top500 list7) reveal that upcoming exas-cale systems will require the use of millions of cores.As of today, the well-known bottlenecks limiting thefull enablement of OpenFOAM on massively parallelclusters are (i) the limit in the parallelism paradigm(Pstream Library); (ii) the I/O data storage system; (iii)the sub-optimal sparse matrices storage format (LDU)that does not enable any cache-blocking mechanism(SIMD, vectorisation).

The recently formed HPC technical committee8

aims at working together with the community to over-come the aforementioned HPC bottlenecks. The pri-orities of the HPC technical committee are:

• To create an open and shared repository of HPCbenchmarks, to provide the community with ahomogeneous term of reference to compare dif-ferent architectures, configurations and differentsoftware environments;

• To enable GP-GPUs and others accelerators;• To enable Parallel I/O: Adios 2 I/O library as

function object is now a git submodule in theOpenFOAM develop branch.

2.3. Alya: UnstructuredMesh Finite Element

2.3.1. ContextAlya is a high performance computational mechanicscode to solve complex coupled multi-physics/multi-scale/multi-domain problems, which are mostly com-ing from the engineering realm. Among the differentphysics solved by Alya we can mention: incompress-ible/compressible flows, non-linear solid mechanics,chemistry, particle transport, heat transfer, turbulencemodelling, electrical propagation, etc. Alya (Vázquezet al. 2016) was specially designed for massively paral-lel supercomputers and is part of the Unified EuropeanApplication Benchmark Suite (UEABS) a set of 13codes highly scalable, relevant and publicly available.

2.3.2. Physical ProblemThe problem considered in this work is the simula-tion of an airplane, extensively described in Borrellet al. (2020). The mesh is hybrid and comprises 31.5Melements: 15.4M tetrahedra, 7K pyramids and 16.1Mprisms. The algorithm solves incompressible flows,using a low-dissipative Galerkin method (Lehmkuhlet al. 2019), togetherwith theVREMANLESmodel forthe subgrid scale stress tensor (Vreman 2004). A wallmodel is used at the airplane surface, implemented asa mixed Dirichlet and Neumann condition.

2.3.3. Numerical MethodA Runge–Kutta scheme of third order is considered asthe time discretisation scheme, involving three con-secutive assemblies of the residual of the momentumequation, obtained through a loop over the elementsand the boundaries of the mesh. Pressure is stabilisedthrough a projection step, which consists of solving aPoisson equation for the pressure (Codina 2001), andthen the velocity is corrected to satisfy mass conser-vation. Here, a Deflated Conjugate Gradient (Löhneret al. 2011) with linelet preconditioner (Soto, Löhner,and Camelli 2003) is used. The solution strategy isillustrated in Algorithm 2.

2.3.4. ParallelisationAlya implements different levels of parallelisationincluding MPI for distributed memory systems,OpenMP or OmpSs for shared memory (Garcia-Gasulla et al. 2019) and GPU support among oth-ers. As far as the parallelisation is concerned, in thiswork we rely exclusively on MPI as this is the ver-sion used in production runs. The MPI parallelisation

512 M. GARCIA-GASULLA ET AL.

Algorithm 2 Alya: solution strategy for solving Navier–Stokes equations.for time steps do

Compute time stepfor Runge–Kutta steps do

Assemble residual of momentum equationsSolve momentum explicitly

end forSolve pressure equation with DCG and linelet preconditionerCorrect velocity

end for

of Alya relies on Metis to partition the mesh in sub-domains. It should be noted that contrary to classi-cal implementations of the Finite Volume and FiniteDifference methods, halo cells or halo nodes are notrequired to assemble the matrices or residuals in theFinite-Elementmethod (Houzeaux et al. 2018). Avoid-ing duplicated work on these cells, provides a cer-tain advantage in the assembly phase for the Finite-Element method, where the scalability only dependsupon the control of the load balance. Although wehave treated the load imbalance due to the differenttypes of elements in previous papers (Garcia-Gasullaet al. 2019; Borrell et al. 2020), here, no specific strategyhas been employed.

2.4. CHORUS: StructuredMesh Finite Volume

2.4.1. ContextCHORUS is an in-house code developed at LIMSI-CNRS laboratory. It is a parallel (MPI)DNS/LES solverspecially designed to solve unsteady high-Reynoldsnumber compressible flows in the transonic and super-sonic regimes. In these regimes, dealing with flowsinvolving shock waves, one must use a numericalscheme which can both represent small scale struc-tures with the minimum of numerical dissipation,and capture discontinuities with the robustness that iscommon to Godunov-type methods. To achieve thisdual objective, high-order accurate shock capturingschemes is privileged. A shock capturing procedurehas then been used to avoid spurious oscillations pro-duced by high-order schemes in the vicinity of dis-continuities such as shock waves (Pirozzoli 2011), forinstance.

2.4.2. Physical ProblemThe problem considered in this work is the simulationof the well documented 3D Taylor–Green vortex at a

Reynolds number of 1600 in a periodic cube (Wanget al. 2013; Brachet et al. 2006). A 3D periodic domain() of 2π non-dimensional side length is considered.The initial flow field, solution of the Navier–Stokesequations, consists in an analytic solution of eight pla-nar vortices (Brachet et al. 2006). A Cartesian uniformgrid is used with the same spacing in each direction.The number of grid points used in this test case is 25Mfor the multi-node study and 3M for the intra nodeone.

2.4.3. Numerical MethodThe Navier–Stokes equations are solved by using ahigh-order finite volume approach on a Cartesianmesh. An operator splitting procedure is employedthat splits the resolution into the Euler part and theviscous problem. The Euler part is discretised bymeans of a high-order one-stepMonotonicity Preserv-ing scheme, namely the OSMP7 scheme (Daru andTenaud 2004), based on a Lax–Wendroff approach,which ensures a seventh-order of accuracy in bothtime and space in the regular regions. A Monocity-Preserving (MP) criterion is added to dealwith discon-tinuity capturing by locally relaxing the classic TVDconstraints for such scheme. Besides, the discretisationof the diffusive fluxes is obtained bymeans of a classicalsecond-order centred scheme that has been coupledwith a second-order Runge–Kutta time integration.The numerical approach is extensively described byDaru and Tenaud (2004).

2.4.4. ParallelisationThe MPI parallelisation of CHORUS relies on thedefault cartesian mesh decompositions. At the subdo-main interfaces, four ghost cells are required in thenormal to the interface direction.

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS 513

3. Methodology

TheEuropeanCenter of Excellence PerformanceOpti-misation and Productivity9 (POP) defined a set ofmetrics (Wagner et al. 2017) for modelling the per-formance of parallel applications. We apply such per-formance analysis method to the CFD applicationsintroduced in Section 2. In the rest of the document,we refer to the efficiency metrics used for our perfor-mance analysis as POP metrics or efficiency metrics.The POP metrics express the efficiency of the MPIparallelisation and can be computed for anyMPI appli-cation. They consist of indicators that serve as a guidefor the following detailed analysis to spot the exactfactors limiting the scalability.

For the performance analysis of each application,weapply the following steps:

Tracing – Run the application with a relevant inputand core count and obtain a trace from this run.

Selection of the focus of analysis (FoA) – Based onthe trace, select the region where to focus theperformance analysis. To select the FoA first,we identify the iterative pattern within the traceand then select a couple of representative itera-tions disregarding the initialisation and finalisa-tion phases.

Single-node analysis – Compute the efficiency met-rics for different number of MPI processes ina single node. Based on these metrics we studythe limiting factors and performance issues whenscaling within a compute node. When the inputset does not allow to run on a single node we willdo the study using a constant number ofMPI pro-cesses and increase the number of MPI processesspawned per node, this means that we will usemore nodes as they will not be fully used.

Multi-node analysis – Compute the efficiency met-rics when usingmore than one compute node.OnMareNostrum4, we study from 1 to 16 nodes(corresponding to 48–768 MPI processes). Notethat for CHORUS we extend our study from

8 to 32 compute nodes.As for the single-nodeanalysis, we leverage the efficiencymetrics to spotthe issues that limit the scalability of the code onmultiple nodes.

We focus on analysing the parallel efficiency of theMPI parallelisation, so no hybrid distributed+sharedmemory techniques are analysed here. Also, the studyis based on traces obtained from real runs. Traces con-tain information about all the processes involved in anexecution, and they can include, among others, infor-mation on MPI or I/O activity as well as hardwarecounters.

All the studies performed in this paper are usingstrong scalability but the methodology can be appliedas well to weak scaling codes.

3.1. Metrics for Performance Analysis

For the definition of the POPmetrics needed in the restof the paper, we use the simplified model of executiondepicted in Figure 2.

We call P = p1, . . . , pn the set of MPI processes.Also we assume that each process can only be in twostates over the execution time: the state in which it isperforming computation, called useful (blue) and thestate in which it is not performing computation, e.g.communicating to other processes, called not-useful(red) . For eachMPI process p, we can therefore definethe set Up = up1, up2, . . . , up|U| of the time intervalswhere the application is performing useful computa-tion (the set of the blue intervals). We define the sumof the durations of all useful time intervals in a processp as:

DUp =∑Up

=|Up|∑j=1

upj . (1)

Similarly we can define Up and DUpfor the not-useful

intervals.The Load Balance measures the efficiency loss due

to different load (useful computation) for each pro-cess globally for the whole FoA. Thus, it represents a

Figure 2. Example of parallel execution of three MPI processes.

514 M. GARCIA-GASULLA ET AL.

meaningful metric to characterise the performance ofa parallel code. We define Ln the Load Balance amongnMPI processes as:

Ln =avg|P|

⎛⎝∑

Up

⎞⎠

max|P|

⎛⎝∑

Up

⎞⎠

=

n∑i=1

DUi

n · maxni=1DUi

The Transfer efficiency measures inefficiencies due totransferring data with MPI communications. In orderto compute the Transfer efficiency, we compare a traceof a real execution with the same trace processedby the Dimemas simulator (see Section 3.2 for moredetails) assuming all communication has been per-formed on an ideal network (i.e. with zero latency andinfinite bandwidth). We define Tn the Transfer effi-ciency among nMPI processes as:

Tn = max|P|t′pmax|P|tp

where tp is the real runtime of the process p, while t′p isthe runtime of the process pwhen running on an idealnetwork.

The Serialisation Efficiency measures inefficiencydue to idle time within communications (i.e. timewhere no data is transferred) and is expressed as:

Sn = max|P|DUp

max|P|t′p

where DUp is computed as in Equation (1) and t′p isthe runtime of the process pwhen running on an idealnetwork.

The Communication Efficiency is the product of theTransfer efficiency and the Serialisation efficiency:Cn =Tn · Sn. Combining the Load Balance and the Commu-nication efficiency we obtain the Parallel efficiency fora run with n MPI processes: Pn = Ln · Cn. Its valuereveals the inefficiency in splitting computation overprocesses and then communicating data between pro-cesses. A good value of Pn (i) ensures an even dis-tribution of computational work across processes and(ii) minimises the time spent in communicating dataamong processes. Once defined the Parallel efficiency,the remaining possible source of inefficiencies can onlycome from the blue part of Figure 2, so from the usefulcomputation performed within the parallel applica-tions when changing the number of MPI processes.

We call this Computation Scalability and we define itin case of strong scalability as:

Un =∑

P0 DUp∑P DUp

where∑

P0 DUp is the sum of all useful time intervalsof all processes when running with P0 MPI processesand

∑P DUp is the sum of all useful time intervals of

all processes when runningwith PMPI processes, withP0 < P.

We know that the execution time t of a single pro-gram can be computed as:

t = ifρ

where i is the number of instruction executed by theprogram, f is the frequency of execution of the instruc-tions of the program and ρ is the IPC. This way wecan divide the study of the Computation Scalability inthree independent factors all related to the useful com-putation (blue part of Figure 2): the total number ofinstructions executed (i), the operational frequency ofthe program (f ), and the IPC (ρ). The reader shouldremember that in a multi process machine, the fre-quency of execution of the program can be differentfrom the frequency of the CPU.

We compute the scalability σ of instructions σias the ratio of i when running with P0 MPI pro-cesses (taken as baseline) divided by the value of iwhen running with Pn MPI processes, varying Pn in4, 8, 16, 24, 48, 96, 192, 384, 768:

Instruction scalability σi = iP0iPn

,

Similarly, we compute the scalability σ of the IPC σρ

and frequency σf as the ratio of ρ and f when runningwith Pn MPI processes divided by the value of ρ and fwhen running with P0 MPI processes (taken as base-line), varying Pn in 4, 8, 16, 24, 48, 96, 192, 384, 768:

IPC scalability σρ = ρPnρP0

with

ρP0 =∑

P0 i0∑P0 c0

, ρPn =∑

Pn in∑Pn cn

Frequency scalability σi = fP0fPn

For an homogeneous representation, we report thescalability values σ as ‘%’ so to be compared with theParallel efficiencies values.

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS 515

Finally, we can combine the efficiencymetrics intro-duced so far in the Global Efficiency defined as Gn =Pn · Un.

3.2. Performance Tools

For the analysis performed in this paper, we adopt a setof tools developed within the Barcelona Supercomput-ing Center.

Extrae is a tracing tool developed at BSC (Servatet al. 2013). It collects information such as PAPIcounters, MPI and OpenMP calls during the exe-cution of an application. The extracted data arestored in a trace that can be visualised with Par-aver. Since the overhead introduced by the instru-mentation can affect the overall performance wemeasure the overhead by comparing the timewhen runningwith andwithout instrumentation.For each run, we report the value of overheadas a percentage of the execution time withoutinstrumentation.

Paraver takes traces generated with Extrae and pro-vides a visual interface to analyse them (Pilletet al. 1995). The traces can be displayed as time-lines of the execution but the tool also allows usto do more complex statistical analysis.

Dimemas accepts as input an Extrae trace gatheredfrom an execution on a real HPC cluster. Theoutput of Dimemas is a new trace reconstruct-ing the time behaviour of the parallel applica-tion as if it was ran on a machine modelled by aset of performance parameters (also provided asinput). For this paper, we employed Dimemas tosimulate the behaviour of the CFD codes understudy when running on a parallel system withan ideal network, i.e. zero latency and infinitebandwidth.

Clustering is a tool that, based in a Extrae trace, canidentify regions of the trace with the same com-putational behaviour. This classification is donebased on hardware counters defined by the user.Clustering is useful for detecting iterative pat-terns and regions of code that appear severaltimes during the execution.

Tracking helps to visualise the evolution of the dataclusters across multiple runs when using differ-ent amount of resources and exploiting differentlevels of parallelism.

Combining the metrics defined in Section 3.1 andthe tools just introduced, we are able to summarise thehealth of a parallel applications running on a differentnumber of MPI processes in a heat-map table. Rowsindicate different POPmetrics while columns indicatethe number of MPI processes of a given run. Each cellis colour-coded in a scale from green, for values closeto 100% efficiency; yellow, for values between 80% and100% of the efficiency; down to red, for values below80% efficiency.

4. Performance Analysis of CFDMethods

In this section, we apply the methods introduced inSection 3 to each of the CFD codes. For each applica-tion, first, we briefly introduce the structure and thenwe identify the FoAvisualising it on an execution time-line. Since all codes are iterative, the FoA region is a setof three to five iterations.

Within the FoA region, we then collect and presentthe POP metrics in two scenarios:

Single-node study, i.e. when changing the number ofprocesses assigned to each node up to filling a nodewith a number of MPI processes equivalent to thenumber of cores.

Multi-node study, i.e. when scaling the applicationup to 16 (768 cores) or 32 (1536 cores) full computenodes of MareNostrum4.

We then study the spots where the POP efficiencymetrics drop with the goal of predicting which fun-damental factors will harm the performance of eachapplication when scaling up the number of MPI pro-cesses. For each application, we identify critical pointsthat can be related to hardware or software configura-tions as well as parallelisation practices.

4.1. OpenFOAM: UnstructuredMesh Finite Volume

Figure 3 shows a timeline visualised with Paraver of anexecution of OpenFOAM on one node of MareNos-trum4 (using 48MPI processes). The x-axis representsthe time while each value of the y-axis corresponds toone MPI process. The colour in this case representsthe MPI call that is being executed in a given instantof time by a process. The white colour means that theprocess is performing useful computation.

In the top timeline, we can clearly identify the ini-tialisation phase and the iterative pattern. The Focusof Analysis (FoA) are the three timesteps shown in

516 M. GARCIA-GASULLA ET AL.

Figure 3. Timeline of OpenFOAM.

the bottom timeline. The analysis presented in thefollowing sections is performed for this region.

4.1.1. Single-node AnalysisIn Table 2, we can see the POP efficiency met-rics obtained for OpenFOAM when using from 2 to48 MPI processes within a computational node ofMareNostrum4. We observe that the main factor lim-iting the scalability within a node is the IPC scalabil-ity, reaching values around 60% with 24 and 48 MPIprocesses.

Since we perform a strong scalability study withina node, we can reasonably expect that increasing thenumber of processes increases the pressure to thememory subsystem.

In Figure 4, we see a matrix of histograms obtainedwith Paraver from the OpenFOAM traces. Each oneof these histograms depicts one horizontal line per

MPI process. On the x-axis, the corresponding met-ric is represented (i.e. number of misses to L2 cacheper 1000 Instructions, number of misses to L3 cacheper 1000 Instructions and instructions per cycle).The colour represents the percentage of useful timethat a given metric assumed in each process. Thecolour code is a gradient going from low values(painted as light green) to high values (painted as darkblue).

This matrix of histograms shows that the numberof L2 misses is constant among the different processesof the same run and it is not affected by the incre-ment of the number of MPI processes per node. Wecan deduce it from the blue lines in the first col-umn of histograms (L2MPKI): they are vertical withinthe same histogram, meaning that all processes staya similar percentage of time with the same value ofL2 MPKI. Also, all other histograms of the same col-umn show a similar distribution of blue lines meaning

Table 2. Single-node efficiency metrics of OpenFOAM.

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS 517

Figure 4. OpenFOAM: Histograms for L2 MPKI, L3 MPKI and IPC when increasing number of MPI processes per node.

that increasing the number of MPI processes does notaffect the percentage of time spent in each value of L2MPKI.

In a similar way, we can observe that the numberof L3 misses does not present a relevant variability asthe vertical blue lines follow the same pattern for allhistograms in the second column (L3 MPKI).

Finally, as was highlighted by the efficiency met-rics of Table 2, the IPC presents a sudden drop whengoing from 12 to 24 MPI processes per node. We canrecognise this in Figure 4 because the vertical lines inthe third column of histograms (IPC) move to the left(lower values of IPC) when going from 12 to 24 MPIprocesses per node.

With this we demonstrate that the drop in IPC isnot due to a higher number of misses in L2 nor L3caches. At this point the saturation of memory band-width seems to be the most feasible reason for the IPCdrop. In order to verify this we studied then the densityof misses over time, i.e. the total number of L3 missesper microseconds executed by all process.

In Figure 5, we show the number of misses permicrosecond for the L3 cache (x-axis) when increas-ing the number of MPI processes (y-axis). Sincethe L3 is shared among all cores of a socket, thiscorresponds to the number of petitions to mainmemory that are issued per microsecond by eachsocket.

518 M. GARCIA-GASULLA ET AL.

Figure 5. OpenFOAM: L3 Misses per microsecond.

We can see how the pressure to memory increasesas the number of MPI processes increases.

We conclude that the main factor limiting the scal-ability of OpenFOAM within a computational node isthe amount of L3 cache misses. This limitation couldbe mitigated with a data layout that ensures a betterdata locality for the different levels of caches.

4.1.2. Multi-node AnalysisIn Table 3, we can see the efficiencies obtained byOpenFOAM when using multiple nodes from 1 to 16(48 to 768 MPI processes). We observe that the mainfactor limiting the scalability is the transfer efficiencyfollowed closely by the serialisation.

An important observation for the multi-node effi-ciency is related to the instrumentation overhead, i.e.the interference that the instrumentation tool (Extrae)adds to the application while gathering figures to bestored in a trace. If we look at the instrumentationoverheadwenotice that for 384MPI processes the run-time with the instrumentation is ∼17% longer than

without instrumentation and with 768 MPI processesthis overhead grows to 36%. This values of overheadtell that absolute measurements in these two casesshould be discarded. Nevertheless, relative figures canbe considered as reliable and most importantly thetrend of the efficiency metrics maintains its validity.We still consider that the main limiting factor forthe performance of OpenFOAM will be the transfer,because even if we ignore the columns of 384 and 768the efficiency values with 48, 96 and 192MPI processessuggest that when increasing the number of nodes thelimiting factor will be the transfer efficiency.

As we explained in Section 3.1, the Transfer effi-ciency accounts for time spent inside MPI com-munication that it is not due to synchronisationbetween processes. We split the transfer time perclass of MPI calls to identify what kind of calls arethe ones limiting the scalability. The class of callsare P2P sync including MPI_send, MPI_recv, andMPI_sendrecv; P2P async including MPI_isend,MPI_irecv, MPI_waitall; Allreduce including

Table 3. Multi-node efficiency metrics of OpenFOAM.

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS 519

Figure 6. OpenFOAM: Transfer time per MPI primitive.

Figure 7. OpenFOAM: Bytes exchanged per MPI primitive.

Figure 8. OpenFOAM: Bytes exchanged by each process per MPIprimitive.

MPI_Allreduce. In Figure 6, we can see the totaltime spent in transfer when increasing the number ofMPI processes per class of MPI call. We observe thatthe asynchronous point to pointMPI calls are the onesthat account for more transfer time.

Figures 7 and 8 show the number of bytesexchanged, in total and per process. It is clear thatthe total amount of data being exchanged decreases.Therefore, the problem does not seem to be related tonetwork bandwidth.

In Figure 9, we can see the number of MPI calls ofeach type executed per process. We observe that thenumber of asynchronous point to point MPI calls per-formed per process increases with the number of MPI

Figure 9. Number of calls by each process per MPI primitive.

processes. This observation together with the previ-ous charts demonstrate that the transfer efficiency lostby OpenFOAM is due to the overhead introduced bycalling the asynchronous point to point MPI calls ahigh number of times to exchange a low number ofbytes. The suggestion for the developers to address thisissue would be to try grouping several point to pointMPI calls into a single one to reduce the number ofcalls. If the algorithm does not allow this, it means thefocus must be put in the partition of the problem andthe number of neighbours of each process. Reducingthe number of neighbours would reduce in fact thenumber of point-to-point communications.

4.2. Alya: UnstructuredMesh Finite Element

In Figure 10, we report a timeline of Alya runningin one node of MareNostrum4 using 48 MPI pro-cesses. We identify the initialisation and the finalisa-tion phases as well as the iterative phase in betweenrepresenting the most time consuming and computeintensive part of large Alya simulations. In the follow-ing sections, we study the FoA shown in the timelineat the bottom of Figure 10 including four time steps ofthe iterative phase of Alya.

4.2.1. Single-node AnalysisIn Table 4, we report the POP efficiency metrics whenrunning Alya on a single node of MareNostrum4. Theheader of the columns expresses the number of pro-cesses per node. The reader should remember that werunAlyawith a fixed number of processes (48), spawn-ing an increasing number of processes per node ineach test. So, the column with the header ‘2’ reportsthe efficiencies of Alya runningwith 48MPI processes,spawning 2 processes per node on 24 compute nodesof MareNostrum4.

520 M. GARCIA-GASULLA ET AL.

Figure 10. Timeline of Alya.

Table 4. Single-node efficiency metrics of Alya.

We see that the load balance presents a low valuethat remains constant for the different runs. Thisbehaviour is as expected because all runs refer to apartition of the problem into 48 MPI ranks. The loadbalance problem re-appear in the multi-node studyand we analyse it in more details in the next Section.There we can see how it changes when increasing thenumber of MPI processes (partitions of the problem).

As often happens, Table 4 shows that after the loadbalance the main limiting factor when increasing thenumber of processes per node is the drop of the IPC.

To study this phenomenon, we use the Cluster-ing tool introduced in Section 3.2. We clusterise eachexecution trace of Alya using IPC and number ofinstructions. We identify three main computationalclusters corresponding to three well-known parts ofthe code: Residual Assembly (RAss), Timestep Com-putation (TsComp), Algebraic Solver (Solver).

In Figure 11, we highlight with different coloursthe clusters identified by the Clustering tool over atimeline. The blue region corresponds to the differentRunge–Kutta iterations, introduced in Algorithm 2.

Figure 11. Alya: timeline of three iterations highlighting the clusters.

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS 521

Figure 12. Alya: cluster durationwith 48MPI ranks with differentprocesses per node distribution.

We characterise the clusters using their durationand we plot them in Figure 12. We notice that theSolver is the responsible for the performance degra-dation when increasing the number of processes pernode.

The IPC drop could be generated by the saturationof resources (e.g. memory bandwidth). For this reasonwe study the density of L3 cache misses per microsec-ond per socket as we did for OpenFOAM in Figure 5.This time we perform the study on each cluster and

Figure 13. Alya: study of the density of L3 data cache misses permicrosecond per socket.

we confirm that the Solver phase is the root causeof the IPC drop. In Figure 13, we report the numberof L3 cache misses per microsecond per socket (y-axis) when changing the number of processes per node(x-axis). We notice that while the Residual Assem-bly phase and the Timestep Computation phase havea steady low L3 miss density, the Solver shows anincreasing density of cache misses, corresponding tolower values of IPC when increasing the number ofprocesses per node.

We conclude that IPC drop within the node is dueto memory resource saturation when increasing thenumber of MPI processes per node.We suggest to finda better data layout that would allow a better cache datareuse.

4.2.2. Multi-node AnalysisIn Table 5, we can find the efficiencies obtained byAlya when using from 48 to 768 MPI processes (1–16nodes). We can see that the main factor limiting scala-bility is the load balance.

In Table 6, we can see the load balance in usefultime as computed by the POP efficiency metrics, andthe load balance in the number of instructions. Wecan conclude that the load balance of Alya comes fromthe partition of the problem, because some processesexecute more instructions than others. The exampleconsidered for the present analysis is a full airplanesimulation. Themesh is hybrid, composed of prisms inthe boundary layer region, tetrahedra in the core flowand pyramids in the transition region. For this study,we have partitioned the mesh disregarding the type ofelements, which cost for assembling the residuals isdifferent. This relative cost is responsible for the load

Table 5. Multi-node efficiency metrics of Alya.

522 M. GARCIA-GASULLA ET AL.

Table 6. Alya: Load and Instruction Balance when increasingnumber of processes.

Number of processes 48 96 192 384 768

Load Balance 84.51% 83.63% 83.09% 80.19% 70.95%Instruction Balance 83.10% 82.63% 82.82% 79.83% 71.03%

Figure 14. Alya: Transfer time per MPI primitive.

imbalance. This issue can be addressed using a betterheuristic to partition the problem or use a dynamicload balancing mechanism as has been proven use-ful in other works (Garcia-Gasulla et al. 2019; Garcia,Labarta, and Corbalan 2014).

In the case of Alya, we study a second factor limit-ing the scalability, looking at Table 5 we can see thatafter the Load Balance, the main problems are serial-isation and transfer. As serialisation problems usuallycome from load balancing in different phases with dif-ferent patterns and we have already addressed the loadbalance problem, we study the transfer efficiency inAlya.

As we did in OpenFOAM, we analyse the trans-fer time spent by the different class of MPI calls.The class of calls are P2P sync including MPI_send,MPI_recv, andMPI_sendrecv; P2P async includ-ing MPI_isend, MPI_irecv, MPI_waitall;Allreduce including MPI_Allreduce. In Figure 14,we report the time spent communicating data for dif-ferent types of MPI calls. We observe that the transfertime increases steadily for the asynchronous point-to-point MPI calls (P2P async). It also increases for thesynchronous calls but stops increasing after 384 MPIprocesses, the amount of time spent in the AllReducecall also increases drastically for 768 MPI processes.

In Figures 15 and 16, we can see the number ofbytes exchanged per MPI call and per MPI process,respectively. We can see that the number of bytesexchanged per synchronous and asynchronous MPIcalls decreases drasticallywith the number ofMPI pro-cesses. If we look at the number of bytes exchanged by

Figure 15. Bytes exchanged per MPI primitive.

Figure 16. Byte exchanged by each process per MPI primitive.

process we observe that it also decreases for both kindsof calls as we increase the number of MPI processes.

In Figure 17, we show the number of calls per-formed of each type by each process. We observe thatthe number of asynchronous point to point is veryhigh and increases with the number of MPI processes.Therefore, the transfer time spent in the asynchronouspoint to point MPI call is due to the high number ofcalls and the overhead associated to them. The sugges-tion to address this issue is to refactor, if possible, thecommunications to group them and use a single callto send several data. It can happen that algorithmicallythis is not possible, in that case the suggestion is to look

Figure 17. Number of calls by each process per MPI primitive.

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS 523

Figure 18. Timeline of Chorus.

at the partition of the problem because the high num-ber of point to point communications is a symptom ofhaving too many neighbours. In fact, a Hilbert SpaceFilling curve has been used to partition the mesh, andsuch a strategy does not enable to control the numberof neighbours and the size of the interfaces.

4.3. CHORUS: StructuredMesh Finite Volume

In Figure 18, we can see a timeline of a run ofCHORUSin one node ofMareNostrum4using 48MPI processes.We can identify the initialisation, iterative and final-isation phases. In the following sections, we use theFoA shown in the timeline at the bottom of the image,including four time steps.

4.3.1. Single-node AnalysisIn Table 7, we can see the efficiencies obtained byCHORUS when increasing the number of MPI pro-cesses within a computational node. In the same way,

as Alya the study of CHORUS can not be done in asingle node because of the memory requirements ofthe input set. For this reason, we fix the number ofMPI processes to 48 and increase the number of pro-cesses spawn per node. In Table 7, the label of columnsdoes not refer to the total number ofMPI processes butto the number of MPI processes per node. For exam-ple, the first column with a ‘2’ label, means that wehave executed 48 MPI processes using 24 nodes andspawning 2 MPI process per node.

The input set has been reduced respect to the oneused in the multi-node study to be able to fit in lessnodes.

We can observe in Table 7 that the main factor lim-iting the scalability within a node is the IPC scalability.

In Figure 19, we can find a matrix of histogramsrepresenting different metrics (L1MPKI, L2MPKI, L3MPKI and IPC) at different number of MPI processes.The colour code represents the percentage of time inwhich the given metric assumed the value given on

Table 7. Single-node efficiency metrics of CHORUS.

524 M. GARCIA-GASULLA ET AL.

Figure 19. CHORUS: Histograms for L1 MPKI, L2 MPKI, L3 MPKI and IPC when increasing number of MPI processes per node.

the x-axis. Looking at the first column (L1 MPKI) wecan observe that the number of L1 misses per thou-sand instructions remains constant when increasingthe number of MPI processes. This can be explainedby the vertical blue lines that are equal for the differentnumber of MPI processes (rows). The second column(L2 MPKI) drives us to the same conclusion about thenumber of L2 misses.

When we look at the third column (L3 MPKI) wecan see that the number of L3 misses increases as weincrease the number ofMPI process per node. The bluevertical line moves in fact to the right (correspondingto higher values of MPKI). This is the effect of fix-ing the number of MPI processes while increasing thenumber of nodes to study the scalability within a node.As L3 is a shared resource per socket, when we run2 MPI processes per node we are using 24 nodes (48sockets) but whenwe run 4MPI processes per nodeweare using 12 nodes (24 sockets). Therefore, the total L3capacity is reduced. The increase in L3misses is due tothe replacement of data done by other processes in thesame node.

Although the number of L3 misses increases withthe number of MPI process per node, the IPC (right

most column of Figure 19) is not affected significantly:we observe in fact similar values of IPC from 2 to12 MPI processes. Nevertheless, the IPC decreasesfrom 12 to 24 MPI processes and from 24 to 48 MPIprocesses. As with OpenFOAM, this seems to be aproblem of memory bandwidth and we studied thisphenomenon measuring the L3 miss density.

In Figure 20, we report on the y-axis the numberof L3 misses per socket while changing the number ofprocesses per node. As for OpenFOAM in Figure 5, wecan notice that the increasing density of L3 misses persocket corresponds to lower and lower values of IPC.This allows us to conclude that more processes on the

Figure 20. CHORUS: L3 Misses per microsecond.

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS 525

same node competing for the same memory resourcemake the overall IPC efficiency decreasing.

4.3.2. Multi-node AnalysisIn Table 8, we find the efficiencies obtained byCHORUS when running in 8–32 nodes (384–1536MPI processes). We can see that the main factor limit-ing scalability is the instructions scalability. This met-rics points out that there is code replicated or notparallelised. In order to identify the region of the codethat is being affected by thismetric, we are going to do aclustering of the trace based in number of instructionscompleted and IPC. The clustering process is done asexplained for Alya.

In Figure 21, we can see a timeline of three timestepsof CHORUS with 384 MPI processes visualising theclusters detected by the clustering tool.We can see thatthe clustering tool has been able to detect the differentphases of the application. The most relevant cluster induration isCluster 2, followed byCluster 1.Weobservethat these two clusters alternate their position withinthe time step.

In Figure 22, we plot the total number of instruc-tions executed in each cluster when increasing the

Figure 22. CHORUS: Instructions executed per cluster.

number of MPI processes. We can see that all theclusters increase the number of instructions executedwhen increasing the number of MPI processes, itseems that Clusters 1 and 2 are the ones that increasemore proportionally.

In Figure 23, we see the same numbers plotted asefficiencies, we can observe that the trend is similarfor all the plots, being Cluster 4 the less affected andCluster 1 the most affected one. We can conclude that

Table 8. Multi-node efficiency metrics of CHORUS.

Figure 21. CHORUS: Timeline showing clusters of CHORUS.

526 M. GARCIA-GASULLA ET AL.

Figure 23. CHORUS: Instruction efficiency per cluster.

the whole code is affected by the instruction scala-bility in a similar way, we can not identify a singlecluster responsible for the low value of instruction scal-ability. Therefore, the recommendation to address thisissue is to review the parallelisation approach to detectreplication of code in the different MPI processes.

5. Conclusions and Discussion

5.1. About theMethod

In this paper, we utilised a general performance analy-sis method allowing us to highlight inefficiencies thatcan result in performance degradation when increas-ing the parallelism of HPC applications. We appliedthe method to three complex CFD codes representingthree numerical methods: finite volumes, finite ele-ments, and finite differences. Despite the diversity ofthe codes, we show that with a single method we areable to spot inefficiencies that can be common in dif-ferent applications. An example is the IPC drop whenscaling within a compute node of MareNostrum4: forall the applications that we analysed we verified thatthe problem is caused by the saturation of memoryresources. Also, the paper shows that our methodcan identify limitations that are specific of the appli-cations, thus producing insightful feedback for CFDcode developers.

Beside the effectiveness of the method, we alsoreveal that it heavily depends on a set of tools thatcan have limitations. The Clustering tool, for exam-ple, while it has been extremely useful for the study ofAlya, it was inapplicable to the FOA of OpenFOAM.In this case, we have to find alternative paths to gatherinsights.

Also, wemeasured the instrumentation overhead ofour tracing tool and we discovered that sometimes itheavily interferes with the measurements. While thisis in the nature of the method and the user needs to beaware and be able to read the measurements providedby the tools, we show that results can lead anyhow toinsightful conclusions. An example of this is the studyof the multi-node efficiencies of OpenFOAM.

5.2. About CFD Applications

The analysis of the scalability within a computationalnode of the three applications lead to the same con-clusion: when using a high number of MPI processesper node (i.e. above 12 or 24) the IPC drops because ofthe amount of data petitions that are issued tomemory.Our suggestion to application developers is to improvethe pattern of data access and data layout to maximisethe reuse of data in the different caches levels.

When analysing themulti-node scalability ofOpen-FOAM and Alya we concluded for both codes thattheir main limitation when scaling to a high num-ber of cores will be the number of MPI asynchronouspoint to point calls. These calls seem innocuous tothe programmer because they are asynchronous andallow to overlap communication and computation, butthe reality is the overhead of calling MPI cannot beoverlapped and is the bottleneck that prevents theseapplications to scale. To address this issue, we suggestto reduce the number of point to point calls, either bygrouping them if the algorithm allows it or by reducingthe number of neighbours of the partition.

Alya presents a load balance problem (due to thehybrid mesh used in this analysis) that we suggestto address with a better partition or a dynamic loadbalancing mechanism. In this case, the suggestion ofpreparing a better partition of the problemclasheswiththe previous one, where we suggest to improve the par-tition to reduce the number of neighbours. These twoapproaches are usually conflicting, because optimisingthe number of neighbours can lead to an imbalancedworkload. For this reason we would emphasise theuse of dynamic mechanisms to attack the load balanceissue.

In the case of CHORUS, we detect that the mainissue affecting the efficiency is the instruction scala-bility. This is a clear example of Amdahl’s law beingapplied. Although the parallelisation is very good (90%of efficiency when doubling the number of cores) the

INTERNATIONAL JOURNAL OF COMPUTATIONAL FLUID DYNAMICS 527

part of the code that is not parallelised ends up beingthe bottleneck.

Notes

1. https://cfd.direct/2. https://www.openfoam.com/3. https://sourceforge.net/projects/foam-extend4. http://www.aer.mw.tum.de/en/research-groups/automoti

ve/drivaer/5. https://openfoam.com/documentation/user-guide/userch

1.php#x3-200016. https://www.openfoam.com/documentation/user-guide/

running-applications-parallel.php7. https://www.top500.org/lists8. https://www.openfoam.com/governance/technical-comm

ittees.php9. https://pop-coe.eu

Acknowledgments

This work is partially supported by the Spanish Governmentthrough Programa Severo Ochoa (SEV-2015-0493), by theSpanishMinistry of Science and Technology (TIN2015-65316-P), by the Generalitat de Catalunya (2017-SGR-1414), and bythe European POP CoE (GA n. 824080).

Disclosure Statement

No potential conflict of interest was reported by the author(s).

Funding

This work is partially supported by the Spanish Governmentthrough Programa Severo Ochoa (SEV-2015-0493), by theSpanishMinistry of Science and Technology (TIN2015-65316-P), by the Generalitat de Catalunya (2017-SGR-1414), and bythe European POP CoE (GA n. 824080).

ORCID

Marta Garcia-Gasulla http://orcid.org/0000-0003-3682-9905Fabio Banchelli http://orcid.org/0000-0001-9809-0857Kilian Peiro http://orcid.org/0000-0002-9791-2166GuillaumeHouzeaux http://orcid.org/0000-0002-2592-1426Christian Tenaud http://orcid.org/0000-0002-2024-485XFilippo Mantovani http://orcid.org/0000-0003-3559-4825

References

Banchelli, Fabio, Marta Garcia-Gasulla, Guillaume Houzeaux,and Filippo Mantovani. 2020a. “Benchmarking of State-of-the-art HPC Clusters with a Production CFDCode.” In Pro-ceedings of the Platform for Advanced Scientific ComputingConference, 1–11.

Banchelli, Fabio, Kilian Peiro, Andrea Querol, GuillemRamirez-Gargallo, Guillem Ramirez-Miranda, Joan Vinyals,Pablo Vizcaino, Marta Garcia-Gasulla, and Filippo Manto-vani. 2020b. “Performance Study of HPC Applications onan Arm-based Cluster Using a Generic Efficiency Model.”In 2020 28th Euromicro International Conference on Parallel,Distributed and Network-Based Processing (PDP), 167–174.IEEE.

Bnà, Simone, Ivan Spisso, Giacomo Rossi, and Mark Olesen.2019. HPC Performance improvements for OpenFOAM lin-ear solvers. Technical report, ESI OpenFOAM Conference,Berlin.

Borrell, Ricard, Damien Dosimont, Marta Garcia-Gasulla,Guillaume Houzeaux, Oriol Lehmkuhl, Vishal Mehta, Her-bert Owen, Mariano Vázquez, and Guillermo Oyarzun.2020. “Heterogeneous CPU/GPU Co-execution of CFDSimulations on the POWER9 Architecture: Application toAirplane Aerodynamics.” Future Generation Computer Sys-tems 107: 31–48.

Brachet, Marc E., Daniel I. Meiron, Steven A. Orszag, B. G.Nickel, Rudolf H. Morf, and Uriel Frisch. 2006. “Small-scale Structure of the Taylor-Green Vortex.” Journal of FluidMechanics 130 (23): 411–452.

Chen, Goong, Qingang Xiong, Philip J Morris, Eric G Pater-son, Alexey Sergeev, and Y Wang. 2014. “OpenFOAM forComputational Fluid Dynamics.” Notices of the AMS 61 (4):354–363.

Codina, Ramon. 2001. “Pressure Stability in Fractional StepFinite Element Methods for Incompressible Flows.” Journalof Computational Physics 170: 112–140.

Culpo, Massimiliano. 2011. Current bottlenecks in the scala-bility of OpenFOAM on massively parallel clusters. PRACEwhite paper to appear on http://www.praceri.eu.

Daru, Virginie, and Christian Tenaud. 2004. “High OrderOne-step Monotonicity-Preserving Schemes for UnsteadyCompressible Flow Calculations.” Journal of ComputationalPhysics 193 (2): 563–594.

Garcia, Marta, Jesus Labarta, and Julita Corbalan. 2014. “Hintsto Improve Automatic Load Balancing with Lewi for HybridApplications.” Journal of Parallel and Distributed Computing74 (9): 2781–2794.

Garcia-Gasulla, Marta, Filippo Mantovani, Marc Josep-Fabrego, Beatriz Eguzkitza, and Guillaume Houzeaux. 2019.“Runtime Mechanisms to Survive New HPC Architectures:A Use Case in Human Respiratory Simulations.” The Inter-national Journal of High Performance Computing Applica-tions 34 (1): 42–56.

Garcia-Gasulla, Marta, Guillaume Houzeaux, Roger Ferrer,Artigues Antoni, Victor López, Jesus Labarta, and Mari-anoVázquez. 2019. “MPI+X: Task-based Parallelization andDynamic Load Balance of Finite Element Assembly.” JournalInternational Journal of Computational Fluid Dynamics33(3): 115–136.

Houzeaux, Guillaume, Ricard Borrell, Yvan Fournier, MartaGarcia-Gasulla, Jens Henrik Göbbert, Elie Hachem, VishalMehta, YoussefMesri,HerbertOwen, andMarianoVázquez.

528 M. GARCIA-GASULLA ET AL.

2018. “High-performance computing: Dos and don’ts.” InAdela Ionescu, editor,Computational FluidDynamics – BasicInstruments and Applications in Science, chapter 01. InTech:Rijeka.

Jasak, Hrvoje. 2011. HPC Deployment of OpenFOAM in anIndustrial Setting Overview Objective – Review the stateand prospects for massive parallelisation in CFD codes, withreview of implementation in OpenFOAM Topics. Technicalreport, PRACE, Stockholm.

Jasak, Hrvoje. 2012. “Handling Parallelisation in Openfoam.”In Cyprus Advanced HPCWorkshop, 101, 3.

Löhner, Rainald, Fernando Mut, Juan Raul Cebral, RomainAubry, and Guillaume Houzeaux. 2011. “Deflated Pre-conditioned Conjugate Gradient Solvers for the Pressure-poisson Equation: Extensions and Improvements.” Inter-national Journal for Numerical Methods in Engineering 87:2–14.

Lehmkuhl, Oriol, Guillaume Houzeaux, Herbert Owen, Geor-gios Chrysokentis, and Ivette Rodríguez. 2019. “A Low-dissipation Finite Element Scheme for Scale Resolving Simu-lations of Turbulent Flows.” Journal of Computational Physics390: 51–65.

Mantovani, Filippo,MartaGarcia-Gasulla, JoséGracia, EstebanStafford, Fabio Banchelli, Marc Josep-Fabrego, Joel Criado-Ledesma, andMathias Nachtmann. 2020. “Performance andEnergy Consumption of HPCWorkloads on a Cluster Basedon Arm ThunderX2 CPU.” Future Generation ComputerSystems.

Morgan, Eric Lease. 2000. “The Cathedral & the Bazaar: Mus-ings on Linux and Open Source by an Accidental Rev-olutionary.” Information Technology and Libraries 19 (2):105.

Pillet, Vincent, Jesús Labarta, Toni Cortes, and Sergi Girona.Paraver. 1995. “Paraver: ATool toVisualize andAnalyze Par-allel Code.” In Proceedings of WoTUG-18: Transputer andOccam Developments, 44, 1, 17–31.

Pirozzoli, Sergio. 2011. “Numerical Methods for High-speedFlows.” Annual Review of Fluid Mechanics 43 (1): 163–194.

Servat, Harald, German Llort, Kevin Huck, Judit Gimenez, andJesus Labarta. 2013. “Framework for a Productive Perfor-mance Optimization.” Parallel Computing 39 (8): 336–353.

Soto, Orlando, Rainald Löhner, and Fernando Camelli. 2003.“A Linelet Preconditioner for Incompressible Flow Solvers.”Journals International Journal of Numerical Methods for Heat& Fluid Flow 13 (1): 133–147.

Vreman, A. W.. 2004. “An Eddy-viscosity Subgrid-scale Modelfor Turbulent Shear Flow: Algebraic Theory and Applica-tions.” Physics of Fluids 16 (10): 3670–3681.

Vázquez, Mariano, Guillaume Houzeaux, Seid Koric, AntoniArtigues, Jazmin Aguado-Sierra, Ruth Arís, Daniel Miraet al. 2016. “Alya: Multiphysics Engineering SimulationToward Exascale.” Journal of Computational Science 14:15–27.

Wagner, Michael, Stephan Mohr, Judit Giménez, and JesúsLabarta. 2017. “A Structured Approach to PerformanceAnalysis.” In International Workshop on Parallel Tools forHigh Performance Computing, 1–15. Cham: Springer.

Wang, Z. J., Krzysztof Fidkowski, Rémi Abgrall, FrancescoBassi, DoruCaraeni, AndrewCary,HermanDeconinck, RalfHartmann, Koen Hillewaert, H. T. Huynh, Norbert Kroll,Georg May, Per-Olof Persson, Bram van Leer, and MiguelVisbal. 2013. “High-Order CFD Methods: Current Statusand Perspective.” International Journal For Numerical Meth-ods In Fluids 72 (8): 811–845.