hpc benchmark project (part i)outline of the presentation first part: openfoam hpc benchmark project...

1www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.Copyright © ESI Group, 2019. All rights reserved.

www.esi-group.com

HPC Benchmark Project , Ivan Spisso , CINECA

HPC Benchmark Project (part I)

Ivan Spisso, [email protected] Bna, [email protected]

Giorgio Amati, g.amati [email protected] Rossi, [email protected]

Fabrizio Magugliani, [email protected]

mailto:[email protected]





2www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Outline of the presentationFirst Part: OpenFOAM HPC Benchmark Project

• HPC Performances of OpenFOAM (3 slides)

• OpenFOAM HPC Technical Committee (TC)

• Code repository for HPC TC, https://develop.openfoam.com/committees/hpc

• Initial (first?) Benchmark test-case: 3-D Lid Driven Cavity • Set-up • Metrics/KPI (Key Performance Indicators) to measure performances• Results

• Further Benchmark test-cases (with other TCs): ask to the Community


https://develop.openfoam.com/committees/hpc

3www.esi-group.com


OpenFOAM: algorithmic overview and Pstream● OpenFOAM is first and foremost a C++ library used to solve in discretized form systems of Partial

Differential Equations (PDE). OpenFOAM stands for Open Field Operation and Manipulation

● The Engine of OpenFOAM is the Numerical Method, with the following features: ○ segregated, iterative solution (PCG/GAMG), unstructured finite volume method,

co-located variables, equation coupling.

● The method of parallel computing used by OpenFOAM is based on the standardMPI using the strategy of domain decomposition(zero layer domain decomposition)

● A convenient interface, Pstream, is used to plug any MPI library into OpenFOAM. It is a light wrapper around the selected MPI Interface


4www.esi-group.com


OpenFOAM: HPC Performances ● OpenFOAM scales reasonably well up to thousands of cores, upper limit order of thousands of cores [1] We

are looking at achieving radical scalability of cases of 100's of millions / billions of cell on 10K-100K cores.

● A custom version by Shimuzu Corp., Fujitsu Limited and RIKEN on K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect) was able to achieve high performance on 100 thousand MPI tasks on a large scale transient CFD simulation up to 100 billion cell mesh [2]


https://www.esi-group.com/it/resources/abstract-openfoam2018-spisso-cineca

https://www.top500.org/system/177232

https://www.top500.org/system/177232

https://www.esi-group.com/sites/default/files/resource/other/5355/abstract_keynote_shimizu_vanphuc_largescaletransientcfdsimulationsforbuildings.pdf

5www.esi-group.com


OpenFOAM HPC bottleneckstowards the exascale: open issuesUp to now, the well known bottlenecks for enabling OpenFOAM to perform on massively parallel clusters are:

● Scalability of the linear solvers and their limits in the parallelism paradigm.

○ In most cases the memory bandwidth is a limiting factor for the solver. ○ Additionally, global reductions are frequently required in the solvers.

The linear algebra core libraries are the main communication bottlenecks for the scalability

● Sparse Matrix storage format: The LDU sparse matrix storage format used internally does not enable any cache-blocking mechanism (SIMD, vectorization).

● The I/O data storage system: when running in parallel, the data for decomposed fields and mesh(es) has historically been stored in multiple files within separate directories for each processor, which is a bottleneck for big simulation.


6www.esi-group.com


OpenFOAM HPC bottlenecksProfiling with aps from Intel

towards the exascale: open issues

● https://software.intel.com/sites/products/snapshots/application-snapshot/


https://software.intel.com/sites/products/snapshots/application-snapshot/

7www.esi-group.com


HPC Roof-line Model and Computational intensity

● The roofline model: Performance bound (y-axis) ordered according to computational intensity● Computational Intensity: ratio of total floating-point operations to total data movement (bytes): i.e.

flops/byte● Which is the OpenFoam (CFD/FEM) arithmetic intensity? About 0.1, may be less…. ● TOP500 List - June 2019

CFD range

Linpack range

● CFD Range BW-limited● Peak flops: two order of

Magnitude ● Linpack: Dense Linear

Algebra● OpenFOAM: Sparse

linear algebra ● HPCG benchmark

Technological trend: towards the exascale


https://www.top500.org/list/2019/06/

http://www.hpcg-benchmark.org/

8www.esi-group.com


HPC Hardware technological trendTowards the e

OpenFOAM HPC Technical Committee (TC)

• The Technical Committees cover all the key focus areas for OpenFOAM development; they assess the state-of-the-art, need and status for validation, documentation and further development.

•• Remits of the HPC TC

• OpenFOAM recommendations to Steering Committee in respect of HPC technical area• Work together with the Community to overcome the actual HPC bottlenecks of OpenFOAM• Scalability of linear solvers• Adapt/modify data structures of Sparse Linear System to enable vectorization / hybridization• Improve memory access on new architectures• Improve memory bandwidth• Parallel pre- and post-processing, parallel I/O

• Strong co-design approach• Identify algorithm improvements to enhance HPC scalability • Interaction with other the Technical Committee (Numerics, Documentations)

• Priorities of HPC TC:• HPC Benchmark• GPU enabling of OpenFOAM (see presentation afterwards of Nvidia’s Stan Posey) • Parallel I/O :Adios 2 I/O library as function object is now a git submodule in the OpenFOAM develop branch

https://www.openfoam.com/governance/technical-committees.php#tm-hpc


https://www.openfoam.com/governance/technical-committees.php

https://www.openfoam.com/governance/technical-committees.php#tm-hpc

9www.esi-group.com


HPC Technical Committee Membership Nominations

Nominations for Technical Committee Membership

Members of the Technical Committee•Chair: Dr. Ivan Spisso CINECA-S. Bnà, HPC Developer, CINECA•Deputy-Chair: Dr. Mark Olesen Principal Eng., ESI-OpenCFD (Release and Maintenance)•Dr. Henrik Rusche Wikki Ltd. (Release Authority)•Dr. M. Klemm, Principal Eng., Dr. G. Rossi App. Software Eng., Intel (Hardware OEM)•Dr. Olly Perks / Dr. F. Spiga Research Eng. ARM (Hardware OEM)•F. Magugliani, Strategic Planning E4 (HPC System Integrator)•Dr. William F. Godoy, Scientific Data Group Oak Ridge National Lab (HPC Computer Sc.)•Dr. Pham Van Phuc, Institute of Technology Shimizu Corporation (HPC/OEM)•PhD. Stefano Zampini, PETSC Team KAUST (HPC center/ PETSc dev.)•Mr. Axel Koehler, Sr. Solutions Architect Nvidia (Hardware OEM)


10www.esi-group.com


HPC Technical Committee First Indications of TasksTechnical Committee Tasks

• Commitment• •Work together with the Community to overcome the actual HPC bottlenecks of OpenFOAM.• Demonstrate improvements in performance and scalability to move forward from actual near peta-scale to pre- and exa-scale class

performances•• Remits

•OpenFOAM recommendations to Steering Committee in respect of HPC technical area•Work together with the Community to overcome the actual HPC bottlenecks of OpenFOAM•Scalability of linear solvers•Adapt/modify data structures of Sparse Linear System to enable vectorization / hybridization•Improve memory access on new architectures•Improve memory bandwidth•Parallel pre- and post-processing, parallel I/O

•Strong co-design approach•Identify algorithm improvements to enhance HPC scalability•Interaction with other the Technical Committee (Numerics, Documentations)

• Tasks•Tests and benchmark Third Party linear solver algebra package (ESI, Intel, PeTSc Team)•Tests parallel I/O (ADIOS2) (ESI, ORNL) •Review of HPC benchmarks and establish common HPC benchmarks among different architectures

(Intel, ARM, EPCC,E4)•Review of documentation in HPC & OpenFOAM technical area

•…


11www.esi-group.com


HPC Hardware technological trendTowards the exascale: co-design is required!

Code repository for HPC Technical Committee https://develop.openfoam.com/committees/hpc

● Create an open and shared repository with relevant data-sets and information

● Provide an User-Guide and initial scripts to set-up and run different data-sets

● Provide to the community a homogeneous term of reference to compare different HW architectures, configurations and different SW environments

● Define a common set of Metrics/KPI (Key Performance Indicators) to measure performances

● Data-sets are in patch-1 branch, right now


https://develop.openfoam.com/committees/hpc

12www.esi-group.com


3-D Lid Driven Cavity

Initial Benchmark test case:

data-set Problem size(s)

Physical features

Relation with HW/SW infrastructure

Tasks per node KPIs bottleneck(s)Cpu intensive Memory

intensive GPU intensive I/O intensive

3-D Lid Driven Cavity flow

S (100^3)M (200^3)XL (400^3)

Incompr. laminar flow,Regular and uniform grid

yes yes no no Full (36)

Time to solution

Energy to solution (?)

Memory Bandwidth

Bound

Linear solver Algebra

Data structure

….. Hlaf (18)

● Derived from 2-D Lid driven cavity flow of OpenFOAM tutorial, https://www.openfoam.com/documentation/tutorial-guide/tutorialse2.php

● Simple geometry. B.C.: inflow, outflow, no slip walls● Stress Test for the linear solver algebra, mainly pressure equation ● KPIs to be monitored: Wall time, Memory Bandwidth, Bound● Bound = The metric represents the percentage of Elapsed time spent heavily utilizing system bandwidth

(available only for Intel architectures)


https://www.openfoam.com/documentation/tutorial-guide/tutorialse2.php

13www.esi-group.com


Profiling of S test-case:

Profiling with Vtune:

• Vectorization 3,9 % • Top Loops/functions usage by CPU Time

• lduMatrix:Amul, only FP Scalar Ops, No VEC

• Time spent in pressure solver 67.9 %• Time spent in velocity solver 27.3 %

Initial Benchmark test-case:


14www.esi-group.com


Geometrical and physical properties

Initial Benchmark test-case: 3-D Lid Driven Cavity

Geometrical and physical propertiesTest-case S M XL

delta X (m) 0.001 0.0005 0.00025

N of cells 1,000,000 8,000,000 64,000,000

n of cells lin 100 200 400

ni (m^2/se) 0.01 0.01 0.01

d (m) 0.1 0.1 0.1

Co 1 0.5 0.25

T. final 0.5 0.5 0.5(*) (0.07275)

delta T (sec.) 0.001 0.00025 0.0000625Reynolds 10 10 10

U (m/s) 1 1 1

num. of Iterations 500 2000 8000

● Final physical time T = 0.5, to reach steady state in laminar flow (* if feasible)● Delta X is halved when moving to the next bigger case. ● Courant number under stability limit, reduced for bigger cases ● Delta T is reduced by 4 times when moving to the next bigger case

Co ≅ (U delta T) / delta X


https://www.openfoam.com/documentation/guides/latest/doc/guide-fos-field-courant-no.html

15www.esi-group.com


Geometrical and physical properties



16www.esi-group.com


Solver set-upInitial Benchmark test-case: 3-D Lid Driven Cavity

● The set-up has been chosen to have a constant computational load at each time-step

● maxIter is reached by the # of iter, at each time step both for p and U

● An “acceptable” level of convergence for the residuals is get

● The set-up has been NOT been chosen to maximize the execution time or convergence rate

Num iter. Tot = num. Iteration x maxIter.

system/fvSolutionTest-case S M XL

pressuresolver PCG PCG PCG

tolerance 0.00E+00 0 0relTol 0 0 0

maxIter 250 650 1300num. iter Tot 125,000 1,300,000 10,400,000

pFinal - -relTol 0 0 0

U velocitysolver PBiCGStab PBiCGStab PBiCGStab

preconditioner DILU DILU DILUtolerance 0 0 0

relTol 0 0 0maxIter 5 5 5

num. iter Tot 2,500 10,000 40,000


17www.esi-group.com


Residuals and num. of iteration comparisonInitial Benchmark test-case: 3-D Lid Driven Cavity

● Comparison pressure (Left) and Ux residuals (Right) ● Solver tolerance vs solver tolerance maxIter ● Left vertical axis: Residual trend with and without fixed # iter. ● Right vertical axis: # iter.


18www.esi-group.com


Solver set-up: S case exampleInitial Benchmark test-case: 3-D Lid Driven Cavity

system/fvSolutionTest-case S M XL

pressuresolver PCG PCG PCG

tolerance 0 0 0relTol 0 0 0

maxIter 250 650 1300

num. iter Tot 125,000 1,300,000 10,400,000

pFinal - -relTol 0 0 0

U velocitysolver PBiCGStab PBiCGStab PBiCGStab

preconditioner DILU DILU DILUtolerance 0 0 0

relTol 0 0 0maxIter 5 5 5

num. iter Tot 2,500 10,000 40,000


19www.esi-group.com


Computational set-up and domain decomposition


● Full computational density: using the whole cores/tasks available per node● Half computational density: using 1/2 cores/tasks available per node● Quarter density: using 1/4 cores/tasks available per node

● For Broadwell processor, respectively: ○ --tasks-per-node=36○ --tasks-per-node=18○ --tasks-per-node=9

● KPI to measure = memory bandwidth

● Domain decomposition = scotch


20www.esi-group.com


Example of Table with specifications of the platform used

HPC resources specs


https://drive.google.com/open?id=1HiXC6VDwVwR9sfQUIznaRg5V5htbb-NW

21www.esi-group.com


● Galileo cluster: Intel Broadwell based cluster, --tasks-per-node=36● Left Figure: Inter-node scalability ● Too few cells per cores to get good scalability

Results: S test-case 1 M of cells


22www.esi-group.com


Comparison ARMIDIA, GALILEO, etc.

HPC resources specs

Cluster

Compute

Proc. type Cores per node

tot.num. of nodes accelerators Memory

(GB/node) networkMemory

bandwidth (GB/s)

Galileo (CINECA)

Intel Broadwell E5-2697

36 1022 none 128Intel

OmniPath (100Gb/s)

141

ARMIDIA (E4)

ARM Marvell TX2@2,2/2.5

GHz64 8 none 256

Mellanox IB 100

GBytes/sec

….


23www.esi-group.com


HPC comparison: Armidia cluster Execution time, speed-up, number of cells per procs

● Armidia cluster: 8 compute nodes, 64 cores per node, Marvell TX2@2,2/2.5 GHz, 256 Gbytes/node, Mellanox IB 100 GBytes/sec , ● --tasks-per-node=64

Results: M and XL test-case


24www.esi-group.com


HPC comparison: Armidia cluster Execution time, speed-up, number of cells per procs

Results: M and XL test-case


25www.esi-group.com


HPC comparison: Galileo cluster Execution time and speed-up and number of cells per cores

● Galileo cluster: Intel Broadwell based cluster, --tasks-per-node=36● Superlinear scalability due to memory bandwidth, see next presentation

Results: XL test-case


26www.esi-group.com


HPC comparison: Armidia and Galileo clusterResults:XL test-case

number of cores 128 256 288 512 576 1152

Intel Broadwell

(Galileone)103.505 43.664 15.867

ArmiThunderX2

(Armidia)189.930 90.914 47.276

● Comparison by using the same number of cores ● Runs by using the full number of tasks-per-nodes 36 per Broadwell, 64 for ARM● Continuous line, trend line


27www.esi-group.com


Conclusion / Further work / Acknowledgment ● Conclusions

○ Presented the OpenFOAM HPC Benchmark project○ Create an open and shared repository with relevant data-sets and information ○ Present preliminary results on HPC architectures○ Starting to provide to the community a homogeneous term of reference to compare different HW architectures,

configurations and different SW environments ● Further work

○ Run M test-case○ Populate the repo web site, with instruction on how to run and post-proc the bench (with Documentation TC)○ Add half number of cores per node ○ Add several architectures○ Add weak scalability tests at optimal number of cells per cores○ Add energy to solution?○ Add different decomposition (optimal?) method ○ Add further test-cases

● Acknowledgment ○ C. Latini and A. Memmolo (CINECA) for post-proc, scripting and figures templates○ Andy Heather, joseph Nagzy for initial set-up of wiki repo and web-page (work in progress)○ Useful discussion with S. Zampini (PETSc Team)○ Peoples / Company / OEM which give (or will give) the availability of different architectures (E4, Intel, ARM, AMD, etc. etc. )


hpc benchmark project (part i)outline of the presentation first part: openfoam hpc benchmark project...

Documents