hpc benchmark project (part i)outline of the presentation first part: openfoam hpc benchmark project...
TRANSCRIPT
1www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.Copyright © ESI Group, 2019. All rights reserved.
www.esi-group.com
HPC Benchmark Project , Ivan Spisso , CINECA
HPC Benchmark Project (part I)
Ivan Spisso, [email protected] Bna, [email protected]
Giorgio Amati, g.amati [email protected] Rossi, [email protected]
Fabrizio Magugliani, [email protected]
2www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
Outline of the presentationFirst Part: OpenFOAM HPC Benchmark Project
• HPC Performances of OpenFOAM (3 slides)
• OpenFOAM HPC Technical Committee (TC)
• Code repository for HPC TC, https://develop.openfoam.com/committees/hpc
• Initial (first?) Benchmark test-case: 3-D Lid Driven Cavity • Set-up • Metrics/KPI (Key Performance Indicators) to measure performances• Results
• Further Benchmark test-cases (with other TCs): ask to the Community
HPC Benchmark Project , Ivan Spisso , CINECA
3www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
OpenFOAM: algorithmic overview and Pstream● OpenFOAM is first and foremost a C++ library used to solve in discretized form systems of Partial
Differential Equations (PDE). OpenFOAM stands for Open Field Operation and Manipulation
● The Engine of OpenFOAM is the Numerical Method, with the following features: ○ segregated, iterative solution (PCG/GAMG), unstructured finite volume method,
co-located variables, equation coupling.
● The method of parallel computing used by OpenFOAM is based on the standardMPI using the strategy of domain decomposition(zero layer domain decomposition)
● A convenient interface, Pstream, is used to plug any MPI library into OpenFOAM. It is a light wrapper around the selected MPI Interface
HPC Benchmark Project , Ivan Spisso , CINECA
4www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
OpenFOAM: HPC Performances ● OpenFOAM scales reasonably well up to thousands of cores, upper limit order of thousands of cores [1] We
are looking at achieving radical scalability of cases of 100's of millions / billions of cell on 10K-100K cores.
● A custom version by Shimuzu Corp., Fujitsu Limited and RIKEN on K computer, SPARC64 VIIIfx 2.0GHz, Tofu interconnect) was able to achieve high performance on 100 thousand MPI tasks on a large scale transient CFD simulation up to 100 billion cell mesh [2]
HPC Benchmark Project , Ivan Spisso , CINECA
5www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
OpenFOAM HPC bottleneckstowards the exascale: open issuesUp to now, the well known bottlenecks for enabling OpenFOAM to perform on massively parallel clusters are:
● Scalability of the linear solvers and their limits in the parallelism paradigm.
○ In most cases the memory bandwidth is a limiting factor for the solver. ○ Additionally, global reductions are frequently required in the solvers.
The linear algebra core libraries are the main communication bottlenecks for the scalability
● Sparse Matrix storage format: The LDU sparse matrix storage format used internally does not enable any cache-blocking mechanism (SIMD, vectorization).
● The I/O data storage system: when running in parallel, the data for decomposed fields and mesh(es) has historically been stored in multiple files within separate directories for each processor, which is a bottleneck for big simulation.
HPC Benchmark Project , Ivan Spisso , CINECA
6www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
OpenFOAM HPC bottlenecksProfiling with aps from Intel
towards the exascale: open issues
● https://software.intel.com/sites/products/snapshots/application-snapshot/
HPC Benchmark Project , Ivan Spisso , CINECA
7www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
HPC Roof-line Model and Computational intensity
● The roofline model: Performance bound (y-axis) ordered according to computational intensity● Computational Intensity: ratio of total floating-point operations to total data movement (bytes): i.e.
flops/byte● Which is the OpenFoam (CFD/FEM) arithmetic intensity? About 0.1, may be less…. ● TOP500 List - June 2019
CFD range
Linpack range
● CFD Range BW-limited● Peak flops: two order of
Magnitude ● Linpack: Dense Linear
Algebra● OpenFOAM: Sparse
linear algebra ● HPCG benchmark
Technological trend: towards the exascale
HPC Benchmark Project , Ivan Spisso , CINECA
8www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
HPC Hardware technological trendTowards the e
OpenFOAM HPC Technical Committee (TC)
• The Technical Committees cover all the key focus areas for OpenFOAM development; they assess the state-of-the-art, need and status for validation, documentation and further development.
•• Remits of the HPC TC
• OpenFOAM recommendations to Steering Committee in respect of HPC technical area• Work together with the Community to overcome the actual HPC bottlenecks of OpenFOAM• Scalability of linear solvers• Adapt/modify data structures of Sparse Linear System to enable vectorization / hybridization• Improve memory access on new architectures• Improve memory bandwidth• Parallel pre- and post-processing, parallel I/O
• Strong co-design approach• Identify algorithm improvements to enhance HPC scalability • Interaction with other the Technical Committee (Numerics, Documentations)
• Priorities of HPC TC:• HPC Benchmark• GPU enabling of OpenFOAM (see presentation afterwards of Nvidia’s Stan Posey) • Parallel I/O :Adios 2 I/O library as function object is now a git submodule in the OpenFOAM develop branch
https://www.openfoam.com/governance/technical-committees.php#tm-hpc
HPC Benchmark Project , Ivan Spisso , CINECA
9www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
HPC Technical Committee Membership Nominations
Nominations for Technical Committee Membership
Members of the Technical Committee•Chair: Dr. Ivan Spisso CINECA-S. Bnà, HPC Developer, CINECA•Deputy-Chair: Dr. Mark Olesen Principal Eng., ESI-OpenCFD (Release and Maintenance)•Dr. Henrik Rusche Wikki Ltd. (Release Authority)•Dr. M. Klemm, Principal Eng., Dr. G. Rossi App. Software Eng., Intel (Hardware OEM)•Dr. Olly Perks / Dr. F. Spiga Research Eng. ARM (Hardware OEM)•F. Magugliani, Strategic Planning E4 (HPC System Integrator)•Dr. William F. Godoy, Scientific Data Group Oak Ridge National Lab (HPC Computer Sc.)•Dr. Pham Van Phuc, Institute of Technology Shimizu Corporation (HPC/OEM)•PhD. Stefano Zampini, PETSC Team KAUST (HPC center/ PETSc dev.)•Mr. Axel Koehler, Sr. Solutions Architect Nvidia (Hardware OEM)
HPC Benchmark Project , Ivan Spisso , CINECA
10www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
HPC Technical Committee First Indications of TasksTechnical Committee Tasks
• Commitment• •Work together with the Community to overcome the actual HPC bottlenecks of OpenFOAM.• Demonstrate improvements in performance and scalability to move forward from actual near peta-scale to pre- and exa-scale class
performances•• Remits
•OpenFOAM recommendations to Steering Committee in respect of HPC technical area•Work together with the Community to overcome the actual HPC bottlenecks of OpenFOAM•Scalability of linear solvers•Adapt/modify data structures of Sparse Linear System to enable vectorization / hybridization•Improve memory access on new architectures•Improve memory bandwidth•Parallel pre- and post-processing, parallel I/O
•Strong co-design approach•Identify algorithm improvements to enhance HPC scalability•Interaction with other the Technical Committee (Numerics, Documentations)
• Tasks•Tests and benchmark Third Party linear solver algebra package (ESI, Intel, PeTSc Team)•Tests parallel I/O (ADIOS2) (ESI, ORNL) •Review of HPC benchmarks and establish common HPC benchmarks among different architectures
(Intel, ARM, EPCC,E4)•Review of documentation in HPC & OpenFOAM technical area
•…
HPC Benchmark Project , Ivan Spisso , CINECA
11www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
HPC Hardware technological trendTowards the exascale: co-design is required!
Code repository for HPC Technical Committee https://develop.openfoam.com/committees/hpc
● Create an open and shared repository with relevant data-sets and information
● Provide an User-Guide and initial scripts to set-up and run different data-sets
● Provide to the community a homogeneous term of reference to compare different HW architectures, configurations and different SW environments
● Define a common set of Metrics/KPI (Key Performance Indicators) to measure performances
● Data-sets are in patch-1 branch, right now
HPC Benchmark Project , Ivan Spisso , CINECA
12www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
3-D Lid Driven Cavity
Initial Benchmark test case:
data-set Problem size(s)
Physical features
Relation with HW/SW infrastructure
Tasks per node KPIs bottleneck(s)Cpu intensive Memory
intensive GPU intensive I/O intensive
3-D Lid Driven Cavity flow
S (100^3)M (200^3)XL (400^3)
Incompr. laminar flow,Regular and uniform grid
yes yes no no Full (36)
Time to solution
Energy to solution (?)
Memory Bandwidth
Bound
Linear solver Algebra
Data structure
….. Hlaf (18)
● Derived from 2-D Lid driven cavity flow of OpenFOAM tutorial, https://www.openfoam.com/documentation/tutorial-guide/tutorialse2.php
● Simple geometry. B.C.: inflow, outflow, no slip walls● Stress Test for the linear solver algebra, mainly pressure equation ● KPIs to be monitored: Wall time, Memory Bandwidth, Bound● Bound = The metric represents the percentage of Elapsed time spent heavily utilizing system bandwidth
(available only for Intel architectures)
HPC Benchmark Project , Ivan Spisso , CINECA
13www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
Profiling of S test-case:
Profiling with Vtune:
• Vectorization 3,9 % • Top Loops/functions usage by CPU Time
• lduMatrix:Amul, only FP Scalar Ops, No VEC
• Time spent in pressure solver 67.9 %• Time spent in velocity solver 27.3 %
Initial Benchmark test-case:
HPC Benchmark Project , Ivan Spisso , CINECA
14www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
Geometrical and physical properties
Initial Benchmark test-case: 3-D Lid Driven Cavity
Geometrical and physical propertiesTest-case S M XL
delta X (m) 0.001 0.0005 0.00025
N of cells 1,000,000 8,000,000 64,000,000
n of cells lin 100 200 400
ni (m^2/se) 0.01 0.01 0.01
d (m) 0.1 0.1 0.1
Co 1 0.5 0.25
T. final 0.5 0.5 0.5(*) (0.07275)
delta T (sec.) 0.001 0.00025 0.0000625Reynolds 10 10 10
U (m/s) 1 1 1
num. of Iterations 500 2000 8000
● Final physical time T = 0.5, to reach steady state in laminar flow (* if feasible)● Delta X is halved when moving to the next bigger case. ● Courant number under stability limit, reduced for bigger cases ● Delta T is reduced by 4 times when moving to the next bigger case
Co ≅ (U delta T) / delta X
HPC Benchmark Project , Ivan Spisso , CINECA
15www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
Geometrical and physical properties
Initial Benchmark test-case: 3-D Lid Driven Cavity
HPC Benchmark Project , Ivan Spisso , CINECA
16www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
Solver set-upInitial Benchmark test-case: 3-D Lid Driven Cavity
● The set-up has been chosen to have a constant computational load at each time-step
● maxIter is reached by the # of iter, at each time step both for p and U
● An “acceptable” level of convergence for the residuals is get
● The set-up has been NOT been chosen to maximize the execution time or convergence rate
Num iter. Tot = num. Iteration x maxIter.
system/fvSolutionTest-case S M XL
pressuresolver PCG PCG PCG
tolerance 0.00E+00 0 0relTol 0 0 0
maxIter 250 650 1300num. iter Tot 125,000 1,300,000 10,400,000
pFinal - -relTol 0 0 0
U velocitysolver PBiCGStab PBiCGStab PBiCGStab
preconditioner DILU DILU DILUtolerance 0 0 0
relTol 0 0 0maxIter 5 5 5
num. iter Tot 2,500 10,000 40,000
HPC Benchmark Project , Ivan Spisso , CINECA
17www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
Residuals and num. of iteration comparisonInitial Benchmark test-case: 3-D Lid Driven Cavity
● Comparison pressure (Left) and Ux residuals (Right) ● Solver tolerance vs solver tolerance maxIter ● Left vertical axis: Residual trend with and without fixed # iter. ● Right vertical axis: # iter.
HPC Benchmark Project , Ivan Spisso , CINECA
18www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
Solver set-up: S case exampleInitial Benchmark test-case: 3-D Lid Driven Cavity
system/fvSolutionTest-case S M XL
pressuresolver PCG PCG PCG
tolerance 0 0 0relTol 0 0 0
maxIter 250 650 1300
num. iter Tot 125,000 1,300,000 10,400,000
pFinal - -relTol 0 0 0
U velocitysolver PBiCGStab PBiCGStab PBiCGStab
preconditioner DILU DILU DILUtolerance 0 0 0
relTol 0 0 0maxIter 5 5 5
num. iter Tot 2,500 10,000 40,000
HPC Benchmark Project , Ivan Spisso , CINECA
19www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
Computational set-up and domain decomposition
Initial Benchmark test-case: 3-D Lid Driven Cavity
● Full computational density: using the whole cores/tasks available per node● Half computational density: using 1/2 cores/tasks available per node● Quarter density: using 1/4 cores/tasks available per node
● For Broadwell processor, respectively: ○ --tasks-per-node=36○ --tasks-per-node=18○ --tasks-per-node=9
● KPI to measure = memory bandwidth
● Domain decomposition = scotch
HPC Benchmark Project , Ivan Spisso , CINECA
20www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
Example of Table with specifications of the platform used
HPC resources specs
HPC Benchmark Project , Ivan Spisso , CINECA
21www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
● Galileo cluster: Intel Broadwell based cluster, --tasks-per-node=36● Left Figure: Inter-node scalability ● Too few cells per cores to get good scalability
Results: S test-case 1 M of cells
HPC Benchmark Project , Ivan Spisso , CINECA
22www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
Comparison ARMIDIA, GALILEO, etc.
HPC resources specs
Cluster
Compute
Proc. type Cores per node
tot.num. of nodes accelerators Memory
(GB/node) networkMemory
bandwidth (GB/s)
Galileo (CINECA)
Intel Broadwell E5-2697
36 1022 none 128Intel
OmniPath (100Gb/s)
141
ARMIDIA (E4)
ARM Marvell TX2@2,2/2.5
GHz64 8 none 256
Mellanox IB 100
GBytes/sec
….
HPC Benchmark Project , Ivan Spisso , CINECA
23www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
HPC comparison: Armidia cluster Execution time, speed-up, number of cells per procs
● Armidia cluster: 8 compute nodes, 64 cores per node, Marvell TX2@2,2/2.5 GHz, 256 Gbytes/node, Mellanox IB 100 GBytes/sec , ● --tasks-per-node=64
Results: M and XL test-case
HPC Benchmark Project , Ivan Spisso , CINECA
24www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
HPC comparison: Armidia cluster Execution time, speed-up, number of cells per procs
Results: M and XL test-case
HPC Benchmark Project , Ivan Spisso , CINECA
25www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
HPC comparison: Galileo cluster Execution time and speed-up and number of cells per cores
● Galileo cluster: Intel Broadwell based cluster, --tasks-per-node=36● Superlinear scalability due to memory bandwidth, see next presentation
Results: XL test-case
HPC Benchmark Project , Ivan Spisso , CINECA
26www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
HPC comparison: Armidia and Galileo clusterResults:XL test-case
number of cores 128 256 288 512 576 1152
Intel Broadwell
(Galileone)103.505 43.664 15.867
ArmiThunderX2
(Armidia)189.930 90.914 47.276
● Comparison by using the same number of cores ● Runs by using the full number of tasks-per-nodes 36 per Broadwell, 64 for ARM● Continuous line, trend line
HPC Benchmark Project , Ivan Spisso , CINECA
27www.esi-group.com
Copyright © ESI Group, 2019. All rights reserved.
Conclusion / Further work / Acknowledgment ● Conclusions
○ Presented the OpenFOAM HPC Benchmark project○ Create an open and shared repository with relevant data-sets and information ○ Present preliminary results on HPC architectures○ Starting to provide to the community a homogeneous term of reference to compare different HW architectures,
configurations and different SW environments ● Further work
○ Run M test-case○ Populate the repo web site, with instruction on how to run and post-proc the bench (with Documentation TC)○ Add half number of cores per node ○ Add several architectures○ Add weak scalability tests at optimal number of cells per cores○ Add energy to solution?○ Add different decomposition (optimal?) method ○ Add further test-cases
● Acknowledgment ○ C. Latini and A. Memmolo (CINECA) for post-proc, scripting and figures templates○ Andy Heather, joseph Nagzy for initial set-up of wiki repo and web-page (work in progress)○ Useful discussion with S. Zampini (PETSc Team)○ Peoples / Company / OEM which give (or will give) the availability of different architectures (E4, Intel, ARM, AMD, etc. etc. )
HPC Benchmark Project , Ivan Spisso , CINECA