industrial hydrocode on recent multicore...

27
Performance modeling of an industrial hydrocode on recent multicore processors Thibault Gasc 1,2,3 , F. De Vuyst 1 , R. Motte 3 , M. Peybernes 4 , R. Poncet 5 1 CMLA, ENS Cachan, CNRS, Universit´ e Paris-Saclay, 94235, Cachan, France 2 Maison de la Simulation, USR 3441, CEA – CNRS – INRIA – Univ. Paris-Sud – Univ. de Versailles, F-91191, Gif-sur-Yvette, France 3 CEA, DAM, DIF, F-91297 Arpajon, France 4 CEA, DEN, DM2S, STMF F-91191, Gif-sur-Yvette, France 5 CGG, 27 Avenue Carnot, 91300 Massy, France SimRace 2015 8 th december 2015

Upload: others

Post on 10-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Performance modeling of anindustrial hydrocode on recentmulticore processors

Thibault Gasc1,2,3,F. De Vuyst1, R. Motte3,M. Peybernes4, R. Poncet5

1 CMLA, ENS Cachan, CNRS, Universite Paris-Saclay, 94235,Cachan, France2 Maison de la Simulation, USR 3441, CEA – CNRS – INRIA –Univ. Paris-Sud – Univ. de Versailles, F-91191, Gif-sur-Yvette,France3 CEA, DAM, DIF, F-91297 Arpajon, France4 CEA, DEN, DM2S, STMF F-91191, Gif-sur-Yvette, France5 CGG, 27 Avenue Carnot, 91300 Massy, France

SimRace 2015 8th december 2015

Page 2: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Outline

1 Goals and target application

2 Performance model: the Roofline

3 Application of the Roofline model to the current solver

4 The ECM model

5 Conclusions

Page 3: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

The goal

Build a solver for compressible fluid dynamics (Eulerequations) which will run efficiently on current and futurecomputers

T. Gasc 3/21

Page 4: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

How can we get this ?

Understand HPC properties of the reference numericalmethod

• Analyze and understand the current solver performance

Improve the current solver to its limits

• Understand current processor micro-architectures

• Optimize the solver implementation for current processorsmicro-architectures

• Estimate the maximal achievable performance for this solver

Going further

• Predict performance on other and future architectures

• Design a solver based on a more efficient algorithm

⇒ Building performance model

T. Gasc 4/21

Page 5: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

How can we get this ?

Understand HPC properties of the reference numericalmethod

• Analyze and understand the current solver performance

Improve the current solver to its limits

• Understand current processor micro-architectures

• Optimize the solver implementation for current processorsmicro-architectures

• Estimate the maximal achievable performance for this solver

Going further

• Predict performance on other and future architectures

• Design a solver based on a more efficient algorithm

⇒ Building performance modelT. Gasc 4/21

Page 6: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Specificity of our solver (Lagrange-Remapmethods)

1 - Lagrangian step

Compute evolution of hydrodynamic quantities on a mesh movingat material velocity

2 - Remap step

Remap / interpolation from distorted mesh to fixed initial one

T. Gasc 5/21

Page 7: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Figure : Sketch of Lagrange-Remap methodsT. Gasc 6/21

Page 8: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Outline

1 Goals and target application

2 Performance model: the Roofline

3 Application of the Roofline model to the current solver

4 The ECM model

5 Conclusions

Page 9: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

What is performance modeling ?

Definition

Build and use simple abstraction to estimate execution time of achosen algorithm on a given computing architecture⇒ Understanding the performance behavior of a code

Interests

• Define how efficiently a computer is used (current usage vsmachine peak)

• Predict / extrapolate performance behavior on othersarchitectures

• Provide optimization hints

T. Gasc 7/21

Page 10: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Performance modeling: principle

Performance model

Architecture

Algorithm

Predicted performance

[Flops]

Figure : Input / output of a performance model

T. Gasc 8/21

Page 11: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

The Roofline model [Williams 2009]:metrics

Architecture• Bandwidth:

Data transfers rate from memory to computingunits[GBytes/s]

• Peak:Absolute maximum performance of a computer[GFlops]

Algorithm

• Arithmetic Intensity (AI):

AI := number of operationquantity of data transfered [Flops/Byte]

• Unbalance factor (instruction mix):Reduction performance factor due to not perfect matchingbetween algorithm and architecture [%peak]

T. Gasc 9/21

Page 12: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

The Roofline model: definitions

(ideal) Roofline model

Perf(ideal) = f (Peak , bandwidth,AI )

:= min(Peak , bandwidth × AI )

(effective) Roofline model

Perf(effective) = f (Peak, bandwidth,AI, unbalance factor)

:= min(Peak× unbalance factor , Bwd× AI)

T. Gasc 10/21

Page 13: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Roofline: principle

Roofline model

ArchitectureBandwidth[GB/s]

Peak[Flops]

AlgorithmArithmetic Intensity[F/B]

Unbalance factor[%Peak]

Predicted performance

[Flops]

T. Gasc 11/21

Page 14: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Roofline: graphical representation

T. Gasc 12/21

Page 15: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Roofline: graphical representation

T. Gasc 12/21

Page 16: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Outline

1 Goals and target application

2 Performance model: the Roofline

3 Application of the Roofline model to the current solver

4 The ECM model

5 Conclusions

Page 17: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Dataflow diagram of the remap step

Lagrangian energy Mass Lagrangian velocity

Out mass Out velocityOut energy

Predicted velocity

Lagrangian quantity update

Lag density Volume flux

Cell centered gradients

Energ gradient Density gradient

MUSCL fluxes

Energy flux Mass Flux

Cell-centered remap

Node centered gradient

Velocity gradient

Velocity Remap

Figure : Dataflow diagram of the remap stepT. Gasc 13/21

Page 18: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Roofline model vs measured performance

error11%

12%

T. Gasc 14/21

Page 19: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Roofline model vs measured performance

error30%

70%

T. Gasc 15/21

Page 20: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Outline

1 Goals and target application

2 Performance model: the Roofline

3 Application of the Roofline model to the current solver

4 The ECM model

5 Conclusions

Page 21: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Execution Cache Memory model[Hager & Treibig 2010]

ECM (Execution Cache Memory model):refinement of the Roofline model by introducingdata transfers through caches

Metrics: time [cycles / cache line update]

• Time for pure computation (assuming dataare in L1 Cache)

• Times for data transfers through the differentcache levels L1-L2, L2-L3, L3-Ram

Main hypotheses

• Computations and data transfers may overlap

• Data transfers do not overlap

Registers

L1 Cache

L2 Cache

L3 Cache

Memory

256 bit/cy

256 bit/cy

256 bit/cy 128 bit/cy

*53 bit/cy

Execution units Computation

Data

T ransfe r

T. Gasc 16/21

Page 22: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

ECM model vs measured performance

Haswell AVX kernels ECM measurements error

Lagrange kernelsPressure prediction 168 173 3%Pressure correction 80 78 3%Velocity update 89 65 8%

Remap kernelsLag. q. update 57 58 2%Cell centered gradient 168 170 1%MUSCL Flux 44 42 5%Cell centered remap 76 65 17%Node centered gradient 168 170 1%Velocity remap 30 25 17%

T. Gasc 17/21

Page 23: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

ECM model vs Roofline model vs measuredperformance

kernel name time [cycles / cache line update]prediction ECM measurement

Pressure correction 80 78

Lagrangian q. update 57 58

modeling errorRoofline ECM

30% 3%70% 2%

T. Gasc 18/21

Page 24: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

ECM: from one core to full node

Registers

L1 Cache

L2 Cache

L3 Cache

Memory

256 bit/cy

256 bit/cy

256 bit/cy 128 bit/cy

*53 bit/cy

Execution units Computation

Data

T ransfe r

Registers

L1 Cache

L2 Cache

L3 Cache

Memory

256 bit/cy

256 bit/cy

*53 bit/cy

Registers

L1 Cache

L2 Cache

Privateresources

256 bit/cy

SharedRessources

T. Gasc 19/21

Page 25: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

ECM: multicore scalability

Kernel name Speed up 8-cores SNB Speed up 4-core HSWPred. Meas. Pred. Meas.

Lagrange kernelsPressure prediction 8 7.45 4 3.5Pressure correction 2.6 3.0 1.4 1.6Velocity update 2.8 3.0 1.5 1.5Remap kernelsLagrangian q. update 3.2 3.4 1.6 1.6Cell centered gradient 8 7.9 4 3.9MUSCL fluxes 2.3 2.6 1.3 1.4Cell centered remap 2.5 2.7 1.2 1.6Node centered gradient 8 7.9 4 3.9Velocity remap 2.3 2.7 1.5 1.5

T. Gasc 20/21

Page 26: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Conclusions and on going works

Performance modeling

• Very useful analysis tools for understanding codeperformance behavior

• Provide qualitative / quantitative information depending onthe used model

Highlighting some bottlenecks in the current solver

• Variable location (node and cell-centered)

• Multi dimensional remap in multiple steps

• Complex geometrical features

On going works

• Building a new efficient numerical scheme(see F. De Vuyst talk, Thursday 10th)

• Performance modeling of this new scheme T. Gasc 21/21

Page 27: industrial hydrocode on recent multicore processorsprojet.ifpen.fr/.../pdf/2015-12/3_gasc_simrace.pdf · Velocity update 89 65 8% Remap kernels Lag. q. update 57 58 2% Cell centered

Thanks for your attention!