scalasca, fftw & alltoall pencil code: performancemichiel/pc2013/presentations/... · • times...

6/19/13

1

Scalasca, FFTW & Alltoall Pencil code: Performance JOACHIM HEIN, LUNARC, LUND UNIVERSITY

IN COLLABORATION: ANDERS JOHANSEN,

Outline

•  PRACE

•  Scalasca

•  Scalasca analysis of the Pencil code

•  Communications for Fourier transformations

•  FFTW

6/19/13

2

PRACE AN OVERVIEW

25 PRACE Members

April, 23rd 2010 creation of the legal entity (AISBL) PRACE

with seat location in Brussels, Belgium

67+ Million € from EC FP7 for preparatory and

implementation phases Grants INFSO-RI-211528,

261557, 283493, and 312763 Complemented by ~ 50 Million € from PRACE

members

Interest by: Latvia, Belgium

4

6/19/13

3

PRACE is building the top of the pyramid...

First production system available: 1 Petaflop/s IBM BlueGene/P (JUGENE) at GCS (Gauss Centre for Supercomputing) partner FZJ (Forschungszentrum Jülich)

Second production system available: Bull Bullx CURIE at GENCI partner CEA. Full capacity of 1.8 Petaflop/s reached by late 2011.

Third production system available by the end of 2011: 1 Petaflop/s Cray (HERMIT) at GCS partner HLRS (High Performance Computing Center Stuttgart).

Fourth production system available by mid 2012: 3 Petaflop/s IBM (SuperMUC) at GCS partner LRZ (Leibniz-Rechenzentrum).

Tier-0

Tier-1

Tier-2

Fifth production system available by August 2011: 1 Petaflop/s IIBM BG/Q (FERMI) at CINEC.

Sixth production system available by January 2013: 1 Petaflop/s IBM (MareNostrum) at BSC.

Upgrade: 5.87 Petaflop/s IBM Blue Gene/Q (JUQUEEN)

•  Tier-0 program –  Largest machines in Europe –  Can have several 10 M CPU hours –  Call expected in September/October frame –  Scalability requirements – should be fine for Pencil –  Also: preparatory access program – cut-off 2nd September

•  Can ask for help from PRACE HPC expert

•  Tier-1 program –  DECI program for access to Tier-1 architecture –  Selected on a national level –  In Sweden can ask ≈ 6 M CPU hours (Lindgren) – typically cut a bit

•  Training events and Schools •  Website: www.prace-project.eu

–  Info on the above –  White papers

Selected PRACE activities

6

6/19/13

4

Performance Analysis Tool SCALASCA

Scalasca: Overview

•  Parallel profiling tool – OpenMP – MPI

– Can be used for serial •  Aims to help with questions like

– Where is my application spending time – Why is it spending time on this

•  Fast and convenient to use

6/19/13

5

Overview (cont.)

•  Developed by: –  Forschungszentrum Jülich (Germany) – German Research School for Simulation Sciences

•  Free but copyright •  Widely available – architectures include:

– Cray XT/XE –  IBM Bluegene –  IBM Power

–  “various” Linux x86/x64 clusters

What does it attempt to do

•  Event based, e.g.: – MPI call starts – OpenMP parallel region finishes

– Subroutine “init” starts •  Recording

– Runtime summation (time spend in routine) – Event tracing (huge files)

•  GUI report browser to view the results

– Some other tools can also digest the data

6/19/13

6

Three step approach

1.  Instrument your program (Source access required) –  Automatically during compile –  Manually using an API (Source modification)

–  PDT instrumentation (still need to try) 2.  Execute the program – perform the analysis

–  Events are recorded while executing –  Check: runtime still reasonable

3.  Look at the results

1.  Use the GUI

Instrumentation STEP 1

6/19/13

7

Automatic instrumentation (general)

•  Pre-fix your compiler call with either of scalasca -‐instrument

skin

•  Example: mpif90 –o myprog.x myprog.f90

becomes: scalasca -‐instrument mpif90 –o myprog.x myprog.f90

skin mpif90 –o myprog.x myprog.f90

Feedback for successful instrumentation

[jhein@alarik]$ skin mpif90 -O3 -march=bdver1 \!

-fdefault-real-8 \!

-fdefault-double-8 nochemistry.o nochiral.o \!

nocosmicray.o nocosmicrayflux.o density.o \!

nodustdensity.o nodustvelocity.o entropy.o \ !

... \!

pencil_check.o run.o -o run.x!

!

INFO: Instrumented executable for MPI measurement!

[jhein@alarik]$ !

6/19/13

8

Analyse Application STEP 2

Use the analysis tool to run the executable

•  MPI example scalasca –analyze mpiexec –np 4 ./myprog.x

scan mpiexec –np 4 ./myprog.x

•  OpenMP code (or serial)

scalasca –analyze ./myprog.x

scan ./myprog.x

6/19/13

9

Scalasca set-up for the Pencil code

•  Modify your configuration file: FC = skin mpif90!

!

mpiexec=scan mpiexec -bind-to-core!

!

•  Use modified configuration file for building and running

Remark: Also engaged task binding – use for all runs

Important comments on runtime

•  The above creates a summary report

•  Have a look at the runtime

–  small routines can make the runtime explode –  if it is bad: use a filter file

–  if it is really bad: compile relevant file without instrumentation

•  Analysis step can give you hint on culprits –  Look for high invocation count

6/19/13

10

Example: Filter file for gfortran 4.6.2

[jhein@alarik Benchmark]$ more not_to_measure.filt

__particles_sub_MOD_get_rhopswarm_point

__general_MOD_keep_compiler_quiet_i

__general_MOD_keep_compiler_quiet_r4d

__general_MOD_keep_compiler_quiet_i2d

__general_MOD_random_number_wrapper_0

__general_MOD_random_number_wrapper_1

__general_MOD_ran0

__particles_map_MOD_interpolate_quadratic_spline

__particles_MOD_get_frictiontime

Filterfile

•  Name routines not to measure in filter file •  Option of analyser:

scan –f filterfile mpiexec –np 128 myprog.x

•  Set environment variable: export EPK_FILTER=filterfile

Remark: Replace filterfile with actual name Remark: Time spent in excluded routines attributed to caller

6/19/13

11

Routines excluded from instrumentation

mpif90 -O3 -march=bdver1 -fdefault-real-8 \ !

-fdefault-double-8 -o general.o \!

-c general.f90!

!

mpif90 -O3 -march=bdver1 -fdefault-real-8 \ !

-fdefault-double-8 -o particles_sub.o \!

-c particles_sub.f90!

!

•  Use objects as normal

Result directory (aka. epik directory)

•  After the scan-run you will find a result directory in your workdirectory

•  Example: epik_run_64_sum – Executable: run!

–  64 MPI tasks –  summary run

•  By default: scan aborts when result-directory exists

6/19/13

12

Examine with the GUI STEP 3

Starting the GUI

•  The command: scalasca –examine epik_directory

square epik_directory

–  post-processes the data (pattern search)

–  starts the GUI

•  Use –s option for post-processing only

– Useful when opening archive from the GUI •  Can install GUI on e.g. laptop

6/19/13

13

GUI window: Time for MPI_Barrier

Metric Call tree Process

GUI window: Load imbalance in a Send

6/19/13

14

Profiling Results TIMINGS AND BYTES

Configuration details

•  Provided by Anders Johansen •  Planetary formation simulation

– Gas evolved on a fixed grid

–  Freely moving dust particles –  Interaction between gas & particles via drag forces

•  Grid: 128 × 128 × 128 •  Processors:

– ncpus=32, nprocx=1, nprocy=16, nprocz=2!– ncpus=64, nprocx=1, nprocy=32, nprocz=2!

6/19/13

15

Routine Exectime 32

Exectime 64

MPItime 32

MPItime 64

particles_pde_pencil 27700 26800 0 0 particles_bondconds 14000 14400 1023 3716 calc_selfpotential 6230 6344 6037 9885 particles_pde 8702 8808 0 0 fold_df 387 771 1872 10100 finalize_isendrcv_bdry 685 1328 2598 4757

Timings inside a time-step Code status autumn 2012

Routine MPItime 32

MPItime 64

MPIbyte 32

MPIbyte 64

particles_pde_pencil 0 0 0 0 particles_bondconds 1023 3716 3.32e10 6.04e10 calc_selfpotential 6037 9885 9.56e11 9.81e11 particles_pde 0 0 0 0 fold_df 1872 10100 0.96e11 1.77e11 finalize_isendrcv_bdry 2598 4757 1.16e12 2.19e12

Bytes transferred Code status autumn 2012

6/19/13

16

Selfpotential: transpose ALL-TO-ALL COMMUNICATION

Percentage of execution time

6/19/13

17

Observations

•  Substantial time in MPI: 8% of runtime –  severe imbalance

•  1.5% of runtime in FFTPACK

•  over 3% in other Fortran code

Alltoall version of the transpose do px = 0, nprocy-1!

send_buf( : , : , : , px) &!

& = a(px*ny + 1 : (px+1)*ny, : , : )!

end do!

Call MPI_Alltoall ( send_buf, sendcount, MPI_REAL, &!

recv_buf, sendcount, MPI_REAL, &!

MPI_COMM_XYPLANE, mpierr)!

do px = 0, nprocy-1!

do iz = 1, nz!

a( px*ny+1 : (px+1)*ny, :, iz) &!

= transpose(recv_buf( : , : , iz, px))!

enddo!

enddo!

6/19/13

18

Improving the transpose Time: fourier_transform_shear!

Version CPU MPI total real transp, p2p, fftpack 5.83 10.0 15.9 real transp, a2a, fftpack 5.73 5.59 11.3 Cmplx transp, a2a, fftpack 5.51 5.82 11.3 Cmplx & real tr., a2a, fftpack 5.39 5.69 11.1

•  Current code –  transpose for real and imaginary part –  define complex transpose

•  Don’t transpose trivial imaginary part in the beginning

Times measured in 1000s, 64 MPI tasks

Selfpotential: Efficiency FFTW

6/19/13

19

FFTW: Portable high performance FFT library Vector instructions: (e.g. AVX using 256 bit registers = 4 dbl words)

•  FFTW portable high performance FFT library

–  compile yourself

– MKL offers FFTW interface – Calling C from Fortran

•  Efficient use of vector instructions

–  alignment constraints

– Alignment needs to be handled at plan creation

= +

Optimising shear

•  Shear needs a y-dependent (z-independent) phase factor – many complex exponential

•  Original code calculates phase for all points

•  Modify: Calculate once for given y-point and use for all local z-points

6/19/13

20

Timing impact on: fourier_transform_shear

•  Times measured in 1000s, 64 MPI tasks

•  Speed up by almost 2× for transform shear •  FFTW negative impact on MPI performance!

Version CPU MPI total real transp, p2p, fftpack 5.83 10.0 15.9 Cmplx & real tr., a2a, fftpack 5.39 5.69 11.1 Cmplx & real tr., a2a, FFTW

3.95 6.28 10.2

Cmplx & real tr., a2a, FFTW, reused exponential

1.73 6.38 8.1

Overall performance impact GCC AND INTEL COMPILER

6/19/13

21

Overall times instrumented Runtime for 2000 iterations

•  OpenMPI 1.6.4, Intel 13.1, GCC 4.6.2, FFTW 3.3.2 •  64 cores

•  Alltoall damages over all performance? – Open question at present

Version Gfortran Intel real transp, p2p, fftpack 0.491 0.375 Cmplx & real tr., p2p, FFTW 0.478 0.361 Cmplx & real tr., p2p, FFTW, reused exponential

0.464 0.360

Cmplx & real tr., a2a, FFTW, reused exponential

0.506 0.391

Summary

•  Scalasca tool for profiling •  Profiling results for Pencil code •  Discussed optimisation attempts for fourier-transform-

shear routine – MPI_Alltoall

–  FFTW •  MPI_Alltoall

–  helps the for fourier-transform-shear routine

–  In its current implementation it damages other Pencil parts that overall impact is negative

scalasca, fftw & alltoall pencil code: performancemichiel/pc2013/presentations/... · • times...

Documents