scalasca, fftw & alltoall pencil code: performancemichiel/pc2013/presentations/... · • times...
TRANSCRIPT
6/19/13
1
Scalasca, FFTW & Alltoall Pencil code: Performance JOACHIM HEIN, LUNARC, LUND UNIVERSITY
IN COLLABORATION: ANDERS JOHANSEN,
Outline
• PRACE
• Scalasca
• Scalasca analysis of the Pencil code
• Communications for Fourier transformations
• FFTW
6/19/13
2
PRACE AN OVERVIEW
25 PRACE Members
April, 23rd 2010 creation of the legal entity (AISBL) PRACE
with seat location in Brussels, Belgium
67+ Million € from EC FP7 for preparatory and
implementation phases Grants INFSO-RI-211528,
261557, 283493, and 312763 Complemented by ~ 50 Million € from PRACE
members
Interest by: Latvia, Belgium
4
6/19/13
3
PRACE is building the top of the pyramid...
First production system available: 1 Petaflop/s IBM BlueGene/P (JUGENE) at GCS (Gauss Centre for Supercomputing) partner FZJ (Forschungszentrum Jülich)
Second production system available: Bull Bullx CURIE at GENCI partner CEA. Full capacity of 1.8 Petaflop/s reached by late 2011.
Third production system available by the end of 2011: 1 Petaflop/s Cray (HERMIT) at GCS partner HLRS (High Performance Computing Center Stuttgart).
Fourth production system available by mid 2012: 3 Petaflop/s IBM (SuperMUC) at GCS partner LRZ (Leibniz-Rechenzentrum).
Tier-0
Tier-1
Tier-2
Fifth production system available by August 2011: 1 Petaflop/s IIBM BG/Q (FERMI) at CINEC.
Sixth production system available by January 2013: 1 Petaflop/s IBM (MareNostrum) at BSC.
Upgrade: 5.87 Petaflop/s IBM Blue Gene/Q (JUQUEEN)
• Tier-0 program – Largest machines in Europe – Can have several 10 M CPU hours – Call expected in September/October frame – Scalability requirements – should be fine for Pencil – Also: preparatory access program – cut-off 2nd September
• Can ask for help from PRACE HPC expert
• Tier-1 program – DECI program for access to Tier-1 architecture – Selected on a national level – In Sweden can ask ≈ 6 M CPU hours (Lindgren) – typically cut a bit
• Training events and Schools • Website: www.prace-project.eu
– Info on the above – White papers
Selected PRACE activities
6
6/19/13
4
Performance Analysis Tool SCALASCA
Scalasca: Overview
• Parallel profiling tool – OpenMP – MPI
– Can be used for serial • Aims to help with questions like
– Where is my application spending time – Why is it spending time on this
• Fast and convenient to use
6/19/13
5
Overview (cont.)
• Developed by: – Forschungszentrum Jülich (Germany) – German Research School for Simulation Sciences
• Free but copyright • Widely available – architectures include:
– Cray XT/XE – IBM Bluegene – IBM Power
– “various” Linux x86/x64 clusters
What does it attempt to do
• Event based, e.g.: – MPI call starts – OpenMP parallel region finishes
– Subroutine “init” starts • Recording
– Runtime summation (time spend in routine) – Event tracing (huge files)
• GUI report browser to view the results
– Some other tools can also digest the data
6/19/13
6
Three step approach
1. Instrument your program (Source access required) – Automatically during compile – Manually using an API (Source modification)
– PDT instrumentation (still need to try) 2. Execute the program – perform the analysis
– Events are recorded while executing – Check: runtime still reasonable
3. Look at the results
1. Use the GUI
Instrumentation STEP 1
6/19/13
7
Automatic instrumentation (general)
• Pre-fix your compiler call with either of scalasca -‐instrument
skin
• Example: mpif90 –o myprog.x myprog.f90
becomes: scalasca -‐instrument mpif90 –o myprog.x myprog.f90
skin mpif90 –o myprog.x myprog.f90
Feedback for successful instrumentation
[jhein@alarik]$ skin mpif90 -O3 -march=bdver1 \!
-fdefault-real-8 \!
-fdefault-double-8 nochemistry.o nochiral.o \!
nocosmicray.o nocosmicrayflux.o density.o \!
nodustdensity.o nodustvelocity.o entropy.o \ !
... \!
pencil_check.o run.o -o run.x!
!
INFO: Instrumented executable for MPI measurement!
[jhein@alarik]$ !
6/19/13
8
Analyse Application STEP 2
Use the analysis tool to run the executable
• MPI example scalasca –analyze mpiexec –np 4 ./myprog.x
scan mpiexec –np 4 ./myprog.x
• OpenMP code (or serial)
scalasca –analyze ./myprog.x
scan ./myprog.x
6/19/13
9
Scalasca set-up for the Pencil code
• Modify your configuration file: FC = skin mpif90!
!
mpiexec=scan mpiexec -bind-to-core!
!
• Use modified configuration file for building and running
Remark: Also engaged task binding – use for all runs
Important comments on runtime
• The above creates a summary report
• Have a look at the runtime
– small routines can make the runtime explode – if it is bad: use a filter file
– if it is really bad: compile relevant file without instrumentation
• Analysis step can give you hint on culprits – Look for high invocation count
6/19/13
10
Example: Filter file for gfortran 4.6.2
[jhein@alarik Benchmark]$ more not_to_measure.filt
__particles_sub_MOD_get_rhopswarm_point
__general_MOD_keep_compiler_quiet_i
__general_MOD_keep_compiler_quiet_r4d
__general_MOD_keep_compiler_quiet_i2d
__general_MOD_random_number_wrapper_0
__general_MOD_random_number_wrapper_1
__general_MOD_ran0
__particles_map_MOD_interpolate_quadratic_spline
__particles_MOD_get_frictiontime
Filterfile
• Name routines not to measure in filter file • Option of analyser:
scan –f filterfile mpiexec –np 128 myprog.x
• Set environment variable: export EPK_FILTER=filterfile
Remark: Replace filterfile with actual name Remark: Time spent in excluded routines attributed to caller
6/19/13
11
Routines excluded from instrumentation
mpif90 -O3 -march=bdver1 -fdefault-real-8 \ !
-fdefault-double-8 -o general.o \!
-c general.f90!
!
mpif90 -O3 -march=bdver1 -fdefault-real-8 \ !
-fdefault-double-8 -o particles_sub.o \!
-c particles_sub.f90!
!
• Use objects as normal
Result directory (aka. epik directory)
• After the scan-run you will find a result directory in your workdirectory
• Example: epik_run_64_sum – Executable: run!
– 64 MPI tasks – summary run
• By default: scan aborts when result-directory exists
6/19/13
12
Examine with the GUI STEP 3
Starting the GUI
• The command: scalasca –examine epik_directory
square epik_directory
– post-processes the data (pattern search)
– starts the GUI
• Use –s option for post-processing only
– Useful when opening archive from the GUI • Can install GUI on e.g. laptop
6/19/13
13
GUI window: Time for MPI_Barrier
Metric Call tree Process
GUI window: Load imbalance in a Send
6/19/13
14
Profiling Results TIMINGS AND BYTES
Configuration details
• Provided by Anders Johansen • Planetary formation simulation
– Gas evolved on a fixed grid
– Freely moving dust particles – Interaction between gas & particles via drag forces
• Grid: 128 × 128 × 128 • Processors:
– ncpus=32, nprocx=1, nprocy=16, nprocz=2!– ncpus=64, nprocx=1, nprocy=32, nprocz=2!
6/19/13
15
Routine Exectime 32
Exectime 64
MPItime 32
MPItime 64
particles_pde_pencil 27700 26800 0 0 particles_bondconds 14000 14400 1023 3716 calc_selfpotential 6230 6344 6037 9885 particles_pde 8702 8808 0 0 fold_df 387 771 1872 10100 finalize_isendrcv_bdry 685 1328 2598 4757
Timings inside a time-step Code status autumn 2012
Routine MPItime 32
MPItime 64
MPIbyte 32
MPIbyte 64
particles_pde_pencil 0 0 0 0 particles_bondconds 1023 3716 3.32e10 6.04e10 calc_selfpotential 6037 9885 9.56e11 9.81e11 particles_pde 0 0 0 0 fold_df 1872 10100 0.96e11 1.77e11 finalize_isendrcv_bdry 2598 4757 1.16e12 2.19e12
Bytes transferred Code status autumn 2012
6/19/13
16
Selfpotential: transpose ALL-TO-ALL COMMUNICATION
Percentage of execution time
6/19/13
17
Observations
• Substantial time in MPI: 8% of runtime – severe imbalance
• 1.5% of runtime in FFTPACK
• over 3% in other Fortran code
Alltoall version of the transpose do px = 0, nprocy-1!
send_buf( : , : , : , px) &!
& = a(px*ny + 1 : (px+1)*ny, : , : )!
end do!
Call MPI_Alltoall ( send_buf, sendcount, MPI_REAL, &!
recv_buf, sendcount, MPI_REAL, &!
MPI_COMM_XYPLANE, mpierr)!
do px = 0, nprocy-1!
do iz = 1, nz!
a( px*ny+1 : (px+1)*ny, :, iz) &!
= transpose(recv_buf( : , : , iz, px))!
enddo!
enddo!
6/19/13
18
Improving the transpose Time: fourier_transform_shear!
Version CPU MPI total real transp, p2p, fftpack 5.83 10.0 15.9 real transp, a2a, fftpack 5.73 5.59 11.3 Cmplx transp, a2a, fftpack 5.51 5.82 11.3 Cmplx & real tr., a2a, fftpack 5.39 5.69 11.1
• Current code – transpose for real and imaginary part – define complex transpose
• Don’t transpose trivial imaginary part in the beginning
Times measured in 1000s, 64 MPI tasks
Selfpotential: Efficiency FFTW
6/19/13
19
FFTW: Portable high performance FFT library Vector instructions: (e.g. AVX using 256 bit registers = 4 dbl words)
• FFTW portable high performance FFT library
– compile yourself
– MKL offers FFTW interface – Calling C from Fortran
• Efficient use of vector instructions
– alignment constraints
– Alignment needs to be handled at plan creation
= +
Optimising shear
• Shear needs a y-dependent (z-independent) phase factor – many complex exponential
• Original code calculates phase for all points
• Modify: Calculate once for given y-point and use for all local z-points
6/19/13
20
Timing impact on: fourier_transform_shear
• Times measured in 1000s, 64 MPI tasks
• Speed up by almost 2× for transform shear • FFTW negative impact on MPI performance!
Version CPU MPI total real transp, p2p, fftpack 5.83 10.0 15.9 Cmplx & real tr., a2a, fftpack 5.39 5.69 11.1 Cmplx & real tr., a2a, FFTW
3.95 6.28 10.2
Cmplx & real tr., a2a, FFTW, reused exponential
1.73 6.38 8.1
Overall performance impact GCC AND INTEL COMPILER
6/19/13
21
Overall times instrumented Runtime for 2000 iterations
• OpenMPI 1.6.4, Intel 13.1, GCC 4.6.2, FFTW 3.3.2 • 64 cores
• Alltoall damages over all performance? – Open question at present
Version Gfortran Intel real transp, p2p, fftpack 0.491 0.375 Cmplx & real tr., p2p, FFTW 0.478 0.361 Cmplx & real tr., p2p, FFTW, reused exponential
0.464 0.360
Cmplx & real tr., a2a, FFTW, reused exponential
0.506 0.391
Summary
• Scalasca tool for profiling • Profiling results for Pencil code • Discussed optimisation attempts for fourier-transform-
shear routine – MPI_Alltoall
– FFTW • MPI_Alltoall
– helps the for fourier-transform-shear routine
– In its current implementation it damages other Pencil parts that overall impact is negative