porting highly computational intensive applications to hybridsystems
TRANSCRIPT
Porting highly computational intensive applications to hybrid systems:
case studies from the Irish Centre for High-End Computing
Ivan Girotto - ICHEC GPU Team Leader [email protected] http://gpgpu.ichec.ie
… in collaboration with Filippo Spiga
Outline • The Irish Centre for High-End Computing (ICHEC)
• GPGPU Computing @ ICHEC
• Main topics:
17 – 02 - 2012 2 Ivan Girotto - ICHEC - PRACE Workshop
• Use of accelerated libraries for porting scientific applications
• Quantum ESPRESSO
• GPU computing for business: an ICHEC’s case study
ICHEC • founded in mid-2005 by SFI and HEA • ~ 25 staff member • Two main systems
SGI Altix ICE 8200EX (~ 4k cores) Bull Novascale R422-E2 (~ 500 cores)
• 50 NVIDIA GPUs, National service production
• Objectives: Provide computational resources
Provide education and training to third-level institutions
Tech transfer and consultancy services to develop Irish smart economy
17 – 02 - 2012 3 Ivan Girotto - ICHEC - PRACE Workshop
ICHEC’s GPU Computing Facilities /1
• Stoney: GPU National Service Production
48 x NVIDIA M2090 3 x Dell PowerEdge C410x
17 – 02 - 2012 4 Ivan Girotto - ICHEC - PRACE Workshop
Bull Novascale R422-E2 64 x 2 Quad cores Intel Xeon X5560 @ 2.8GHz
Peak Performance = 5.73 TFlop
Bull Novascale R422-E2 64 x 2 Quad cores Intel Xeon X5560 @ 2.8GHz
Peak Performance = 37.6 TFlop
• Gemini: Software Development platform
ICHEC’s GPU Computing Facilities /2
16 x NVIDIA M2090
• Development, Education & Training:
~ 10 x NVIDIA GTX 480
17 – 02 - 2012 5 Ivan Girotto - ICHEC - PRACE Workshop
+
2 x Dell PowerEdge ( C410x + C6145 )
Nucleosome 25,095 Atoms (Implicit Solvent GB)
!
!
!
!
!
!
!
!
!!!
!
!
!
0 10 20 30 40510
1520
0
1
2
3
4
5
#GPGPUs
#Nodes
ns/day
cpucpu+gpgpu
!
!
http://ambermd.org/gpus/benchmarks.htm 17 – 02 - 2012 6 Ivan Girotto - ICHEC - PRACE Workshop
courtesy of Dr. Martin Peters
17 – 02 - 2012 7 Ivan Girotto - ICHEC - PRACE Workshop
LAMMPS-CUDA E
laps
ed T
ime
(sec
.)
Stoney Nodes
courtesy of Dr. Nicola Varini
… see http://www.nvidia.com/object/vertical_solutions.html
GPGPU Computing @ ICHEC • PRACE Project 1IP & 2IP
Sub-task leadership T7.5e “Accelerators” – 1IP
Sub-task leadership T7.2d and T7.3d – 2IP
17 – 02 - 2012 8 Ivan Girotto - ICHEC - PRACE Workshop
• AutoTune Analyse and tune serial, parallel & GPGPU codes Aims to close the loop from profiling to tuning GPGPU support built in ICHEC brings extensive application experience For more info contact Prof. Michael Gerndt (PM)
PRACE T7.5 • Programming Techniques for High Performance Applications
• The task will work with users to implement new programming techniques which have the potential to facilitate significant improvements in their applications performance
• This task will investigate the following areas and produce best-practice guides covering: Analysis of scalable algorithms and libraries to enhance
scientific applications Optimisation of applications on multi-core/many core systems Exploitation of accelerators for real applications
17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 9
Speed-up Scientific Codes
17 – 02 - 2012 10 Ivan Girotto - ICHEC - PRACE Workshop
Application!Code
Compute-Intensive Functions
Rest of the Sequential
Code
GPU CPU
Use!CUDA!to!parallelise
Directives (OpenACC)
Libraries
CUDA
3 Way to Accelerate on GPU
CUDA library ecosystem
APPLICATIONS
CUDA C
3rd Party Libraries
CUDA Libraries
• MAGMA • CULA • libJacket • CUSP • IMSL Library • NAG • OpenVidia • OpenCurrent • phiGEMM
• CUFFT • CUBLAS • CUSPARSE • CURAND • THRUST • NPP • Libm (math.h) • System Call
http://developer.nvidia.com/gpu-accelerated-libraries
17 – 02 - 2012 11 Ivan Girotto - ICHEC - PRACE Workshop
2. Launch Kernel
GPU Kernel execution workflow
CPU Host Memory
Device Memory
1. Copy Data 4. Copy Result
3. Execute GPU kernel
GPU
~ 30/40 GBytes
~ 110/120 GByte
17 – 02 - 2012 12 Ivan Girotto - ICHEC - PRACE Workshop
CPU vs GPU synchronization
CPU Host Memory
Device Memory
GPU
~ 30/40 GBytes
~ 110/120 GByte
~ 8 GB
ytes
17 – 02 - 2012 13 Ivan Girotto - ICHEC - PRACE Workshop
kernel <<< Nbloks, Nthreads >>> ( … )!cudaMemcpy(dst, src, count, kind)!cudaMemcpyAsync(dst, src, count, kind, stream)!
CUBLAS - Thunking Vs non-Thunking Thunking: • Allows to interface existing applications transparently wrappers • For each call to the library, the wrappers allocate GPU memory, copy source
data from CPU memory space to GPU memory space, call CUBLAS, and finally copy back the results to CPU memory space and deallocate the GPGPU memory
• Intended for light testing, call overhead
Non-Thunking (default):
• Existing applications need to be slightly modified to manage data transfer to and from GPGPU memory space (CUBLAS_ALLOC, CUBLAS_FREE, CUBLAS_SET_VECTOR, CUBLAS_GET_VECTOR, CUBLAS_SET_MATRIX, and CUBLAS_GET_MATRIX)
• Intended for production code, high flexibility
17 – 02 - 2012 14 Ivan Girotto - ICHEC - PRACE Workshop
Existing work Dgemm for Linpack HPL Phillips & Fatica: http://www.nvidia.com/content/GTC-2010/pdfs/2057_GTC2010.pdf
17 – 02 - 2012 15 Ivan Girotto - ICHEC - PRACE Workshop
The PHIGEMM library from ICHEC • A library that you use like CUBLAS
Transparently manages the thunking operations
Supports Sgemm, Dgemm, Cgemm and Zgemm
Asynchronous data transfer (via PINNED Memory)
MultiGPU management through single process (CUDA 4.x)
• Evolving: Possible improvements via multi-stream out-of-order execution (see Philips & Fatica GTC 2010)
• Freely downloadable from http://qe-forge.org/projects/phigemm/ 17 – 02 - 2012 16 Ivan Girotto - ICHEC - PRACE Workshop
PHIGEMM matches 1 GPU performance CPU
GFL
OP
S
M = N = K (DP Size
2 x Intel Xeon X5680 3.33GHz + NVIDIA Tesla C2050
http://qe-forge.org/projects/phigemm/
0
50
100
150
200
250
300
350
400
450
500
1024 2048 3072 4096 5120 6144 7168 8192 9216 10240
MKL THUNKINK CUBLAS CUBLAS (PEAK) phiGEMM MKL + CUBLAS (PEAK)
H2D = ~ 5.5GB/s D2H = ~ 6.0GB/s
MKL + CUBLAS theoretical peak
phiGEMM
Thunking CUBLAS
MKL
System provided by
THUNKING CUBLAS
)
17 – 02 - 2012 17 Ivan Girotto - ICHEC - PRACE Workshop
PHIGEMM dual GPU/single bus CPU
M = N = K (DP Size)
GFL
OP
S
http://qe-forge.org/projects/phigemm/
0
100
200
300
400
500
600
700
800
900
1024 2048 3072 4096 5120 6144 7168 8192 9216 10240
MKL THUNKINK CUBLAS CUBLAS (PEAK) phiGEMM MKL + CUBLAS (PEAK)
H2D = ~ 2.8GB/s D2H = ~ 3.2GB/s
2 x Intel Xeon X5680 3.33GHz + 2 NVIDIA Tesla C2050
MKL + CUBLAS Peak phiGEMM
MKL
Thunking CUBLAS 1 GPU
THUNKING CUBLAS
System provided by
17 – 02 - 2012 18 Ivan Girotto - ICHEC - PRACE Workshop
PHIGEMM dual GPU/dual bus CPU
GFL
OP
S
M = N = K (DP Size)
http://qe-forge.org/projects/phigemm/
0
100
200
300
400
500
600
700
800
900
1024 2048 3072 4096 5120 6144 7168 8192 9216 10240
MKL THUNKINK CUBLAS CUBLAS (PEAK) phiGEMM MKL + CUBLAS (PEAK)
2 x Intel Xeon X5680 3.33GHz + 2 NVIDIA Tesla C2050
GPU0 H2D = ~ 4.8GB/s D2H = ~ 5.0GB/s
GPU1 H2D = ~ 4.3GB/s D2H = ~ 4.8GB/s
MKL + CUBLAS Peak phiGEMM
MKL
Thunking CUBLAS
Faster here
THUNKING CUBLAS
System provided by
17 – 02 - 2012 19 Ivan Girotto - ICHEC - PRACE Workshop
Performance is dependent on problem size • phiGEMM can run GEMM on matrices that do not fit
on a single GPU
• Recursive call to phiGEMM with smaller sub-matrix
x = BA CA1 C1
C1 A1
B1 C1
C1
B1
STEP 1 STEP 2 STEP 3 STEP 4 GPU CPU CPU & GPU http://qe-forge.org/projects/phigemm/
17 – 02 - 2012 20 Ivan Girotto - ICHEC - PRACE Workshop
BIG PHIGEMM multi GPU/dual bus CPU http://qe-forge.org/projects/phigemm/
GFL
OP
S 2 x Intel
Xeon X5670 2.93GHz + 4 NVIDIA Tesla C2050
0
100
200
300
400
500
600
700
800
900
1000
1100
1GPU 2GPUs 4GPUs
CUBLAS MKL
x 2.0 x 3.4
277
551
942
(2 GPUs x 1 PCI Bus!!)
M = K = N = 25000 (DP) = 15GBytes
GPU
CPU
System provided by
17 – 02 - 2012 21 Ivan Girotto - ICHEC - PRACE Workshop
phiGEMM final comments
• CPU/GPU workload distribution
• full *GEMM library for multiGPU hybrid system
• HPC benchmark for conventional and co-processor accelerated computers performance peak Effective memory/PCI bus bandwidth
• http://qe-forge.org/projects/phigemm/
17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 22
QUANTUM ESPRESSO in a nutshell • QUANTUM ESPRESSO is an integrated suite of computer codes for electronic-
structure calculations and materials modeling at the nanoscale Density-Functional Theory, plane waves, pseudo-potential, …
• Free, open-source and widely used in academia and industries
• Current version (4.3.2) are more than half a million lines of code Fortran90/95, Fortran77, C, Tcl scripts, input examples, …
• Ported to several HPC architectures Linux clusters, CRAY supercomputers, IBM Power & BlueGene, NEC, …
• It supports MPI and OpenMP for a highly scalable parallel implementation
• It systematically uses standardized mathematical libraries BLAS, LAPACK, FFTW, …
• Two main packages: PWSCF and CP
Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 23
Scenario • Selected as key application
code for EU scientific community of Material Science area within WP7 PRACE-1P
• Expressed interest in this SW package by Irish research institutions
• Perfect synergy between partners within this project
18 – 10 - 2011 Ivan Girotto - ICHEC - GPU Computing 24
Performance Analysis (2IP – WP8)
17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 25
GPU-enablement: a deep look inside: where libraries are
!"#$%&%&'(
)&#*+$#,&-#%&+$.+/.%0'0#1&,%+$&#$.1#%2&34/+2.'#50.6 7+&$%8
5#,5",#%&+$.+/.%0'50#2*'.)'$(&%9
5#,5",#%&+$.+/.%0'.$':.7+%'$%&#,($+$ ,+5#,.7+%'$%&#,(
5#,5",#%&+$.+/.%0'.$':
/+25'(.#$)(%2'(('(.'!"#,.
%+.-'2+
',('
5#,5",#%&+.$.+/..$':#%+1&5.7+(&%&+$(
4#$).,#%%&5'.7#2#1'%'2(8
5#,5",#%&+$.+/
4#$).(%2'(('(8$':./+25'(
',('
5#,5",#%&+$.+/:#;'/"$5%&+$(
+2%0+ $+21#,&-#%&+$
&$&%&#,.709(&5#,
!"#$%"$#&'()"*+*,-"*(.
:#;'/"$5%&+$
('%"7.+/.%0'
!&/0 %(.!*!"&.%1#50&';')
(',/ 5+$(&(%'$59
FFT + GEMM + eigensolvers (LAPACK)
FFT FFT + GEMM
18 – 10 - 2011 26 Ivan Girotto - ICHEC - GPU Computing
fft3Dtest.cu - Profiling
0
2
4
6
8
10
12
4096 x 643 512 x 1283 64 x 2563
Time [s]
Number of GPUs
3D-FFT on multpile GPU
1 GPU2 GPU4 GPU
100
101
102
103
4096 x 643
512 x 1283
64 x 2563
Time [s]
Number of GPUs
3D-FFT on multpile GPU vs single-core FFTW
CPU1 GPU2 GPU4 GPU
17 – 02 - 2012 27 Ivan Girotto - ICHEC - PRACE Workshop
CUDA VLOC_PSI_K – single node
PSI PSIC
CUFFT GR
products
CUFFT RG
PSIC HPSI
PSIC
PSIC
Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 28
Serial diag. vs SCALAPACK vs MAGMA
18 – 10 - 2011 29 Ivan Girotto - ICHEC - GPU Computing
4.9x
• It extensively uses libraries – BLAS PHIGEMM (CUDA vs single-core, ~5x) – LAPACK (CUDA vs single-core, ~5x)
– FFTW (mainly in VLOC_PSI)
• It already well supports OpenMP
• Wall-time is mainly distributed within few routines
– ADDUSDENS (CUDA vs single-core, ~20x) – NEWD (CUDA vs single-core, ~7x) – VLOC_PSI (CUDA vs single-core, ~9x)
Performance analysis on single node
18 – 10 - 2011 30 Ivan Girotto - ICHEC - GPU Computing
Benchmarking
18 – 10 - 2011 31 Ivan Girotto - ICHEC - GPU Computing
CPU
ICHEC workstation Intel six-cores X5650 + 1 NVIDIA C2050
PINNED memory (or non-pageable)
• allows asynchronous data transfers (no CPU needed, DMA works alone);
• allows overlapping COMPUTATION/COMPUTATION on the same stream;
• since it is page-locked, the SO cannot move it;
• if you do not de-allocate it, never disappears;
• don’t be greedy!! The overall performance of the system is affected.
17 – 02 - 2012 32 Ivan Girotto - ICHEC - PRACE Workshop
use-case: PHIGEMM
I CPU
GPU H2D D2H
I CPU
GPU
H2D D2H
Without PINNED
With PINNED
17 – 02 - 2012 33 Ivan Girotto - ICHEC - PRACE Workshop
Consideration over PINNED memory
4m23.05s
4m35.05s NO MPI here!
17 – 02 - 2012 34 Ivan Girotto - ICHEC - PRACE Workshop
Integrating the CUDA components
At the end… • Original data distribution is preserved • CUDA libraries are mixed with explicit CUDA kernels • One single initial large memory allocation done on the GPU • Kernels are fully “abstracted”, overlapping wherever is possible computation
and data transfer (but not yet everywhere!) • All the MPI communications occur outside the ported routine
Moreover PWSCF… • detects GPU and binds each GPU to each MPI process • can share one GPU between multiple MPI processes • can assign more than one GPU to each MPI process
Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 35
MPI/Process-GPU scenarios
Single GPU Multiple GPUs
SERIA
L PA
RAL
LEL
Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 36
MPI-GPU binding & GPU memory management
NOTE: PWscf keeps track of the amount of GPU memory available
Pointer to CUDA memory (MPI process sub-rank 0) Pointer to CUDA memory
(MPI process sub-rank 0)
Pointer to CUDA memory (MPI process sub-rank 1)
NEVER allocated 100% !!!
Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 37
MPI:GPU Ratios (MGST-hex, parallel)
18 – 10 - 2011 38 Ivan Girotto - ICHEC - GPU Computing
Parallelization levels in PWSCF
GPU!
• Only for Nudged Elastic Band (NEB) calculations Images
• Distribution over k-points (if more than one) • Scalability over k-points (if more than one) • No memory scaling
K-points
• Distribution of wave-function coefficients • Distribution of real-grid points • Good memory scale, good overall scalability, LB
Plane-waves
• Iterative diagonalization (fully-parallel or serial) • Smart grouping of 3DFFTs to reduce compulsory MPI
communications
Linear algebra & task groups
• OpenMP handled explicitly or implicitly • Extend the scaling on multi-core machines with “limited”
memory
Multi-threaded kernels
Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 39
PWSCF GPU, results & benchmarks (parallel – MGST-hex)
3.38x
2.96x 2.43x
72 MPI x 2 OpenMP + 24 GPUs (GPU:MPI = 3:1) Acceleration, not Scalability!
Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 40
PRACE Preparatory Access PWSCF, input of 489 atoms, 3 SCF steps, 24 MPI x 6 OMP (144 cores), PLX
2.8x
3.1x
10x
4x
~2.3x
Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 41
CUDA VLOC_PSI_K – parallel
PSI PSIC
FFT GR
products
FFT RG
PSIC HPSI
DISTRIBUTED
PSI PSIC CUFFT GR
products
CUFFT RG
PSIC HPSI
PSIC
PSIC
Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 42
CUDA VLOC_PSI_K – parallel
PSI PSIC CUFFT GR
products
CUFFT RG
PSIC HPSI
PSIC
PSIC
“MPI_Allgatherv”
“MPI_Allscatterv”
Multiple LOCAL grid to compute
Same communication than original P3DFFT but more overlapping!!
Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 43
Serial diag. vs SCALAPACK vs MAGMA
18 – 10 - 2011 44 Ivan Girotto - ICHEC - GPU Computing
4.9x
Achievements • Scientific Computing, NVIDA GTC webinar “The Practical Reality of
Heterogeneous Super Computing”
• Last SC2011, Dell booth and number of talks performed by Rob Farber
• CUDA text book “CUDA Application Design and Development”
• Invited speaker for GPUComputing.net: live class for University of Illinois at Urbana-Champaign
• Invited lecture at the CINECA Advanced School of Parallel Computing and last PRACE winter school
• Filippo Spiga & Ivan Girotto, phiGEMM: a CPU-GPU library for porting Quantum ESPRESSO on hybrid systems, IEEE PDP2012
• GTC2012
• QE-GPU and phiGEMM published repository & qe-gpu forum 17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 45
Future works
18 – 10 - 2011 46 Ivan Girotto - ICHEC - GPU Computing
• PRACE WP8 – 2IP => T. Schulthess, C. Gheller
• Enable QE to further generation of EU Tier-0 system
• Scalable approach to accelerated 3DFFT
• Scalable approach to accelerated eigensolvers
• Multithreading optimization (OpenMP)
• Linear algebra (phiGEMM model)
• GPU-enabled version of CP code
CURAND
17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 47
• Library for generating random numbers • Features:
XORWOW pseudo-random generator Sobol’ quasi-random number generators Host API for generating random numbers in bulk Inline implementation allows use inside GPU
functions/kernels Single- and double-precision, uniform, normal and
long-normal distributions
!"#
$"#
%"#
&"#
!'"#
($"#
'%"#
!$&"#
$)'"#
*+,+# -..# -/01#
!23"#
!24"#
&42'"#
$23"#
(25"#
!542$"#
(23"#
!$2("#!#6789+:#
$#6789+:;#
!$#6789+:;#
17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 48
ICHEC’s case study: next GTC2012
12190
242180
950
700
App-1App-2
x1 x2 x8
ICHEC Case Study In Risk Management
Performance Optimisation Monte Carlo Using GPGPU
For Two Applications
GPGPU GPGPU GPGPU
Larger and more complex Monte Carlo model !!
17 – 02 - 2012 49 Ivan Girotto - ICHEC - PRACE Workshop
Acknowledgement
17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 50
Filippo Spiga – ICHEC
Gilles Civario – ICHEC
M. Fatica - NVIDIA Corporation
Rob Farber, ICHEC Research Consultant
Carlo Cavazzoni - CINECA
P. Giannozzi, L. Marsamos - Quantum ESPRESSO
Philip Yang – former scholarship student at ICHEC
Supported by Science Foundation Ireland under grant 08/HEC/I1450
and by HEA’s PRTLI-C4.
Acknowledgements