porting highly computational intensive applications to hybridsystems

Porting highly computational intensive applications to hybrid systems:

case studies from the Irish Centre for High-End Computing

Ivan Girotto - ICHEC GPU Team Leader [email protected] http://gpgpu.ichec.ie

… in collaboration with Filippo Spiga

Outline •  The Irish Centre for High-End Computing (ICHEC)

•  GPGPU Computing @ ICHEC

•  Main topics:

17 – 02 - 2012 2 Ivan Girotto - ICHEC - PRACE Workshop

•  Use of accelerated libraries for porting scientific applications

•  Quantum ESPRESSO

•  GPU computing for business: an ICHEC’s case study

ICHEC •  founded in mid-2005 by SFI and HEA •  ~ 25 staff member •  Two main systems

  SGI Altix ICE 8200EX (~ 4k cores)   Bull Novascale R422-E2 (~ 500 cores)

•  50 NVIDIA GPUs, National service production

•  Objectives:   Provide computational resources

  Provide education and training to third-level institutions

  Tech transfer and consultancy services to develop Irish smart economy


ICHEC’s GPU Computing Facilities /1

•  Stoney: GPU National Service Production

48 x NVIDIA M2090 3 x Dell PowerEdge C410x


Bull Novascale R422-E2 64 x 2 Quad cores Intel Xeon X5560 @ 2.8GHz

Peak Performance = 5.73 TFlop

Bull Novascale R422-E2 64 x 2 Quad cores Intel Xeon X5560 @ 2.8GHz

Peak Performance = 37.6 TFlop

•  Gemini: Software Development platform

ICHEC’s GPU Computing Facilities /2

16 x NVIDIA M2090

•  Development, Education & Training:

~ 10 x NVIDIA GTX 480


+

2 x Dell PowerEdge ( C410x + C6145 )

Nucleosome 25,095 Atoms (Implicit Solvent GB)

!

!

!

!

!

!

!

!

!!!

!

!

!

0 10 20 30 40510

1520

0

1

2

3

4

5

#GPGPUs

#Nodes

ns/day

cpucpu+gpgpu

!

!

http://ambermd.org/gpus/benchmarks.htm 17 – 02 - 2012 6 Ivan Girotto - ICHEC - PRACE Workshop

courtesy of Dr. Martin Peters


LAMMPS-CUDA E

laps

ed T

ime

(sec

.)

Stoney Nodes

courtesy of Dr. Nicola Varini

… see http://www.nvidia.com/object/vertical_solutions.html

GPGPU Computing @ ICHEC •  PRACE Project 1IP & 2IP

  Sub-task leadership T7.5e “Accelerators” – 1IP

  Sub-task leadership T7.2d and T7.3d – 2IP


•  AutoTune   Analyse and tune serial, parallel & GPGPU codes   Aims to close the loop from profiling to tuning   GPGPU support built in   ICHEC brings extensive application experience   For more info contact Prof. Michael Gerndt (PM)

PRACE T7.5 •  Programming Techniques for High Performance Applications

•  The task will work with users to implement new programming techniques which have the potential to facilitate significant improvements in their applications performance

•  This task will investigate the following areas and produce best-practice guides covering:   Analysis of scalable algorithms and libraries to enhance

scientific applications   Optimisation of applications on multi-core/many core systems   Exploitation of accelerators for real applications

17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 9

Speed-up Scientific Codes


Application!Code

Compute-Intensive Functions

Rest of the Sequential

Code

GPU CPU

Use!CUDA!to!parallelise

  Directives (OpenACC)

  Libraries

  CUDA

3 Way to Accelerate on GPU

CUDA library ecosystem

APPLICATIONS

CUDA C

3rd Party Libraries

CUDA Libraries

• MAGMA • CULA •  libJacket • CUSP •  IMSL Library • NAG • OpenVidia • OpenCurrent •  phiGEMM

•  CUFFT •  CUBLAS •  CUSPARSE •  CURAND •  THRUST •  NPP •  Libm (math.h) •  System Call

http://developer.nvidia.com/gpu-accelerated-libraries


2. Launch Kernel

GPU Kernel execution workflow

CPU Host Memory

Device Memory

1. Copy Data 4. Copy Result

3. Execute GPU kernel

GPU

~ 30/40 GBytes

~ 110/120 GByte


CPU vs GPU synchronization

CPU Host Memory

Device Memory

GPU

~ 30/40 GBytes

~ 110/120 GByte

~ 8 GB

ytes


kernel <<< Nbloks, Nthreads >>> ( … )!cudaMemcpy(dst, src, count, kind)!cudaMemcpyAsync(dst, src, count, kind, stream)!

CUBLAS - Thunking Vs non-Thunking Thunking: •  Allows to interface existing applications transparently wrappers •  For each call to the library, the wrappers allocate GPU memory, copy source

data from CPU memory space to GPU memory space, call CUBLAS, and finally copy back the results to CPU memory space and deallocate the GPGPU memory

•  Intended for light testing, call overhead

Non-Thunking (default):

•  Existing applications need to be slightly modified to manage data transfer to and from GPGPU memory space (CUBLAS_ALLOC, CUBLAS_FREE, CUBLAS_SET_VECTOR, CUBLAS_GET_VECTOR, CUBLAS_SET_MATRIX, and CUBLAS_GET_MATRIX)

•  Intended for production code, high flexibility


Existing work Dgemm for Linpack HPL Phillips & Fatica: http://www.nvidia.com/content/GTC-2010/pdfs/2057_GTC2010.pdf


The PHIGEMM library from ICHEC •  A library that you use like CUBLAS

  Transparently manages the thunking operations

  Supports Sgemm, Dgemm, Cgemm and Zgemm

  Asynchronous data transfer (via PINNED Memory)

  MultiGPU management through single process (CUDA 4.x)

•  Evolving: Possible improvements via multi-stream out-of-order execution (see Philips & Fatica GTC 2010)

•  Freely downloadable from http://qe-forge.org/projects/phigemm/ 17 – 02 - 2012 16 Ivan Girotto - ICHEC - PRACE Workshop

PHIGEMM matches 1 GPU performance CPU

GFL

OP

S

M = N = K (DP Size

2 x Intel Xeon X5680 3.33GHz + NVIDIA Tesla C2050

http://qe-forge.org/projects/phigemm/

0

50

100

150

200

250

300

350

400

450

500

1024 2048 3072 4096 5120 6144 7168 8192 9216 10240

MKL THUNKINK CUBLAS CUBLAS (PEAK) phiGEMM MKL + CUBLAS (PEAK)

H2D = ~ 5.5GB/s D2H = ~ 6.0GB/s

MKL + CUBLAS theoretical peak

phiGEMM

Thunking CUBLAS

MKL

System provided by

THUNKING CUBLAS

)


PHIGEMM dual GPU/single bus CPU

M = N = K (DP Size)

GFL

OP

S


0

100

200

300

400

500

600

700

800

900

1024 2048 3072 4096 5120 6144 7168 8192 9216 10240


H2D = ~ 2.8GB/s D2H = ~ 3.2GB/s

2 x Intel Xeon X5680 3.33GHz + 2 NVIDIA Tesla C2050

MKL + CUBLAS Peak phiGEMM

MKL

Thunking CUBLAS 1 GPU

THUNKING CUBLAS

System provided by


PHIGEMM dual GPU/dual bus CPU

GFL

OP

S

M = N = K (DP Size)


0

100

200

300

400

500

600

700

800

900

1024 2048 3072 4096 5120 6144 7168 8192 9216 10240


2 x Intel Xeon X5680 3.33GHz + 2 NVIDIA Tesla C2050

GPU0 H2D = ~ 4.8GB/s D2H = ~ 5.0GB/s

GPU1 H2D = ~ 4.3GB/s D2H = ~ 4.8GB/s

MKL + CUBLAS Peak phiGEMM

MKL

Thunking CUBLAS

Faster here

THUNKING CUBLAS

System provided by


Performance is dependent on problem size •  phiGEMM can run GEMM on matrices that do not fit

on a single GPU

•  Recursive call to phiGEMM with smaller sub-matrix

x = BA CA1 C1

C1 A1

B1 C1

C1

B1

STEP 1 STEP 2 STEP 3 STEP 4 GPU CPU CPU & GPU http://qe-forge.org/projects/phigemm/


BIG PHIGEMM multi GPU/dual bus CPU http://qe-forge.org/projects/phigemm/

GFL

OP

S 2 x Intel

Xeon X5670 2.93GHz + 4 NVIDIA Tesla C2050

0

100

200

300

400

500

600

700

800

900

1000

1100

1GPU 2GPUs 4GPUs

CUBLAS MKL

x 2.0 x 3.4

277

551

942

(2 GPUs x 1 PCI Bus!!)

M = K = N = 25000 (DP) = 15GBytes

GPU

CPU

System provided by


phiGEMM final comments

•  CPU/GPU workload distribution

•  full *GEMM library for multiGPU hybrid system

•  HPC benchmark for conventional and co-processor accelerated computers   performance peak   Effective memory/PCI bus bandwidth

•  http://qe-forge.org/projects/phigemm/


QUANTUM ESPRESSO in a nutshell •  QUANTUM ESPRESSO is an integrated suite of computer codes for electronic-

structure calculations and materials modeling at the nanoscale   Density-Functional Theory, plane waves, pseudo-potential, …

•  Free, open-source and widely used in academia and industries

•  Current version (4.3.2) are more than half a million lines of code   Fortran90/95, Fortran77, C, Tcl scripts, input examples, …

•  Ported to several HPC architectures   Linux clusters, CRAY supercomputers, IBM Power & BlueGene, NEC, …

•  It supports MPI and OpenMP for a highly scalable parallel implementation

•  It systematically uses standardized mathematical libraries   BLAS, LAPACK, FFTW, …

•  Two main packages: PWSCF and CP

Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 23

Scenario •  Selected as key application

code for EU scientific community of Material Science area within WP7 PRACE-1P

•  Expressed interest in this SW package by Irish research institutions

•  Perfect synergy between partners within this project

18 – 10 - 2011 Ivan Girotto - ICHEC - GPU Computing 24

Performance Analysis (2IP – WP8)


GPU-enablement: a deep look inside: where libraries are

!"#$%&%&'(

)&#*+$#,&-#%&+$.+/.%0'0#1&,%+$&#$.1#%2&34/+2.'#50.6 7+&$%8

5#,5",#%&+$.+/.%0'50#2*'.)'$(&%9

5#,5",#%&+$.+/.%0'.$':.7+%'$%&#,($+$ ,+5#,.7+%'$%&#,(

5#,5",#%&+$.+/.%0'.$':

/+25'(.#$)(%2'(('(.'!"#,.

%+.-'2+

',('

5#,5",#%&+.$.+/..$':#%+1&5.7+(&%&+$(

4#$).,#%%&5'.7#2#1'%'2(8

5#,5",#%&+$.+/

4#$).(%2'(('(8$':./+25'(

',('

5#,5",#%&+$.+/:#;'/"$5%&+$(

+2%0+ $+21#,&-#%&+$

&$&%&#,.709(&5#,

!"#$%"$#&'()"*+*,-"*(.

:#;'/"$5%&+$

('%"7.+/.%0'

!&/0 %(.!*!"&.%1#50&';')

(',/ 5+$(&(%'$59

FFT + GEMM + eigensolvers (LAPACK)

FFT FFT + GEMM

18 – 10 - 2011 26 Ivan Girotto - ICHEC - GPU Computing

fft3Dtest.cu - Profiling

0

2

4

6

8

10

12

4096 x 643 512 x 1283 64 x 2563

Time [s]

Number of GPUs

3D-FFT on multpile GPU

1 GPU2 GPU4 GPU

100

101

102

103

4096 x 643

512 x 1283

64 x 2563

Time [s]

Number of GPUs

3D-FFT on multpile GPU vs single-core FFTW

CPU1 GPU2 GPU4 GPU


CUDA VLOC_PSI_K – single node

PSI PSIC

CUFFT GR

products

CUFFT RG

PSIC HPSI

PSIC

PSIC


Serial diag. vs SCALAPACK vs MAGMA


4.9x

•  It extensively uses libraries –  BLAS PHIGEMM (CUDA vs single-core, ~5x) –  LAPACK (CUDA vs single-core, ~5x)

–  FFTW (mainly in VLOC_PSI)

•  It already well supports OpenMP

•  Wall-time is mainly distributed within few routines

–  ADDUSDENS (CUDA vs single-core, ~20x) –  NEWD (CUDA vs single-core, ~7x) –  VLOC_PSI (CUDA vs single-core, ~9x)

Performance analysis on single node


Benchmarking


CPU

ICHEC workstation Intel six-cores X5650 + 1 NVIDIA C2050

PINNED memory (or non-pageable)

•  allows asynchronous data transfers (no CPU needed, DMA works alone);

•  allows overlapping COMPUTATION/COMPUTATION on the same stream;

•  since it is page-locked, the SO cannot move it;

•  if you do not de-allocate it, never disappears;

•  don’t be greedy!! The overall performance of the system is affected.


use-case: PHIGEMM

I CPU

GPU H2D D2H

I CPU

GPU

H2D D2H

Without PINNED

With PINNED


Consideration over PINNED memory

4m23.05s

4m35.05s NO MPI here!


Integrating the CUDA components

At the end… •  Original data distribution is preserved •  CUDA libraries are mixed with explicit CUDA kernels •  One single initial large memory allocation done on the GPU •  Kernels are fully “abstracted”, overlapping wherever is possible computation

and data transfer (but not yet everywhere!) •  All the MPI communications occur outside the ported routine

Moreover PWSCF… •  detects GPU and binds each GPU to each MPI process •  can share one GPU between multiple MPI processes •  can assign more than one GPU to each MPI process


MPI/Process-GPU scenarios

Single GPU Multiple GPUs

SERIA

L PA

RAL

LEL


MPI-GPU binding & GPU memory management

NOTE: PWscf keeps track of the amount of GPU memory available

Pointer to CUDA memory (MPI process sub-rank 0) Pointer to CUDA memory

(MPI process sub-rank 0)

Pointer to CUDA memory (MPI process sub-rank 1)

NEVER allocated 100% !!!


MPI:GPU Ratios (MGST-hex, parallel)


Parallelization levels in PWSCF

GPU!

• Only for Nudged Elastic Band (NEB) calculations Images

• Distribution over k-points (if more than one) • Scalability over k-points (if more than one) • No memory scaling

K-points

• Distribution of wave-function coefficients • Distribution of real-grid points • Good memory scale, good overall scalability, LB

Plane-waves

• Iterative diagonalization (fully-parallel or serial) • Smart grouping of 3DFFTs to reduce compulsory MPI

communications

Linear algebra & task groups

• OpenMP handled explicitly or implicitly • Extend the scaling on multi-core machines with “limited”

memory

Multi-threaded kernels


PWSCF GPU, results & benchmarks (parallel – MGST-hex)

3.38x

2.96x 2.43x

72 MPI x 2 OpenMP + 24 GPUs (GPU:MPI = 3:1) Acceleration, not Scalability!


PRACE Preparatory Access PWSCF, input of 489 atoms, 3 SCF steps, 24 MPI x 6 OMP (144 cores), PLX

2.8x

3.1x

10x

4x

~2.3x


CUDA VLOC_PSI_K – parallel

PSI PSIC

FFT GR

products

FFT RG

PSIC HPSI

DISTRIBUTED

PSI PSIC CUFFT GR

products

CUFFT RG

PSIC HPSI

PSIC

PSIC


CUDA VLOC_PSI_K – parallel

PSI PSIC CUFFT GR

products

CUFFT RG

PSIC HPSI

PSIC

PSIC

“MPI_Allgatherv”

“MPI_Allscatterv”

Multiple LOCAL grid to compute

Same communication than original P3DFFT but more overlapping!!


Serial diag. vs SCALAPACK vs MAGMA


4.9x

Achievements •  Scientific Computing, NVIDA GTC webinar “The Practical Reality of

Heterogeneous Super Computing”

•  Last SC2011, Dell booth and number of talks performed by Rob Farber

•  CUDA text book “CUDA Application Design and Development”

•  Invited speaker for GPUComputing.net: live class for University of Illinois at Urbana-Champaign

•  Invited lecture at the CINECA Advanced School of Parallel Computing and last PRACE winter school

•  Filippo Spiga & Ivan Girotto, phiGEMM: a CPU-GPU library for porting Quantum ESPRESSO on hybrid systems, IEEE PDP2012

•  GTC2012

•  QE-GPU and phiGEMM published repository & qe-gpu forum 17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 45

Future works


•  PRACE WP8 – 2IP => T. Schulthess, C. Gheller

•  Enable QE to further generation of EU Tier-0 system

•  Scalable approach to accelerated 3DFFT

•  Scalable approach to accelerated eigensolvers

•  Multithreading optimization (OpenMP)

•  Linear algebra (phiGEMM model)

•  GPU-enabled version of CP code

CURAND


•  Library for generating random numbers •  Features:

  XORWOW pseudo-random generator   Sobol’ quasi-random number generators   Host API for generating random numbers in bulk   Inline implementation allows use inside GPU

functions/kernels   Single- and double-precision, uniform, normal and

long-normal distributions

!"#

$"#

%"#

&"#

!'"#

($"#

'%"#

!$&"#

$)'"#

*+,+# -..# -/01#

!23"#

!24"#

&42'"#

$23"#

(25"#

!542$"#

(23"#

!$2("#!#6789+:#

$#6789+:;#

!$#6789+:;#


ICHEC’s case study: next GTC2012

12190

242180

950

700

App-1App-2

x1 x2 x8

ICHEC Case Study In Risk Management

Performance Optimisation Monte Carlo Using GPGPU

For Two Applications

GPGPU GPGPU GPGPU

Larger and more complex Monte Carlo model !!


Acknowledgement


Filippo Spiga – ICHEC

Gilles Civario – ICHEC

M. Fatica - NVIDIA Corporation

Rob Farber, ICHEC Research Consultant

Carlo Cavazzoni - CINECA

P. Giannozzi, L. Marsamos - Quantum ESPRESSO

Philip Yang – former scholarship student at ICHEC

Supported by Science Foundation Ireland under grant 08/HEC/I1450

and by HEA’s PRTLI-C4.

Acknowledgements

porting highly computational intensive applications to hybridsystems

Documents