porting highly computational intensive applications to hybridsystems

51
Porting highly computational intensive applications to hybrid systems: case studies from the Irish Centre for High-End Computing Ivan Girotto - ICHEC GPU Team Leader [email protected] http://gpgpu.ichec.ie … in collaboration with Filippo Spiga

Upload: hoangdiep

Post on 11-Jan-2017

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Porting highly computational intensive applications to hybridsystems

Porting highly computational intensive applications to hybrid systems:

case studies from the Irish Centre for High-End Computing

Ivan Girotto - ICHEC GPU Team Leader [email protected] http://gpgpu.ichec.ie

… in collaboration with Filippo Spiga

Page 2: Porting highly computational intensive applications to hybridsystems

Outline •  The Irish Centre for High-End Computing (ICHEC)

•  GPGPU Computing @ ICHEC

•  Main topics:

17 – 02 - 2012 2 Ivan Girotto - ICHEC - PRACE Workshop

•  Use of accelerated libraries for porting scientific applications

•  Quantum ESPRESSO

•  GPU computing for business: an ICHEC’s case study

Page 3: Porting highly computational intensive applications to hybridsystems

ICHEC •  founded in mid-2005 by SFI and HEA •  ~ 25 staff member •  Two main systems

  SGI Altix ICE 8200EX (~ 4k cores)   Bull Novascale R422-E2 (~ 500 cores)

•  50 NVIDIA GPUs, National service production

•  Objectives:   Provide computational resources

  Provide education and training to third-level institutions

  Tech transfer and consultancy services to develop Irish smart economy

17 – 02 - 2012 3 Ivan Girotto - ICHEC - PRACE Workshop

Page 4: Porting highly computational intensive applications to hybridsystems

ICHEC’s GPU Computing Facilities /1

•  Stoney: GPU National Service Production

48 x NVIDIA M2090 3 x Dell PowerEdge C410x

17 – 02 - 2012 4 Ivan Girotto - ICHEC - PRACE Workshop

Bull Novascale R422-E2 64 x 2 Quad cores Intel Xeon X5560 @ 2.8GHz

Peak Performance = 5.73 TFlop

Bull Novascale R422-E2 64 x 2 Quad cores Intel Xeon X5560 @ 2.8GHz

Peak Performance = 37.6 TFlop

Page 5: Porting highly computational intensive applications to hybridsystems

•  Gemini: Software Development platform

ICHEC’s GPU Computing Facilities /2

16 x NVIDIA M2090

•  Development, Education & Training:

~ 10 x NVIDIA GTX 480

17 – 02 - 2012 5 Ivan Girotto - ICHEC - PRACE Workshop

+

2 x Dell PowerEdge ( C410x + C6145 )

Page 6: Porting highly computational intensive applications to hybridsystems

Nucleosome 25,095 Atoms (Implicit Solvent GB)

!

!

!

!

!

!

!

!

!!!

!

!

!

0 10 20 30 40510

1520

0

1

2

3

4

5

#GPGPUs

#Nodes

ns/day

cpucpu+gpgpu

!

!

http://ambermd.org/gpus/benchmarks.htm 17 – 02 - 2012 6 Ivan Girotto - ICHEC - PRACE Workshop

courtesy of Dr. Martin Peters

Page 7: Porting highly computational intensive applications to hybridsystems

17 – 02 - 2012 7 Ivan Girotto - ICHEC - PRACE Workshop

LAMMPS-CUDA E

laps

ed T

ime

(sec

.)

Stoney Nodes

courtesy of Dr. Nicola Varini

… see http://www.nvidia.com/object/vertical_solutions.html

Page 8: Porting highly computational intensive applications to hybridsystems

GPGPU Computing @ ICHEC •  PRACE Project 1IP & 2IP

  Sub-task leadership T7.5e “Accelerators” – 1IP

  Sub-task leadership T7.2d and T7.3d – 2IP

17 – 02 - 2012 8 Ivan Girotto - ICHEC - PRACE Workshop

•  AutoTune   Analyse and tune serial, parallel & GPGPU codes   Aims to close the loop from profiling to tuning   GPGPU support built in   ICHEC brings extensive application experience   For more info contact Prof. Michael Gerndt (PM)

Page 9: Porting highly computational intensive applications to hybridsystems

PRACE T7.5 •  Programming Techniques for High Performance Applications

•  The task will work with users to implement new programming techniques which have the potential to facilitate significant improvements in their applications performance

•  This task will investigate the following areas and produce best-practice guides covering:   Analysis of scalable algorithms and libraries to enhance

scientific applications   Optimisation of applications on multi-core/many core systems   Exploitation of accelerators for real applications

17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 9

Page 10: Porting highly computational intensive applications to hybridsystems

Speed-up Scientific Codes

17 – 02 - 2012 10 Ivan Girotto - ICHEC - PRACE Workshop

Application!Code

Compute-Intensive Functions

Rest of the Sequential

Code

GPU CPU

Use!CUDA!to!parallelise

  Directives (OpenACC)

  Libraries

  CUDA

3 Way to Accelerate on GPU

Page 11: Porting highly computational intensive applications to hybridsystems

CUDA library ecosystem

APPLICATIONS

CUDA C

3rd Party Libraries

CUDA Libraries

• MAGMA • CULA •  libJacket • CUSP •  IMSL Library • NAG • OpenVidia • OpenCurrent •  phiGEMM

•  CUFFT •  CUBLAS •  CUSPARSE •  CURAND •  THRUST •  NPP •  Libm (math.h) •  System Call

http://developer.nvidia.com/gpu-accelerated-libraries

17 – 02 - 2012 11 Ivan Girotto - ICHEC - PRACE Workshop

Page 12: Porting highly computational intensive applications to hybridsystems

2. Launch Kernel

GPU Kernel execution workflow

CPU Host Memory

Device Memory

1. Copy Data 4. Copy Result

3. Execute GPU kernel

GPU

~ 30/40 GBytes

~ 110/120 GByte

17 – 02 - 2012 12 Ivan Girotto - ICHEC - PRACE Workshop

Page 13: Porting highly computational intensive applications to hybridsystems

CPU vs GPU synchronization

CPU Host Memory

Device Memory

GPU

~ 30/40 GBytes

~ 110/120 GByte

~ 8 GB

ytes

17 – 02 - 2012 13 Ivan Girotto - ICHEC - PRACE Workshop

kernel <<< Nbloks, Nthreads >>> ( … )!cudaMemcpy(dst, src, count, kind)!cudaMemcpyAsync(dst, src, count, kind, stream)!

Page 14: Porting highly computational intensive applications to hybridsystems

CUBLAS - Thunking Vs non-Thunking Thunking: •  Allows to interface existing applications transparently wrappers •  For each call to the library, the wrappers allocate GPU memory, copy source

data from CPU memory space to GPU memory space, call CUBLAS, and finally copy back the results to CPU memory space and deallocate the GPGPU memory

•  Intended for light testing, call overhead

Non-Thunking (default):

•  Existing applications need to be slightly modified to manage data transfer to and from GPGPU memory space (CUBLAS_ALLOC, CUBLAS_FREE, CUBLAS_SET_VECTOR, CUBLAS_GET_VECTOR, CUBLAS_SET_MATRIX, and CUBLAS_GET_MATRIX)

•  Intended for production code, high flexibility

17 – 02 - 2012 14 Ivan Girotto - ICHEC - PRACE Workshop

Page 15: Porting highly computational intensive applications to hybridsystems

Existing work Dgemm for Linpack HPL Phillips & Fatica: http://www.nvidia.com/content/GTC-2010/pdfs/2057_GTC2010.pdf

17 – 02 - 2012 15 Ivan Girotto - ICHEC - PRACE Workshop

Page 16: Porting highly computational intensive applications to hybridsystems

The PHIGEMM library from ICHEC •  A library that you use like CUBLAS

  Transparently manages the thunking operations

  Supports Sgemm, Dgemm, Cgemm and Zgemm

  Asynchronous data transfer (via PINNED Memory)

  MultiGPU management through single process (CUDA 4.x)

•  Evolving: Possible improvements via multi-stream out-of-order execution (see Philips & Fatica GTC 2010)

•  Freely downloadable from http://qe-forge.org/projects/phigemm/ 17 – 02 - 2012 16 Ivan Girotto - ICHEC - PRACE Workshop

Page 17: Porting highly computational intensive applications to hybridsystems

PHIGEMM matches 1 GPU performance CPU

GFL

OP

S

M = N = K (DP Size

2 x Intel Xeon X5680 3.33GHz + NVIDIA Tesla C2050

http://qe-forge.org/projects/phigemm/

0

50

100

150

200

250

300

350

400

450

500

1024 2048 3072 4096 5120 6144 7168 8192 9216 10240

MKL THUNKINK CUBLAS CUBLAS (PEAK) phiGEMM MKL + CUBLAS (PEAK)

H2D = ~ 5.5GB/s D2H = ~ 6.0GB/s

MKL + CUBLAS theoretical peak

phiGEMM

Thunking CUBLAS

MKL

System provided by

THUNKING CUBLAS

)

17 – 02 - 2012 17 Ivan Girotto - ICHEC - PRACE Workshop

Page 18: Porting highly computational intensive applications to hybridsystems

PHIGEMM dual GPU/single bus CPU

M = N = K (DP Size)

GFL

OP

S

http://qe-forge.org/projects/phigemm/

0

100

200

300

400

500

600

700

800

900

1024 2048 3072 4096 5120 6144 7168 8192 9216 10240

MKL THUNKINK CUBLAS CUBLAS (PEAK) phiGEMM MKL + CUBLAS (PEAK)

H2D = ~ 2.8GB/s D2H = ~ 3.2GB/s

2 x Intel Xeon X5680 3.33GHz + 2 NVIDIA Tesla C2050

MKL + CUBLAS Peak phiGEMM

MKL

Thunking CUBLAS 1 GPU

THUNKING CUBLAS

System provided by

17 – 02 - 2012 18 Ivan Girotto - ICHEC - PRACE Workshop

Page 19: Porting highly computational intensive applications to hybridsystems

PHIGEMM dual GPU/dual bus CPU

GFL

OP

S

M = N = K (DP Size)

http://qe-forge.org/projects/phigemm/

0

100

200

300

400

500

600

700

800

900

1024 2048 3072 4096 5120 6144 7168 8192 9216 10240

MKL THUNKINK CUBLAS CUBLAS (PEAK) phiGEMM MKL + CUBLAS (PEAK)

2 x Intel Xeon X5680 3.33GHz + 2 NVIDIA Tesla C2050

GPU0 H2D = ~ 4.8GB/s D2H = ~ 5.0GB/s

GPU1 H2D = ~ 4.3GB/s D2H = ~ 4.8GB/s

MKL + CUBLAS Peak phiGEMM

MKL

Thunking CUBLAS

Faster here

THUNKING CUBLAS

System provided by

17 – 02 - 2012 19 Ivan Girotto - ICHEC - PRACE Workshop

Page 20: Porting highly computational intensive applications to hybridsystems

Performance is dependent on problem size •  phiGEMM can run GEMM on matrices that do not fit

on a single GPU

•  Recursive call to phiGEMM with smaller sub-matrix

x = BA CA1 C1

C1 A1

B1 C1

C1

B1

STEP 1 STEP 2 STEP 3 STEP 4 GPU CPU CPU & GPU http://qe-forge.org/projects/phigemm/

17 – 02 - 2012 20 Ivan Girotto - ICHEC - PRACE Workshop

Page 21: Porting highly computational intensive applications to hybridsystems

BIG PHIGEMM multi GPU/dual bus CPU http://qe-forge.org/projects/phigemm/

GFL

OP

S 2 x Intel

Xeon X5670 2.93GHz + 4 NVIDIA Tesla C2050

0

100

200

300

400

500

600

700

800

900

1000

1100

1GPU 2GPUs 4GPUs

CUBLAS MKL

x 2.0 x 3.4

277

551

942

(2 GPUs x 1 PCI Bus!!)

M = K = N = 25000 (DP) = 15GBytes

GPU

CPU

System provided by

17 – 02 - 2012 21 Ivan Girotto - ICHEC - PRACE Workshop

Page 22: Porting highly computational intensive applications to hybridsystems

phiGEMM final comments

•  CPU/GPU workload distribution

•  full *GEMM library for multiGPU hybrid system

•  HPC benchmark for conventional and co-processor accelerated computers   performance peak   Effective memory/PCI bus bandwidth

•  http://qe-forge.org/projects/phigemm/

17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 22

Page 23: Porting highly computational intensive applications to hybridsystems

QUANTUM ESPRESSO in a nutshell •  QUANTUM ESPRESSO is an integrated suite of computer codes for electronic-

structure calculations and materials modeling at the nanoscale   Density-Functional Theory, plane waves, pseudo-potential, …

•  Free, open-source and widely used in academia and industries

•  Current version (4.3.2) are more than half a million lines of code   Fortran90/95, Fortran77, C, Tcl scripts, input examples, …

•  Ported to several HPC architectures   Linux clusters, CRAY supercomputers, IBM Power & BlueGene, NEC, …

•  It supports MPI and OpenMP for a highly scalable parallel implementation

•  It systematically uses standardized mathematical libraries   BLAS, LAPACK, FFTW, …

•  Two main packages: PWSCF and CP

Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 23

Page 24: Porting highly computational intensive applications to hybridsystems

Scenario •  Selected as key application

code for EU scientific community of Material Science area within WP7 PRACE-1P

•  Expressed interest in this SW package by Irish research institutions

•  Perfect synergy between partners within this project

18 – 10 - 2011 Ivan Girotto - ICHEC - GPU Computing 24

Page 25: Porting highly computational intensive applications to hybridsystems

Performance Analysis (2IP – WP8)

17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 25

Page 26: Porting highly computational intensive applications to hybridsystems

GPU-enablement: a deep look inside: where libraries are

!"#$%&%&'(

)&#*+$#,&-#%&+$.+/.%0'0#1&,%+$&#$.1#%2&34/+2.'#50.6 7+&$%8

5#,5",#%&+$.+/.%0'50#2*'.)'$(&%9

5#,5",#%&+$.+/.%0'.$':.7+%'$%&#,($+$ ,+5#,.7+%'$%&#,(

5#,5",#%&+$.+/.%0'.$':

/+25'(.#$)(%2'(('(.'!"#,.

%+.-'2+

',('

5#,5",#%&+.$.+/..$':#%+1&5.7+(&%&+$(

4#$).,#%%&5'.7#2#1'%'2(8

5#,5",#%&+$.+/

4#$).(%2'(('(8$':./+25'(

',('

5#,5",#%&+$.+/:#;'/"$5%&+$(

+2%0+ $+21#,&-#%&+$

&$&%&#,.709(&5#,

!"#$%"$#&'()"*+*,-"*(.

:#;'/"$5%&+$

('%"7.+/.%0'

!&/0 %(.!*!"&.%1#50&';')

(',/ 5+$(&(%'$59

FFT + GEMM + eigensolvers (LAPACK)

FFT FFT + GEMM

18 – 10 - 2011 26 Ivan Girotto - ICHEC - GPU Computing

Page 27: Porting highly computational intensive applications to hybridsystems

fft3Dtest.cu - Profiling

0

2

4

6

8

10

12

4096 x 643 512 x 1283 64 x 2563

Time [s]

Number of GPUs

3D-FFT on multpile GPU

1 GPU2 GPU4 GPU

100

101

102

103

4096 x 643

512 x 1283

64 x 2563

Time [s]

Number of GPUs

3D-FFT on multpile GPU vs single-core FFTW

CPU1 GPU2 GPU4 GPU

17 – 02 - 2012 27 Ivan Girotto - ICHEC - PRACE Workshop

Page 28: Porting highly computational intensive applications to hybridsystems

CUDA VLOC_PSI_K – single node

PSI PSIC

CUFFT GR

products

CUFFT RG

PSIC HPSI

PSIC

PSIC

Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 28

Page 29: Porting highly computational intensive applications to hybridsystems

Serial diag. vs SCALAPACK vs MAGMA

18 – 10 - 2011 29 Ivan Girotto - ICHEC - GPU Computing

4.9x

Page 30: Porting highly computational intensive applications to hybridsystems

•  It extensively uses libraries –  BLAS PHIGEMM (CUDA vs single-core, ~5x) –  LAPACK (CUDA vs single-core, ~5x)

–  FFTW (mainly in VLOC_PSI)

•  It already well supports OpenMP

•  Wall-time is mainly distributed within few routines

–  ADDUSDENS (CUDA vs single-core, ~20x) –  NEWD (CUDA vs single-core, ~7x) –  VLOC_PSI (CUDA vs single-core, ~9x)

Performance analysis on single node

18 – 10 - 2011 30 Ivan Girotto - ICHEC - GPU Computing

Page 31: Porting highly computational intensive applications to hybridsystems

Benchmarking

18 – 10 - 2011 31 Ivan Girotto - ICHEC - GPU Computing

CPU

ICHEC workstation Intel six-cores X5650 + 1 NVIDIA C2050

Page 32: Porting highly computational intensive applications to hybridsystems

PINNED memory (or non-pageable)

•  allows asynchronous data transfers (no CPU needed, DMA works alone);

•  allows overlapping COMPUTATION/COMPUTATION on the same stream;

•  since it is page-locked, the SO cannot move it;

•  if you do not de-allocate it, never disappears;

•  don’t be greedy!! The overall performance of the system is affected.

17 – 02 - 2012 32 Ivan Girotto - ICHEC - PRACE Workshop

Page 33: Porting highly computational intensive applications to hybridsystems

use-case: PHIGEMM

I CPU

GPU H2D D2H

I CPU

GPU

H2D D2H

Without PINNED

With PINNED

17 – 02 - 2012 33 Ivan Girotto - ICHEC - PRACE Workshop

Page 34: Porting highly computational intensive applications to hybridsystems

Consideration over PINNED memory

4m23.05s

4m35.05s NO MPI here!

17 – 02 - 2012 34 Ivan Girotto - ICHEC - PRACE Workshop

Page 35: Porting highly computational intensive applications to hybridsystems

Integrating the CUDA components

At the end… •  Original data distribution is preserved •  CUDA libraries are mixed with explicit CUDA kernels •  One single initial large memory allocation done on the GPU •  Kernels are fully “abstracted”, overlapping wherever is possible computation

and data transfer (but not yet everywhere!) •  All the MPI communications occur outside the ported routine

Moreover PWSCF… •  detects GPU and binds each GPU to each MPI process •  can share one GPU between multiple MPI processes •  can assign more than one GPU to each MPI process

Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 35

Page 36: Porting highly computational intensive applications to hybridsystems

MPI/Process-GPU scenarios

Single GPU Multiple GPUs

SERIA

L PA

RAL

LEL

Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 36

Page 37: Porting highly computational intensive applications to hybridsystems

MPI-GPU binding & GPU memory management

NOTE: PWscf keeps track of the amount of GPU memory available

Pointer to CUDA memory (MPI process sub-rank 0) Pointer to CUDA memory

(MPI process sub-rank 0)

Pointer to CUDA memory (MPI process sub-rank 1)

NEVER allocated 100% !!!

Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 37

Page 38: Porting highly computational intensive applications to hybridsystems

MPI:GPU Ratios (MGST-hex, parallel)

18 – 10 - 2011 38 Ivan Girotto - ICHEC - GPU Computing

Page 39: Porting highly computational intensive applications to hybridsystems

Parallelization levels in PWSCF

GPU!

• Only for Nudged Elastic Band (NEB) calculations Images

• Distribution over k-points (if more than one) • Scalability over k-points (if more than one) • No memory scaling

K-points

• Distribution of wave-function coefficients • Distribution of real-grid points • Good memory scale, good overall scalability, LB

Plane-waves

• Iterative diagonalization (fully-parallel or serial) • Smart grouping of 3DFFTs to reduce compulsory MPI

communications

Linear algebra & task groups

• OpenMP handled explicitly or implicitly • Extend the scaling on multi-core machines with “limited”

memory

Multi-threaded kernels

Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 39

Page 40: Porting highly computational intensive applications to hybridsystems

PWSCF GPU, results & benchmarks (parallel – MGST-hex)

3.38x

2.96x 2.43x

72 MPI x 2 OpenMP + 24 GPUs (GPU:MPI = 3:1) Acceleration, not Scalability!

Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 40

Page 41: Porting highly computational intensive applications to hybridsystems

PRACE Preparatory Access PWSCF, input of 489 atoms, 3 SCF steps, 24 MPI x 6 OMP (144 cores), PLX

2.8x

3.1x

10x

4x

~2.3x

Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 41

Page 42: Porting highly computational intensive applications to hybridsystems

CUDA VLOC_PSI_K – parallel

PSI PSIC

FFT GR

products

FFT RG

PSIC HPSI

DISTRIBUTED

PSI PSIC CUFFT GR

products

CUFFT RG

PSIC HPSI

PSIC

PSIC

Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 42

Page 43: Porting highly computational intensive applications to hybridsystems

CUDA VLOC_PSI_K – parallel

PSI PSIC CUFFT GR

products

CUFFT RG

PSIC HPSI

PSIC

PSIC

“MPI_Allgatherv”

“MPI_Allscatterv”

Multiple LOCAL grid to compute

Same communication than original P3DFFT but more overlapping!!

Ivan Girotto - ICHEC - GPU Computing 18 – 10 - 2011 43

Page 44: Porting highly computational intensive applications to hybridsystems

Serial diag. vs SCALAPACK vs MAGMA

18 – 10 - 2011 44 Ivan Girotto - ICHEC - GPU Computing

4.9x

Page 45: Porting highly computational intensive applications to hybridsystems

Achievements •  Scientific Computing, NVIDA GTC webinar “The Practical Reality of

Heterogeneous Super Computing”

•  Last SC2011, Dell booth and number of talks performed by Rob Farber

•  CUDA text book “CUDA Application Design and Development”

•  Invited speaker for GPUComputing.net: live class for University of Illinois at Urbana-Champaign

•  Invited lecture at the CINECA Advanced School of Parallel Computing and last PRACE winter school

•  Filippo Spiga & Ivan Girotto, phiGEMM: a CPU-GPU library for porting Quantum ESPRESSO on hybrid systems, IEEE PDP2012

•  GTC2012

•  QE-GPU and phiGEMM published repository & qe-gpu forum 17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 45

Page 46: Porting highly computational intensive applications to hybridsystems

Future works

18 – 10 - 2011 46 Ivan Girotto - ICHEC - GPU Computing

•  PRACE WP8 – 2IP => T. Schulthess, C. Gheller

•  Enable QE to further generation of EU Tier-0 system

•  Scalable approach to accelerated 3DFFT

•  Scalable approach to accelerated eigensolvers

•  Multithreading optimization (OpenMP)

•  Linear algebra (phiGEMM model)

•  GPU-enabled version of CP code

Page 47: Porting highly computational intensive applications to hybridsystems

CURAND

17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 47

•  Library for generating random numbers •  Features:

  XORWOW pseudo-random generator   Sobol’ quasi-random number generators   Host API for generating random numbers in bulk   Inline implementation allows use inside GPU

functions/kernels   Single- and double-precision, uniform, normal and

long-normal distributions

Page 48: Porting highly computational intensive applications to hybridsystems

!"#

$"#

%"#

&"#

!'"#

($"#

'%"#

!$&"#

$)'"#

*+,+# -..# -/01#

!23"#

!24"#

&42'"#

$23"#

(25"#

!542$"#

(23"#

!$2("#!#6789+:#

$#6789+:;#

!$#6789+:;#

17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 48

ICHEC’s case study: next GTC2012

Page 49: Porting highly computational intensive applications to hybridsystems

12190

242180

950

700

App-1App-2

x1 x2 x8

ICHEC Case Study In Risk Management

Performance Optimisation Monte Carlo Using GPGPU

For Two Applications

GPGPU GPGPU GPGPU

Larger and more complex Monte Carlo model !!

17 – 02 - 2012 49 Ivan Girotto - ICHEC - PRACE Workshop

Page 50: Porting highly computational intensive applications to hybridsystems

Acknowledgement

17 – 02 - 2012 Ivan Girotto - ICHEC - PRACE Workshop 50

Filippo Spiga – ICHEC

Gilles Civario – ICHEC

M. Fatica - NVIDIA Corporation

Rob Farber, ICHEC Research Consultant

Carlo Cavazzoni - CINECA

P. Giannozzi, L. Marsamos - Quantum ESPRESSO

Philip Yang – former scholarship student at ICHEC

Page 51: Porting highly computational intensive applications to hybridsystems

Supported by Science Foundation Ireland under grant 08/HEC/I1450

and by HEA’s PRTLI-C4.

Acknowledgements