caps technology - runtimeruntime.bordeaux.inria.fr/prohmpt/meetings/r01/anr-08...• indicate...

CAPS Technology

ProHMPT, 2009 March12th

Confidential, copyright CAPS entreprise

Overview of the Talk

1.  HMPP in a nutshell •  Directives for Hardware Accelerators (HWA)

2.  HMPP Code Generation Capabilities •  Efficient code generation for CUDA

3.  Library adapter •  HPL / DGEMM experiment

4.  Codelet Finder

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise Bordeaux, 12 mars 2009


HMPP Directives

  C and Fortran directives to program hardware accelerators •  Ensure portability and default compilation and execution •  Declare hardware implementations of native functions •  Indicate resource allocation and communication •  Place synchronization barriers

  A standard and portable way of programming   A programming glue between general-purpose and

hardware-specific languages •  Insulation of hardware-specific kernels in C and Fortran

code

HMPP Workbench 4


Directives Principles

  Declare hardware specific implementations of functions (codelets) •  Can be specialized to the

execution context (data size, …)

  Codelet calls •  Synchronous, asynchronous

properties   Data transfers

•  Data preloading   Synchronization barriers

•  Host CPU waits until remote computation has completed

HMPP Workbench 5

Main Memory

Application data

General Purpose

Processor Cores

HWA

Application data

on HWA

Cores

Upload

Download

Remote Procedure

call


Simple Example

HMPP Workbench 6

#pragma hmpp sgemm codelet, target=CUDA, args[vout].io=inout extern void sgemm( int m, int n, int k, float alpha, const float vin1[n][n], const float vin2[n][n], float beta, float vout[n][n] );

int main(int argc, char **argv) { … for( j = 0 ; j < 2 ; j++ ) { #pragma hmpp sgemm callsite sgemm( size, size, size, alpha, vin1, vin2, beta, vout ); }


HMPP Codelet Definition

  A pure function to be executed in a remote device or specialized core •  No global variable •  No side effects

  Several possible variants •  for different targets •  for different use contexts (vector size, ...)

  Managed by the HMPP runtime •  HMPP API provides the necessary support functions

HMPP Workbench 7


Directives Overview

  A unique label identifies a group of directives that belong to the same codelet

  Directive types: •  codelet: codelet declaration •  callsite: codelet call, can be asynchronous •  advancedload: preloading of data •  delegatedstore: wait for data result upload •  synchronize: wait for the completion of a codelet •  release: free a compute unit for another codelet

HMPP Workbench 8

#pragma hmpp <label> <directive type> [, <directive parameter>]* [&]

!$hmpp <label> <directive type> [, <directive parameter>]* [&]


Advanced Programming

HMPP Workbench 9

int main(int argc, char **argv) { … #pragma hmpp sgemm allocate, args[vin1;vin2;vout].size={size,size} #pragma hmpp sgemm advancedload, args[vin1;vin2;vout] & #pragma hmpp sgemm advancedload, args[m;n;k;alpha;beta]

for( j = 0 ; j < 2 ; j++ ) { #pragma hmpp sgemm callsite, asynchronous, & #pragma hmpp sgemm args[vin1;vin2;vout].advancedload=true & #pragma hmpp sgemm args[m;n;k;alpha;beta].advancedload=true sgemm( size, size, size, alpha, vin1, vin2, beta, vout ); #pragma hmpp sgemm synchronize }

#pragma hmpp sgemm delegatedstore, args[vout] #pragma hmpp sgemm release

Allocate and initialize device outside loop

Preload data

Execute asynchronously

Download result when needed


Codelet Directive (1)

  Declare a hardware specific implementation of a function •  Several possible variants (target, execution context) •  Default is the native codelet

HMPP Workbench 10

#pragma hmpp label codelet, target=CUDA:BROOK, args[v1].io=out #pragma hmpp label2 codelet, target=SSE, args[v1].io=out, cond=“n<800“ void MyCodelet(int n, float v1[n], float v2[n], float v3[n]) { int i; for (i = 0 ; i < n ; i++) { v1[i] = v2[i] + v3[i]; } }


Advancedload Directive (1)

  Data transfers strongly impact on performance •  Try to preload data before codelet call site

HMPP Workbench 11

#pragma hmpp simple advancedload, args[v2], asynchronous,\ args[v2].addr=”t2” for (k = 0 ; k < iter ; k++) { #pragma hmpp simple callsite, args[v2].advancedload=true simplefunc1(n, &(t1[k*n]), &(t2[k*n]), &(t3[k*n]));

#pragma hmpp simple advancedload, args[v2], asynchronous,\ args[v2].addr="&(t2[(k+1)*n])”, args[v2].size="(n)" /* … do something else… */ } #pragma hmpp simple release


Advancedload Directive (2)

  Avoid reloading constant data

HMPP Workbench 12

t2 is not reloaded each loop iteration

int main(int argc, char **argv) { … #pragma hmpp simple advancedload, args[v2], const for (j=0; j<n; j++){ #pragma hmpp simple callsite, args[v2].advancedload=true simplefunc1(n,t1[j], t2, t3[j], alpha); } #pragma hmpp label release … }


Codelet Generation


Objectives

  Allow to transparently use HWA •  From C or Fortran to CUDA, Brook, …

  Allow for code tuning at source code level •  Directives based approach



Code Generation Flow



Codelet Generation

  C, Java or Fortran source code input •  HWA oriented subset of the languages

  Set of directives to •  Optimize target codelet generation

•  Express parallelism expression

  Make code tuning easier

  Generated code can also be tuned



Loop Parallelization

  Force or prevent the parallelization of loops   Help defining kernels in a codelet


#pragma hmppcg parallel for (i=0; i < n; i++) { #pragma hmppcg noParallel for (j=0; j < n; j++) { D[i][j] = A[i][j] * E[3][j]; } }


Input C Code Example 1


typedef struct{ float r, i;} Complex; #pragma hmpp convolution2d codelet, args[data; opx].io=in, args[convr].io=out, target=CUDA void convolution2d( Complex *data, int nx, int ny, Complex *opx, int oplx, int oply, Complex *convr) { int hoplx = (oplx+1)/2; int hoply = (oply+1)/2; int iy, ix; #pragma hmppcg parallel for (iy = 0; iy < ny; iy++) { #pragma hmppcg parallel for (ix = 0; ix < nx; ix++) { float dumr =0.0, dumi = 0.0; int ky; for(ky = -(oply - hoply - 1); ky <= hoply; ky++) { int kx; for(kx = -(oplx - hoplx - 1); kx <= hoplx; kx++){ int dx = min( max(ix+kx, 0), (nx - 1) ); int dy = min( max(iy+ky, 0), (ny - 1) ); dumr += data[dy * nx + dx].r * opx[(hoply - ky) * oplx + (hoplx - kx)].r; dumr -= data[dy * nx + dx].i * opx[(hoply - ky) * oplx + (hoplx - kx)].i; dumi += data[dy * nx + dx].r * opx[(hoply - ky) * oplx + (hoplx - kx)].i; dumi += data[dy * nx + dx].i * opx[(hoply - ky) * oplx + (hoplx - kx)].r; } } convr[iy*nx+ix].r = dumr; convr[iy*nx+ix].i = dumi; } } }


Input Fortran Code Example 2


!$HMPP sgemm3 codelet, target=CUDA, args[vout].io=inout SUBROUTINE sgemm(m,n,k2,alpha,vin1,vin2,beta,vout) INTEGER, INTENT(IN) :: m,n,k2 REAL, INTENT(IN) :: alpha,beta REAL, INTENT(IN) :: vin1(n,n), vin2(n,n) REAL, INTENT(INOUT) :: vout(n,n) REAL :: prod INTEGER :: i,j,k !$HMPPCG unroll(8), jam(2), noremainder !$HMPPCG parallel DO j=1,n !$HMPPCG unroll(8), splitted, noremainder !$HMPPCG parallel DO i=1,n prod = 0.0 DO k=1,n prod = prod + vin1(i,k) * vin2(k,j) ENDDO vout(i,j) = alpha * prod + beta * vout(i,j) ; END DO END DO END SUBROUTINE sgemm


MxM Performance



Performance Examples



Tuning Issue Example


#pragma hmpp astex_codelet__1 codelet & #pragma hmpp astex_codelet__1 , args[c].io=in & #pragma hmpp astex_codelet__1 , args[v].io=inout & #pragma hmpp astex_codelet__1 , args[u].io=inout & #pragma hmpp astex_codelet__1 , target=CUDA & #pragma hmpp astex_codelet__1 , version=1.4.0 void astex_codelet__1(float u[256][256][256], float v[256][256][256], float c[256][256][256], const int K, const float x2){ astex_thread_begin:{ for (int it = 0 ; it < K ; ++it){ for (int i2 = 1 ; i2 < 256 - 1 ; ++i2){ for (int i3 = 1 ; i3 < 256 - 1 ; ++i3){ for (int i1 = 1 ; i1 < 256 - 1 ; ++i1){ float coeff = c[i3][i2][i1] * c[i3][i2][i1] * x2; float sum = u[i3][i2][i1 + 1] + u[i3][i2][i1 - 1]; sum += u[i3][i2 + 1][i1] + u[i3][i2 - 1][i1]; sum += u[i3 + 1][i2][i1] + u[i3 - 1][i2][i1]; v[i3][i2][i1] = (2. - 6. * coeff) * u[i3][i2][i1] + coeff * sum - v[i3][i2][i1]; } } } for (int i2 = 1 ; i2 < 256 - 1 ; ++i2){ for (int i3 = 1 ; i3 < 256 - 1 ; ++i3){ for (int i1 = 1 ; i1 < 256 - 1 ; ++i1{ . . . . . }astex_thread_end:; }

Need interchange


Library Issues


Motivations

  Various implementations of libraries are available for a given target •  CUBLAS, MKL, ATLAS, …

  No strict performance order •  Each library has a different performance profile •  Best choice depends on platform and runtime parameters

  User left with a complex issue •  Routine performance measure •  Decision programming •  Hardware version adaptation

  Development partially funded by STREP Milepost •  http://www.milepost.eu/ •  Machine Learning for Embedded Programs Optimisation



Difficult Decisions Making with Alternative Codes (Multiversioning)

  Various implementations of routines are available or can be generated for a given target •  CUBLAS, MKL, ATLAS, … •  SIMD instructions, GPcore, HWA, Hybrid

  No strict performance order •  Each implementation has a different performance

profile •  Best choice depends on platform and runtime

parameters

  Decision is a complex issue •  How to produce the decision?



Library Adapter Overview



Illustrating Example: Dealing with Multiple BLAS Implementations

  Runtime selection of DGEMM in High Performance Linpack •  Intel(R) Xeon(R) E5420 @ 2.50GHz •  CUBLAS - Tesla C1060, Intel MKL

  Three binaries of the application   Static linking with CUBLAS   Static linking with MKL   Library mix with selection of routine at runtime

  Automatically generated using CAPS tooling

  Three hardware resource configurations   GPU + 1, 2, and 4 cores used for MKL



Performance Using One Core

  Performance in Gigaflops   4 problem sizes: 64, 500, 1200, 8000


0,07 1,2

4,4

23

1,3

7,3 8 9

1,4

6,5 8,1

23,3

0

5

10

15

20

25

64 500 1200 8000

Perfo

rman

ce (G

FLO

PS)

Problem Size

Cublas MKL Dyn. Sel.


Performance Using Two Cores


0,07 1,2

4,4

23

0,6

4,3

7,6

15

1,4

6,5

12

29

0

5

10

15

20

25

30

35

64 500 1200 8000

Perfo

rman

ce (G

FLO

PS)

Problem Size



Performance Using Four Cores


0,07 1,2

4,4

23

0,9

5

9,7

26

1,2

7,2

13

32

0

5

10

15

20

25

30

35

64 500 1200 8000

Perfo

rman

ce (G

FLO

PS)

Problem Size



Codelet Finder Alpha version


Codelet Finder Overview   Partitioning of C code to highlight codelets

•  Data value specialization •  Aliasing speculation

  Useful for •  HWA exploitation (and maybe vectorization and

parallelization)

static

dynamic dynamic

static

Partitioned code



Extracted Codelets Are Not Just Hotspots

  HWA data mapping in local memory adds constraints

{ for (x = 0 ; x < i_size ; x++) { diff[x + y * i_size] = pix1[x] - pix2[x]; } pix1 += i_pix1; pix2 += i_pix2; }

Main memory HWA local memory

pix1= 0xA…10

0xA…10

pix1= 0x0…05

0x0…05

0xA…40 0x0…35 ………………… …………………



Example of Partitioning to Use HWA (1)


#pragma hmpp astex_codelet__1 codelet & #pragma hmpp astex_codelet__1 , args[c].io=in & #pragma hmpp astex_codelet__1 , args[v].io=inout & #pragma hmpp astex_codelet__1 , args[u].io=inout & #pragma hmpp astex_codelet__1 , target=CUDA & #pragma hmpp astex_codelet__1 , version=1.4.0 void astex_codelet__1(float u[256][256][256], float v[256][256][256], float c[256][256][256], const int K, const float x2){ astex_thread_begin:{ for (int it = 0 ; it < K ; ++it){ for (int i2 = 1 ; i2 < 256 - 1 ; ++i2){ for (int i3 = 1 ; i3 < 256 - 1 ; ++i3){ for (int i1 = 1 ; i1 < 256 - 1 ; ++i1){ float coeff = c[i3][i2][i1] * c[i3][i2][i1] * x2; float sum = u[i3][i2][i1 + 1] + u[i3][i2][i1 - 1]; sum += u[i3][i2 + 1][i1] + u[i3][i2 - 1][i1]; sum += u[i3 + 1][i2][i1] + u[i3 - 1][i2][i1]; v[i3][i2][i1] = (2. - 6. * coeff) * u[i3][i2][i1] + coeff * sum - v[i3][i2][i1]; } } } for (int i2 = 1 ; i2 < 256 - 1 ; ++i2){ for (int i3 = 1 ; i3 < 256 - 1 ; ++i3){ for (int i1 = 1 ; i1 < 256 - 1 ; ++i1{ . . . . . }astex_thread_end:; }


Example of Partitioning to Use HWA (2)

  Extract codelet to be executed on HWA •  Data specialization •  Aliasing speculation

  Convolution code •  icc -O3 vs HWA •  Speedup is 3.4 with the HWA

  Codelet tuning •  Loop interchange was needed



Codelet Finder in ProHMPT

  Can be used to provide codelet testbed for the various techniques



Conclusion


CAPS in ProHMPT

  Génération de codes adaptatifs

  Définition de directives

  Allocation dynamiques de ressources



Tasks

  Thème 2: extraction du parallélisme •  Tâche 3: Compilation et analyse du code statique

  Fondé sur DPIL

•  Tâche 4: Langage   Extension d’OpenMP pour l’hétérogène

  Thème 3: Support logiciel •  Tâche 7: Ordonnancement

  Allocation de ressources


caps technology - runtimeruntime.bordeaux.inria.fr/prohmpt/meetings/r01/anr-08...• indicate...

Documents