caps technology - runtimeruntime.bordeaux.inria.fr/prohmpt/meetings/r01/anr-08...• indicate...
TRANSCRIPT
Confidential, copyright CAPS entreprise
Overview of the Talk
1. HMPP in a nutshell • Directives for Hardware Accelerators (HWA)
2. HMPP Code Generation Capabilities • Efficient code generation for CUDA
3. Library adapter • HPL / DGEMM experiment
4. Codelet Finder
Bordeaux, 12 mars 2009
Confidential, copyright CAPS entreprise
HMPP Directives
C and Fortran directives to program hardware accelerators • Ensure portability and default compilation and execution • Declare hardware implementations of native functions • Indicate resource allocation and communication • Place synchronization barriers
A standard and portable way of programming A programming glue between general-purpose and
hardware-specific languages • Insulation of hardware-specific kernels in C and Fortran
code
HMPP Workbench 4
Confidential, copyright CAPS entreprise
Directives Principles
Declare hardware specific implementations of functions (codelets) • Can be specialized to the
execution context (data size, …)
Codelet calls • Synchronous, asynchronous
properties Data transfers
• Data preloading Synchronization barriers
• Host CPU waits until remote computation has completed
HMPP Workbench 5
Main Memory
Application data
General Purpose
Processor Cores
HWA
Application data
on HWA
Cores
Upload
Download
Remote Procedure
call
Confidential, copyright CAPS entreprise
Simple Example
HMPP Workbench 6
#pragma hmpp sgemm codelet, target=CUDA, args[vout].io=inout extern void sgemm( int m, int n, int k, float alpha, const float vin1[n][n], const float vin2[n][n], float beta, float vout[n][n] );
int main(int argc, char **argv) { … for( j = 0 ; j < 2 ; j++ ) { #pragma hmpp sgemm callsite sgemm( size, size, size, alpha, vin1, vin2, beta, vout ); }
Confidential, copyright CAPS entreprise
HMPP Codelet Definition
A pure function to be executed in a remote device or specialized core • No global variable • No side effects
Several possible variants • for different targets • for different use contexts (vector size, ...)
Managed by the HMPP runtime • HMPP API provides the necessary support functions
HMPP Workbench 7
Confidential, copyright CAPS entreprise
Directives Overview
A unique label identifies a group of directives that belong to the same codelet
Directive types: • codelet: codelet declaration • callsite: codelet call, can be asynchronous • advancedload: preloading of data • delegatedstore: wait for data result upload • synchronize: wait for the completion of a codelet • release: free a compute unit for another codelet
HMPP Workbench 8
#pragma hmpp <label> <directive type> [, <directive parameter>]* [&]
!$hmpp <label> <directive type> [, <directive parameter>]* [&]
Confidential, copyright CAPS entreprise
Advanced Programming
HMPP Workbench 9
int main(int argc, char **argv) { … #pragma hmpp sgemm allocate, args[vin1;vin2;vout].size={size,size} #pragma hmpp sgemm advancedload, args[vin1;vin2;vout] & #pragma hmpp sgemm advancedload, args[m;n;k;alpha;beta]
for( j = 0 ; j < 2 ; j++ ) { #pragma hmpp sgemm callsite, asynchronous, & #pragma hmpp sgemm args[vin1;vin2;vout].advancedload=true & #pragma hmpp sgemm args[m;n;k;alpha;beta].advancedload=true sgemm( size, size, size, alpha, vin1, vin2, beta, vout ); #pragma hmpp sgemm synchronize }
#pragma hmpp sgemm delegatedstore, args[vout] #pragma hmpp sgemm release
Allocate and initialize device outside loop
Preload data
Execute asynchronously
Download result when needed
Confidential, copyright CAPS entreprise
Codelet Directive (1)
Declare a hardware specific implementation of a function • Several possible variants (target, execution context) • Default is the native codelet
HMPP Workbench 10
#pragma hmpp label codelet, target=CUDA:BROOK, args[v1].io=out #pragma hmpp label2 codelet, target=SSE, args[v1].io=out, cond=“n<800“ void MyCodelet(int n, float v1[n], float v2[n], float v3[n]) { int i; for (i = 0 ; i < n ; i++) { v1[i] = v2[i] + v3[i]; } }
Confidential, copyright CAPS entreprise
Advancedload Directive (1)
Data transfers strongly impact on performance • Try to preload data before codelet call site
HMPP Workbench 11
#pragma hmpp simple advancedload, args[v2], asynchronous,\ args[v2].addr=”t2” for (k = 0 ; k < iter ; k++) { #pragma hmpp simple callsite, args[v2].advancedload=true simplefunc1(n, &(t1[k*n]), &(t2[k*n]), &(t3[k*n]));
#pragma hmpp simple advancedload, args[v2], asynchronous,\ args[v2].addr="&(t2[(k+1)*n])”, args[v2].size="(n)" /* … do something else… */ } #pragma hmpp simple release
Confidential, copyright CAPS entreprise
Advancedload Directive (2)
Avoid reloading constant data
HMPP Workbench 12
t2 is not reloaded each loop iteration
int main(int argc, char **argv) { … #pragma hmpp simple advancedload, args[v2], const for (j=0; j<n; j++){ #pragma hmpp simple callsite, args[v2].advancedload=true simplefunc1(n,t1[j], t2, t3[j], alpha); } #pragma hmpp label release … }
Confidential, copyright CAPS entreprise
Objectives
Allow to transparently use HWA • From C or Fortran to CUDA, Brook, …
Allow for code tuning at source code level • Directives based approach
Bordeaux, 12 mars 2009
Confidential, copyright CAPS entreprise
Codelet Generation
C, Java or Fortran source code input • HWA oriented subset of the languages
Set of directives to • Optimize target codelet generation
• Express parallelism expression
Make code tuning easier
Generated code can also be tuned
Bordeaux, 12 mars 2009
Confidential, copyright CAPS entreprise
Loop Parallelization
Force or prevent the parallelization of loops Help defining kernels in a codelet
Bordeaux, 12 mars 2009
#pragma hmppcg parallel for (i=0; i < n; i++) { #pragma hmppcg noParallel for (j=0; j < n; j++) { D[i][j] = A[i][j] * E[3][j]; } }
Confidential, copyright CAPS entreprise
Input C Code Example 1
Bordeaux, 12 mars 2009
typedef struct{ float r, i;} Complex; #pragma hmpp convolution2d codelet, args[data; opx].io=in, args[convr].io=out, target=CUDA void convolution2d( Complex *data, int nx, int ny, Complex *opx, int oplx, int oply, Complex *convr) { int hoplx = (oplx+1)/2; int hoply = (oply+1)/2; int iy, ix; #pragma hmppcg parallel for (iy = 0; iy < ny; iy++) { #pragma hmppcg parallel for (ix = 0; ix < nx; ix++) { float dumr =0.0, dumi = 0.0; int ky; for(ky = -(oply - hoply - 1); ky <= hoply; ky++) { int kx; for(kx = -(oplx - hoplx - 1); kx <= hoplx; kx++){ int dx = min( max(ix+kx, 0), (nx - 1) ); int dy = min( max(iy+ky, 0), (ny - 1) ); dumr += data[dy * nx + dx].r * opx[(hoply - ky) * oplx + (hoplx - kx)].r; dumr -= data[dy * nx + dx].i * opx[(hoply - ky) * oplx + (hoplx - kx)].i; dumi += data[dy * nx + dx].r * opx[(hoply - ky) * oplx + (hoplx - kx)].i; dumi += data[dy * nx + dx].i * opx[(hoply - ky) * oplx + (hoplx - kx)].r; } } convr[iy*nx+ix].r = dumr; convr[iy*nx+ix].i = dumi; } } }
Confidential, copyright CAPS entreprise
Input Fortran Code Example 2
Bordeaux, 12 mars 2009
!$HMPP sgemm3 codelet, target=CUDA, args[vout].io=inout SUBROUTINE sgemm(m,n,k2,alpha,vin1,vin2,beta,vout) INTEGER, INTENT(IN) :: m,n,k2 REAL, INTENT(IN) :: alpha,beta REAL, INTENT(IN) :: vin1(n,n), vin2(n,n) REAL, INTENT(INOUT) :: vout(n,n) REAL :: prod INTEGER :: i,j,k !$HMPPCG unroll(8), jam(2), noremainder !$HMPPCG parallel DO j=1,n !$HMPPCG unroll(8), splitted, noremainder !$HMPPCG parallel DO i=1,n prod = 0.0 DO k=1,n prod = prod + vin1(i,k) * vin2(k,j) ENDDO vout(i,j) = alpha * prod + beta * vout(i,j) ; END DO END DO END SUBROUTINE sgemm
Confidential, copyright CAPS entreprise
Tuning Issue Example
Bordeaux, 12 mars 2009
#pragma hmpp astex_codelet__1 codelet & #pragma hmpp astex_codelet__1 , args[c].io=in & #pragma hmpp astex_codelet__1 , args[v].io=inout & #pragma hmpp astex_codelet__1 , args[u].io=inout & #pragma hmpp astex_codelet__1 , target=CUDA & #pragma hmpp astex_codelet__1 , version=1.4.0 void astex_codelet__1(float u[256][256][256], float v[256][256][256], float c[256][256][256], const int K, const float x2){ astex_thread_begin:{ for (int it = 0 ; it < K ; ++it){ for (int i2 = 1 ; i2 < 256 - 1 ; ++i2){ for (int i3 = 1 ; i3 < 256 - 1 ; ++i3){ for (int i1 = 1 ; i1 < 256 - 1 ; ++i1){ float coeff = c[i3][i2][i1] * c[i3][i2][i1] * x2; float sum = u[i3][i2][i1 + 1] + u[i3][i2][i1 - 1]; sum += u[i3][i2 + 1][i1] + u[i3][i2 - 1][i1]; sum += u[i3 + 1][i2][i1] + u[i3 - 1][i2][i1]; v[i3][i2][i1] = (2. - 6. * coeff) * u[i3][i2][i1] + coeff * sum - v[i3][i2][i1]; } } } for (int i2 = 1 ; i2 < 256 - 1 ; ++i2){ for (int i3 = 1 ; i3 < 256 - 1 ; ++i3){ for (int i1 = 1 ; i1 < 256 - 1 ; ++i1{ . . . . . }astex_thread_end:; }
Need interchange
Confidential, copyright CAPS entreprise
Motivations
Various implementations of libraries are available for a given target • CUBLAS, MKL, ATLAS, …
No strict performance order • Each library has a different performance profile • Best choice depends on platform and runtime parameters
User left with a complex issue • Routine performance measure • Decision programming • Hardware version adaptation
Development partially funded by STREP Milepost • http://www.milepost.eu/ • Machine Learning for Embedded Programs Optimisation
Bordeaux, 12 mars 2009
Confidential, copyright CAPS entreprise
Difficult Decisions Making with Alternative Codes (Multiversioning)
Various implementations of routines are available or can be generated for a given target • CUBLAS, MKL, ATLAS, … • SIMD instructions, GPcore, HWA, Hybrid
No strict performance order • Each implementation has a different performance
profile • Best choice depends on platform and runtime
parameters
Decision is a complex issue • How to produce the decision?
Bordeaux, 12 mars 2009
Confidential, copyright CAPS entreprise
Illustrating Example: Dealing with Multiple BLAS Implementations
Runtime selection of DGEMM in High Performance Linpack • Intel(R) Xeon(R) E5420 @ 2.50GHz • CUBLAS - Tesla C1060, Intel MKL
Three binaries of the application Static linking with CUBLAS Static linking with MKL Library mix with selection of routine at runtime
Automatically generated using CAPS tooling
Three hardware resource configurations GPU + 1, 2, and 4 cores used for MKL
Bordeaux, 12 mars 2009
Confidential, copyright CAPS entreprise
Performance Using One Core
Performance in Gigaflops 4 problem sizes: 64, 500, 1200, 8000
Bordeaux, 12 mars 2009
0,07 1,2
4,4
23
1,3
7,3 8 9
1,4
6,5 8,1
23,3
0
5
10
15
20
25
64 500 1200 8000
Perfo
rman
ce (G
FLO
PS)
Problem Size
Cublas MKL Dyn. Sel.
Confidential, copyright CAPS entreprise
Performance Using Two Cores
Bordeaux, 12 mars 2009
0,07 1,2
4,4
23
0,6
4,3
7,6
15
1,4
6,5
12
29
0
5
10
15
20
25
30
35
64 500 1200 8000
Perfo
rman
ce (G
FLO
PS)
Problem Size
Cublas MKL Dyn. Sel.
Confidential, copyright CAPS entreprise
Performance Using Four Cores
Bordeaux, 12 mars 2009
0,07 1,2
4,4
23
0,9
5
9,7
26
1,2
7,2
13
32
0
5
10
15
20
25
30
35
64 500 1200 8000
Perfo
rman
ce (G
FLO
PS)
Problem Size
Cublas MKL Dyn. Sel.
Confidential, copyright CAPS entreprise
Codelet Finder Overview Partitioning of C code to highlight codelets
• Data value specialization • Aliasing speculation
Useful for • HWA exploitation (and maybe vectorization and
parallelization)
static
dynamic dynamic
static
Partitioned code
Bordeaux, 12 mars 2009
Confidential, copyright CAPS entreprise
Extracted Codelets Are Not Just Hotspots
HWA data mapping in local memory adds constraints
{ for (x = 0 ; x < i_size ; x++) { diff[x + y * i_size] = pix1[x] - pix2[x]; } pix1 += i_pix1; pix2 += i_pix2; }
Main memory HWA local memory
pix1= 0xA…10
0xA…10
pix1= 0x0…05
0x0…05
0xA…40 0x0…35 ………………… …………………
Bordeaux, 12 mars 2009
Confidential, copyright CAPS entreprise
Example of Partitioning to Use HWA (1)
Bordeaux, 12 mars 2009
#pragma hmpp astex_codelet__1 codelet & #pragma hmpp astex_codelet__1 , args[c].io=in & #pragma hmpp astex_codelet__1 , args[v].io=inout & #pragma hmpp astex_codelet__1 , args[u].io=inout & #pragma hmpp astex_codelet__1 , target=CUDA & #pragma hmpp astex_codelet__1 , version=1.4.0 void astex_codelet__1(float u[256][256][256], float v[256][256][256], float c[256][256][256], const int K, const float x2){ astex_thread_begin:{ for (int it = 0 ; it < K ; ++it){ for (int i2 = 1 ; i2 < 256 - 1 ; ++i2){ for (int i3 = 1 ; i3 < 256 - 1 ; ++i3){ for (int i1 = 1 ; i1 < 256 - 1 ; ++i1){ float coeff = c[i3][i2][i1] * c[i3][i2][i1] * x2; float sum = u[i3][i2][i1 + 1] + u[i3][i2][i1 - 1]; sum += u[i3][i2 + 1][i1] + u[i3][i2 - 1][i1]; sum += u[i3 + 1][i2][i1] + u[i3 - 1][i2][i1]; v[i3][i2][i1] = (2. - 6. * coeff) * u[i3][i2][i1] + coeff * sum - v[i3][i2][i1]; } } } for (int i2 = 1 ; i2 < 256 - 1 ; ++i2){ for (int i3 = 1 ; i3 < 256 - 1 ; ++i3){ for (int i1 = 1 ; i1 < 256 - 1 ; ++i1{ . . . . . }astex_thread_end:; }
Confidential, copyright CAPS entreprise
Example of Partitioning to Use HWA (2)
Extract codelet to be executed on HWA • Data specialization • Aliasing speculation
Convolution code • icc -O3 vs HWA • Speedup is 3.4 with the HWA
Codelet tuning • Loop interchange was needed
Bordeaux, 12 mars 2009
Confidential, copyright CAPS entreprise
Codelet Finder in ProHMPT
Can be used to provide codelet testbed for the various techniques
Bordeaux, 12 mars 2009
Confidential, copyright CAPS entreprise
CAPS in ProHMPT
Génération de codes adaptatifs
Définition de directives
Allocation dynamiques de ressources
Bordeaux, 12 mars 2009
Confidential, copyright CAPS entreprise
Tasks
Thème 2: extraction du parallélisme • Tâche 3: Compilation et analyse du code statique
Fondé sur DPIL
• Tâche 4: Langage Extension d’OpenMP pour l’hétérogène
Thème 3: Support logiciel • Tâche 7: Ordonnancement
Allocation de ressources
Bordeaux, 12 mars 2009