caps technology - runtimeruntime.bordeaux.inria.fr/prohmpt/meetings/r01/anr-08...• indicate...

39
CAPS Technology ProHMPT, 2009 March12 th

Upload: lyphuc

Post on 09-May-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

CAPS Technology

ProHMPT, 2009 March12th

Confidential, copyright CAPS entreprise

Overview of the Talk

1.  HMPP in a nutshell •  Directives for Hardware Accelerators (HWA)

2.  HMPP Code Generation Capabilities •  Efficient code generation for CUDA

3.  Library adapter •  HPL / DGEMM experiment

4.  Codelet Finder

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise

HMPP Directives

  C and Fortran directives to program hardware accelerators •  Ensure portability and default compilation and execution •  Declare hardware implementations of native functions •  Indicate resource allocation and communication •  Place synchronization barriers

  A standard and portable way of programming   A programming glue between general-purpose and

hardware-specific languages •  Insulation of hardware-specific kernels in C and Fortran

code

HMPP Workbench 4

Confidential, copyright CAPS entreprise

Directives Principles

  Declare hardware specific implementations of functions (codelets) •  Can be specialized to the

execution context (data size, …)

  Codelet calls •  Synchronous, asynchronous

properties   Data transfers

•  Data preloading   Synchronization barriers

•  Host CPU waits until remote computation has completed

HMPP Workbench 5

Main Memory

Application data

General Purpose

Processor Cores

HWA

Application data

on HWA

Cores

Upload

Download

Remote Procedure

call

Confidential, copyright CAPS entreprise

Simple Example

HMPP Workbench 6

#pragma hmpp sgemm codelet, target=CUDA, args[vout].io=inout extern void sgemm( int m, int n, int k, float alpha, const float vin1[n][n], const float vin2[n][n], float beta, float vout[n][n] );

int main(int argc, char **argv) { … for( j = 0 ; j < 2 ; j++ ) { #pragma hmpp sgemm callsite sgemm( size, size, size, alpha, vin1, vin2, beta, vout ); }

Confidential, copyright CAPS entreprise

HMPP Codelet Definition

  A pure function to be executed in a remote device or specialized core •  No global variable •  No side effects

  Several possible variants •  for different targets •  for different use contexts (vector size, ...)

  Managed by the HMPP runtime •  HMPP API provides the necessary support functions

HMPP Workbench 7

Confidential, copyright CAPS entreprise

Directives Overview

  A unique label identifies a group of directives that belong to the same codelet

  Directive types: •  codelet: codelet declaration •  callsite: codelet call, can be asynchronous •  advancedload: preloading of data •  delegatedstore: wait for data result upload •  synchronize: wait for the completion of a codelet •  release: free a compute unit for another codelet

HMPP Workbench 8

#pragma hmpp <label> <directive type> [, <directive parameter>]* [&]

!$hmpp <label> <directive type> [, <directive parameter>]* [&]

Confidential, copyright CAPS entreprise

Advanced Programming

HMPP Workbench 9

int main(int argc, char **argv) { … #pragma hmpp sgemm allocate, args[vin1;vin2;vout].size={size,size} #pragma hmpp sgemm advancedload, args[vin1;vin2;vout] & #pragma hmpp sgemm advancedload, args[m;n;k;alpha;beta]

for( j = 0 ; j < 2 ; j++ ) { #pragma hmpp sgemm callsite, asynchronous, & #pragma hmpp sgemm args[vin1;vin2;vout].advancedload=true & #pragma hmpp sgemm args[m;n;k;alpha;beta].advancedload=true sgemm( size, size, size, alpha, vin1, vin2, beta, vout ); #pragma hmpp sgemm synchronize }

#pragma hmpp sgemm delegatedstore, args[vout] #pragma hmpp sgemm release

Allocate and initialize device outside loop

Preload data

Execute asynchronously

Download result when needed

Confidential, copyright CAPS entreprise

Codelet Directive (1)

  Declare a hardware specific implementation of a function •  Several possible variants (target, execution context) •  Default is the native codelet

HMPP Workbench 10

#pragma hmpp label codelet, target=CUDA:BROOK, args[v1].io=out #pragma hmpp label2 codelet, target=SSE, args[v1].io=out, cond=“n<800“ void MyCodelet(int n, float v1[n], float v2[n], float v3[n]) { int i; for (i = 0 ; i < n ; i++) { v1[i] = v2[i] + v3[i]; } }

Confidential, copyright CAPS entreprise

Advancedload Directive (1)

  Data transfers strongly impact on performance •  Try to preload data before codelet call site

HMPP Workbench 11

#pragma hmpp simple advancedload, args[v2], asynchronous,\ args[v2].addr=”t2” for (k = 0 ; k < iter ; k++) { #pragma hmpp simple callsite, args[v2].advancedload=true simplefunc1(n, &(t1[k*n]), &(t2[k*n]), &(t3[k*n]));

#pragma hmpp simple advancedload, args[v2], asynchronous,\ args[v2].addr="&(t2[(k+1)*n])”, args[v2].size="(n)" /* … do something else… */ } #pragma hmpp simple release

Confidential, copyright CAPS entreprise

Advancedload Directive (2)

  Avoid reloading constant data

HMPP Workbench 12

t2 is not reloaded each loop iteration

int main(int argc, char **argv) { … #pragma hmpp simple advancedload, args[v2], const for (j=0; j<n; j++){ #pragma hmpp simple callsite, args[v2].advancedload=true simplefunc1(n,t1[j], t2, t3[j], alpha); } #pragma hmpp label release … }

Confidential, copyright CAPS entreprise Bordeaux, 12 mars 2009

Codelet Generation

Confidential, copyright CAPS entreprise

Objectives

  Allow to transparently use HWA •  From C or Fortran to CUDA, Brook, …

  Allow for code tuning at source code level •  Directives based approach

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise

Code Generation Flow

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise

Codelet Generation

  C, Java or Fortran source code input •  HWA oriented subset of the languages

  Set of directives to •  Optimize target codelet generation

•  Express parallelism expression

  Make code tuning easier

  Generated code can also be tuned

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise

Loop Parallelization

  Force or prevent the parallelization of loops   Help defining kernels in a codelet

Bordeaux, 12 mars 2009

#pragma hmppcg parallel for (i=0; i < n; i++) { #pragma hmppcg noParallel for (j=0; j < n; j++) { D[i][j] = A[i][j] * E[3][j]; } }

Confidential, copyright CAPS entreprise

Input C Code Example 1

Bordeaux, 12 mars 2009

typedef struct{ float r, i;} Complex; #pragma hmpp convolution2d codelet, args[data; opx].io=in, args[convr].io=out, target=CUDA void convolution2d( Complex *data, int nx, int ny, Complex *opx, int oplx, int oply, Complex *convr) { int hoplx = (oplx+1)/2; int hoply = (oply+1)/2; int iy, ix; #pragma hmppcg parallel for (iy = 0; iy < ny; iy++) { #pragma hmppcg parallel for (ix = 0; ix < nx; ix++) { float dumr =0.0, dumi = 0.0; int ky; for(ky = -(oply - hoply - 1); ky <= hoply; ky++) { int kx; for(kx = -(oplx - hoplx - 1); kx <= hoplx; kx++){ int dx = min( max(ix+kx, 0), (nx - 1) ); int dy = min( max(iy+ky, 0), (ny - 1) ); dumr += data[dy * nx + dx].r * opx[(hoply - ky) * oplx + (hoplx - kx)].r; dumr -= data[dy * nx + dx].i * opx[(hoply - ky) * oplx + (hoplx - kx)].i; dumi += data[dy * nx + dx].r * opx[(hoply - ky) * oplx + (hoplx - kx)].i; dumi += data[dy * nx + dx].i * opx[(hoply - ky) * oplx + (hoplx - kx)].r; } } convr[iy*nx+ix].r = dumr; convr[iy*nx+ix].i = dumi; } } }

Confidential, copyright CAPS entreprise

Input Fortran Code Example 2

Bordeaux, 12 mars 2009

!$HMPP sgemm3 codelet, target=CUDA, args[vout].io=inout SUBROUTINE sgemm(m,n,k2,alpha,vin1,vin2,beta,vout)   INTEGER, INTENT(IN)    :: m,n,k2   REAL,   INTENT(IN)    :: alpha,beta   REAL,    INTENT(IN)    :: vin1(n,n), vin2(n,n)   REAL,    INTENT(INOUT) :: vout(n,n)   REAL     :: prod   INTEGER  :: i,j,k !$HMPPCG unroll(8), jam(2), noremainder   !$HMPPCG parallel   DO j=1,n !$HMPPCG unroll(8), splitted, noremainder      !$HMPPCG parallel     DO i=1,n         prod = 0.0         DO k=1,n            prod = prod + vin1(i,k) * vin2(k,j)         ENDDO         vout(i,j) = alpha * prod + beta * vout(i,j) ;      END DO   END DO END SUBROUTINE sgemm

Confidential, copyright CAPS entreprise

MxM Performance

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise

Performance Examples

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise

Tuning Issue Example

Bordeaux, 12 mars 2009

#pragma hmpp astex_codelet__1 codelet & #pragma hmpp astex_codelet__1 , args[c].io=in & #pragma hmpp astex_codelet__1 , args[v].io=inout & #pragma hmpp astex_codelet__1 , args[u].io=inout & #pragma hmpp astex_codelet__1 , target=CUDA & #pragma hmpp astex_codelet__1 , version=1.4.0 void astex_codelet__1(float u[256][256][256], float v[256][256][256], float c[256][256][256], const int K, const float x2){ astex_thread_begin:{    for (int it = 0 ; it < K ; ++it){        for (int i2 = 1 ; i2 < 256 - 1 ; ++i2){   for (int i3 = 1 ; i3 < 256 - 1 ; ++i3){          for (int i1 = 1 ; i1 < 256 - 1 ; ++i1){           float  coeff = c[i3][i2][i1] * c[i3][i2][i1] * x2;           float  sum = u[i3][i2][i1 + 1] + u[i3][i2][i1 - 1];           sum += u[i3][i2 + 1][i1] + u[i3][i2 - 1][i1];           sum += u[i3 + 1][i2][i1] + u[i3 - 1][i2][i1];           v[i3][i2][i1] = (2. - 6. * coeff) * u[i3][i2][i1] + coeff * sum - v[i3][i2][i1];         }              }     }        for (int i2 = 1 ; i2 < 256 - 1 ; ++i2){ for (int i3 = 1 ; i3 < 256 - 1 ; ++i3){       for (int i1 = 1 ; i1 < 256 - 1 ; ++i1{                  . . . . .            }astex_thread_end:; }

Need interchange

Confidential, copyright CAPS entreprise Bordeaux, 12 mars 2009

Library Issues

Confidential, copyright CAPS entreprise

Motivations

  Various implementations of libraries are available for a given target •  CUBLAS, MKL, ATLAS, …

  No strict performance order •  Each library has a different performance profile •  Best choice depends on platform and runtime parameters

  User left with a complex issue •  Routine performance measure •  Decision programming •  Hardware version adaptation

  Development partially funded by STREP Milepost •  http://www.milepost.eu/ •  Machine Learning for Embedded Programs Optimisation

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise

Difficult Decisions Making with Alternative Codes (Multiversioning)

  Various implementations of routines are available or can be generated for a given target •  CUBLAS, MKL, ATLAS, … •  SIMD instructions, GPcore, HWA, Hybrid

  No strict performance order •  Each implementation has a different performance

profile •  Best choice depends on platform and runtime

parameters

  Decision is a complex issue •  How to produce the decision?

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise

Library Adapter Overview

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise

Illustrating Example: Dealing with Multiple BLAS Implementations

  Runtime selection of DGEMM in High Performance Linpack •  Intel(R) Xeon(R) E5420 @ 2.50GHz •  CUBLAS - Tesla C1060, Intel MKL

  Three binaries of the application   Static linking with CUBLAS   Static linking with MKL   Library mix with selection of routine at runtime

  Automatically generated using CAPS tooling

  Three hardware resource configurations   GPU + 1, 2, and 4 cores used for MKL

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise

Performance Using One Core

  Performance in Gigaflops   4 problem sizes: 64, 500, 1200, 8000

Bordeaux, 12 mars 2009

0,07 1,2

4,4

23

1,3

7,3 8 9

1,4

6,5 8,1

23,3

0

5

10

15

20

25

64 500 1200 8000

Perfo

rman

ce (G

FLO

PS)

Problem Size

Cublas MKL Dyn. Sel.

Confidential, copyright CAPS entreprise

Performance Using Two Cores

Bordeaux, 12 mars 2009

0,07 1,2

4,4

23

0,6

4,3

7,6

15

1,4

6,5

12

29

0

5

10

15

20

25

30

35

64 500 1200 8000

Perfo

rman

ce (G

FLO

PS)

Problem Size

Cublas MKL Dyn. Sel.

Confidential, copyright CAPS entreprise

Performance Using Four Cores

Bordeaux, 12 mars 2009

0,07 1,2

4,4

23

0,9

5

9,7

26

1,2

7,2

13

32

0

5

10

15

20

25

30

35

64 500 1200 8000

Perfo

rman

ce (G

FLO

PS)

Problem Size

Cublas MKL Dyn. Sel.

Confidential, copyright CAPS entreprise Bordeaux, 12 mars 2009

Codelet Finder Alpha version

Confidential, copyright CAPS entreprise

Codelet Finder Overview   Partitioning of C code to highlight codelets

•  Data value specialization •  Aliasing speculation

  Useful for •  HWA exploitation (and maybe vectorization and

parallelization)

static

dynamic dynamic

static

Partitioned code

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise

Extracted Codelets Are Not Just Hotspots

  HWA data mapping in local memory adds constraints

{ for (x = 0 ; x < i_size ; x++) { diff[x + y * i_size] = pix1[x] - pix2[x]; } pix1 += i_pix1; pix2 += i_pix2; }

Main memory HWA local memory

pix1= 0xA…10

0xA…10

pix1= 0x0…05

0x0…05

0xA…40 0x0…35 ………………… …………………

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise

Example of Partitioning to Use HWA (1)

Bordeaux, 12 mars 2009

#pragma hmpp astex_codelet__1 codelet & #pragma hmpp astex_codelet__1 , args[c].io=in & #pragma hmpp astex_codelet__1 , args[v].io=inout & #pragma hmpp astex_codelet__1 , args[u].io=inout & #pragma hmpp astex_codelet__1 , target=CUDA & #pragma hmpp astex_codelet__1 , version=1.4.0 void astex_codelet__1(float u[256][256][256], float v[256][256][256], float c[256][256][256], const int K, const float x2){ astex_thread_begin:{    for (int it = 0 ; it < K ; ++it){        for (int i2 = 1 ; i2 < 256 - 1 ; ++i2){   for (int i3 = 1 ; i3 < 256 - 1 ; ++i3){          for (int i1 = 1 ; i1 < 256 - 1 ; ++i1){           float  coeff = c[i3][i2][i1] * c[i3][i2][i1] * x2;           float  sum = u[i3][i2][i1 + 1] + u[i3][i2][i1 - 1];           sum += u[i3][i2 + 1][i1] + u[i3][i2 - 1][i1];           sum += u[i3 + 1][i2][i1] + u[i3 - 1][i2][i1];           v[i3][i2][i1] = (2. - 6. * coeff) * u[i3][i2][i1] + coeff * sum - v[i3][i2][i1];         }              }     }        for (int i2 = 1 ; i2 < 256 - 1 ; ++i2){ for (int i3 = 1 ; i3 < 256 - 1 ; ++i3){       for (int i1 = 1 ; i1 < 256 - 1 ; ++i1{                  . . . . .            }astex_thread_end:; }

Confidential, copyright CAPS entreprise

Example of Partitioning to Use HWA (2)

  Extract codelet to be executed on HWA •  Data specialization •  Aliasing speculation

  Convolution code •  icc -O3 vs HWA •  Speedup is 3.4 with the HWA

  Codelet tuning •  Loop interchange was needed

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise

Codelet Finder in ProHMPT

  Can be used to provide codelet testbed for the various techniques

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise Bordeaux, 12 mars 2009

Conclusion

Confidential, copyright CAPS entreprise

CAPS in ProHMPT

  Génération de codes adaptatifs

  Définition de directives

  Allocation dynamiques de ressources

Bordeaux, 12 mars 2009

Confidential, copyright CAPS entreprise

Tasks

  Thème 2: extraction du parallélisme •  Tâche 3: Compilation et analyse du code statique

  Fondé sur DPIL

•  Tâche 4: Langage   Extension d’OpenMP pour l’hétérogène

  Thème 3: Support logiciel •  Tâche 7: Ordonnancement

  Allocation de ressources

Bordeaux, 12 mars 2009