migration of legacy applications to...

MIGRATION OF LEGACY

APPLICATIONS TO

HETEROGENEOUS

ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise

June 2011

2 | Migration Legacy | Month 6, 2011

FREE LUNCH IS OVER, CODES HAVE TO MIGRATE!

Many existing legacy codes needs to migrate to new architectures

– mostly written in C/C++ and Fortran

Computing power comes from parallelism

– Hardware (frequency increase) to software (parallel codes) shift

– Driven by energy consumption

heterogeneity is the source of efficiency

Few large fast OO cores combined with many smaller cores (e.g. APU)

Very wide configuration space due to heterogeneity

– Not specific to GPUs but to fat node systems

– Requires hybrid programming (e.g. MPI + OpenMP / OpenCL) and looking for tradeoffs

number of nodes vs. the compute power of each node

– Simpler if did not require specific code migration


Mastering migration cost

– Ensuring an adequate return on investment

– Minimizing risks as well as manpower

Produce code that will last many architecture generations

– It is safe to assume that the node architecture may change with the renewal of the computer

Code that is still application developers friendly

– App. developers may not be multicore / accelerator / parallelism savvy

– Once ported, the application still needs to evolve

Keeping a unique version, preferably mono-language, of the codes

– Reduce maintenance cost

Library use

– No one-to-one replacement (e.g. FFT libraries)

– Must interact with non library accelerated kernels

LEGACY CODES MIGRATION CHALLENGES


HETEROGENEOUS HARDWARE, MULTIPLE PARALLELISM FORMS

granularity

`

Fine Grain

Coarse Grain

Large Grain

Ex. Domain decomposition

Message Passing

Task parallelism

Data / stream parallelism

Instruction level parallelism

SIMD Instructions (SSE, …)

Dynamic,

load balancing oriented

Data locality oriented

Target accelerators and manycores

Compilers’ target

Application programmers’ level

HMPP


Agnostic programming is paramount

– Highlight parallelism not its implementation

Use the right parallelism level for each part of the applications

– Software engineering is important

– Separate application-concerned from performance-concerned

Specialized components, libraries, …

Do no expect a common programming API for all levels

– APIs always make some underlying architecture assumptions

– No low level programming APIs common to all devices (i.e. from coarse grain to fine grain)

– An API addresses a specific hardware component, as a consequence we need many

Plan for debugging and tuning

– Parallel bugs are nasty

– Tuning is target specific

HYBRID PROGRAMMING FOR FUTURE MANYCORES


A directive-based multi-language (C and Fortran) programming environment

– Help keeping software independent from hardware targets

– Provide an incremental API to exploit GPUs in legacy applications

– Avoid exit cost, can be future-proof solution

HMPP provides

– Code generators from C and Fortran to OpenCL / CUDA

– A compiler driver that handles all low level details of GPU compilers

Code generation and data transfer directives

GPU code optimization directives

– A runtime to allocate and manage GPU resources

Source to source compiler

– CPU code does not require compiler change (especially important in Fortran)

– Complement existing parallel APIs (OpenMP or MPI)

– Exploit native programming environments (e.g. OpenCL)

WHAT IS HMPP? (HETEROGENEOUS MANYCORE PARALLEL PROGRAMMING)


The Codelet directive indicates

the function to compile for GPU

The callsite directive performs a

Remote Procedure Call (RPC) onto

the GPU

HMPP BASIC DIRECTIVES

#pragma hmpp call1 codelet, target=OpenCL

void myFunc(int n, int A[n], int B[n]){

int i;

for (i=0 ; i<n ; ++i)

A[i] = A[i]+B[i]+1;

}

void main(void)

{

int [100][10000], Y[10000];

...

for (i=0;i<100;i++){

#pragma hmpp call1 callsite, …

myFunc(10000, X[i], Y);

}

...

}


A codelet is a pure function that can be remotely executed on a GPU

Regions are a short cut for writing codelets

HMPP CODELETS AND REGIONS

#pragma hmpp myfunc codelet, …

void saxpy(int n, float alpha, float x[n], float y[n]){

#pragma hmppcg parallel

for(int i = 0; i<n; ++i)

y[i] = alpha*x[i] + y[i];

}

#pragma hmpp myreg region, …

{

for(int i = 0; i<n; ++i)


}


HMPP COMPILATION FLOW

HMPP annotated source code

HMPP Compiler

Application source code

Standard Compiler

Host application

Target source code

Target compiler

Accelerated codelet library

HMPP Runtime

HMPP Preprocessor

Target driver

Back-end Generators

GPU CPU

OpenCL


EXAMPLE OF DATA TRANSFER DIRECTIVES

int main(int argc, char **argv) {

#pragma hmpp sgemm allocate, args[vin1;vin2;vout].size={size,size}

. . .

#pragma hmpp sgemm advancedload, args[vin1;m;n;k;alpha;beta]

for( j = 0 ; j < 2 ; j++ ) {

#pragma hmpp sgemm callsite &

#pragma hmpp sgemm args[m;n;k;alpha;beta;vin1].advancedload=true

sgemm( size, size, size, alpha, vin1, vin2, beta, vout );

. . .

}

. . .

#pragma hmpp sgemm release

Preload data

Avoid reloading data


EXAMPLE OF HMPP OPTIMIZATION DIRECTIVES

#pragma hmppcg unroll(4), jam(2), noremainder

for( j = 0 ; j < p ; j++ ) {

#pragma hmppcg unroll(4), split, noremainder

for( i = 0 ; i < m ; i++ ) {

double prod = 0.0;

double v1a,v2a ;

k=0 ;

v1a = vin1[k][i] ;

v2a = vin2[j][k] ;

for( k = 1 ; k < n ; k++ ) {

prod += v1a * v2a;

v1a = vin1[k][i] ;

v2a = vin2[j][k] ;

}

prod += v1a * v2a;

vout[j][i] = alpha * prod + beta * vout[j][i];

}

}


LIBRARIES IN HMPP (VERSION 2.5)

C

CALL INIT(A,N)

CALL ZFFT1D(A,N,0,B) ! This call is needed to initialize FFTE

CALL DUMP(A,N)

CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

C TELL HMPP TO USE THE PROXY FUNCTION FROM THE FFTE PROXY GROUP

C CHECK FOR EXECUTION ERROR ON THE GPU

!$hmppalt ffte call , name="zfft1d", error="proxy_err"

CALL ZFFT1D(A,N,-1,B)

CALL DUMP(A,N)

C

C SAME HERE

!$hmppalt ffte call , name="zfft1d" , error="proxy_err"

CALL ZFFT1D(A,N,1,B)

CALL DUMP(A,N)

C

Replace the call to a proxy that handles GPUs

and allows to mix user GPU code with library ones

Keep original CPU library calls


Code generation for CPU and GPU

– From the same parallel source code, provide generation to

CPU or GPU code for C, C++ and Fortran

Code tuning may differ according to the cores used

– Load balancing between CPU and GPU changes

– Sensible to problem size

– Number of cores may vary

OpenCL is portable syntax, not portable performance

– Automatic generation of OpenCL helps tuning applications

GETTING PREPARED FOR APU


CODE GENERATION FOR THE GPU CORES

The loop nest gridification process converts parallel loop nests in a grid of GPU threads

– Use the parallel loop nest iteration space to produce the threads

t

0

t

1

t

2

t

3

t

5

t

6

t

7

t

8

t

9

t

4

{

int i = blockIdx.x*

blockDim.x + threadIdx.x;

if( i<10 )

y[i]=alpha*x[i]+y[i];

}

i=0 i=1 i=9 . . . . . . .

GPU threads

#pragma hmppcg parallel

for(int i = 0; i<10; ++i){


}


Two main techniques being mixed

– Allocation of blocks of iterations on cores

Locality / affinity oriented

– Loop transformations to optimize loop body according to

Memory accesses

Workload / code generation

Vectorization

Favor affinity between loop nests

– Exploit the relationship between the iteration space and the data accesses

Similar block allocation leads to affinity across loop nests

Natural in many codes

– May degrade load balancing

CODE GENERATION APPROACH CPU CORES (ALPHA) - 1


Loop transformation to tune memory access, vectorization and elemental workload


#pragma hmppcg unroll 4, jam


Iteration mapping on cores based on blocking


Core 0 Core 1

Core 2 Core 3

Allocation of resources

#pragma hmpp allocate

#pragma hmppcg grid blocksize 1024x2048


LEGACY CODES MIGRATION

THE CRITICAL STEP


GO / NO GO

Go

• Dense hotspot

• Fast kernels

• Low CPU-GPU data transfers

• Prepare to manycore parallelism

No Go

• Flat profile

• Slow GPU kernels (i.e. no speedup to be expected)

• Binary exact CPU-GPU results (cannot validate execution)

• Memory space needed


PHASE 1 (DETAILS)


PHASE 2 (DETAILS)


HMPP FOR FUTURE MANYCORES

Current HMPP (2.x)

– Agnostic directive-based style

– Target stream parallelism on accelerators

– High level expression of stream oriented parallelism

– Mostly deals with one GPU per threads, no GPU sharing

– Oriented toward performance and device memory saving

Next HMPP (3.x)

– Agnostic directive-based with more API for expert programmers

Adaptive programming

– High level expression of stream oriented parallelism

– Target accelerators and CPU cores

– Handles multiple GPUs, data distribution

– Easier handling of data management between CPU and GPU


Heterogeneous architectures are becoming ubiquitous

– In HPC centers but not only

– Tremendous opportunities but not always easy to seize

– CPU and GPU have to be used simultaneously

Legacy codes still need to be ported

– Software migration required understanding options

Do not want to backtrack

– A methodology supporting tools is needed and must provide a set of consistent views

– The legacy style is not helping

– Highlighted parallelism for GPU is useful for future manycores

HMPP based programming

– Helps implementing incremental strategies

– Is being complemented by a set of tools

CONCLUSION


Need for new standard programming

– OpenHMPP initiative launch by CAPS and Pathscale

– http://www.openhmpp.org/

Energy consumption control at software level

– Is energy saving cost worthwhile the software tuning cost?

Cloud technology

– All manycore issues and more …

PERSPECTIVES


HTTP://WWW.CAPS-ENTREPRISE.COM

HTTP://WWW.OPENHMPP.ORG

http://www.caps-entreprise.com/



http://www.openhmpp.org/


Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions

and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited

to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product

differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no

obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to

make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO

RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS

INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY

DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL

OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF

EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in

this presentation are for informational purposes only and may be trademarks of their respective owners.

The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and

opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is

not responsible for the content herein and no endorsements are implied.

migration of legacy applications to...

Documents