migration of legacy applications to...
TRANSCRIPT
MIGRATION OF LEGACY
APPLICATIONS TO
HETEROGENEOUS
ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise
June 2011
2 | Migration Legacy | Month 6, 2011
FREE LUNCH IS OVER, CODES HAVE TO MIGRATE!
Many existing legacy codes needs to migrate to new architectures
– mostly written in C/C++ and Fortran
Computing power comes from parallelism
– Hardware (frequency increase) to software (parallel codes) shift
– Driven by energy consumption
heterogeneity is the source of efficiency
Few large fast OO cores combined with many smaller cores (e.g. APU)
Very wide configuration space due to heterogeneity
– Not specific to GPUs but to fat node systems
– Requires hybrid programming (e.g. MPI + OpenMP / OpenCL) and looking for tradeoffs
number of nodes vs. the compute power of each node
– Simpler if did not require specific code migration
3 | Migration Legacy | Month 6, 2011
Mastering migration cost
– Ensuring an adequate return on investment
– Minimizing risks as well as manpower
Produce code that will last many architecture generations
– It is safe to assume that the node architecture may change with the renewal of the computer
Code that is still application developers friendly
– App. developers may not be multicore / accelerator / parallelism savvy
– Once ported, the application still needs to evolve
Keeping a unique version, preferably mono-language, of the codes
– Reduce maintenance cost
Library use
– No one-to-one replacement (e.g. FFT libraries)
– Must interact with non library accelerated kernels
LEGACY CODES MIGRATION CHALLENGES
4 | Migration Legacy | Month 6, 2011
HETEROGENEOUS HARDWARE, MULTIPLE PARALLELISM FORMS
granularity
`
Fine Grain
Coarse Grain
Large Grain
Ex. Domain decomposition
Message Passing
Task parallelism
Data / stream parallelism
Instruction level parallelism
SIMD Instructions (SSE, …)
Dynamic,
load balancing oriented
Data locality oriented
Target accelerators and manycores
Compilers’ target
Application programmers’ level
HMPP
5 | Migration Legacy | Month 6, 2011
Agnostic programming is paramount
– Highlight parallelism not its implementation
Use the right parallelism level for each part of the applications
– Software engineering is important
– Separate application-concerned from performance-concerned
Specialized components, libraries, …
Do no expect a common programming API for all levels
– APIs always make some underlying architecture assumptions
– No low level programming APIs common to all devices (i.e. from coarse grain to fine grain)
– An API addresses a specific hardware component, as a consequence we need many
Plan for debugging and tuning
– Parallel bugs are nasty
– Tuning is target specific
HYBRID PROGRAMMING FOR FUTURE MANYCORES
6 | Migration Legacy | Month 6, 2011
A directive-based multi-language (C and Fortran) programming environment
– Help keeping software independent from hardware targets
– Provide an incremental API to exploit GPUs in legacy applications
– Avoid exit cost, can be future-proof solution
HMPP provides
– Code generators from C and Fortran to OpenCL / CUDA
– A compiler driver that handles all low level details of GPU compilers
Code generation and data transfer directives
GPU code optimization directives
– A runtime to allocate and manage GPU resources
Source to source compiler
– CPU code does not require compiler change (especially important in Fortran)
– Complement existing parallel APIs (OpenMP or MPI)
– Exploit native programming environments (e.g. OpenCL)
WHAT IS HMPP? (HETEROGENEOUS MANYCORE PARALLEL PROGRAMMING)
7 | Migration Legacy | Month 6, 2011
The Codelet directive indicates
the function to compile for GPU
The callsite directive performs a
Remote Procedure Call (RPC) onto
the GPU
HMPP BASIC DIRECTIVES
#pragma hmpp call1 codelet, target=OpenCL
void myFunc(int n, int A[n], int B[n]){
int i;
for (i=0 ; i<n ; ++i)
A[i] = A[i]+B[i]+1;
}
void main(void)
{
int [100][10000], Y[10000];
...
for (i=0;i<100;i++){
#pragma hmpp call1 callsite, …
myFunc(10000, X[i], Y);
}
...
}
8 | Migration Legacy | Month 6, 2011
A codelet is a pure function that can be remotely executed on a GPU
Regions are a short cut for writing codelets
HMPP CODELETS AND REGIONS
#pragma hmpp myfunc codelet, …
void saxpy(int n, float alpha, float x[n], float y[n]){
#pragma hmppcg parallel
for(int i = 0; i<n; ++i)
y[i] = alpha*x[i] + y[i];
}
#pragma hmpp myreg region, …
{
for(int i = 0; i<n; ++i)
y[i] = alpha*x[i] + y[i];
}
9 | Migration Legacy | Month 6, 2011
HMPP COMPILATION FLOW
HMPP annotated source code
HMPP Compiler
Application source code
Standard Compiler
Host application
Target source code
Target compiler
Accelerated codelet library
HMPP Runtime
HMPP Preprocessor
Target driver
Back-end Generators
GPU CPU
OpenCL
10 | Migration Legacy | Month 6, 2011
EXAMPLE OF DATA TRANSFER DIRECTIVES
int main(int argc, char **argv) {
#pragma hmpp sgemm allocate, args[vin1;vin2;vout].size={size,size}
. . .
#pragma hmpp sgemm advancedload, args[vin1;m;n;k;alpha;beta]
for( j = 0 ; j < 2 ; j++ ) {
#pragma hmpp sgemm callsite &
#pragma hmpp sgemm args[m;n;k;alpha;beta;vin1].advancedload=true
sgemm( size, size, size, alpha, vin1, vin2, beta, vout );
. . .
}
. . .
#pragma hmpp sgemm release
Preload data
Avoid reloading data
11 | Migration Legacy | Month 6, 2011
EXAMPLE OF HMPP OPTIMIZATION DIRECTIVES
#pragma hmppcg unroll(4), jam(2), noremainder
for( j = 0 ; j < p ; j++ ) {
#pragma hmppcg unroll(4), split, noremainder
for( i = 0 ; i < m ; i++ ) {
double prod = 0.0;
double v1a,v2a ;
k=0 ;
v1a = vin1[k][i] ;
v2a = vin2[j][k] ;
for( k = 1 ; k < n ; k++ ) {
prod += v1a * v2a;
v1a = vin1[k][i] ;
v2a = vin2[j][k] ;
}
prod += v1a * v2a;
vout[j][i] = alpha * prod + beta * vout[j][i];
}
}
12 | Migration Legacy | Month 6, 2011
LIBRARIES IN HMPP (VERSION 2.5)
C
CALL INIT(A,N)
CALL ZFFT1D(A,N,0,B) ! This call is needed to initialize FFTE
CALL DUMP(A,N)
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
C TELL HMPP TO USE THE PROXY FUNCTION FROM THE FFTE PROXY GROUP
C CHECK FOR EXECUTION ERROR ON THE GPU
!$hmppalt ffte call , name="zfft1d", error="proxy_err"
CALL ZFFT1D(A,N,-1,B)
CALL DUMP(A,N)
C
C SAME HERE
!$hmppalt ffte call , name="zfft1d" , error="proxy_err"
CALL ZFFT1D(A,N,1,B)
CALL DUMP(A,N)
C
Replace the call to a proxy that handles GPUs
and allows to mix user GPU code with library ones
Keep original CPU library calls
13 | Migration Legacy | Month 6, 2011
Code generation for CPU and GPU
– From the same parallel source code, provide generation to
CPU or GPU code for C, C++ and Fortran
Code tuning may differ according to the cores used
– Load balancing between CPU and GPU changes
– Sensible to problem size
– Number of cores may vary
OpenCL is portable syntax, not portable performance
– Automatic generation of OpenCL helps tuning applications
GETTING PREPARED FOR APU
14 | Migration Legacy | Month 6, 2011
CODE GENERATION FOR THE GPU CORES
The loop nest gridification process converts parallel loop nests in a grid of GPU threads
– Use the parallel loop nest iteration space to produce the threads
t
0
t
1
t
2
t
3
t
5
t
6
t
7
t
8
t
9
t
4
{
int i = blockIdx.x*
blockDim.x + threadIdx.x;
if( i<10 )
y[i]=alpha*x[i]+y[i];
}
i=0 i=1 i=9 . . . . . . .
GPU threads
#pragma hmppcg parallel
for(int i = 0; i<10; ++i){
y[i] = alpha*x[i] + y[i];
}
15 | Migration Legacy | Month 6, 2011
Two main techniques being mixed
– Allocation of blocks of iterations on cores
Locality / affinity oriented
– Loop transformations to optimize loop body according to
Memory accesses
Workload / code generation
Vectorization
Favor affinity between loop nests
– Exploit the relationship between the iteration space and the data accesses
Similar block allocation leads to affinity across loop nests
Natural in many codes
– May degrade load balancing
CODE GENERATION APPROACH CPU CORES (ALPHA) - 1
16 | Migration Legacy | Month 6, 2011
Loop transformation to tune memory access, vectorization and elemental workload
CODE GENERATION APPROACH CPU CORES (ALPHA) - 2
#pragma hmppcg unroll 4, jam
17 | Migration Legacy | Month 6, 2011
Iteration mapping on cores based on blocking
CODE GENERATION APPROACH CPU CORES (ALPHA) - 3
Core 0 Core 1
Core 2 Core 3
Allocation of resources
#pragma hmpp allocate
#pragma hmppcg grid blocksize 1024x2048
18 | Migration Legacy | Month 6, 2011
LEGACY CODES MIGRATION
THE CRITICAL STEP
19 | Migration Legacy | Month 6, 2011
GO / NO GO
Go
• Dense hotspot
• Fast kernels
• Low CPU-GPU data transfers
• Prepare to manycore parallelism
No Go
• Flat profile
• Slow GPU kernels (i.e. no speedup to be expected)
• Binary exact CPU-GPU results (cannot validate execution)
• Memory space needed
20 | Migration Legacy | Month 6, 2011
PHASE 1 (DETAILS)
21 | Migration Legacy | Month 6, 2011
PHASE 2 (DETAILS)
22 | Migration Legacy | Month 6, 2011
HMPP FOR FUTURE MANYCORES
Current HMPP (2.x)
– Agnostic directive-based style
– Target stream parallelism on accelerators
– High level expression of stream oriented parallelism
– Mostly deals with one GPU per threads, no GPU sharing
– Oriented toward performance and device memory saving
Next HMPP (3.x)
– Agnostic directive-based with more API for expert programmers
Adaptive programming
– High level expression of stream oriented parallelism
– Target accelerators and CPU cores
– Handles multiple GPUs, data distribution
– Easier handling of data management between CPU and GPU
23 | Migration Legacy | Month 6, 2011
Heterogeneous architectures are becoming ubiquitous
– In HPC centers but not only
– Tremendous opportunities but not always easy to seize
– CPU and GPU have to be used simultaneously
Legacy codes still need to be ported
– Software migration required understanding options
Do not want to backtrack
– A methodology supporting tools is needed and must provide a set of consistent views
– The legacy style is not helping
– Highlighted parallelism for GPU is useful for future manycores
HMPP based programming
– Helps implementing incremental strategies
– Is being complemented by a set of tools
CONCLUSION
24 | Migration Legacy | Month 6, 2011
Need for new standard programming
– OpenHMPP initiative launch by CAPS and Pathscale
– http://www.openhmpp.org/
Energy consumption control at software level
– Is energy saving cost worthwhile the software tuning cost?
Cloud technology
– All manycore issues and more …
PERSPECTIVES
25 | Migration Legacy | Month 6, 2011
HTTP://WWW.CAPS-ENTREPRISE.COM
HTTP://WWW.OPENHMPP.ORG
26 | Migration Legacy | Month 6, 2011
Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions
and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited
to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no
obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to
make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO
RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS
INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY
DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL
OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF
EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in
this presentation are for informational purposes only and may be trademarks of their respective owners.
The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and
opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is
not responsible for the content herein and no endorsements are implied.