2011.02.18 marco parenzan - modelli di programmazione per le gpu

15 April 2023 - slide 1MOSE – University of Trieste

Linguaggi di programmazione e compilatori per le GPU

Marco Parenzan

GPU@UniTS


WARNING!Linguaggi di programmazione e compilatori per le GPU


WARNING!There is a choose to do now:

Nvidia/CUDA… …or not

GPU computing is not Nvidia/CUDA computing Nvidia is the most advanded as a product Nvidia is the only one with a specific scientific product on sale

now Fermi was late (1 year, scheduled for 2009released in 2010)

So if CUDA is the main choose…tomorrow…? Intel is too late now, but Intel is Intel…

GPU can be personal, not only cluster Not only simulation Broadly available video cards are now powered by GPUs

We have GPUs thanks to games!


Why GPU Computing…Over the past few years, the GPU has evolved from a fixed-function special-purpose processor into a full-fledged parallel programmable processor with additional fixed-function special-purpose functionality

Fixed Function Pipeline : lack of generality

More fully featured instruction set

Unified Shader Model

Increased Program-mability

Program-mable engine surrounded by supporting fixed function units


ProductsProducts you can choose so you can try

There is a BIG range of products There is a LITTLE range of scientific products

Main products Nvidia Fermi ATI Radeon Intel Larrabee

Again, not yet released So late…!


Nvidia CardsCompute capability GPUs Cards

1.0 G80 GeForce 8800GTX/Ultra/GTS, Tesla C/D/S870, FX4/5600, 360M

1.1G86, G84, G98, G96, G96b, G94, G94b, G92, G92b

GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS, 9600GT/GSO, 9800GT/GTX/GX2, GTS 250, GT 120/30, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM, 32/370M, 3/5/770M, 16/17/27/28/36/37/3800M, NVS420/50

1.2 GT218, GT216, GT215

GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M, NVS 2/3100M

1.3 GT200, GT200b GTX 260/75/80/85, 295, Tesla C/M1060, S1070, CX, FX 3/4/5800

2.0 GF100, GF110GTX 465, 470/80, Tesla C2050/70, S/M2050/70, Quadro 600,4/5/6000, Plex7000, GTX570, GTX580

2.1 GF108, GF106, GF104, GF114 GT 420/30/40, GTS 450, GTX 460, 500M

Professional SeriesComputing SeriesConsumer Series

Current Series


ATI CardsRetail/card series name Chip seriesR8500,R9000-R9250 R200R9500-R9800, x300-x600, x1050 R300

X700-850 R420X1300-1950 R520HD2000-HD3000 R600HD4000 R700HD5000 R800/EvergreenHD6000 R900/Northern IslandsHD7000 R1000/Southern Islands

Consumer Series

Current Series


REFERENCE ARCHITECTURE

Linguaggi di programmazione e compilatori per le GPU


An Asymmetric Multi- Processor System

GPU-enabled system is an asymetric multi-processor systemDifferent abilities The GPU is designed for a particular

class of applications GPUs have optimized instruction sets

for paralled data computing

Different Performances GPUs have optimized memory access

and wide bandwidth for parallel data access


An Asymmetric Multi- Processor System

Computational requirements are largeParallelism is substantialThroughput is more important than latency

CPU50GFlops

GPU1TFlop

CPU RAM4-6 GB

GPU RAM1 GB

10GB/s 100GB/s

1GB/s


CPU vs GPU

CPU

Low latency memoryRandom accesses20GB/s bandwidth0.1TFlop compute1GFlops/watt

Well known programming model

GPU

High bandwidth memorySequential accesses100GB/s bandwidth1TFlop compute10 Gflops/watt

Niche programming model


How to use a GPU

GPU can execute only parts of our algoritmParts that can handle indipendent blocks of data that can be managed in parallel… …only for some tasks, compatible with

GPU instruction set Loops, for example, are not GPU friendly Think more functional


PROGRAMMING MODELSLinguaggi di programmazione e compilatori per le GPU


Programming ModelsCUDA

Specific for Nvidia Hardware Multiplatform (Windows, Linux, Mac)

OpenCL Generic for multi-vendors hardware, also non GPU hardware Multiplatform (Windows, Linux, Mac)

DirectComputeGeneric for multi-vendors GPU hardware , also non GPU hardware

Single Platform (Windows)


Pros & ConsCUDA OpenCL DirectComput

e

Pros

Most advancedMature

Multiplatform

MultiplatformStandard as

OpenGL

Success of DirectX familyAcceptance/

Support by ATI

Cons Proprietary Not mature Proprietary


How to program: download toolkits

Nvidia http://developer.nvidia.com/object/cuda_3_2_downloads.html

CUDA + OpenCL support

Ati http://developer.amd.com/gpu/amdappsdk/pages/default.aspx

Accellerated Parallel Processing (aka Stream) API + OpenCL support

DirectCompute DirectX SDK (June 2010)

http://www.microsoft.com/downloads/en/details.aspx?displaylang=en&FamilyID=3021d52b-514e-41d3-ad02-438a3ba730ba

In any case, download updated drivers for GPU/graphic cards

Attention to mobile drivers...always in beta!

http://developer.nvidia.com/object/cuda_3_2_downloads.html



http://developer.amd.com/gpu/amdappsdk/pages/default.aspx

http://developer.amd.com/gpu/amdappsdk/pages/default.aspx





How to program

Every platform has it’s own proprietary languageAs GPU are just specialized CPUs, primary programming tool is a C-derived dialect language CUDA

C for CUDA (PathScale Open64 C compiler) OpenCL

C99-dialect HLSL for DirectCompute (High Level Shading

Language – Shading of pixels, from consumer background)


Kernel for SubBlock Matrix Multiplication in C for CUDA

//////////////////////////////////////////////////////////////////////////////////! Matrix multiplication on the device: C = A * B//! wA is A's width and wB is B's width////////////////////////////////////////////////////////////////////////////////__global__ voidmatrixMul( float* C, float* A, float* B, int wA, int wB){ // Block index int bx = blockIdx.x; int by = blockIdx.y;

// Thread index int tx = threadIdx.x; int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block int aEnd = aBegin + wA - 1;

// Step size used to iterate through the sub-matrices of A int aStep = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B int bStep = BLOCK_SIZE * wB;

// Csub is used to store the element of the block sub-matrix // that is computed by the thread float Csub = 0;

// Loop over all the sub-matrices of A and B // required to compute the block sub-matrix

// Write the block sub-matrix to device memory; // each thread writes one element int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub;}

for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {

__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

AS(ty, tx) = A[a + wA * ty + tx]; BS(ty, tx) = B[b + wB * ty + tx];

__syncthreads();

for (int k = 0; k < BLOCK_SIZE; ++k) Csub += AS(ty, k) * BS(k, tx);

__syncthreads(); }


Kernel for SubBlock Matrix Multiplication in C for OpenCL

__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1))) voidfloatMatrixMultLocals(__global float * MResp, __global float * M1, __global float * M2, __global int * q){ //Identification of this workgroup int i = get_group_id(0); int j = get_group_id(1); //Identification of work-item int idX = get_local_id(0); int idY = get_local_id(1); //matrixes dimensions int p = get_global_size(0); int r = get_global_size(1); int qq = q[0]; //Number of submatrixes to be processed by each worker (Q dimension) int numSubMat = qq/BLOCK_SIZE; float4 resp = (float4)(0,0,0,0); __local float A[BLOCK_SIZE][BLOCK_SIZE]; __local float B[BLOCK_SIZE][BLOCK_SIZE]; MResp[BLOCK_SIZE*i + idX + p*(BLOCK_SIZE*j+idY)] = // LOO CODE resp.x+resp.y+resp.z+resp.w;}

for (int k=0; k<numSubMat; k++) { //Copy submatrixes to local memory. Each worker copies one element //Notice that A[i,k] accesses elements starting from M[BLOCK_SIZE*i, BLOCK_SIZE*j] A[idX][idY] = M1[BLOCK_SIZE*i + idX + p*(BLOCK_SIZE*k+idY)]; B[idX][idY] = M2[BLOCK_SIZE*k + idX + qq*(BLOCK_SIZE*j+idY)]; barrier(CLK_LOCAL_MEM_FENCE);

for (int k2 = 0; k2 < BLOCK_SIZE; k2+=4) { float4 temp1=(float4)(A[idX][k2],A[idX][k2+1],A[idX][k2+2],A[idX][k2+3]); float4 temp2=(float4)(B[k2][idY],B[k2+1][idY],B[k2+2][idY],B[k2+3][idY]); resp += temp1 * temp2; } barrier(CLK_LOCAL_MEM_FENCE); }


Kernel for Matrix Multiplication in C for OpenCL

__kernel voidfloatMatrixMult( __global float * MResp, __global float * M1, __global float * M2, __global int * q){ // Vector element index int i = get_global_id(0); int j = get_global_id(1);

int p = get_global_size(0); int r = get_global_size(1);

MResp[i + p * j] = 0; int QQ = q[0]; for (int k = 0; k < QQ; k++) { MResp[i + p * j] += M1[i + p * k] * M2[k + QQ * j]; }}


Kernel for Matrix Multiplication in HLSL for DirectCompute

[numthreads(GROUP_SIZE_X, GROUP_SIZE_Y, 1)]void matrixMul( uint3 DTid : SV_DispatchThreadID ){

if(DTid.x < WidthB && DTid.y < HeightA){

float sum = 0;for(uint i=0; i<WidthA; i++){

uint addrA = DTid.y * WidthA + i;uint addrB = DTid.x + i*WidthB;

sum += MatrixA[addrA].val * MatrixB[addrB];}Output[DTid.y*WidthOut + DTid.x] = sum;

}}


Hosting GPUsAsymetric model needs an host language

Executed on CPU Coordinates tasks on GPU

C/C++ Embedding Technology GPU compiler strip GPU specific code from source CPU compiler compiler the remainder code and executes GPU-

compiled code

Non-C/C++ Language Bindings GPU code compiled via a sort of «compiler as a service>


Host Side API Code (example in DirectCompute)

Create device and contextMatrix A – structured buffer and shader resource viewMatrix B –float buffer and shader resource viewOutput Matrix – float buffer and unordered access viewCreate Constant BufferCompile and create shaderExecuteRead Back


Host Side API Code (example in DirectCompute)


Bindings (example for CUDA)Fortran - FORTRAN CUDA, PGI CUDA Fortran CompilerLua - KappaCUDAIDL - GPULibMathematica - CUDALinkMATLAB - Jacket.NET - CUDA.NETPerl - KappaCUDAPython - PyCUDA KappaCUDARuby - KappaCUDAJava - jCUDA, JCuda, JCublas, JCufft


Example of PyCUDA codeimport pycuda.autoinitimport pycuda.driver as drvimport numpy

from pycuda.compiler import SourceModulemod = SourceModule("""__global__ void multiply_them(float *dest, float *a, float *b){ const int i = threadIdx.x; dest[i] = a[i] * b[i];}""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)multiply_them( drv.Out(dest), drv.In(a), drv.In(b), block=(400,1,1), grid=(1,1))

print dest-a*b


Custom LibrariesCUDA, OpenCL and DirectCompute are the building blocks

The smallest «bricks» available to build our software

Are there any «biggest» bricks?Libraries to achieve some specific results targeting a GPU platformExample

Microsoft Research Accelleratorhttp://research.microsoft.com/en-us/projects/Accelerator/

// ... using Microsoft.ParallelArrays;// ...

namespace AccelleratorDemo{ class Program { static void Main(string[] args) { // ....

// Build a computation that calculates average of neigbors var input = new FloatParallelArray(nums); var sum = ParallelArrays.Shift(input, 1) + input + ParallelArrays.Shift(input, -1); var output = sum / 3.0f;

// Run the computation var target = new DX9Target(); var res = target.ToArray1D(output);

// Output the original data and calculated 'blurred' result Action<float[]> WriteArray = (vals) => Console.WriteLine(vals.Aggregate("", (str, f) => str + Math.Round(f) + ", "));

// ... } }}

http://research.microsoft.com/en-us/projects/Accelerator/





Metaprogramming

What is metaprogramming? Metaprogramming is the writing of computer

programs that write or manipulate other programs (or themselves) as their data, or that do part of the work at compile time that would otherwise be done at runtime. (http://en.wikipedia.org/wiki/Metaprogramming)

http://en.wikipedia.org/wiki/Metaprogramming


GPU MetaprogrammingThe code executed on GPU is not in the host codeThe REFERENCE code is in the host code

Metaprogramming techniques allows to generate the real executable code from the source code, used as a template, as a declaration of the need

In many cases, this allows programmers to get more done in the same amount of time as they would take to write all the code manually, or it gives programs greater flexibility to efficiently handle new situations without recompilation.

An example in .NET GPU.NET (http://www.tidepowerd.com/)

All metaprogrammable languages are potentially tool for this approach

The GPU.NET code works in C#, Visual Basic, F#....

http://www.tidepowerd.com/


Sample in GPU.NET [Kernel(CustomFallbackMethod = "AddCpu")] private static void AddGpu(float[] a, float[] b, float[] c) { // Get the thread id and total number of threads int ThreadId = BlockDimension.X * BlockIndex.X + ThreadIndex.X; int TotalThreads = BlockDimension.X * GridDimension.X;

// Loop over the vectors 'a' and 'b', adding them pairwise and storing the sums in 'c' for (int ElementIndex = ThreadId; ElementIndex < a.Length; ElementIndex += TotalThreads) { c[ElementIndex] = a[ElementIndex] + b[ElementIndex]; }}

GPU code is runtime generated and loaded into GPU from [Kernel] decored native .NET methods


CONCLUSIONSLinguaggi di programmazione e compilatori per le GPU


Conclusions

GPUs give us bit opportunitiesMarket and Research have understood it Developer Community too

Now we have to assist to the grow of the market So of the tools So of the languages

2011 will be the APUs year CPU + GPU on the single die


Which programming model?

No one knows itAgain, CUDA is the most efficient and advanced

...but runs only on Nvidia cards No one knows if ATI will adopt it

OpenCL is not so advanced But it’s the same relation between OpenGL and the others ...and NVidia knows it.

DirectCompute? On Windows hosts, with DirectX experience, it’s another a big

opportunity

I think.... Specific programming models (CUDA) for ISVs which can write

drivers... OpenCL or DirectCompute for custom/lab/«home»

development


Attention!GPU is the cheapest computing power, but...

You can execute only streamable code (data processing)

What about more generic code?


TRENDSLinguaggi di programmazione e compilatori per le GPU


Parallel thinks about data Distinct blocks of data handled in the same way Distinct blocks of data handled at the same time

Parallel is all about distributing the same computation task in different computation resources at the same timeResult is divided into blocks

Result blocks must be merged in one general result

Software can be parallel

Parallel


Async is not Parallel Parallel thinks about data (and probably just 1 task) Async thinks about tasks

Not all tasks are sequential Some tasks are sequential Some tasks are parallel

Non sequential tasks have to be coordinated

Software have to be more asyncProgrammers don’t thinks async

Programmer are not trained async

Asyncronous


Imperative languages have no primitives to handle async or parallelTask coordination have to be adaptative

There is no only one way to do Coordination have to be “simplified”

Async or Parallel tasks are better coded in declarative language

Leave the compiler do the infrastructure and optimization work

Functional is a special declarative language The metaphor is the function as first-class citizen in language Function is threated as a value

Functional


Pros: On demand scalability Total Cost of Ownership (?) Different level of services (SAAS, PAAS, xAAS)

Cons: Broadband (speed) Privacy (?)

Cloud


blog:

email:

web:

twitter:

slideshare:

facebook:

linked-in:

Links

Marco Parenzan

http://blog.codeisvalue.com/

[email protected]

http://www.codeisvalue.com/

@marco_parenzan

http://www.slideshare.com/marco.parenzan

http://www.facebook.com/parenzan.marco

http://it.linkedin.com/in/marcoparenzan

2011.02.18 marco parenzan - modelli di programmazione per le gpu

Technology

programmazione e compilatori

tesla cds870

tesla cm1060

tesla c205070

parallel data access18

gpus thanks

specific scientific

big range of products