2011.02.18 marco parenzan - modelli di programmazione per le gpu
TRANSCRIPT
15 April 2023 - slide 1MOSE – University of Trieste
Linguaggi di programmazione e compilatori per le GPU
Marco Parenzan
GPU@UniTS
15 April 2023 - slide 2MOSE – University of Trieste
WARNING!Linguaggi di programmazione e compilatori per le GPU
15 April 2023 - slide 3MOSE – University of Trieste
WARNING!There is a choose to do now:
Nvidia/CUDA… …or not
GPU computing is not Nvidia/CUDA computing Nvidia is the most advanded as a product Nvidia is the only one with a specific scientific product on sale
now Fermi was late (1 year, scheduled for 2009released in 2010)
So if CUDA is the main choose…tomorrow…? Intel is too late now, but Intel is Intel…
GPU can be personal, not only cluster Not only simulation Broadly available video cards are now powered by GPUs
We have GPUs thanks to games!
15 April 2023 - slide 4MOSE – University of Trieste
Why GPU Computing…Over the past few years, the GPU has evolved from a fixed-function special-purpose processor into a full-fledged parallel programmable processor with additional fixed-function special-purpose functionality
Fixed Function Pipeline : lack of generality
More fully featured instruction set
Unified Shader Model
Increased Program-mability
Program-mable engine surrounded by supporting fixed function units
15 April 2023 - slide 5MOSE – University of Trieste
ProductsProducts you can choose so you can try
There is a BIG range of products There is a LITTLE range of scientific products
Main products Nvidia Fermi ATI Radeon Intel Larrabee
Again, not yet released So late…!
15 April 2023 - slide 6MOSE – University of Trieste
Nvidia CardsCompute capability GPUs Cards
1.0 G80 GeForce 8800GTX/Ultra/GTS, Tesla C/D/S870, FX4/5600, 360M
1.1G86, G84, G98, G96, G96b, G94, G94b, G92, G92b
GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS, 9600GT/GSO, 9800GT/GTX/GX2, GTS 250, GT 120/30, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM, 32/370M, 3/5/770M, 16/17/27/28/36/37/3800M, NVS420/50
1.2 GT218, GT216, GT215
GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M, NVS 2/3100M
1.3 GT200, GT200b GTX 260/75/80/85, 295, Tesla C/M1060, S1070, CX, FX 3/4/5800
2.0 GF100, GF110GTX 465, 470/80, Tesla C2050/70, S/M2050/70, Quadro 600,4/5/6000, Plex7000, GTX570, GTX580
2.1 GF108, GF106, GF104, GF114 GT 420/30/40, GTS 450, GTX 460, 500M
Professional SeriesComputing SeriesConsumer Series
Current Series
15 April 2023 - slide 7MOSE – University of Trieste
ATI CardsRetail/card series name Chip seriesR8500,R9000-R9250 R200R9500-R9800, x300-x600, x1050 R300
X700-850 R420X1300-1950 R520HD2000-HD3000 R600HD4000 R700HD5000 R800/EvergreenHD6000 R900/Northern IslandsHD7000 R1000/Southern Islands
Consumer Series
Current Series
15 April 2023 - slide 8MOSE – University of Trieste
REFERENCE ARCHITECTURE
Linguaggi di programmazione e compilatori per le GPU
15 April 2023 - slide 9MOSE – University of Trieste
An Asymmetric Multi- Processor System
GPU-enabled system is an asymetric multi-processor systemDifferent abilities The GPU is designed for a particular
class of applications GPUs have optimized instruction sets
for paralled data computing
Different Performances GPUs have optimized memory access
and wide bandwidth for parallel data access
15 April 2023 - slide 10MOSE – University of Trieste
An Asymmetric Multi- Processor System
Computational requirements are largeParallelism is substantialThroughput is more important than latency
CPU50GFlops
GPU1TFlop
CPU RAM4-6 GB
GPU RAM1 GB
10GB/s 100GB/s
1GB/s
15 April 2023 - slide 11MOSE – University of Trieste
CPU vs GPU
CPU
Low latency memoryRandom accesses20GB/s bandwidth0.1TFlop compute1GFlops/watt
Well known programming model
GPU
High bandwidth memorySequential accesses100GB/s bandwidth1TFlop compute10 Gflops/watt
Niche programming model
15 April 2023 - slide 12MOSE – University of Trieste
How to use a GPU
GPU can execute only parts of our algoritmParts that can handle indipendent blocks of data that can be managed in parallel… …only for some tasks, compatible with
GPU instruction set Loops, for example, are not GPU friendly Think more functional
15 April 2023 - slide 13MOSE – University of Trieste
PROGRAMMING MODELSLinguaggi di programmazione e compilatori per le GPU
15 April 2023 - slide 14MOSE – University of Trieste
Programming ModelsCUDA
Specific for Nvidia Hardware Multiplatform (Windows, Linux, Mac)
OpenCL Generic for multi-vendors hardware, also non GPU hardware Multiplatform (Windows, Linux, Mac)
DirectComputeGeneric for multi-vendors GPU hardware , also non GPU hardware
Single Platform (Windows)
15 April 2023 - slide 15MOSE – University of Trieste
Pros & ConsCUDA OpenCL DirectComput
e
Pros
Most advancedMature
Multiplatform
MultiplatformStandard as
OpenGL
Success of DirectX familyAcceptance/
Support by ATI
Cons Proprietary Not mature Proprietary
15 April 2023 - slide 16MOSE – University of Trieste
How to program: download toolkits
Nvidia http://developer.nvidia.com/object/cuda_3_2_downloads.html
CUDA + OpenCL support
Ati http://developer.amd.com/gpu/amdappsdk/pages/default.aspx
Accellerated Parallel Processing (aka Stream) API + OpenCL support
DirectCompute DirectX SDK (June 2010)
http://www.microsoft.com/downloads/en/details.aspx?displaylang=en&FamilyID=3021d52b-514e-41d3-ad02-438a3ba730ba
In any case, download updated drivers for GPU/graphic cards
Attention to mobile drivers...always in beta!
15 April 2023 - slide 17MOSE – University of Trieste
How to program
Every platform has it’s own proprietary languageAs GPU are just specialized CPUs, primary programming tool is a C-derived dialect language CUDA
C for CUDA (PathScale Open64 C compiler) OpenCL
C99-dialect HLSL for DirectCompute (High Level Shading
Language – Shading of pixels, from consumer background)
15 April 2023 - slide 18MOSE – University of Trieste
Kernel for SubBlock Matrix Multiplication in C for CUDA
//////////////////////////////////////////////////////////////////////////////////! Matrix multiplication on the device: C = A * B//! wA is A's width and wB is B's width////////////////////////////////////////////////////////////////////////////////__global__ voidmatrixMul( float* C, float* A, float* B, int wA, int wB){ // Block index int bx = blockIdx.x; int by = blockIdx.y;
// Thread index int tx = threadIdx.x; int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block int aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B int bStep = BLOCK_SIZE * wB;
// Csub is used to store the element of the block sub-matrix // that is computed by the thread float Csub = 0;
// Loop over all the sub-matrices of A and B // required to compute the block sub-matrix
// Write the block sub-matrix to device memory; // each thread writes one element int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub;}
for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) {
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];
AS(ty, tx) = A[a + wA * ty + tx]; BS(ty, tx) = B[b + wB * ty + tx];
__syncthreads();
for (int k = 0; k < BLOCK_SIZE; ++k) Csub += AS(ty, k) * BS(k, tx);
__syncthreads(); }
15 April 2023 - slide 20MOSE – University of Trieste
Kernel for SubBlock Matrix Multiplication in C for OpenCL
__kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1))) voidfloatMatrixMultLocals(__global float * MResp, __global float * M1, __global float * M2, __global int * q){ //Identification of this workgroup int i = get_group_id(0); int j = get_group_id(1); //Identification of work-item int idX = get_local_id(0); int idY = get_local_id(1); //matrixes dimensions int p = get_global_size(0); int r = get_global_size(1); int qq = q[0]; //Number of submatrixes to be processed by each worker (Q dimension) int numSubMat = qq/BLOCK_SIZE; float4 resp = (float4)(0,0,0,0); __local float A[BLOCK_SIZE][BLOCK_SIZE]; __local float B[BLOCK_SIZE][BLOCK_SIZE]; MResp[BLOCK_SIZE*i + idX + p*(BLOCK_SIZE*j+idY)] = // LOO CODE resp.x+resp.y+resp.z+resp.w;}
for (int k=0; k<numSubMat; k++) { //Copy submatrixes to local memory. Each worker copies one element //Notice that A[i,k] accesses elements starting from M[BLOCK_SIZE*i, BLOCK_SIZE*j] A[idX][idY] = M1[BLOCK_SIZE*i + idX + p*(BLOCK_SIZE*k+idY)]; B[idX][idY] = M2[BLOCK_SIZE*k + idX + qq*(BLOCK_SIZE*j+idY)]; barrier(CLK_LOCAL_MEM_FENCE);
for (int k2 = 0; k2 < BLOCK_SIZE; k2+=4) { float4 temp1=(float4)(A[idX][k2],A[idX][k2+1],A[idX][k2+2],A[idX][k2+3]); float4 temp2=(float4)(B[k2][idY],B[k2+1][idY],B[k2+2][idY],B[k2+3][idY]); resp += temp1 * temp2; } barrier(CLK_LOCAL_MEM_FENCE); }
15 April 2023 - slide 21MOSE – University of Trieste
Kernel for Matrix Multiplication in C for OpenCL
__kernel voidfloatMatrixMult( __global float * MResp, __global float * M1, __global float * M2, __global int * q){ // Vector element index int i = get_global_id(0); int j = get_global_id(1);
int p = get_global_size(0); int r = get_global_size(1);
MResp[i + p * j] = 0; int QQ = q[0]; for (int k = 0; k < QQ; k++) { MResp[i + p * j] += M1[i + p * k] * M2[k + QQ * j]; }}
15 April 2023 - slide 22MOSE – University of Trieste
Kernel for Matrix Multiplication in HLSL for DirectCompute
[numthreads(GROUP_SIZE_X, GROUP_SIZE_Y, 1)]void matrixMul( uint3 DTid : SV_DispatchThreadID ){
if(DTid.x < WidthB && DTid.y < HeightA){
float sum = 0;for(uint i=0; i<WidthA; i++){
uint addrA = DTid.y * WidthA + i;uint addrB = DTid.x + i*WidthB;
sum += MatrixA[addrA].val * MatrixB[addrB];}Output[DTid.y*WidthOut + DTid.x] = sum;
}}
15 April 2023 - slide 23MOSE – University of Trieste
Hosting GPUsAsymetric model needs an host language
Executed on CPU Coordinates tasks on GPU
C/C++ Embedding Technology GPU compiler strip GPU specific code from source CPU compiler compiler the remainder code and executes GPU-
compiled code
Non-C/C++ Language Bindings GPU code compiled via a sort of «compiler as a service>
15 April 2023 - slide 24MOSE – University of Trieste
Host Side API Code (example in DirectCompute)
Create device and contextMatrix A – structured buffer and shader resource viewMatrix B –float buffer and shader resource viewOutput Matrix – float buffer and unordered access viewCreate Constant BufferCompile and create shaderExecuteRead Back
15 April 2023 - slide 25MOSE – University of Trieste
Host Side API Code (example in DirectCompute)
15 April 2023 - slide 26MOSE – University of Trieste
Bindings (example for CUDA)Fortran - FORTRAN CUDA, PGI CUDA Fortran CompilerLua - KappaCUDAIDL - GPULibMathematica - CUDALinkMATLAB - Jacket.NET - CUDA.NETPerl - KappaCUDAPython - PyCUDA KappaCUDARuby - KappaCUDAJava - jCUDA, JCuda, JCublas, JCufft
15 April 2023 - slide 27MOSE – University of Trieste
Example of PyCUDA codeimport pycuda.autoinitimport pycuda.driver as drvimport numpy
from pycuda.compiler import SourceModulemod = SourceModule("""__global__ void multiply_them(float *dest, float *a, float *b){ const int i = threadIdx.x; dest[i] = a[i] * b[i];}""")
multiply_them = mod.get_function("multiply_them")
a = numpy.random.randn(400).astype(numpy.float32)b = numpy.random.randn(400).astype(numpy.float32)
dest = numpy.zeros_like(a)multiply_them( drv.Out(dest), drv.In(a), drv.In(b), block=(400,1,1), grid=(1,1))
print dest-a*b
15 April 2023 - slide 28MOSE – University of Trieste
Custom LibrariesCUDA, OpenCL and DirectCompute are the building blocks
The smallest «bricks» available to build our software
Are there any «biggest» bricks?Libraries to achieve some specific results targeting a GPU platformExample
Microsoft Research Accelleratorhttp://research.microsoft.com/en-us/projects/Accelerator/
// ... using Microsoft.ParallelArrays;// ...
namespace AccelleratorDemo{ class Program { static void Main(string[] args) { // ....
// Build a computation that calculates average of neigbors var input = new FloatParallelArray(nums); var sum = ParallelArrays.Shift(input, 1) + input + ParallelArrays.Shift(input, -1); var output = sum / 3.0f;
// Run the computation var target = new DX9Target(); var res = target.ToArray1D(output);
// Output the original data and calculated 'blurred' result Action<float[]> WriteArray = (vals) => Console.WriteLine(vals.Aggregate("", (str, f) => str + Math.Round(f) + ", "));
// ... } }}
15 April 2023 - slide 29MOSE – University of Trieste
Metaprogramming
What is metaprogramming? Metaprogramming is the writing of computer
programs that write or manipulate other programs (or themselves) as their data, or that do part of the work at compile time that would otherwise be done at runtime. (http://en.wikipedia.org/wiki/Metaprogramming)
15 April 2023 - slide 30MOSE – University of Trieste
GPU MetaprogrammingThe code executed on GPU is not in the host codeThe REFERENCE code is in the host code
Metaprogramming techniques allows to generate the real executable code from the source code, used as a template, as a declaration of the need
In many cases, this allows programmers to get more done in the same amount of time as they would take to write all the code manually, or it gives programs greater flexibility to efficiently handle new situations without recompilation.
An example in .NET GPU.NET (http://www.tidepowerd.com/)
All metaprogrammable languages are potentially tool for this approach
The GPU.NET code works in C#, Visual Basic, F#....
15 April 2023 - slide 31MOSE – University of Trieste
Sample in GPU.NET [Kernel(CustomFallbackMethod = "AddCpu")] private static void AddGpu(float[] a, float[] b, float[] c) { // Get the thread id and total number of threads int ThreadId = BlockDimension.X * BlockIndex.X + ThreadIndex.X; int TotalThreads = BlockDimension.X * GridDimension.X;
// Loop over the vectors 'a' and 'b', adding them pairwise and storing the sums in 'c' for (int ElementIndex = ThreadId; ElementIndex < a.Length; ElementIndex += TotalThreads) { c[ElementIndex] = a[ElementIndex] + b[ElementIndex]; }}
GPU code is runtime generated and loaded into GPU from [Kernel] decored native .NET methods
15 April 2023 - slide 32MOSE – University of Trieste
CONCLUSIONSLinguaggi di programmazione e compilatori per le GPU
15 April 2023 - slide 33MOSE – University of Trieste
Conclusions
GPUs give us bit opportunitiesMarket and Research have understood it Developer Community too
Now we have to assist to the grow of the market So of the tools So of the languages
2011 will be the APUs year CPU + GPU on the single die
15 April 2023 - slide 34MOSE – University of Trieste
Which programming model?
No one knows itAgain, CUDA is the most efficient and advanced
...but runs only on Nvidia cards No one knows if ATI will adopt it
OpenCL is not so advanced But it’s the same relation between OpenGL and the others ...and NVidia knows it.
DirectCompute? On Windows hosts, with DirectX experience, it’s another a big
opportunity
I think.... Specific programming models (CUDA) for ISVs which can write
drivers... OpenCL or DirectCompute for custom/lab/«home»
development
15 April 2023 - slide 35MOSE – University of Trieste
Attention!GPU is the cheapest computing power, but...
You can execute only streamable code (data processing)
What about more generic code?
15 April 2023 - slide 36MOSE – University of Trieste
TRENDSLinguaggi di programmazione e compilatori per le GPU
15 April 2023 - slide 37MOSE – University of Trieste
Parallel thinks about data Distinct blocks of data handled in the same way Distinct blocks of data handled at the same time
Parallel is all about distributing the same computation task in different computation resources at the same timeResult is divided into blocks
Result blocks must be merged in one general result
Software can be parallel
Parallel
15 April 2023 - slide 38MOSE – University of Trieste
Async is not Parallel Parallel thinks about data (and probably just 1 task) Async thinks about tasks
Not all tasks are sequential Some tasks are sequential Some tasks are parallel
Non sequential tasks have to be coordinated
Software have to be more asyncProgrammers don’t thinks async
Programmer are not trained async
Asyncronous
15 April 2023 - slide 39MOSE – University of Trieste
Imperative languages have no primitives to handle async or parallelTask coordination have to be adaptative
There is no only one way to do Coordination have to be “simplified”
Async or Parallel tasks are better coded in declarative language
Leave the compiler do the infrastructure and optimization work
Functional is a special declarative language The metaphor is the function as first-class citizen in language Function is threated as a value
Functional
15 April 2023 - slide 40MOSE – University of Trieste
Pros: On demand scalability Total Cost of Ownership (?) Different level of services (SAAS, PAAS, xAAS)
Cons: Broadband (speed) Privacy (?)
Cloud
15 April 2023 - slide 41MOSE – University of Trieste
blog:
email:
web:
twitter:
slideshare:
facebook:
linked-in:
Links
Marco Parenzan
http://blog.codeisvalue.com/
http://www.codeisvalue.com/
@marco_parenzan
http://www.slideshare.com/marco.parenzan
http://www.facebook.com/parenzan.marco
http://it.linkedin.com/in/marcoparenzan