opencl for programming gpus - sharcnet · 2017-04-24 · summer school 2010 pawel pomorski gpu...

Summer School 2010 Pawel Pomorski

OpenCL for programming GPUs

Sharcnet Summer School 2010

Pawel Pomorski


OverviewWhat is OpenCL?

Basic OpenCL- Detecting your environment- Basic building blocks - “Hello World!” program- Hands on exercise - convert a “serial” CPU code to OpenCL GPU code- OpenCL C

“Advanced” OpenCL- Memory optimization


GPU computing timeline

before 2003 - Calculations on GPU, using graphics API2003 - Brook “C with streams”2005 - Steady increase in CPU clock speed comes to a halt, switch to multicore chips to compensate. At thesame time, computational power of GPUs increasesNovember, 2006 - CUDA released by NVIDIANovember, 2006 - CTM (Close to Metal) from ATIDecember 2007 - Succeded by AMD Stream SDKDecember, 2008 - Technical specification for OpenCL1.0 releasedApril, 2009 - First OpenCL 1.0 GPU drivers released by NVIDIAAugust, 2009 - Mac OS X 10.6 Snow Leopard released, with OpenCL 1.0 includedSeptember 2009 - Public release of OpenCL by NVIDIADecember 2009 - AMD release of ATI Stream SDK 2.0 with OpenCL supportMarch 2010 - Cuda 3.0 released, incorporating OpenCLNear future - Intel may release a GPGPU (Larrabee)Future - CPUs will have so many cores they will start to be treated as GPUs?


Why are GPUs fast?

Different paradigm, data parallelism (vs. task parallelism on CPU)

Stream processing

Hardware capable of launching MANY threads

However, GPUs also have significant limitations and not all code can run fast on them

If a significant portion of your code cannot be accelerated, your speedup will be limited by Amdahl’s Law


OpenCL (Open Compute Language) is an open standard for parallel programming of heterogenous systems, managed by Khronos Group.

Aim is for it to form the foundation layer of a parallel computing ecosystem of platform-independent tools

OpenCL includes a language for writing kernels, plus APIs used to define and control platforms

Targeted at GPUs, but also multicore CPUs and other hardware (Cell etc.), though to run efficiently each family of devices needs different code


OpenCL is largely derived from CUDA (no need to reinvent the wheel)

Same basic functioning : kernel is sent to the accelerator “compute device” composed of “compute units” of which “processing elements” work on “work items”.

Some names changed between CUDA and OpenCL

Thread -> Work-itemBlock -> Work-group

Both CUDA and OpenCL allow for detailed low level optimization of the GPU


OpenCL device independent - really?

Actually, there is no way to write code that will run equally efficiently on both on GPU and multi-CPU architecture

However, OpenCL can detect the features of the architecture and run code appropriate for each

OpenCL code should run efficiently on many GPUs with a bit of fine tuning

On the other hand, an OpenCL code highly optimized for NVIDIA hardware will not run that efficiently on AMD hardware

As High Performance Computing code needs to be highly optimized, so OpenCL may not offer practical ability to be device independent


CUDA vs. OpenCLNVIDIA is fully supporting OpenCL even though it does not run exclusively on their hardware

The strategy is to increase the number of GPU users by making software more portable and accessible, which OpenCL is meant to do. If there are more users, NVIDIA will sell more hardware.

As CUDA is fully controlled by NVIDIA, it is possible that in the future it will contain more “bleeding edge” features than OpenCL, which is overseen by a consortium hence slower to incorporate new features

If you want your GPU code to run on both NVIDIA and AMD/ATI devices (two main players at present), OpenCL is the only way to accomplish that


Available OpenCL environmentsNVIDIA OpenCL - distributed with CUDA since version 3.0 - no CPU support at the moment- for NVIDIA GPU cards only

Apple OpenCL - included as standard feature in Mac OS X 10.6 Snow Leopard (requires XCode)- supports both graphics cards and CPUs (Apple hardware only)

AMD (ATI) OpenCL- supports only AMD GPU cards - no CPU support at the moment

IBM OpenCL, supports IBM Blade Center hardware

Foxc (Fixstars OpenCL Cross Compiler), in Beta as of June, 2010, CPU only


OpenCL References

The OpenCL Programming Book - http://www.fixstars.com/en/company/books/opencl/

Kronos consortium: http://www.khronos.org/

CUDA OpenCL: http://www.nvidia.com/object/cuda_opencl_new.html

NVIDIA GPU Computing SDK code samples

On “angel”: /opt/sharcnet/cuda/3.0/gpucomputingsdk/OpenCL

Apple OpenCL : http://developer.apple.com

AMD (ATI) OpenCL : http://ati.amd.com/technology/streamcomputing/opencl.html

http://www.khronos.org

http://www.khronos.org

http://www.nvidia.com/object/cuda_opencl_new.html

http://www.nvidia.com/object/cuda_opencl_new.html

http://developer.apple.com

http://developer.apple.com

http://ati.amd.com/technology/streamcomputing/opencl.html

http://ati.amd.com/technology/streamcomputing/opencl.html


OpenCL on SHARCNET

Cluster “angel” is SHARCNET’s main GPU computing cluster.

Hardware: 22 nodes, each with 8 CPU cores, each node “sees” 2 Tesla T10 GPUs CUDA 3.0 drivers needed for OpenCL are currently available on node ang4 via a special arrangement for Summer School

Students can test OpenCL programs by connecting to ang4 via “ssh ang4” (from angel login node)

Run interactively, as ang4 is not accessible through queues right now

Compile with:gcc -o test.x your_code.c -lOpenCL


Know your hardwareOpenCL programs should aim to be hardware agnostic

The program should find the relevant system information at runtime

OpenCL provides methods for this

At minimum programmer must determine what OpenCL devices are available and choose which are to be used by the program


#include <stdio.h>#include <CL/cl.h>

int main(int argc, char** argv) { char dname[500]; cl_device_id devices[10]; cl_uint num_devices,entries; cl_ulong long_entries; int d; cl_int err; cl_platform_id platform_id = NULL; size_t p_size;

/* obtain list of platforms available */ err = clGetPlatformIDs(1, &platform_id,NULL); if (err != CL_SUCCESS) { printf("Error: Failure in clGetPlatformIDs,error code=%d \n",err); return 0; }

/* obtain information about platform */ clGetPlatformInfo(platform_id,CL_PLATFORM_NAME,500,dname,NULL); printf("CL_PLATFORM_NAME = %s\n", dname);

clGetPlatformInfo(platform_id,CL_PLATFORM_VERSION,500,dname,NULL); printf("CL_PLATFORM_VERSION = %s\n", dname);

Program to get information - What is the platform?

On angel cluster: /work/ppomorsk/SUMMER_SCHOOL/get_info/get_opencl_information.c


/* obtain list of devices available on platform */ clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ALL, 10, devices, &num_devices); printf("%d devices found\n", num_devices);

/* query devices for information */ for (d = 0; d < num_devices; ++d) { clGetDeviceInfo(devices[d], CL_DEVICE_NAME, 500, dname,NULL); printf("Device #%d name = %s\n", d, dname); clGetDeviceInfo(devices[d],CL_DRIVER_VERSION, 500, dname,NULL); printf("\tDriver version = %s\n", dname); clGetDeviceInfo(devices[d],CL_DEVICE_GLOBAL_MEM_SIZE,sizeof(cl_ulong),&long_entries,NULL); printf("\tGlobal Memory (MB):\t%llu\n",long_entries/1024/1024); clGetDeviceInfo(devices[d],CL_DEVICE_GLOBAL_MEM_CACHE_SIZE,sizeof(cl_ulong),&long_entries,NULL); printf("\tGlobal Memory Cache (MB):\t%llu\n",long_entries/1024/1024); clGetDeviceInfo(devices[d],CL_DEVICE_LOCAL_MEM_SIZE,sizeof(cl_ulong),&long_entries,NULL); printf("\tLocal Memory (KB):\t%llu\n",long_entries/1024); clGetDeviceInfo(devices[d],CL_DEVICE_MAX_CLOCK_FREQUENCY,sizeof(cl_ulong),&long_entries,NULL); printf("\tMax clock (MHz) :\t%llu\n",long_entries); clGetDeviceInfo(devices[d],CL_DEVICE_MAX_WORK_GROUP_SIZE,sizeof(size_t),&p_size,NULL); printf("\tMax Work Group Size:\t%d\n",p_size); clGetDeviceInfo(devices[d],CL_DEVICE_MAX_COMPUTE_UNITS,sizeof(cl_uint),&entries,NULL); printf("\tNumber of parallel compute cores:\t%d\n",entries); } return 0;}

Program to get information cont. - What are the devices?


[ppomorsk@ang4:~] cd /work/ppomorsk/SUMMER_SCHOOL/get_info[ppomorsk@ang4:/work/ppomorsk/SUMMER_SCHOOL/get_info] gcc -o test.x get_opencl_information.c -lOpenCL[ppomorsk@ang4:/work/ppomorsk/SUMMER_SCHOOL/get_info] ./test.xCL_PLATFORM_NAME = NVIDIA CUDACL_PLATFORM_VERSION = OpenCL 1.0 CUDA 3.0.12 devices foundDevice #0 name = Tesla T10 Processor Driver version = 195.36.15 Global Memory (MB): 4095 Global Memory Cache (MB): 0 Local Memory (KB): 16 Max clock (MHz) : 1440 Max Work Group Size: 512 Number of parallel compute cores: 30Device #1 name = Tesla T10 Processor Driver version = 195.36.15 Global Memory (MB): 4095 Global Memory Cache (MB): 0 Local Memory (KB): 16 Max clock (MHz) : 1440 Max Work Group Size: 512 Number of parallel compute cores: 30[ppomorsk@ang4:/work/ppomorsk/SUMMER_SCHOOL/get_info]

Output on angel - 2 GPU detected, no CPUs detected


Angel architectureGPU server: Tesla S1070http://archive.electronicdesign.com/files/29/19280/fig_01.gif

4 GPUs on server

Each GPU has 30 computeunits, each consisting of 8 streamprocessing cores

240 cores in total on one GPU

Each node “sees” 2 GPUs

http://archive.electronicdesign.com/files/29/19280/fig_01.gif

http://archive.electronicdesign.com/files/29/19280/fig_01.gif


Output on Apple MacBook laptop - both GPU and CPU seenppomorsk-mac:tryopencl pawelpomorski$ gcc -o test.x get_opencl_information.c -w -m32 -lm -lstdc++ -framework OpenCLppomorsk-mac:tryopencl pawelpomorski$ ./test.x CL_PLATFORM_NAME = AppleCL_PLATFORM_VERSION = OpenCL 1.0 (Feb 10 2010 23:46:58)2 devices foundDevice #0 name = GeForce 9400M Driver version = CLH 1.0 Global Memory (MB): 256 Global Memory Cache (MB): 0 Local Memory (KB): 16 Max clock (MHz) : 1100 Max Work Group Size: 512 Number of parallel compute cores: 2Device #1 name = Intel(R) Core(TM)2 Duo CPU P7350 @ 2.00GHz Driver version = 1.0 Global Memory (MB): 1536 Global Memory Cache (MB): 3 Local Memory (KB): 16 Max clock (MHz) : 2000 Max Work Group Size: 1 Number of parallel compute cores: 2ppomorsk-mac:tryopencl pawelpomorski$


Getting a code to run on a GPU

Take existing serial program, separate out the parts that will continue to run on host, and the parts which will be sent to the GPU

GPU parts need to be rewritten in the form of kernel functions

Add code to host that manages GPU overhead: creates kernels, moves data between host and GPU etc.


OpenCL “Hello, World!” example

Not practical to do proper “Hello, World!” as OpenCL devices cannot access standard output directly

Our example program will pass an array of numbers from host to the GPU, square each number in the array on the GPU, then return modified array to host

This program is for learning purposes only, and will not run efficiently on a GPU

The program will demonstrate the basic structure which is common to all OpenCL programs

Will not show error checks on the slides for clarity, but having them is essential when writing OpenCL code

Source codeOn angel cluster: /work/ppomorsk/SUMMER_SCHOOL/hello/no_errorchecks_hello.c /work/ppomorsk/SUMMER_SCHOOL/hello/hello.c


Program flow - OpenCL function callsclGetPlatformIDsclGetDeviceIDsclCreateContextclCreateCommandQueue

clCreateProgramWithSourceclBuildProgramclCreateKernel

clCreateBufferclEnqueueWriteBuffer

clSetKernelArgclGetKernelWorkGroupInfoclEnqueueNDRangeKernelclFinish

clEnqueueReadBuffer

clRelease...

Organize resources, create command queue

Compile kernel

Transfer data from host to GPU memory

Lauch threads running kernels on GPU, perform main computation

Transfer data from GPU to host memory

Free all allocated memory


// Simple compute kernel which computes the square of an input array //const char *KernelSource = "\n" \"__kernel void square( \n" \" __global float* input, \n" \" __global float* output, \n" \" const unsigned int count) \n" \"{ \n" \" int i = get_global_id(0); \n" \" if(i < count) \n" \" output[i] = input[i] * input[i]; \n" \"} \n" \"\n";

Kernel code - string with OpenCL C code


int main(int argc, char** argv){ int err; // error code returned from api calls float data[DATA_SIZE]; // original data set given to device float results[DATA_SIZE]; // results returned from device unsigned int correct; // number of correct results returned

size_t global; // global domain size for our calculation size_t local; // local domain size for our calculation

cl_device_id device_id; // compute device id cl_context context; // compute context cl_command_queue commands; // compute command queue cl_program program; // compute program cl_kernel kernel; // compute kernel cl_mem input; // device memory used for the input array cl_mem output; // device memory used for the output array cl_platform_id platform_id = NULL; // Fill our data set with random float values // int i = 0; unsigned int count = DATA_SIZE; for(i = 0; i < count; i++) data[i] = rand() / (float)RAND_MAX;

Define variables, set input data


// determine OpenCL platform err = clGetPlatformIDs(1, &platform_id,NULL); // Connect to a compute device // int gpu = 1; err = clGetDeviceIDs(platform_id, gpu ? CL_DEVICE_TYPE_GPU : CL_DEVICE_TYPE_CPU, 1, &device_id, NULL); // Create a compute context // context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

// Create a command commands // commands = clCreateCommandQueue(context, device_id, 0, &err);

Organise resources, create command queue


// Create the compute program from the source buffer // program = clCreateProgramWithSource(context, 1, (const char **) & KernelSource, NULL, &err);

// Build the program executable // err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

// Create the compute kernel in the program we wish to run // kernel = clCreateKernel(program, "square", &err);

Compile kernel


// Create the input and output arrays in device memory for our calculation // input = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * count, NULL, NULL); output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * count, NULL, NULL); // Write our data set into the input array in device memory // err = clEnqueueWriteBuffer(commands, input, CL_TRUE, 0, sizeof(float) * count, data, 0, NULL, NULL);

Transfer data from host to GPU memory


// Set the arguments to our compute kernel // err = 0; err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input); err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &output); err |= clSetKernelArg(kernel, 2, sizeof(unsigned int), &count);

// Execute the kernel over the entire range of our 1d input data set // using one work item per work group (allows for arbitrary length of data array) // global = count; local = 1; err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL);

// Wait for the command commands to get serviced before reading back results // clFinish(commands);

Lauch threads running kernels on GPU, perform main computation


// Read back the results from the device to verify the output // err = clEnqueueReadBuffer( commands, output, CL_TRUE, 0, sizeof(float) * count, results, 0, NULL, NULL ); // Validate our results // correct = 0; for(i = 0; i < count; i++) { if(results[i] == data[i] * data[i]) correct++; } // Print a brief summary detailing the results // printf("Computed '%d/%d' correct values!\n", correct, count);

Transfer data from GPU to host memory


// Shutdown and cleanup // clReleaseMemObject(input); clReleaseMemObject(output); clReleaseProgram(program); clReleaseKernel(kernel); clReleaseCommandQueue(commands); clReleaseContext(context);

return 0;}

Free all allocated memory

If you are trying to use an OpenCL profiler and it crashes at the end of the run, not freeing all memory could be the cause


Compiling kernel - Online approachIn our example we have compiled code at runtime (online compile)

Kernel code can be specified as a string in the main program file

Kernel code can also be stored in a file (.cl) and loaded at runtime

const char fileName[] = "./kernel.cl";

/* Load kernel source code */ fp = fopen(fileName, "r"); if (!fp) { fprintf(stderr, "Failed to load kernel.\n"); exit(1); } source_str = (char *)malloc(MAX_SOURCE_SIZE); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp ); fclose( fp );// ...

/* Create Kernel program from the read in source */ program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret);


Compiling kernel - Offline approachIt is possible to compile OpenCL source code (kernels) offline

This is the approach used in CUDA

Upside: need not spend time for compiling during runtime

Downside: executable not portable, need to compile separate binary for each device the code is runnnig on

There is no freestanding compiler for OpenCL (like nvcc for CUDA)

Kernels have to be compiled at runtime and the resulting binary saved.

That binary can then be loaded in the future to avoid compiling step


Task queueQueue is used to launch kernels, in precisely controlled order if required.

Event objects contain information about whether various operations have finished.

For example, clEnqueueTask launches a single kernel, after checking the list of provided event object whether they have completed, and returns its own event object.

//create quque enabled for out of order (parallel) executioncommands = clCreateCommandQueue(context, device_id, OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);...// no synchronizationclEnqueueTask(command_queue, kernel_E, 0, NULL,NULL);

// synchronize so that kernel E starts only after kernels A,B,C,D finishcl_event events[4]; // define event object arrayclEnqueueTask(commands, kernel_A, 0, NULL, &events[0]);clEnqueueTask(commands, kernel_B, 0, NULL, &events[1]);clEnqueueTask(commands, kernel_C, 0, NULL, &events[2]);clEnqueueTask(commands, kernel_D, 0, NULL, &events[3]);clEnqueueTask(commands, kernel_E, 4, events, NULL);


Queueing Data Parallel tasksclEnqueueNDRangeKernel - used to lauch data parallel tasks, i.e. copies of kernel which are identical except for operating on different data

For example, to lauch #(global) work-items (copies of kernel, i.e. threads) grouped into work-groups of size #(local), one would use:

#(global) must be divisible by #(local), maximum size of #(local) dependent on device

Each work item can retrieve:get_global_id(0) - its index among all threadsget_local_id(0) - its index among threads in its work-groupget_group_id(0) - index of its work group

This information can be used to determine which data to operate on

err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL);


Organising work-items in more than 1DConvenient and efficient for many tasks, 2D and 3D possible

Maximum possible values in global and local arrays are device dependent

For 2D, can retrieve global index coordinates(x,y) = (get_global_id(0),get_global_id(1))

For 3D, can similary retrieve global (x,y,z) = (get_global_id(0),get_global_id(1),get_global_id(2))

//2D example// launch 256 work-items, organised in 16x16 grid// grouped in groups of 16, organised as 4x4global[0]=16; global[1]=16;local[0]=4; local(1)=4;

err = clEnqueueNDRangeKernel(commands, kernel, 2, NULL, &global, &local, 0, NULL, NULL);


Measuring execution time

OpenCL is an abstraction layer that allows the same code to be executed on different platforms

The code is guaranteed to run but its speed of execution will be dependent on the device and type of parallelism used

In order to get maximum performance, device and parallelism dependent tuning must be performed.

To do this, execution time must be measured. OpenCL provides convenient ways to do this

Event objects can also contain information about how long it took for the task associated with even to execute


clGetEventProfilingInfo can obtain start and end time of event Taking the difference gives event duration in nanoseconds

commands = clCreateCommandQueue(context, device_id, CL_QUEUE_PROFILING_ENABLE, &err); cl_event event; err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, &event);

clWaitForEvents(1, &event); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL); printf("Time for event (ms): %10.5f \n", (end-start)/1000000.0);// ... ClReleaseEvent(event); //important to free memory


Timing “Hello World!”

Increase DATA_SIZE to 1048576

Check how computation time varies with size of work-groups

Using largest possible work group is usually the best choice

Time for write/read to buffer is 2.51/5.14 ms.

Work-group size (local) Time (ms)1 6.1918 0.80316 0.41432 0.20964 0.127512 (maximum) 0.130


Timing “Hello World!” with openclprof/opt/sharcnet/cuda/3.0/cuda/openclprof/bin/openclprof


Must minimize memory transfers between host and GPU as these are quite slow.

If you have a kernel which does not show any performance improvement when run on GPU, run it on GPU if doing so eliminates need for device to host memory transfer

Intermediate data structures should be created in GPU memory, operated on by GPU, and destroyed without ever being mapped or copied to host memory

Because of overhead, batching small transfers into one works better than making each transfer separately.

Higher bandwidth can be achieved when using page-locked (or pinned) memory.


Running openclprof on SHARCNET

1. ssh -X [email protected]. ssh -X ang43. /opt/sharcnet/cuda/3.0/cuda/openclprof/bin/openclprofOnce you get the GUI:4. File->New5. Name your project, select location somewhere in your work space6. Select OpenCL binary executable compiled previously (no special compile options needed)7. Select your working directory8. Click “Start” to perform profiling runExamine outputTry various options in session and view

mailto:[email protected]

mailto:[email protected]


Exercise: Port a serial code to use OpenCLcp -rp /work/ppomorsk/SUMMER_SCHOOL/ . ./exercise/serial - serial code you are to convert to OpenCL./exercise/opencl_template - contains a generic “template” which you can modify, using the serial code, to develop the OpenCL version of the code./hello - consult if needed./exercise/opencl_solution - the solution (peek if you must)

Compile with: gcc -o test.x main.c -lOpenCL To run: ./test.x

Template is for the case of lunching one kernel as a single thread

This port will be fairly generic, and could easily run on both GPU and CPU


void moving_average(int *values, float *average, int length, int width){ int i; int add_value;

/* Compute sum for the first "width" elements */ add_value = 0; for( i = 0; i < width; i++ ) { add_value += values[i]; } average[width-1] = (float)add_value;

/* Compute sum for the (width) to (length-1) elements */ for( i = width; i < length; i++ ) { add_value = add_value - values[i-width] + values[i]; average[i] = (float)(add_value); }

/* Insert zeros to 0th to (width-2)th element */ for( i = 0; i < width-1; i++ ) { average[i] = 0.0f; }

/* Compute average from the sum */ for( i = width-1; i < length; i++ ) { average[i] /= (float)width; }}

Computes Simple Moving Average (SMA)

Hint: Converting this to an OpenCL kernel that runs as single thread requires very little modification


#include <stdio.h>#include <stdlib.h>/* Read Stock data */int stock_array1[] = { #include "stock_array1.txt"};/* Define width for the moving average */#define WINDOW_SIZE (13)

int main(int argc, char *argv[]){ float *result;

int data_num = sizeof(stock_array1) / sizeof(stock_array1[0]); int window_num = (int)WINDOW_SIZE; int i; /* Allocate space for the result */ result = (float *)malloc(data_num*sizeof(float));

/* Call the moving average function */ moving_average(stock_array1,result,data_num,window_num);

/* Print result */ for(i=0; i<data_num; i++) { printf( "result[%d] = %f\n", i, result[i] ); }

/* Deallocate memory */ free(result);

return 0;}

Serial host code


OpenCL C language


OpenCL C

OpenCL C is basically standard C (C99) with some extensions and restrictions. This language is used to program the kernel.

There is a list of restrictions, for example - recursion is not supported- the point passed as an argument to a kernel function must be of type __global, __constant or __local- support for double precision floating-point is currently an optional extension, and it may not be implemented

Some OpenCL functions are built in without requiring library, including math functions, geometric functions, and work-item functions

OpenCL includes some new features not available in C


Vector Data

In OpenCL can define and operate on vector data-types, which is a struct consisting of many components of same data-type (CUDA has a much more limited capacity to do this)

float4 consists of (float,float,float,float) declared via float4 f4;

Vector sizes are 2,4,8,16, can consist of char,int,float, double etc.

// vector literals used to assign values to variablesfloat4 g4 = (float4)(1.0f,2.0f,3.0f,4.0f);

// built in math functions can process vector typesfloat4 f4 = cos(g4); // f4 = cos g(0),cos g(1), cos g(2), cos g(3)


Accessing vector data

// .xyzw will access 0 to 3, same in CUDA

int4 x = (int4)(1,2,3,4);int4 a = x.wzyx // a=(4,3,2,1) int2 b = x.xx // b = (1,1) */

// can access through number index using .s followed by number // (in hex, which uses A-F for indices above 9)

int 8 c = x.s01233210 // c=(1,2,3,4,4,3,2,1)

// extract odd and even indices (.odd , .even)

int 2 d =x.odd // d=(2,4)

// extract upper half and lower half of vector (.hi,.lo)

int 2 e =x.hi // e = (3,4)


Other vector operations

Addition, subtraction

Comparison operation will return a vector containing TRUE or FALSE values is returned

This type of the vector values is determined by what the operands are. It’s not usually char for char, and integer of same length as compared scalar values (short/int/long)

The actual values for FALSE/TRUE are 0/-1 and not 0/1, as that is more convenient for SIMD units

int2 tf = (float2)(1.0f,1.0f) == (float2) (0.0f,1.0f);// tf = (int2) (0,-1)


Effective use of SIMD

Single Instruction obviously implies that it is essential for each thread to run exactly the same code as much as possible

Thus, if I have an if statement that will send some threads down one branch of the code and some other threads down another branch, that is not good.


Example of selection without using if

-- better way --

Works because: a&(all 1s) = a, a|(all 0s)=aGood illustration why -1 for TRUE works better

int a = in0[i], b = in1[i]; /*want to put smaller in out[i]*//*usual way with if */if (a<b) out[i]=a;else out[i]=b;// or possibly// out[i]=a<b?a:b;

int a = in0[i], b = in1[i];int cmp = a < b; //if TRUE, cmp has all bits 1, if FALSE all bits 0 // & bitwise AND// | bitwise OR// ~ flips all bitsout[i] = (a&cmp) | (b&~cmp); //a when TRUE and b when FALSE

// OpenCL has shortct for thisout[i]=bitselect(cmp,b,a);


Vector Benefits

Using vectors will only have benefits on hardware with SIMD features

If those features are not there, vectors will be handled sequentially with no speedup

The selection of vector size is important. For example, double16 may be too big to fit into some registers, so speedup will not occur

Ideally all this should be handled automatically by the compiler, but we are not yet there


OpenCL Pragmas

#pragma OPENCL FP_CONTRACT on-off-switch

Control how floating point is done (is FMA used). In contract operations multiple operations are performed as 1.

#pragma OPENCL EXTENSION <extension_name> : <behavior>

A number of extensions are available


OpenCL Memory Types

__global - memory which can be read from all work items (threads), though with relatively slow access unless the program is carefully optimised. It is the device’s main memory.

__local - can be read from work items within a work group. Physically it is the shared memory on each compute unit.

__private - can only be used within each work item. Physically it is the registers used by each processing element.

__constant - memory that can be used to store constant data for read-only access by all compute units. The host is responsible for initializing and allocating this memory


OpenCL vs Cuda - Vector addition example

__global__ voidvectorAdd(const float * a, const float * b, float * c){// Vector element indexint nIndex = blockIdx.x * blockDim.x + threadIdx.x;c[nIndex] = a[nIndex] + b[nIndex];}

__kernel voidvectorAdd(__global const float * a,__global const float * b,__global float * c){// Vector element indexint nIndex = get_global_id(0);c[nIndex] = a[nIndex] + b[nIndex];}

CUDA

OpenCL


Optimizing OpenCL

- Efficient access to global memory

- Use of local memory to increase bandwith: Matrix Multiplication example

Code, data and figures from:Nvidia OpenCL - Best Practices Guidehttp://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_OpenCL_BestPracticesGuide.pdf

http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_OpenCL_BestPracticesGuide.pdf

http://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_OpenCL_BestPracticesGuide.pdf


GPU memory vs. CPU memory

CPUs have a cache which for many problems can provide fast memory access, with transfers between cache and main memory handled in hardware.

The cache paradigm scales poorly as the number of cores is increased, as increased hardware overhead is needed to maintain coherency between caches attached to different cores.

In a GPU, with its hundreds of cores, caches are not practical and thus are not present.

Fast but small local (shared) memory is somewhat analogous to L1 cache on a CPU. However, there is no hardware to automatically shift data from local to global memory as needed. This must be taken care of in software, i.e. by the programmer.

In general on a CPU the requirements for memory access are not that stringent. If you are accessing memory sequentially, you are almost certainly OK.

On GPU efficient memory acess requires more work by the programmer.


Coalesced Access to Global Memory

Global memory loads and stores by threads of half warp (16 threads) can be combined into as few as one transaction.

You can think of global memory as rows of 16 floats linked together into a 64 byte group, with 2 of these forming a 128 byte group

Good access pattern results in a single 64 bit transaction

In this example, if accesses were permuted, still one transaction would be performed on newer cards (CC 1.2), but 16 serializedtransaction on older cards (CC 1.1)

NVIDIA OpenCL Best Practices Guide

!"#$%&''%()*+,%&-)".+%)*,+&$/%0*(/+%,+12+/)+$%&$$,+//%'"+/%"#%)*+%/&3+%/+43+#)5%&#$%,+$2-+%)*+%),&#/&-)"(#%/"6+%"7%8(//"9'+:%

;7%)*+%),&#/&-)"(#%"/%<=>%9?)+/%&#$%(#'?%)*+%'(0+,%(,%288+,%*&'7%"/%2/+$5%,+$2-+%)*+%),&#/&-)"(#%/"6+%)(%@A%9?)+/B%

;7%)*+%),&#/&-)"(#%"/%@A%9?)+/%&#$%(#'?%)*+%'(0+,%(,%288+,%*&'7%"/%2/+$5%,+$2-+%)*+%),&#/&-)"(#%/"6+%)(%C=%9?)+/B%

D&,,?%(2)%)*+%),&#/&-)"(#%&#$%3&,E%)*+%/+,."-+$%)*,+&$/%&/%"#&-)".+B% F+8+&)%2#)"'%&''%)*,+&$/%"#%)*+%*&'7%0&,8%&,+%/+,."-+$B%

G*+/+%-(#-+8)/%&,+%"''2/),&)+$%"#%)*+%7(''(0"#4%/"38'+%+H&38'+/B%

3.2.1.1 A Simple Access Pattern G*+%7",/)%&#$%/"38'+/)%-&/+%(7%-(&'+/-"#4%-&#%9+%&-*"+.+$%9?%&#?%$+."-+:%)*+%!I)*%)*,+&$%&--+//+/%)*+%!I)*%0(,$%"#%&%/+43+#)J%)*+%+H-+8)"(#%"/%)*&)%#()%&''%)*,+&$/%#++$%)(%8&,)"-"8&)+B%KL++%!"42,+%CBABM%

%

%

Figure 3.4 Coalesced access in which all threads but one access the corresponding word in a segment

G*"/%&--+//%8&))+,#%,+/2')/%"#%&%/"#4'+%@AI9?)+%),&#/&-)"(#5%"#$"-&)+$%9?%)*+%,+$%,+-)&#4'+B%N()+%)*&)%+.+#%)*(24*%(#+%0(,$%"/%#()%,+12+/)+$5%&''%$&)&%"#%)*+%/+43+#)%&,+%7+)-*+$B%;7%&--+//+/%9?%)*,+&$/%0+,+%8+,32)+$%0")*"#%)*"/%/+43+#)5%/)"''%(#+%@AI9?)+%),&#/&-)"(#%0(2'$%9+%8+,7(,3+$%9?%&%$+."-+%0")*%-(382)+%-&8&9"'")?%<B=%(,%*"4*+,5%92)%<@%/+,"&'"6+$%),&#/&-)"(#/%0(2'$%9+%8+,7(,3+$%9?%&%$+."-+%0")*%-(382)+%-&8&9"'")?%<B<%(,%'(0+,B%

3.2.1.2 A Sequential but Misaligned Access Pattern ;7%/+12+#)"&'%)*,+&$/%"#%&%*&'7%0&,8%&--+//%3+3(,?%)*&)%"/%/+12+#)"&'%92)%#()%&'"4#+$%0")*%)*+%/+43+#)/5%)*+#%&%/+8&,&)+%),&#/&-)"(#%,+/2')/%7(,%+&-*%+'+3+#)%,+12+/)+$%(#%&%$+."-+%0")*%-(382)+%-&8&9"'")?%<B<%(,%'(0+,B%O#%&%$+."-+%0")*%-(382)+%-&8&9"'")?%<B=%(,%*"4*+,5%/+.+,&'%$"77+,+#)%/-+#&,"(/%-&#%&,"/+%$+8+#$"#4%(#%0*+)*+,%&''%&$$,+//+/%7(,%&%*&'7%0&,8%7&''%0")*"#%&%/"#4'+%<=>I9?)+%/+43+#)B%;7%)*+%&$$,+//+/%7&''%0")*"#%&%<=>I9?)+%/+43+#)5%)*+#%&%/"#4'+%<=>I9?)+%),&#/&-)"(#%"/%8+,7(,3+$5%&/%/*(0#%"#%!"42,+%CBPB%

%

14 August 31, 2009

Half warp of threads

Global memory


Memory Optimizations

!Figure 3.5 Unaligned sequential addresses that fit within a single 128-

byte segment

"#!$!%$&#!'$()!$**+,,+,!-+-.(/!0%$0!1,!,+23+401$&!530!,)&10!$*(.,,!0'.!67895/0+!,+:-+40,;!0%+4!0'.!0($4,$*01.4,!$(+!)+(#.(-+<=!"4!0%+!#.&&.'14:!*$,+;!1&&3,0($0+<!14!>1:3(+!?=@;!.4+!@A95/0+!0($4,$*01.4!$4<!.4+!?795/0+!0($4,$*01.4!(+,3&0=!

!

!Figure 3.6 Misaligned sequential addresses that fall within two 128-byte

segments

B+C1*+!-+-.(/!$&&.*$0+<!0%(.3:%!D)+4EF!1,!:3$($40++<!0.!5+!$&1:4+<!0.!$0!&+$,0!7G@!5/0+,=!H%+(+#.(+;!*%..,14:!,+4,15&+!0%(+$<!5&.*I!,1J+,;!,3*%!$,!-3&01)&+,!.#!6@;!#$*1&10$0+,!-+-.(/!$**+,,+,!5/!%$&#!'$(),!0%$0!$(+!$&1:4+<!0.!,+:-+40,=!"4!$<<101.4;!0%+!23$&1#1+(,!__attribute__ ((aligned(8))) and __attribute__ ((aligned(16)))!*$4!5+!3,+<!'%+4!<+#1414:!,0(3*03(+,!0.!+4,3(+!$&1:4-+40!0.!,+:-+40,=!

3.2.1.3 Effects of Misaligned Accesses "0!1,!+$,/!$4<!14#.(-$01C+!0.!+K)&.(+!0%+!($-1#1*$01.4,!.#!-1,$&1:4+<!$**+,,+,!3,14:!$!,1-)&+!*.)/!I+(4+&;!,3*%!$,!0%+!.4+!14!F1,014:!?=G=!__kernel void offsetCopy(__global float *odata, __global float* idata, int offset) { int xid = get_global_id(0) + offset; odata[xid] = idata[xid]; }

Listing 3.5 A copy kernel that illustrates misaligned accesses

"4!F1,014:!?=G;!<$0$!1,!*.)1+<!#(.-!0%+!14)30!$(($/!idata!0.!0%+!.30)30!$(($/;!5.0%!.#!'%1*%!+K1,0!14!:&.5$&!-+-.(/=!H%+!I+(4+&!1,!+K+*30+<!'10%14!$!&..)!14!%.,0!*.<+!0%$0!

August 31, 2009 15

Somewhat misaligned pattern CC 1.1 devices will perform transactions in seriesCC 1.2 can still do it in one


!Figure 3.5 Unaligned sequential addresses that fit within a single 128-

byte segment

"#!$!%$&#!'$()!$**+,,+,!-+-.(/!0%$0!1,!,+23+401$&!530!,)&10!$*(.,,!0'.!67895/0+!,+:-+40,;!0%+4!0'.!0($4,$*01.4,!$(+!)+(#.(-+<=!"4!0%+!#.&&.'14:!*$,+;!1&&3,0($0+<!14!>1:3(+!?=@;!.4+!@A95/0+!0($4,$*01.4!$4<!.4+!?795/0+!0($4,$*01.4!(+,3&0=!

!

!Figure 3.6 Misaligned sequential addresses that fall within two 128-byte

segments

B+C1*+!-+-.(/!$&&.*$0+<!0%(.3:%!D)+4EF!1,!:3$($40++<!0.!5+!$&1:4+<!0.!$0!&+$,0!7G@!5/0+,=!H%+(+#.(+;!*%..,14:!,+4,15&+!0%(+$<!5&.*I!,1J+,;!,3*%!$,!-3&01)&+,!.#!6@;!#$*1&10$0+,!-+-.(/!$**+,,+,!5/!%$&#!'$(),!0%$0!$(+!$&1:4+<!0.!,+:-+40,=!"4!$<<101.4;!0%+!23$&1#1+(,!__attribute__ ((aligned(8))) and __attribute__ ((aligned(16)))!*$4!5+!3,+<!'%+4!<+#1414:!,0(3*03(+,!0.!+4,3(+!$&1:4-+40!0.!,+:-+40,=!

3.2.1.3 Effects of Misaligned Accesses "0!1,!+$,/!$4<!14#.(-$01C+!0.!+K)&.(+!0%+!($-1#1*$01.4,!.#!-1,$&1:4+<!$**+,,+,!3,14:!$!,1-)&+!*.)/!I+(4+&;!,3*%!$,!0%+!.4+!14!F1,014:!?=G=!__kernel void offsetCopy(__global float *odata, __global float* idata, int offset) { int xid = get_global_id(0) + offset; odata[xid] = idata[xid]; }

Listing 3.5 A copy kernel that illustrates misaligned accesses

"4!F1,014:!?=G;!<$0$!1,!*.)1+<!#(.-!0%+!14)30!$(($/!idata!0.!0%+!.30)30!$(($/;!5.0%!.#!'%1*%!+K1,0!14!:&.5$&!-+-.(/=!H%+!I+(4+&!1,!+K+*30+<!'10%14!$!&..)!14!%.,0!*.<+!0%$0!

August 31, 2009 15

Memory to be accessed now belongs to two different 128 bytegroups, so CC 1.2 devices need2 transactions

Choosing work-group sizes to be multiples of 16 facilitates efficientaccess patterns


Cost of misaligned access

__kernel void offsetCopy(__global float *odata,__global float* idata,int offset){int xid = get_global_id(0) + offset;odata[xid] = idata[xid];}


!"#$%&'()%'*"#"+%(%#'offset',#-+'.'(-'/01'23$45#%&'/16'"78'/19':-##%&*-78'(-'-,,&%(&'-,'.'"78'.;<'#%&*%:($!%=>1?'@)%'%,,%:($!%'A"78B$8()',-#'()%':-*>'B$()'!"#$-5&'-,,&%(&'-7'"7'CDEFEG'H%3-#:%'H@I'0JK'2B$()':-+*5(%':"*"A$=$(>'.1/?'"78'"7'CDEFEG'H%3-#:%'H@I'JJKK'2:-+*5(%':"*"A$=$(>'.1K?'"#%'&)-B7'$7'3$45#%'/1;1'

'

'Figure 3.7 Performance of offsetCopy kernel

3-#'()%'CDEFEG'H%3-#:%'H@I'JJKK'8%!$:%<'4=-A"='+%+-#>'"::%&&%&'B$()'7-'-,,&%('-#'B$()'-,,&%(&'()"('"#%'+5=($*=%&'-,'.9'#%&5=('$7'"'&$74=%'(#"7&":($-7'*%#')"=,'B"#*'"78'"7'%,,%:($!%'A"78B$8()'-,'"**#-L$+"(%=>';M'HN*&1'O()%#B$&%<'.9'(#"7&":($-7&'"#%'$&&5%8'*%#')"=,'B"#*'#%&5=($74'$7'"7'%,,%:($!%'A"78B$8()'-,'"**#-L$+"(%=>';'HN*&1'@)$&'#-54)=>'JL'*%#,-#+"7:%'8%4#"8"($-7'$&'85%'(-'()%',":('()"('/0'A>(%&<'()%'+$7$+5+'(#"7&":($-7'&$P%<'"#%',%(:)%8',-#'%":)'()#%"81'Q-B%!%#<'-7=>'M'A>(%&'-,'8"("'"#%'5&%8',-#'%":)'/0'A>(%&',%(:)%8R#%&5=($74'$7'()%'MS/0T.SJ'*%#,-#+"7:%'#%="($!%'(-'()%',5==>':-"=%&:%8':"&%1'@)%'(B-'75+A%#&'"=&-'#%,=%:('()%'8$,,%#%7('8"("'#%*#%&%7(%8'A>'%,,%:($!%'A"78B$8()'2M'A>(%&?'!%#&5&'":(5"='A"78B$8()'2/0'A>(%&?1'

N%:"5&%'-,'()$&'*-&&$A=%'*%#,-#+"7:%'8%4#"8"($-7<'+%+-#>':-"=%&:$74'$&'()%'+-&(':#$($:"='"&*%:('-,'*%#,-#+"7:%'-*($+$P"($-7'-,'8%!$:%'+%+-#>1'3-#'()%'CDEFEG'H%3-#:%'H@I'0JK'8%!$:%<'()%'&$(5"($-7'$&'=%&&'8$#%',-#'+$&"=$47%8'"::%&&%&'A%:"5&%<'$7'"==':"&%&<'"::%&&'A>'"')"=,'B"#*'-,'()#%"8&'$7'()$&'U%#7%='#%&5=(&'$7'%$()%#'-7%'-#'(B-'(#"7&":($-7&1'G&'&5:)<'()%'%,,%:($!%'A"78B$8()'$&'A%(B%%7'.0K'HN*&',-#'"'&$74=%'(#"7&":($-7'"78';K'HN*&',-#'(B-'(#"7&":($-7&'*%#')"=,'B"#*1'@)%'75+A%#'-,'(#"7&":($-7&'$&&5%8',-#'"')"=,'B"#*'-,'()#%"8&'8%*%78&'-7'()%'-,,&%('"78'B)%()%#'()%'B"#*'$&'%!%7V'-#'-88V75+A%#%81'3-#'-,,&%(&'-,'K'-#'.9<'%":)')"=,'B"#*'#%&5=(&'$7'"'&$74=%'9MVA>(%'(#"7&":($-7'23$45#%'/1M?1'3-#'-,,&%(&'-,'.'()#-54)';'-#'W'()#-54)'.6<'%!%7V75+A%#%8'B"#*&'#%&5=('$7'"'&$74=%'.0JVA>(%'(#"7&":($-7'23$45#%'/16?'"78'-88V75+A%#%8'B"#*&'#%&5=('$7'(B-'(#"7&":($-7&X'-7%'9MVA>(%'"78'-7%'/0VA>(%'23$45#%'/19?1'3-#'-,,&%(&'-,'J<'%!%7V75+A%#%8'B"#*&'#%&5=('$7'-7%'.0JVA>(%'(#"7&":($-7'"78'-88V75+A%#%8'B"#*&'#%&5=('$7'(B-'/0VA>(%'(#"7&":($-7&1'@)%'(B-'/0VA>(%'

16 August 31, 2009

Even on newer cards consecutive but misaligned access can decrease bandwidth by a factor of 2


Strided accesses__kernel void strideCopy(__global float* odata,__global float* idata,int stride){int xid = get_global_id(0) * stride;odata[xid] = idata[xid];}


!"#$%#&!'($%)*"#!+,"*!+#$*#*-./*#$0*#*12/34!,*!"#$%#&!'($)*#",*",%5($%'36,*7("*!+,*36'5*#!*!+,*(77%,!*(7*8*'$*9':;",*1<=<*

3.2.1.4 Strided Accesses >6!+(;:+*!+,*",6#?,0*&(#6,%&'$:*",%!"'&!'($%*7("*0,@'&,%*A'!+*&(B5;!,*&#5#3'6'!4*C<2*("*+':+,"*#&+',@,*($,/+#67*7;66*3#$0A'0!+*7("*!+,*(77%,!*&(54*&#%,*D;%!*0,%&"'3,0)*5,"7("B#$&,*($*%;&+*0,@'&,%*&#$*0,:"#0,*A+,$*%;&&,%%'@,*!+",#0%*'$*#*+#67*A#"5*#&&,%%*B,B("4*6(&#!'($%*!+#!*+#@,*$($/;$'!*%!"'0,%<*E+'%*5#!!,"$*(&&;"%*7",F;,$!64*A+,$*0,#6'$:*A'!+*B;6!'0'B,$%'($#6*0#!#*("*B#!"'&,%G*7("*,?#B56,)*A+,$*#*+#67*A#"5*(7*!+",#0%*#&&,%%,%*B#!"'?*,6,B,$!%*&(6;B$A'%,*#$0*!+,*B#!"'?*'%*%!(",0*'$*"(A/B#D("*("0,"<*

E(*'66;%!"#!,*!+,*,77,&!*(7*%!"'0,0*#&&,%%*($*,77,&!'@,*3#$0A'0!+)*%,,*!+,*7(66(A'$:*H,"$,6*strideCopy())*A+'&+*&(5',%*0#!#*A'!+*#*%!"'0,*(7*stride*,6,B,$!%*3,!A,,$*!+",#0%*7"(B*idata*!(*odata<*__kernel void strideCopy(__global float* odata, __global float* idata, int stride) { int xid = get_global_id(0) * stride; odata[xid] = idata[xid]; }

Listing 3.6 A kernel to illustrate non-unit stride data copy

9':;",*1<8*'66;%!"#!,%*#*%'!;#!'($*!+#!*&#$*3,*&",#!,0*;%'$:*!+,*&(0,*'$*I'%!'$:*1<-G*$#B,64)*!+",#0%*A'!+'$*#*+#67*A#"5*#&&,%%*B,B("4*A'!+*#*%!"'0,*(7*2<*E+'%*#&!'($*'%*&(#6,%&,0*'$!(*#*%'$:6,*C28/34!,*!"#$%#&!'($*($*#$*JKLML>*N,9("&,*NEO*28P*Q&(B5;!,*&#5#3'6'!4*C<1R<*

*

*Figure 3.8 A half warp accessing memory with a stride of 2

>6!+(;:+*#*%!"'0,*(7*2*",%;6!%*'$*#*%'$:6,*!"#$%#&!'($)*$(!,*!+#!*+#67*!+,*,6,B,$!%*'$*!+,*!"#$%#&!'($*#",*$(!*;%,0*#$0*",5",%,$!*A#%!,0*3#$0A'0!+<*>%*!+,*%!"'0,*'$&",#%,%)*!+,*,77,&!'@,*3#$0A'0!+*0,&",#%,%*;$!'6*!+,*5('$!*A+,",*C-*!"#$%#&!'($%*#",*'%%;,0*7("*!+,*C-*!+",#0%*'$*#*+#67*A#"5)*#%*'$0'&#!,0*'$*9':;",*1<S<*

August 31, 2009 17

Stride=2 case, CC > 1.3 devices can do thisin a single 128 byte transaction, but bandwidth is wasted as half of elements notused


!Figure 3.9 Performance of strideCopy kernel

"#$%&!'#(%)%*&!$'+$!#,!$'%!"-./.0!123!4455!6%)78%!98#:;<$%!8+;+=7>7$?!@A5B&!+,?!,#,C<,7$!D$*76%!*%D<>$D!7,!@E!D%;+*+$%!$*+,D+8$7#,D!;%*!'+>F!(+*;A!

0D!7>><D$*+$%6!7,!G7H<*%!IAJ&!,#,C<,7$!D$*76%!H>#=+>!:%:#*?!+88%DD%D!D'#<>6!=%!+)#76%6!('%,%)%*!;#DD7=>%A!K,%!:%$'#6!F#*!6#7,H!D#!<$7>7L%D!D'+*%6!:%:#*?&!('78'!7D!67D8<DD%6!7,!$'%!,%M$!D%8$7#,A!

3.2.2 Shared Memory N%8+<D%!7$!7D!#,C8'7;&!D'+*%6!:%:#*?!97A%A!K;%,OP!QQ>#8+>!:%:#*?B!7D!:<8'!F+D$%*!$'+,!>#8+>!+,6!H>#=+>!:%:#*?A!.,!F+8$&!D'+*%6!:%:#*?!>+$%,8?!7D!*#<H'>?!@55M!>#(%*!$'+,!H>#=+>!:%:#*?!>+$%,8?R;*#)76%6!$'%*%!+*%!,#!=+,S!8#,F>78$D!=%$(%%,!$'%!$'*%+6D&!+D!6%$+7>%6!7,!$'%!F#>>#(7,H!D%8$7#,A!!

3.2.2.1 Shared Memory and Memory Banks 2#!+8'7%)%!'7H'!:%:#*?!=+,6(76$'!F#*!8#,8<**%,$!+88%DD%D&!D'+*%6!:%:#*?!7D!67)76%6!7,$#!%T<+>>?!D7L%6!:%:#*?!:#6<>%D&!8+>>%6!=+,SD&!$'+$!8+,!=%!+88%DD%6!D7:<>$+,%#<D>?A!2'%*%F#*%&!+,?!:%:#*?!>#+6!#*!D$#*%!#F!!!+66*%DD%D!$'+$!D;+,D!!!67D$7,8$!:%:#*?!=+,SD!8+,!=%!D%*)78%6!D7:<>$+,%#<D>?&!?7%>67,H!+,!%FF%8$7)%!=+,6(76$'!$'+$!7D!!!$7:%D!+D!'7H'!+D!$'%!=+,6(76$'!#F!+!D7,H>%!=+,SA!!

U#(%)%*&!7F!:<>$7;>%!+66*%DD%D!#F!+!:%:#*?!*%T<%D$!:+;!$#!$'%!D+:%!:%:#*?!=+,S&!$'%!+88%DD%D!+*%!D%*7+>7L%6A!2'%!'+*6(+*%!D;>7$D!+!:%:#*?!*%T<%D$!$'+$!'+D!=+,S!8#,F>78$D!7,$#!+D!:+,?!D%;+*+$%!8#,F>78$CF*%%!*%T<%D$D!+D!,%8%DD+*?&!6%8*%+D7,H!$'%!%FF%8$7)%!=+,6(76$'!=?!+!F+8$#*!%T<+>!$#!$'%!,<:=%*!#F!D%;+*+$%!:%:#*?!*%T<%D$DA!2'%!#,%!%M8%;$7#,!'%*%!7D!('%,!+>>!$'*%+6D!7,!+!'+>F!(+*;!+66*%DD!$'%!D+:%!D'+*%6!:%:#*?!>#8+$7#,&!*%D<>$7,H!7,!+!=*#+68+D$A!!

2#!:7,7:7L%!=+,S!8#,F>78$D&!7$!7D!7:;#*$+,$!$#!<,6%*D$+,6!'#(!:%:#*?!+66*%DD%D!:+;!$#!:%:#*?!=+,SD!+,6!'#(!$#!#;$7:+>>?!D8'%6<>%!:%:#*?!*%T<%D$DA!!

18 August 31, 2009

Effective bandwidth deteriorates as stride is increased to 16 and beyond


Note that performance deteriorates significantly as stride increases to just 16

On CPU the cache with its fast memory is relatively large and thus can provide reasonable performance for larger stride

GPUs have no cache, but they have __local memory (in OpenCL, same as shared memory in CUDA), which can be used to improve stride access efficiency


Optimizing matrix multiplication with use of local memory

Consider special case of AB=CA is (Mx16)B is (16xN)C is (MxN)

Use work-group of 16x16


!

Figure 3.10 A block-column matrix (A) multiplied by a block-row matrix (B) and the resulting product matrix (C)

"#!$#!%&'()!%&*!simpleMultiply!+*,-*.!/0'(%'-1!2345!67.68.7%*(!%&*!#8%98%!*.*:*-%(!#;!7!%'.*!#;!:7%,'<!=3!

__kernel void simpleMultiply(__global float* a, __global float* b, __global float* c, int N) { int row = get_global_id(1); int col = get_global_id(0); float sum = 0.0f; for (int i = 0; i < TILE_DIM; i++) { sum += a[row*TILE_DIM+i] * b[i*N+col]; } c[row*N+col] = sum; }

Listing 3.7 Unoptimized matrix multiplication

>-!0'(%'-1!234)!a)!b)!7-$!c!7,*!9#'-%*,(!%#!1.#?7.!:*:#,@!;#,!%&*!:7%,'6*(!A)!B)!7-$!=)!,*(9*6%'C*.@D!blockDim.x)!blockDim.y)!7-$!TILE_DIM!7,*!7..!EF3!G76&!%&,*7$!'-!%&*!EF<EF!?.#6+!67.68.7%*(!#-*!*.*:*-%!'-!7!%'.*!#;!=3!row!7-$!col!7,*!%&*!,#H!7-$!6#.8:-!#;!%&*!*.*:*-%!'-!=!?*'-1!67.68.7%*$!?@!7!97,%'68.7,!%&,*7$3!"&*!for!.##9!#C*,!i!:8.%'9.'*(!7!,#H!#;!A!?@!7!6#.8:-!#;!B)!H&'6&!'(!%&*-!H,'%%*-!%#!=3!

"&*!*;;*6%'C*!?7-$H'$%&!#;!%&'(!+*,-*.!'(!#-.@!I34!JB9(!#-!7-!KL>M>A!J*N#,6*!J"O!PIQ!7-$!Q34!JB9(!#-!7-!KL>M>A!J*N#,6*!J"O!IIQQ3!"#!7-7.@R*!9*,;#,:7-6*)!'%!'(!-*6*((7,@!%#!6#-('$*,!&#H!&7.;!H7,9(!#;!%&,*7$(!766*((!1.#?7.!:*:#,@!'-!%&*!for!.##93!G76&!&7.;!H7,9!#;!%&,*7$(!67.68.7%*(!#-*!,#H!#;!7!%'.*!#;!=)!H&'6&!$*9*-$(!#-!7!('-1.*!,#H!#;!A!7-$!7-!*-%',*!%'.*!#;!B!7(!'..8(%,7%*$!'-!N'18,*!23EE3!

!

20 August 31, 2009


__kernel void simpleMultiply(__global float* a,__global float* b,__global float* c,int N){int row = get_global_id(1);int col = get_global_id(0);float sum = 0.0f;for (int i = 0; i < TILE_DIM; i++) {sum += a[row*TILE_DIM+i] * b[i*N+col];}c[row*N+col] = sum;}

Unoptimized

With this code, the number of times global memorymust is accessed to provide needed entries of a and bis N * M * (16 + 16) times


!Figure 3.11 Computing a row (half warp) of a tile in C using one row of A

and an entire tile of B

"#$!%&'(!)*%$&*)#+!i!#,!*(%!for!-##./!&--!*($%&01!)+!&!(&-,!2&$.!$%&0!*(%!1&3%!4&-5%!,$#3!6-#7&-!3%3#$8!9*(%!)+0%: row*TILE_DIM+i!)1!'#+1*&+*!2)*()+!&!(&-,!2&$.;/!$%15-*)+6!)+!<=!*$&+1&'*)#+1!,#$!'#3.5*%!'&.&7)-)*8!<><!#$!-#2%$/!&+0!<!*$&+1&'*)#+!,#$!'#3.5*%!'&.&7)-)*8!<>?!#$!()6(%$>!@4%+!*(#56(!*(%!#.%$&*)#+!$%A5)$%1!#+-8!<!*$&+1&'*)#+!,#$!'#3.5*%!'&.&7)-)*8!<>?!#$!()6(%$/!*(%$%!)1!2&1*%0!7&+02)0*(!)+!*(%!*$&+1&'*)#+!7%'&51%!#+-8!B!78*%1!#5*!#,!&!C?D78*%!*$&+1&'*)#+!&$%!51%0>!"#$!%&'(!)*%$&*)#+/!*(%!<=!*($%&01!)+!&!(&-,!2&$.!$%&0!&!$#2!#,!*(%!E!*)-%/!2()'(!)1!&!1%A5%+*)&-!&+0!'#&-%1'%0!&''%11!,#$!&--!'#3.5*%!'&.&7)-)*)%1>!

F(%!.%$,#$3&+'%!#+!&!0%4)'%!#,!&+8!'#3.5*%!'&.&7)-)*8!'&+!7%!)3.$#4%0!78!$%&0)+6!&!*)-%!#,!G!)+*#!1(&$%0!3%3#$8!&1!1(#2+!)+!H)1*)+6!C>I>!__kernel void coalescedMultiply(__global float* a, __global float* b, __global float* c, int N, __local float aTile[TILE_DIM][TILE_DIM]) { int row = get_global_id(1); int col = get_global_id(0); float sum = 0.0f; int x = get_local_id(0); int y = get_local_id(1); aTile[y][x] = a[row*TILE_DIM+x]; for (int i = 0; i < TILE_DIM; i++) { sum += aTile[y][i]* b[i*N+col]; } c[row*N+col] = sum; }

August 31, 2009 21


__kernel void coalescedMultiply(__global float* a,__global float* b,__global float* c,int N,__local float aTile[TILE_DIM][TILE_DIM]){int row = get_global_id(1);int col = get_global_id(0);float sum = 0.0f;int x = get_local_id(0);int y = get_local_id(1);aTile[y][x] = a[row*TILE_DIM+x];for (int i = 0; i < TILE_DIM; i++) {sum += aTile[y][i]* b[i*N+col];}c[row*N+col] = sum;}

Partially optimizedStore A in local memoryGlobal memory is accessedN*M*(1+16 times)

__kernel void sharedABMultiply(__global float* a,__global float* b,__global float* c,int N,__local float aTile[TILE_DIM][TILE_DIM],__local float bTile[TILE_DIM][TILE_DIM])

{int row = get_global_id(1);int col = get_global_id(0);float sum = 0.0f;int x = get_local_id(0);int y = get_local_id(1);aTile[y][x] = a[row*TILE_DIM+x];bTile[y][x] = b[y*N+col];barrier(CLK_LOCAL_MEM_FENCE);for (int i = 0; i < TILE_DIM; i++) {sum += aTile[y][i]* bTile[i][x];}c[row*N+col] = sum;}

Fully optimizedStore A in local memoryGlobal memory is accessedN*M*(1+1) times


Performance

Optimization NVIDIA GeForce GTX 280 (released June 2008)

NVIDIA GeForce GTX 8800 (released November 2006)

None 8.8 GBps 0.7 GBps

Coalesced using shared memory to store a tile of A 14.3 GBps 8.2 GBps

Additionally using shared memory to store a tile of B 29.7 GBps 15.7 GBps

The trend is for newer cards to cope better with poor optimization


Conclusion

Explore OpenCL to find out more of its features, as this introduction was only an overview with many aspects left out

Experiment with OpenCL on angel (full install very soon)

Contact SHARCNET HPC analysts for advice about porting your code to GPU

Keep up with new developments in the rapidly developing field of GPU computing

opencl for programming gpus - sharcnet · 2017-04-24 · summer school 2010 pawel pomorski gpu...

Documents