opencl for programming gpus - sharcnet · 2017-04-24 · summer school 2010 pawel pomorski gpu...
TRANSCRIPT
Summer School 2010 Pawel Pomorski
OpenCL for programming GPUs
Sharcnet Summer School 2010
Pawel Pomorski
Summer School 2010 Pawel Pomorski
OverviewWhat is OpenCL?
Basic OpenCL- Detecting your environment- Basic building blocks - “Hello World!” program- Hands on exercise - convert a “serial” CPU code to OpenCL GPU code- OpenCL C
“Advanced” OpenCL- Memory optimization
Summer School 2010 Pawel Pomorski
GPU computing timeline
before 2003 - Calculations on GPU, using graphics API2003 - Brook “C with streams”2005 - Steady increase in CPU clock speed comes to a halt, switch to multicore chips to compensate. At thesame time, computational power of GPUs increasesNovember, 2006 - CUDA released by NVIDIANovember, 2006 - CTM (Close to Metal) from ATIDecember 2007 - Succeded by AMD Stream SDKDecember, 2008 - Technical specification for OpenCL1.0 releasedApril, 2009 - First OpenCL 1.0 GPU drivers released by NVIDIAAugust, 2009 - Mac OS X 10.6 Snow Leopard released, with OpenCL 1.0 includedSeptember 2009 - Public release of OpenCL by NVIDIADecember 2009 - AMD release of ATI Stream SDK 2.0 with OpenCL supportMarch 2010 - Cuda 3.0 released, incorporating OpenCLNear future - Intel may release a GPGPU (Larrabee)Future - CPUs will have so many cores they will start to be treated as GPUs?
Summer School 2010 Pawel Pomorski
Why are GPUs fast?
Different paradigm, data parallelism (vs. task parallelism on CPU)
Stream processing
Hardware capable of launching MANY threads
However, GPUs also have significant limitations and not all code can run fast on them
If a significant portion of your code cannot be accelerated, your speedup will be limited by Amdahl’s Law
Summer School 2010 Pawel Pomorski
OpenCL (Open Compute Language) is an open standard for parallel programming of heterogenous systems, managed by Khronos Group.
Aim is for it to form the foundation layer of a parallel computing ecosystem of platform-independent tools
OpenCL includes a language for writing kernels, plus APIs used to define and control platforms
Targeted at GPUs, but also multicore CPUs and other hardware (Cell etc.), though to run efficiently each family of devices needs different code
Summer School 2010 Pawel Pomorski
OpenCL is largely derived from CUDA (no need to reinvent the wheel)
Same basic functioning : kernel is sent to the accelerator “compute device” composed of “compute units” of which “processing elements” work on “work items”.
Some names changed between CUDA and OpenCL
Thread -> Work-itemBlock -> Work-group
Both CUDA and OpenCL allow for detailed low level optimization of the GPU
Summer School 2010 Pawel Pomorski
OpenCL device independent - really?
Actually, there is no way to write code that will run equally efficiently on both on GPU and multi-CPU architecture
However, OpenCL can detect the features of the architecture and run code appropriate for each
OpenCL code should run efficiently on many GPUs with a bit of fine tuning
On the other hand, an OpenCL code highly optimized for NVIDIA hardware will not run that efficiently on AMD hardware
As High Performance Computing code needs to be highly optimized, so OpenCL may not offer practical ability to be device independent
Summer School 2010 Pawel Pomorski
CUDA vs. OpenCLNVIDIA is fully supporting OpenCL even though it does not run exclusively on their hardware
The strategy is to increase the number of GPU users by making software more portable and accessible, which OpenCL is meant to do. If there are more users, NVIDIA will sell more hardware.
As CUDA is fully controlled by NVIDIA, it is possible that in the future it will contain more “bleeding edge” features than OpenCL, which is overseen by a consortium hence slower to incorporate new features
If you want your GPU code to run on both NVIDIA and AMD/ATI devices (two main players at present), OpenCL is the only way to accomplish that
Summer School 2010 Pawel Pomorski
Available OpenCL environmentsNVIDIA OpenCL - distributed with CUDA since version 3.0 - no CPU support at the moment- for NVIDIA GPU cards only
Apple OpenCL - included as standard feature in Mac OS X 10.6 Snow Leopard (requires XCode)- supports both graphics cards and CPUs (Apple hardware only)
AMD (ATI) OpenCL- supports only AMD GPU cards - no CPU support at the moment
IBM OpenCL, supports IBM Blade Center hardware
Foxc (Fixstars OpenCL Cross Compiler), in Beta as of June, 2010, CPU only
Summer School 2010 Pawel Pomorski
OpenCL References
The OpenCL Programming Book - http://www.fixstars.com/en/company/books/opencl/
Kronos consortium: http://www.khronos.org/
CUDA OpenCL: http://www.nvidia.com/object/cuda_opencl_new.html
NVIDIA GPU Computing SDK code samples
On “angel”: /opt/sharcnet/cuda/3.0/gpucomputingsdk/OpenCL
Apple OpenCL : http://developer.apple.com
AMD (ATI) OpenCL : http://ati.amd.com/technology/streamcomputing/opencl.html
Summer School 2010 Pawel Pomorski
OpenCL on SHARCNET
Cluster “angel” is SHARCNET’s main GPU computing cluster.
Hardware: 22 nodes, each with 8 CPU cores, each node “sees” 2 Tesla T10 GPUs CUDA 3.0 drivers needed for OpenCL are currently available on node ang4 via a special arrangement for Summer School
Students can test OpenCL programs by connecting to ang4 via “ssh ang4” (from angel login node)
Run interactively, as ang4 is not accessible through queues right now
Compile with:gcc -o test.x your_code.c -lOpenCL
Summer School 2010 Pawel Pomorski
Know your hardwareOpenCL programs should aim to be hardware agnostic
The program should find the relevant system information at runtime
OpenCL provides methods for this
At minimum programmer must determine what OpenCL devices are available and choose which are to be used by the program
Summer School 2010 Pawel Pomorski
#include <stdio.h>#include <CL/cl.h>
int main(int argc, char** argv) { char dname[500]; cl_device_id devices[10]; cl_uint num_devices,entries; cl_ulong long_entries; int d; cl_int err; cl_platform_id platform_id = NULL; size_t p_size;
/* obtain list of platforms available */ err = clGetPlatformIDs(1, &platform_id,NULL); if (err != CL_SUCCESS) { printf("Error: Failure in clGetPlatformIDs,error code=%d \n",err); return 0; }
/* obtain information about platform */ clGetPlatformInfo(platform_id,CL_PLATFORM_NAME,500,dname,NULL); printf("CL_PLATFORM_NAME = %s\n", dname);
clGetPlatformInfo(platform_id,CL_PLATFORM_VERSION,500,dname,NULL); printf("CL_PLATFORM_VERSION = %s\n", dname);
Program to get information - What is the platform?
On angel cluster: /work/ppomorsk/SUMMER_SCHOOL/get_info/get_opencl_information.c
Summer School 2010 Pawel Pomorski
/* obtain list of devices available on platform */ clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ALL, 10, devices, &num_devices); printf("%d devices found\n", num_devices);
/* query devices for information */ for (d = 0; d < num_devices; ++d) { clGetDeviceInfo(devices[d], CL_DEVICE_NAME, 500, dname,NULL); printf("Device #%d name = %s\n", d, dname); clGetDeviceInfo(devices[d],CL_DRIVER_VERSION, 500, dname,NULL); printf("\tDriver version = %s\n", dname); clGetDeviceInfo(devices[d],CL_DEVICE_GLOBAL_MEM_SIZE,sizeof(cl_ulong),&long_entries,NULL); printf("\tGlobal Memory (MB):\t%llu\n",long_entries/1024/1024); clGetDeviceInfo(devices[d],CL_DEVICE_GLOBAL_MEM_CACHE_SIZE,sizeof(cl_ulong),&long_entries,NULL); printf("\tGlobal Memory Cache (MB):\t%llu\n",long_entries/1024/1024); clGetDeviceInfo(devices[d],CL_DEVICE_LOCAL_MEM_SIZE,sizeof(cl_ulong),&long_entries,NULL); printf("\tLocal Memory (KB):\t%llu\n",long_entries/1024); clGetDeviceInfo(devices[d],CL_DEVICE_MAX_CLOCK_FREQUENCY,sizeof(cl_ulong),&long_entries,NULL); printf("\tMax clock (MHz) :\t%llu\n",long_entries); clGetDeviceInfo(devices[d],CL_DEVICE_MAX_WORK_GROUP_SIZE,sizeof(size_t),&p_size,NULL); printf("\tMax Work Group Size:\t%d\n",p_size); clGetDeviceInfo(devices[d],CL_DEVICE_MAX_COMPUTE_UNITS,sizeof(cl_uint),&entries,NULL); printf("\tNumber of parallel compute cores:\t%d\n",entries); } return 0;}
Program to get information cont. - What are the devices?
Summer School 2010 Pawel Pomorski
[ppomorsk@ang4:~] cd /work/ppomorsk/SUMMER_SCHOOL/get_info[ppomorsk@ang4:/work/ppomorsk/SUMMER_SCHOOL/get_info] gcc -o test.x get_opencl_information.c -lOpenCL[ppomorsk@ang4:/work/ppomorsk/SUMMER_SCHOOL/get_info] ./test.xCL_PLATFORM_NAME = NVIDIA CUDACL_PLATFORM_VERSION = OpenCL 1.0 CUDA 3.0.12 devices foundDevice #0 name = Tesla T10 Processor Driver version = 195.36.15 Global Memory (MB): 4095 Global Memory Cache (MB): 0 Local Memory (KB): 16 Max clock (MHz) : 1440 Max Work Group Size: 512 Number of parallel compute cores: 30Device #1 name = Tesla T10 Processor Driver version = 195.36.15 Global Memory (MB): 4095 Global Memory Cache (MB): 0 Local Memory (KB): 16 Max clock (MHz) : 1440 Max Work Group Size: 512 Number of parallel compute cores: 30[ppomorsk@ang4:/work/ppomorsk/SUMMER_SCHOOL/get_info]
Output on angel - 2 GPU detected, no CPUs detected
Summer School 2010 Pawel Pomorski
Angel architectureGPU server: Tesla S1070http://archive.electronicdesign.com/files/29/19280/fig_01.gif
4 GPUs on server
Each GPU has 30 computeunits, each consisting of 8 streamprocessing cores
240 cores in total on one GPU
Each node “sees” 2 GPUs
Summer School 2010 Pawel Pomorski
Output on Apple MacBook laptop - both GPU and CPU seenppomorsk-mac:tryopencl pawelpomorski$ gcc -o test.x get_opencl_information.c -w -m32 -lm -lstdc++ -framework OpenCLppomorsk-mac:tryopencl pawelpomorski$ ./test.x CL_PLATFORM_NAME = AppleCL_PLATFORM_VERSION = OpenCL 1.0 (Feb 10 2010 23:46:58)2 devices foundDevice #0 name = GeForce 9400M Driver version = CLH 1.0 Global Memory (MB): 256 Global Memory Cache (MB): 0 Local Memory (KB): 16 Max clock (MHz) : 1100 Max Work Group Size: 512 Number of parallel compute cores: 2Device #1 name = Intel(R) Core(TM)2 Duo CPU P7350 @ 2.00GHz Driver version = 1.0 Global Memory (MB): 1536 Global Memory Cache (MB): 3 Local Memory (KB): 16 Max clock (MHz) : 2000 Max Work Group Size: 1 Number of parallel compute cores: 2ppomorsk-mac:tryopencl pawelpomorski$
Summer School 2010 Pawel Pomorski
Getting a code to run on a GPU
Take existing serial program, separate out the parts that will continue to run on host, and the parts which will be sent to the GPU
GPU parts need to be rewritten in the form of kernel functions
Add code to host that manages GPU overhead: creates kernels, moves data between host and GPU etc.
Summer School 2010 Pawel Pomorski
OpenCL “Hello, World!” example
Not practical to do proper “Hello, World!” as OpenCL devices cannot access standard output directly
Our example program will pass an array of numbers from host to the GPU, square each number in the array on the GPU, then return modified array to host
This program is for learning purposes only, and will not run efficiently on a GPU
The program will demonstrate the basic structure which is common to all OpenCL programs
Will not show error checks on the slides for clarity, but having them is essential when writing OpenCL code
Source codeOn angel cluster: /work/ppomorsk/SUMMER_SCHOOL/hello/no_errorchecks_hello.c /work/ppomorsk/SUMMER_SCHOOL/hello/hello.c
Summer School 2010 Pawel Pomorski
Program flow - OpenCL function callsclGetPlatformIDsclGetDeviceIDsclCreateContextclCreateCommandQueue
clCreateProgramWithSourceclBuildProgramclCreateKernel
clCreateBufferclEnqueueWriteBuffer
clSetKernelArgclGetKernelWorkGroupInfoclEnqueueNDRangeKernelclFinish
clEnqueueReadBuffer
clRelease...
Organize resources, create command queue
Compile kernel
Transfer data from host to GPU memory
Lauch threads running kernels on GPU, perform main computation
Transfer data from GPU to host memory
Free all allocated memory
Summer School 2010 Pawel Pomorski
// Simple compute kernel which computes the square of an input array //const char *KernelSource = "\n" \"__kernel void square( \n" \" __global float* input, \n" \" __global float* output, \n" \" const unsigned int count) \n" \"{ \n" \" int i = get_global_id(0); \n" \" if(i < count) \n" \" output[i] = input[i] * input[i]; \n" \"} \n" \"\n";
Kernel code - string with OpenCL C code
Summer School 2010 Pawel Pomorski
int main(int argc, char** argv){ int err; // error code returned from api calls float data[DATA_SIZE]; // original data set given to device float results[DATA_SIZE]; // results returned from device unsigned int correct; // number of correct results returned
size_t global; // global domain size for our calculation size_t local; // local domain size for our calculation
cl_device_id device_id; // compute device id cl_context context; // compute context cl_command_queue commands; // compute command queue cl_program program; // compute program cl_kernel kernel; // compute kernel cl_mem input; // device memory used for the input array cl_mem output; // device memory used for the output array cl_platform_id platform_id = NULL; // Fill our data set with random float values // int i = 0; unsigned int count = DATA_SIZE; for(i = 0; i < count; i++) data[i] = rand() / (float)RAND_MAX;
Define variables, set input data
Summer School 2010 Pawel Pomorski
// determine OpenCL platform err = clGetPlatformIDs(1, &platform_id,NULL); // Connect to a compute device // int gpu = 1; err = clGetDeviceIDs(platform_id, gpu ? CL_DEVICE_TYPE_GPU : CL_DEVICE_TYPE_CPU, 1, &device_id, NULL); // Create a compute context // context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
// Create a command commands // commands = clCreateCommandQueue(context, device_id, 0, &err);
Organise resources, create command queue
Summer School 2010 Pawel Pomorski
// Create the compute program from the source buffer // program = clCreateProgramWithSource(context, 1, (const char **) & KernelSource, NULL, &err);
// Build the program executable // err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
// Create the compute kernel in the program we wish to run // kernel = clCreateKernel(program, "square", &err);
Compile kernel
Summer School 2010 Pawel Pomorski
// Create the input and output arrays in device memory for our calculation // input = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * count, NULL, NULL); output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * count, NULL, NULL); // Write our data set into the input array in device memory // err = clEnqueueWriteBuffer(commands, input, CL_TRUE, 0, sizeof(float) * count, data, 0, NULL, NULL);
Transfer data from host to GPU memory
Summer School 2010 Pawel Pomorski
// Set the arguments to our compute kernel // err = 0; err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input); err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &output); err |= clSetKernelArg(kernel, 2, sizeof(unsigned int), &count);
// Execute the kernel over the entire range of our 1d input data set // using one work item per work group (allows for arbitrary length of data array) // global = count; local = 1; err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL);
// Wait for the command commands to get serviced before reading back results // clFinish(commands);
Lauch threads running kernels on GPU, perform main computation
Summer School 2010 Pawel Pomorski
// Read back the results from the device to verify the output // err = clEnqueueReadBuffer( commands, output, CL_TRUE, 0, sizeof(float) * count, results, 0, NULL, NULL ); // Validate our results // correct = 0; for(i = 0; i < count; i++) { if(results[i] == data[i] * data[i]) correct++; } // Print a brief summary detailing the results // printf("Computed '%d/%d' correct values!\n", correct, count);
Transfer data from GPU to host memory
Summer School 2010 Pawel Pomorski
// Shutdown and cleanup // clReleaseMemObject(input); clReleaseMemObject(output); clReleaseProgram(program); clReleaseKernel(kernel); clReleaseCommandQueue(commands); clReleaseContext(context);
return 0;}
Free all allocated memory
If you are trying to use an OpenCL profiler and it crashes at the end of the run, not freeing all memory could be the cause
Summer School 2010 Pawel Pomorski
Compiling kernel - Online approachIn our example we have compiled code at runtime (online compile)
Kernel code can be specified as a string in the main program file
Kernel code can also be stored in a file (.cl) and loaded at runtime
const char fileName[] = "./kernel.cl";
/* Load kernel source code */ fp = fopen(fileName, "r"); if (!fp) { fprintf(stderr, "Failed to load kernel.\n"); exit(1); } source_str = (char *)malloc(MAX_SOURCE_SIZE); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp ); fclose( fp );// ...
/* Create Kernel program from the read in source */ program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret);
Summer School 2010 Pawel Pomorski
Compiling kernel - Offline approachIt is possible to compile OpenCL source code (kernels) offline
This is the approach used in CUDA
Upside: need not spend time for compiling during runtime
Downside: executable not portable, need to compile separate binary for each device the code is runnnig on
There is no freestanding compiler for OpenCL (like nvcc for CUDA)
Kernels have to be compiled at runtime and the resulting binary saved.
That binary can then be loaded in the future to avoid compiling step
Summer School 2010 Pawel Pomorski
Task queueQueue is used to launch kernels, in precisely controlled order if required.
Event objects contain information about whether various operations have finished.
For example, clEnqueueTask launches a single kernel, after checking the list of provided event object whether they have completed, and returns its own event object.
//create quque enabled for out of order (parallel) executioncommands = clCreateCommandQueue(context, device_id, OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);...// no synchronizationclEnqueueTask(command_queue, kernel_E, 0, NULL,NULL);
// synchronize so that kernel E starts only after kernels A,B,C,D finishcl_event events[4]; // define event object arrayclEnqueueTask(commands, kernel_A, 0, NULL, &events[0]);clEnqueueTask(commands, kernel_B, 0, NULL, &events[1]);clEnqueueTask(commands, kernel_C, 0, NULL, &events[2]);clEnqueueTask(commands, kernel_D, 0, NULL, &events[3]);clEnqueueTask(commands, kernel_E, 4, events, NULL);
Summer School 2010 Pawel Pomorski
Queueing Data Parallel tasksclEnqueueNDRangeKernel - used to lauch data parallel tasks, i.e. copies of kernel which are identical except for operating on different data
For example, to lauch #(global) work-items (copies of kernel, i.e. threads) grouped into work-groups of size #(local), one would use:
#(global) must be divisible by #(local), maximum size of #(local) dependent on device
Each work item can retrieve:get_global_id(0) - its index among all threadsget_local_id(0) - its index among threads in its work-groupget_group_id(0) - index of its work group
This information can be used to determine which data to operate on
err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL);
Summer School 2010 Pawel Pomorski
Organising work-items in more than 1DConvenient and efficient for many tasks, 2D and 3D possible
Maximum possible values in global and local arrays are device dependent
For 2D, can retrieve global index coordinates(x,y) = (get_global_id(0),get_global_id(1))
For 3D, can similary retrieve global (x,y,z) = (get_global_id(0),get_global_id(1),get_global_id(2))
//2D example// launch 256 work-items, organised in 16x16 grid// grouped in groups of 16, organised as 4x4global[0]=16; global[1]=16;local[0]=4; local(1)=4;
err = clEnqueueNDRangeKernel(commands, kernel, 2, NULL, &global, &local, 0, NULL, NULL);
Summer School 2010 Pawel Pomorski
Measuring execution time
OpenCL is an abstraction layer that allows the same code to be executed on different platforms
The code is guaranteed to run but its speed of execution will be dependent on the device and type of parallelism used
In order to get maximum performance, device and parallelism dependent tuning must be performed.
To do this, execution time must be measured. OpenCL provides convenient ways to do this
Event objects can also contain information about how long it took for the task associated with even to execute
Summer School 2010 Pawel Pomorski
clGetEventProfilingInfo can obtain start and end time of event Taking the difference gives event duration in nanoseconds
commands = clCreateCommandQueue(context, device_id, CL_QUEUE_PROFILING_ENABLE, &err); cl_event event; err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, &event);
clWaitForEvents(1, &event); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL); printf("Time for event (ms): %10.5f \n", (end-start)/1000000.0);// ... ClReleaseEvent(event); //important to free memory
Summer School 2010 Pawel Pomorski
Timing “Hello World!”
Increase DATA_SIZE to 1048576
Check how computation time varies with size of work-groups
Using largest possible work group is usually the best choice
Time for write/read to buffer is 2.51/5.14 ms.
Work-group size (local) Time (ms)1 6.1918 0.80316 0.41432 0.20964 0.127512 (maximum) 0.130
Summer School 2010 Pawel Pomorski
Timing “Hello World!” with openclprof/opt/sharcnet/cuda/3.0/cuda/openclprof/bin/openclprof
Summer School 2010 Pawel Pomorski
Must minimize memory transfers between host and GPU as these are quite slow.
If you have a kernel which does not show any performance improvement when run on GPU, run it on GPU if doing so eliminates need for device to host memory transfer
Intermediate data structures should be created in GPU memory, operated on by GPU, and destroyed without ever being mapped or copied to host memory
Because of overhead, batching small transfers into one works better than making each transfer separately.
Higher bandwidth can be achieved when using page-locked (or pinned) memory.
Summer School 2010 Pawel Pomorski
Running openclprof on SHARCNET
1. ssh -X [email protected]. ssh -X ang43. /opt/sharcnet/cuda/3.0/cuda/openclprof/bin/openclprofOnce you get the GUI:4. File->New5. Name your project, select location somewhere in your work space6. Select OpenCL binary executable compiled previously (no special compile options needed)7. Select your working directory8. Click “Start” to perform profiling runExamine outputTry various options in session and view
Summer School 2010 Pawel Pomorski
Exercise: Port a serial code to use OpenCLcp -rp /work/ppomorsk/SUMMER_SCHOOL/ . ./exercise/serial - serial code you are to convert to OpenCL./exercise/opencl_template - contains a generic “template” which you can modify, using the serial code, to develop the OpenCL version of the code./hello - consult if needed./exercise/opencl_solution - the solution (peek if you must)
Compile with: gcc -o test.x main.c -lOpenCL To run: ./test.x
Template is for the case of lunching one kernel as a single thread
This port will be fairly generic, and could easily run on both GPU and CPU
Summer School 2010 Pawel Pomorski
void moving_average(int *values, float *average, int length, int width){ int i; int add_value;
/* Compute sum for the first "width" elements */ add_value = 0; for( i = 0; i < width; i++ ) { add_value += values[i]; } average[width-1] = (float)add_value;
/* Compute sum for the (width) to (length-1) elements */ for( i = width; i < length; i++ ) { add_value = add_value - values[i-width] + values[i]; average[i] = (float)(add_value); }
/* Insert zeros to 0th to (width-2)th element */ for( i = 0; i < width-1; i++ ) { average[i] = 0.0f; }
/* Compute average from the sum */ for( i = width-1; i < length; i++ ) { average[i] /= (float)width; }}
Computes Simple Moving Average (SMA)
Hint: Converting this to an OpenCL kernel that runs as single thread requires very little modification
Summer School 2010 Pawel Pomorski
#include <stdio.h>#include <stdlib.h>/* Read Stock data */int stock_array1[] = { #include "stock_array1.txt"};/* Define width for the moving average */#define WINDOW_SIZE (13)
int main(int argc, char *argv[]){ float *result;
int data_num = sizeof(stock_array1) / sizeof(stock_array1[0]); int window_num = (int)WINDOW_SIZE; int i; /* Allocate space for the result */ result = (float *)malloc(data_num*sizeof(float));
/* Call the moving average function */ moving_average(stock_array1,result,data_num,window_num);
/* Print result */ for(i=0; i<data_num; i++) { printf( "result[%d] = %f\n", i, result[i] ); }
/* Deallocate memory */ free(result);
return 0;}
Serial host code
Summer School 2010 Pawel Pomorski
OpenCL C language
Summer School 2010 Pawel Pomorski
OpenCL C
OpenCL C is basically standard C (C99) with some extensions and restrictions. This language is used to program the kernel.
There is a list of restrictions, for example - recursion is not supported- the point passed as an argument to a kernel function must be of type __global, __constant or __local- support for double precision floating-point is currently an optional extension, and it may not be implemented
Some OpenCL functions are built in without requiring library, including math functions, geometric functions, and work-item functions
OpenCL includes some new features not available in C
Summer School 2010 Pawel Pomorski
Vector Data
In OpenCL can define and operate on vector data-types, which is a struct consisting of many components of same data-type (CUDA has a much more limited capacity to do this)
float4 consists of (float,float,float,float) declared via float4 f4;
Vector sizes are 2,4,8,16, can consist of char,int,float, double etc.
// vector literals used to assign values to variablesfloat4 g4 = (float4)(1.0f,2.0f,3.0f,4.0f);
// built in math functions can process vector typesfloat4 f4 = cos(g4); // f4 = cos g(0),cos g(1), cos g(2), cos g(3)
Summer School 2010 Pawel Pomorski
Accessing vector data
// .xyzw will access 0 to 3, same in CUDA
int4 x = (int4)(1,2,3,4);int4 a = x.wzyx // a=(4,3,2,1) int2 b = x.xx // b = (1,1) */
// can access through number index using .s followed by number // (in hex, which uses A-F for indices above 9)
int 8 c = x.s01233210 // c=(1,2,3,4,4,3,2,1)
// extract odd and even indices (.odd , .even)
int 2 d =x.odd // d=(2,4)
// extract upper half and lower half of vector (.hi,.lo)
int 2 e =x.hi // e = (3,4)
Summer School 2010 Pawel Pomorski
Other vector operations
Addition, subtraction
Comparison operation will return a vector containing TRUE or FALSE values is returned
This type of the vector values is determined by what the operands are. It’s not usually char for char, and integer of same length as compared scalar values (short/int/long)
The actual values for FALSE/TRUE are 0/-1 and not 0/1, as that is more convenient for SIMD units
int2 tf = (float2)(1.0f,1.0f) == (float2) (0.0f,1.0f);// tf = (int2) (0,-1)
Summer School 2010 Pawel Pomorski
Effective use of SIMD
Single Instruction obviously implies that it is essential for each thread to run exactly the same code as much as possible
Thus, if I have an if statement that will send some threads down one branch of the code and some other threads down another branch, that is not good.
Summer School 2010 Pawel Pomorski
Example of selection without using if
-- better way --
Works because: a&(all 1s) = a, a|(all 0s)=aGood illustration why -1 for TRUE works better
int a = in0[i], b = in1[i]; /*want to put smaller in out[i]*//*usual way with if */if (a<b) out[i]=a;else out[i]=b;// or possibly// out[i]=a<b?a:b;
int a = in0[i], b = in1[i];int cmp = a < b; //if TRUE, cmp has all bits 1, if FALSE all bits 0 // & bitwise AND// | bitwise OR// ~ flips all bitsout[i] = (a&cmp) | (b&~cmp); //a when TRUE and b when FALSE
// OpenCL has shortct for thisout[i]=bitselect(cmp,b,a);
Summer School 2010 Pawel Pomorski
Vector Benefits
Using vectors will only have benefits on hardware with SIMD features
If those features are not there, vectors will be handled sequentially with no speedup
The selection of vector size is important. For example, double16 may be too big to fit into some registers, so speedup will not occur
Ideally all this should be handled automatically by the compiler, but we are not yet there
Summer School 2010 Pawel Pomorski
OpenCL Pragmas
#pragma OPENCL FP_CONTRACT on-off-switch
Control how floating point is done (is FMA used). In contract operations multiple operations are performed as 1.
#pragma OPENCL EXTENSION <extension_name> : <behavior>
A number of extensions are available
Summer School 2010 Pawel Pomorski
OpenCL Memory Types
__global - memory which can be read from all work items (threads), though with relatively slow access unless the program is carefully optimised. It is the device’s main memory.
__local - can be read from work items within a work group. Physically it is the shared memory on each compute unit.
__private - can only be used within each work item. Physically it is the registers used by each processing element.
__constant - memory that can be used to store constant data for read-only access by all compute units. The host is responsible for initializing and allocating this memory
Summer School 2010 Pawel Pomorski
OpenCL vs Cuda - Vector addition example
__global__ voidvectorAdd(const float * a, const float * b, float * c){// Vector element indexint nIndex = blockIdx.x * blockDim.x + threadIdx.x;c[nIndex] = a[nIndex] + b[nIndex];}
__kernel voidvectorAdd(__global const float * a,__global const float * b,__global float * c){// Vector element indexint nIndex = get_global_id(0);c[nIndex] = a[nIndex] + b[nIndex];}
CUDA
OpenCL
Summer School 2010 Pawel Pomorski
Optimizing OpenCL
- Efficient access to global memory
- Use of local memory to increase bandwith: Matrix Multiplication example
Code, data and figures from:Nvidia OpenCL - Best Practices Guidehttp://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_OpenCL_BestPracticesGuide.pdf
Summer School 2010 Pawel Pomorski
GPU memory vs. CPU memory
CPUs have a cache which for many problems can provide fast memory access, with transfers between cache and main memory handled in hardware.
The cache paradigm scales poorly as the number of cores is increased, as increased hardware overhead is needed to maintain coherency between caches attached to different cores.
In a GPU, with its hundreds of cores, caches are not practical and thus are not present.
Fast but small local (shared) memory is somewhat analogous to L1 cache on a CPU. However, there is no hardware to automatically shift data from local to global memory as needed. This must be taken care of in software, i.e. by the programmer.
In general on a CPU the requirements for memory access are not that stringent. If you are accessing memory sequentially, you are almost certainly OK.
On GPU efficient memory acess requires more work by the programmer.
Summer School 2010 Pawel Pomorski
Coalesced Access to Global Memory
Global memory loads and stores by threads of half warp (16 threads) can be combined into as few as one transaction.
You can think of global memory as rows of 16 floats linked together into a 64 byte group, with 2 of these forming a 128 byte group
Good access pattern results in a single 64 bit transaction
In this example, if accesses were permuted, still one transaction would be performed on newer cards (CC 1.2), but 16 serializedtransaction on older cards (CC 1.1)
NVIDIA OpenCL Best Practices Guide
!"#$%&''%()*+,%&-)".+%)*,+&$/%0*(/+%,+12+/)+$%&$$,+//%'"+/%"#%)*+%/&3+%/+43+#)5%&#$%,+$2-+%)*+%),&#/&-)"(#%/"6+%"7%8(//"9'+:%
;7%)*+%),&#/&-)"(#%"/%<=>%9?)+/%&#$%(#'?%)*+%'(0+,%(,%288+,%*&'7%"/%2/+$5%,+$2-+%)*+%),&#/&-)"(#%/"6+%)(%@A%9?)+/B%
;7%)*+%),&#/&-)"(#%"/%@A%9?)+/%&#$%(#'?%)*+%'(0+,%(,%288+,%*&'7%"/%2/+$5%,+$2-+%)*+%),&#/&-)"(#%/"6+%)(%C=%9?)+/B%
D&,,?%(2)%)*+%),&#/&-)"(#%&#$%3&,E%)*+%/+,."-+$%)*,+&$/%&/%"#&-)".+B% F+8+&)%2#)"'%&''%)*,+&$/%"#%)*+%*&'7%0&,8%&,+%/+,."-+$B%
G*+/+%-(#-+8)/%&,+%"''2/),&)+$%"#%)*+%7(''(0"#4%/"38'+%+H&38'+/B%
3.2.1.1 A Simple Access Pattern G*+%7",/)%&#$%/"38'+/)%-&/+%(7%-(&'+/-"#4%-&#%9+%&-*"+.+$%9?%&#?%$+."-+:%)*+%!I)*%)*,+&$%&--+//+/%)*+%!I)*%0(,$%"#%&%/+43+#)J%)*+%+H-+8)"(#%"/%)*&)%#()%&''%)*,+&$/%#++$%)(%8&,)"-"8&)+B%KL++%!"42,+%CBABM%
%
%
Figure 3.4 Coalesced access in which all threads but one access the corresponding word in a segment
G*"/%&--+//%8&))+,#%,+/2')/%"#%&%/"#4'+%@AI9?)+%),&#/&-)"(#5%"#$"-&)+$%9?%)*+%,+$%,+-)'+B%N()+%)*&)%+.+#%)*(24*%(#+%0(,$%"/%#()%,+12+/)+$5%&''%$&)&%"#%)*+%/+43+#)%&,+%7+)-*+$B%;7%&--+//+/%9?%)*,+&$/%0+,+%8+,32)+$%0")*"#%)*"/%/+43+#)5%/)"''%(#+%@AI9?)+%),&#/&-)"(#%0(2'$%9+%8+,7(,3+$%9?%&%$+."-+%0")*%-(382)+%-&8&9"'")?%<B=%(,%*"4*+,5%92)%<@%/+,"&'"6+$%),&#/&-)"(#/%0(2'$%9+%8+,7(,3+$%9?%&%$+."-+%0")*%-(382)+%-&8&9"'")?%<B<%(,%'(0+,B%
3.2.1.2 A Sequential but Misaligned Access Pattern ;7%/+12+#)"&'%)*,+&$/%"#%&%*&'7%0&,8%&--+//%3+3(,?%)*&)%"/%/+12+#)"&'%92)%#()%&'"4#+$%0")*%)*+%/+43+#)/5%)*+#%&%/+8&,&)+%),&#/&-)"(#%,+/2')/%7(,%+&-*%+'+3+#)%,+12+/)+$%(#%&%$+."-+%0")*%-(382)+%-&8&9"'")?%<B<%(,%'(0+,B%O#%&%$+."-+%0")*%-(382)+%-&8&9"'")?%<B=%(,%*"4*+,5%/+.+,&'%$"77+,+#)%/-+#&,"(/%-&#%&,"/+%$+8+#$"#4%(#%0*+)*+,%&''%&$$,+//+/%7(,%&%*&'7%0&,8%7&''%0")*"#%&%/"#4'+%<=>I9?)+%/+43+#)B%;7%)*+%&$$,+//+/%7&''%0")*"#%&%<=>I9?)+%/+43+#)5%)*+#%&%/"#4'+%<=>I9?)+%),&#/&-)"(#%"/%8+,7(,3+$5%&/%/*(0#%"#%!"42,+%CBPB%
%
14 August 31, 2009
Half warp of threads
Global memory
Summer School 2010 Pawel Pomorski
Memory Optimizations
!Figure 3.5 Unaligned sequential addresses that fit within a single 128-
byte segment
"#!$!%$&#!'$()!$**+,,+,!-+-.(/!0%$0!1,!,+23+401$&!530!,)&10!$*(.,,!0'.!67895/0+!,+:-+40,;!0%+4!0'.!0($4,$*01.4,!$(+!)+(#.(-+<=!"4!0%+!#.&&.'14:!*$,+;!1&&3,0($0+<!14!>1:3(+!?=@;!.4+!@A95/0+!0($4,$*01.4!$4<!.4+!?795/0+!0($4,$*01.4!(+,3&0=!
!
!Figure 3.6 Misaligned sequential addresses that fall within two 128-byte
segments
B+C1*+!-+-.(/!$&&.*$0+<!0%(.3:%!D)+4EF!1,!:3$($40++<!0.!5+!$&1:4+<!0.!$0!&+$,0!7G@!5/0+,=!H%+(+#.(+;!*%..,14:!,+4,15&+!0%(+$<!5&.*I!,1J+,;!,3*%!$,!-3&01)&+,!.#!6@;!#$*1&10$0+,!-+-.(/!$**+,,+,!5/!%$&#!'$(),!0%$0!$(+!$&1:4+<!0.!,+:-+40,=!"4!$<<101.4;!0%+!23$&1#1+(,!__attribute__ ((aligned(8))) and __attribute__ ((aligned(16)))!*$4!5+!3,+<!'%+4!<+#1414:!,0(3*03(+,!0.!+4,3(+!$&1:4-+40!0.!,+:-+40,=!
3.2.1.3 Effects of Misaligned Accesses "0!1,!+$,/!$4<!14#.(-$01C+!0.!+K)&.(+!0%+!($-1#1*$01.4,!.#!-1,$&1:4+<!$**+,,+,!3,14:!$!,1-)&+!*.)/!I+(4+&;!,3*%!$,!0%+!.4+!14!F1,014:!?=G=!__kernel void offsetCopy(__global float *odata, __global float* idata, int offset) { int xid = get_global_id(0) + offset; odata[xid] = idata[xid]; }
Listing 3.5 A copy kernel that illustrates misaligned accesses
"4!F1,014:!?=G;!<$0$!1,!*.)1+<!#(.-!0%+!14)30!$(($/!idata!0.!0%+!.30)30!$(($/;!5.0%!.#!'%1*%!+K1,0!14!:&.5$&!-+-.(/=!H%+!I+(4+&!1,!+K+*30+<!'10%14!$!&..)!14!%.,0!*.<+!0%$0!
August 31, 2009 15
Somewhat misaligned pattern CC 1.1 devices will perform transactions in seriesCC 1.2 can still do it in one
Memory Optimizations
!Figure 3.5 Unaligned sequential addresses that fit within a single 128-
byte segment
"#!$!%$&#!'$()!$**+,,+,!-+-.(/!0%$0!1,!,+23+401$&!530!,)&10!$*(.,,!0'.!67895/0+!,+:-+40,;!0%+4!0'.!0($4,$*01.4,!$(+!)+(#.(-+<=!"4!0%+!#.&&.'14:!*$,+;!1&&3,0($0+<!14!>1:3(+!?=@;!.4+!@A95/0+!0($4,$*01.4!$4<!.4+!?795/0+!0($4,$*01.4!(+,3&0=!
!
!Figure 3.6 Misaligned sequential addresses that fall within two 128-byte
segments
B+C1*+!-+-.(/!$&&.*$0+<!0%(.3:%!D)+4EF!1,!:3$($40++<!0.!5+!$&1:4+<!0.!$0!&+$,0!7G@!5/0+,=!H%+(+#.(+;!*%..,14:!,+4,15&+!0%(+$<!5&.*I!,1J+,;!,3*%!$,!-3&01)&+,!.#!6@;!#$*1&10$0+,!-+-.(/!$**+,,+,!5/!%$&#!'$(),!0%$0!$(+!$&1:4+<!0.!,+:-+40,=!"4!$<<101.4;!0%+!23$&1#1+(,!__attribute__ ((aligned(8))) and __attribute__ ((aligned(16)))!*$4!5+!3,+<!'%+4!<+#1414:!,0(3*03(+,!0.!+4,3(+!$&1:4-+40!0.!,+:-+40,=!
3.2.1.3 Effects of Misaligned Accesses "0!1,!+$,/!$4<!14#.(-$01C+!0.!+K)&.(+!0%+!($-1#1*$01.4,!.#!-1,$&1:4+<!$**+,,+,!3,14:!$!,1-)&+!*.)/!I+(4+&;!,3*%!$,!0%+!.4+!14!F1,014:!?=G=!__kernel void offsetCopy(__global float *odata, __global float* idata, int offset) { int xid = get_global_id(0) + offset; odata[xid] = idata[xid]; }
Listing 3.5 A copy kernel that illustrates misaligned accesses
"4!F1,014:!?=G;!<$0$!1,!*.)1+<!#(.-!0%+!14)30!$(($/!idata!0.!0%+!.30)30!$(($/;!5.0%!.#!'%1*%!+K1,0!14!:&.5$&!-+-.(/=!H%+!I+(4+&!1,!+K+*30+<!'10%14!$!&..)!14!%.,0!*.<+!0%$0!
August 31, 2009 15
Memory to be accessed now belongs to two different 128 bytegroups, so CC 1.2 devices need2 transactions
Choosing work-group sizes to be multiples of 16 facilitates efficientaccess patterns
Summer School 2010 Pawel Pomorski
Cost of misaligned access
__kernel void offsetCopy(__global float *odata,__global float* idata,int offset){int xid = get_global_id(0) + offset;odata[xid] = idata[xid];}
NVIDIA OpenCL Best Practices Guide
!"#$%&'()%'*"#"+%(%#'offset',#-+'.'(-'/01'23$45#%&'/16'"78'/19':-##%&*-78'(-'-,,&%(&'-,'.'"78'.;<'#%&*%:($!%=>1?'@)%'%,,%:($!%'A"78B$8()',-#'()%':-*>'B$()'!"#$-5&'-,,&%(&'-7'"7'CDEFEG'H%3-#:%'H@I'0JK'2B$()':-+*5(%':"*"A$=$(>'.1/?'"78'"7'CDEFEG'H%3-#:%'H@I'JJKK'2:-+*5(%':"*"A$=$(>'.1K?'"#%'&)-B7'$7'3$45#%'/1;1'
'
'Figure 3.7 Performance of offsetCopy kernel
3-#'()%'CDEFEG'H%3-#:%'H@I'JJKK'8%!$:%<'4=-A"='+%+-#>'"::%&&%&'B$()'7-'-,,&%('-#'B$()'-,,&%(&'()"('"#%'+5=($*=%&'-,'.9'#%&5=('$7'"'&$74=%'(#"7&":($-7'*%#')"=,'B"#*'"78'"7'%,,%:($!%'A"78B$8()'-,'"**#-L$+"(%=>';M'HN*&1'O()%#B$&%<'.9'(#"7&":($-7&'"#%'$&&5%8'*%#')"=,'B"#*'#%&5=($74'$7'"7'%,,%:($!%'A"78B$8()'-,'"**#-L$+"(%=>';'HN*&1'@)$&'#-54)=>'JL'*%#,-#+"7:%'8%4#"8"($-7'$&'85%'(-'()%',":('()"('/0'A>(%&<'()%'+$7$+5+'(#"7&":($-7'&$P%<'"#%',%(:)%8',-#'%":)'()#%"81'Q-B%!%#<'-7=>'M'A>(%&'-,'8"("'"#%'5&%8',-#'%":)'/0'A>(%&',%(:)%8R#%&5=($74'$7'()%'MS/0T.SJ'*%#,-#+"7:%'#%="($!%'(-'()%',5==>':-"=%&:%8':"&%1'@)%'(B-'75+A%#&'"=&-'#%,=%:('()%'8$,,%#%7('8"("'#%*#%&%7(%8'A>'%,,%:($!%'A"78B$8()'2M'A>(%&?'!%#&5&'":(5"='A"78B$8()'2/0'A>(%&?1'
N%:"5&%'-,'()$&'*-&&$A=%'*%#,-#+"7:%'8%4#"8"($-7<'+%+-#>':-"=%&:$74'$&'()%'+-&(':#$($:"='"&*%:('-,'*%#,-#+"7:%'-*($+$P"($-7'-,'8%!$:%'+%+-#>1'3-#'()%'CDEFEG'H%3-#:%'H@I'0JK'8%!$:%<'()%'&$(5"($-7'$&'=%&&'8$#%',-#'+$&"=$47%8'"::%&&%&'A%:"5&%<'$7'"==':"&%&<'"::%&&'A>'"')"=,'B"#*'-,'()#%"8&'$7'()$&'U%#7%='#%&5=(&'$7'%$()%#'-7%'-#'(B-'(#"7&":($-7&1'G&'&5:)<'()%'%,,%:($!%'A"78B$8()'$&'A%(B%%7'.0K'HN*&',-#'"'&$74=%'(#"7&":($-7'"78';K'HN*&',-#'(B-'(#"7&":($-7&'*%#')"=,'B"#*1'@)%'75+A%#'-,'(#"7&":($-7&'$&&5%8',-#'"')"=,'B"#*'-,'()#%"8&'8%*%78&'-7'()%'-,,&%('"78'B)%()%#'()%'B"#*'$&'%!%7V'-#'-88V75+A%#%81'3-#'-,,&%(&'-,'K'-#'.9<'%":)')"=,'B"#*'#%&5=(&'$7'"'&$74=%'9MVA>(%'(#"7&":($-7'23$45#%'/1M?1'3-#'-,,&%(&'-,'.'()#-54)';'-#'W'()#-54)'.6<'%!%7V75+A%#%8'B"#*&'#%&5=('$7'"'&$74=%'.0JVA>(%'(#"7&":($-7'23$45#%'/16?'"78'-88V75+A%#%8'B"#*&'#%&5=('$7'(B-'(#"7&":($-7&X'-7%'9MVA>(%'"78'-7%'/0VA>(%'23$45#%'/19?1'3-#'-,,&%(&'-,'J<'%!%7V75+A%#%8'B"#*&'#%&5=('$7'-7%'.0JVA>(%'(#"7&":($-7'"78'-88V75+A%#%8'B"#*&'#%&5=('$7'(B-'/0VA>(%'(#"7&":($-7&1'@)%'(B-'/0VA>(%'
16 August 31, 2009
Even on newer cards consecutive but misaligned access can decrease bandwidth by a factor of 2
Summer School 2010 Pawel Pomorski
Strided accesses__kernel void strideCopy(__global float* odata,__global float* idata,int stride){int xid = get_global_id(0) * stride;odata[xid] = idata[xid];}
Memory Optimizations
!"#$%#&!'($%)*"#!+,"*!+#$*#*-./*#$0*#*12/34!,*!"#$%#&!'($)*#",*",%5($%'36,*7("*!+,*36'5*#!*!+,*(77%,!*(7*8*'$*9':;",*1<=<*
3.2.1.4 Strided Accesses >6!+(;:+*!+,*",6#?,0*&(#6,%&'$:*",%!"'&!'($%*7("*0,@'&,%*A'!+*&(B5;!,*#3'6'!4*C<2*("*+':+,"*#&+',@,*($,/+#67*7;66*3#$0A'0!+*7("*!+,*(77%,!*&(54*&#%,*D;%!*0,%&"'3,0)*5,"7("B#$&,*($*%;&+*0,@'&,%*&#$*0,:"#0,*A+,$*%;&&,%%'@,*!+",#0%*'$*#*+#67*A#"5*#&&,%%*B,B("4*6(&#!'($%*!+#!*+#@,*$($/;$'!*%!"'0,%<*E+'%*5#!!,"$*(&&;"%*7",F;,$!64*A+,$*0,#6'$:*A'!+*B;6!'0'B,$%'($#6*0#!#*("*B#!"'&,%G*7("*,?#B56,)*A+,$*#*+#67*A#"5*(7*!+",#0%*#&&,%%,%*B#!"'?*,6,B,$!%*&(6;B$A'%,*#$0*!+,*B#!"'?*'%*%!(",0*'$*"(A/B#D("*("0,"<*
E(*'66;%!"#!,*!+,*,77,&!*(7*%!"'0,0*#&&,%%*($*,77,&!'@,*3#$0A'0!+)*%,,*!+,*7(66(A'$:*H,"$,6*strideCopy())*A+'&+*&(5',%*0#!#*A'!+*#*%!"'0,*(7*stride*,6,B,$!%*3,!A,,$*!+",#0%*7"(B*idata*!(*odata<*__kernel void strideCopy(__global float* odata, __global float* idata, int stride) { int xid = get_global_id(0) * stride; odata[xid] = idata[xid]; }
Listing 3.6 A kernel to illustrate non-unit stride data copy
9':;",*1<8*'66;%!"#!,%*#*%'!;#!'($*!+#!*&#$*3,*&",#!,0*;%'$:*!+,*&(0,*'$*I'%!'$:*1<-G*$#B,64)*!+",#0%*A'!+'$*#*+#67*A#"5*#&&,%%*B,B("4*A'!+*#*%!"'0,*(7*2<*E+'%*#&!'($*'%*&(#6,%&,0*'$!(*#*%'$:6,*C28/34!,*!"#$%#&!'($*($*#$*JKLML>*N,9("&,*NEO*28P*Q&(B5;!,*#3'6'!4*C<1R<*
*
*Figure 3.8 A half warp accessing memory with a stride of 2
>6!+(;:+*#*%!"'0,*(7*2*",%;6!%*'$*#*%'$:6,*!"#$%#&!'($)*$(!,*!+#!*+#67*!+,*,6,B,$!%*'$*!+,*!"#$%#&!'($*#",*$(!*;%,0*#$0*",5",%,$!*A#%!,0*3#$0A'0!+<*>%*!+,*%!"'0,*'$&",#%,%)*!+,*,77,&!'@,*3#$0A'0!+*0,&",#%,%*;$!'6*!+,*5('$!*A+,",*C-*!"#$%#&!'($%*#",*'%%;,0*7("*!+,*C-*!+",#0%*'$*#*+#67*A#"5)*#%*'$0'&#!,0*'$*9':;",*1<S<*
August 31, 2009 17
Stride=2 case, CC > 1.3 devices can do thisin a single 128 byte transaction, but bandwidth is wasted as half of elements notused
NVIDIA OpenCL Best Practices Guide
!Figure 3.9 Performance of strideCopy kernel
"#$%&!'#(%)%*&!$'+$!#,!$'%!"-./.0!123!4455!6%)78%!98#:;<$%!8+;+=7>7$?!@A5B&!+,?!,#,C<,7$!D$*76%!*%D<>$D!7,!@E!D%;+*+$%!$*+,D+8$7#,D!;%*!'+>F!(+*;A!
0D!7>><D$*+$%6!7,!G7H<*%!IAJ&!,#,C<,7$!D$*76%!H>#=+>!:%:#*?!+88%DD%D!D'#<>6!=%!+)#76%6!('%,%)%*!;#DD7=>%A!K,%!:%$'#6!F#*!6#7,H!D#!<$7>7L%D!D'+*%6!:%:#*?&!('78'!7D!67D8<DD%6!7,!$'%!,%M$!D%8$7#,A!
3.2.2 Shared Memory N%8+<D%!7$!7D!#,C8'7;&!D'+*%6!:%:#*?!97A%A!K;%,OP!QQ>#8+>!:%:#*?B!7D!:<8'!F+D$%*!$'+,!>#8+>!+,6!H>#=+>!:%:#*?A!.,!F+8$&!D'+*%6!:%:#*?!>+$%,8?!7D!*#<H'>?!@55M!>#(%*!$'+,!H>#=+>!:%:#*?!>+$%,8?R;*#)76%6!$'%*%!+*%!,#!=+,S!8#,F>78$D!=%$(%%,!$'%!$'*%+6D&!+D!6%$+7>%6!7,!$'%!F#>>#(7,H!D%8$7#,A!!
3.2.2.1 Shared Memory and Memory Banks 2#!+8'7%)%!'7H'!:%:#*?!=+,6(76$'!F#*!8#,8<**%,$!+88%DD%D&!D'+*%6!:%:#*?!7D!67)76%6!7,$#!%T<+>>?!D7L%6!:%:#*?!:#6<>%D&!8+>>%6!=+,SD&!$'+$!8+,!=%!+88%DD%6!D7:<>$+,%#<D>?A!2'%*%F#*%&!+,?!:%:#*?!>#+6!#*!D$#*%!#F!!!+66*%DD%D!$'+$!D;+,D!!!67D$7,8$!:%:#*?!=+,SD!8+,!=%!D%*)78%6!D7:<>$+,%#<D>?&!?7%>67,H!+,!%FF%8$7)%!=+,6(76$'!$'+$!7D!!!$7:%D!+D!'7H'!+D!$'%!=+,6(76$'!#F!+!D7,H>%!=+,SA!!
U#(%)%*&!7F!:<>$7;>%!+66*%DD%D!#F!+!:%:#*?!*%T<%D$!:+;!$#!$'%!D+:%!:%:#*?!=+,S&!$'%!+88%DD%D!+*%!D%*7+>7L%6A!2'%!'+*6(+*%!D;>7$D!+!:%:#*?!*%T<%D$!$'+$!'+D!=+,S!8#,F>78$D!7,$#!+D!:+,?!D%;+*+$%!8#,F>78$CF*%%!*%T<%D$D!+D!,%8%DD+*?&!6%8*%+D7,H!$'%!%FF%8$7)%!=+,6(76$'!=?!+!F+8$#*!%T<+>!$#!$'%!,<:=%*!#F!D%;+*+$%!:%:#*?!*%T<%D$DA!2'%!#,%!%M8%;$7#,!'%*%!7D!('%,!+>>!$'*%+6D!7,!+!'+>F!(+*;!+66*%DD!$'%!D+:%!D'+*%6!:%:#*?!>#8+$7#,&!*%D<>$7,H!7,!+!=*#+68+D$A!!
2#!:7,7:7L%!=+,S!8#,F>78$D&!7$!7D!7:;#*$+,$!$#!<,6%*D$+,6!'#(!:%:#*?!+66*%DD%D!:+;!$#!:%:#*?!=+,SD!+,6!'#(!$#!#;$7:+>>?!D8'%6<>%!:%:#*?!*%T<%D$DA!!
18 August 31, 2009
Effective bandwidth deteriorates as stride is increased to 16 and beyond
Summer School 2010 Pawel Pomorski
Note that performance deteriorates significantly as stride increases to just 16
On CPU the cache with its fast memory is relatively large and thus can provide reasonable performance for larger stride
GPUs have no cache, but they have __local memory (in OpenCL, same as shared memory in CUDA), which can be used to improve stride access efficiency
Summer School 2010 Pawel Pomorski
Optimizing matrix multiplication with use of local memory
Consider special case of AB=CA is (Mx16)B is (16xN)C is (MxN)
Use work-group of 16x16
NVIDIA OpenCL Best Practices Guide
!
Figure 3.10 A block-column matrix (A) multiplied by a block-row matrix (B) and the resulting product matrix (C)
"#!$#!%&'()!%&*!simpleMultiply!+*,-*.!/0'(%'-1!2345!67.68.7%*(!%&*!#8%98%!*.*:*-%(!#;!7!%'.*!#;!:7%,'<!=3!
__kernel void simpleMultiply(__global float* a, __global float* b, __global float* c, int N) { int row = get_global_id(1); int col = get_global_id(0); float sum = 0.0f; for (int i = 0; i < TILE_DIM; i++) { sum += a[row*TILE_DIM+i] * b[i*N+col]; } c[row*N+col] = sum; }
Listing 3.7 Unoptimized matrix multiplication
>-!0'(%'-1!234)!a)!b)!7-$!c!7,*!9#'-%*,(!%#!1.#?7.!:*:#,@!;#,!%&*!:7%,'6*(!A)!B)!7-$!=)!,*(9*6%'C*.@D!blockDim.x)!blockDim.y)!7-$!TILE_DIM!7,*!7..!EF3!G76&!%&,*7$!'-!%&*!EF<EF!?.#6+!67.68.7%*(!#-*!*.*:*-%!'-!7!%'.*!#;!=3!row!7-$!col!7,*!%&*!,#H!7-$!6#.8:-!#;!%&*!*.*:*-%!'-!=!?*'-1!67.68.7%*$!?@!7!97,%'68.7,!%&,*7$3!"&*!for!.##9!#C*,!i!:8.%'9.'*(!7!,#H!#;!A!?@!7!6#.8:-!#;!B)!H&'6&!'(!%&*-!H,'%%*-!%#!=3!
"&*!*;;*6%'C*!?7-$H'$%&!#;!%&'(!+*,-*.!'(!#-.@!I34!JB9(!#-!7-!KL>M>A!J*N#,6*!J"O!PIQ!7-$!Q34!JB9(!#-!7-!KL>M>A!J*N#,6*!J"O!IIQQ3!"#!7-7.@R*!9*,;#,:7-6*)!'%!'(!-*6*((7,@!%#!6#-('$*,!&#H!&7.;!H7,9(!#;!%&,*7$(!766*((!1.#?7.!:*:#,@!'-!%&*!for!.##93!G76&!&7.;!H7,9!#;!%&,*7$(!67.68.7%*(!#-*!,#H!#;!7!%'.*!#;!=)!H&'6&!$*9*-$(!#-!7!('-1.*!,#H!#;!A!7-$!7-!*-%',*!%'.*!#;!B!7(!'..8(%,7%*$!'-!N'18,*!23EE3!
!
20 August 31, 2009
Summer School 2010 Pawel Pomorski
__kernel void simpleMultiply(__global float* a,__global float* b,__global float* c,int N){int row = get_global_id(1);int col = get_global_id(0);float sum = 0.0f;for (int i = 0; i < TILE_DIM; i++) {sum += a[row*TILE_DIM+i] * b[i*N+col];}c[row*N+col] = sum;}
Unoptimized
With this code, the number of times global memorymust is accessed to provide needed entries of a and bis N * M * (16 + 16) times
Memory Optimizations
!Figure 3.11 Computing a row (half warp) of a tile in C using one row of A
and an entire tile of B
"#$!%&'(!)*%$&*)#+!i!#,!*(%!for!-##./!&--!*($%&01!)+!&!(&-,!2&$.!$%&0!*(%!1&3%!4&-5%!,$#3!6-#7&-!3%3#$8!9*(%!)+0%: row*TILE_DIM+i!)1!'#+1*&+*!2)*()+!&!(&-,!2&$.;/!$%15-*)+6!)+!<=!*$&+1&'*)#+1!,#$!'#3.5*%!'&.&7)-)*8!<><!#$!-#2%$/!&+0!<!*$&+1&'*)#+!,#$!'#3.5*%!'&.&7)-)*8!<>?!#$!()6(%$>!@4%+!*(#56(!*(%!#.%$&*)#+!$%A5)$%1!#+-8!<!*$&+1&'*)#+!,#$!'#3.5*%!'&.&7)-)*8!<>?!#$!()6(%$/!*(%$%!)1!2&1*%0!7&+02)0*(!)+!*(%!*$&+1&'*)#+!7%'&51%!#+-8!B!78*%1!#5*!#,!&!C?D78*%!*$&+1&'*)#+!&$%!51%0>!"#$!%&'(!)*%$&*)#+/!*(%!<=!*($%&01!)+!&!(&-,!2&$.!$%&0!&!$#2!#,!*(%!E!*)-%/!2()'(!)1!&!1%A5%+*)&-!&+0!'#&-%1'%0!&''%11!,#$!&--!'#3.5*%!'&.&7)-)*)%1>!
F(%!.%$,#$3&+'%!#+!&!0%4)'%!#,!&+8!'#3.5*%!'&.&7)-)*8!'&+!7%!)3.$#4%0!78!$%&0)+6!&!*)-%!#,!G!)+*#!1(&$%0!3%3#$8!&1!1(#2+!)+!H)1*)+6!C>I>!__kernel void coalescedMultiply(__global float* a, __global float* b, __global float* c, int N, __local float aTile[TILE_DIM][TILE_DIM]) { int row = get_global_id(1); int col = get_global_id(0); float sum = 0.0f; int x = get_local_id(0); int y = get_local_id(1); aTile[y][x] = a[row*TILE_DIM+x]; for (int i = 0; i < TILE_DIM; i++) { sum += aTile[y][i]* b[i*N+col]; } c[row*N+col] = sum; }
August 31, 2009 21
Summer School 2010 Pawel Pomorski
__kernel void coalescedMultiply(__global float* a,__global float* b,__global float* c,int N,__local float aTile[TILE_DIM][TILE_DIM]){int row = get_global_id(1);int col = get_global_id(0);float sum = 0.0f;int x = get_local_id(0);int y = get_local_id(1);aTile[y][x] = a[row*TILE_DIM+x];for (int i = 0; i < TILE_DIM; i++) {sum += aTile[y][i]* b[i*N+col];}c[row*N+col] = sum;}
Partially optimizedStore A in local memoryGlobal memory is accessedN*M*(1+16 times)
__kernel void sharedABMultiply(__global float* a,__global float* b,__global float* c,int N,__local float aTile[TILE_DIM][TILE_DIM],__local float bTile[TILE_DIM][TILE_DIM])
{int row = get_global_id(1);int col = get_global_id(0);float sum = 0.0f;int x = get_local_id(0);int y = get_local_id(1);aTile[y][x] = a[row*TILE_DIM+x];bTile[y][x] = b[y*N+col];barrier(CLK_LOCAL_MEM_FENCE);for (int i = 0; i < TILE_DIM; i++) {sum += aTile[y][i]* bTile[i][x];}c[row*N+col] = sum;}
Fully optimizedStore A in local memoryGlobal memory is accessedN*M*(1+1) times
Summer School 2010 Pawel Pomorski
Performance
Optimization NVIDIA GeForce GTX 280 (released June 2008)
NVIDIA GeForce GTX 8800 (released November 2006)
None 8.8 GBps 0.7 GBps
Coalesced using shared memory to store a tile of A 14.3 GBps 8.2 GBps
Additionally using shared memory to store a tile of B 29.7 GBps 15.7 GBps
The trend is for newer cards to cope better with poor optimization
Summer School 2010 Pawel Pomorski
Conclusion
Explore OpenCL to find out more of its features, as this introduction was only an overview with many aspects left out
Experiment with OpenCL on angel (full install very soon)
Contact SHARCNET HPC analysts for advice about porting your code to GPU
Keep up with new developments in the rapidly developing field of GPU computing