opencl for programming gpus - sharcnet · 2017-04-24 · summer school 2010 pawel pomorski gpu...

65
Summer School 2010 Pawel Pomorski OpenCL for programming GPUs Sharcnet Summer School 2010 Pawel Pomorski

Upload: others

Post on 20-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

OpenCL for programming GPUs

Sharcnet Summer School 2010

Pawel Pomorski

Page 2: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

OverviewWhat is OpenCL?

Basic OpenCL- Detecting your environment- Basic building blocks - “Hello World!” program- Hands on exercise - convert a “serial” CPU code to OpenCL GPU code- OpenCL C

“Advanced” OpenCL- Memory optimization

Page 3: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

GPU computing timeline

before 2003 - Calculations on GPU, using graphics API2003 - Brook “C with streams”2005 - Steady increase in CPU clock speed comes to a halt, switch to multicore chips to compensate. At thesame time, computational power of GPUs increasesNovember, 2006 - CUDA released by NVIDIANovember, 2006 - CTM (Close to Metal) from ATIDecember 2007 - Succeded by AMD Stream SDKDecember, 2008 - Technical specification for OpenCL1.0 releasedApril, 2009 - First OpenCL 1.0 GPU drivers released by NVIDIAAugust, 2009 - Mac OS X 10.6 Snow Leopard released, with OpenCL 1.0 includedSeptember 2009 - Public release of OpenCL by NVIDIADecember 2009 - AMD release of ATI Stream SDK 2.0 with OpenCL supportMarch 2010 - Cuda 3.0 released, incorporating OpenCLNear future - Intel may release a GPGPU (Larrabee)Future - CPUs will have so many cores they will start to be treated as GPUs?

Page 4: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Why are GPUs fast?

Different paradigm, data parallelism (vs. task parallelism on CPU)

Stream processing

Hardware capable of launching MANY threads

However, GPUs also have significant limitations and not all code can run fast on them

If a significant portion of your code cannot be accelerated, your speedup will be limited by Amdahl’s Law

Page 5: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

OpenCL (Open Compute Language) is an open standard for parallel programming of heterogenous systems, managed by Khronos Group.

Aim is for it to form the foundation layer of a parallel computing ecosystem of platform-independent tools

OpenCL includes a language for writing kernels, plus APIs used to define and control platforms

Targeted at GPUs, but also multicore CPUs and other hardware (Cell etc.), though to run efficiently each family of devices needs different code

Page 6: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

OpenCL is largely derived from CUDA (no need to reinvent the wheel)

Same basic functioning : kernel is sent to the accelerator “compute device” composed of “compute units” of which “processing elements” work on “work items”.

Some names changed between CUDA and OpenCL

Thread -> Work-itemBlock -> Work-group

Both CUDA and OpenCL allow for detailed low level optimization of the GPU

Page 7: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

OpenCL device independent - really?

Actually, there is no way to write code that will run equally efficiently on both on GPU and multi-CPU architecture

However, OpenCL can detect the features of the architecture and run code appropriate for each

OpenCL code should run efficiently on many GPUs with a bit of fine tuning

On the other hand, an OpenCL code highly optimized for NVIDIA hardware will not run that efficiently on AMD hardware

As High Performance Computing code needs to be highly optimized, so OpenCL may not offer practical ability to be device independent

Page 8: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

CUDA vs. OpenCLNVIDIA is fully supporting OpenCL even though it does not run exclusively on their hardware

The strategy is to increase the number of GPU users by making software more portable and accessible, which OpenCL is meant to do. If there are more users, NVIDIA will sell more hardware.

As CUDA is fully controlled by NVIDIA, it is possible that in the future it will contain more “bleeding edge” features than OpenCL, which is overseen by a consortium hence slower to incorporate new features

If you want your GPU code to run on both NVIDIA and AMD/ATI devices (two main players at present), OpenCL is the only way to accomplish that

Page 9: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Available OpenCL environmentsNVIDIA OpenCL - distributed with CUDA since version 3.0 - no CPU support at the moment- for NVIDIA GPU cards only

Apple OpenCL - included as standard feature in Mac OS X 10.6 Snow Leopard (requires XCode)- supports both graphics cards and CPUs (Apple hardware only)

AMD (ATI) OpenCL- supports only AMD GPU cards - no CPU support at the moment

IBM OpenCL, supports IBM Blade Center hardware

Foxc (Fixstars OpenCL Cross Compiler), in Beta as of June, 2010, CPU only

Page 10: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

OpenCL References

The OpenCL Programming Book - http://www.fixstars.com/en/company/books/opencl/

Kronos consortium: http://www.khronos.org/

CUDA OpenCL: http://www.nvidia.com/object/cuda_opencl_new.html

NVIDIA GPU Computing SDK code samples

On “angel”: /opt/sharcnet/cuda/3.0/gpucomputingsdk/OpenCL

Apple OpenCL : http://developer.apple.com

AMD (ATI) OpenCL : http://ati.amd.com/technology/streamcomputing/opencl.html

Page 11: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

OpenCL on SHARCNET

Cluster “angel” is SHARCNET’s main GPU computing cluster.

Hardware: 22 nodes, each with 8 CPU cores, each node “sees” 2 Tesla T10 GPUs CUDA 3.0 drivers needed for OpenCL are currently available on node ang4 via a special arrangement for Summer School

Students can test OpenCL programs by connecting to ang4 via “ssh ang4” (from angel login node)

Run interactively, as ang4 is not accessible through queues right now

Compile with:gcc -o test.x your_code.c -lOpenCL

Page 12: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Know your hardwareOpenCL programs should aim to be hardware agnostic

The program should find the relevant system information at runtime

OpenCL provides methods for this

At minimum programmer must determine what OpenCL devices are available and choose which are to be used by the program

Page 13: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

#include <stdio.h>#include <CL/cl.h>

int main(int argc, char** argv) { char dname[500]; cl_device_id devices[10]; cl_uint num_devices,entries; cl_ulong long_entries; int d; cl_int err; cl_platform_id platform_id = NULL; size_t p_size;

/* obtain list of platforms available */ err = clGetPlatformIDs(1, &platform_id,NULL); if (err != CL_SUCCESS) { printf("Error: Failure in clGetPlatformIDs,error code=%d \n",err); return 0; }

/* obtain information about platform */ clGetPlatformInfo(platform_id,CL_PLATFORM_NAME,500,dname,NULL); printf("CL_PLATFORM_NAME = %s\n", dname);

clGetPlatformInfo(platform_id,CL_PLATFORM_VERSION,500,dname,NULL); printf("CL_PLATFORM_VERSION = %s\n", dname);

Program to get information - What is the platform?

On angel cluster: /work/ppomorsk/SUMMER_SCHOOL/get_info/get_opencl_information.c

Page 14: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

/* obtain list of devices available on platform */ clGetDeviceIDs(platform_id, CL_DEVICE_TYPE_ALL, 10, devices, &num_devices); printf("%d devices found\n", num_devices);

/* query devices for information */ for (d = 0; d < num_devices; ++d) { clGetDeviceInfo(devices[d], CL_DEVICE_NAME, 500, dname,NULL); printf("Device #%d name = %s\n", d, dname); clGetDeviceInfo(devices[d],CL_DRIVER_VERSION, 500, dname,NULL); printf("\tDriver version = %s\n", dname); clGetDeviceInfo(devices[d],CL_DEVICE_GLOBAL_MEM_SIZE,sizeof(cl_ulong),&long_entries,NULL); printf("\tGlobal Memory (MB):\t%llu\n",long_entries/1024/1024); clGetDeviceInfo(devices[d],CL_DEVICE_GLOBAL_MEM_CACHE_SIZE,sizeof(cl_ulong),&long_entries,NULL); printf("\tGlobal Memory Cache (MB):\t%llu\n",long_entries/1024/1024); clGetDeviceInfo(devices[d],CL_DEVICE_LOCAL_MEM_SIZE,sizeof(cl_ulong),&long_entries,NULL); printf("\tLocal Memory (KB):\t%llu\n",long_entries/1024); clGetDeviceInfo(devices[d],CL_DEVICE_MAX_CLOCK_FREQUENCY,sizeof(cl_ulong),&long_entries,NULL); printf("\tMax clock (MHz) :\t%llu\n",long_entries); clGetDeviceInfo(devices[d],CL_DEVICE_MAX_WORK_GROUP_SIZE,sizeof(size_t),&p_size,NULL); printf("\tMax Work Group Size:\t%d\n",p_size); clGetDeviceInfo(devices[d],CL_DEVICE_MAX_COMPUTE_UNITS,sizeof(cl_uint),&entries,NULL); printf("\tNumber of parallel compute cores:\t%d\n",entries); } return 0;}

Program to get information cont. - What are the devices?

Page 15: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

[ppomorsk@ang4:~] cd /work/ppomorsk/SUMMER_SCHOOL/get_info[ppomorsk@ang4:/work/ppomorsk/SUMMER_SCHOOL/get_info] gcc -o test.x get_opencl_information.c -lOpenCL[ppomorsk@ang4:/work/ppomorsk/SUMMER_SCHOOL/get_info] ./test.xCL_PLATFORM_NAME = NVIDIA CUDACL_PLATFORM_VERSION = OpenCL 1.0 CUDA 3.0.12 devices foundDevice #0 name = Tesla T10 Processor Driver version = 195.36.15 Global Memory (MB): 4095 Global Memory Cache (MB): 0 Local Memory (KB): 16 Max clock (MHz) : 1440 Max Work Group Size: 512 Number of parallel compute cores: 30Device #1 name = Tesla T10 Processor Driver version = 195.36.15 Global Memory (MB): 4095 Global Memory Cache (MB): 0 Local Memory (KB): 16 Max clock (MHz) : 1440 Max Work Group Size: 512 Number of parallel compute cores: 30[ppomorsk@ang4:/work/ppomorsk/SUMMER_SCHOOL/get_info]

Output on angel - 2 GPU detected, no CPUs detected

Page 16: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Angel architectureGPU server: Tesla S1070http://archive.electronicdesign.com/files/29/19280/fig_01.gif

4 GPUs on server

Each GPU has 30 computeunits, each consisting of 8 streamprocessing cores

240 cores in total on one GPU

Each node “sees” 2 GPUs

Page 17: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Output on Apple MacBook laptop - both GPU and CPU seenppomorsk-mac:tryopencl pawelpomorski$ gcc -o test.x get_opencl_information.c -w -m32 -lm -lstdc++ -framework OpenCLppomorsk-mac:tryopencl pawelpomorski$ ./test.x CL_PLATFORM_NAME = AppleCL_PLATFORM_VERSION = OpenCL 1.0 (Feb 10 2010 23:46:58)2 devices foundDevice #0 name = GeForce 9400M Driver version = CLH 1.0 Global Memory (MB): 256 Global Memory Cache (MB): 0 Local Memory (KB): 16 Max clock (MHz) : 1100 Max Work Group Size: 512 Number of parallel compute cores: 2Device #1 name = Intel(R) Core(TM)2 Duo CPU P7350 @ 2.00GHz Driver version = 1.0 Global Memory (MB): 1536 Global Memory Cache (MB): 3 Local Memory (KB): 16 Max clock (MHz) : 2000 Max Work Group Size: 1 Number of parallel compute cores: 2ppomorsk-mac:tryopencl pawelpomorski$

Page 18: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Getting a code to run on a GPU

Take existing serial program, separate out the parts that will continue to run on host, and the parts which will be sent to the GPU

GPU parts need to be rewritten in the form of kernel functions

Add code to host that manages GPU overhead: creates kernels, moves data between host and GPU etc.

Page 19: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

OpenCL “Hello, World!” example

Not practical to do proper “Hello, World!” as OpenCL devices cannot access standard output directly

Our example program will pass an array of numbers from host to the GPU, square each number in the array on the GPU, then return modified array to host

This program is for learning purposes only, and will not run efficiently on a GPU

The program will demonstrate the basic structure which is common to all OpenCL programs

Will not show error checks on the slides for clarity, but having them is essential when writing OpenCL code

Source codeOn angel cluster: /work/ppomorsk/SUMMER_SCHOOL/hello/no_errorchecks_hello.c /work/ppomorsk/SUMMER_SCHOOL/hello/hello.c

Page 20: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Program flow - OpenCL function callsclGetPlatformIDsclGetDeviceIDsclCreateContextclCreateCommandQueue

clCreateProgramWithSourceclBuildProgramclCreateKernel

clCreateBufferclEnqueueWriteBuffer

clSetKernelArgclGetKernelWorkGroupInfoclEnqueueNDRangeKernelclFinish

clEnqueueReadBuffer

clRelease...

Organize resources, create command queue

Compile kernel

Transfer data from host to GPU memory

Lauch threads running kernels on GPU, perform main computation

Transfer data from GPU to host memory

Free all allocated memory

Page 21: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

// Simple compute kernel which computes the square of an input array //const char *KernelSource = "\n" \"__kernel void square( \n" \" __global float* input, \n" \" __global float* output, \n" \" const unsigned int count) \n" \"{ \n" \" int i = get_global_id(0); \n" \" if(i < count) \n" \" output[i] = input[i] * input[i]; \n" \"} \n" \"\n";

Kernel code - string with OpenCL C code

Page 22: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

int main(int argc, char** argv){ int err; // error code returned from api calls float data[DATA_SIZE]; // original data set given to device float results[DATA_SIZE]; // results returned from device unsigned int correct; // number of correct results returned

size_t global; // global domain size for our calculation size_t local; // local domain size for our calculation

cl_device_id device_id; // compute device id cl_context context; // compute context cl_command_queue commands; // compute command queue cl_program program; // compute program cl_kernel kernel; // compute kernel cl_mem input; // device memory used for the input array cl_mem output; // device memory used for the output array cl_platform_id platform_id = NULL; // Fill our data set with random float values // int i = 0; unsigned int count = DATA_SIZE; for(i = 0; i < count; i++) data[i] = rand() / (float)RAND_MAX;

Define variables, set input data

Page 23: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

// determine OpenCL platform err = clGetPlatformIDs(1, &platform_id,NULL); // Connect to a compute device // int gpu = 1; err = clGetDeviceIDs(platform_id, gpu ? CL_DEVICE_TYPE_GPU : CL_DEVICE_TYPE_CPU, 1, &device_id, NULL); // Create a compute context // context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

// Create a command commands // commands = clCreateCommandQueue(context, device_id, 0, &err);

Organise resources, create command queue

Page 24: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

// Create the compute program from the source buffer // program = clCreateProgramWithSource(context, 1, (const char **) & KernelSource, NULL, &err);

// Build the program executable // err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

// Create the compute kernel in the program we wish to run // kernel = clCreateKernel(program, "square", &err);

Compile kernel

Page 25: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

// Create the input and output arrays in device memory for our calculation // input = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(float) * count, NULL, NULL); output = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * count, NULL, NULL); // Write our data set into the input array in device memory // err = clEnqueueWriteBuffer(commands, input, CL_TRUE, 0, sizeof(float) * count, data, 0, NULL, NULL);

Transfer data from host to GPU memory

Page 26: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

// Set the arguments to our compute kernel // err = 0; err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input); err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &output); err |= clSetKernelArg(kernel, 2, sizeof(unsigned int), &count);

// Execute the kernel over the entire range of our 1d input data set // using one work item per work group (allows for arbitrary length of data array) // global = count; local = 1; err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL);

// Wait for the command commands to get serviced before reading back results // clFinish(commands);

Lauch threads running kernels on GPU, perform main computation

Page 27: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

// Read back the results from the device to verify the output // err = clEnqueueReadBuffer( commands, output, CL_TRUE, 0, sizeof(float) * count, results, 0, NULL, NULL ); // Validate our results // correct = 0; for(i = 0; i < count; i++) { if(results[i] == data[i] * data[i]) correct++; } // Print a brief summary detailing the results // printf("Computed '%d/%d' correct values!\n", correct, count);

Transfer data from GPU to host memory

Page 28: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

// Shutdown and cleanup // clReleaseMemObject(input); clReleaseMemObject(output); clReleaseProgram(program); clReleaseKernel(kernel); clReleaseCommandQueue(commands); clReleaseContext(context);

return 0;}

Free all allocated memory

If you are trying to use an OpenCL profiler and it crashes at the end of the run, not freeing all memory could be the cause

Page 29: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Compiling kernel - Online approachIn our example we have compiled code at runtime (online compile)

Kernel code can be specified as a string in the main program file

Kernel code can also be stored in a file (.cl) and loaded at runtime

const char fileName[] = "./kernel.cl";

/* Load kernel source code */ fp = fopen(fileName, "r"); if (!fp) { fprintf(stderr, "Failed to load kernel.\n"); exit(1); } source_str = (char *)malloc(MAX_SOURCE_SIZE); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp ); fclose( fp );// ...

/* Create Kernel program from the read in source */ program = clCreateProgramWithSource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret);

Page 30: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Compiling kernel - Offline approachIt is possible to compile OpenCL source code (kernels) offline

This is the approach used in CUDA

Upside: need not spend time for compiling during runtime

Downside: executable not portable, need to compile separate binary for each device the code is runnnig on

There is no freestanding compiler for OpenCL (like nvcc for CUDA)

Kernels have to be compiled at runtime and the resulting binary saved.

That binary can then be loaded in the future to avoid compiling step

Page 31: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Task queueQueue is used to launch kernels, in precisely controlled order if required.

Event objects contain information about whether various operations have finished.

For example, clEnqueueTask launches a single kernel, after checking the list of provided event object whether they have completed, and returns its own event object.

//create quque enabled for out of order (parallel) executioncommands = clCreateCommandQueue(context, device_id, OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);...// no synchronizationclEnqueueTask(command_queue, kernel_E, 0, NULL,NULL);

// synchronize so that kernel E starts only after kernels A,B,C,D finishcl_event events[4]; // define event object arrayclEnqueueTask(commands, kernel_A, 0, NULL, &events[0]);clEnqueueTask(commands, kernel_B, 0, NULL, &events[1]);clEnqueueTask(commands, kernel_C, 0, NULL, &events[2]);clEnqueueTask(commands, kernel_D, 0, NULL, &events[3]);clEnqueueTask(commands, kernel_E, 4, events, NULL);

Page 32: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Queueing Data Parallel tasksclEnqueueNDRangeKernel - used to lauch data parallel tasks, i.e. copies of kernel which are identical except for operating on different data

For example, to lauch #(global) work-items (copies of kernel, i.e. threads) grouped into work-groups of size #(local), one would use:

#(global) must be divisible by #(local), maximum size of #(local) dependent on device

Each work item can retrieve:get_global_id(0) - its index among all threadsget_local_id(0) - its index among threads in its work-groupget_group_id(0) - index of its work group

This information can be used to determine which data to operate on

err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL);

Page 33: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Organising work-items in more than 1DConvenient and efficient for many tasks, 2D and 3D possible

Maximum possible values in global and local arrays are device dependent

For 2D, can retrieve global index coordinates(x,y) = (get_global_id(0),get_global_id(1))

For 3D, can similary retrieve global (x,y,z) = (get_global_id(0),get_global_id(1),get_global_id(2))

//2D example// launch 256 work-items, organised in 16x16 grid// grouped in groups of 16, organised as 4x4global[0]=16; global[1]=16;local[0]=4; local(1)=4;

err = clEnqueueNDRangeKernel(commands, kernel, 2, NULL, &global, &local, 0, NULL, NULL);

Page 34: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Measuring execution time

OpenCL is an abstraction layer that allows the same code to be executed on different platforms

The code is guaranteed to run but its speed of execution will be dependent on the device and type of parallelism used

In order to get maximum performance, device and parallelism dependent tuning must be performed.

To do this, execution time must be measured. OpenCL provides convenient ways to do this

Event objects can also contain information about how long it took for the task associated with even to execute

Page 35: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

clGetEventProfilingInfo can obtain start and end time of event Taking the difference gives event duration in nanoseconds

commands = clCreateCommandQueue(context, device_id, CL_QUEUE_PROFILING_ENABLE, &err); cl_event event; err = clEnqueueNDRangeKernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, &event);

clWaitForEvents(1, &event); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL); clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL); printf("Time for event (ms): %10.5f \n", (end-start)/1000000.0);// ... ClReleaseEvent(event); //important to free memory

Page 36: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Timing “Hello World!”

Increase DATA_SIZE to 1048576

Check how computation time varies with size of work-groups

Using largest possible work group is usually the best choice

Time for write/read to buffer is 2.51/5.14 ms.

Work-group size (local) Time (ms)1 6.1918 0.80316 0.41432 0.20964 0.127512 (maximum) 0.130

Page 37: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Timing “Hello World!” with openclprof/opt/sharcnet/cuda/3.0/cuda/openclprof/bin/openclprof

Page 38: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Must minimize memory transfers between host and GPU as these are quite slow.

If you have a kernel which does not show any performance improvement when run on GPU, run it on GPU if doing so eliminates need for device to host memory transfer

Intermediate data structures should be created in GPU memory, operated on by GPU, and destroyed without ever being mapped or copied to host memory

Because of overhead, batching small transfers into one works better than making each transfer separately.

Higher bandwidth can be achieved when using page-locked (or pinned) memory.

Page 39: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Running openclprof on SHARCNET

1. ssh -X [email protected]. ssh -X ang43. /opt/sharcnet/cuda/3.0/cuda/openclprof/bin/openclprofOnce you get the GUI:4. File->New5. Name your project, select location somewhere in your work space6. Select OpenCL binary executable compiled previously (no special compile options needed)7. Select your working directory8. Click “Start” to perform profiling runExamine outputTry various options in session and view

Page 40: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Exercise: Port a serial code to use OpenCLcp -rp /work/ppomorsk/SUMMER_SCHOOL/ . ./exercise/serial - serial code you are to convert to OpenCL./exercise/opencl_template - contains a generic “template” which you can modify, using the serial code, to develop the OpenCL version of the code./hello - consult if needed./exercise/opencl_solution - the solution (peek if you must)

Compile with: gcc -o test.x main.c -lOpenCL To run: ./test.x

Template is for the case of lunching one kernel as a single thread

This port will be fairly generic, and could easily run on both GPU and CPU

Page 41: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

void moving_average(int *values, float *average, int length, int width){ int i; int add_value;

/* Compute sum for the first "width" elements */ add_value = 0; for( i = 0; i < width; i++ ) { add_value += values[i]; } average[width-1] = (float)add_value;

/* Compute sum for the (width) to (length-1) elements */ for( i = width; i < length; i++ ) { add_value = add_value - values[i-width] + values[i]; average[i] = (float)(add_value); }

/* Insert zeros to 0th to (width-2)th element */ for( i = 0; i < width-1; i++ ) { average[i] = 0.0f; }

/* Compute average from the sum */ for( i = width-1; i < length; i++ ) { average[i] /= (float)width; }}

Computes Simple Moving Average (SMA)

Hint: Converting this to an OpenCL kernel that runs as single thread requires very little modification

Page 42: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

#include <stdio.h>#include <stdlib.h>/* Read Stock data */int stock_array1[] = { #include "stock_array1.txt"};/* Define width for the moving average */#define WINDOW_SIZE (13)

int main(int argc, char *argv[]){ float *result;

int data_num = sizeof(stock_array1) / sizeof(stock_array1[0]); int window_num = (int)WINDOW_SIZE; int i; /* Allocate space for the result */ result = (float *)malloc(data_num*sizeof(float));

/* Call the moving average function */ moving_average(stock_array1,result,data_num,window_num);

/* Print result */ for(i=0; i<data_num; i++) { printf( "result[%d] = %f\n", i, result[i] ); }

/* Deallocate memory */ free(result);

return 0;}

Serial host code

Page 43: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

OpenCL C language

Page 44: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

OpenCL C

OpenCL C is basically standard C (C99) with some extensions and restrictions. This language is used to program the kernel.

There is a list of restrictions, for example - recursion is not supported- the point passed as an argument to a kernel function must be of type __global, __constant or __local- support for double precision floating-point is currently an optional extension, and it may not be implemented

Some OpenCL functions are built in without requiring library, including math functions, geometric functions, and work-item functions

OpenCL includes some new features not available in C

Page 45: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Vector Data

In OpenCL can define and operate on vector data-types, which is a struct consisting of many components of same data-type (CUDA has a much more limited capacity to do this)

float4 consists of (float,float,float,float) declared via float4 f4;

Vector sizes are 2,4,8,16, can consist of char,int,float, double etc.

// vector literals used to assign values to variablesfloat4 g4 = (float4)(1.0f,2.0f,3.0f,4.0f);

// built in math functions can process vector typesfloat4 f4 = cos(g4); // f4 = cos g(0),cos g(1), cos g(2), cos g(3)

Page 46: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Accessing vector data

// .xyzw will access 0 to 3, same in CUDA

int4 x = (int4)(1,2,3,4);int4 a = x.wzyx // a=(4,3,2,1) int2 b = x.xx // b = (1,1) */

// can access through number index using .s followed by number // (in hex, which uses A-F for indices above 9)

int 8 c = x.s01233210 // c=(1,2,3,4,4,3,2,1)

// extract odd and even indices (.odd , .even)

int 2 d =x.odd // d=(2,4)

// extract upper half and lower half of vector (.hi,.lo)

int 2 e =x.hi // e = (3,4)

Page 47: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Other vector operations

Addition, subtraction

Comparison operation will return a vector containing TRUE or FALSE values is returned

This type of the vector values is determined by what the operands are. It’s not usually char for char, and integer of same length as compared scalar values (short/int/long)

The actual values for FALSE/TRUE are 0/-1 and not 0/1, as that is more convenient for SIMD units

int2 tf = (float2)(1.0f,1.0f) == (float2) (0.0f,1.0f);// tf = (int2) (0,-1)

Page 48: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Effective use of SIMD

Single Instruction obviously implies that it is essential for each thread to run exactly the same code as much as possible

Thus, if I have an if statement that will send some threads down one branch of the code and some other threads down another branch, that is not good.

Page 49: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Example of selection without using if

-- better way --

Works because: a&(all 1s) = a, a|(all 0s)=aGood illustration why -1 for TRUE works better

int a = in0[i], b = in1[i]; /*want to put smaller in out[i]*//*usual way with if */if (a<b) out[i]=a;else out[i]=b;// or possibly// out[i]=a<b?a:b;

int a = in0[i], b = in1[i];int cmp = a < b; //if TRUE, cmp has all bits 1, if FALSE all bits 0 // & bitwise AND// | bitwise OR// ~ flips all bitsout[i] = (a&cmp) | (b&~cmp); //a when TRUE and b when FALSE

// OpenCL has shortct for thisout[i]=bitselect(cmp,b,a);

Page 50: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Vector Benefits

Using vectors will only have benefits on hardware with SIMD features

If those features are not there, vectors will be handled sequentially with no speedup

The selection of vector size is important. For example, double16 may be too big to fit into some registers, so speedup will not occur

Ideally all this should be handled automatically by the compiler, but we are not yet there

Page 51: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

OpenCL Pragmas

#pragma OPENCL FP_CONTRACT on-off-switch

Control how floating point is done (is FMA used). In contract operations multiple operations are performed as 1.

#pragma OPENCL EXTENSION <extension_name> : <behavior>

A number of extensions are available

Page 52: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

OpenCL Memory Types

__global - memory which can be read from all work items (threads), though with relatively slow access unless the program is carefully optimised. It is the device’s main memory.

__local - can be read from work items within a work group. Physically it is the shared memory on each compute unit.

__private - can only be used within each work item. Physically it is the registers used by each processing element.

__constant - memory that can be used to store constant data for read-only access by all compute units. The host is responsible for initializing and allocating this memory

Page 53: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

OpenCL vs Cuda - Vector addition example

__global__ voidvectorAdd(const float * a, const float * b, float * c){// Vector element indexint nIndex = blockIdx.x * blockDim.x + threadIdx.x;c[nIndex] = a[nIndex] + b[nIndex];}

__kernel voidvectorAdd(__global const float * a,__global const float * b,__global float * c){// Vector element indexint nIndex = get_global_id(0);c[nIndex] = a[nIndex] + b[nIndex];}

CUDA

OpenCL

Page 54: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Optimizing OpenCL

- Efficient access to global memory

- Use of local memory to increase bandwith: Matrix Multiplication example

Code, data and figures from:Nvidia OpenCL - Best Practices Guidehttp://developer.download.nvidia.com/compute/cuda/3_0/toolkit/docs/NVIDIA_OpenCL_BestPracticesGuide.pdf

Page 55: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

GPU memory vs. CPU memory

CPUs have a cache which for many problems can provide fast memory access, with transfers between cache and main memory handled in hardware.

The cache paradigm scales poorly as the number of cores is increased, as increased hardware overhead is needed to maintain coherency between caches attached to different cores.

In a GPU, with its hundreds of cores, caches are not practical and thus are not present.

Fast but small local (shared) memory is somewhat analogous to L1 cache on a CPU. However, there is no hardware to automatically shift data from local to global memory as needed. This must be taken care of in software, i.e. by the programmer.

In general on a CPU the requirements for memory access are not that stringent. If you are accessing memory sequentially, you are almost certainly OK.

On GPU efficient memory acess requires more work by the programmer.

Page 56: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Coalesced Access to Global Memory

Global memory loads and stores by threads of half warp (16 threads) can be combined into as few as one transaction.

You can think of global memory as rows of 16 floats linked together into a 64 byte group, with 2 of these forming a 128 byte group

Good access pattern results in a single 64 bit transaction

In this example, if accesses were permuted, still one transaction would be performed on newer cards (CC 1.2), but 16 serializedtransaction on older cards (CC 1.1)

NVIDIA OpenCL Best Practices Guide

!"#$%&''%()*+,%&-)".+%)*,+&$/%0*(/+%,+12+/)+$%&$$,+//%'"+/%"#%)*+%/&3+%/+43+#)5%&#$%,+$2-+%)*+%),&#/&-)"(#%/"6+%"7%8(//"9'+:%

;7%)*+%),&#/&-)"(#%"/%<=>%9?)+/%&#$%(#'?%)*+%'(0+,%(,%288+,%*&'7%"/%2/+$5%,+$2-+%)*+%),&#/&-)"(#%/"6+%)(%@A%9?)+/B%

;7%)*+%),&#/&-)"(#%"/%@A%9?)+/%&#$%(#'?%)*+%'(0+,%(,%288+,%*&'7%"/%2/+$5%,+$2-+%)*+%),&#/&-)"(#%/"6+%)(%C=%9?)+/B%

D&,,?%(2)%)*+%),&#/&-)"(#%&#$%3&,E%)*+%/+,."-+$%)*,+&$/%&/%"#&-)".+B% F+8+&)%2#)"'%&''%)*,+&$/%"#%)*+%*&'7%0&,8%&,+%/+,."-+$B%

G*+/+%-(#-+8)/%&,+%"''2/),&)+$%"#%)*+%7(''(0"#4%/"38'+%+H&38'+/B%

3.2.1.1 A Simple Access Pattern G*+%7",/)%&#$%/"38'+/)%-&/+%(7%-(&'+/-"#4%-&#%9+%&-*"+.+$%9?%&#?%$+."-+:%)*+%!I)*%)*,+&$%&--+//+/%)*+%!I)*%0(,$%"#%&%/+43+#)J%)*+%+H-+8)"(#%"/%)*&)%#()%&''%)*,+&$/%#++$%)(%8&,)"-"8&)+B%KL++%!"42,+%CBABM%

%

%

Figure 3.4 Coalesced access in which all threads but one access the corresponding word in a segment

G*"/%&--+//%8&))+,#%,+/2')/%"#%&%/"#4'+%@AI9?)+%),&#/&-)"(#5%"#$"-&)+$%9?%)*+%,+$%,+-)&#4'+B%N()+%)*&)%+.+#%)*(24*%(#+%0(,$%"/%#()%,+12+/)+$5%&''%$&)&%"#%)*+%/+43+#)%&,+%7+)-*+$B%;7%&--+//+/%9?%)*,+&$/%0+,+%8+,32)+$%0")*"#%)*"/%/+43+#)5%/)"''%(#+%@AI9?)+%),&#/&-)"(#%0(2'$%9+%8+,7(,3+$%9?%&%$+."-+%0")*%-(382)+%-&8&9"'")?%<B=%(,%*"4*+,5%92)%<@%/+,"&'"6+$%),&#/&-)"(#/%0(2'$%9+%8+,7(,3+$%9?%&%$+."-+%0")*%-(382)+%-&8&9"'")?%<B<%(,%'(0+,B%

3.2.1.2 A Sequential but Misaligned Access Pattern ;7%/+12+#)"&'%)*,+&$/%"#%&%*&'7%0&,8%&--+//%3+3(,?%)*&)%"/%/+12+#)"&'%92)%#()%&'"4#+$%0")*%)*+%/+43+#)/5%)*+#%&%/+8&,&)+%),&#/&-)"(#%,+/2')/%7(,%+&-*%+'+3+#)%,+12+/)+$%(#%&%$+."-+%0")*%-(382)+%-&8&9"'")?%<B<%(,%'(0+,B%O#%&%$+."-+%0")*%-(382)+%-&8&9"'")?%<B=%(,%*"4*+,5%/+.+,&'%$"77+,+#)%/-+#&,"(/%-&#%&,"/+%$+8+#$"#4%(#%0*+)*+,%&''%&$$,+//+/%7(,%&%*&'7%0&,8%7&''%0")*"#%&%/"#4'+%<=>I9?)+%/+43+#)B%;7%)*+%&$$,+//+/%7&''%0")*"#%&%<=>I9?)+%/+43+#)5%)*+#%&%/"#4'+%<=>I9?)+%),&#/&-)"(#%"/%8+,7(,3+$5%&/%/*(0#%"#%!"42,+%CBPB%

%

14 August 31, 2009

Half warp of threads

Global memory

Page 57: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Memory Optimizations

!Figure 3.5 Unaligned sequential addresses that fit within a single 128-

byte segment

"#!$!%$&#!'$()!$**+,,+,!-+-.(/!0%$0!1,!,+23+401$&!530!,)&10!$*(.,,!0'.!67895/0+!,+:-+40,;!0%+4!0'.!0($4,$*01.4,!$(+!)+(#.(-+<=!"4!0%+!#.&&.'14:!*$,+;!1&&3,0($0+<!14!>1:3(+!?=@;!.4+!@A95/0+!0($4,$*01.4!$4<!.4+!?795/0+!0($4,$*01.4!(+,3&0=!

!

!Figure 3.6 Misaligned sequential addresses that fall within two 128-byte

segments

B+C1*+!-+-.(/!$&&.*$0+<!0%(.3:%!D)+4EF!1,!:3$($40++<!0.!5+!$&1:4+<!0.!$0!&+$,0!7G@!5/0+,=!H%+(+#.(+;!*%..,14:!,+4,15&+!0%(+$<!5&.*I!,1J+,;!,3*%!$,!-3&01)&+,!.#!6@;!#$*1&10$0+,!-+-.(/!$**+,,+,!5/!%$&#!'$(),!0%$0!$(+!$&1:4+<!0.!,+:-+40,=!"4!$<<101.4;!0%+!23$&1#1+(,!__attribute__ ((aligned(8))) and __attribute__ ((aligned(16)))!*$4!5+!3,+<!'%+4!<+#1414:!,0(3*03(+,!0.!+4,3(+!$&1:4-+40!0.!,+:-+40,=!

3.2.1.3 Effects of Misaligned Accesses "0!1,!+$,/!$4<!14#.(-$01C+!0.!+K)&.(+!0%+!($-1#1*$01.4,!.#!-1,$&1:4+<!$**+,,+,!3,14:!$!,1-)&+!*.)/!I+(4+&;!,3*%!$,!0%+!.4+!14!F1,014:!?=G=!__kernel void offsetCopy(__global float *odata, __global float* idata, int offset) { int xid = get_global_id(0) + offset; odata[xid] = idata[xid]; }

Listing 3.5 A copy kernel that illustrates misaligned accesses

"4!F1,014:!?=G;!<$0$!1,!*.)1+<!#(.-!0%+!14)30!$(($/!idata!0.!0%+!.30)30!$(($/;!5.0%!.#!'%1*%!+K1,0!14!:&.5$&!-+-.(/=!H%+!I+(4+&!1,!+K+*30+<!'10%14!$!&..)!14!%.,0!*.<+!0%$0!

August 31, 2009 15

Somewhat misaligned pattern CC 1.1 devices will perform transactions in seriesCC 1.2 can still do it in one

Memory Optimizations

!Figure 3.5 Unaligned sequential addresses that fit within a single 128-

byte segment

"#!$!%$&#!'$()!$**+,,+,!-+-.(/!0%$0!1,!,+23+401$&!530!,)&10!$*(.,,!0'.!67895/0+!,+:-+40,;!0%+4!0'.!0($4,$*01.4,!$(+!)+(#.(-+<=!"4!0%+!#.&&.'14:!*$,+;!1&&3,0($0+<!14!>1:3(+!?=@;!.4+!@A95/0+!0($4,$*01.4!$4<!.4+!?795/0+!0($4,$*01.4!(+,3&0=!

!

!Figure 3.6 Misaligned sequential addresses that fall within two 128-byte

segments

B+C1*+!-+-.(/!$&&.*$0+<!0%(.3:%!D)+4EF!1,!:3$($40++<!0.!5+!$&1:4+<!0.!$0!&+$,0!7G@!5/0+,=!H%+(+#.(+;!*%..,14:!,+4,15&+!0%(+$<!5&.*I!,1J+,;!,3*%!$,!-3&01)&+,!.#!6@;!#$*1&10$0+,!-+-.(/!$**+,,+,!5/!%$&#!'$(),!0%$0!$(+!$&1:4+<!0.!,+:-+40,=!"4!$<<101.4;!0%+!23$&1#1+(,!__attribute__ ((aligned(8))) and __attribute__ ((aligned(16)))!*$4!5+!3,+<!'%+4!<+#1414:!,0(3*03(+,!0.!+4,3(+!$&1:4-+40!0.!,+:-+40,=!

3.2.1.3 Effects of Misaligned Accesses "0!1,!+$,/!$4<!14#.(-$01C+!0.!+K)&.(+!0%+!($-1#1*$01.4,!.#!-1,$&1:4+<!$**+,,+,!3,14:!$!,1-)&+!*.)/!I+(4+&;!,3*%!$,!0%+!.4+!14!F1,014:!?=G=!__kernel void offsetCopy(__global float *odata, __global float* idata, int offset) { int xid = get_global_id(0) + offset; odata[xid] = idata[xid]; }

Listing 3.5 A copy kernel that illustrates misaligned accesses

"4!F1,014:!?=G;!<$0$!1,!*.)1+<!#(.-!0%+!14)30!$(($/!idata!0.!0%+!.30)30!$(($/;!5.0%!.#!'%1*%!+K1,0!14!:&.5$&!-+-.(/=!H%+!I+(4+&!1,!+K+*30+<!'10%14!$!&..)!14!%.,0!*.<+!0%$0!

August 31, 2009 15

Memory to be accessed now belongs to two different 128 bytegroups, so CC 1.2 devices need2 transactions

Choosing work-group sizes to be multiples of 16 facilitates efficientaccess patterns

Page 58: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Cost of misaligned access

__kernel void offsetCopy(__global float *odata,__global float* idata,int offset){int xid = get_global_id(0) + offset;odata[xid] = idata[xid];}

NVIDIA OpenCL Best Practices Guide

!"#$%&'()%'*"#"+%(%#'offset',#-+'.'(-'/01'23$45#%&'/16'"78'/19':-##%&*-78'(-'-,,&%(&'-,'.'"78'.;<'#%&*%:($!%=>1?'@)%'%,,%:($!%'A"78B$8()',-#'()%':-*>'B$()'!"#$-5&'-,,&%(&'-7'"7'CDEFEG'H%3-#:%'H@I'0JK'2B$()':-+*5(%':"*"A$=$(>'.1/?'"78'"7'CDEFEG'H%3-#:%'H@I'JJKK'2:-+*5(%':"*"A$=$(>'.1K?'"#%'&)-B7'$7'3$45#%'/1;1'

'

'Figure 3.7 Performance of offsetCopy kernel

3-#'()%'CDEFEG'H%3-#:%'H@I'JJKK'8%!$:%<'4=-A"='+%+-#>'"::%&&%&'B$()'7-'-,,&%('-#'B$()'-,,&%(&'()"('"#%'+5=($*=%&'-,'.9'#%&5=('$7'"'&$74=%'(#"7&":($-7'*%#')"=,'B"#*'"78'"7'%,,%:($!%'A"78B$8()'-,'"**#-L$+"(%=>';M'HN*&1'O()%#B$&%<'.9'(#"7&":($-7&'"#%'$&&5%8'*%#')"=,'B"#*'#%&5=($74'$7'"7'%,,%:($!%'A"78B$8()'-,'"**#-L$+"(%=>';'HN*&1'@)$&'#-54)=>'JL'*%#,-#+"7:%'8%4#"8"($-7'$&'85%'(-'()%',":('()"('/0'A>(%&<'()%'+$7$+5+'(#"7&":($-7'&$P%<'"#%',%(:)%8',-#'%":)'()#%"81'Q-B%!%#<'-7=>'M'A>(%&'-,'8"("'"#%'5&%8',-#'%":)'/0'A>(%&',%(:)%8R#%&5=($74'$7'()%'MS/0T.SJ'*%#,-#+"7:%'#%="($!%'(-'()%',5==>':-"=%&:%8':"&%1'@)%'(B-'75+A%#&'"=&-'#%,=%:('()%'8$,,%#%7('8"("'#%*#%&%7(%8'A>'%,,%:($!%'A"78B$8()'2M'A>(%&?'!%#&5&'":(5"='A"78B$8()'2/0'A>(%&?1'

N%:"5&%'-,'()$&'*-&&$A=%'*%#,-#+"7:%'8%4#"8"($-7<'+%+-#>':-"=%&:$74'$&'()%'+-&(':#$($:"='"&*%:('-,'*%#,-#+"7:%'-*($+$P"($-7'-,'8%!$:%'+%+-#>1'3-#'()%'CDEFEG'H%3-#:%'H@I'0JK'8%!$:%<'()%'&$(5"($-7'$&'=%&&'8$#%',-#'+$&"=$47%8'"::%&&%&'A%:"5&%<'$7'"==':"&%&<'"::%&&'A>'"')"=,'B"#*'-,'()#%"8&'$7'()$&'U%#7%='#%&5=(&'$7'%$()%#'-7%'-#'(B-'(#"7&":($-7&1'G&'&5:)<'()%'%,,%:($!%'A"78B$8()'$&'A%(B%%7'.0K'HN*&',-#'"'&$74=%'(#"7&":($-7'"78';K'HN*&',-#'(B-'(#"7&":($-7&'*%#')"=,'B"#*1'@)%'75+A%#'-,'(#"7&":($-7&'$&&5%8',-#'"')"=,'B"#*'-,'()#%"8&'8%*%78&'-7'()%'-,,&%('"78'B)%()%#'()%'B"#*'$&'%!%7V'-#'-88V75+A%#%81'3-#'-,,&%(&'-,'K'-#'.9<'%":)')"=,'B"#*'#%&5=(&'$7'"'&$74=%'9MVA>(%'(#"7&":($-7'23$45#%'/1M?1'3-#'-,,&%(&'-,'.'()#-54)';'-#'W'()#-54)'.6<'%!%7V75+A%#%8'B"#*&'#%&5=('$7'"'&$74=%'.0JVA>(%'(#"7&":($-7'23$45#%'/16?'"78'-88V75+A%#%8'B"#*&'#%&5=('$7'(B-'(#"7&":($-7&X'-7%'9MVA>(%'"78'-7%'/0VA>(%'23$45#%'/19?1'3-#'-,,&%(&'-,'J<'%!%7V75+A%#%8'B"#*&'#%&5=('$7'-7%'.0JVA>(%'(#"7&":($-7'"78'-88V75+A%#%8'B"#*&'#%&5=('$7'(B-'/0VA>(%'(#"7&":($-7&1'@)%'(B-'/0VA>(%'

16 August 31, 2009

Even on newer cards consecutive but misaligned access can decrease bandwidth by a factor of 2

Page 59: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Strided accesses__kernel void strideCopy(__global float* odata,__global float* idata,int stride){int xid = get_global_id(0) * stride;odata[xid] = idata[xid];}

Memory Optimizations

!"#$%#&!'($%)*"#!+,"*!+#$*#*-./*#$0*#*12/34!,*!"#$%#&!'($)*#",*",%5($%'36,*7("*!+,*36'5*#!*!+,*(77%,!*(7*8*'$*9':;",*1<=<*

3.2.1.4 Strided Accesses >6!+(;:+*!+,*",6#?,0*&(#6,%&'$:*",%!"'&!'($%*7("*0,@'&,%*A'!+*&(B5;!,*&#5#3'6'!4*C<2*("*+':+,"*#&+',@,*($,/+#67*7;66*3#$0A'0!+*7("*!+,*(77%,!*&(54*&#%,*D;%!*0,%&"'3,0)*5,"7("B#$&,*($*%;&+*0,@'&,%*&#$*0,:"#0,*A+,$*%;&&,%%'@,*!+",#0%*'$*#*+#67*A#"5*#&&,%%*B,B("4*6(&#!'($%*!+#!*+#@,*$($/;$'!*%!"'0,%<*E+'%*5#!!,"$*(&&;"%*7",F;,$!64*A+,$*0,#6'$:*A'!+*B;6!'0'B,$%'($#6*0#!#*("*B#!"'&,%G*7("*,?#B56,)*A+,$*#*+#67*A#"5*(7*!+",#0%*#&&,%%,%*B#!"'?*,6,B,$!%*&(6;B$A'%,*#$0*!+,*B#!"'?*'%*%!(",0*'$*"(A/B#D("*("0,"<*

E(*'66;%!"#!,*!+,*,77,&!*(7*%!"'0,0*#&&,%%*($*,77,&!'@,*3#$0A'0!+)*%,,*!+,*7(66(A'$:*H,"$,6*strideCopy())*A+'&+*&(5',%*0#!#*A'!+*#*%!"'0,*(7*stride*,6,B,$!%*3,!A,,$*!+",#0%*7"(B*idata*!(*odata<*__kernel void strideCopy(__global float* odata, __global float* idata, int stride) { int xid = get_global_id(0) * stride; odata[xid] = idata[xid]; }

Listing 3.6 A kernel to illustrate non-unit stride data copy

9':;",*1<8*'66;%!"#!,%*#*%'!;#!'($*!+#!*&#$*3,*&",#!,0*;%'$:*!+,*&(0,*'$*I'%!'$:*1<-G*$#B,64)*!+",#0%*A'!+'$*#*+#67*A#"5*#&&,%%*B,B("4*A'!+*#*%!"'0,*(7*2<*E+'%*#&!'($*'%*&(#6,%&,0*'$!(*#*%'$:6,*C28/34!,*!"#$%#&!'($*($*#$*JKLML>*N,9("&,*NEO*28P*Q&(B5;!,*&#5#3'6'!4*C<1R<*

*

*Figure 3.8 A half warp accessing memory with a stride of 2

>6!+(;:+*#*%!"'0,*(7*2*",%;6!%*'$*#*%'$:6,*!"#$%#&!'($)*$(!,*!+#!*+#67*!+,*,6,B,$!%*'$*!+,*!"#$%#&!'($*#",*$(!*;%,0*#$0*",5",%,$!*A#%!,0*3#$0A'0!+<*>%*!+,*%!"'0,*'$&",#%,%)*!+,*,77,&!'@,*3#$0A'0!+*0,&",#%,%*;$!'6*!+,*5('$!*A+,",*C-*!"#$%#&!'($%*#",*'%%;,0*7("*!+,*C-*!+",#0%*'$*#*+#67*A#"5)*#%*'$0'&#!,0*'$*9':;",*1<S<*

August 31, 2009 17

Stride=2 case, CC > 1.3 devices can do thisin a single 128 byte transaction, but bandwidth is wasted as half of elements notused

NVIDIA OpenCL Best Practices Guide

!Figure 3.9 Performance of strideCopy kernel

"#$%&!'#(%)%*&!$'+$!#,!$'%!"-./.0!123!4455!6%)78%!98#:;<$%!8+;+=7>7$?!@A5B&!+,?!,#,C<,7$!D$*76%!*%D<>$D!7,!@E!D%;+*+$%!$*+,D+8$7#,D!;%*!'+>F!(+*;A!

0D!7>><D$*+$%6!7,!G7H<*%!IAJ&!,#,C<,7$!D$*76%!H>#=+>!:%:#*?!+88%DD%D!D'#<>6!=%!+)#76%6!('%,%)%*!;#DD7=>%A!K,%!:%$'#6!F#*!6#7,H!D#!<$7>7L%D!D'+*%6!:%:#*?&!('78'!7D!67D8<DD%6!7,!$'%!,%M$!D%8$7#,A!

3.2.2 Shared Memory N%8+<D%!7$!7D!#,C8'7;&!D'+*%6!:%:#*?!97A%A!K;%,OP!QQ>#8+>!:%:#*?B!7D!:<8'!F+D$%*!$'+,!>#8+>!+,6!H>#=+>!:%:#*?A!.,!F+8$&!D'+*%6!:%:#*?!>+$%,8?!7D!*#<H'>?!@55M!>#(%*!$'+,!H>#=+>!:%:#*?!>+$%,8?R;*#)76%6!$'%*%!+*%!,#!=+,S!8#,F>78$D!=%$(%%,!$'%!$'*%+6D&!+D!6%$+7>%6!7,!$'%!F#>>#(7,H!D%8$7#,A!!

3.2.2.1 Shared Memory and Memory Banks 2#!+8'7%)%!'7H'!:%:#*?!=+,6(76$'!F#*!8#,8<**%,$!+88%DD%D&!D'+*%6!:%:#*?!7D!67)76%6!7,$#!%T<+>>?!D7L%6!:%:#*?!:#6<>%D&!8+>>%6!=+,SD&!$'+$!8+,!=%!+88%DD%6!D7:<>$+,%#<D>?A!2'%*%F#*%&!+,?!:%:#*?!>#+6!#*!D$#*%!#F!!!+66*%DD%D!$'+$!D;+,D!!!67D$7,8$!:%:#*?!=+,SD!8+,!=%!D%*)78%6!D7:<>$+,%#<D>?&!?7%>67,H!+,!%FF%8$7)%!=+,6(76$'!$'+$!7D!!!$7:%D!+D!'7H'!+D!$'%!=+,6(76$'!#F!+!D7,H>%!=+,SA!!

U#(%)%*&!7F!:<>$7;>%!+66*%DD%D!#F!+!:%:#*?!*%T<%D$!:+;!$#!$'%!D+:%!:%:#*?!=+,S&!$'%!+88%DD%D!+*%!D%*7+>7L%6A!2'%!'+*6(+*%!D;>7$D!+!:%:#*?!*%T<%D$!$'+$!'+D!=+,S!8#,F>78$D!7,$#!+D!:+,?!D%;+*+$%!8#,F>78$CF*%%!*%T<%D$D!+D!,%8%DD+*?&!6%8*%+D7,H!$'%!%FF%8$7)%!=+,6(76$'!=?!+!F+8$#*!%T<+>!$#!$'%!,<:=%*!#F!D%;+*+$%!:%:#*?!*%T<%D$DA!2'%!#,%!%M8%;$7#,!'%*%!7D!('%,!+>>!$'*%+6D!7,!+!'+>F!(+*;!+66*%DD!$'%!D+:%!D'+*%6!:%:#*?!>#8+$7#,&!*%D<>$7,H!7,!+!=*#+68+D$A!!

2#!:7,7:7L%!=+,S!8#,F>78$D&!7$!7D!7:;#*$+,$!$#!<,6%*D$+,6!'#(!:%:#*?!+66*%DD%D!:+;!$#!:%:#*?!=+,SD!+,6!'#(!$#!#;$7:+>>?!D8'%6<>%!:%:#*?!*%T<%D$DA!!

18 August 31, 2009

Effective bandwidth deteriorates as stride is increased to 16 and beyond

Page 60: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Note that performance deteriorates significantly as stride increases to just 16

On CPU the cache with its fast memory is relatively large and thus can provide reasonable performance for larger stride

GPUs have no cache, but they have __local memory (in OpenCL, same as shared memory in CUDA), which can be used to improve stride access efficiency

Page 61: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Optimizing matrix multiplication with use of local memory

Consider special case of AB=CA is (Mx16)B is (16xN)C is (MxN)

Use work-group of 16x16

NVIDIA OpenCL Best Practices Guide

!

Figure 3.10 A block-column matrix (A) multiplied by a block-row matrix (B) and the resulting product matrix (C)

"#!$#!%&'()!%&*!simpleMultiply!+*,-*.!/0'(%'-1!2345!67.68.7%*(!%&*!#8%98%!*.*:*-%(!#;!7!%'.*!#;!:7%,'<!=3!

__kernel void simpleMultiply(__global float* a, __global float* b, __global float* c, int N) { int row = get_global_id(1); int col = get_global_id(0); float sum = 0.0f; for (int i = 0; i < TILE_DIM; i++) { sum += a[row*TILE_DIM+i] * b[i*N+col]; } c[row*N+col] = sum; }

Listing 3.7 Unoptimized matrix multiplication

>-!0'(%'-1!234)!a)!b)!7-$!c!7,*!9#'-%*,(!%#!1.#?7.!:*:#,@!;#,!%&*!:7%,'6*(!A)!B)!7-$!=)!,*(9*6%'C*.@D!blockDim.x)!blockDim.y)!7-$!TILE_DIM!7,*!7..!EF3!G76&!%&,*7$!'-!%&*!EF<EF!?.#6+!67.68.7%*(!#-*!*.*:*-%!'-!7!%'.*!#;!=3!row!7-$!col!7,*!%&*!,#H!7-$!6#.8:-!#;!%&*!*.*:*-%!'-!=!?*'-1!67.68.7%*$!?@!7!97,%'68.7,!%&,*7$3!"&*!for!.##9!#C*,!i!:8.%'9.'*(!7!,#H!#;!A!?@!7!6#.8:-!#;!B)!H&'6&!'(!%&*-!H,'%%*-!%#!=3!

"&*!*;;*6%'C*!?7-$H'$%&!#;!%&'(!+*,-*.!'(!#-.@!I34!JB9(!#-!7-!KL>M>A!J*N#,6*!J"O!PIQ!7-$!Q34!JB9(!#-!7-!KL>M>A!J*N#,6*!J"O!IIQQ3!"#!7-7.@R*!9*,;#,:7-6*)!'%!'(!-*6*((7,@!%#!6#-('$*,!&#H!&7.;!H7,9(!#;!%&,*7$(!766*((!1.#?7.!:*:#,@!'-!%&*!for!.##93!G76&!&7.;!H7,9!#;!%&,*7$(!67.68.7%*(!#-*!,#H!#;!7!%'.*!#;!=)!H&'6&!$*9*-$(!#-!7!('-1.*!,#H!#;!A!7-$!7-!*-%',*!%'.*!#;!B!7(!'..8(%,7%*$!'-!N'18,*!23EE3!

!

20 August 31, 2009

Page 62: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

__kernel void simpleMultiply(__global float* a,__global float* b,__global float* c,int N){int row = get_global_id(1);int col = get_global_id(0);float sum = 0.0f;for (int i = 0; i < TILE_DIM; i++) {sum += a[row*TILE_DIM+i] * b[i*N+col];}c[row*N+col] = sum;}

Unoptimized

With this code, the number of times global memorymust is accessed to provide needed entries of a and bis N * M * (16 + 16) times

Memory Optimizations

!Figure 3.11 Computing a row (half warp) of a tile in C using one row of A

and an entire tile of B

"#$!%&'(!)*%$&*)#+!i!#,!*(%!for!-##./!&--!*($%&01!)+!&!(&-,!2&$.!$%&0!*(%!1&3%!4&-5%!,$#3!6-#7&-!3%3#$8!9*(%!)+0%: row*TILE_DIM+i!)1!'#+1*&+*!2)*()+!&!(&-,!2&$.;/!$%15-*)+6!)+!<=!*$&+1&'*)#+1!,#$!'#3.5*%!'&.&7)-)*8!<><!#$!-#2%$/!&+0!<!*$&+1&'*)#+!,#$!'#3.5*%!'&.&7)-)*8!<>?!#$!()6(%$>!@4%+!*(#56(!*(%!#.%$&*)#+!$%A5)$%1!#+-8!<!*$&+1&'*)#+!,#$!'#3.5*%!'&.&7)-)*8!<>?!#$!()6(%$/!*(%$%!)1!2&1*%0!7&+02)0*(!)+!*(%!*$&+1&'*)#+!7%'&51%!#+-8!B!78*%1!#5*!#,!&!C?D78*%!*$&+1&'*)#+!&$%!51%0>!"#$!%&'(!)*%$&*)#+/!*(%!<=!*($%&01!)+!&!(&-,!2&$.!$%&0!&!$#2!#,!*(%!E!*)-%/!2()'(!)1!&!1%A5%+*)&-!&+0!'#&-%1'%0!&''%11!,#$!&--!'#3.5*%!'&.&7)-)*)%1>!

F(%!.%$,#$3&+'%!#+!&!0%4)'%!#,!&+8!'#3.5*%!'&.&7)-)*8!'&+!7%!)3.$#4%0!78!$%&0)+6!&!*)-%!#,!G!)+*#!1(&$%0!3%3#$8!&1!1(#2+!)+!H)1*)+6!C>I>!__kernel void coalescedMultiply(__global float* a, __global float* b, __global float* c, int N, __local float aTile[TILE_DIM][TILE_DIM]) { int row = get_global_id(1); int col = get_global_id(0); float sum = 0.0f; int x = get_local_id(0); int y = get_local_id(1); aTile[y][x] = a[row*TILE_DIM+x]; for (int i = 0; i < TILE_DIM; i++) { sum += aTile[y][i]* b[i*N+col]; } c[row*N+col] = sum; }

August 31, 2009 21

Page 63: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

__kernel void coalescedMultiply(__global float* a,__global float* b,__global float* c,int N,__local float aTile[TILE_DIM][TILE_DIM]){int row = get_global_id(1);int col = get_global_id(0);float sum = 0.0f;int x = get_local_id(0);int y = get_local_id(1);aTile[y][x] = a[row*TILE_DIM+x];for (int i = 0; i < TILE_DIM; i++) {sum += aTile[y][i]* b[i*N+col];}c[row*N+col] = sum;}

Partially optimizedStore A in local memoryGlobal memory is accessedN*M*(1+16 times)

__kernel void sharedABMultiply(__global float* a,__global float* b,__global float* c,int N,__local float aTile[TILE_DIM][TILE_DIM],__local float bTile[TILE_DIM][TILE_DIM])

{int row = get_global_id(1);int col = get_global_id(0);float sum = 0.0f;int x = get_local_id(0);int y = get_local_id(1);aTile[y][x] = a[row*TILE_DIM+x];bTile[y][x] = b[y*N+col];barrier(CLK_LOCAL_MEM_FENCE);for (int i = 0; i < TILE_DIM; i++) {sum += aTile[y][i]* bTile[i][x];}c[row*N+col] = sum;}

Fully optimizedStore A in local memoryGlobal memory is accessedN*M*(1+1) times

Page 64: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Performance

Optimization NVIDIA GeForce GTX 280 (released June 2008)

NVIDIA GeForce GTX 8800 (released November 2006)

None 8.8 GBps 0.7 GBps

Coalesced using shared memory to store a tile of A 14.3 GBps 8.2 GBps

Additionally using shared memory to store a tile of B 29.7 GBps 15.7 GBps

The trend is for newer cards to cope better with poor optimization

Page 65: OpenCL for programming GPUs - SHARCNET · 2017-04-24 · Summer School 2010 Pawel Pomorski GPU computing timeline before 2003 - Calculations on GPU, using graphics API 2003 - Brook

Summer School 2010 Pawel Pomorski

Conclusion

Explore OpenCL to find out more of its features, as this introduction was only an overview with many aspects left out

Experiment with OpenCL on angel (full install very soon)

Contact SHARCNET HPC analysts for advice about porting your code to GPU

Keep up with new developments in the rapidly developing field of GPU computing