opencl (pdf presentation) - beyond programmable shadings08.idav.ucdavis.edu/munshi-opencl.pdf ·...

OpenCLParallel Computing on the GPU and CPU

Aaftab Munshi

Beyond Programmable Shading: Fundamentals

•Today’s processors are increasingly parallel•CPUs

■ Multiple cores are driving performance increases•GPUs

■ Transforming into general purpose data-parallel computational coprocessors

■ Improving numerical precision (single and double)

Opportunity: Processor


•Writing parallel programs different for the CPU and GPU■ Differing domain-specific techniques■ Vendor-specific technologies

•Graphics API is not an ideal abstraction for general purpose compute

Challenge: Processor Parallelism


•OpenCL – Open Computing Language•Approachable language for accessing heterogeneous computational resources

•Supports parallel execution on single or multiple processors■ GPU, CPU, GPU + CPU or multiple GPUs

•Desktop and Handheld Profiles•Designed to work with graphics APIs such as OpenGL

Introducing OpenCL


OpenCL = Open Standard•Specification under review

■ Royalty free, cross-platform, vendor neutral■ Khronos OpenCL working group (www.khronos.org)

•Based on a proposal by Apple■ Developed in collaboration with industry leaders■ Performance-enhancing technology in Mac OS X Snow Leopard

http://www.khronos.org

http://www.khronos.org


OpenCL Working Group MembersBroad Industry Support

© Copyright Khronos Group, 2008 - Page

http://www.apple.com/

http://www.apple.com/

http://www.codeplay.com/





http://www.amd.com/

http://www.amd.com/

http://www.amd.com/

http://www.amd.com/

http://www.amd.com/

http://www.umu.se/umu/index_eng.html

http://www.umu.se/umu/index_eng.html

OpenCL — A Sneak Preview


•Use all computational resources in system ■ GPUs and CPUs as peers■ Data- and task- parallel compute model

•Efficient parallel programming model■ Based on C■ Abstract the specifics of underlying hardware

•Specify accuracy of floating-point computations■ IEEE 754 compliant rounding behavior■ Define maximum allowable error of math functions

•Drive future hardware requirements

Design Goals of OpenCL


•Platform Layer■ query and select compute devices in the system■ initialize a compute device(s)■ create compute contexts and work-queues

•Runtime ■ resource management■ execute compute kernels

•Compiler■ A subset of ISO C99 with appropriate language additions

■ Compile and build compute program executables■ online or offline

OpenCL Software Stack


•Compute Kernel■ Basic unit of executable code — similar to a C function

■ Data-parallel or task-parallel•Compute Program

■ Collection of compute kernels and internal functions■ Analogous to a dynamic library

•Applications queue compute kernel execution instances■ Queued in-order ■ Executed in-order or out-of-order■ Events are used to implement appropriate

OpenCL Execution Model


•Define N-Dimensional computation domain■ Each independent element of execution in N-D domain is called a work-item

■ The N-D domain defines the total number of work-items that execute in parallel — global work size.

•Work-items can be grouped together — work-group■ Work-items in group can communicate with each other

■ Can synchronize execution among work-items in group to coordinate memory access

•Execute multiple work-groups in parallel•Mapping of global work size to work-groups

OpenCL Data-Parallel Execution


•Data-parallel execution model must be implemented by all OpenCL compute devices

•Some compute devices such as CPUs can also execute task-parallel compute kernels■ Executes as a single work-item■ A compute kernel written in OpenCL ■ A native C / C++ function

OpenCL Task-Parallel Execution


OpenCL Memory Model•Implements a relaxed consistency, shared memory model

•Multiple distinct address spaces■ Address spaces can be collapsed


Compute Unit 1

Private Memory

Private Memory

WorkItem 1

WorkItem M

Compute Unit N

Private Memory

Private Memory

WorkItem 1

WorkItem M


•Multiple distinct address spaces■ Address spaces can be collapsed ■ Address Qualifiers

■ __private


Compute Unit 1

Private Memory

Private Memory

WorkItem 1

WorkItem M

Compute Unit N

Private Memory

Private Memory

WorkItem 1

WorkItem M

Local Memory Local Memory



■ __private■ __local


Compute Device

Compute Unit 1

Private Memory

Private Memory

WorkItem 1

WorkItem M

Compute Unit N

Private Memory

Private Memory

WorkItem 1

WorkItem M

Local Memory Local Memory

Global / Constant Memory Data Cache

Compute Device Memory

Global Memory



■ __private■ __local ■ __constant and __global

■ Example: ■ __global float4 *p;


•Derived from ISO C99•A few restrictions

■ Recursion, function pointers, functions in C99 standard headers ...

•Preprocessing directives defined by C99 are supported

•Built-in Data Types■ Scalar and vector data types■ Structs, Pointers■ Data-type conversion functions

■ convert_type<_sat><_roundingmode> ■ Image types

Language for writing compute


•Built-in Functions — Required■ work-item functions■ math.h■ read and write image■ relational■ geometric functions■ synchronization functions



•Built-in Functions — Required■ work-item functions■ math.h■ read and write image■ relational■ geometric functions■ synchronization functions

•Built-in Functions — Optional■ double precision■ atomics to global and local memory■ selection of rounding mode



OpenCL FFT Example - Host API



// create a compute context with GPU device



// create a compute context with GPU devicecontext = clCreateContextFromType(CL_DEVICE_TYPE_GPU);




// create a work-queue




// create a work-queuequeue = clCreateWorkQueue(context, NULL, NULL, 0);





// allocate the buffer memory objects





// allocate the buffer memory objectsmemobjs[0] = clCreateBuffer(context,





// allocate the buffer memory objectsmemobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA);






memobjs[1] = clCreateBuffer(context,






memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE,






memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL);



// create the compute programprogram = clCreateProgramFromSource(context, 1, &fft1D_1024_kernel_src, NULL);

// build the compute program executableclBuildProgramExecutable(program, false, NULL, NULL);

// create the compute kernel kernel = clCreateKernel(program, “fft1D_1024”);


OpenCL FFT Example - Host API // create N-D range object with work-item dimensionsglobal_work_size[0] = n; local_work_size[0] = 64;range = clCreateNDRangeContainer(context, 0, 1, global_work_size, local_work_size);

// set the args valuesclSetKernelArg(kernel, 0, (void *)&memobjs[0], sizeof(cl_mem), NULL);clSetKernelArg(kernel, 1, (void *)&memobjs[1], sizeof(cl_mem), NULL);clSetKernelArg(kernel, 2, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL);clSetKernelArg(kernel, 3, NULL, sizeof(float)*(local_work_size[0]+1)*16, NULL);

// execute kernel clExecuteKernel(queue, kernel, NULL, range, NULL, 0, NULL);


OpenCL FFT Example - Compute // This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into// calls to a radix 16 function, another radix 16 function and then a radix 4 function// Based on "Fitting FFT onto G80 Architecture". Vasily Volkov & Brian Kazian, UC Berkeley CS258 project report, May 2008__kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int tid = get_local_id(0);int blockIdx = get_group_id(0) * 1024 + tid;float2 data[16];

// starting index of data to/from global memory in = in + blockIdx; out = out + blockIdx;

globalLoads(data, in, 64); // coalesced global readsfftRadix16Pass(data); // in-place radix-16 passtwiddleFactorMul(data, tid, 1024, 0);

// local shuffle using local memorylocalShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4))); fftRadix16Pass(data); // in-place radix-16 passtwiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication

localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15)));// four radix-4 function callsfftRadix4Pass(data); fftRadix4Pass(data + 4); fftRadix4Pass(data + 8); fftRadix4Pass(data + 12);

// coalesced global writesglobalStores(data, out, 64);

}

http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project6_report.pdf

http://www.cs.berkeley.edu/~kubitron/courses/cs258-S08/projects/reports/project6_report.pdf


•Sharing OpenGL Resources■ OpenCL is designed to efficiently share with OpenGL

■ Textures, Buffer Objects and Renderbuffers■ Data is shared, not copied

•Efficient queuing of OpenCL and OpenGL commands•Apps can select compute device(s) that will run OpenGL and OpenCL

OpenCL and OpenGL


•A new compute language that works across GPUs and CPUs■ C99 with extensions ■ Familiar to developers■ Includes a rich set of built-in functions■ Makes it easy to develop data- and task- parallel compute programs

•Defines hardware and numerical precision requirements

•Open standard for heterogeneous parallel computing

Summary

opencl (pdf presentation) - beyond programmable shadings08.idav.ucdavis.edu/munshi-opencl.pdf ·...

Documents