opencl

China MCP

1

OpenCL

Agenda• OpenCL Overview• Usage• Memory Model• Synchronization• Operational Flow• Availability

© Copyright Texas Instruments Inc., 2013

radar & communicationscomputing4

Video and audioinfrastructure

NetworkingDVR / NVR& smart camera

Wireless testers

Industrial control

High-performance and cloud computing

Home AVR andautomotive audio

Portable mobile radio

Medical imaging

Mission critical systems

media processing industrial electronics

Industrial imaging

Analytics

OpenCL Overview: Motivation

http://www.google.com/imgres?imgurl=http://emergencyelectric.net/images/soundstack3.jpg&imgrefurl=http://emergencyelectric.net/Pro_Audio.html&usg=__kO5ALWyND-xNtck6H7GUIzmbQrs=&h=354&w=304&sz=60&hl=en&start=18&zoom=1&tbnid=HJkySMvv6GIMXM:&tbnh=121&tbnw=104&prev=/images?q=professional+audio+equipment&hl=en&gbv=2&tbs=isch:1&itbs=1

http://www.google.com/imgres?imgurl=http://emergencyelectric.net/images/soundstack3.jpg&imgrefurl=http://emergencyelectric.net/Pro_Audio.html&usg=__kO5ALWyND-xNtck6H7GUIzmbQrs=&h=354&w=304&sz=60&hl=en&start=18&zoom=1&tbnid=HJkySMvv6GIMXM:&tbnh=121&tbnw=104&prev=/images?q=professional+audio+equipment&hl=en&gbv=2&tbs=isch:1&itbs=1

http://i.cmpnet.com/eeproductcenter/test/2004/Aeroflex_PXI_pix.jpg



http://www.macom-wireless.com/products/p25/prodDetail_L.asp?id=20

http://www.macom-wireless.com/products/p25/prodDetail_L.asp?id=20

http://images.google.com/imgres?imgurl=http://images.businessweek.com/ss/07/05/0503_china_index/image/semiconductor-manufacturing.jpg&imgrefurl=http://images.businessweek.com/ss/07/05/0503_china_index/source/17.htm&usg=__qkxG40vq8jUMUbgt2JEX29aOiGQ=&h=497&w=410&sz=82&hl=en&start=9&tbnid=9QiZvgbtNNtfgM:&tbnh=130&tbnw=107&prev=/images?q=semiconductor+manufacturing&gbv=2&hl=en

http://images.google.com/imgres?imgurl=http://images.businessweek.com/ss/07/05/0503_china_index/image/semiconductor-manufacturing.jpg&imgrefurl=http://images.businessweek.com/ss/07/05/0503_china_index/source/17.htm&usg=__qkxG40vq8jUMUbgt2JEX29aOiGQ=&h=497&w=410&sz=82&hl=en&start=9&tbnid=9QiZvgbtNNtfgM:&tbnh=130&tbnw=107&prev=/images?q=semiconductor+manufacturing&gbv=2&hl=en

http://www.flickr.com/photos/olaborda/2762765511/

http://images.google.com/imgres?imgurl=http://www.everyjoe.com/files/106/predator-uav.jpg&imgrefurl=http://forums.bharat-rakshak.com/viewtopic.php?p=797602&usg=__EX6YF2jqiu38ztX7gPLb22PSK9o=&h=500&w=768&sz=93&hl=en&start=14&tbnid=NaNtsNNhGdaTpM:&tbnh=92&tbnw=142&prev=/images?q=uav&gbv=2&hl=en

http://images.google.com/imgres?imgurl=http://www.everyjoe.com/files/106/predator-uav.jpg&imgrefurl=http://forums.bharat-rakshak.com/viewtopic.php?p=797602&usg=__EX6YF2jqiu38ztX7gPLb22PSK9o=&h=500&w=768&sz=93&hl=en&start=14&tbnid=NaNtsNNhGdaTpM:&tbnh=92&tbnw=142&prev=/images?q=uav&gbv=2&hl=en

5 © Copyright Texas Instruments Inc., 2013

Many current TI DSP users:

• Comfortable working with TI platforms

• Large software teams, low level programming models for algorithmic control

• Understand DSP programming

Many customers in new markets like High-Performance-Compute:

• Often not DSP programmers

• Not familiar with TI proprietary software, especially in early stages

• Comfortable with workstation parallel programming models

Important that customers in these new markets are comfortable with leveraging TI’s heterogeneous multicore offerings

OpenCL Overview: Motivation


• Framework for expressing programs where parallel computation is dispatched to any attached heterogeneous device

• Open, standard and royalty-free

• Consists of two components

1. API for host program to create and submit kernels for execution (Host-based generic header and vendor-supplied library file)

2. Cross-platform language for expressing kernels (Based on C99 C w/ some additions/restrictions, built-in functions)• Promotes portability of applications from device to device and across

generations of a single device roadmap

OpenCL Overview: What it is


Node 0

MPI Communication APIs

Node 1 Node N

• MPI allows expression of parallelism across nodes in a distributed system

• MPI’s first specification was in 1992

OpenCL Overview: Where it fits in


CPU CPU CPU CPU

OpenMP Threads

Node 0


CPU CPU CPU CPU

OpenMP Threads

Node 1

CPU CPU CPU CPU

OpenMP Threads

Node N

• OpenMP allows expression of parallelism across homogeneous, shared-memory cores

• OpenMP’s first specification was in 1997



CPU CPU CPU CPU

OpenMP Threads

GPU

CUDA/OpenCL

Node 0


CPU CPU CPU CPU

OpenMP Threads

GPU

CUDA/OpenCL

Node 1

CPU CPU CPU CPU

OpenMP Threads

GPU

CUDA/OpenCL

Node N

• CUDA / OpenCL can leverage parallelism across heterogeneous computing devices in a system, even with distinct memory spaces

• CUDA’s first specification was in 2007• OpenCL’s first specification was in 2008



CPU CPU CPU CPU

OpenMP Threads

DSP

OpenCL

Node 0


CPU CPU CPU CPU

OpenMP Threads

DSP

OpenCL

Node 1

CPU CPU CPU CPU

OpenMP Threads

DSP

OpenCL

Node N

• Focus on OpenCL as an open alternative to CUDA• Focus on OpenCL devices other than GPU, like DSPs



CPU CPU CPU CPU

OpenCL

Node 0


Node 1 Node N

CPU CPU CPU CPU

OpenCL

CPU CPU CPU CPU

OpenCL

• OpenCL is expressive enough to allow efficient control over all compute engines in a node.



• Host connected to one or more OpenCL devices– Commands are submitted from host to OpenCL devices– Host can also be an OpenCL device

• OpenCL device is a collection of one or more compute units (cores)– OpenCL device viewed by programmer as single virtual processor– Programmer does not need to know how many cores are in the device– OpenCL runtime efficiently divides total processing effort across cores

12

66AK2H12KeyStone II Multicore DSP + ARM

ARM A15

ARM A15

ARM A15

ARM A15

+*-<<

C66x DSP

+*-<<

C66x DSP+*-<<

C66x DSP

+*-<<

C66x DSP+*-<<

C66x DSP

+*-<<

C66x DSP+*-<<

C66x DSP

+*-<<

C66x DSP

Multicore Shared Memory

• Example on 66AK2H12- A15 running OpenCL process acts as host- 8 C66x DSPs available as a single device

(Accelerator type, 8 compute units)- 4 A15’s available as single device

(CPU type, 4 compute units)

OpenCL Overview: Model

Agenda• OpenCL Overview• OpenCL Usage• Memory Model• Synchronization• Operational Flow• Availability


OpenCL Usage: Platform Layer

Context context (CL_DEVICE_TYPE_ACCELERATOR);vector<Device>devices = context.getInfo<CL_CONTEXT_DEVICES>();

• Platform Layer APIs allow an OpenCL application to:– Query the platform for OpenCL devices– Query OpenCL devices for their configuration and capabilities– Create OpenCL contexts using one or more devices

• Context: – Environment within which work-items execute– Includes devices and their memories and command queues

• Kernels dispatched within this context will run on accelerators (DSPs)• To change the program to run kernels on a CPU device instead: change

CL_DEVICE_TYPE_ACCELERATOR to CL_DEVICE_TYPE_CPU


Cint err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_CPU, 1, &device_id, NULL);if (err != CL_SUCCESS) { … }context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);if (!context) { … }commands = clCreateCommandQueue(context, device_id, 0, &err);if (!commands) { … }

C++Context context(CL_DEVICE_TYPE_CPU);std::vector<Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();CommandQueue Q(context, devices[0]);

Usage: Contexts & Command Queues

Typical flow• Query the platform for all available accelerator devices• Create an OpenCL context containing all those devices• Query the context to enumerate the devices and place them in a vector

16 © Copyright Khronos Group, 2009

• OpenCL C Kernel– Basic unit of executable code on a device - similar to a C function– Can be data-parallel or task-parallel

• OpenCL C Program– Collection of kernels and other functions

• OpenCL Applications queue kernel execution instances– Application defines command queues

• Command queue is tied to a specific device• Any/All devices may have command queues

– Application enqueues kernels to these queues– Kernels will then run asynchronously to the main application thread– Queues can be defined to execute in-order or allow out-of-order

Usage: Execution Model


Usage: Data Kernel Execution Kernel enqueuing is a combination of

1. OpenCL C kernel definition (expressing an algorithm for a work-item)2. Description of the total number of work-items required for the kernel

CommandQueue Q (context, devices[0]);Kernel kernel (program, "mpy2");Q.enqueueNDRangeKernel(kernel, NDRange(1024));

Kernel void mpy2(global int *p){ int i = get_global_id(0); p[i] *= 2;}

Work-items for a kernel execution are grouped into workgroups– Workgroup is executed by a compute unit (core)– Size of a workgroup can be specified, or left to the runtime to define– Different workgroups can execute asynchronously across multiple cores

Q.enqueueNDRangeKernel(kernel, NDRange(1024), NDRange(128));

• Code line above enqueues kernel with 1024 work-items grouped in workgroups of 128 work-items each

• 1024/128 => 8 workgroups, that could execute simultaneously on 8 cores.


• Execution order of work-items in workgroup not defined by spec.• Portable OpenCL code must assume they could all execute concurrently.

– GPU implementations typically execute work-items within a workgroup concurrently

– CPU / DSP implem. typically serialize work-items within workgroup– OpenCL C barrier instructions can be used to ensure that all work-

items in a workgroup reach the barrier, before any work-items in the workgroup proceed past the barrier.

• Execution order of workgroups associated with 1 kernel execution is not defined by the spec.

• Portable OpenCL code must assume any order is valid• No mechanism exists in OpenCL to synchronize or order workgroups

Usage: Execution OrderWork-Items & Workgroups


OpenCL Host Code

Context context (CL_DEVICE_TYPE_ACCELERATOR);vector<Device>devices = context.getInfo<CL_CONTEXT_DEVICES>();

Program program(context, devices, source);Program.build(devices);

Buffer buf (context, CL_MEM_READ_WRITE, sizeof(input));Kernel kernel (program, "mpy2");kernel.setArg(0, buf);

CommandQueue Q (context, devices[0]);

Q.enqueueWriteBuffer (buf, CL_TRUE, 0, sizeof(input), input);Q.enqueueNDRangeKernel(kernel, NDRange(globSz), NDRange(wgSz));Q.enqueueReadBuffer (buf, CL_TRUE, 0, sizeof(input), input);

• Host code uses optional OpenCL C++ bindings– Creates a buffer and a kernel, sets the arguments, writes the buffer,

invokes the kernel and reads the buffer.

• Kernel is purely algorithmic– No dealing with DMA’s, cache flushing, communication protocols, etc.

OpenCL Kernel

Kernel void mpy2(global int *p){ int i = get_global_id(0); p[i] *= 2;}

Usage: Example


• When compiling, tell gcc where the headers are: gcc –I$TI_OCL_INSTALL/include …

• Link with the TI OpenCL library as: gcc <obj files> -L$TI_OCL_INSTALL/lib –lTIOpenCL …

Usage: Compiling & Linking


• Private Memory– Per work-item– Typically registers

• Local Memory– Shared within a workgroup– Local to a compute unit (core)

• Global/Constant Memory– Shared across all compute units

(cores) in a device

• Host Memory– Attached to the Host CPU– Can be distinct from global memory

• Read / Write buffer model– Can be same as global memory

• Map / Unmap buffer model

OpenCL Memory Model: Overview

Workgroup

Work-Item

Computer Device

Work-Item

Workgroup

Work-ItemWork-Item

Host

Private Memory

Private Memory

Private Memory

Private Memory

Local MemoryLocal Memory

Global/Constant Memory

Host Memory


• Buffers– Simple chunks of memory– Kernels can access however they like (array, pointers, structs)– Kernels can read and write buffers

• Images– Opaque 2D or 3D formatted data structures– Kernels access only via read_image() and write_image()– Each image can be read or written in a kernel, but not both– Only required for GPU devices !

OpenCL Memory: Resources


OpenCL Memory: Distinct Host and Global Device Memory1. char *ary = malloc(globsz);2. for (int i = 0; i < globsz; i++) ary[i] = i;3. Buffer buf (context, CL_MEM_READ_WRITE, sizeof(ary));4. Q.enqueueWriteBuffer (buf, CL_TRUE, 0, sizeof(ary), ary);5. Q.enqueueNDRangeKernel(kernel, NDRange(globSz), NDRange(wgSz));6. Q.enqueueReadBuffer (buf, CL_TRUE, 0, sizeof(ary), ary);7. for (int i = 0; i < globsz; i++) … = ary[i];

24

Host Memory Device Global Memory

0,1,2,3, … 0,1,2,3 …0,2,4,6 …0,2,4,6, …


OpenCL Memory: Shared Host and Global Device Memory1. Buffer buf (context, CL_MEM_READ_WRITE, globsz);

2. char* ary = Q.enqueueMapBuffer(buf, CL_TRUE, CL_MAP_WRITE, 0, globsz);3. for (int i = 0; i < globsz; i++) ary[i] = i;4. Q.enqueueUnmapMemObject(buf, ary);

5. Q.enqueueNDRangeKernel(kernel, NDRange(globSz), NDRange(wgSz));

6. ary = Q.enqueueMapBuffer(buf, CL_TRUE, CL_MAP_READ, 0, globsz);7. for (int i = 0; i < globsz; i++) … = ary[i];8. Q.enqueueUnmapMemObject(buf, ary);

25

Shared Host + Device Global Memory

0,1,2,3, …0,2,4,6, …Ownership

to hostOwnership to deviceOwnership

to hostOwnership to device


OpenCL Synchronization• Kernel execution is defined to be the execution and

completion of all work-items associated with an enqueue kernel command

• Kernel executions can synchronize at their boundaries through OpenCL events at the Host API level

• Within a workgroup, work-items can synchronize through barriers and fences, expressed as OpenCL C built-in functions

• Workgroups cannot synchronize with workgroups• Work-items in different workgroups cannot synchronize


OpenCL Operational Flow

29

Context context;

CommandQueue Q;

Buffer buffer;

Program program;program.build();

Kernel kernel(program, "kname");

Q.enqueueNDRangeKernel(kernel)

HOST DSP 0 Core 0 DSP 0 Core 0-7DSP0 DDR

DOWNLOAD MONITOR PGM

START DSPs

ALLOCATE SPACE

See if program has already been compiled and is cached, if so reuse.Else cross compile program on host for execution on DSP.

ESTABLISH MAILBOX

Start Host thread to monitor this queue and mailbox

Establish kname as an entry point in program

Create Dispatch packet for the DSP

LOAD PROGRAM

Break kernel into workgroups.

RESET DSPS()

SEND DISPATCH PACKET()

SEND WGs TO ALL CORES()

DONE

CACHE OPS()

DONEDONE

Note: Items are show occuring at their earliest point, but are often lazily executed at first need time.


TI OpenCL 1.1 Products

• Advantech DSPC8681 with four 8-core DSPs

• Advantech DSPC8682 with eight 8-core DSPs

• Each 8 core DSP is an OpenCL device

• Ubuntu Linux PC as OpenCL host

• OpenCL in limited distribution Alpha

• GA approx. End of Q1 2014.

1GB DDR3

1GB DDR3

1GB DDR3

1GB DDR3

TMS320C6678

8 C66 DSPs

TMS320C6678

8 C66 DSPs

TMS320C6678

8 C66 DSPs

TMS320C6678

8 C66 DSPs

66AK2H12KeyStone II Multicore DSP + ARM

ARM A15

ARM A15

ARM A15

ARM A15

+*-<<

C66x DSP

+*-<<

C66x DSP+*-<<

C66x DSP

+*-<<

C66x DSP+*-<<

C66x DSP

+*-<<

C66x DSP+*-<<

C66x DSP

+*-<<

C66x DSP

Multicore Shared Memory

* Product is based on a published Khronos Specification, and is expected to pass the Khronos Conformance Testing Process. Current conformance status can be found at www.khronos.org/conformance.

• OpenCL on a chip• 4 ARM A15s running Linux as OpenCL host• 8 core DSP as an OpenCL Device• 6M on chip shared memory.• Up to 10G attached DDR3• GA approx. End of Q1 2014.

BACKUP

KeyStone OpenCL


int acc = 0;

for (int i = 0; i < N; ++i) acc += buffer[i];

return acc;

• Sequential in nature• Not parallel

Usage: Vector Sum Reduction Example


Usage: Example //Vector Sum Reductionkernel void sum_reduce(global float* buffer, global float* result)

{

int gid = get_global_id(0);//which work-item am I of all work-items

int lid = get_local_id (0); //which work-item am I within workgroup

for (int offset = get_local_size(0) >> 1; offset > 0; offset >>= 1)

{

if (lid < offset) buffer[gid] += buffer[gid + offset];

barrier(CLK_GLOBAL_MEM_FENCE);

}

if (lid == 0) result[get_group_id(0)] = buffer[gid];

}


kernel void sum_reduce(global float* buffer, local float *acc, global float* result)

{

int gid = get_global_id(0); //which work-item am I out of all work-items

int lid = get_local_id (0); // which work-item am I within my workgroup

bool first_wi = (lid == 0);

bool last_wi = (lid == get_local_size(0) – 1);

int wg_index = get_group_id (0); // which workgroup am I

if (first_wi) acc[wg_index] = 0;

acc[wg_index] += buffer[gid];

if (last_wi) result[wg_index] = acc[wg_index];

}

• Not valid on a GPU• Could be valid on a device that serializes work-items in a workgroup, i.e. DSP

Usage: Example // Vector Sum Reduction (Iterative DSP)


kernel void sum_reduce(global float* buffer, local float* scratch, global float* result)

{

int lid = get_local_id (0); // which work-item am I within my workgroup

scratch[lid] = buffer[get_global_id(0)];

barrier(CLK_LOCAL_MEM_FENCE);

for (int offset = get_local_size(0) >> 1; offset > 0; offset >>= 1)

{ if (lid < offset) scratch[lid] += scratch[lid + offset];

barrier(CLK_LOCAL_MEM_FENCE);

}

if (lid == 0) result[get_group_id(0)] = scratch[lid];

}

OpenCL Memory: // Vector Sum Reduction

opencl

Documents

opencl devices

mpi communication apisnode

lowlevel communication

expression of parallelism

new markets

ti proprietary software

current ti dsp users

parallel computation