an introduction to opencl™ programming with amd gpus - amd & acceleware webinar

52
An Introduction to OpenCLUsing AMD GPUs Chris Mason Product Manager, Acceleware September 17, 2014

Upload: amd-developer-central

Post on 07-Nov-2014

870 views

Category:

Technology


2 download

DESCRIPTION

This deck presents highlights from the Introduction to OpenCL™ Programming Webinar presented by Acceleware & AMD on Sept. 17, 2014. Watch a replay of this popular webinar on the AMD Dev Central YouTube channel here: https://www.youtube.com/user/AMDDevCentral or here for the direct link: http://bit.ly/1r3DgfF

TRANSCRIPT

Page 1: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

An Introduction to OpenCL™

Using AMD GPUs

Chris Mason

Product Manager, Acceleware

September 17, 2014

Page 2: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

About Acceleware

Programmer Training

– OpenCL, CUDA, OpenMP

– Over 100 courses taught

– http://acceleware.com/training

Consulting Services

– Completed projects for: Oil & Gas, Medical,

Finance, Security & Defence, Computer Aided

Engineering, Media & Entertainment

– http://acceleware.com/services

GPU Accelerated Software

– Seismic imaging & modeling

– Electromagnetics

2

Page 3: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Seismic Imaging & Modeling

AxWave – Seismic forward modeling

– 2D, 3D, constant and variable density models

– High fidelity finite-difference modeling

AxRTM – High performance Reverse Time Migration

Application

– Isotropic, VTI and TTI media

HPC Implementation – Optimized for GPUs

– Efficient multi-GPU scaling

3

Page 4: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Electromagnetics

AxFDTD™

– Finite-Difference Time-Domain

Electromagnetic Solver

– Optimized for GPUs

– Sub-gridding and large feature

coverage

– Multi-GPU, GPU clusters, GPU

targeting

– Available from:

4

Page 5: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Consulting Services Industry Application Work Completed Results

Finance Option Pricing Debugged & optimized existing code Implemented the Leisen-Reimer version of the binomial model for stock option pricing

30-50x performance improvement compared to single-threaded CPU code

Security & Defense

Detection System

Replaced legacy Cell-based infrastructure with GPUs Implemented a GPU accelerated X-ray iterative image reconstruction and explosive detection algorithms

Surpassed the performance targets Reduced hardware cost by a factor of 10

CAE SIMULIA Abaqus

Developed a GPU accelerated version Conducted a finite-element analysis and developed a library to offload LDLT factorization portion of the multi-frontal solver to GPUs

Delivered an accelerated (2-3x) solution that supports NVIDIA and AMD GPUs

Medical CT Reconstruction Software

Developed a GPU accelerated application for image reconstruction on CT scanners and implemented advanced features including job batch manager, filtering and bad pixel corrections

Accelerated back projection by 31x

Oil & Gas Seismic Application

Converted MATLAB research code into a standalone application & improved performance via algorithmic optimizations

20-30x speedup

5

Page 6: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Programmer Training

OpenCL, CUDA, OpenMP

Teachers with real world experience

Hands-on lab exercises

Progressive lectures

Small class sizes to maximize learning

90 days post training support

“The level of detail is fantastic. The course did not focus on syntax but rather on how to expertly program for the GPU. I loved the course and I hope that we can get more of our team to take it.”

Jason Gauci, Software Engineer

Lockheed Martin

6

Page 7: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Outline

Introduction to the OpenCL Architecture – Contexts, Devices, Queues

Memory and Error Management

Data-Parallel Computing – Kernel Launches

GPU Kernels

7

Page 8: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Introduction To The OpenCL Architecture

Page 9: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Architecture Introduction and Terminology

Four high level models describe the key OpenCL concepts:

– Platform Model – high level host/device interaction

– Execution Model – OpenCL programs execute on host/device

– Memory Model – different memory resources on device

– Programming Model – types of parallel workloads

9

Page 10: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Platform Model

A host connected to one or more devices

– Example: GPUs, DSPs, FPGAs

A program can work with devices from multiple vendors

A platform is a host and a collection of devices that share

resources and execute programs

10

Host

Device 1 GPU

Device 2 CPU

… Device N

GPU

Page 11: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Execution Model

The host defines a context to control the device

– The context manages the following resources:

– Devices – hardware to run on

– Kernels – functions to run on the hardware

– Program Objects – device executables

– Memory Objects – memory visible to host and device

A command queue schedules commands for execution on

the device

11

Page 12: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL API - Platform and Runtime Layer

The OpenCL API is divided into two layers: Platform and

Runtime

The platform layer allows the host program to discover

devices and capabilities

The runtime layer allows the host program to work with

contexts once created

12

Page 13: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Program Set Up

To set up an OpenCL program, the typical steps are as follows:

1. Query and select the platforms (e.g., AMD)

2. Query the devices

3. Create a context

4. Create a command queue

5. Read/Write to the device

6. Launch the kernel

13

Platform Layer

Runtime Layer

Page 14: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Sample Platform Layer C Code

14

//Get the platform ID

cl_platform_id platform;

clGetPlatformIDs(1, &platform, NULL);

// Get the first GPU device associated with the platform

cl_device_id device;

clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

//Create an OpenCL context for the GPU device

cl_context context;

context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);

Page 15: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Runtime Layer

A command queue operates on contexts, memory, and program

objects

Each device can have one or more command queues

Operations in the command queue will execute in order unless

the out of order mode is enabled

15

Copy Data Copy Data Launch Kernel

Copy Data

Command Queue

Page 16: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Memory and Error Management

Page 17: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Buffers

A buffer stores a one dimensional collection of elements

Buffer objects use the cl_mem type

– cl_mem is an abstract memory container (i.e., a handle)

– The buffer object cannot be dereferenced on the host

• cl_mem a; a[0] = 5; // Not allowed

OpenCL commands interact with buffers

17

Page 18: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax – C Memory Management

Example

Example:

18

//Create an OpenCL command queue

cl_int err;

cl_command_queue queue;

queue = clCreateCommandQueue(context, device, 0, &err);

// Allocate memory on device

const int N = 5;

int nBytes = N*sizeof(int);

cl_mem a = clCreateBuffer(context, CL_MEM_READ_WRITE,

nBytes, NULL, &err);

int hostarr [N] = {3,1,4,1,5};

// Transfer Memory

err = clEnqueueWriteBuffer(queue, a, CL_TRUE, 0,

nBytes, hostarr, 0, NULL,

NULL);

Page 19: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax – Error Management

Host code manages errors:

– Most host side OpenCL function calls return cl_int

• “Create” calls return the object that is created

– Error code is passed by reference as last argument

• Error codes are negative values defined in cl.h

• CL_SUCCESS == 0

19

Page 20: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax – Clean Up

All objects that are created can

be released with the following

functions: – clReleaseContext

– clReleaseCommandQueue

– clReleaseMemObject

20

Page 21: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Data-Parallel Computing

Page 22: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Data-Parallel Computing

Data-parallelism

1. Performs operations on a data set organized into a common structure (e.g. an array)

2. Tasks work collectively on the same structure with each task operating on its own portion of the structure

3. Tasks perform identical operations on their portions of the structure. Operations on each portion must not be data dependent!

22

Page 23: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Data Dependence

Data dependence occurs when a program statement refers

to the data of a preceding statement.

Data dependence limits parallelism

23

a = 2 * x; b = 2 * y; c = 3 * x;

a = 2 * x; b = 2 * a * a; c = b * 9;

These 3 statements are

independent! b depends on a, c depends

on b and a!

Page 24: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Data-Parallel Computing Example

Data set consisting of

arrays A,B, and C

Same operations performed

on each element - Cx = Ax +

Bx

Two tasks operating on a

subset of the arrays. Tasks

0 and 1 are independent.

Could have more tasks.

24

A0 A1 A2 A3 A4 A5 A6 A7

B0 B1 B2 B3 B4 B5 B6 B7

C0 C1 C2 C3 C4 C5 C6 C7 C0 C1

Cx = Ax + Bx

Task 0 Task 1

Operation

Page 25: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

The OpenCL Programming Model

OpenCL is a heterogeneous model, including provisions for

both host and device

25

CPU

Chipset

DR

AM

DRAM

DSP or

GPU or

FPGA

Device Host

PCIe

Page 26: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

The OpenCL Programming Model

Data-parallel portions of an algorithm are executed on the

device as kernels

– Kernels are C functions with some restrictions, and a few language

extensions

Only one kernel is executed at a time

A kernel is executed by many work-items

– Each work-item executes the same kernel

26

Page 27: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Work-Items

OpenCL work-items are conceptually similar to data-parallel tasks or threads – Each work-item performs the same operations on a subset of a data

structure

– Work-items execute independently

OpenCL work-items are not CPU threads – OpenCL work-items are extremely lightweight

• Little creation overhead

• Instant context-switching

– Work-items must execute the same kernel

27

Page 28: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Work-Item Hierarchy

OpenCL is designed to execute millions of work-items

Work-items are grouped together into work-groups

– Maximum # of work-items per work-group (HW limit)

– Query CL_DEVICE_MAX_WORK_GROUP_SIZE in clDeviceInfo

• Typically 256-1024

The entire collection of work-items is called the N-

Dimensional Range (NDRange)

28

Page 29: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Work-Item Hierarchy

Work-groups and NDRange

can be 1D, 2D, or 3D

Dimensions set at launch

time

29

Work-Item

(3,0)

Work-Item

(1,0)

Work-Item

(2,0)

Work-Item

(0,0)

Work-Item

(3,1)

Work-Item

(1,1)

Work-Item

(2,1)

Work-Item

(0,1)

Work-Item

(3,2)

Work-Item

(1,2)

Work-Item

(2,2)

Work-Item

(0,2)

Work-Group (1,1)

Work-Group

(0,0)

Work-Group

(1,0)

Work-Group

(2,0)

Work-Group

(0,1)

Work-Group

(1,1)

Work-Group

(2,1)

ND Range

Page 30: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

The OpenCL

Programming Model

The host launches kernels

The host executes serial code

between device kernel

launches

– Memory management

– Data exchange to/from device

(usually)

– Error handling

30

Work-Group (0,0) Work-Group (1,0)

Work-Group (0,1) Work-Group (1,1)

Work-Group (0,2) Work-Group( 1,2)

ND Range

Work-Group

(0,0)

Work-Group

(1,0)

Work-Group

(2,0)

Work-Group

(0,1)

Work-Group

(1,1)

Work-Group

(2,1)

ND Range

Host

Device

Host

Device

Page 31: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Data-Parallel Computing on GPUs

Data-parallel computing maps well to GPUs: – Identical operations executed on many data elements in parallel

• Simplified flow control allows increased ratio of compute logic (ALUs) to control logic

31

DRAM

GPU

DRAM

CPU

ALU Control

L1 Cache

L2 Cache

ALU

ALU ALU

ALU Control

L1 Cache

L2 Cache

ALU

ALU ALU

L3 Cache

Page 32: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL API – Launching a Kernel C

How to launch a kernel:

32

//3D Work-Group, let OpenCL Runtime determine

//local work size.

size_t const globalWorkSize[3] = {512,512,512};

clEnqueueNDRangeKernel(queue, kernel, 3, NULL,

globalWorkSize, NULL,

0, NULL, NULL);

//2D Work-Group, specify local work size

size_t const globalWorkSize[2] = {512,512};

size_t const localWorkSize[2] = {16, 16};

clEnqueueNDRangeKernel(queue, kernel, 2, NULL,

globalWorkSize, localWorkSize,

0, NULL, NULL);

Page 33: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

GPU Kernels

Page 34: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Writing OpenCL Kernels

Denoted by __kernel function qualifier

– Eg. __kernel void myKernel(__global float* a)

Queued from host, executed on device

A few noteworthy restrictions:

– No access to host memory (in general!)

– Must return void

– No function pointers

– No static variables

– No recursion (no stack)

34

Page 35: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax - Kernels

Kernels have built-in functions:

– The variable dim ranges from 0 to 2, depending on the dimension of the kernel launch

– get_work_dim (): number of dimensions in use

– get_global_id (dim): unique index of a work-item

– get_global_size (dim): number of global work-items

35

Page 36: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax – Kernels (Continued)

Built-in function listing (continued):

– get_local_id (dim): unique index of the work-item within the work-group

– get_local_size (dim): number of work-items within the work-group

– get_group_id (dim): index of the work-group

– get_num_groups (dim): number of work-groups

– Cannot vary the size of work-groups or work-items during a kernel call

36

Page 37: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax - Kernels

Built-in functions are typically used to determine unique work-item identifiers:

37

get_group_id(0)

get_local_size(0) = 5

get_global_id(0)

ND Range

0

0 1 2 3 4

1

0 1 2 3 4

2

0 1 2 3 4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

get_local_id(0)

One Dimensional Array (get_work_dim () == 1)

get_global_id(0) == get_group_id(0) * get_local_size(0) + get_local_id(0)

Page 38: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax – Thread Identifiers

Result for each kernel launched with the following execution

configuration: Dimension = 1 Global work size = 12 Local Work Size = 4

38

__kernel void MyKernel(__global int* a) { int idx = get_global_id(0); a[idx] = 7; } __kernel void MyKernel(__global int* a) { int idx = get_global_id(0); a[idx] = get_group_id(0); } __kernel void MyKernel(__global int* a) { int idx = get_global_id(0); a[idx] = get_local_id(0); }

a: 7 7 7 7 7 7 7 7 7 7 7 7

a: 0 0 0 0 1 1 1 1 2 2 2 2

a: 0 1 2 3 0 1 2 3 0 1 2 3

Page 39: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Code Example - Kernel

Kernel is executed by N work-items – Each work-item has a unique ID between 0 and N-1

39

void inc(float* a, float b, int N) { for(int i = 0; i<N; i++) a[i] = a[i] + b; } void main() { … increment(a,b,N); }

__kernel void inc(__global float* a, float b) { int i = get_global_id(0); a[i] = a[i] + b; } void main() { … clEnqueueNDRangeKernel(…,…); }

Page 40: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax - Kernels

All C operators are supported – eg. +, *, /, ^, >, >>

Many functions from the standard math library – eg. sin(), cos(), ceil(), fabs()

Can write/call your own non-kernel functions – float myDeviceFunction(__global float *a)

– Non-kernel functions cannot be called by host

Control flow statements too! – eg. if(), while(), for()

40

Page 41: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax - Synchronization

Kernel launches are asynchronous – Control returns to CPU immediately

– Subsequent commands added to the command queue will wait until the kernel has completed

– If you want to synchronize on the host:

• Implicit synchronization via blocking commands – eg. clEnqueueReadBuffer() with the blocking argument set to CL_TRUE

– Explicitly call clFinish()

clFinish(queue)

– Blocks on host until all outstanding OpenCL commands are complete in a given queue

41

Page 42: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Questions?

OpenCL training courses and consulting services

Acceleware Ltd.

Twitter: @Acceleware

Web: http://acceleware.com/opencl-training

Email: [email protected]

-------------------

Stay in the know about developer news, tools, SDKs, technical presentations, events and future webinars. Connect with AMD Developer Central here:

AMD Developer Central

Twitter: @AMDDevCentral

Web: http://developer.amd.com/

YouTube: https://www.youtube.com/user/AMDDevCentral

Developer Forums: http://devgurus.amd.com/welcome

42

Page 43: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

An Overview of GPU Hardware

Page 44: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

What is the GPU?

The GPU is a graphics processing unit

Historically used to offload graphics

computations from the CPU

Can either be a dedicated video card,

integrated on the motherboard or on the

same die as the CPU

– Highest performance will require a dedicated

video card

44

Page 45: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Why use GPUs? Performance!

45

Intel Xeon E5-2697 v2

(Ivy Bridge)

AMD Opteron 6386SE

(Bulldozer)

AMD FirePro

W9100 (Volcanic Islands)

AMD FirePro S10000

(Southern Islands)

Processing Cores 12 16 2816 3584

Clock Frequency

(GHz) 2.7-3.4* GHz 2.8-3.5* GHz 930 MHz 825 MHz

Memory Bandwidth 59.7 GB/s / socket 59.7 GB/s / socket 320 GB/s 480 GB/s

Peak Gflops** (single) 576 @ 3.0GHz 410 @ 3.2GHz 5240 5910

Peak Gflops** (double) 288 @ 3.0GHz 205 @ 3.2GHz 2620 1480

Gflops/Watt

(single) 4.4 2.9 19 15.76

Total Memory >>16GB >>16GB 16 GB 6 GB

*Indicates range of clock frequencies supported via Intel Turbo Boost and AMD Turbo CORE Technology

** At maximum frequency when all cores are executing

Page 46: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

GPU Potential Advantages

9x more single-precision floating-point throughput

9x more double-precision floating-point throughput

5x higher memory bandwidth

46

AMD FirePro W9100 vs. Xeon E5-2697 v2

Page 47: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

GPU Disadvantages

Architecture not as flexible as CPU

Must rewrite algorithms and maintain software in GPU languages

Attached to CPU via relatively slow PCIe – 16GB/s bi-directional for PCIe 3.0 16x

Limited memory (though 6-16GB is reasonable for many applications)

47

Page 48: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Software Approaches for Acceleration

Maximum Flexibility – OpenCL

Simple programming for

heterogeneous systems – Simple compiler hints/pragmas

– Compiler parallelizes code

– Target a variety of platforms

“Drop-in” Acceleration – In-depth GPU knowledge not required

– Highly optimized by GPU experts

– Provides functions used in a broad range of

applications (eg. FFT, BLAS)

48

Programming Languages

OpenACC Directives

Libraries

Effort

Page 49: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

An Introduction to OpenCL

Page 50: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Overview

Parallel computing architecture

standardized by the Khronos Group

OpenCL:

– Is a royalty free standard

– Provides an API to coordinate parallel

computation across heterogeneous

processors

– Defines a cross-platform programming

language

50

Page 51: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Versions

To date there are four different versions of OpenCL

– OpenCL 1.0

– OpenCL 1.1

– OpenCL 1.2

– OpenCL 2.0 (finalized November 2013)

Different versions support different functionality

51

Hardware Vendor Supported OpenCL Version

AMD OpenCL 1.2

Intel OpenCL 1.2

NVIDIA OpenCL 1.1

Page 52: An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Extensions

Optional functionality is exposed through extensions

– Vendors are not required to support extensions to achieve conformance

– However, extensions are expected to be widely available

Some OpenCL extensions are approved by the OpenCL working group

– These extensions are likely to be promoted to core functionality in future

versions of the standard

Multi-vendor and vendor specific extensions do not need approval by

the working group

52