an introduction to opencl™ programming with amd gpus - amd & acceleware webinar

An Introduction to OpenCL™

Using AMD GPUs

Chris Mason

Product Manager, Acceleware

September 17, 2014

© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.

An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

About Acceleware

Programmer Training

– OpenCL, CUDA, OpenMP

– Over 100 courses taught

– http://acceleware.com/training

Consulting Services

– Completed projects for: Oil & Gas, Medical,

Finance, Security & Defence, Computer Aided

Engineering, Media & Entertainment

– http://acceleware.com/services

GPU Accelerated Software

– Seismic imaging & modeling

– Electromagnetics

2

http://acceleware.com/training

http://acceleware.com/training

http://acceleware.com/services

http://acceleware.com/services


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Seismic Imaging & Modeling

AxWave – Seismic forward modeling

– 2D, 3D, constant and variable density models

– High fidelity finite-difference modeling

AxRTM – High performance Reverse Time Migration

Application

– Isotropic, VTI and TTI media

HPC Implementation – Optimized for GPUs

– Efficient multi-GPU scaling

3


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Electromagnetics

AxFDTD™

– Finite-Difference Time-Domain

Electromagnetic Solver

– Optimized for GPUs

– Sub-gridding and large feature

coverage

– Multi-GPU, GPU clusters, GPU

targeting

– Available from:

4


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Consulting Services Industry Application Work Completed Results

Finance Option Pricing Debugged & optimized existing code Implemented the Leisen-Reimer version of the binomial model for stock option pricing

30-50x performance improvement compared to single-threaded CPU code

Security & Defense

Detection System

Replaced legacy Cell-based infrastructure with GPUs Implemented a GPU accelerated X-ray iterative image reconstruction and explosive detection algorithms

Surpassed the performance targets Reduced hardware cost by a factor of 10

CAE SIMULIA Abaqus

Developed a GPU accelerated version Conducted a finite-element analysis and developed a library to offload LDLT factorization portion of the multi-frontal solver to GPUs

Delivered an accelerated (2-3x) solution that supports NVIDIA and AMD GPUs

Medical CT Reconstruction Software

Developed a GPU accelerated application for image reconstruction on CT scanners and implemented advanced features including job batch manager, filtering and bad pixel corrections

Accelerated back projection by 31x

Oil & Gas Seismic Application

Converted MATLAB research code into a standalone application & improved performance via algorithmic optimizations

20-30x speedup

5


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Programmer Training

OpenCL, CUDA, OpenMP

Teachers with real world experience

Hands-on lab exercises

Progressive lectures

Small class sizes to maximize learning

90 days post training support

“The level of detail is fantastic. The course did not focus on syntax but rather on how to expertly program for the GPU. I loved the course and I hope that we can get more of our team to take it.”

Jason Gauci, Software Engineer

Lockheed Martin

6


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Outline

Introduction to the OpenCL Architecture – Contexts, Devices, Queues

Memory and Error Management

Data-Parallel Computing – Kernel Launches

GPU Kernels

7


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Introduction To The OpenCL Architecture


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Architecture Introduction and Terminology

Four high level models describe the key OpenCL concepts:

– Platform Model – high level host/device interaction

– Execution Model – OpenCL programs execute on host/device

– Memory Model – different memory resources on device

– Programming Model – types of parallel workloads

9


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Platform Model

A host connected to one or more devices

– Example: GPUs, DSPs, FPGAs

A program can work with devices from multiple vendors

A platform is a host and a collection of devices that share

resources and execute programs

10

Host

Device 1 GPU

Device 2 CPU

… Device N

GPU


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Execution Model

The host defines a context to control the device

– The context manages the following resources:

– Devices – hardware to run on

– Kernels – functions to run on the hardware

– Program Objects – device executables

– Memory Objects – memory visible to host and device

A command queue schedules commands for execution on

the device

11


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL API - Platform and Runtime Layer

The OpenCL API is divided into two layers: Platform and

Runtime

The platform layer allows the host program to discover

devices and capabilities

The runtime layer allows the host program to work with

contexts once created

12


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Program Set Up

To set up an OpenCL program, the typical steps are as follows:

1. Query and select the platforms (e.g., AMD)

2. Query the devices

3. Create a context

4. Create a command queue

5. Read/Write to the device

6. Launch the kernel

13

Platform Layer

Runtime Layer


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Sample Platform Layer C Code

14

//Get the platform ID

cl_platform_id platform;

clGetPlatformIDs(1, &platform, NULL);

// Get the first GPU device associated with the platform

cl_device_id device;

clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

//Create an OpenCL context for the GPU device

cl_context context;

context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Runtime Layer

A command queue operates on contexts, memory, and program

objects

Each device can have one or more command queues

Operations in the command queue will execute in order unless

the out of order mode is enabled

15

Copy Data Copy Data Launch Kernel

Copy Data

Command Queue


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Memory and Error Management


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Buffers

A buffer stores a one dimensional collection of elements

Buffer objects use the cl_mem type

– cl_mem is an abstract memory container (i.e., a handle)

– The buffer object cannot be dereferenced on the host

• cl_mem a; a[0] = 5; // Not allowed

OpenCL commands interact with buffers

17


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax – C Memory Management

Example

Example:

18

//Create an OpenCL command queue

cl_int err;

cl_command_queue queue;

queue = clCreateCommandQueue(context, device, 0, &err);

// Allocate memory on device

const int N = 5;

int nBytes = N*sizeof(int);

cl_mem a = clCreateBuffer(context, CL_MEM_READ_WRITE,

nBytes, NULL, &err);

int hostarr [N] = {3,1,4,1,5};

// Transfer Memory

err = clEnqueueWriteBuffer(queue, a, CL_TRUE, 0,

nBytes, hostarr, 0, NULL,

NULL);


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax – Error Management

Host code manages errors:

– Most host side OpenCL function calls return cl_int

• “Create” calls return the object that is created

– Error code is passed by reference as last argument

• Error codes are negative values defined in cl.h

• CL_SUCCESS == 0

19


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax – Clean Up

All objects that are created can

be released with the following

functions: – clReleaseContext

– clReleaseCommandQueue

– clReleaseMemObject

20


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Data-Parallel Computing


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Data-Parallel Computing

Data-parallelism

1. Performs operations on a data set organized into a common structure (e.g. an array)

2. Tasks work collectively on the same structure with each task operating on its own portion of the structure

3. Tasks perform identical operations on their portions of the structure. Operations on each portion must not be data dependent!

22


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Data Dependence

Data dependence occurs when a program statement refers

to the data of a preceding statement.

Data dependence limits parallelism

23

a = 2 * x; b = 2 * y; c = 3 * x;

a = 2 * x; b = 2 * a * a; c = b * 9;

These 3 statements are

independent! b depends on a, c depends

on b and a!


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Data-Parallel Computing Example

Data set consisting of

arrays A,B, and C

Same operations performed

on each element - Cx = Ax +

Bx

Two tasks operating on a

subset of the arrays. Tasks

0 and 1 are independent.

Could have more tasks.

24

A0 A1 A2 A3 A4 A5 A6 A7

B0 B1 B2 B3 B4 B5 B6 B7

C0 C1 C2 C3 C4 C5 C6 C7 C0 C1

Cx = Ax + Bx

Task 0 Task 1

Operation


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

The OpenCL Programming Model

OpenCL is a heterogeneous model, including provisions for

both host and device

25

CPU

Chipset

DR

AM

DRAM

DSP or

GPU or

FPGA

Device Host

PCIe


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

The OpenCL Programming Model

Data-parallel portions of an algorithm are executed on the

device as kernels

– Kernels are C functions with some restrictions, and a few language

extensions

Only one kernel is executed at a time

A kernel is executed by many work-items

– Each work-item executes the same kernel

26


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Work-Items

OpenCL work-items are conceptually similar to data-parallel tasks or threads – Each work-item performs the same operations on a subset of a data

structure

– Work-items execute independently

OpenCL work-items are not CPU threads – OpenCL work-items are extremely lightweight

• Little creation overhead

• Instant context-switching

– Work-items must execute the same kernel

27


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Work-Item Hierarchy

OpenCL is designed to execute millions of work-items

Work-items are grouped together into work-groups

– Maximum # of work-items per work-group (HW limit)

– Query CL_DEVICE_MAX_WORK_GROUP_SIZE in clDeviceInfo

• Typically 256-1024

The entire collection of work-items is called the N-

Dimensional Range (NDRange)

28


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Work-Item Hierarchy

Work-groups and NDRange

can be 1D, 2D, or 3D

Dimensions set at launch

time

29

Work-Item

(3,0)

Work-Item

(1,0)

Work-Item

(2,0)

Work-Item

(0,0)

Work-Item

(3,1)

Work-Item

(1,1)

Work-Item

(2,1)

Work-Item

(0,1)

Work-Item

(3,2)

Work-Item

(1,2)

Work-Item

(2,2)

Work-Item

(0,2)

Work-Group (1,1)

Work-Group

(0,0)

Work-Group

(1,0)

Work-Group

(2,0)

Work-Group

(0,1)

Work-Group

(1,1)

Work-Group

(2,1)

ND Range


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

The OpenCL

Programming Model

The host launches kernels

The host executes serial code

between device kernel

launches

– Memory management

– Data exchange to/from device

(usually)

– Error handling

30

Work-Group (0,0) Work-Group (1,0)

Work-Group (0,1) Work-Group (1,1)

Work-Group (0,2) Work-Group( 1,2)

ND Range

Work-Group

(0,0)

Work-Group

(1,0)

Work-Group

(2,0)

Work-Group

(0,1)

Work-Group

(1,1)

Work-Group

(2,1)

ND Range

Host

Device

Host

Device


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Data-Parallel Computing on GPUs

Data-parallel computing maps well to GPUs: – Identical operations executed on many data elements in parallel

• Simplified flow control allows increased ratio of compute logic (ALUs) to control logic

31

DRAM

GPU

DRAM

CPU

ALU Control

L1 Cache

L2 Cache

ALU

ALU ALU

ALU Control

L1 Cache

L2 Cache

ALU

ALU ALU

L3 Cache


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL API – Launching a Kernel C

How to launch a kernel:

32

//3D Work-Group, let OpenCL Runtime determine

//local work size.

size_t const globalWorkSize[3] = {512,512,512};

clEnqueueNDRangeKernel(queue, kernel, 3, NULL,

globalWorkSize, NULL,

0, NULL, NULL);

//2D Work-Group, specify local work size

size_t const globalWorkSize[2] = {512,512};

size_t const localWorkSize[2] = {16, 16};

clEnqueueNDRangeKernel(queue, kernel, 2, NULL,

globalWorkSize, localWorkSize,

0, NULL, NULL);


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

GPU Kernels


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Writing OpenCL Kernels

Denoted by __kernel function qualifier

– Eg. __kernel void myKernel(__global float* a)

Queued from host, executed on device

A few noteworthy restrictions:

– No access to host memory (in general!)

– Must return void

– No function pointers

– No static variables

– No recursion (no stack)

34


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax - Kernels

Kernels have built-in functions:

– The variable dim ranges from 0 to 2, depending on the dimension of the kernel launch

– get_work_dim (): number of dimensions in use

– get_global_id (dim): unique index of a work-item

– get_global_size (dim): number of global work-items

35


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax – Kernels (Continued)

Built-in function listing (continued):

– get_local_id (dim): unique index of the work-item within the work-group

– get_local_size (dim): number of work-items within the work-group

– get_group_id (dim): index of the work-group

– get_num_groups (dim): number of work-groups

– Cannot vary the size of work-groups or work-items during a kernel call

36


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us


Built-in functions are typically used to determine unique work-item identifiers:

37

get_group_id(0)

get_local_size(0) = 5

get_global_id(0)

ND Range

0

0 1 2 3 4

1

0 1 2 3 4

2

0 1 2 3 4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

get_local_id(0)

One Dimensional Array (get_work_dim () == 1)

get_global_id(0) == get_group_id(0) * get_local_size(0) + get_local_id(0)


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax – Thread Identifiers

Result for each kernel launched with the following execution

configuration: Dimension = 1 Global work size = 12 Local Work Size = 4

38

__kernel void MyKernel(__global int* a) { int idx = get_global_id(0); a[idx] = 7; } __kernel void MyKernel(__global int* a) { int idx = get_global_id(0); a[idx] = get_group_id(0); } __kernel void MyKernel(__global int* a) { int idx = get_global_id(0); a[idx] = get_local_id(0); }

a: 7 7 7 7 7 7 7 7 7 7 7 7

a: 0 0 0 0 1 1 1 1 2 2 2 2

a: 0 1 2 3 0 1 2 3 0 1 2 3


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Code Example - Kernel

Kernel is executed by N work-items – Each work-item has a unique ID between 0 and N-1

39

void inc(float* a, float b, int N) { for(int i = 0; i<N; i++) a[i] = a[i] + b; } void main() { … increment(a,b,N); }

__kernel void inc(__global float* a, float b) { int i = get_global_id(0); a[i] = a[i] + b; } void main() { … clEnqueueNDRangeKernel(…,…); }


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us


All C operators are supported – eg. +, *, /, ^, >, >>

Many functions from the standard math library – eg. sin(), cos(), ceil(), fabs()

Can write/call your own non-kernel functions – float myDeviceFunction(__global float *a)

– Non-kernel functions cannot be called by host

Control flow statements too! – eg. if(), while(), for()

40


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Syntax - Synchronization

Kernel launches are asynchronous – Control returns to CPU immediately

– Subsequent commands added to the command queue will wait until the kernel has completed

– If you want to synchronize on the host:

• Implicit synchronization via blocking commands – eg. clEnqueueReadBuffer() with the blocking argument set to CL_TRUE

– Explicitly call clFinish()

clFinish(queue)

– Blocks on host until all outstanding OpenCL commands are complete in a given queue

41


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Questions?

OpenCL training courses and consulting services

Acceleware Ltd.

Twitter: @Acceleware

Web: http://acceleware.com/opencl-training

Email: [email protected]

-------------------

Stay in the know about developer news, tools, SDKs, technical presentations, events and future webinars. Connect with AMD Developer Central here:

AMD Developer Central

Twitter: @AMDDevCentral

Web: http://developer.amd.com/

YouTube: https://www.youtube.com/user/AMDDevCentral

Developer Forums: http://devgurus.amd.com/welcome

42

http://acceleware.com/opencl-training




mailto:[email protected]

http://developer.amd.com/

https://www.youtube.com/user/AMDDevCentral

http://devgurus.amd.com/welcome


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

An Overview of GPU Hardware


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

What is the GPU?

The GPU is a graphics processing unit

Historically used to offload graphics

computations from the CPU

Can either be a dedicated video card,

integrated on the motherboard or on the

same die as the CPU

– Highest performance will require a dedicated

video card

44


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Why use GPUs? Performance!

45

Intel Xeon E5-2697 v2

(Ivy Bridge)

AMD Opteron 6386SE

(Bulldozer)

AMD FirePro

W9100 (Volcanic Islands)

AMD FirePro S10000

(Southern Islands)

Processing Cores 12 16 2816 3584

Clock Frequency

(GHz) 2.7-3.4* GHz 2.8-3.5* GHz 930 MHz 825 MHz

Memory Bandwidth 59.7 GB/s / socket 59.7 GB/s / socket 320 GB/s 480 GB/s

Peak Gflops** (single) 576 @ 3.0GHz 410 @ 3.2GHz 5240 5910

Peak Gflops** (double) 288 @ 3.0GHz 205 @ 3.2GHz 2620 1480

Gflops/Watt

(single) 4.4 2.9 19 15.76

Total Memory >>16GB >>16GB 16 GB 6 GB

*Indicates range of clock frequencies supported via Intel Turbo Boost and AMD Turbo CORE Technology

** At maximum frequency when all cores are executing


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

GPU Potential Advantages

9x more single-precision floating-point throughput

9x more double-precision floating-point throughput

5x higher memory bandwidth

46

AMD FirePro W9100 vs. Xeon E5-2697 v2


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

GPU Disadvantages

Architecture not as flexible as CPU

Must rewrite algorithms and maintain software in GPU languages

Attached to CPU via relatively slow PCIe – 16GB/s bi-directional for PCIe 3.0 16x

Limited memory (though 6-16GB is reasonable for many applications)

47


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

Software Approaches for Acceleration

Maximum Flexibility – OpenCL

Simple programming for

heterogeneous systems – Simple compiler hints/pragmas

– Compiler parallelizes code

– Target a variety of platforms

“Drop-in” Acceleration – In-depth GPU knowledge not required

– Highly optimized by GPU experts

– Provides functions used in a broad range of

applications (eg. FFT, BLAS)

48

Programming Languages

OpenACC Directives

Libraries

Effort


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

An Introduction to OpenCL


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Overview

Parallel computing architecture

standardized by the Khronos Group

OpenCL:

– Is a royalty free standard

– Provides an API to coordinate parallel

computation across heterogeneous

processors

– Defines a cross-platform programming

language

50


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Versions

To date there are four different versions of OpenCL

– OpenCL 1.0

– OpenCL 1.1

– OpenCL 1.2

– OpenCL 2.0 (finalized November 2013)

Different versions support different functionality

51

Hardware Vendor Supported OpenCL Version

AMD OpenCL 1.2

Intel OpenCL 1.2

NVIDIA OpenCL 1.1


An

In

tro

ductio

n to

Op

en

CL

U

sin

g A

MD

GP

Us

OpenCL Extensions

Optional functionality is exposed through extensions

– Vendors are not required to support extensions to achieve conformance

– However, extensions are expected to be widely available

Some OpenCL extensions are approved by the OpenCL working group

– These extensions are likely to be promoted to core functionality in future

versions of the standard

Multi-vendor and vendor specific extensions do not need approval by

the working group

52

an introduction to opencl™ programming with amd gpus - amd & acceleware webinar

Technology

distribution

amd gpus data

parallel computing

amd gpus

command queue

precision

kernels kernels

unique index