an introduction to opencl™ programming with amd gpus - amd & acceleware webinar

An Introduction to OpenCL™

Using AMD GPUs

Chris Mason

Product Manager, Acceleware

September 17, 2014

ductio

About Acceleware

Programmer Training

– OpenCL, CUDA, OpenMP

– Over 100 courses taught

– http://acceleware.com/training

Consulting Services

– Completed projects for: Oil & Gas, Medical,

Finance, Security & Defence, Computer Aided

Engineering, Media & Entertainment

– http://acceleware.com/services

GPU Accelerated Software

– Seismic imaging & modeling

– Electromagnetics

ductio

Seismic Imaging & Modeling

AxWave – Seismic forward modeling

– 2D, 3D, constant and variable density models

– High fidelity finite-difference modeling

AxRTM – High performance Reverse Time Migration

Application

– Isotropic, VTI and TTI media

HPC Implementation – Optimized for GPUs

– Efficient multi-GPU scaling

ductio

Electromagnetics

AxFDTD™

– Finite-Difference Time-Domain

Electromagnetic Solver

– Optimized for GPUs

– Sub-gridding and large feature

coverage

– Multi-GPU, GPU clusters, GPU

targeting

– Available from:

ductio

Consulting Services Industry Application Work Completed Results

Finance Option Pricing Debugged & optimized existing code Implemented the Leisen-Reimer version of the binomial model for stock option pricing

30-50x performance improvement compared to single-threaded CPU code

Security & Defense

Detection System

Replaced legacy Cell-based infrastructure with GPUs Implemented a GPU accelerated X-ray iterative image reconstruction and explosive detection algorithms

Surpassed the performance targets Reduced hardware cost by a factor of 10

CAE SIMULIA Abaqus

Developed a GPU accelerated version Conducted a finite-element analysis and developed a library to offload LDLT factorization portion of the multi-frontal solver to GPUs

Delivered an accelerated (2-3x) solution that supports NVIDIA and AMD GPUs

Medical CT Reconstruction Software

Developed a GPU accelerated application for image reconstruction on CT scanners and implemented advanced features including job batch manager, filtering and bad pixel corrections

Accelerated back projection by 31x

Oil & Gas Seismic Application

Converted MATLAB research code into a standalone application & improved performance via algorithmic optimizations

20-30x speedup

ductio

Programmer Training

OpenCL, CUDA, OpenMP

Teachers with real world experience

Hands-on lab exercises

Progressive lectures

Small class sizes to maximize learning

90 days post training support

“The level of detail is fantastic. The course did not focus on syntax but rather on how to expertly program for the GPU. I loved the course and I hope that we can get more of our team to take it.”

Jason Gauci, Software Engineer

Lockheed Martin

ductio

Outline

Introduction to the OpenCL Architecture – Contexts, Devices, Queues

Memory and Error Management

Data-Parallel Computing – Kernel Launches

GPU Kernels

ductio

Introduction To The OpenCL Architecture

ductio

OpenCL Architecture Introduction and Terminology

Four high level models describe the key OpenCL concepts:

– Platform Model – high level host/device interaction

– Execution Model – OpenCL programs execute on host/device

– Memory Model – different memory resources on device

– Programming Model – types of parallel workloads

ductio

OpenCL Platform Model

A host connected to one or more devices

– Example: GPUs, DSPs, FPGAs

A program can work with devices from multiple vendors

A platform is a host and a collection of devices that share

resources and execute programs

Device 1 GPU

Device 2 CPU

… Device N

ductio

OpenCL Execution Model

The host defines a context to control the device

– The context manages the following resources:

– Devices – hardware to run on

– Kernels – functions to run on the hardware

– Program Objects – device executables

– Memory Objects – memory visible to host and device

A command queue schedules commands for execution on

the device

ductio

OpenCL API - Platform and Runtime Layer

The OpenCL API is divided into two layers: Platform and

Runtime

The platform layer allows the host program to discover

devices and capabilities

The runtime layer allows the host program to work with

contexts once created

ductio

Program Set Up

To set up an OpenCL program, the typical steps are as follows:

1. Query and select the platforms (e.g., AMD)

2. Query the devices

3. Create a context

4. Create a command queue

5. Read/Write to the device

6. Launch the kernel

Platform Layer

Runtime Layer

ductio

Sample Platform Layer C Code

//Get the platform ID

cl_platform_id platform;

clGetPlatformIDs(1, &platform, NULL);

// Get the first GPU device associated with the platform

cl_device_id device;

clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

//Create an OpenCL context for the GPU device

cl_context context;

context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);

ductio

OpenCL Runtime Layer

A command queue operates on contexts, memory, and program

objects

Each device can have one or more command queues

Operations in the command queue will execute in order unless

the out of order mode is enabled

Copy Data Copy Data Launch Kernel

Copy Data

Command Queue

ductio

Memory and Error Management

ductio

OpenCL Buffers

A buffer stores a one dimensional collection of elements

Buffer objects use the cl_mem type

– cl_mem is an abstract memory container (i.e., a handle)

– The buffer object cannot be dereferenced on the host

• cl_mem a; a[0] = 5; // Not allowed

OpenCL commands interact with buffers

ductio

OpenCL Syntax – C Memory Management

Example

Example:

//Create an OpenCL command queue

cl_int err;

cl_command_queue queue;

queue = clCreateCommandQueue(context, device, 0, &err);

// Allocate memory on device

const int N = 5;

int nBytes = N*sizeof(int);

cl_mem a = clCreateBuffer(context, CL_MEM_READ_WRITE,

nBytes, NULL, &err);

int hostarr [N] = {3,1,4,1,5};

// Transfer Memory

err = clEnqueueWriteBuffer(queue, a, CL_TRUE, 0,

nBytes, hostarr, 0, NULL,

NULL);

ductio

OpenCL Syntax – Error Management

Host code manages errors:

– Most host side OpenCL function calls return cl_int

• “Create” calls return the object that is created

– Error code is passed by reference as last argument

• Error codes are negative values defined in cl.h

• CL_SUCCESS == 0

ductio

OpenCL Syntax – Clean Up

All objects that are created can

be released with the following

functions: – clReleaseContext

– clReleaseCommandQueue

– clReleaseMemObject

ductio

Data-Parallel Computing

ductio

Data-Parallel Computing

Data-parallelism

1. Performs operations on a data set organized into a common structure (e.g. an array)

2. Tasks work collectively on the same structure with each task operating on its own portion of the structure

3. Tasks perform identical operations on their portions of the structure. Operations on each portion must not be data dependent!

ductio

Data Dependence

Data dependence occurs when a program statement refers

to the data of a preceding statement.

Data dependence limits parallelism

a = 2 * x; b = 2 * y; c = 3 * x;

a = 2 * x; b = 2 * a * a; c = b * 9;

These 3 statements are

independent! b depends on a, c depends

on b and a!

ductio

Data-Parallel Computing Example

Data set consisting of

arrays A,B, and C

Same operations performed

on each element - Cx = Ax +

Two tasks operating on a

subset of the arrays. Tasks

0 and 1 are independent.

Could have more tasks.

A0 A1 A2 A3 A4 A5 A6 A7

B0 B1 B2 B3 B4 B5 B6 B7

C0 C1 C2 C3 C4 C5 C6 C7 C0 C1

Cx = Ax + Bx

Task 0 Task 1

Operation

ductio

The OpenCL Programming Model

OpenCL is a heterogeneous model, including provisions for

both host and device

Chipset

DSP or

GPU or

Device Host

ductio

The OpenCL Programming Model

Data-parallel portions of an algorithm are executed on the

device as kernels

– Kernels are C functions with some restrictions, and a few language

extensions

Only one kernel is executed at a time

A kernel is executed by many work-items

– Each work-item executes the same kernel

ductio

OpenCL Work-Items

OpenCL work-items are conceptually similar to data-parallel tasks or threads – Each work-item performs the same operations on a subset of a data

structure

– Work-items execute independently

OpenCL work-items are not CPU threads – OpenCL work-items are extremely lightweight

• Little creation overhead

• Instant context-switching

– Work-items must execute the same kernel

ductio

OpenCL Work-Item Hierarchy

OpenCL is designed to execute millions of work-items

Work-items are grouped together into work-groups

– Maximum # of work-items per work-group (HW limit)

– Query CL_DEVICE_MAX_WORK_GROUP_SIZE in clDeviceInfo

• Typically 256-1024

The entire collection of work-items is called the N-

Dimensional Range (NDRange)

ductio

OpenCL Work-Item Hierarchy

Work-groups and NDRange

can be 1D, 2D, or 3D

Dimensions set at launch

Work-Item

Work-Group (1,1)

Work-Group

ND Range

ductio

The OpenCL

Programming Model

The host launches kernels

The host executes serial code

between device kernel

launches

– Memory management

– Data exchange to/from device

(usually)

– Error handling

Work-Group (0,0) Work-Group (1,0)

Work-Group (0,1) Work-Group (1,1)

Work-Group (0,2) Work-Group( 1,2)

ND Range

Work-Group

ND Range

Device

ductio

Data-Parallel Computing on GPUs

Data-parallel computing maps well to GPUs: – Identical operations executed on many data elements in parallel

• Simplified flow control allows increased ratio of compute logic (ALUs) to control logic

ALU Control

L1 Cache

L2 Cache

ALU ALU

ALU Control

L1 Cache

L2 Cache

ALU ALU

L3 Cache

ductio

OpenCL API – Launching a Kernel C

How to launch a kernel:

//3D Work-Group, let OpenCL Runtime determine

//local work size.

size_t const globalWorkSize[3] = {512,512,512};

clEnqueueNDRangeKernel(queue, kernel, 3, NULL,

globalWorkSize, NULL,

0, NULL, NULL);

//2D Work-Group, specify local work size

size_t const globalWorkSize[2] = {512,512};

size_t const localWorkSize[2] = {16, 16};

clEnqueueNDRangeKernel(queue, kernel, 2, NULL,

globalWorkSize, localWorkSize,

0, NULL, NULL);

ductio

GPU Kernels

ductio

Writing OpenCL Kernels

Denoted by __kernel function qualifier

– Eg. __kernel void myKernel(__global float* a)

Queued from host, executed on device

A few noteworthy restrictions:

– No access to host memory (in general!)

– Must return void

– No function pointers

– No static variables

– No recursion (no stack)

ductio

OpenCL Syntax - Kernels

Kernels have built-in functions:

– The variable dim ranges from 0 to 2, depending on the dimension of the kernel launch

– get_work_dim (): number of dimensions in use

– get_global_id (dim): unique index of a work-item

– get_global_size (dim): number of global work-items

ductio

OpenCL Syntax – Kernels (Continued)

Built-in function listing (continued):

– get_local_id (dim): unique index of the work-item within the work-group

– get_local_size (dim): number of work-items within the work-group

– get_group_id (dim): index of the work-group

– get_num_groups (dim): number of work-groups

– Cannot vary the size of work-groups or work-items during a kernel call

ductio

Built-in functions are typically used to determine unique work-item identifiers:

get_group_id(0)

get_local_size(0) = 5

get_global_id(0)

ND Range

0 1 2 3 4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

get_local_id(0)

One Dimensional Array (get_work_dim () == 1)

get_global_id(0) == get_group_id(0) * get_local_size(0) + get_local_id(0)

ductio

OpenCL Syntax – Thread Identifiers

Result for each kernel launched with the following execution

configuration: Dimension = 1 Global work size = 12 Local Work Size = 4

__kernel void MyKernel(__global int* a) { int idx = get_global_id(0); a[idx] = 7; } __kernel void MyKernel(__global int* a) { int idx = get_global_id(0); a[idx] = get_group_id(0); } __kernel void MyKernel(__global int* a) { int idx = get_global_id(0); a[idx] = get_local_id(0); }

a: 7 7 7 7 7 7 7 7 7 7 7 7

a: 0 0 0 0 1 1 1 1 2 2 2 2

a: 0 1 2 3 0 1 2 3 0 1 2 3

ductio

Code Example - Kernel

Kernel is executed by N work-items – Each work-item has a unique ID between 0 and N-1

void inc(float* a, float b, int N) { for(int i = 0; i<N; i++) a[i] = a[i] + b; } void main() { … increment(a,b,N); }

__kernel void inc(__global float* a, float b) { int i = get_global_id(0); a[i] = a[i] + b; } void main() { … clEnqueueNDRangeKernel(…,…); }

ductio

All C operators are supported – eg. +, *, /, ^, >, >>

Many functions from the standard math library – eg. sin(), cos(), ceil(), fabs()

Can write/call your own non-kernel functions – float myDeviceFunction(__global float *a)

– Non-kernel functions cannot be called by host

Control flow statements too! – eg. if(), while(), for()

ductio

OpenCL Syntax - Synchronization

Kernel launches are asynchronous – Control returns to CPU immediately

– Subsequent commands added to the command queue will wait until the kernel has completed

– If you want to synchronize on the host:

• Implicit synchronization via blocking commands – eg. clEnqueueReadBuffer() with the blocking argument set to CL_TRUE

– Explicitly call clFinish()

clFinish(queue)

– Blocks on host until all outstanding OpenCL commands are complete in a given queue

ductio

Questions?

OpenCL training courses and consulting services

Acceleware Ltd.

Twitter: @Acceleware

Web: http://acceleware.com/opencl-training

Email: services@acceleware.com

-------------------

Stay in the know about developer news, tools, SDKs, technical presentations, events and future webinars. Connect with AMD Developer Central here:

AMD Developer Central

Twitter: @AMDDevCentral

Web: http://developer.amd.com/

YouTube: https://www.youtube.com/user/AMDDevCentral

Developer Forums: http://devgurus.amd.com/welcome

ductio

An Overview of GPU Hardware

ductio

What is the GPU?

The GPU is a graphics processing unit

Historically used to offload graphics

computations from the CPU

Can either be a dedicated video card,

integrated on the motherboard or on the

same die as the CPU

– Highest performance will require a dedicated

video card

ductio

Why use GPUs? Performance!

Intel Xeon E5-2697 v2

(Ivy Bridge)

AMD Opteron 6386SE

(Bulldozer)

AMD FirePro

W9100 (Volcanic Islands)

AMD FirePro S10000

(Southern Islands)

Processing Cores 12 16 2816 3584

Clock Frequency

(GHz) 2.7-3.4* GHz 2.8-3.5* GHz 930 MHz 825 MHz

Memory Bandwidth 59.7 GB/s / socket 59.7 GB/s / socket 320 GB/s 480 GB/s

Peak Gflops** (single) 576 @ 3.0GHz 410 @ 3.2GHz 5240 5910

Peak Gflops** (double) 288 @ 3.0GHz 205 @ 3.2GHz 2620 1480

Gflops/Watt

(single) 4.4 2.9 19 15.76

Total Memory >>16GB >>16GB 16 GB 6 GB

*Indicates range of clock frequencies supported via Intel Turbo Boost and AMD Turbo CORE Technology

** At maximum frequency when all cores are executing

ductio

GPU Potential Advantages

9x more single-precision floating-point throughput

9x more double-precision floating-point throughput

5x higher memory bandwidth

AMD FirePro W9100 vs. Xeon E5-2697 v2

ductio

GPU Disadvantages

Architecture not as flexible as CPU

Must rewrite algorithms and maintain software in GPU languages

Attached to CPU via relatively slow PCIe – 16GB/s bi-directional for PCIe 3.0 16x

Limited memory (though 6-16GB is reasonable for many applications)

ductio

Software Approaches for Acceleration

Maximum Flexibility – OpenCL

Simple programming for

heterogeneous systems – Simple compiler hints/pragmas

– Compiler parallelizes code

– Target a variety of platforms

“Drop-in” Acceleration – In-depth GPU knowledge not required

– Highly optimized by GPU experts

– Provides functions used in a broad range of

applications (eg. FFT, BLAS)

Programming Languages

OpenACC Directives

Libraries

Effort

ductio

An Introduction to OpenCL

ductio

OpenCL Overview

Parallel computing architecture

standardized by the Khronos Group

OpenCL:

– Is a royalty free standard

– Provides an API to coordinate parallel

computation across heterogeneous

processors

– Defines a cross-platform programming

language

ductio

OpenCL Versions

To date there are four different versions of OpenCL

– OpenCL 1.0

– OpenCL 1.1

– OpenCL 1.2

– OpenCL 2.0 (finalized November 2013)

Different versions support different functionality

Hardware Vendor Supported OpenCL Version

AMD OpenCL 1.2

Intel OpenCL 1.2

NVIDIA OpenCL 1.1

ductio

OpenCL Extensions

Optional functionality is exposed through extensions

– Vendors are not required to support extensions to achieve conformance

– However, extensions are expected to be widely available

Some OpenCL extensions are approved by the OpenCL working group

– These extensions are likely to be promoted to core functionality in future

versions of the standard

Multi-vendor and vendor specific extensions do not need approval by

the working group

an introduction to opencl™ programming with amd gpus - amd & acceleware webinar

distribution

amd gpus data

parallel computing

amd gpus

command queue

precision

kernels kernels

unique index

Technology

opencl sathish vadhiyar sources: opencl overview from amd...

introduction to opencl - amd...3 | introduction to opencl tm...

fpga beschleuniger -...

the opencl c++ wrapper api - amd developer central

fpga beschleuniger - persönliche webseiten der...

the opencl specification - amd

leverage the speed of opencl™ with amd math libraries

pt-4102, simulation, compilation and debugging of opencl on...

amd app...

synthesisinan opencl* basedfpga programming...

case study: accelerating full waveform inversion via opencl...

opencl - intel® software · pdf file- page 1 opencl a...

amd gpu architecture - gpgpu€¦ · 2 | amd gpu...

computational challenges in the use of emerging many...

introduction to opencl™ programming - home |...

introduction to opencl™ programming -...

amd accelerated parallel processing opencl programming...

enhanced oil recovery - acceleware ltd. · 2018. 11. 1. ·...

introduction to opencl™ - amd

indiana university bloomington - rick weber, david d...