an introduction to opencl™ programming with amd gpus - amd & acceleware webinar
Post on 07-Nov-2014
870 Views
Preview:
DESCRIPTION
TRANSCRIPT
An Introduction to OpenCL™
Using AMD GPUs
Chris Mason
Product Manager, Acceleware
September 17, 2014
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
About Acceleware
Programmer Training
– OpenCL, CUDA, OpenMP
– Over 100 courses taught
– http://acceleware.com/training
Consulting Services
– Completed projects for: Oil & Gas, Medical,
Finance, Security & Defence, Computer Aided
Engineering, Media & Entertainment
– http://acceleware.com/services
GPU Accelerated Software
– Seismic imaging & modeling
– Electromagnetics
2
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Seismic Imaging & Modeling
AxWave – Seismic forward modeling
– 2D, 3D, constant and variable density models
– High fidelity finite-difference modeling
AxRTM – High performance Reverse Time Migration
Application
– Isotropic, VTI and TTI media
HPC Implementation – Optimized for GPUs
– Efficient multi-GPU scaling
3
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Electromagnetics
AxFDTD™
– Finite-Difference Time-Domain
Electromagnetic Solver
– Optimized for GPUs
– Sub-gridding and large feature
coverage
– Multi-GPU, GPU clusters, GPU
targeting
– Available from:
4
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Consulting Services Industry Application Work Completed Results
Finance Option Pricing Debugged & optimized existing code Implemented the Leisen-Reimer version of the binomial model for stock option pricing
30-50x performance improvement compared to single-threaded CPU code
Security & Defense
Detection System
Replaced legacy Cell-based infrastructure with GPUs Implemented a GPU accelerated X-ray iterative image reconstruction and explosive detection algorithms
Surpassed the performance targets Reduced hardware cost by a factor of 10
CAE SIMULIA Abaqus
Developed a GPU accelerated version Conducted a finite-element analysis and developed a library to offload LDLT factorization portion of the multi-frontal solver to GPUs
Delivered an accelerated (2-3x) solution that supports NVIDIA and AMD GPUs
Medical CT Reconstruction Software
Developed a GPU accelerated application for image reconstruction on CT scanners and implemented advanced features including job batch manager, filtering and bad pixel corrections
Accelerated back projection by 31x
Oil & Gas Seismic Application
Converted MATLAB research code into a standalone application & improved performance via algorithmic optimizations
20-30x speedup
5
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Programmer Training
OpenCL, CUDA, OpenMP
Teachers with real world experience
Hands-on lab exercises
Progressive lectures
Small class sizes to maximize learning
90 days post training support
“The level of detail is fantastic. The course did not focus on syntax but rather on how to expertly program for the GPU. I loved the course and I hope that we can get more of our team to take it.”
Jason Gauci, Software Engineer
Lockheed Martin
6
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Outline
Introduction to the OpenCL Architecture – Contexts, Devices, Queues
Memory and Error Management
Data-Parallel Computing – Kernel Launches
GPU Kernels
7
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Introduction To The OpenCL Architecture
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Architecture Introduction and Terminology
Four high level models describe the key OpenCL concepts:
– Platform Model – high level host/device interaction
– Execution Model – OpenCL programs execute on host/device
– Memory Model – different memory resources on device
– Programming Model – types of parallel workloads
9
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Platform Model
A host connected to one or more devices
– Example: GPUs, DSPs, FPGAs
A program can work with devices from multiple vendors
A platform is a host and a collection of devices that share
resources and execute programs
10
Host
Device 1 GPU
Device 2 CPU
… Device N
GPU
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Execution Model
The host defines a context to control the device
– The context manages the following resources:
– Devices – hardware to run on
– Kernels – functions to run on the hardware
– Program Objects – device executables
– Memory Objects – memory visible to host and device
A command queue schedules commands for execution on
the device
11
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL API - Platform and Runtime Layer
The OpenCL API is divided into two layers: Platform and
Runtime
The platform layer allows the host program to discover
devices and capabilities
The runtime layer allows the host program to work with
contexts once created
12
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Program Set Up
To set up an OpenCL program, the typical steps are as follows:
1. Query and select the platforms (e.g., AMD)
2. Query the devices
3. Create a context
4. Create a command queue
5. Read/Write to the device
6. Launch the kernel
13
Platform Layer
Runtime Layer
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Sample Platform Layer C Code
14
//Get the platform ID
cl_platform_id platform;
clGetPlatformIDs(1, &platform, NULL);
// Get the first GPU device associated with the platform
cl_device_id device;
clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
//Create an OpenCL context for the GPU device
cl_context context;
context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Runtime Layer
A command queue operates on contexts, memory, and program
objects
Each device can have one or more command queues
Operations in the command queue will execute in order unless
the out of order mode is enabled
15
Copy Data Copy Data Launch Kernel
Copy Data
Command Queue
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Memory and Error Management
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Buffers
A buffer stores a one dimensional collection of elements
Buffer objects use the cl_mem type
– cl_mem is an abstract memory container (i.e., a handle)
– The buffer object cannot be dereferenced on the host
• cl_mem a; a[0] = 5; // Not allowed
OpenCL commands interact with buffers
17
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Syntax – C Memory Management
Example
Example:
18
//Create an OpenCL command queue
cl_int err;
cl_command_queue queue;
queue = clCreateCommandQueue(context, device, 0, &err);
// Allocate memory on device
const int N = 5;
int nBytes = N*sizeof(int);
cl_mem a = clCreateBuffer(context, CL_MEM_READ_WRITE,
nBytes, NULL, &err);
int hostarr [N] = {3,1,4,1,5};
// Transfer Memory
err = clEnqueueWriteBuffer(queue, a, CL_TRUE, 0,
nBytes, hostarr, 0, NULL,
NULL);
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Syntax – Error Management
Host code manages errors:
– Most host side OpenCL function calls return cl_int
• “Create” calls return the object that is created
– Error code is passed by reference as last argument
• Error codes are negative values defined in cl.h
• CL_SUCCESS == 0
19
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Syntax – Clean Up
All objects that are created can
be released with the following
functions: – clReleaseContext
– clReleaseCommandQueue
– clReleaseMemObject
20
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Data-Parallel Computing
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Data-Parallel Computing
Data-parallelism
1. Performs operations on a data set organized into a common structure (e.g. an array)
2. Tasks work collectively on the same structure with each task operating on its own portion of the structure
3. Tasks perform identical operations on their portions of the structure. Operations on each portion must not be data dependent!
22
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Data Dependence
Data dependence occurs when a program statement refers
to the data of a preceding statement.
Data dependence limits parallelism
23
a = 2 * x; b = 2 * y; c = 3 * x;
a = 2 * x; b = 2 * a * a; c = b * 9;
These 3 statements are
independent! b depends on a, c depends
on b and a!
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Data-Parallel Computing Example
Data set consisting of
arrays A,B, and C
Same operations performed
on each element - Cx = Ax +
Bx
Two tasks operating on a
subset of the arrays. Tasks
0 and 1 are independent.
Could have more tasks.
24
A0 A1 A2 A3 A4 A5 A6 A7
B0 B1 B2 B3 B4 B5 B6 B7
C0 C1 C2 C3 C4 C5 C6 C7 C0 C1
Cx = Ax + Bx
Task 0 Task 1
Operation
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
The OpenCL Programming Model
OpenCL is a heterogeneous model, including provisions for
both host and device
25
CPU
Chipset
DR
AM
DRAM
DSP or
GPU or
FPGA
Device Host
PCIe
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
The OpenCL Programming Model
Data-parallel portions of an algorithm are executed on the
device as kernels
– Kernels are C functions with some restrictions, and a few language
extensions
Only one kernel is executed at a time
A kernel is executed by many work-items
– Each work-item executes the same kernel
26
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Work-Items
OpenCL work-items are conceptually similar to data-parallel tasks or threads – Each work-item performs the same operations on a subset of a data
structure
– Work-items execute independently
OpenCL work-items are not CPU threads – OpenCL work-items are extremely lightweight
• Little creation overhead
• Instant context-switching
– Work-items must execute the same kernel
27
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Work-Item Hierarchy
OpenCL is designed to execute millions of work-items
Work-items are grouped together into work-groups
– Maximum # of work-items per work-group (HW limit)
– Query CL_DEVICE_MAX_WORK_GROUP_SIZE in clDeviceInfo
• Typically 256-1024
The entire collection of work-items is called the N-
Dimensional Range (NDRange)
28
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Work-Item Hierarchy
Work-groups and NDRange
can be 1D, 2D, or 3D
Dimensions set at launch
time
29
Work-Item
(3,0)
Work-Item
(1,0)
Work-Item
(2,0)
Work-Item
(0,0)
Work-Item
(3,1)
Work-Item
(1,1)
Work-Item
(2,1)
Work-Item
(0,1)
Work-Item
(3,2)
Work-Item
(1,2)
Work-Item
(2,2)
Work-Item
(0,2)
Work-Group (1,1)
Work-Group
(0,0)
Work-Group
(1,0)
Work-Group
(2,0)
Work-Group
(0,1)
Work-Group
(1,1)
Work-Group
(2,1)
ND Range
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
The OpenCL
Programming Model
The host launches kernels
The host executes serial code
between device kernel
launches
– Memory management
– Data exchange to/from device
(usually)
– Error handling
30
Work-Group (0,0) Work-Group (1,0)
Work-Group (0,1) Work-Group (1,1)
Work-Group (0,2) Work-Group( 1,2)
ND Range
Work-Group
(0,0)
Work-Group
(1,0)
Work-Group
(2,0)
Work-Group
(0,1)
Work-Group
(1,1)
Work-Group
(2,1)
ND Range
Host
Device
Host
Device
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Data-Parallel Computing on GPUs
Data-parallel computing maps well to GPUs: – Identical operations executed on many data elements in parallel
• Simplified flow control allows increased ratio of compute logic (ALUs) to control logic
31
DRAM
GPU
DRAM
CPU
ALU Control
L1 Cache
L2 Cache
ALU
ALU ALU
ALU Control
L1 Cache
L2 Cache
ALU
ALU ALU
L3 Cache
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL API – Launching a Kernel C
How to launch a kernel:
32
//3D Work-Group, let OpenCL Runtime determine
//local work size.
size_t const globalWorkSize[3] = {512,512,512};
clEnqueueNDRangeKernel(queue, kernel, 3, NULL,
globalWorkSize, NULL,
0, NULL, NULL);
//2D Work-Group, specify local work size
size_t const globalWorkSize[2] = {512,512};
size_t const localWorkSize[2] = {16, 16};
clEnqueueNDRangeKernel(queue, kernel, 2, NULL,
globalWorkSize, localWorkSize,
0, NULL, NULL);
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
GPU Kernels
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Writing OpenCL Kernels
Denoted by __kernel function qualifier
– Eg. __kernel void myKernel(__global float* a)
Queued from host, executed on device
A few noteworthy restrictions:
– No access to host memory (in general!)
– Must return void
– No function pointers
– No static variables
– No recursion (no stack)
34
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Syntax - Kernels
Kernels have built-in functions:
– The variable dim ranges from 0 to 2, depending on the dimension of the kernel launch
– get_work_dim (): number of dimensions in use
– get_global_id (dim): unique index of a work-item
– get_global_size (dim): number of global work-items
35
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Syntax – Kernels (Continued)
Built-in function listing (continued):
– get_local_id (dim): unique index of the work-item within the work-group
– get_local_size (dim): number of work-items within the work-group
– get_group_id (dim): index of the work-group
– get_num_groups (dim): number of work-groups
– Cannot vary the size of work-groups or work-items during a kernel call
36
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Syntax - Kernels
Built-in functions are typically used to determine unique work-item identifiers:
37
get_group_id(0)
get_local_size(0) = 5
get_global_id(0)
ND Range
0
0 1 2 3 4
1
0 1 2 3 4
2
0 1 2 3 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
get_local_id(0)
One Dimensional Array (get_work_dim () == 1)
get_global_id(0) == get_group_id(0) * get_local_size(0) + get_local_id(0)
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Syntax – Thread Identifiers
Result for each kernel launched with the following execution
configuration: Dimension = 1 Global work size = 12 Local Work Size = 4
38
__kernel void MyKernel(__global int* a) { int idx = get_global_id(0); a[idx] = 7; } __kernel void MyKernel(__global int* a) { int idx = get_global_id(0); a[idx] = get_group_id(0); } __kernel void MyKernel(__global int* a) { int idx = get_global_id(0); a[idx] = get_local_id(0); }
a: 7 7 7 7 7 7 7 7 7 7 7 7
a: 0 0 0 0 1 1 1 1 2 2 2 2
a: 0 1 2 3 0 1 2 3 0 1 2 3
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Code Example - Kernel
Kernel is executed by N work-items – Each work-item has a unique ID between 0 and N-1
39
void inc(float* a, float b, int N) { for(int i = 0; i<N; i++) a[i] = a[i] + b; } void main() { … increment(a,b,N); }
__kernel void inc(__global float* a, float b) { int i = get_global_id(0); a[i] = a[i] + b; } void main() { … clEnqueueNDRangeKernel(…,…); }
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Syntax - Kernels
All C operators are supported – eg. +, *, /, ^, >, >>
Many functions from the standard math library – eg. sin(), cos(), ceil(), fabs()
Can write/call your own non-kernel functions – float myDeviceFunction(__global float *a)
– Non-kernel functions cannot be called by host
Control flow statements too! – eg. if(), while(), for()
40
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Syntax - Synchronization
Kernel launches are asynchronous – Control returns to CPU immediately
– Subsequent commands added to the command queue will wait until the kernel has completed
– If you want to synchronize on the host:
• Implicit synchronization via blocking commands – eg. clEnqueueReadBuffer() with the blocking argument set to CL_TRUE
– Explicitly call clFinish()
clFinish(queue)
– Blocks on host until all outstanding OpenCL commands are complete in a given queue
41
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Questions?
OpenCL training courses and consulting services
Acceleware Ltd.
Twitter: @Acceleware
Web: http://acceleware.com/opencl-training
Email: services@acceleware.com
-------------------
Stay in the know about developer news, tools, SDKs, technical presentations, events and future webinars. Connect with AMD Developer Central here:
AMD Developer Central
Twitter: @AMDDevCentral
Web: http://developer.amd.com/
YouTube: https://www.youtube.com/user/AMDDevCentral
Developer Forums: http://devgurus.amd.com/welcome
42
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
An Overview of GPU Hardware
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
What is the GPU?
The GPU is a graphics processing unit
Historically used to offload graphics
computations from the CPU
Can either be a dedicated video card,
integrated on the motherboard or on the
same die as the CPU
– Highest performance will require a dedicated
video card
44
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Why use GPUs? Performance!
45
Intel Xeon E5-2697 v2
(Ivy Bridge)
AMD Opteron 6386SE
(Bulldozer)
AMD FirePro
W9100 (Volcanic Islands)
AMD FirePro S10000
(Southern Islands)
Processing Cores 12 16 2816 3584
Clock Frequency
(GHz) 2.7-3.4* GHz 2.8-3.5* GHz 930 MHz 825 MHz
Memory Bandwidth 59.7 GB/s / socket 59.7 GB/s / socket 320 GB/s 480 GB/s
Peak Gflops** (single) 576 @ 3.0GHz 410 @ 3.2GHz 5240 5910
Peak Gflops** (double) 288 @ 3.0GHz 205 @ 3.2GHz 2620 1480
Gflops/Watt
(single) 4.4 2.9 19 15.76
Total Memory >>16GB >>16GB 16 GB 6 GB
*Indicates range of clock frequencies supported via Intel Turbo Boost and AMD Turbo CORE Technology
** At maximum frequency when all cores are executing
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
GPU Potential Advantages
9x more single-precision floating-point throughput
9x more double-precision floating-point throughput
5x higher memory bandwidth
46
AMD FirePro W9100 vs. Xeon E5-2697 v2
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
GPU Disadvantages
Architecture not as flexible as CPU
Must rewrite algorithms and maintain software in GPU languages
Attached to CPU via relatively slow PCIe – 16GB/s bi-directional for PCIe 3.0 16x
Limited memory (though 6-16GB is reasonable for many applications)
47
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
Software Approaches for Acceleration
Maximum Flexibility – OpenCL
Simple programming for
heterogeneous systems – Simple compiler hints/pragmas
– Compiler parallelizes code
– Target a variety of platforms
“Drop-in” Acceleration – In-depth GPU knowledge not required
– Highly optimized by GPU experts
– Provides functions used in a broad range of
applications (eg. FFT, BLAS)
48
Programming Languages
OpenACC Directives
Libraries
Effort
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
An Introduction to OpenCL
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Overview
Parallel computing architecture
standardized by the Khronos Group
OpenCL:
– Is a royalty free standard
– Provides an API to coordinate parallel
computation across heterogeneous
processors
– Defines a cross-platform programming
language
50
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Versions
To date there are four different versions of OpenCL
– OpenCL 1.0
– OpenCL 1.1
– OpenCL 1.2
– OpenCL 2.0 (finalized November 2013)
Different versions support different functionality
51
Hardware Vendor Supported OpenCL Version
AMD OpenCL 1.2
Intel OpenCL 1.2
NVIDIA OpenCL 1.1
© 2014 Acceleware Ltd. Reproduction or distribution strictly prohibited.
An
In
tro
ductio
n to
Op
en
CL
U
sin
g A
MD
GP
Us
OpenCL Extensions
Optional functionality is exposed through extensions
– Vendors are not required to support extensions to achieve conformance
– However, extensions are expected to be widely available
Some OpenCL extensions are approved by the OpenCL working group
– These extensions are likely to be promoted to core functionality in future
versions of the standard
Multi-vendor and vendor specific extensions do not need approval by
the working group
52
top related