pt-4054, "opencl™ accelerated compute libraries" by john melonakos

64
Software Libraries for CUDA & OpenCL

Upload: amd-developer-central

Post on 07-Dec-2014

897 views

Category:

Technology


3 download

DESCRIPTION

Presentation PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos at the AMD Developer Summit (APU13) Nov. 11-13, 2013.

TRANSCRIPT

Page 1: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Software Librariesfor CUDA & OpenCL

Page 2: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Heterogeneous Computing is Hard

Two Examples:

1. Median Filtering

2. Local Windowing

Page 3: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Median Filtering

Increasingly

Difficult

Page 4: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Local Windowing

Best algorithm to use changes given which

device is in the system.

Device 1 Device 2 Device 3 Device 4

Algorithm 1 395 ms 599 244 102

Algorihm 2 270 703 241 103

Algorithm 3 699 407 138 116

Algorithm 4 380 522 202 98

Page 5: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Why Software Libraries Are Great

Reduce many lines of code to one line

Obsessively tuned by experts; faster than DIY

Well-tested and maintained

Continuously improving

Page 6: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Five Influencers (besides price)

Portability Scalability Community

ProgrammabilityPerformance

Page 7: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Faster

Time-

consuming

SSE or

AVXSlower

Easy-to-use

Performance & Programmability

Page 8: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Faster

Time-

consuming

Writing

Kernels

SSE or

AVXSlower

Easy-to-use

Performance & Programmability

Page 9: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Faster

Time-

consuming

Writing

Kernels

Compiler

Directives

SSE or

AVXSlower

Easy-to-use

Performance & Programmability

Page 10: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Faster

Time-

consuming

Writing

Kernels

Using

Libraries

Compiler

Directives

SSE or

AVXSlower

Easy-to-use

Performance & Programmability

Page 11: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Performance

Page 12: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Performance

Page 13: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Portability

Flavors of portability

HW vendor options

Accelerator options (GPU, coprocessor, FPGA)

CPU fallback

High-performance mobile computing

Libraries can provide portability

Page 14: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Scalability

Always start with one device

Potential headaches of adding devices

Performance hit

Development complexity

Libraries can make scaling easy

Page 15: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Community

What do you do when bugs arise?

Continuous refinement

Someone to answer questions

Libraries can have great community support

Page 16: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Benefits of Using a Library

Development

Documentation

Test and QA

Maintenance

Porting

TIM

E

COST

TIM

E

COST

Libraries eliminate

hidden costs of software

development

Pain Pleasure

Page 17: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

ArrayFire: Technical Computing

Page 18: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Performance & Programmability

Super easy to program

Highly optimized

Page 19: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Portability

Page 20: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Scalability

Multi-GPU is 1-line of code

array *y = new array[n];

for (int i = 0; i < n; ++i) {

deviceset(i); // change GPUs

array x = randu(5,5); // add work to GPU’s queue

y[i] = fft(x); // more work in queue

}

// all GPUs are now computing simultaneously

Page 21: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Community

Over 8,000 posts at

http://forums.accelereyes.com

Nightly library update releases

Stable releases a few times a year

v2.0 coming at the end of summer

Page 22: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Example Case Studies 1

45X

Radar Imaging

System Planning

17X

Neuro-imaging

Georgia Tech

20X

Video Processing

Google

12X

Medical Devices

Spencer Tech

20X

Viral Analyses

CDC

Page 23: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Example Case Studies 2

70X

Drug Delivery

Georgia Tech

5X

Weather Models

NCAR

17X

Surveillance

BAE Systems

35X

Bioinformatics

Leibnitz

35X

Power Eng

IIT India

Page 24: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Hundreds of Functions

reductions

• sum, min, max, count,

prod

• vectors, columns,

rows, etc

convolutions

• 2D, 3D, ND

dense linear algebra

• LU, QR, Cholesky, SVD,

Eigenvalues, Inversion,

Solvers, Determinant,

Matrix Power

FFTs

• 2D, 3D, ND

image processing

• filter, rotate, erode,

dilate, morph,

resize, rgb2gray,

histograms

interpolate & scale

• vectors, matrices

• rescaling

sorting

• along any

dimension

• sort detection

and many more…

Page 25: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Intuitive Functions (estimate π)

#include <stdio.h>

#include <arrayfire.h>

using namespace af;

int main() {

// 20 million random samples

int n = 20e6;

array x = randu(n,1), y = randu(n,1);

// how many fell inside unit circle?

float pi = 4 * sum<float>(x*x + y*y < 1) / n;

printf("pi = %g\n", pi);

return 0;

}

Page 26: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Data Types

c32complex

single precision

f64real

double precision

f32real

single precision

c64complex

double precision

b8boolean byte

arraycontainer object

s32signed integer

u32unsigned integer

array x = randu(n, f32);

array y = randu(n, f64);

array z = randu(n, u32);

Page 27: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

ND Support

vectors

matrices volumes… ND

Page 28: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Subscripting

A(span,span,2)

ArrayFire Keywords: end, span

A(end,span)

A(1,span)A(1,1)

A(end,1)

Page 29: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Generate Arrays

constant(0,3) // 3-by-1 column of zeros, single-precision

constant(1,3,2,f64) // 3-by-2 matrix, double-precision

randu(1,8) // row vector (1x8) of random values (uniform)

randn(2,2) // square matrix (2x2) random values (normal)

identity(3,3) // 3-by-3 identity

randu(5,7,c32) // complex random values

Page 30: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Create Arrays from CPU Data

float hA[] = {0,1,2,3,4,5};

array A(2,3,hA); // 2x3 matrix, single-precision

print(A);

// A = [ 0 2 4 ] Note: Fortran storage order

// [ 1 3 5 ]

Page 31: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Arithmetic

array R = randu(3,3);

array C = constant(1,3,3) + complex(sin(R)); // C is c32

// rescale complex values to unit circle

array a = randn(5,c32);

print(a / abs(a));

Page 32: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

L-2 Norm Example

// calculate L-2 norm of every column

sqrt(sum(pow(X, 2))) // norm of every column vector

sqrt(sum(pow(X, 2), 0)) // ..same

sqrt(sum(pow(X, 2), 1)) // norm of every row vector

Page 33: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Subscripting Examples

array A = randu(3,3);

array a1 = A(0); // first element

array a2 = A(0,1); // first row, second column

A(1,span); // second row

A.row(end); // last row

A.cols(1,end); // all but first column

Page 34: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Subscripting Examples

float b_ptr[] = {0,1,2,3,4,5,6,7,8,9};

array b(1,10,b_ptr);

b(seq(3)); // {0,1,2}

b(seq(1,7)); // {1,2,3,4,5,6,7}

b(seq(1,2,7)); // {1,3,5,7}

b(seq(0,2,end)); // {0,2,4,6,8}

Page 35: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Data Manipulation

// setting entries to a constant

A(span) = 4; // fill entire array

A.row(0) = -1; // first row

A(seq(3)) = 3.1415; // first three elements

Page 36: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Data Manipulation

// copy in another matrix

array B = constant(1,4,4,f64);

B.row(0) = randu(1,4,f32); // set row (upcast)

Page 37: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Data Manipulation

// index with another array

float h_inds[] = {0, 4, 2, 1}; // zero-based

array inds(1,4,h_inds);

B(inds) = randu(4,1); // set to random

Page 38: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Linear Algebra

// matrix factorization

array L, U;

lu(L, U, randu(n,n));

// linear systems: A x = b

array A = randu(n,n), b = randu(n,1);

array x = solve(A,b);

Page 39: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Graphics Functions

asynchronous

non-blocking

throttled at 35 Hz

Page 40: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Graphics Functions

non-blocking primitives

surface - surface plotting (2d data)

image - intensity image visualization

arrows - vector fields

plot2 - line plotting (x,y)

plot3 - scatter plot (x,y,z)

volume - volume rendering for 3d data

Page 41: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Graphics Functions

utility commands

keep_on keep_off

subfigure

palette

clearfig

draw (blocking)

figure

title

close

Page 42: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Graphics Example

#include <arrayfire.h>

using namespace af;

int main() {

// random 3d surface

const int n = 256;

while (1) {

array x = randu(n,n);

// 3d surface plot

surface(x);

}

return 0;

}

Page 43: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

GFOR Parallel Loops

gfor (array i, 3)

C(span,span,i) = A(span,span,i) * B;

Parallel matrix multiplications (1 kernel launch)

C(,,1) A(,,1) B

*=

C(,,3) A(,,3) B

*=

C(,,2) A(,,2) B

*=

Page 44: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

GFOR Parallel Loops

BA(,,1:3)C(,,1:3)

*=*=

*=

gfor (array i, 3)

C(span,span,i) = A(span,span,i) * B;

Parallel matrix multiplications (1 kernel launch)

Page 45: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

GFOR Parallel Loops

gfor (array i, 3)

C(span,span,i) = A(span,span,i) * B;

Parallel matrix multiplications (1 kernel launch)

= *

BAC

Page 46: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Four Quick Stories in Conclusion

Advertising Healthcare Finance Oil & Gas

Page 47: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Virtual Glasses Try-On

Page 48: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Acceleration Demands

The CPU code

45 seconds for one session to complete

Highly optimized OpenMP code leveraging all cores

1,000 sessions/minute required 750 CPU nodes

Convert Mac-only research code to C#

Focus on efficiently developed robust performance

Page 49: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

ArrayFire Solution

Linear algebra

Matrix multiple, Transpose

Linear solvers

Image processing

Convolutions

Fast Fourier Transform

Correlation Filter

Sobel Filter

Gaussian Blur

OpenCV functions

Custom edge detection

Graphics

Rendering points

Reductions

Min, Max, Sum

JIT

Increased productivity

Page 50: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Results

3X acceleration

Dropped from 750 nodes,

to 250 nodes

Benefit from ongoing

library support

Page 51: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Culture-Free Microbiology

Filling

Filled

Computer-

controlled

pipettes

Page 52: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Microscope

A computer-controlled microscope scans a

cassette of pipettes, changes imaging

modes, and acquires digital images

according to program

Page 53: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Acceleration Demands

This platform provides a rapid alternative to traditional cell culturing for susceptibility testing

The faster the analysis pipeline, the sooner a patient can be diagnosed and treated with an antibiotic

Culture-based methods can take 2-3 days, which is problematic for many critically ill patients

Page 54: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

ArrayFire Solution

Image Processing

Heavily filter based

Convolve, Filter, Resize

Image Statistics

Mean, StdDev, Variance

Page 55: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Results

Realtime throughputKernel Speedup

Image Registration (Heavy use of

statistics functions)

73.17x

Custom Filter (Prep Center Image) 26.48x

Gaussian Blur 2.19x

Page 56: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Hedge Protection System

Page 57: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Acceleration Demands

CPU-only version was taking 115 hours

Needs to run entire database of portfolios

each night before trading begins next day

Page 58: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

ArrayFire Solution

Statistics Functions

Random number

generation

Variance

Exponentials

Arithmetic

Sqrt

Element-wise math

Reductions

Sum

Page 59: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Results

GPU version drops runtime to 7 hours and

meets the requirement to run overnight

Time left over to try more permutations

Page 60: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Oil Well Monitoring

Ordinary telecom

fiber used as an

efficient, high fidelity

acoustic sensor

Threaded along the

length of oil well

Page 61: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Acceleration Demands

Require realtime signal processing from 24

channels per unit with an onsite server

CPU-only solution was 5x slower than realtime

Page 62: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

ArrayFire Solution

Heavy usage of signal filtering functions

FIR

IIR

Page 63: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Results

6x performance improvements in signal

processing

20x overall performance improvement

through more efficiently vectorized code

Page 64: PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos

Software Shop for CUDA & OpenCL

Two ways to work with us:

Use

Hire our CUDA & OpenCL developers

Code development; CUDA & OpenCL training