pt-4054, "opencl™ accelerated compute libraries" by john melonakos
DESCRIPTION
Presentation PT-4054, "OpenCL™ Accelerated Compute Libraries" by John Melonakos at the AMD Developer Summit (APU13) Nov. 11-13, 2013.TRANSCRIPT
Software Librariesfor CUDA & OpenCL
Heterogeneous Computing is Hard
Two Examples:
1. Median Filtering
2. Local Windowing
Median Filtering
Increasingly
Difficult
Local Windowing
Best algorithm to use changes given which
device is in the system.
Device 1 Device 2 Device 3 Device 4
Algorithm 1 395 ms 599 244 102
Algorihm 2 270 703 241 103
Algorithm 3 699 407 138 116
Algorithm 4 380 522 202 98
Why Software Libraries Are Great
Reduce many lines of code to one line
Obsessively tuned by experts; faster than DIY
Well-tested and maintained
Continuously improving
Five Influencers (besides price)
Portability Scalability Community
ProgrammabilityPerformance
Faster
Time-
consuming
SSE or
AVXSlower
Easy-to-use
Performance & Programmability
Faster
Time-
consuming
Writing
Kernels
SSE or
AVXSlower
Easy-to-use
Performance & Programmability
Faster
Time-
consuming
Writing
Kernels
Compiler
Directives
SSE or
AVXSlower
Easy-to-use
Performance & Programmability
Faster
Time-
consuming
Writing
Kernels
Using
Libraries
Compiler
Directives
SSE or
AVXSlower
Easy-to-use
Performance & Programmability
Performance
Performance
Portability
Flavors of portability
HW vendor options
Accelerator options (GPU, coprocessor, FPGA)
CPU fallback
High-performance mobile computing
Libraries can provide portability
Scalability
Always start with one device
Potential headaches of adding devices
Performance hit
Development complexity
Libraries can make scaling easy
Community
What do you do when bugs arise?
Continuous refinement
Someone to answer questions
Libraries can have great community support
Benefits of Using a Library
Development
Documentation
Test and QA
Maintenance
Porting
TIM
E
COST
TIM
E
COST
Libraries eliminate
hidden costs of software
development
Pain Pleasure
ArrayFire: Technical Computing
Performance & Programmability
Super easy to program
Highly optimized
Portability
Scalability
Multi-GPU is 1-line of code
array *y = new array[n];
for (int i = 0; i < n; ++i) {
deviceset(i); // change GPUs
array x = randu(5,5); // add work to GPU’s queue
y[i] = fft(x); // more work in queue
}
// all GPUs are now computing simultaneously
Community
Over 8,000 posts at
http://forums.accelereyes.com
Nightly library update releases
Stable releases a few times a year
v2.0 coming at the end of summer
Example Case Studies 1
45X
Radar Imaging
System Planning
17X
Neuro-imaging
Georgia Tech
20X
Video Processing
12X
Medical Devices
Spencer Tech
20X
Viral Analyses
CDC
Example Case Studies 2
70X
Drug Delivery
Georgia Tech
5X
Weather Models
NCAR
17X
Surveillance
BAE Systems
35X
Bioinformatics
Leibnitz
35X
Power Eng
IIT India
Hundreds of Functions
reductions
• sum, min, max, count,
prod
• vectors, columns,
rows, etc
convolutions
• 2D, 3D, ND
dense linear algebra
• LU, QR, Cholesky, SVD,
Eigenvalues, Inversion,
Solvers, Determinant,
Matrix Power
FFTs
• 2D, 3D, ND
image processing
• filter, rotate, erode,
dilate, morph,
resize, rgb2gray,
histograms
interpolate & scale
• vectors, matrices
• rescaling
sorting
• along any
dimension
• sort detection
and many more…
Intuitive Functions (estimate π)
#include <stdio.h>
#include <arrayfire.h>
using namespace af;
int main() {
// 20 million random samples
int n = 20e6;
array x = randu(n,1), y = randu(n,1);
// how many fell inside unit circle?
float pi = 4 * sum<float>(x*x + y*y < 1) / n;
printf("pi = %g\n", pi);
return 0;
}
Data Types
c32complex
single precision
f64real
double precision
f32real
single precision
c64complex
double precision
b8boolean byte
arraycontainer object
s32signed integer
u32unsigned integer
array x = randu(n, f32);
array y = randu(n, f64);
array z = randu(n, u32);
ND Support
vectors
matrices volumes… ND
Subscripting
A(span,span,2)
ArrayFire Keywords: end, span
A(end,span)
A(1,span)A(1,1)
A(end,1)
Generate Arrays
constant(0,3) // 3-by-1 column of zeros, single-precision
constant(1,3,2,f64) // 3-by-2 matrix, double-precision
randu(1,8) // row vector (1x8) of random values (uniform)
randn(2,2) // square matrix (2x2) random values (normal)
identity(3,3) // 3-by-3 identity
randu(5,7,c32) // complex random values
Create Arrays from CPU Data
float hA[] = {0,1,2,3,4,5};
array A(2,3,hA); // 2x3 matrix, single-precision
print(A);
// A = [ 0 2 4 ] Note: Fortran storage order
// [ 1 3 5 ]
Arithmetic
array R = randu(3,3);
array C = constant(1,3,3) + complex(sin(R)); // C is c32
// rescale complex values to unit circle
array a = randn(5,c32);
print(a / abs(a));
L-2 Norm Example
// calculate L-2 norm of every column
sqrt(sum(pow(X, 2))) // norm of every column vector
sqrt(sum(pow(X, 2), 0)) // ..same
sqrt(sum(pow(X, 2), 1)) // norm of every row vector
Subscripting Examples
array A = randu(3,3);
array a1 = A(0); // first element
array a2 = A(0,1); // first row, second column
A(1,span); // second row
A.row(end); // last row
A.cols(1,end); // all but first column
Subscripting Examples
float b_ptr[] = {0,1,2,3,4,5,6,7,8,9};
array b(1,10,b_ptr);
b(seq(3)); // {0,1,2}
b(seq(1,7)); // {1,2,3,4,5,6,7}
b(seq(1,2,7)); // {1,3,5,7}
b(seq(0,2,end)); // {0,2,4,6,8}
Data Manipulation
// setting entries to a constant
A(span) = 4; // fill entire array
A.row(0) = -1; // first row
A(seq(3)) = 3.1415; // first three elements
Data Manipulation
// copy in another matrix
array B = constant(1,4,4,f64);
B.row(0) = randu(1,4,f32); // set row (upcast)
Data Manipulation
// index with another array
float h_inds[] = {0, 4, 2, 1}; // zero-based
array inds(1,4,h_inds);
B(inds) = randu(4,1); // set to random
Linear Algebra
// matrix factorization
array L, U;
lu(L, U, randu(n,n));
// linear systems: A x = b
array A = randu(n,n), b = randu(n,1);
array x = solve(A,b);
Graphics Functions
asynchronous
non-blocking
throttled at 35 Hz
Graphics Functions
non-blocking primitives
surface - surface plotting (2d data)
image - intensity image visualization
arrows - vector fields
plot2 - line plotting (x,y)
plot3 - scatter plot (x,y,z)
volume - volume rendering for 3d data
Graphics Functions
utility commands
keep_on keep_off
subfigure
palette
clearfig
draw (blocking)
figure
title
close
Graphics Example
#include <arrayfire.h>
using namespace af;
int main() {
// random 3d surface
const int n = 256;
while (1) {
array x = randu(n,n);
// 3d surface plot
surface(x);
}
return 0;
}
GFOR Parallel Loops
gfor (array i, 3)
C(span,span,i) = A(span,span,i) * B;
Parallel matrix multiplications (1 kernel launch)
C(,,1) A(,,1) B
*=
C(,,3) A(,,3) B
*=
C(,,2) A(,,2) B
*=
GFOR Parallel Loops
BA(,,1:3)C(,,1:3)
*=*=
*=
gfor (array i, 3)
C(span,span,i) = A(span,span,i) * B;
Parallel matrix multiplications (1 kernel launch)
GFOR Parallel Loops
gfor (array i, 3)
C(span,span,i) = A(span,span,i) * B;
Parallel matrix multiplications (1 kernel launch)
= *
BAC
Four Quick Stories in Conclusion
Advertising Healthcare Finance Oil & Gas
Virtual Glasses Try-On
Acceleration Demands
The CPU code
45 seconds for one session to complete
Highly optimized OpenMP code leveraging all cores
1,000 sessions/minute required 750 CPU nodes
Convert Mac-only research code to C#
Focus on efficiently developed robust performance
ArrayFire Solution
Linear algebra
Matrix multiple, Transpose
Linear solvers
Image processing
Convolutions
Fast Fourier Transform
Correlation Filter
Sobel Filter
Gaussian Blur
OpenCV functions
Custom edge detection
Graphics
Rendering points
Reductions
Min, Max, Sum
JIT
Increased productivity
Results
3X acceleration
Dropped from 750 nodes,
to 250 nodes
Benefit from ongoing
library support
Culture-Free Microbiology
Filling
Filled
Computer-
controlled
pipettes
Microscope
A computer-controlled microscope scans a
cassette of pipettes, changes imaging
modes, and acquires digital images
according to program
Acceleration Demands
This platform provides a rapid alternative to traditional cell culturing for susceptibility testing
The faster the analysis pipeline, the sooner a patient can be diagnosed and treated with an antibiotic
Culture-based methods can take 2-3 days, which is problematic for many critically ill patients
ArrayFire Solution
Image Processing
Heavily filter based
Convolve, Filter, Resize
Image Statistics
Mean, StdDev, Variance
Results
Realtime throughputKernel Speedup
Image Registration (Heavy use of
statistics functions)
73.17x
Custom Filter (Prep Center Image) 26.48x
Gaussian Blur 2.19x
Hedge Protection System
Acceleration Demands
CPU-only version was taking 115 hours
Needs to run entire database of portfolios
each night before trading begins next day
ArrayFire Solution
Statistics Functions
Random number
generation
Variance
Exponentials
Arithmetic
Sqrt
Element-wise math
Reductions
Sum
Results
GPU version drops runtime to 7 hours and
meets the requirement to run overnight
Time left over to try more permutations
Oil Well Monitoring
Ordinary telecom
fiber used as an
efficient, high fidelity
acoustic sensor
Threaded along the
length of oil well
Acceleration Demands
Require realtime signal processing from 24
channels per unit with an onsite server
CPU-only solution was 5x slower than realtime
ArrayFire Solution
Heavy usage of signal filtering functions
FIR
IIR
Results
6x performance improvements in signal
processing
20x overall performance improvement
through more efficiently vectorized code
Software Shop for CUDA & OpenCL
Two ways to work with us:
Use
Hire our CUDA & OpenCL developers
Code development; CUDA & OpenCL training