leverage the speed of opencl™ with amd math libraries

26
HETEROGENEOUS MATH LIBRARIES KENT KNOX 12/16/2014

Upload: amd-developer-central

Post on 12-Jul-2015

1.497 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Leverage the Speed of OpenCL™ with AMD Math Libraries

HETEROGENEOUS MATH LIBRARIESKENT KNOX12/16/2014

Page 2: Leverage the Speed of OpenCL™ with AMD Math Libraries

2 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

AGENDA

clMATH‒clBLAS

‒clFFT

ACML

clMAGMA

Bolt

LIBRARIES COVERED

A survey of available libraries

Page 3: Leverage the Speed of OpenCL™ with AMD Math Libraries

3 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

CLMATHLIBRARIES

clMathLibraries is a github organization for OpenCL™ math related subprojects

https://github.com/clMathLibraries

Currently hosting two subprojects: clBLAS & clFFT

Page 4: Leverage the Speed of OpenCL™ with AMD Math Libraries

Open Source clBLAS

Page 5: Leverage the Speed of OpenCL™ with AMD Math Libraries

5 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

CLBLAS - HTTPS://GITHUB.COM/CLMATHLIBRARIES/CLBLAS

clBLAS implements the NetLib BLAS functionality with OpenCL‒ Level 3 – Matrix x Matrix operations, O( N^3 ), compute bound

‒ Level 2 – Matrix x Vector operations, O( N^2 ), mostly memory bound

‒ Level 1 – Vector x Vector operations, O( N ), memory bound

The API is in the same style as NetLib, but appends OpenCL structures‒ clblasStatus clblasSgemm( clblasOrder order, clblasTranspose transA, clblasTranspose transB, size_t M, size_t N, size_t K, cl_float alpha, constcl_mem A, size_t offA, size_t lda, const cl_mem B, size_t offB, size_t ldb, cl_float beta, cl_mem C, size_t offC, size_t ldc, cl_uint numCommandQueues, cl_command_queue* commandQueues, cl_uint numEventsInWaitList, const cl_event* eventWaitList, cl_event* events )

clBLAS assumes that the user is comfortable with OpenCL programming‒ The host code is responsible for detecting /choosing devices, transferring memory and synchronizing

operations

API

Page 6: Leverage the Speed of OpenCL™ with AMD Math Libraries

6 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

CLBLAS - HTTPS://GITHUB.COM/CLMATHLIBRARIES/CLBLAS

A proof of concept Python wrapper for clBLAS started, but only sgemm wrapped‒ https://github.com/clMathLibraries/clBLAS/tree/master/src/wrappers/python

‒ Based on Cython

‒ Works with PyOpenCL to manage OpenCL state

‒ Would love help from the community to finish this

The community wrote a Julia wrapper for clBLAS‒ https://github.com/JuliaGPU/CLBLAS.jl

API

Page 7: Leverage the Speed of OpenCL™ with AMD Math Libraries

7 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

CLBLAS - HTTPS://GITHUB.COM/CLMATHLIBRARIES/CLBLAS

• The user is responsible for running the tool on their machine

as a preprocessing step

• The tool creates a kernel database file (.kdb) that contains the best performing kernel for a given BLAS routine

• The .kdb file is specific to an OpenCL device; will be named after that device; e.g. tahiti.kdb

• Example

• export CLBLAS_STORAGE_PATH = /usr/local/lib

• ./tune --gemm --double

clBLAS contains a Tune tool for findingbetter OpenCL kernels

Page 8: Leverage the Speed of OpenCL™ with AMD Math Libraries

Open Source clFFT

Page 9: Leverage the Speed of OpenCL™ with AMD Math Libraries

9 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

CLFFT - HTTPS://GITHUB.COM/CLMATHLIBRARIES/CLFFT

clFFT implements an FFTW inspired interface with OpenCL‒ Provides a fast and accurate platform for calculating discrete FFTs

‒ Supports 1D, 2D, and 3D transforms with a batch size that can be greater than 1

‒ Supports dimension lengths that can be any mix of powers of 2, 3, and 5

‒ Supports single and double precision floating point formats

clFFT assumes that the user is comfortable with OpenCL programming‒ The host code is responsible for detecting/choosing devices, transferring memory and synchronizing

operations

The community wrote a Python wrapper for clFFT‒ https://github.com/geggo/gpyfft

The community wrote a Julia wrapper for clFFT‒ https://github.com/JuliaGPU/CLFFT.jl

API

Page 10: Leverage the Speed of OpenCL™ with AMD Math Libraries

10 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

CLFFT - HTTPS://GITHUB.COM/CLMATHLIBRARIES/CLFFT

• Users set all FFT state in an FFT plan object when initializing

• Call ‘BakePlan’ using the plan object to tell the library to JIT and compile the kernel outside of performance sensitive loops

• Reuse those plans as much as possible!

clFFT contains the concept of ‘plans’, which allows the library to tune OpenCL kernels at runtime

Page 11: Leverage the Speed of OpenCL™ with AMD Math Libraries

11 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

CLFFT - HTTPS://GITHUB.COM/CLMATHLIBRARIES/CLFFTPERFORMANCE

clFFT v2.3.1 included in ACML v6.1

This version contains optimizations not yet pushed into public github repo

You can use the clFFT.h header file from GitHub to compile your application, then use the binary from ACML

Benchmark system

64bit Linux

FirePro W9100

Catalyst Pro 14.301.1010

AMD A10-7850K

Page 12: Leverage the Speed of OpenCL™ with AMD Math Libraries

ACML 6

Page 13: Leverage the Speed of OpenCL™ with AMD Math Libraries

13 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

ACML 6 INTRODUCES HETEROGENEOUS COMPUTE

OpenCL can be a difficult language to learn

‒ There exists legacy applications that won’t be ported to OpenCL

‒ They might be willing to sacrifice peak performance for program portability

ACML 6 includes clBLAS & clFFT as new backends

‒ ACML hides all OpenCL programming from end users

‒ Client programs do not need to change at all; they only relink ACML 6

When ACML determines that a particular BLAS or FFT call will gain benefit from offloading computation, it will do so without knowledge of the client program

LEVERAGING CLMATH LIBRARIES TO ACCELERATE WITH OPENCL

ACML 6 keeps the same API!

Page 14: Leverage the Speed of OpenCL™ with AMD Math Libraries

14 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

NEW FFTW WRAPPER

ACML 6 now ships with fftw.h

FFTW programs could link with ACML 6 to offload computation onto OpenCL devices

No changes in host code required!

Page 15: Leverage the Speed of OpenCL™ with AMD Math Libraries

15 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

ACMLSCRIPT

• The scripting language uses Lua, with custom ACML callback functions

• http://www.lua.org/

• Refer to chapter 7 of the ACML documentation for more information on how to modify or create your own scripts

ACML includes a new scripting language that expresses the logic ACML uses to offload computation

Page 16: Leverage the Speed of OpenCL™ with AMD Math Libraries

16 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

ACMLSCRIPT: 3-PART VIDEO TUTORIALS

ACMLScript: Part 1

ACMLScript: Part 2

ACMLScript: Part 3

HTTPS://WWW.YOUTUBE.COM/USER/AMDDEVCENTRAL

Page 17: Leverage the Speed of OpenCL™ with AMD Math Libraries

17 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

ACML- HTTPS://GITHUB.COM/CLMATHLIBRARIES/CLFFTPERFORMANCE

ACML v6.0 sgemm

Slightly old at this time

Notice that the green line is equivalent to Max( blue, red )

ACML loads the host processor if the problem is too small to benefit from GPU acceleration

Benchmark system

AMD A10-7850K

CPU & GPU

64bit Linux

Catalyst 14.301.1001

Page 18: Leverage the Speed of OpenCL™ with AMD Math Libraries

Open Source clMagma

Page 19: Leverage the Speed of OpenCL™ with AMD Math Libraries

19 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

CLMAGMA

clMAGMA implements LAPACK functionality with OpenCL acceleration

https://bitbucket.org/icl/clmagma

Maintained by the University of Tennessee Knoxville

Page 20: Leverage the Speed of OpenCL™ with AMD Math Libraries

20 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

CLMAGMA

The newest v1.3 Supports

‒ LU, QR and Cholesky factorizations

‒ Linear and least squares solvers

‒ Reductions to Hessenberg, bidiagonal and tridiagonal forms

‒ Eigen and singular value problem solvers

‒ Orthogonal transformation routines

clMagma uses clBLAS as the GPU compute backend

‒ It currently provides static load balancing between CPU & GPU cores

Multi-GPU support

LEVERAGING CLMATH LIBRARIES TO ACCELERATE WITH OPENCL

v1.3 adds support for Windows and Mac OSX

Page 21: Leverage the Speed of OpenCL™ with AMD Math Libraries

Open Source Bolt

Page 22: Leverage the Speed of OpenCL™ with AMD Math Libraries

22 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

BOLT

Bolt implements parallel C++ STL functionality with AMP & OpenCL acceleration

Bolt on GitHub

Maintained by AMD

Page 23: Leverage the Speed of OpenCL™ with AMD Math Libraries

23 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

BOLT

Bolt provides containers and algorithms that enable clients to accelerate C++ code with minimal GPU knowledge

‒ Sorts

‒ Reductions

‒ Transforms

‒ Scans

Through control structures, clients control where data is allocated and computed (minimal knowledge of AMP or OpenCL is helpful here)

Bolt provides support for both OpenCL & C++ AMP paths

PARALLEL STL

Bolt provides containers such as bolt::device_vector<>

Page 24: Leverage the Speed of OpenCL™ with AMD Math Libraries

24 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

BOLT

#include <bolt/cl/device_vector.h>#include <bolt/cl/scan.h>#include <vector>#include <numeric>

int main(){

size_t length = 1024;

// Create device_vector and initialize it to 1bolt::cl::device_vector< int > boltInput( length, 1 );

// Calculate the inclusive_scan of the device_vectorbolt::cl::inclusive_scan(boltInput.begin(),boltInput.end(),boltInput.begin( ) );

// Create an std vector and initialize it to 1std::vector< int > stdInput( length, 1 );

// Calculate the inclusive_scan of the std vectorbolt::cl::inclusive_scan(stdInput.begin( ),stdInput.end( ),stdInput.begin( ) );return 0;

}

EXAMPLE CODE

Page 25: Leverage the Speed of OpenCL™ with AMD Math Libraries

25 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

Q&A & CONTACT INFO

For More Info:

Follow us on Twitter: @AMDDevCentral

Visit our forums: http://devgurus.amd.com/welcome

Visit our website: www.developer.amd.com

Watch the replay: www.youtube.com/user/AMDDevCentral

Download the presentation: www.slideshare.net/DevCentralAMD

Page 26: Leverage the Speed of OpenCL™ with AMD Math Libraries

26 | HETEROGENEOUS MATH LIBRARIES | DECEMBER 16, 2014

DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATIONCONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.