speedit 2.3 (cuda backend)

53
SpeedIT 2.3 Reference Manual Vratis Ltd. www.vratis.com speed-it.vratis.com Nov 2012

Upload: vratis-ltd

Post on 23-Mar-2016

224 views

Category:

Documents


3 download

DESCRIPTION

SpeedIT 2.3 Documentation

TRANSCRIPT

Page 1: SpeedIT 2.3 (CUDA backend)

SpeedIT 2.3Reference Manual

Vratis Ltd.

www.vratis.comspeed-it.vratis.com

Nov 2012

Page 2: SpeedIT 2.3 (CUDA backend)

Version 2.3

• New ILU(0) preconditioner added.

• GPU Direct support added.

• Support for CUDA 5.0 c©

• Support for new NVIDIA Keppler c© architecture compatibility.

Version 2.1

• Description of parallel functions for added

– si pgdcsrcg

– si pcdcsrbicgstab

• Changes in naming convention added p prefix for parallel functions

Version 2.0

• New preconditioners from cusp-library added

– Algebraic Multigrid with Smoothed Aggregation (AMG)

– Approximate Inverse (AINV)

• Preconditioners are managed by handlers

• Apache license added

2

Page 3: SpeedIT 2.3 (CUDA backend)

CONTENTS

Contents

1 INTRODUCTION 51.1 Versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.2 Licensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Hardware requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.4 Supported Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Font conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 GETTING STARTED 62.1 Obtaining SpeedIT libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 System requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Using SpeedIT libraries in user programs . . . . . . . . . . . . . . . . . . . . . . 72.5 Uninstallation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 USING THE LIBRARY 83.1 Code Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1.1 Library initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.2 Matrix operation with implicit memory management . . . . . . . . . . . . 83.1.3 Explicit memory management . . . . . . . . . . . . . . . . . . . . . . . . . 83.1.4 Solver function call . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Thread Safety and Asynchronous Execution . . . . . . . . . . . . . . . . . . . . . 113.3 Data Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.1 Sparse matrix storage format . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Matrix handlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.5 Preconditioners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.5.1 Preconditioner handlers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.6 Memory Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.7 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Application Programming Interface 204.1 Function naming conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 Sparse BLAS Level 3 routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.1 Sparse matrix dense vector multiplication . . . . . . . . . . . . . . . . . . 224.3 Iterative solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.3.1 Preconditioned Cojugate Gradient (CG) . . . . . . . . . . . . . . . . . . . 244.3.2 Preconditioned Conjugate Gradient in parallel mode . . . . . . . . . . . . 264.3.3 Preconditioned Stabilized Bi-Conjugate Gradient in parallel mode . . . . 274.3.4 Preconditioned Stabilized Bi-Conjugate Gradient (BiCGStab) . . . . . . . 29

4.4 Error handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.5 Memory management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.5.1 Matrix handler creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.5.2 Matrix handler removal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.5.3 Preconditioner handler creation . . . . . . . . . . . . . . . . . . . . . . . . 334.5.4 Preconditioner handler removal . . . . . . . . . . . . . . . . . . . . . . . . 364.5.5 GPU Memory allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.5.6 GPU memory release functions . . . . . . . . . . . . . . . . . . . . . . . . 394.5.7 GPU to CPU memory copy . . . . . . . . . . . . . . . . . . . . . . . . . . 404.5.8 CPU to GPU memory copy . . . . . . . . . . . . . . . . . . . . . . . . . . 424.5.9 GPU memory alloc and copy from CPU memory . . . . . . . . . . . . . . 44

4.6 Library initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

3

Page 4: SpeedIT 2.3 (CUDA backend)

CONTENTS

4.6.1 Opening library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464.6.2 Closing library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.7 Loading matrix in Matrix Market format . . . . . . . . . . . . . . . . . . . . . . 46

5 OpenFOAM interoperability 485.1 ProcType.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.2 vector.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.2.1 Vector initialization on CPU . . . . . . . . . . . . . . . . . . . . . . . . . 485.2.2 Vector initialization on GPU . . . . . . . . . . . . . . . . . . . . . . . . . 485.2.3 Functions specific for Vector allocated on CPU . . . . . . . . . . . . . . . 495.2.4 Common functions for Vector allocated on CPU and GPU . . . . . . . . . 50

5.3 interfaces.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 515.3.1 processor interface initialization and destruction . . . . . . . . . . . . . . 515.3.2 processor interface functions . . . . . . . . . . . . . . . . . . . . . . . . . 525.3.3 processor interfaces initialization . . . . . . . . . . . . . . . . . . . . . . 535.3.4 processor interfaces functions . . . . . . . . . . . . . . . . . . . . . . . . 53

4

Page 5: SpeedIT 2.3 (CUDA backend)

1 INTRODUCTION

SpeedIT is a library that provides a set of accelerated solvers for sparse systems of linearequations. Acceleration is done by exploiting the computational capabilities of modern NVIDIAGraphics Processing Units (GPUs) with CUDA technology enabled. All computations are per-formed with single or double floating point precision. Two sparse linear system solvers with adiagonal (Jacobi) preconditioner, a small subset of sparse BLAS Level 2 routines and memorymanagement functions are supplied. The library interface is written in C and is designed to becalled from C/C++, Fortran and other high-level languages.

SpeedIT library is designed with two main goals in mind. First, it should be easy to useby a person without the knowledge of NVIDIA CUDA technology, second, time the libraryshould easily integrate with CUDA code. This approach allows to avoid a steep learning curvefor the beginners and deeper control and optimization for advanced CUDA users.

SpeedIT application programming interface (API) is compatible with C language. All func-tions use only standard data types. Array indices are zero-based. In this approach user hasto take care of explicit memory management but it is easy to couple computational routinesprovided in SpeedIT with external, usage-specific data types.

SpeedIT library is a low level library, therefore it is capable of maximizing its performance,portability and compatibility with existing software. However, it requires, that many tasks re-lated to memory management has to be done in the external code. The user is responsible forpreparing proper input data, i.e. buffers with appropriate content and size. On the other hand,such approach allows to tune memory transfers for a wide variety of problems.

1.1 Versions

SpeedIT is a full featured and highly optimized library. For testing purposes there is alsoavailable open source version released under GPL license - SpeedIT Classic library. However,SpeedIT Classic provides only limited subset of features from SpeedIT library such as cal-culations in single precision only, a bit worse, not optimized performance and is not covered inthis document.

1.2 Licensing

SpeedIT is available for academic, government and commercial institutions. For more infor-mation see our web page http://speed-it.vratis.com. SpeedIT Classic is licensed underGNU GPL. SpeedIT eXtreme and SpeedIT Multi-GPU is not covered in this document.

SpeedIT is utilising CUSP 0.3.0 for AINV and AMG preconditioners CUSP 0.3.0 is distributedunder APACHE license

Copyright 2008-2009 NVIDIA CorporationLicensed under the Apache License, Version 2.0 (the ”License”); you may not use this file exceptin compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the Li-

5

Page 6: SpeedIT 2.3 (CUDA backend)

1.3 Hardware requirements

cense is distributed on an ”AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OFANY KIND, either express or implied. See the License for the specific language governing per-missions and limitations under the License.

1.3 Hardware requirements

To fully utilize SpeedIT library processing power, NVIDIA GPU with CUDA Compute Capa-bility 2.0 or higher is required. For the full list of CUDA capable devices please visit:http://www.nvidia.com/object/cuda_learn_products.html

1.4 Supported Operating Systems

SpeedIT libraries require hardware platform with CUDA enabled GPU and x86 or x86 64compatible CPU. SpeedIT library was built and tested on the following systems:

• Ubuntu 10.04 Desktop x86 64

• Ubuntu 11.04 Desktop x86 64

• Ubuntu 11.10 Desktop x86 64

1.5 Font conventions

In this document the following notations are used:

• Courier New - fragment of source code, variable names etc.

• A - matrices

• ~X - vectors

2 GETTING STARTED

2.1 Obtaining SpeedIT libraries

SpeedIT libraries can be obtained from our sales team after registration at http://speed-it.vratis.com. Before download please inform us about your system configuration so that wecould choose a proper library version.

If your operating system is not listed (see 1.4) you can still try SpeedIT. However, SpeedITlibrary was tested only on the systems cited in section 1.4.

2.2 System requirements

SpeedIT library requires hardware platform with CUDA-enabled GPU device with supportfor CUDA Compute Capability 2.0, x86 compatible CPU and CUDA 4.0 installed. For thefull list of CUDA capable devices please see http://www.nvidia.com//object/cuda_learn_

products.html.After installing CUDA on Linux please remember to update your PATH and LD LIBRARY PATHenvironment variables as mentioned in CUDA documentation.GPUDirect support is activated by setting an environment variable SI GPUDIRECT=YES .

6

Page 7: SpeedIT 2.3 (CUDA backend)

2.3 Installation

2.3 Installation

SpeedIT library is distributed as a dynamic linked library and a header file. To install thebinary you need only to copy a library file into a directory that is accessible to the operatingsystem. On Linux systems it is usually one of the directories /lib, /usr/lib or /usr/local/lib anddirectories from environment variable LD LIBRAY PATH. You may also add manually the path tolibspeedit.so to LD LIBRAY PATH environment variable.

2.4 Using SpeedIT libraries in user programs

To use SpeedIT libraries in user program on Linux systems two steps are required. First, pathto library header file should be passed to compiler. It can be done by passing option

-I/path to SpeedIT header file

to gcc compiler. The second step requires linking binary library file with your program. This isdone by passing options

-L/path to SpeedIT library -lSpeedit Library

to the linker (usually gcc). Remember, that SpeedIT library requires also cudart and cublas li-braries. Example of program code using SpeedIT and Linux compilation command is availablein Makefile files from tutorial folder.

2.5 Uninstallation

To uninstall SpeedIT libraries it is enough to remove binary library file and header file fromyour system.

7

Page 8: SpeedIT 2.3 (CUDA backend)

3 USING THE LIBRARY

3.1 Code Examples

3.1.1 Library initialization

#inc lude ” s p e e d i t . h”#inc lude <s t d i o . h>#inc lude <s t d l i b . h>i n t main ( i n t argc , char ∗∗ argv ) {

i n t e r r c o d e = s i i n i t ( ) ;p r i n t f ( ” s i i n i t : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;i f ( e r r c o d e != ERR OK) {

e x i t (−1) ;}/∗ Place your code here ∗/e r r c o d e = si shutdown ( ) ;p r i n t f ( ” s i shutdown ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;r e turn 0 ;

} ;

You may also write your own initialization routines, for example if you wish directly control,which GPU device is used for computations. In case of hand-made initialization, remember toinit CUBLAS library with cublasInit() function.

3.1.2 Matrix operation with implicit memory management

// Create spar s e matrix in CSR format//// | 0 1 0 2 |// | 0 0 3 0 |// | 4 5 6 0 |// | 0 0 7 8 |double v a l s [ ] = {1 . 0 , 2 . 0 , 3 . 0 , 4 . 0 , 5 . 0 , 6 . 0 , 7 . 0 , 8 .0} ;i n t c i d x [ ] = {1 , 3 , 2 , 0 , 1 , 2 , 2 , 3} ;i n t r i d x [ ] = {0 , 2 , 3 , 6 , 8} ;i n t n rows = 4 ;double x [ ] = {1 . 0 , 1 . 0 , 1 . 0 , 1 .0} ;double y [ ] = {1 . 0 , 1 . 0 , 1 . 0 , 1 .0} ;e r r c o d e = s i cdcs rmv ( n rows , va l s , c idx , r idx , x , y ) ;p r i n t f ( ” s i cdcs rmv ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;

3.1.3 Explicit memory management

double ∗ gpu va l s = NULL ;i n t ∗ gpu c idx = NULL ;i n t ∗ gpu r idx = NULL ;double ∗ gpu x = NULL ;double ∗ gpu y = NULL ;

8

Page 9: SpeedIT 2.3 (CUDA backend)

3.1 Code Examples

//// GPU memory a l l o c a t i o n and copy data from CPU memory//e r r c o d e = si c2gdmcopy (8 , va l s , &gpu va l s ) ;p r i n t f ( ” si c2gdmcopy ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;e r r c o d e = si c2gdmcopy ( n rows , x , &gpu x ) ;p r i n t f ( ” si c2gdmcopy ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;e r r c o d e = si c2gdmcopy ( n rows , y , &gpu y ) ;p r i n t f ( ” si c2gdmcopy ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;e r r c o d e = s i c2g imcopy (8 , c idx , &gpu c idx ) ;p r i n t f ( ” si c2gdmcopy ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;e r r c o d e = s i g i m a l l o c ( n rows +1, &gpu r idx ) ;p r i n t f ( ” s i g i m a l l o c ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;e r r c o d e = s i c 2 g i c o p y ( n rows +1, r idx , gpu r idx ) ;p r i n t f ( ” s i c 2 g i c o p y ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;

//// SpMV mult ip ly on GPU//e r r c o d e = s i gdcsrmv ( n rows , gpu vals , gpu c idx ,

gpu r idx , gpu x , gpu y ) ;p r i n t f ( ” s i gdcsrmv ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;

//// Copy r e s u l t to CPU memory//e r r c o d e = s i g2cdcopy ( n rows , gpu y , y ) ;p r i n t f ( ” s i g2cdcopy ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;

/∗ Result i s p laced in array y a l l o c a t e d in CPU memory ∗/

//// GPU memory f r e e//e r r c o d e = s i g d f r e e (&gpu va l s ) ;p r i n t f ( ” s i g d f r e e ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;e r r c o d e = s i g d f r e e (&gpu x ) ;p r i n t f ( ” s i g d f r e e ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;e r r c o d e = s i g d f r e e (&gpu y ) ;p r i n t f ( ” s i g d f r e e ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;e r r c o d e = s i g i f r e e (&gpu c idx ) ;p r i n t f ( ” s i g i f r e e ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;e r r c o d e = s i g i f r e e (& gpu r idx ) ;p r i n t f ( ” s i g i f r e e ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;

You may also directly call CUDA routines for memory management, i.e cudaMalloc(), cudaFree().All memory management functions in SpeedIT library use CUDA memory management rou-tines, thus you may also use CUDA and SpeedIT library functions interchangeably, i.e GPUmemory allocated with si gmalloc() and si gmcopy() function can be released with cudaFree().

9

Page 10: SpeedIT 2.3 (CUDA backend)

3.1 Code Examples

3.1.4 Solver function call

i n t n i t e r = 1000 ;double eps = 1e−30 ;

e r r c o d e = s i c d c s r b i c g s t a b ( n rows , va l s , c idx ,r idx , y , x , P NONE, &n i t e r , &eps ) ;

p r i n t f ( ” s i c d c s r b i c g s t a b ( ) : %s \n” , s i e r r s t r ( e r r c o d e ) ) ;p r i n t f ( ” n i t e r = %d , t o l e r a n c e = %g\n” , n i t e r , eps ) ;

/∗ Result i s p laced in array x a l l o c a t e d in CPU memory ∗/

10

Page 11: SpeedIT 2.3 (CUDA backend)

3.2 Thread Safety and Asynchronous Execution

3.2 Thread Safety and Asynchronous Execution

The library is not guaranteed to be thread-safe.The SpeedIT functions are executed on the GPU device asynchronously with respect to the

CPU and may return control to the host application before their execution is completed. Youcould call cudaDeviceSynchronize function from CUDA toolkit to synchronize the executionof a particular SpeedIT function with the host application.

11

Page 12: SpeedIT 2.3 (CUDA backend)

3.3 Data Formats

3.3 Data Formats

The SpeedIT uses two important data types: dense vectors and sparse matrices. Dense vectorsare standard C arrays, i.e. their elements are placed in memory sequentially and their indicesstart from 0. Sparse matrices are expressed in either compressed sparse row (CSR) or our newCompressed Multiple-Row Storage CMRS format (see Sec. 3.3.1). Dense vectors and sparsematrices may contain data in one of the following types:

• signed integer numbers – int

• floating point single precision numbers – float

• floating point double precision numbers – double

• single precision complex numbers – si scomplex

• double precision complex numbers – si dcomplex

Complex types si scomplex and si dcomplex, are built on top of CUDA vector types float2

and double2, respectively, and hence can be safely cast to or from these types.

12

Page 13: SpeedIT 2.3 (CUDA backend)

3.3 Data Formats

3.3.1 Sparse matrix storage format

Internally a sparse matrix is represented in one of two available formats. The first one is thewell-known compressed sparse row (CSR) format. Matrices in CSR can be supplied to SpeedITfunctions either via a handler (a pointer, see Sec. 3.4) or three C arrays and two integers:

• vals (pointer) — an array holding all nonzero matrix values in row-major order. Thenumber of elements in this array is equal to the number of all nonzero matrix elements.

• c idx (pointer) — an integer array of column indices of the corresponding elements inarray vals.

• r idx (pointer) — an integer array of indices into vals corresponding to first nonzeroelements in consecutive matrix rows (the j-th element of the array stores the index of thefirst nonzero element in the j-th matrix row for j = 0, . . .n rows). The number of elementsin r idx is equal to number of rows + 1. The value of the last element in the array is equalto the number of nonzero elements in the matrix. Alternatively, r idx can be defined asan array such that the difference r idx[j+1] − r idx[j] gives the number of nonzeroelements in the j-th row.

• n rows (integer) — the number of matrix rows;

• n cols (int) — the number of matrix columns

All indices are zero based (the index of the first matrix row or column is 0).

Example

Let sparse matrix A be defined as

A =

0 1 0 20 0 3 04 5 6 00 0 7 8

Then its CSR representation is:

vals = [1 2 3 4 5 6 7 8]

c idx = [1 3 2 0 1 2 2 3]

r idx = [0 2 3 6 8]

n rows = 4

n cols = 4

The second format used by the SpeedIT is called CMRS (Compressed Multi-Row Storage).It is our proprietary format which boosts the library performance by optimizing the way thematrix is stored in the GPU memory. Specification of this format will be published elsewhere.Until then it is available only through a handler. It can be used via special routines that convertit to/from the CSR format.

13

Page 14: SpeedIT 2.3 (CUDA backend)

3.4 Matrix handlers

3.4 Matrix handlers

SpeedIT introduces new way for passing matrix data into the computational functions. Matrixdata structures are now provided trough handlers (pointers to data structures). There areavailable 7 different matrix handlers:

• SI CSR DOUBLE HANDLE – handler for matrix in CSR format with values in double precision.

• SI CSR FLOAT HANDLE – handler for matrix in CSR format with values in single precision.

• SI CSR INT HANDLE – handler for matrix in CSR format with integer values.

• SI CSR SCOMPLEX HANDLE – handelr for matrix in CSR format with values of complex typein single precision.

• SI CSR DCOMPLEX HANDLE – handler for matrix in CSR format with values of complex typein double precision.

• SI CMR DOUBLE HANDLE – handler for matrix in CMRS format with values in double preci-sion.

• SI CMR FLOAT HANDLE – handler for matrix in CMRS format with values in single precision.

As mentioned, currently only handlers that point to double and float data types are supported.Handlers to complex types and integer are not yet available. Handler always points to a datastructure which is allocated in the GPU memory space. Each matrix handler have two functions.First one is used to create matrix (see section 4.5.1), second one to release allocated data (seesection 4.5.2). To create any matrix handler you must provide pointers to memory buffers whichdefine CSR matrix, number of its non-zero elements and origin parameter which defines a placewhere data are currently allocated. The following data structures should be defined:

• values — array of non-zero values.

• c idx — array containing column indices of matrix nonzero values.

• r idx — array containing n rows+1 elements. First n rows elements contain pointers tofirst nonzero elements in each row of matrix. The last element stores number of nonzerovalues in matrix .

• n rows — number of rows

• n cols — number of columns

• nnz — number of non-zero values

• origin — Points memory space where provided data are located. Proper values are HOST(CPU memory space) or DEVICE (GPU memory space). When origin of given data isHOST then they are copied to GPU. When origin is DEVICE then given memory buffersare not copied but assigned to the matrix structure to save GPU memory.

To properly release allocated memory space you should use a release function which takes areference to matrix handler as an argument.

Warning! In case of creating handler from GPU data (origin = DEVICE) created matrixstructure does not own those data. In such a case when cleaning up the memory it is yourresponsibility to call free functions on pointers to allocated memory buffers. See listing2, 1.

14

Page 15: SpeedIT 2.3 (CUDA backend)

3.4 Matrix handlers

Listing 1: Handler management with DEVICE origin

1 /∗2 Create empty po i n t e r s to matrix s t u r c tu r e .3 Memory w i l l be automat i ca l l y a l l o c a t e d i n s i d e load func t i on4 ∗/5 i n t ∗ rows = NULL;6 i n t ∗ c o l s = NULL;7 f l o a t ∗ va lue s = NULL;8 i n t nnz = 0 ;9 i n t nco l s = 0 ;

10 i n t nrows= 0 ;11 // c r ea t e gpu po i n t e r s f o r matrix12 i n t ∗ g rows = NULL;13 i n t ∗ g c o l s = NULL;14 f l o a t ∗ g va lue s = NULL;1516 // load matrix17 i n t s i e r r o r = s i c s l o a d ( mtx f i l e , nnz , nrows , nco l s , &values , &rows , &c o l s ) ;1819 i f ( s i e r r o r != ERR OK)20 {21 std : : cout << ”Error : ” << s i e r r s t r ( s i e r r o r ) << std : : endl ;22 }2324 // a l l o c a t e and copy data on GPU.25 s i c2g imcopy ( nnz , co l s , &g c o l s ) ;26 s i c2g imcopy ( nrows+1, rows , &g rows ) ;27 s i c2gsmcopy ( nnz , values , &g va lue s ) ;2829 // c r ea t e matrix handler from data a l l o c a t e d in GPU memory space30 SI CSR FLOAT HANDLE31 matr ix handler=s i s h c s r ( g va lues , g co l s , g rows , nrows , nco l s , nnz , DEVICE) ;3233 . . .3435 // In case o f DEVICE o r i g i n user i s r e s p on s i b l e f o r c a l l i n g r e l e a s e func t i on36 // cleanup the data to each o f a l l o c a t e d po i n t e r s .37 s i s h r e l e a s e c s r (&matr ix handler ) ;38 s i g s f r e e (&g va lue s ) ;39 s i g i f r e e (& g c o l s ) ;40 s i g i f r e e (&g rows ) ;

15

Page 16: SpeedIT 2.3 (CUDA backend)

3.4 Matrix handlers

Listing 2: Handler management with HOST origin

1 /∗2 Create empty po i n t e r s to matrix s t u r c tu r e .3 Memory w i l l be automat i ca l l y a l l o c a t e d i n s i d e load func t i on .4 We w i l l use ”c” func t i on which s t o r e the matrix data on CPU5 ∗/6 i n t ∗ rows = NULL;7 i n t ∗ c o l s = NULL;8 f l o a t ∗ va lue s = NULL;9 i n t nnz = 0 ;

10 i n t nco l s = 0 ;11 i n t nrows= 0 ;12 // load matrix from MTX f i l e ,13 // bu f f e r s w i l l be automat i ca l l y a l l o c a t e d .14 i n t s i e r r o r = s i c s l o a d ( mtx f i l e , nnz , nrows , nco l s ,15 &values , &rows , &c o l s ) ;1617 i f ( s i e r r o r != ERR OK)18 {19 std : : cout << ”Error : ” << s i e r r s t r ( s i e r r o r ) << std : : endl ;20 }2122 // c r ea t e matrix handler23 SI CMR FLOAT HANDLE24 matr ix handler=s i shcmr ( values , co l s , rows , nrows , nco l s , nnz , HOST) ;2526 . . .2728 // r e l e a s e memory29 s i s h r e l e a s e cmr (&matr ix handler ) ;

16

Page 17: SpeedIT 2.3 (CUDA backend)

3.5 Preconditioners

3.5 Preconditioners

SpeedIT takes advantage of CUSP library (http://code.google.com/p/cusp-library/) thatoffers set of preconditioners which improve the rate of convergence of iterative solvers. Followingprecondtioners are available for CSR :

• Algebraic Multigrid based on Smoothed Aggregation

• Approximate Inverse

• Diagonal

For CMRS format:

• Diagonal

3.5.1 Preconditioner handlers

Similarly to matrices, preconditioners are managed by handlers to structures. For each typeof preconditioner SpeedIT offers set of funnctions to create or release preconditioner structurefrom memory i.e.:

Listing 3: Preconditioner management

12 . . .3 //Create handler to CSR matrix4 SI CSR DOUBLE HANDLE matr ix handler =5 s i d h c s r ( p va l s , p c idx , p r idx , n rows , n rows , nnz , HOST) ;6 //Create handler to AMG pre cond i t i on e r7 SI CSR DOUBLE AMG HANDLE pre cond i t i on e r hand l e r = s i gdcsramg ( matr ix handler ) ;8 // Ca l l the s o l v e r9 s o l v e r r e s u l t = s i gdhcs rb i cgs tabamg

10 ( matr ix handler , pgpu X , pgpu B , p r e cond i t i one r hand l e r , &n i t e r , &eps ) ;11 // r e l e a s e handler to p r e cond i t i on e r12 s i dhc s r f r e eamg (&pr e cond i t i on e r hand l e r ) ;1314 . . .

3.6 Memory Management

All computational functions in SpeedIT require pointers to allocated memory buffers and han-dlers to matrix structures (see section 3.4) as arguments. You must explicitly allocate buffersin CPU or GPU memory. Memory allocation in CPU address space can be done with anytechnique, i.e. malloc/calloc function in C code or new operator in C++ code. Memoryallocation in GPU address space should be done with CUDA routines to ensure detailed lowlevel control. However, for users without deep CUDA programming knowledge there are twoadditional ways of GPU memory management.

The first way is to use computational functions accepting arguments placed in CPU memory(see the naming convention in Sec. 4.1 and functions where ¡mem¿=’c’). In functions acceptingarrays allocated in CPU memory all necessary GPU buffers allocations and data copying is doneinternally without user interaction. However, this approach may result in low performance, espe-cially for algorithms, where the amount of data transfers is much higher than the computations.A typical example is a sparse matrix by dense vector multiplication (SpMV). In current hardwarearchitectures the performance of SpMV depends almost entirely on memory bandwidth. Since

17

Page 18: SpeedIT 2.3 (CUDA backend)

3.6 Memory Management

transfer from CPU to GPU memory is usually slower than pure CPU memory bandwidth 1, itis faster to compute sparse matrix-vector multiplication on CPU than to copy it to GPU memory.

The second way of memory management without explicit calls to CUDA routines is to usefunctions provided in SpeedIT. Library functions allow to allocate/free fixed size buffers inGPU memory and to copy memory chunks between CPU and GPU memory. With these func-tions you can explicitly schedule data transfers between CPU and GPU memory without learningCUDA.

1At the time of writing this document modern GPUs are usually connected with PCI Express 2.0 x16 buswith bandwidth of approx. 8 GB/s, where the CPU memory bandwidth for triple channel double-data-rate three(DDR3) memories is about 25 GB/s.

18

Page 19: SpeedIT 2.3 (CUDA backend)

3.7 Error Handling

3.7 Error Handling

Most functions from SpeedIT return a number of type int as an error code. When a functionproceeds without errors, 0 is returned. Other numbers denote different kinds of errors. Possi-ble error values belong to three classes: SpeedIT internal errors, CUDA errors and CUBLASerrors. CUDA error numbers are returned unchanged. CUBLAS error number are encoded todisable overlapping with CUDA error numbers. If CUBLAS error is detected, error number isincreased by CUBLAS ERR BASE constant.

All possible SpeedIT internal error codes are defined in header file speedit.h. Their shortdescription is presented below:

ERR OK No errors, function succeeded.

ERR GPU MEM ALLOC Can not allocate required buffer in GPUmemory.

ERR BAD GPU POINTER Wrong pointer value. Probably pointer isuninitialized or points to a buffer allocatedon a wrong device.

ERR WRONG COPY DIRECTION Wrong direction of data transfer betweenGPU and CPU. Sometimes it means thatpointers to data buffers are passed in awrong order.

ERR DOUBLE UNSUPPORTED Available GPU has no hardware supportfor double precision calculations.

ERR BICGSTAB FAILED BiCGStab failed to converge.

ERR OMEGA VANISHED BiCGStab failed to converge (omega hasvanished).

ERR TOO LITTLE ITER Solver has not reached a required errorlevel in a given number of iterations.

ERR ZERO ON DIAGONAL There are zero values on the matrix diag-onal. This is not allowed in the currentcontext

ERR CUDA INVALID VALUE Can’t perform an operation. Please checkif the host is not confused with a devicepointer.

ERR INVALID PRECOND Invalid preconditioner.

ERR INVALID ORIGIN TYPE Wrong transfer type given, correct areHOST or DEVICE.

ERR MM FILE OPENING Can’t open the MTX file.

ERR MM WRONG BANNER Unsupported MM banner.

ERR MM UNSUPPORTED MATRIX FORMAT Unsupported matrix format.

ERR MM UNSUPPORTED DATA TYPE Unsupported matrix data type.

ERR MM READING FAILED Problem with reading MTX file.

ERR UNKNOWN Unexpected error, maybe a hardware fail-ure?

Error codes can be transformed to text messages with function si errstr().

19

Page 20: SpeedIT 2.3 (CUDA backend)

4 Application Programming Interface

4.1 Function naming conventions

All function names are built according to a following template:

<prefix><p><mem><data_type><h><matrix format><operation>

• prefix : library prefix, possible values:

– si - SpeedIT library

• p : function for multiGPU calculations

• mem : where arguments and results are stored, possible values:

– c – CPU memory

– g – GPU memory

– g2c – Input arguments in GPU memory, results in CPU memory

– c2g – Input arguments in CPU memory, results in GPU memory

• data type : Data representation, has impact on computation precision, possible values:

– s – Single precision floating point numbers

– d – Double precision floating point numbers

– i – Signed integer numbers

– v – Void type, usually used in memory management functions

– sz – Single precision complex numbers

– dz – Double precision complex numbers

• h : Does the function accepts matrix data using a handler

• matrix format : Format, in which sparse matrix is stored, possible values:

– csr – Compressed Sparse Row format

– cmr – Compressed Multiple-Row Storage format

• operation : Performed operation, possible values:

– malloc – Allocate buffer in memory

– free – Free previously allocated buffer in memory

– copy – Copy data from source buffer to destination buffer, all buffers have to beallocated in memory

– mcopy – Allocate destination buffer and copy data from source buffer to destinationbuffer

– mv – Matrix vector multiplication

– bicgstab – Solve system of linear equations with BiCGSTAB solver

– cg – Solve system of linear equations with CG solver

– init – Initialize library

– shutdown – Close library

20

Page 21: SpeedIT 2.3 (CUDA backend)

4.1 Function naming conventions

Examples

si gscsrmv – All data is stored in GPU memory. Computations are performed in single preci-sion, matrix is stored in CSR format, function calculates matrix by vector multiplication.

si cdcsrmv – All data is stored in CPU memory. Computations are performed in double preci-sion, matrix is stored in CSR format, function calculates matrix by vector multiplication.

si gdhcmrbicgstab – All arrays are stored in GPU memory. Matrix in CMRS format is pro-vided by the matrix handler. Computations are performed in double precision, function runsConjugate Gradient algorithm.

si gdfree – Free buffer in GPU memory. Buffer contains double precision numbers.

21

Page 22: SpeedIT 2.3 (CUDA backend)

4.2 Sparse BLAS Level 3 routines

4.2 Sparse BLAS Level 3 routines

4.2.1 Sparse matrix dense vector multiplication

si_cscsrmv(n_rows, n_cols, vals, c_idx, r_idx, x, y)

si_gscsrmv(n_rows, n_cols, vals, c_idx, r_idx, x, y)

si_cdcsrmv(n_rows, n_cols, vals, c_idx, r_idx, x, y)

si_gdcsrmv(n_rows, n_cols, vals, c_idx, r_idx, x, y)

si_cdzcsrmv(n_rows, n_cols, vals, c_idx, r_idx, x, y)

si_gdzcsrmv(n_rows, n_cols, vals, c_idx, r_idx, x, y)

si_cszcsrmv(n_rows, n_cols, vals, c_idx, r_idx, x, y)

si_gszcsrmv(n_rows, n_cols, vals, c_idx, r_idx, x, y)

si_cdhcsrmv(matrix_handler, x, y)

si_gdhcsrmv(matrix_handler, x, y)

si_cshcsrmv(matrix_handler, x, y)

si_gshcsrmv(matrix_handler, x, y)

si_cdhcmrmv(matrix_handler, x, y)

si_gdhcmrmv(matrix_handler, x, y)

si_cshcmrmv(matrix_handler, x, y)

si_gshcmrmv(matrix_handler, x, y)

DescriptionThe functions compute matrix vector multiplication defined as

~y = A ∗ ~x

where ~x and ~y are dense vectors and A is a matrix represented in CMRS or CSR format withzero-based indexing. Matrix can be passed in two different ways. By providing set of parameterswhich defines a CSR matrix (see section 3.3.1) or by matrix handler (see section 3.4)

Input parameters: Matrix

• n rows: int, Number of rows in a matrix .

• n cols: int, Number of columns in a matrix .

• vals: Array containing nonzero values of a matrix.

– float* for si cscsrmv, si gscsrmv, si <c/g>shcrsmv and si <c/g>shcmrmv

– double* for si cdcsrmv, si gdcsrmv, si <c/g>dhcrsmv and si <c/g>dhcmrmv

– si scomplex* for si cszcsrmv and si gszcsrmv,

– si dcomplex* for si cdzcsrmv and si gdzcsrmv,

• c idx: int*. Array containing column indices of matrix nonzero values.

• r idx: int*. Array containing n rows+1 elements. First n rows elements contain pointersto first nonzero elements in each row of matrix. The last element stores number of nonzerovalues in a matrix .

• matrix handler: Pointer to data structure which defines a matrix (see section 3.4 and4.5.1 for more details)

– SI CSR DOUBLE HANDLE for si cdhcsrmv and si gdhcsrmv

– SI CSR FLOAT HANDLE for si cshcsrmv and si gshcsrmv

22

Page 23: SpeedIT 2.3 (CUDA backend)

4.2 Sparse BLAS Level 3 routines

– SI CMR DOUBLE HANDLE for si cdhcmrmv and si gdhcmrmv

– SI CMR FLOAT HANDLE for si cshcmrmv and si gshcmrmv

Input parameters: Vector

• x: Array containing all values of vector ~x. Array contains n rows elements.

– float* for si cscsrmv, si gscsrmv, si <c/g>shcrsmv and si <c/g>shcmrmv

– double* for si cdcsrmv, si gdcsrmv, si <c/g>dhcrsmv and si <c/g>dhcmrmv

– si scomplex* for si cszcsrmv and si gszcsrmv,

– si dcomplex* for si cdzcsrmv and si gdzcsrmv,

Output parameters

• y: Result array. Must be allocated and contain n rows elements. On output array is filledwith multiplication results.

– float* for si cscsrmv and si gscsrmv,

– double* for si cdcsrmv and si gdcsrmv,

– si scomplex* for si cszcsrmv and si gszcsrmv,

– si dcomplex* for si cdzcsrmv and si gdzcsrmv.

Result valueOn success function returns ERR OK. In case of errors function may return ERR UNKNOWN andencoded CUDA or CUBLAS errors.

Warning! All arrays in functions with memory prefix <c> like: si cscsrmv, orsi cdhcmrmv have to be allocated in CPU memory. All arrays in functions with memoryprefix <g> like: si gscsrmv, si gdhcsrmv have to be allocated in GPU memory. Passingpointers to data allocated in a wrong place may result in unrecoverable errors.

23

Page 24: SpeedIT 2.3 (CUDA backend)

4.3 Iterative solvers

4.3 Iterative solvers

4.3.1 Preconditioned Cojugate Gradient (CG)

si_cdhcmrcgdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cdhcmrcgvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcmrcgdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcmrcgvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcmrcgdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcmrcgvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcmrcgdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcmrcgvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cdhcsrcgamg(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cdhcsrcgainv(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cdhcsrcgainvscaled(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cdhcsrcgainvnsym(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cdhcsrcgdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cdhcsrcgvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcsrcgamg(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcsrcgainv(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcsrcgainvscaled(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcsrcgainvnsym(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcsrcgdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcsrcgvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcsrcgamg(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcsrcgainv(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcsrcgainvscaled(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcsrcgainvnsym(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcsrcgdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcsrcgvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcsrcgamg(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcsrcgainv(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcsrcgainvscaled(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcsrcgainvnsym(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcsrcgdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcsrcgvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcsrcgilu(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

DescriptionThe functions solves linear system of equations defined as

A ∗ ~x = ~b

where ~x and ~b are dense vectors and A is a sparse matrix represented in CMRS or CSR formatwith zero-based indexing. Matrix can be passed in two different ways. By providing set ofparameters which defines CSR matrix (see section 3.3.1) or by matrix handler ( see section 3.4).

Input parameters

• matrix handler: Pointer to data structure which defines a matrix (see Sections 3.4 and4.5.1 for more details)

– SI CSR DOUBLE HANDLE for si cdhcsrcg and si gdhcsrcg

– SI CSR FLOAT HANDLE for si cshcsrcg and si gshcsrcg

24

Page 25: SpeedIT 2.3 (CUDA backend)

4.3 Iterative solvers

– SI CMR DOUBLE HANDLE for si cdhcmrcg and si gdhcmrcg

– SI CMR FLOAT HANDLE for si cshcmrcg and si gshcmrcg

• preconditioner handler: Pointer to data structure which defines a preconditioner ( seesection 3.5 for more details). Each solver function has a suffix coresponding to name ofthe preconditioner which is expected. i.e.:

– si cshcsrbicgstabamg: expects a handler pointing to AMG preconditioner which val-ues are in single precision.

– si gdhcsrbicgstabainvscaled: expects a handler pointing to AINV SCALED precon-ditioner which values are in double precision.

• b: Array containing right hand side values of the equation. Array contains n rows elements.

– float* for si <c/g>s<matrix>bicgstab<preconditioner>

– double* for si <c/g>d<matrix>bicgstab<preconditioner>

– si scomplex* for si <c/g>sz<matrix>bicgstab<preconditioner>

– si dcomplex* for si <c/g>dz<matrix>bicgstab<preconditioner>

– matrix — can be either csr or cmr.

Input/Output parameters

• x: Array which stores the solution vector. On input the array must be allocated andcontains the approximate solution. On output array is filled with calculated solution.

– float* for si <c/g>s<matrix>bicgstab<preconditioner>

– double* for si <c/g>d<matrix>bicgstab<preconditioner>

– si scomplex* for si <c/g>sz<matrix>bicgstab<preconditioner>

– si dcomplex* for si <c/g>dz<matrix>bicgstab<preconditioner>

– matrix — can be either csr or cmr.

• n iter - int*. Reference to a variable which on input must contain maximum number ofsolver iterations. On output it is filled with the number of actually calculated iterations.For all solver functions you have to pass that value by variable reference.

• eps - double*. Reference to a variable which on input must contain acceptable residual.On output the actual error is assigned to that variable. For all solver functions you haveto pass that value by variable reference.

Return value

On success function returns ERR OK. In case of errors function may return

• ERR TOO LITTLE ITER

• ERR ZERO ON DIAGONAL

• ERR UNKNOWN

and encoded CUDA or CUBLAS errors.

Warning! All arrays for functions with memory prefix <c> like: si cscsrcg, orsi cdhcmrcg have to be allocated in CPU memory. All arrays for functions with memoryprefix <g> like: si gscsrcg, si gdhcsrcg have to be allocated in GPU memory. Passingpointers to data allocated in wrong place may result in unrecoverable errors.

25

Page 26: SpeedIT 2.3 (CUDA backend)

4.3 Iterative solvers

4.3.2 Preconditioned Conjugate Gradient in parallel mode

Warning! Functions with prefix si p are dedicated to use only in OpenFOAM environ-ment.

DLLAPI

int si_pcdcsrcg(n_rows, n_cols, nnz, vals, c_idx, r_idx, X, B, precond, n_iter,

eps, eps0, pifaces);

DescriptionThe functions solves linear system of equations defined as

A ∗ ~x = ~b

where ~x and ~b are dense vectors and A is a sparse matrix represented in CSR format with zero-based indexing.

Input parameters

• n rows - int. Number of rows in matrix A

• n cols - int. Number of columns in matrix A

• nnz - int. Number of nonzero values in matrix A

• vals - double*. Array containing nonzero values of matrix A

• c idx - int*. Array containing nonzero values of matrix A

• r idx - int*. Array containing nonzero values of matrix A

• X - double*. On input array must be allocated and contain approximate solution. Onoutput array is filled with calculated solution

• B - double*. Array containing right hand side values of the equation. Array containsn rows elements.

• precond - type of preconditioner. Available values are:

– P DIAG - diagonal preconditioner

– P AINV B - approximate inversed preconditioner

– P AINV SB - approximate inversed scaled preconditioner

– P AINV NB - approximate inversed non-symmetric preconditioner

– P AMG - algebraic multigrid preconditioner

Input/Output parameters

• x: double*. Array which stores the solution vector. On input the array must be allocatedand contains the approximate solution. On output array is filled with calculated solution.

• n iter - int*. Reference to a variable which on input must contain maximum number ofsolver iterations. On output it is filled with the number of actually calculated iterations.For all solver functions you have to pass that value by variable reference.

• eps - double*. Reference to a variable which on input must contain acceptable residual.On output the actual error is assigned to that variable. For all solver functions you haveto pass that value by variable reference.

26

Page 27: SpeedIT 2.3 (CUDA backend)

4.3 Iterative solvers

• eps0 - double*. Reference to a variable which will store inital residual

Return value

On success function returns ERR OK. In case of errors function may return

• ERR TOO LITTLE ITER

• ERR ZERO ON DIAGONAL

• ERR UNKNOWN

and encoded CUDA or CUBLAS errors.

4.3.3 Preconditioned Stabilized Bi-Conjugate Gradient in parallel mode

Warning! Functions with prefix si p are dedicated to use only in OpenFOAM environ-ment.

DLLAPI

int si_pcdcsrbicgstab(n_rows, n_cols, nnz, vals, c_idx, r_idx, X, B, precond,

n_iter, eps, eps0, pifaces);

DescriptionThe functions solves linear system of equations defined as

A ∗ ~x = ~b

where ~x and ~b are dense vectors and A is a sparse matrix represented in CSR format with zero-based indexing.

Input parameters

• n rows - int. Number of rows in matrix A

• n cols - int. Number of columns in matrix A

• nnz - int. Number of nonzero values in matrix A

• vals - double*. Array containing nonzero values of matrix A

• c idx - int*. Array containing nonzero values of matrix A

• r idx - int*. Array containing nonzero values of matrix A

• X - double*. On input array must be allocated and contain approximate solution. Onoutput array is filled with calculated solution

• B - double*. Array containing right hand side values of the equation. Array containsn rows elements.

• precond - type of preconditioner. Available values are:

– P DIAG - diagonal preconditioner

– P AINV B - approximate inversed preconditioner

– P AINV SB - approximate inversed scaled preconditioner

– P AINV NB - approximate inversed non-symmetric preconditioner

27

Page 28: SpeedIT 2.3 (CUDA backend)

4.3 Iterative solvers

– P AMG - algebraic multigrid preconditioner

Input/Output parameters

• x: double*. Array which stores the solution vector. On input the array must be allocatedand contains the approximate solution. On output array is filled with calculated solution.

• n iter - int*. Reference to a variable which on input must contain maximum number ofsolver iterations. On output it is filled with the number of actually calculated iterations.For all solver functions you have to pass that value by variable reference.

• eps - double*. Reference to a variable which on input must contain acceptable residual.On output the actual error is assigned to that variable. For all solver functions you haveto pass that value by variable reference.

• eps0 - double*. Reference to a variable which will store inital residual

Return value

On success function returns ERR OK. In case of errors function may return

• ERR TOO LITTLE ITER

• ERR ZERO ON DIAGONAL

• ERR OMEGA VANISHED

• ERR BICGSTAB FAILED

• ERR UNKNOWN

and encoded CUDA or CUBLAS errors.

28

Page 29: SpeedIT 2.3 (CUDA backend)

4.3 Iterative solvers

4.3.4 Preconditioned Stabilized Bi-Conjugate Gradient (BiCGStab)

si_cdhcmrbicgstabdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cdhcmrbicgstabvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcmrbicgstabdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcmrbicgstabvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcmrbicgstabdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcmrbicgstabvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcmrbicgstabdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcmrbicgstabvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cdhcsrbicgstabamg(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cdhcsrbicgstabainv(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cdhcsrbicgstabainvscaled(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cdhcsrbicgstabainvnsym(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cdhcsrbicgstabdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cdhcsrbicgstabvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcsrbicgstabamg(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcsrbicgstabainv(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcsrbicgstabainvscaled(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcsrbicgstabainvnsym(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcsrbicgstabdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_cshcsrbicgstabvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcsrbicgstabamg(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcsrbicgstabainv(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcsrbicgstabainvscaled(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcsrbicgstabainvnsym(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcsrbicgstabdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gdhcsrbicgstabvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcsrbicgstabamg(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcsrbicgstabainv(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcsrbicgstabainvscaled(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcsrbicgstabainvnsym(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcsrbicgstabdiagonal(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

si_gshcsrbicgstabvoid(matrix_handle,x, b, preconditioner_handle, n_iter, eps);

DescriptionThe functions solves linear system of equations defined as

A ∗ ~x = ~b

where ~x and ~b are dense vectors and A is a sparse matrix represented in CMRS or CSR for-mat with zero-based indexing. Matrix can be passed in two different ways. By providing setof parameters which defines CSR matrix (see section 3.3.1) or by matrix handler ( see section 3.4)

Input parameters

• matrix handler: Pointer to data structure which defines a matrix (see section 3.4 and4.5.1 for more details)

– SI CSR DOUBLE HANDLE for si <c/g>dhcsrbicgstab<preconditioner>

– SI CSR FLOAT HANDLE for si <c/g>shcsrbicgstab<preconditioner>

– SI CMR DOUBLE HANDLE for si <c/g>dhcmrbicgstab<preconditioner>

– SI CMR FLOAT HANDLE for si <c/g>shcmrbicgstab<preconditioner>

29

Page 30: SpeedIT 2.3 (CUDA backend)

4.3 Iterative solvers

• preconditioner handler: Pointer to a data structure which defines preconditioner ( seesection 3.5 for more details). Each solver function have a suffix coresponding to a name ofthe preconditioner which is expected. i.e.:

– si cshcsrbicgstabamg: expects handler pointing to AMG preconditioner which valuesare in single precision.

– si gdhcsrbicgstabainvscaled: expects handler pointing to AINV SCALED precon-ditioner which values are in double precision.

• b: Array containing right hand side values of the equation. Array contains n rows elements.

– float* for si <c/g>s<matrix>bicgstab<preconditioner>

– double* for si <c/g>d<matrix>bicgstab<preconditioner>

– si scomplex* for si <c/g>sz<matrix>bicgstab<preconditioner>

– si dcomplex* for si <c/g>dz<matrix>bicgstab<preconditioner>

– matrix — can be either csr or cmr.

Input/Output parameters

• x: Array which stores the solution vector. On input the array must be allocated andcontains the approximate solution. On output array is filled with calculated solution.

– float* for si <c/g>s<matrix>bicgstab<preconditioner>

– double* for si <c/g>d<matrix>bicgstab<preconditioner>

– si scomplex* for si <c/g>sz<matrix>bicgstab<preconditioner>

– si dcomplex* for si <c/g>dz<matrix>bicgstab<preconditioner>

– matrix — can be either csr or cmr.

• n iter - int*. Reference to variable which on input must contain a maximum number ofsolver iterations. On output it is filled with the number of actually calculated iterations.For all solver functions you have to pass that value by variable reference.

• eps - double*. Reference to variable which on input must contain a required solutionerror. On output the actual error is assigned to that variable. For all solver functions youhave to pass that value by variable reference.

Return value

On success function returns ERR OK. In case of errors function may return

• ERR TOO LITTLE ITER

• ERR ZERO ON DIAGONAL

• ERR OMEGA VANISHED

• ERR BICGSTAB FAILED

• ERR UNKNOWN

and encoded CUDA or CUBLAS errors.

Warning! All arrays for functions with memory prefix <c> like: si cscsrbicgstab, orsi cdhcmrbicgstab have to be allocated in CPU memory. All arrays for functions withmemory prefix <g> like: si gscsrbicgstab, si gdhcsrbicgstab have to be allocated inGPU memory. Passing pointers to data allocated in wrong place may result in unrecover-able errors.

30

Page 31: SpeedIT 2.3 (CUDA backend)

4.4 Error handling

4.4 Error handling

si_errstr(err_code)

Description The functions converts passed error code to text message. Error code can beany number returned by SpeedIT functions, i.e. internal library error or encoded CUDAand CUBLAS errors. For unknown error codes message "Unknown error, maybe hardware

failure?" is returned.

Input parameterserr code – int. Error code returned by library functions.

Return valueconst char* – pointer to a text message for a given error code. Returned message CAN NOTbe freed after use.

NOTE: Function si errstr() can not fail, thus it does not any return error code.

31

Page 32: SpeedIT 2.3 (CUDA backend)

4.5 Memory management

4.5 Memory management

4.5.1 Matrix handler creation

SI_CSR_DOUBLE_HANDLE si_dhcsr(values, col_idx, row_idx, rows, cols, nnz, origin)

SI_CSR_FLOAT_HANDLE si_shcsr(values, col_idx, row_idx, rows, cols, nnz, origin)

SI_CSR_INT_HANDLE si_ihcsr(values, col_idx, row_idx, rows, cols, nnz, origin)

SI_CSR_SCOMPLEX_HANDLE si_schcsr(values, col_idx, row_idx, rows, cols, nnz, origin)

SI_CSR_DCOMPLEX_HANDLE si_dchcsr(values, col_idx, row_idx, rows, cols, nnz, origin)

SI_CMR_DOUBLE_HANDLE si_dhcmr(values, col_idx, row_idx, rows, cols, nnz, origin)

SI_CMR_FLOAT_HANDLE si_shcmr(values, col_idx, row_idx, rows, cols, nnz, origin)

DescriptionThese functions create a handler to data structures which represents desired matrix format.Handlers can be used in all computational functions of SpeedIT. Handlers are always pointingto data stored on GPU memory space. As input for any kind of matrix formats either CMRSor CSR it takes a set of arrays and variables which represent a CSR matrix.

Input parameters

• rows: int. Number of rows in matrix.

• cols: int. Number of columns in matrix.

• values: Array containing nonzero values of matrix.

– float* for si shcsr and si shcmr,

– double* for si dhcsr and si dhcmr,

– int* for si ihcsr,

– si scomplex* for si schcsr,

– si dcomplex* for si dchcsr,

• col idx: int*. Array containing column indices of matrix nonzero values.

• row idx: int*. Array containing rows+1 elements. First rows elements contain pointersto first nonzero elements in each row of matrix. The last element stores number of nonzerovalues in matrix .

• origin: SI ORIGIN. Parameter which tells where provided memory buffers are allocated.There are two correct values of that parameter

– HOST - for memory buffers allocated in the CPU memory space. In that situationdata are copied to GPU

– DEVICE - for memory buffers already allocated in the GPU memory space. Whenthe matrix handler is created with DEVICE origin corresponding release function (see4.5.2 do not remove the data from GPU memory space because pointers were justassigned to a matrix structure and not copied. This a case cwhen created structuredoes not own provided data. You have to release the GPU buffers by calling cudaFreefunction or proper free function from section 4.5.6

Return valueWhen a matrix will be created successfully it returns a pointer to data structure allocated inthe GPU memory space. If something goes wrong it will return NULL.

32

Page 33: SpeedIT 2.3 (CUDA backend)

4.5 Memory management

4.5.2 Matrix handler removal

si_dhreleasecsr(matrix_handle);

si_shreleasecsr(matrix_handle);

si_ihreleasecsr(matrix_handle);

si_dchreleasecsr(matrix_handle);

si_schreleasecsr(matrix_handle);

si_dhreleasecmr(matrix_handle);

si_shreleasecmr(matrix_handle);

DescriptionThe functions remove data pointed by a given handler and sets it to NULL.

Input parameters

• matrix handler - reference to matrix handler

– SI CSR DOUBLE HANDLE for si dhreleasecsr

– SI CSR FLOAT HANDLE for si shreleasecsr

– SI CSR INT HANDLE for si ihreleasecsr

– SI CSR DCOMPLEX HANDLE for si dchreleasecsr

– SI CSR SCOMPEX HANDLE for si schreleasecsr

– SI CMR DOUBLE HANDLE for si dhreleasecmr

– SI CMR FLOAT HANDLE for si shreleasecmr

Return valueOn success function returns ERR OK. In case of errors function returns ERR BAD GPU POINTER

4.5.3 Preconditioner handler creation

The way to create preconditioner handler is similar for all available types. There is a set offunctions which takes a matrix handler as an argument and returns pointer to the precondition-ers. Each function is distincted by return type. In general difference is in matrix format CSRor CMRS and in type for non zero values, FLOAT or DOUBLE. Preconditioners implementedfrom CUSP library can have set of parameters which have default values. You can change themby assigning proper values for them.

Algebraic multigrid (AMG)

SI_CSR_DOUBLE_AMG_HANDLE si_gdcsramg( matrix_handle, theta = 0);

SI_CSR_FLOAT_AMG_HANDLE si_gscsramg( matrix_handle, theta = 0);

DescriptionFunctions create Algebraic Multigrid Preconditioner base on CSR matrix structure.

Input parameters

• matrix handler - reference to CSR matrix handler

– SI CSR DOUBLE HANDLE for si gdcsramg

– SI CSR FLOAT HANDLE for si gdcsramg

33

Page 34: SpeedIT 2.3 (CUDA backend)

4.5 Memory management

Approximate Inverse (AINV)

SI_CSR_DOUBLE_AINV_HANDLE si_gdcsrainv( matrix_handle, drop_tolerance = 0.1,

nonzero_per_row = -1,

lin_dropping = false,

lin_param = 1);

SI_CSR_FLOAT_AINV_HANDLE si_gscsrainv( matrix_handle,

float drop_tolerance = 0.1,

int nonzero_per_row = -1,

bool lin_dropping = false,

int lin_param = 1);

SI_CSR_DOUBLE_AINV_SCALED_HANDLE si_gdcsrainvscaled(matrix_handle,

drop_tolerance = 0.1,

nonzero_per_row = -1,

lin_dropping = false,

lin_param = 1);

SI_CSR_FLOAT_AINV_SCALED_HANDLE si_gscsrainvscaled(matrix_handle,

drop_tolerance = 0.1,

nonzero_per_row = -1,

lin_dropping = false,

lin_param = 1);

SI_CSR_DOUBLE_AINV_NSYM_HANDLE si_gdcsrainvnsym( matrix_handle,

drop_tolerance = 0.1,

nonzero_per_row = -1,

lin_dropping = false,

lin_param = 1);

SI_CSR_FLOAT_AINV_NSYM_HANDLE si_gscsrainvnsym( matrix_handle,

drop_tolerance = 0.1,

nonzero_per_row = -1,

lin_dropping = false,

lin_param = 1);

DescriptionFunctions create Approximate Inverse Preconditioner base on CSR matrix structure. There arethree variations of that preconditioner base AINV, AINV Scaled and AINV Non-Symetric. Thispreconditioner works only for symmetric positive definite matrices.

Input parameters

• matrix handler - reference to CSR matrix handler

– SI CSR DOUBLE HANDLE for si gdcsramg

– SI CSR FLOAT HANDLE for si gdcsramg

• drop tolerance - Tolerance for dropping during factorization

• nonzero per row - Number of non-zeroes allowed per row of the factored matrix. If negativeor lin dropping==true, this will be ignored.

34

Page 35: SpeedIT 2.3 (CUDA backend)

4.5 Memory management

• lin dropping - When true, this will use the dropping strategy from Lin & More, where theper-row count will be based on matrix structure.

• lin param - when lin dropping set to true, this indicates how many additional non-zerosper row should be included

Diagonal

SI_CSR_DOUBLE_DIAGONAL_HANDLE si_gdcsrdiagonal( matrix_handle )

SI_CSR_FLOAT_DIAGONAL_HANDLE si_gscsrdiagonal( matrix_handle );

SI_CMR_DOUBLE_DIAGONAL_HANDLE si_gdcmrdiagonal( matrix_handle );

SI_CMR_FLOAT_DIAGONAL_HANDLE si_gscmrdiagonal( matrix_handle );

DescriptionFunctions create diagonal preconditioner based on matrix in CSR or CMRS format

Input parameters

• matrix handler - reference to CSR or CMRS matrix handler

– SI CSR DOUBLE HANDLE for si gdcsrdiagonal

– SI CSR FLOAT HANDLE for si gscsrdiagonal

– SI CMR DOUBLE HANDLE for si gdcmrdiagonal

– SI CMR FLOAT HANDLE for si gscmrdiagonal

ILU(0)

SI_CSR_DOUBLE_VOID_HANDLE si_gdcsrilu( matrix_handle );

SI_CSR_FLOAT_VOID_HANDLE si_gscsrilu( matrix_handle );

DescriptionFunctions create ILU(0) preconditioner based on matrix in CSR.

Input parameters

• matrix handler - reference to CSR matrix handler

– SI CSR DOUBLE HANDLE for si gdcsrilu

– SI CSR FLOAT HANDLE for si gscsrilu

Void

SI_CSR_DOUBLE_VOID_HANDLE si_gdcsrvoid( matrix_handle );

SI_CSR_FLOAT_VOID_HANDLE si_gscsrvoid( matrix_handle );

SI_CMR_DOUBLE_VOID_HANDLE si_gdcmrvoid( matrix_handle );

SI_CMR_FLOAT_VOID_HANDLE si_gscmrvoid( matrix_handle );

DescriptionFunctions create dummy empty preconditioner based on matrix in CSR or CMRS format. Itshould be used if you wish not use any preconditioner.

Input parameters

• matrix handler - reference to CSR or CMRS matrix handler

35

Page 36: SpeedIT 2.3 (CUDA backend)

4.5 Memory management

– SI CSR DOUBLE HANDLE for si gdcsrvoid

– SI CSR FLOAT HANDLE for si gscsrvoid

– SI CMR DOUBLE HANDLE for si gdcmrvoid

– SI CMR FLOAT HANDLE for si gscmrvoid

4.5.4 Preconditioner handler removal

Following set of functions is used to release the preconditioner structure from memory. Allfunctions are very similar. Each function name says what kind of handler is taken and in whichformat.i.e.:dhcsrfreeamg — expect amg handler in csr format with values of double type

Algebraic Multigrid (AMG)

si_dhcsrfreeamg(precond_handle);

si_shcsrfreeamg(precond_handle);

DescriptionFunctions release AMG preconditioner structure from memory.

Input parameters

• precond handle — reference to AMG preconditioner

– SI CSR DOUBLE AMG HANDLE — for si dhcsrfreeamg

– SI CSR FLOAT AMG HANDLE — for si shcsrfreeamg

Approximate Inverse (AINV)

si_dhcsrfreeainv( precond_handle);

si_dhcsrfreeainvscaled(precond_handle);

si_dhcsrfreeainvnsym(precond_handle);

si_shcsrfreeainv(precond_handle);

si_shcsrfreeanvscaled(precond_handle);

si_shcsrfreeainvnsym(precond_handle);

DescriptionFunctions release AINV preconditioner structure from memory.

Input parameters

• precond handle — reference to AMG preconditioner

– SI CSR DOUBLE AINV HANDLE — for si dhcsrfreeainv

– SI CSR DOUBLE AINV SCALED HANDLE — for si dhcsrfreeainvscaled

– SI CSR DOUBLE AINV NSYM HANDLE — for si dhcsrfreeainvnsym

– SI CSR FLOAT AINV HANDLE — for si shcsrfreeainv

– SI CSR FLOAT AINV SCALED HANDLE — for si shcsrfreeainvscaled

– SI CSR FLOAT AINV NSYM HANDLE — for si shcsrfreeainvnsym

36

Page 37: SpeedIT 2.3 (CUDA backend)

4.5 Memory management

Diagonal

si_dhcsrfreediagonal(precond_handle);

si_shcsrfreediagonal(precond_handle);

si_dhcmrfreediagonal(precond_handle);

si_shcmrfreediagonal(precond_handle);

DescriptionFunctions release diagonal preconditioner structure from memory.

Input parameters

• precond handle — reference to diagonal preconditioner

– SI CSR DOUBLE DIAGONAL HANDLE — for si dhcsrfreediagonal

– SI CSR FLOAT DIAGONAL HANDLE — for si shcsrfreediagonal

– SI CMR DOUBLE DIAGONAL HANDLE — for si dhcmrfreediagonal

– SI CMR FLOAT DIAGONAL HANDLE — for si shcmrfreediagonal

ILU(0)

si_dhcsrfreeilu(precond_handle);

si_shcsrfreeilu(precond_handle);

DescriptionFunctions release ILU(0) preconditioner structure from memory.

Input parameters

• precond handle — reference to ILU(0) preconditioner

– SI CSR DOUBLE ILU HANDLE — for si dhcsrfreeilu

– SI CSR FLOAT ILU HANDLE — for si shcsrfreeilu

Void

si_dhcsrfreevoid( precond_handle );

si_shcsrfreevoid( precond_handle );

si_dhcmrfreevoid( precond_handle );

si_shcmrfreevoid( precond_handle );

DescriptionFunctions release void preconditioner structure from memory.

Input parameters

• precond handle — reference to diagonal preconditioner

– SI CSR DOUBLE DIAGONAL HANDLE — for si dhcsrfreevoid

– SI CSR FLOAT DIAGONAL HANDLE — for si shcsrfreevoid

– SI CMR DOUBLE DIAGONAL HANDLE — for si dhcmrfreedvoid

– SI CMR FLOAT DIAGONAL HANDLE — for si shcmrfreevoid

37

Page 38: SpeedIT 2.3 (CUDA backend)

4.5 Memory management

4.5.5 GPU Memory allocation

si_gsmalloc(size, out_ptr)

si_gdmalloc(size, out_ptr)

si_gimalloc(size, out_ptr)

si_gvmalloc(size, out_ptr)

si_gszmalloc(size, out_ptr)

si_gdzmalloc(size, out_ptr)

DescriptionThe functions allocate memory in the GPU address space. The buffer is properly aligned for alldata types.

Input parameterssize – int. Size of the allocated buffer in units that depend on the buffer data type. Theallocated buffer can hold at least size elements of a given type:

• si gsmalloc – single precision floating point numbers

• si gdmalloc – double precision floating point numbers

• si gimalloc – signed integer numbers

• si gvmalloc – bytes

• si gszmalloc – single precsision complex numbers ( 2 x float)

• si gdzmalloc – double precsision complex number ( 2 x double)

Output parametersout ptr: Double pointer to the allocated buffer. On input make sure to provide a valid addressto the appropriate pointer. On output the target pointer is filled with address of buffer in theGPU memory. If memory allocation fails, target pointer is NULL.

• float** for si gsmalloc,

• double** for si gdmalloc,

• int** for si gimalloc,

• void** for si gvmalloc,

• si scomplex** for si gszmalloc,

• si dcomplex** for si gdzmalloc.

Warning! You should not pass to functions address of a pointer to an existing buffer asout ptr argument. Functions do not check if out ptr target pointer was allocated anddo not free any memory. Passing previously allocated buffer to functions may result inmemory leaks.

Return valueOn success function returns ERR OK. In case of errors function may return

• ERR GPU MEM ALLOC

• ERR UNKNOWN

and encoded CUDA or CUBLAS errors.

38

Page 39: SpeedIT 2.3 (CUDA backend)

4.5 Memory management

4.5.6 GPU memory release functions

si_gsfree(out_ptr)

si_gdfree(out_ptr)

si_gifree(out_ptr)

si_gvfree(out_ptr)

si_gszfree(out_ptr)

si_gdzfree(out_ptr)

DescriptionThe functions release memory in the GPU address space.

Input / Output parametersout ptr: Double pointer to allocated buffer. On input must be a valid address to an appropriatepointer. On output the target pointer is equal to NULL. If memory can not be released thetarget pointer is unchanged.

• float** for si gsfree,

• double** for si gdfree,

• int** for si gifree,

• void** for si gvfree,

• si scomplex** for si gszfree,

• si dcomplex** for si gdfree.

Return valueOn success function returns ERR OK. In case of errors function may return

• ERR BAD GPU POINTER

• ERR UNKNOWN

and encoded CUDA or CUBLAS errors.

39

Page 40: SpeedIT 2.3 (CUDA backend)

4.5 Memory management

4.5.7 GPU to CPU memory copy

si_g2cscopy(size, in_ptr, out_ptr)

si_g2cdcopy(size, in_ptr, out_ptr)

si_g2cicopy(size, in_ptr, out_ptr)

si_g2cvcopy(size, in_ptr, out_ptr)

si_g2cszcopy(size, in_ptr, out_ptr)

si_g2cdzcopy(size, in_ptr, out_ptr)

DescriptionThe functions copy data from GPU memory to CPU memory.

Input parameters

• size: int. Size of the transferred data in units dependent on the buffer data type. Thesize of the transfer will be size elements of the appropriate type:

– si g2cscopy – single precision floating point numbers

– si g2cdcopy – double precision floating point numbers

– si g2cicopy – signed integer numbers

– si g2cvcopy – bytes

– si g2cszcopy – single precsision complex numbers ( 2 x float)

– si g2cdzcopy – double precsision complex number ( 2 x double)

• in ptr: Pointer to the input buffer allocated in GPU memory. The buffer must containenough data to read from, i.e. at least size elements of appropriate type.

– float* for si g2cscopy,

– double* for si g2cdcopy,

– int* for si g2cicopy,

– void* for si g2cvcopy,

– si scomplex* for si g2cszcopy,

– si dcomplex* for si g2cdzcopy.

Output parametersout ptr: Pointer to output buffer allocated in CPU memory. On output the buffer is filled withdata from buffer in GPU memory. The buffer must be large enough to hold all transferred data,i.e. at least size elements of the appropriate type.

• float* for si g2cscopy,

• double* for si g2cdcopy,

• int* for si g2cicopy,

• void* for si g2cvcopy,

• si scomplex* for si g2cszcopy,

• si dcomplex* for si g2cdzcopy.

Return valueOn success function returns ERR OK. In case of errors function may return

40

Page 41: SpeedIT 2.3 (CUDA backend)

4.5 Memory management

• ERR CUDA INVALID VAL

• ERR BAD GPU POINTER

• ERR WRONG COPY DIRECTION

• ERR UNKNOWN

and encoded CUDA or CUBLAS errors.

41

Page 42: SpeedIT 2.3 (CUDA backend)

4.5 Memory management

4.5.8 CPU to GPU memory copy

si_c2gscopy(size, in_ptr, out_ptr)

si_c2gdcopy(size, in_ptr, out_ptr)

si_c2gicopy(size, in_ptr, out_ptr)

si_c2gvcopy(size, in_ptr, out_ptr)

si_c2gszcopy(size, in_ptr, out_ptr)

si_c2gdzcopy(size, in_ptr, out_ptr)

DescriptionThe functions copy data from CPU memory to GPU memory.

Input parameters

• size: int. Size of the transferred data in units dependent on buffer data type. Thetransfer size is size elements of type dependent of function:

– si c2gscopy – single precision floating point numbers

– si c2gdcopy – double precision floating point numbers

– si c2gicopy – signed integer numbers

– si c2gvcopy – bytes

– si c2gszcopy – single precsision complex numbers ( 2 x float)

– si c2gdzcopy – double precsision complex number ( 2 x double)

• in ptr: Pointer to input buffer allocated in GPU memory. Buffer must contain enoughdata to read from, i.e. at least size elements of appropriate type.

– float* for si c2gscopy,

– double* for si c2sdcopy,

– int* for si c2sicopy,

– void* for si c2svcopy,

– si scomplex* for si c2sszcopy,

– si dcomplex* for si c2sdzcopy.

Output parametersout ptr: Pointer to output buffer allocated in CPU memory. On output the buffer is filled withdata from buffer in GPU memory. The buffer must be large enough to hold all transferred data,i.e. at least size elements of the appropriate type.

• float* for si c2gscopy,

• double* for si c2sdcopy,

• int* for si c2sicopy,

• void* for si c2svcopy,

• si scomplex* for si c2sszcopy,

• si dcomplex* for si c2sdzcopy.

Return valueOn success function returns ERR OK. In case of errors function may return

42

Page 43: SpeedIT 2.3 (CUDA backend)

4.5 Memory management

• ERR CUDA INVALID VAL

• ERR BAD GPU POINTER

• ERR WRONG COPY DIRECTION

• ERR UNKNOWN

and encoded CUDA or CUBLAS errors.

43

Page 44: SpeedIT 2.3 (CUDA backend)

4.5 Memory management

4.5.9 GPU memory alloc and copy from CPU memory

si_c2gvmcopy(size, in_ptr, out_ptr)

si_c2gsmcopy(size, in_ptr, out_ptr)

si_c2gdmcopy(size, in_ptr, out_ptr)

si_c2gimcopy(size, in_ptr, out_ptr)

si_c2gszmcopy(size, in_ptr, out_ptr)

si_c2gdzmcopy(size, in_ptr, out_ptr)

DescriptionThe functions allocate destination buffer in GPU memory and copy data from source buffer inCPU memory to destination buffer.

Input parameters

• size: int. Size of the transferred data in units depends on data type in the buffer. Thetransfer size is size elements of type that depends on a given function:

– si c2gsmcopy – single precision floating point numbers

– si c2gdmcopy – double precision floating point numbers

– si c2gimcopy – signed integer numbers

– si c2gvmcopy – bytes

– si c2gszmcopy – single precsision complex numbers ( 2 x float)

– si c2gdzmcopy – double precsision complex number ( 2 x double)

• in ptr: Pointer to input buffer allocated in GPU memory. Buffer must contain enoughdata to read from, i.e. at least size elements of appropriate type.

– float* for si c2gsmcopy,

– double* for si c2sdmcopy,

– int* for si c2simcopy,

– void* for si c2svmcopy,

– si scomplex* for si c2sszmcopy,

– si dcomplex* for si c2sdzmcopy.

Output parametersout ptr: Double pointer to output buffer, which will be allocated in GPU memory. On inputmust be a valid address of empty pointer. On output the pointer Is filled with an address ofallocated buffer and the buffer is filled with data from CPU memory. The allocated buffer sizewill be size elements of the appropriate type. The buffer will be properly aligned for all datatypes.

• float** for si c2gsmcopy,

• double** for si c2sdmcopy,

• int** for si c2simcopy,

• void** for si c2svmcopy,

• si scomplex** forsi c2sszmcopy,

• si dcomplex** for si c2sdzmcopy.

44

Page 45: SpeedIT 2.3 (CUDA backend)

4.5 Memory management

Return valueOn success function returns ERR OK. In case of errors function may return

• ERR CUDA INVALID VAL

• ERR BAD GPU POINTER

• ERR WRONG COPY DIRECTION

• ERR GPU MEM ALLOC

• ERR UNKNOWN

and encoded CUDA or CUBLAS errors.

45

Page 46: SpeedIT 2.3 (CUDA backend)

4.6 Library initialization

4.6 Library initialization

4.6.1 Opening library

si init(deviceID)

DescriptionThe function performs SpeedIT library initialization. Function should be called before anyother library function call. Default value o deviceID argument is -1. When it is not changedSpeedIT will search for compatible GPU devices and choose the best one taking into accountthe number of multiprocessors and compute capability. You can specify a desired deviceID onwhich calculations will be performed. In that case SpeedIT will not check chosen device if itmeets the requirements for compute capability

Input parametersdeviceID: int - zero or positive value which corresponds to deviceID in the system. If deviceIDis not specified function will automatically search for the best device.

Return valuesOn success function returns ERR OK. In case of errors function may return encoded CUDA,CUBLAS errors or abort the program if GPU device is not found.

4.6.2 Closing library

si shutdown()

DescriptionThe function closes the SpeedIT. Function should be called after the last use of SpeedITfunction.

Return valuesOn success function returns ERR OK. In case of errors function may return encoded CUDA orCUBLAS errors.

4.7 Loading matrix in Matrix Market format

si_cdload(filename, nnz, n_rows, n_cols, values, r_idx, c_idx)

si_csload(filename, nnz, n_rows, n_cols, values, r_idx, c_idx)

DescriptionThe functions helps to feed the variables necessary to define matrix in CSR format (see 3.3.1)using data stored in the Matrix Market exchange file format. For more information please visithttp://math.nist.gov/MatrixMarket

Input parameters

• filename - Path to mtx file which stores matrix data in COO format.

Return vlaues

• nnz - Number of non-zero values of input matrix

• n rows - Number of rows of input matrix

• n cols - Number of columns of input matrix

• values - Array of non-zere values of the matrix

46

Page 47: SpeedIT 2.3 (CUDA backend)

4.7 Loading matrix in Matrix Market format

• r idx - Array containg n rows+1 elements first n rows elements contain pointers to firstnonzero elements in each row of input matrix. The last element stores number of nonzerovalues in a matrix

• c idx - Array containing column indices of matrix nonzero values

47

Page 48: SpeedIT 2.3 (CUDA backend)

5 OpenFOAM interoperability

In order to provide OpenFOAM interoperability in SpeedIT 2.3 you can find additional headers.Despite that library can be used in old fashioned way described in previous chapters.

5.1 ProcType.h

This header provides enumeration type called ProcType. It is used to indicate processor typeCPU or GPU and memory type bonded with it. ProcType enumeration contains two types:

• CPU - CPU processor or CPU memory

• GPU - GPU processor or GPU memory

5.2 vector.h

This header provides template implementation of a SpeedIT Vector type.It is a container class for elements of a type REAL.

5.2.1 Vector initialization on CPU

Vector<typename REAL, CPU>::Vector(int n = 0);

Vector<typename REAL, CPU>::Vector(int n, const REAL* p);

Vector<typename REAL, CPU>::Vector(int n, REAL* p);

Vector<typename REAL, CPU>::Vector(Vector<REAL, GPU> const& v);

DescriptionThose functions allows to create a new Vector of a type REAL on host memory.Input parameters

• n - number of elements in Vector.

• p - pointer to data of type REAL. You have to allocate the memory before passing a pointerp.

• v - constant reference to the Vector allocated on a GPU.

Return valuesEach constructor returns Vector of type REAL stored in host memory.

5.2.2 Vector initialization on GPU

Vector<typename REAL, GPU>::Vector(int n = 0);

Vector<typename REAL, GPU>::Vector(int n, const REAL* p);

Vector<typename REAL, GPU>::Vector(int n, REAL* p);

Vector<typename REAL, GPU>::Vector(Vector<REAL, CPU> const& v);

DescriptionThose functions allows to create new Vector of type REAL on a GPU memory.Input parameters

• n - number of elements in the Vector.

• p - pointer to data of type REAL. You have to allocate memory before passing a pointer p.

• v - constant reference to the Vector allocated on a GPU.

Return valuesEach constructor returns the Vector of type REAL stored in a GPU memory.

48

Page 49: SpeedIT 2.3 (CUDA backend)

5.2 vector.h

5.2.3 Functions specific for Vector allocated on CPU

REAL & operator[](int n);

DescriptionRandom access operator to the Vector elements.Input parameters

• n - access n-th element of the Vector.

Return valuesReturns reference to the Vector element of type REAL.

REAL operator[](int n) const ;

DescriptionRandom access operator to the Vector elements.Input parameters

• n - access n-th element of the Vector.

Return valuesReturns constant reference to the Vector element of type REAL.

void init(REAL* p, size_t n);

DescriptionFunction initializes the Vector with new data. When *p is different than pointer stored inter-nally, old data is deallocated.Input parameters

• p - pointer to data of type REAL.

• n - number of the elements.

void swap(Vector<REAL, CPU> & w);

DescriptionSwaps the content of two Vector containers.Input parameters

• w - refernece to the Vector allocated on a GPU.

void swap(REAL* & p, size_t n);

DescriptionSwaps the Vector data.Input parameters

• p - reference to the data pointer of type REAL.

void reduce_size(int n);

DescriptionFunction reduces size of the Vector without deallocating memory.Input parameters

• n - highest accessible element of the Vector.

49

Page 50: SpeedIT 2.3 (CUDA backend)

5.2 vector.h

5.2.4 Common functions for Vector allocated on CPU and GPU

void Vector<typename REAL, ProcType PT>::nullify(void);

DescriptionNullifies data pointer of the Vector. Introduced for sanity reasons.

void Vector<typename REAL, ProcType PT>::set_fields(int size, REAL* p);

DescriptionLow-level control over the Vector private fields.Input parameters

• size - the number of elements

• p - pointer to size elements of type REAL

Vector<typename REAL, ProcType PT>::~Vector();

DescriptionDestructor dealocates memory on a CPU or GPU.

ProcType Vector<typename REAL, ProcType PT>::getProcType();

DescriptionFunction returns ProcType of a Vector class instance.Return valuesGPU for an object allocated on graphic card memory, and CPU for an object allocated on hostmemory

int Vector<typename REAL, ProcType PT>::size() const;

Description Function returns size of the vector.Return values Size of the vector of type int.

REAL* Vector<typename REAL, ProcType PT>::p() const;

DescriptionFunction returns pointer to the Vector data.Return valuesPointer to data of type REAL.

const REAL* Vector<typename REAL, ProcType PT>::constp() const;

DescriptionFunction returns pointer to the Vector data.Return valuesConstant pointer to the data of type REAL.

bool Vector<typename REAL, ProcType PT>::empty() const;

50

Page 51: SpeedIT 2.3 (CUDA backend)

5.3 interfaces.h

DescriptionChecks whether the Vector instance is empty.Return valuesReturn true if the Vector instance is empty in other case returns false.

void Vector<typename REAL, ProcType PT>::operator=(Vector<REAL, GPU> const& v);

DescriptionFunction assigns a data from the Vector instance allocated on a GPU.Input parameters

• v - constant reference to the Vector instance allocated on a GPU.

void Vector<typename REAL, ProcType PT>::operator=(Vector<REAL, CPU> const& v);

DescriptionFunction assignes data from the Vector instance allocated on a GPU.Input parameters

• v - constant reference to the Vector instance allocated on a GPU.

void Vector<typename REAL, ProcType PT>::resize (int n);

DescriptionFunction resizes the Vector. Either for both a GPU and a CPU allocated Vector old data is lost.Input parameters

• n - a new container size.

5.3 interfaces.h

This header provides implementation of template classes used only for OpenFOAM interoper-ability.You should be familiar with OpenMPI and OpenFOAM API.

• processor_interface - SpeedIT implementation of OpenFOAM interface object.

• processor_interfaces - Implementation of container for processor_interface objects.Derived from std::vector<typename T>.

5.3.1 processor interface initialization and destruction

processor_interface<typename COMPLEX, ProcType PT>::processor_interface(

int myRank,

int hisRank,

int patchSize,

const int * boundary_index_ptr,

const COMPLEX * boundary_value_ptr

);

DescriptionThis function allows to create a new processor_interface of type COMPLEX stored either on aGPU or on a GPU memory.Input parameters

51

Page 52: SpeedIT 2.3 (CUDA backend)

5.3 interfaces.h

• myRank - process number of a current processor_interface

• hisRank - process number of a neighbour processor_interface

• patchSize - patch size, usually Foam::fvPatch::size()

• boundary index ptr - pointer to first face cell, usually Foam::fvPatch::faceCell()::begin().

• boundary value ptr - interface coefficients, usually object of typeFoam::FieldField<Foam::Field, Foam::scalar>.

Return valuesA valid processor_interface object of type COMPLEX stored either on a GPU or GPU.

processor_interface<typename COMPLEX, ProcType PT>::~processor_interface();

DescriptionDestructor of the processor_interface object.

5.3.2 processor interface functions

int processor_interface<typename COMPLEX, ProcType PT>::my_rank() const;

DescriptionFunction returns the process number of current processor_interface object.Return valuesProcess number.

int processor_interface<typename COMPLEX, ProcType PT>::his_rank() const;

DescriptionFunction returns the process number of neighbour processor_interface object.Return valuesProcess number.

int processor_interface<typename COMPLEX, ProcType PT>::patch_size() const;

DescriptionFunction returns the patch size of current processor_interface object.Return valuesPatch size.

const COMPLEX* processor_interface<typename COMPLEX, ProcType PT>::get_buffer() const;

DescriptionFunction returns the initial address of send buffer of type COMPLEX.See the MPI_Isend documentation MPI.Return valuesValid pointer of type COMPLEX representing initial address of send buffer.

void processor_interface<typename COMPLEX, ProcType PT>::set_neighbour(int n);

DescriptionFunction sets the neighbour for current processor_interface object.Input parameters

52

Page 53: SpeedIT 2.3 (CUDA backend)

5.3 interfaces.h

• n - neighbour number.

void processor_interface<typename COMPLEX, ProcType PT>::print_received_data();

DescriptionUtility function prints out the received data of current processor_interface object. Introducedfor debug purpose.

5.3.3 processor interfaces initialization

Class is derived from std::vector<typename T>. See STL documentation for detials.

5.3.4 processor interfaces functions

void processor_interfaces<typename COMPLEX, ProcType PT>::init_transfer(

COMPLEX const* old_x,

int x_size,

bool debug = DEFAULT_DEBUG_MODE);

DescriptionFunction initializes a OpenMPI transfer. Internally calls MPI_Isend and MPI_Irecv. See the MPIdocumentation.Input parameters

• old x - initial address of the send buffer.

• x size - size of the send buffer

• debug - debug flag in release mode DEFAULT_DEBUG_MODE equals false. If DEFAULT_DEBUG_MODEequals true debug information will be print out.

void processor_interfaces<typename COMPLEX, ProcType PT>::finalize_transfer(

bool debug = DEFAULT_DEBUG_MODE

);

DescriptionFunction ensures finalization of the data transfer. Internally calls MPI_Waitall. See the MPIdocumentation.Input parameters

• debug - debug flag in release mode DEFAULT_DEBUG_MODE equals false. If DEFAULT_DEBUG_MODEequals true debug information will be print out.

53