rcuda v20.07alpha user’s guide

rCUDA v20.07alphaUser’s Guide

July, 2020

The rCUDA Team

www.rcuda.net [email protected]

Department of Computer EngineeringUniversitat Politècnica de València

Valencia, Spain

Disclaimer:

This is an alpha release and therefore unexpected errors

may occur. Additionally, the performance of this release

has not been assessed yet. Thus, please consider that per-

formance might not be optimum when using this release

of rCUDA. As a consequence, this release is not intended

to be used in performance measurements.

Important information:

We will be very happy to receive your feedback and com-

ments. Notice, however, that we are a small team and

thus we are not able to provide immediate support for

the rCUDA technology. Actually, your feedback and com-

ments will likely be incorporated into the next rCUDA

release. Thank you for your patience and understanding.

1

Changelog:

The most important changes in this version of rCUDA with respect to versionv18.10beta are:

• A completely new and disruptive internal architecture has been designedand implemented for the core of rCUDA. The new internal architectureis aimed at provide much better support to more CUDA applications andalso better performance. Notice, however, that we have not checked yet theamount of applications supported neither their performance when usingrCUDA

• A completely new communications layer has been implemented, which hasbeen architected to provide much easier maintenance of the code and alsomuch better performance than previous versions of rCUDA

• Multi-tenancy is supported. That is, a real GPU can be virtualized intomultiple GPUs, which can be concurrently provided to several applications

• The rCUDA server can simultaneously provide service across TCP andInfiniBand networks. That is, the rCUDA server can provide support tosome applications by using TCP/IP at the same time that other applica-tions are served using the InfiniBand network

• Support for functions in the Driver API has been noticeably improved

• The use of P2P data copies has been noticeably simplified, making it fullytransparent to the user.

• GPU memory can be safely partitioned among different applications. Innext releases of rCUDA we will disclose the public API to do so

• Next releases of rCUDA will include the rCUDA-smi tool, which is similarto the nvidia-smi tool except that remote GPUs are monitored

• Next releases of rCUDA will include the rCUDA GPU scheduler, intendedto provide efficient integration of rCUDA with Slurm and other job sched-ulers

• Next releases of rCUDA will include the sbatch and srun commands re-quired to integrate rCUDA with Slurm. Other job schedulers, such asPBSpro, could also be supported

The rCUDA Team hopes that you enjoy this new version of the rCUDA tech-nology!

2

The most important changes in recent versions of rCUDA (v18.03beta andv18.10beta) were:

• The NCCL library was supported

• A license server was introduced into the rCUDA suite. In orderto use rCUDA, you need first to set the license server in yourcluster. Please refer to the installation chapter later in this guidefor more information

• The architecture of the rCUDA server was renewed from scratch in orderto provide full support to multi-threaded applications

• The rCUDA daemon was improved in order to make a better usage of theGPU memory, allowing more memory for applications

• Support for functions in the Driver API was improved

• Management module was supported

• Support for Deep Learning frameworks such as TensorFlow or Caffe2 wasintroduced

• Support for the cuSOLVER, cuBLAS-XT and NVGRAPH libraries wasintroduced

• The NPP library was partially supported

• Version 7.0 of the cuDNN library was supported

• The rCUDA server used a single TCP port to provide service to all clients.The need for shutting down the firewall at the server node was thereforeremoved

• The RCUDAPROTO environment variable was replaced by theRCUDA_NETWORK environment variable. Please update thescripts you used with previous versions of rCUDA

3

Notice:

rCUDA v20.07alpha provides support for the following versions of CUDA:

• CUDA 9.0

Support for other CUDA versions will be provided in future releases.

4

Please cite the following papers in any published work if you use the rCUDAsoftware:

• F. Silla, S. Iserte, C. Reaño, J. Prades, “On the benefits of the remoteGPU virtualization mechanism: The rCUDA case” , in Concur-rency and Computation: Practice and Experience 29 (13), 2017.

• S. Iserte, J. Prades, C. Reaño, F. Silla, “Increasing the performanceof data centers by combining remote GPU virtualization withSlurm” , in 16th IEEE/ACM International Symposium on Cluster, Cloudand Grid Computing (CCGrid), 2016

5

rCUDA TERMS OF USE

IMPORTANT - READ BEFORE COPYING, INSTALLING OR USING. Do not copy, install, oruse the rCUDA software provided under this license agreement, until you havecarefully read the following terms and conditions.

1. LICENSE1.1 Subject to your compliance with the terms included in this document, you aregranted a limited, non-exclusive, non-sublicensable, non-assignable, free ofcharge license to install the copy of the rCUDA software you received with thislicense (hereafter "rCUDA" or "Software") on a personal computer or otherdevice; and personally use the Software. The rCUDA Team reserves all rights notexplicitly granted to you under these terms.

1.2 Restrictions. You may not and you agree not to:

(a) sub-license, sell, assign, rent, lease, export, import, distribute ortransfer or otherwise grant rights to any third party in the Software;

(b) undertake, cause, permit or authorize the modification, creation ofderivative works or improvements, translation, reverse engineering, decompiling,disassembling, decryption, emulation, hacking, discovery or attempted discoveryof the source code or protocols of the Software or any part or features thereof(except to the extent permitted by law);

(c) publish or post performance results without explicit authorization from TherCUDA Team;

(d) remove, obscure or alter any copyright notices or other proprietary noticesincluded in the Software;

(e) obtain benefits from or charge additional costs for using the Software orcausing the Software (or any part of it) to be used within or to providecommercial products or services to third parties. For such a use, please contactThe rCUDA Team.

1.3 Third Party Technology. If you are using Software pre-loaded on, embeddedin, combined, distributed or used with or downloaded onto third party products,hardware, software applications, programs or devices ("Third Party Technology"),you agree and acknowledge that: (a) you may be required to enter into a separatelicense agreement with the relevant third party owner or licensor for the use ofsuch Third Party Technology; (b) some rCUDA functionalities may not beaccessible through the Third Party Technology and (c) The rCUDA Team cannotguarantee that the Software shall always be available on or in connection withsuch Third Party Technology.

2. PROPRIETARY RIGHTS2.1 The Software, and rCUDA Website contain proprietary and confidentialinformation that is protected by intellectual property laws and treaties.

2.2 The content and compilation of contents included in the rCUDA Website,such as texts, graphics, logos, icons, images and software, are intellectualproperty of The rCUDA Team. Additionally, Universitat Politecnica de Valenciaretains industrial property and exploitation rights of the rCUDA software,and are protected by copyright laws. Such copyright protected content cannotbe reproduced without explicit permission from The rCUDA Team. You will nottake any action to jeopardize, limit or interfere with intellectual propertyrights in the Software and/or rCUDA Website.

2.3 In case you, or any of your employees or students, publish any article orother material resulting from the use of the rCUDA software, that publicationmust cite the following references:

F. Silla, S. Iserte, C. Reaño and J. Prades, "On the benefits of the remoteGPU virtualization mechanism: The rCUDA case", in Concurrency and Computation:Practice and Experience 29 (13), 2017.

S. Iserte, J. Prades, C. Reaño and F. Silla, "Increasing the performance ofdata centers by combining remote GPU virtualization with Slurm", in 16th IEEE/ACMInternational Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2016

6

3. EXCLUSION OF WARRANTIES, LIMITATION OF LIABILITY AND INDEMNITY3.1 No Warranties: TO THE MAXIMUM EXTENT PERMITTED BY LAW: THE SOFTWARE ANDrCUDA WEBSITE ARE PROVIDED "AS IS" AND USED AT YOUR SOLE RISK WITH NOWARRANTIES WHATSOEVER; THE rCUDA TEAM DOES NOT MAKE ANY WARRANTIES, CLAIMS ORREPRESENTATIONS AND EXPRESSLY DISCLAIMS ALL SUCH WARRANTIES OF ANY KIND, WHETHEREXPLICIT, IMPLIED OR STATUTORY, WITH RESPECT TO THE SOFTWARE, AND/OR rCUDAWEBSITE INCLUDING, WITHOUT LIMITATION, WARRANTIES OR CONDITIONS OF QUALITY,PERFORMANCE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR USE FOR APARTICULAR PURPOSE. rCUDA FURTHER DOES NOT REPRESENT OR WARRANT THAT THESOFTWARE, AND/OR rCUDA WEBSITE WILL ALWAYS BE AVAILABLE, ACCESSIBLE,UNINTERRUPTED, TIMELY, SECURE, ACCURATE, COMPLETE AND ERROR-FREE OR WILL OPERATEWITHOUT PACKET LOSS, NOR DOES rCUDA WARRANT ANY CONNECTION TO OR TRANSMISSIONFROM THE INTERNET.

3.2 No Liability: YOU ACKNOWLEDGE AND AGREE THAT rCUDA WILL HAVE NO LIABILITYWHATSOEVER, WHETHER IN CONTRACT, TORT (INCLUDING NEGLIGENCE) OR ANY OTHER THEORYOF LIABILITY, AND WHETHER OR NOT THE POSSIBILITY OF SUCH DAMAGES OR LOSSES HASBEEN NOTIFIED TO rCUDA, IN CONNECTION WITH OR ARISING FROM YOUR USE OF rCUDAWEBSITE OR SOFTWARE. YOUR ONLY RIGHT OR REMEDY WITH RESPECT TO ANY PROBLEMS ORDISSATISFACTION WITH SUCH SOFTWARE AND/OR rCUDA WEBSITE IS TO IMMEDIATELYDEINSTALL SUCH SOFTWARE AND CEASE USE OF SUCH SOFTWARE AND/OR rCUDA WEBSITE.Furthermore, rCUDA shall not be liable to you, whether in contract, tort(including negligence) or any other theory of liability, and whether or not thepossibility of such damages or losses has been notified to rCUDA, for:

(a) any indirect, special, incidental or consequential damages; or

(b) any loss of income, business, actual or anticipated profits, opportunity,goodwill or reputation (whether direct or indirect); or

(c) any damage to or corruption of data (whether direct or indirect);

(d) any claim, damage or loss (whether direct or indirect) arising from orrelating to:

any product or service provided by a third party under their own terms ofservice, including without limitation; any Third Party Technology; any thirdparty website.

3.3 If any third party brings a claim against rCUDA in connection with, orarising out of (i) your breach of these Terms; (ii) your breach of anyapplicable law of regulation; (iii) your infringement or violation of the rightsof any third parties (including intellectual property rights), you willindemnify and hold rCUDA harmless from and against all damages, liability, loss,costs and expenses (including reasonable legal fees and costs) related to suchclaim.

4. TERM OF THE LICENSE AND REGULATION LAWS

4.1. The License shall be valid for period of time equal to the term of the rightsto the Software held by Universitat Politecnica de Valencia.

4.2. For matters not expressly provided herein, this agreement shall be subjectto the provisions of the Spanish regulations.

4.3. In the event of any conflict, the parties agree to refer to the Courts ofthe city of Valencia (Spain), waiving their own jurisdiction.

7

Contents

1 Introduction 10

2 Fixing Most Common Errors with rCUDA 13

2.1 Error 1: ”mlock error: Cannot allocate memory” . . . . . . . . . 13

2.2 Error 2: ”rCUDA error: function cuGetExportTable not sup-ported” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Installation 15

3.1 Setting up the rCUDA License Server . . . . . . . . . . . . . . . 15

4 Components and usage 17

4.1 Client Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 Server Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2.1 cuDNN Users . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Support for P2P Memory Copies between Remote GPUs . . . . . 21

5 Current limitations 23

6 Further Information 24

7 Credits 25

7.1 Coordination & Supervision . . . . . . . . . . . . . . . . . . . . . 25

8

7.2 Development & Testing . . . . . . . . . . . . . . . . . . . . . . . 25

7.3 rCUDA Team Address . . . . . . . . . . . . . . . . . . . . . . . . 25

9

Chapter 1

Introduction

The rCUDA framework enables the usage of remote CUDA-compatible devices.To enable a remote GPU-based acceleration, this framework creates virtualCUDA-compatible devices on those machines without a local GPU. These vir-tual devices represent physical GPUs located in a remote host offering GPGPUservices. By leveraging the remote GPU virtualization technique, rCUDA al-lows to decouple CUDA accelerators from the nodes where they are installed,so that GPUs across the cluster can be seamlessly accessed from any node. Fur-thermore, nodes in the cluster can concurrently access remote GPUs. Figure 1.1graphically depicts the additional flexibility provided by rCUDA.

rCUDA can be useful in the following three different environments:

HPC Clusters and datacenters. In this context, rCUDA increases the flex-ibility of using the GPUs present in the cluster. Sharing a given GPUamong several applications is made possible. In this way, when rCUDA isused along with the SLURM job scheduler, the time required to completethe execution of a job batch is noticeably reduced. This causes that wait-ing time for jobs is smaller. Furthermore, GPU utilization is increased atthe same time that the overall energy required to execute the job batch isreduced.

Virtual Machines. In this scenario, rCUDA enables the shared access fromthe inside of the virtual machine to the CUDA accelerator(s) installedin the physical machine. In addition to allow accessing the acceleratorsinstalled in the real machine that hosts the virtual ones, it is also possibleto access remote GPUs from the virtual domain.

Academia. When using rCUDA in a teaching lab with a commodity network,our middleware offers concurrent access to a few high performance GPUsfrom the computers in the lab used by students as well as their laptops,or even virtual machines in the teaching lab. This reduces the acquisitioncosts of the lab infrastructure.

10

PC

I-e CP

U

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

Interconnection Network

PC

I-e CP

U

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

PC

I-e CP

U

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

PC

I-e CP

U

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

PC

I-e CP

U

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

PC

I-e CP

U

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

(a) When rCUDA is not deployed into the cluster, one or more GPUs must be installedin those nodes of the cluster intended for GPU computing. This usually leads to clusterconfigurations with one or more GPUs attached to all the nodes of the cluster. Nevertheless,GPU utilization may be lower than 100%, thus wasting hardware resources and delayingamortizing initial expenses.

CP

U

Main

Mem

ory

Network

PC

I-e

PC

I-e

CP

U

Main

Mem

ory

Network

CP

U

Main

Mem

ory

Network

PC

I-e

PC

I-e CP

U

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

Interconnection NetworkP

CI-e C

PU

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

PC

I-e CP

U

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

(b) When rCUDA is leveraged in the cluster, only those GPUs actually needed to addressoverall workload must be installed in the cluster, thus reducing initial acquisition costs andoverall energy consumption. rCUDA allows sharing the (reduced amount of) GPUs presentin the cluster among all the nodes.

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

CP

U

Main

Mem

ory

Network

PC

I-e

PC

I-e

CP

U

Main

Mem

ory

Network

Interconnection Network

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

Logical connections

(c) From a logical point of view, GPUs in the cluster can be seen as a pool of GPUs detachedfrom the nodes and accessible through the cluster interconnect, in the same way as networkedstorage (NAS) is shared among all the cluster nodes and concurrently accessed by them.

Figure 1.1: Different cluster configurations: (a) the traditional CUDA-based clusterdeployment; (b) physical view of the cluster when leveraging rCUDA; (c) logical viewof the cluster with rCUDA.

11

The current version of rCUDA (v20.07alpha) implements all functions in theCUDA Runtime API and CUDA Driver API version 9.0 excluding those relatedwith graphics interoperability and Unified Memory Management. It also im-plements most of the functions in the following libraries of CUDA Toolkit 9.0:cuRAND, cuBLAS, cuSPARSE, cuFFT, cuSOLVER, cuBLAS-XT, NCCL andNVGRAPH. The NPP library is partially supported. Additionally, version 7.0of the cuDNN library is supported. Other libraries provided by NVIDIA will besupported in future rCUDA releases. rCUDA 20.07alpha targets the Linux op-erating system for 64-bit x86-based configurations. Future releases will providesupport for the ARM and POWER systems. It provides support for the sameLinux distributions as CUDA does.

12

Chapter 2

Fixing Most Common Errorswith rCUDA

2.1 Error 1: ”mlock error: Cannot allocate mem-ory”

While using rCUDA you may get the error message ”mlock error: Cannot allo-cate memory”

In order to fix this error, you can add the following lines at the end of file”/etc/security/limits.conf” in both client and server nodes:

* hard memlock unlimited* soft memlock unlimited

After rebooting the system, you can run the following command to verify thatthe limits have been changed:

$ ulimit -a | grep "max locked memory"

You should get an output similar to ”max locked memory (kbytes, -l) unlimited”.

13

2.2 Error 2: ”rCUDA error: function cuGetEx-portTable not supported”

This error is caused because the application was compiled to used static libraries(in particular, the CUDA library). Notice that rCUDA needs applications tobe compiled to use CUDA as a dynamic library. That is, a compilation usingCUDA dynamic libraries is needed to allow the use of the rCUDA software.This step can be done by the user in two different ways:

• If nvcc compiler is used, the flag -cudart=shared is needed.

• If gcc/++ compiler is used, the -lcudart flag is needed.

14

Chapter 3

Installation

Notice that prior to use the rCUDA middleware, you MUST have alicense key and appropriately configure the rCUDA license server inyour cluster.

Typically, if you are reading this document, you should already re-ceived a license key from the rCUDA Team. If not, please ask therCUDA Team for a license key at [email protected]

The installation of the rCUDA software is very simple. The binaries of therCUDA software are distributed within a tarball which has to be decompressedmanually by the user. The steps to install the rCUDA binaries are:

1. Decompress the rCUDA package.

2. Install and configure the rCUDA license server as it is explained in Sec-tion 3.1 (you should already have a license key).

3. Copy the rCUDA/lib folder to the client(s) node(s) (without GPU) as itis explained in Section 4.1.

4. Copy the rCUDA folder to the server node (with GPU) as it is explainedin Section 4.2.

3.1 Setting up the rCUDA License Server

The rCUDA license server is a daemon that must be in execution in one of thenodes of the cluster before the rCUDA server is executed. It is important toknow that a license key is required by the rCUDA license server.

15

In order to get a license key, the user was previously provided by the rCUDATeam with a program that gathers some information of the node that will runthe license server. This information comprises, among other parameters, theMAC address of the Ethernet adapter of the node. Other pieces of informationare also gathered. Because of this, the license key received from the rCUDATeam can only be used in the same node where the information was initiallygathered.

After receiving the license key from the rCUDA Team, the user must store itin the config/rCUDA.conf file. In this way, the user must include a new line inthat file, similar to:

RCUDA_LICENSE_KEY=3232145765453730306042551...

You should include in the config/rCUDA.conf file a line with the IP address ofthe node where the license server is being run, such as

RCUDA_LICENSE_SERVER=192.168.0.34

Now the license server can be executed. To that end please launch the rCUDAlsprogram located in the bin folder. It is helpful to execute the license server ininteractive and verbose mode by using the -iv options. That is, you should type:

./rCUDAls -iv

You should get an output similar to:

rCUDAls v1.0Copyright 2009-2018 UNIVERSITAT POLITECNICA DE VALENCIA. All rights reserved.rCUDAls[16812]: License Features:

- Expiration date: 31 Dec 2021- Max rCUDA server(s): 2- Max remote GPU(s): 4- Max virtual GPU(s): 1 per remote GPU- Allow Migration: no

rCUDAls[16812]: License server daemon successfully started.

Once you have checked that the rCUDA license server is properly configured andrunning, you might decide to run it in non-interactive and non-verbose mode ifdesired. To that end, just do not use the -iv options.

Finally, notice that the rCUDA license server listens in the TCP port number8309. Therefore, if a firewall is in execution in the node running the rCUDAlicense server, then it must be configured to allow incoming connections to TCPport 8309.

16

Chapter 4

Components and usage

rCUDA is organized following a client-server distributed architecture, as shownin Figure 4.1. The client middleware is contacted by the application demandingGPGPU services, both running in the same cluster node. The rCUDA clientpresents to the application the very same interface as the regular NVIDIA CUDARuntime and Driver APIs. Upon reception of a request from the application,the client middleware processes it and forwards the corresponding requests tothe rCUDA server middleware. In turn, the server interprets the requests andperforms the required processing by accessing the real GPU to execute the corre-sponding request. Once the GPU has completed the execution of the requestedcommand, results are gathered by the rCUDA server, which sends them back tothe client middleware. There, the results are finally forwarded to the demandingapplication.

In order to optimize client/server data exchange, rCUDA employs a customizedapplication-level communication protocol. Furthermore, rCUDA provides effi-cient support for several underlying network technologies. To that end, rCUDAcurrently targets the InfiniBand network (using InfiniBand verbs), the RoCEnetwork (using RDMA) and the general TCP/IP protocol stack (see Figure 4.1).Additional network technologies may be supported in the future.

4.1 Client Side

The client side of the rCUDA middleware is a library of wrappers that replacesthe CUDA Toolkit dynamic libraries mentioned at the end of Chapter 1. In thisway, CUDA applications that use rCUDA are not aware of being accessing anexternal device. Also, they do not need any source code modification.

17

CUDAwRuntimewlibrary

rCUDAwserverengine

commonwcommunicationwAPI

CUDAwDriverwlibrary

CUDAwRuntimewAPI

rCUDAwclientengine

commonwcommunicationwAPI

Hardware

Software

Client side Server side

Network GPU

Application

TCP/IPcommunication

module

InfiniBandcommunication

module

Networkw"X"communication

module

TCP/IPcommunication

module

InfiniBandcommunication

module

Networkw"X"communication

module

Figure 4.1: rCUDA architecture, showing also the runtime-loadable specific commu-nication modules.

The rCUDA client is distributed in a set of files: “libcuda.so.m.n1, libcudart.so.x.y2,libcublas.so.x.y, libcufft.so.x.y, libcusparse.so.x.y, libcurand.so.x.y and libcudnn.so.x.y.

These shared libraries should be placed in those machines accessing remoteGPGPU services. Set the LD_LIBRARY_PATH environment variable ac-cording to the final location of these files (typically “/opt/rCUDA/lib”, “/usr/local/rCUDA/lib”, or “$HOME/rCUDA/lib”, for instance). Notice that you donot need to be root in order to use the rCUDA middleware.

In order to properly execute the applications using the rCUDA library, set thefollowing environment variables:

• RCUDA_DEVICE_COUNT: indicates the number of GPUs which areaccessible from the current node.Usage: “RCUDA_DEVICE_COUNT=<number_of_GPUs>”For example, if the current node will access two remote GPUs:“RCUDA_DEVICE_COUNT=2”

• RCUDA_DEVICE_X: indicates where GPU X, for the client being con-figured, is located.Usage: “RCUDA_DEVICE_X=<server[@<port>]>[:GPUnumber]”For example, if GPUs 0 and 1 assigned to the current client are locatedat server “192.168.0.1” using the default rCUDA port (8308):“RCUDA_DEVICE_0=192.168.0.1”“RCUDA_DEVICE_1=192.168.0.1:1”

1m.n are based on the exact version of the CUDA driver.2x.y refer to the exact version of CUDA supported by the provided rCUDA package.

18

Furthermore, as the nvcc compiler links with CUDA static libraries by default,a compilation using CUDA dynamic libraries is needed to allow the use of therCUDA software. This step can be done by the user in two different ways:

• If nvcc compiler is used, the flag -cudart=shared is needed.

• If gcc/++ compiler is used, the -lcudart flag is needed.

In case the user of rCUDA is compiling the NVIDIA CUDA Samples, the usershould notice that NVIDIA CUDA Samples must be compiled after the EX-TRA_NVCCFLAGS environment variable has been set to --cudart=shared.

If an InfiniBand network is available and the rCUDA user prefers to use the highperformance InfiniBand Verbs APIs instead of the lower performance TCP/IPsocket API, then the following environment variables should be considered:

• RCUDA_NETWORK: This environment variable must be set to “IB” inorder to make use of the InfiniBand Verbs API. If this variable is not set,or if it is set to “TCP”, then the TCP/IP sockets API will be used even ifan InfiniBand network is available. For example:“RCUDA_NETWORK=IB” will use the InfiniBand Verbs API“RCUDA_NETWORK=TCP” will use the TCP/IP sockets API even ifan InfiniBand network is used

• RCUDA_NETWORK_DEV_NAME: In case the computer where rCUDAis being used has two or more InfiniBand network adapters, then the usermay instruct rCUDA what adapter to use by appropriately setting thisenvironment variable to the name of the selected InfiniBand adapter. Forexample:“RCUDA_NETWORK_DEV_NAME=mlx4_0” will use the InfiniBandnetwork adapter named as mlx4_0“RCUDA_NETWORK_DEV_NAME=mlx5_0” will use the InfiniBandnetwork card named as mlx5_0

It is important to remark that the RCUDA_NETWORK variable must onlybe set in the client side. This variable is not used in the server side becausethe rCUDA server automatically uses TCP/IP or InfiniBand depending on theconfiguration of the incoming client.

4.2 Server Side

The rCUDA server is configured as a daemon (rCUDAd) that runs in thosenodes offering GPGPU acceleration services. Notice that you do not need to beroot in order to use the rCUDA middleware. The rCUDA server contacts therCUDA license server in order to check that its execution is allowed. To that

19

end, the rCUDA server needs to know the IP address of the node executing therCUDA license server. A line with the license key should also be included. Theway to achieve this is to include two new lines in the config/rCUDA.conf filesuch as

RCUDA_LICENSE_KEY=3232145765453730306042551...RCUDA_LICENSE_SERVER=192.168.0.34

The IP address 192.168.0.34 is just an example. The actual IP address of thenode running the rCUDA license server in your cluster should be used instead.Notice that every rCUDA server in the cluster must be configured to know theIP address of the rCUDA license server. Moreover, notice that the rCUDALicense Server and the rCUDA Server can share the same config/rCUDA.conf.In this way, setting these two lines in such file once will satisfy both servers.Also, in case different rCUDA servers running in different cluster nodes share acommon file system, a single config/rCUDA.conf will satisfy all the servers.

Set the LD_LIBRARY_PATH environment variable according to the locationof the CUDA libraries (typically “/usr/local/cuda/lib64”). Notice that thisis the path to the original CUDA libraries, not the rCUDA ones. Add also tothe LD_LIBRARY_PATH environment variable the path to rCUDA cuDNNlibrary (typically “$HOME/rCUDA/lib/cudnn”). See Section 4.2.1 for further in-formation on the use of the cuDNN library.

Notice that the “RCUDA_NETWORK” environment variable is not requiredin the server side because the rCUDA server automatically uses TCP/IP orInfiniBand depending on the configuration of the incoming client. In any case, ifthe node hosting the rCUDA server presents more than one InfiniBand adapter,the environment variable RCUDA_NETWORK_DEV_NAME should be setin order to select which adapter to use in case an incoming client is using theInfiniBand network.

This daemon offers the following command-line options:

-i : Do not daemonize. Instead, run in interactive mode.

-l : Local mode using AF_UNIX sockets (TCP only).

-n <number> : Number of concurrent servers allowed. 0 stands for unlimited(default).

-p <port> : Specify the port to listen to (default: 8308).

-v Verbose mode.

-h Print usage information.

We suggest you to use the -iv options the first time you execute the rCUDAserver in order to check that it properly works. Once you have verified thatthe rCUDA server can connect to the license server, you might stop using theinteractive and verbose modes if desired.

20

4.2.1 cuDNN Users

If you are not going to use the NVIDIA CUDA Deep Neural Network library(cuDNN), you can ignore this section.

If, on the contrary, you plan to use this library, please, notice that in orderto use the cuDNN library, the LD_LIBRARY_PATH environment variablein the server node must contain the location of the NVIDIA cuDNN libraries(typically “/usr/local/cudnn/lib”), instead of the rCUDA ones (typically“$HOME/rCUDA/lib/cudnn”).

In addition, note that the cuDNN library is distributed separately from theCUDA package and, therefore, must be explicitly downloaded and installed.

4.3 Support for P2PMemory Copies between Re-mote GPUs

Figure 4.2 presents the possible scenarios when making peer-to-peer (P2P) mem-ory copies with CUDA and rCUDA. As we can see, with CUDA there is onlyone possible scenario, depicted in Figure 4.2a, where the GPUs are located inthe same cluster node and are interconnected by the PCIe link or NVlink. Onthe contrary, when using rCUDA there are two possible scenarios for makingcopies between remote GPUs: (i) the remote GPUs are located in the sameremote node and are interconnected by the PCIe link of NVlink as shown inFigure 4.2b, and (ii) the remote GPUs are located in different remote nodesin the cluster and therefore they are interconnected by the network fabric, asdepicted in Figure 4.2c.

By default, rCUDA automatically supports both scenarios exposed in Fig-ures 4.2b and 4.2c. In this manner, it is possible to perform memory copiesbetween remote GPUs located in the same server node as well as between GPUslocated in different server nodes. The use of P2P data copies is completely trans-parent to the rCUDA user.

21

GPU2Memory

GPU1

CPU

Copy

Memory

Memory

NODE

PCIe

(a) CUDA scenario.

Networkfabric

NODE 0

rCUDAclient

CPU0

Memory

GPU2Memory

GPU1

CPU

Copy

Memory

Memory

NODE 1

rCUDAserver

PCIe

(b) rCUDA scenario 1.

GPU2

CPU2

Memory

Memory

GPU1

CPU1

Memory

Memory

Copy

Networkfabric

NODE 2

rCUDAserver 2

NODE 1

rCUDAserver 1

NODE 0

rCUDAclient

CPU0

Memory

PCIe PCIe

(c) rCUDA scenario 2.

Figure 4.2: Possible scenarios for P2P memory copies with CUDA and rCUDA.

22

Chapter 5

Current limitations

The current implementation of rCUDA features the following limitations:

• Graphics interoperability is not implemented. Missing modules: OpenGL,Direct3D 9, Direct3D 10, Direct3D 11, VDPAU, Graphics

• The Profiler Control module is not supported.

• Unified Memory Management is not supported.

• Thrust library is not supported.

• Timing with the event management functions might be inaccurate, sincethese timings will discard network delays. Using standard Posix timingprocedures such as “clock_gettime” is recommended.

• Please, notice that this is an alpha release and therefore un-expected errors may occur. Additionally, the performance ofthis release has not been assessed yet. Thus, this release is notintended to be used in performance measurements.

23

Chapter 6

Further Information

For further information, please refer to [11, 8, 10, 12, 9, 17, 15, 18, 14, 13, 7, 3, 4,5, 2, 1, 6, 44, 43, 41, 42, 40, 39, 38, 35, 37, 36, 34, 28, 27, 32, 26, 33, 30, 31, 29, 21,19, 23, 22, 24, 20, 16, 25]. Also, do not hesitate to contact [email protected] any questions or bug reports.

24

Chapter 7

Credits

7.1 Coordination & Supervision

Federico SillaEmail: [email protected]

7.2 Development & Testing

Cristian Peñaranda, Javier Prades and Jaime SierraEmails: {cripeace, japraga, jsierra}@gap.upv.es

7.3 rCUDA Team Address

Federico SillaDepartment of Computer EngineeringUniversitat Politècnica de València

DISCA, Edificio 1G, Camino de Vera, s/n46022 – Valencia, Spain

Email: [email protected]

25

Acknowledgements

This work was supported by PROMETEO program from Generalitat Valenciana(GVA) under Grant PROMETEO/2008/060, by PROMETEO program phaseII from Generalitat Valenciana (GVA) under Grant PROMETEOII/2013/009and also by PROMETEO program from Generalitat Valenciana (GVA) underGrant PROMETEO/2017/77. Authors are also grateful for the generous sup-port provided by Mellanox Technologies.

Bibliography

[1] F. Silla, J. Prades, E. Baydal, and C. Reaño. Improving the performanceof physics applications in atom-based clusters with rCUDA. Journal ofParallel and Distributed Computing, 137:160 – 178, 2020.

[2] J. Prades, B. Imbernón, C. Reaño, J. Peña-García, J. P. Cerón-Carrasco,F. Silla, and H. Pérez-Sánchez. Maximizing resource usage in multifoldmolecular dynamics with rCUDA. The International Journal of High Per-formance Computing Applications, 34(1):5–19, 2020.

[3] C. Reaño and F. Silla. On the support of inter-node P2P GPU memorycopies in rCUDA. Journal of Parallel and Distributed Computing, 127:28– 43, 2019.

[4] C. Reaño, J. Prades, and F. Silla. Analyzing the performance/power trade-off of the rCUDA middleware for future exascale systems. Journal of Par-allel and Distributed Computing, 132:344 – 362, 2019.

[5] J. Prades and F. Silla. GPU-Job migration: The rCUDA case. IEEETransactions on Parallel and Distributed Systems, 30(12):2718–2729, 2019.

[6] J. Prades, C. Reaño, and F. Silla. On the effect of using rCUDA to provideCUDA acceleration to Xen virtual machines. Cluster Computing, 22:185 –204, 2019.

[7] B. Varghese, C. Reaño, and F. Silla. Accelerator virtualization in fogcomputing: Moving from the cloud to the edge. IEEE Cloud Computing,5(6):28–37, 2018.

[8] F. Silla, J. Prades, and C. Reaño. Leveraging rCUDA for enhancing low-power deployments in the physics domain. In Proceedings of the 47th In-ternational Conference on Parallel Processing Companion, ICPP’18, 2018.

[9] C. Reaño, J. Prades, and F. Silla. Improving the efficiency of future exascalesystems with rCUDA. In 2018 IEEE 4th International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era(HiPINEB), pages 40–47, 2018.

[10] C. Reaño, J. Prades, and F. Silla. Exploring the use of remote GPU virtual-ization in low-power systems for bioinformatics applications. In Proceedingsof the 47th International Conference on Parallel Processing Companion,ICPP’18, 2018.

27

[11] J. Prades and F. Silla. Made-to-measure GPUs on virtual machines withrCUDA. In Proceedings of the 47th International Conference on ParallelProcessing Companion, ICPP’18, 2018.

[12] J. Prades, C. Reaño, F. Silla, B. Imbernón, H. Pérez-Sánchez, and J. M.Cecilia. Increasing molecular dynamics simulations throughput by virtual-izing remote GPUs with rCUDA. In Proceedings of the 47th InternationalConference on Parallel Processing Companion, ICPP’18, 2018.

[13] B. Imbernón, J. Prades, D. Giménez, J. M. Cecilia, and F. Silla. Enhanc-ing large-scale docking simulation on heterogeneous systems: An MPI vsrCUDA study. Future Generation Computer Systems, 79:26 – 37, 2018.

[14] F. Silla, S. Iserte, C. Reaño, and J. Prades. On the benefits of the re-mote GPU virtualization mechanism: The rCUDA case. Concurrency andComputation: Practice and Experience, 29(13), 2017.

[15] C. Reaño and F. Silla. A comparative performance analysis of remote GPUvirtualization over three generations of GPUs. In 2017 46th InternationalConference on Parallel Processing Workshops, pages 121–128, 2017.

[16] J. Prades, B. Varghese, C. Reaño, and F. Silla. Multi-tenant virtual GPUsfor optimising performance of a financial risk application. J. Parallel Dis-trib. Comput., 108:28–44, 2017.

[17] J. Prades and F. Silla. Turning GPUs into floating devices over the cluster:The beauty of GPU migration. In 2017 46th International Conference onParallel Processing Workshops, pages 129–136, 2017.

[18] F. Silla, J. Prades, S. Iserte, and C. Reaño. Remote GPU virtualization: Isit useful? In 2016 2nd IEEE International Workshop on High-PerformanceInterconnection Networks in the Exascale and Big-Data Era (HiPINEB),pages 41–48, 2016.

[19] Sergio Iserte and Javier Prades and Carlos Reaño and Federico Silla. In-creasing the Performance of Data Centers by Combining Remote GPUVirtualization with Slurm. In Proceedings of the 16th IEEE/ACM Interna-tional Symposium on Cluster, Cloud and Grid Computing (CCGRID 2016),Cartagena, Colombia, May 2016.

[20] C. Reaño and F. Silla. Tuning remote GPU virtualization for InfiniBandnetworks. The Journal of Supercomputing, 72(12):4520–4545, 2016.

[21] J. Prades, C. Reaño, and F. Silla. CUDA acceleration for xen virtual ma-chines in InfiniBand clusters with rCUDA. In Proceedings of the 21st ACMSIGPLAN Symposium on Principles and Practice of Parallel Programming,PPoPP 2016, Barcelona, Spain, March 12-16, 2016, pages 35:1–35:2, 2016.

[22] F. Perez, C. Reaño, and F. Silla. Providing CUDA acceleration to KVMvirtual machines in InfiniBand clusters with rCUDA. In Distributed Ap-plications and Interoperable Systems - 16th IFIP WG 6.1 InternationalConference, DAIS 2016, Held as Part of the 11th International FederatedConference on Distributed Computing Techniques, DisCoTec 2016, Herak-lion, Crete, Greece, June 6-9, 2016, Proceedings, pages 82–95, 2016.

28

[23] Javier Prades and Fernando Campos and Carlos Reaño and Federico Silla.GPGPU as a service: Providing GPU-acceleration services to federatedcloud systems. In G. Kecskemeti, A. Kertesz, and Z. Nemeth, editors, De-veloping Interoperable and Federated Cloud Architecture, chapter 10, pages281–313. IGI Global, 2016.

[24] Carlos Reaño and Federico Silla. Reducing the Performance Gap of RemoteGPU Virtualization with InfiniBand Connect-IB. In Proceedings of the21st IEEE Symposium on Computers and Communications (ISCC 2016),Messina, Italy, June 2016.

[25] Carlos Reaño and Federico Silla. Extending rCUDA with support for P2Pmemory copies between remote GPUs. In Proceedings of the 18th IEEEInternational Conference on High Performance Computing and Communi-cations (HPCC 2016), Sydney, Australia, Dec. 2016.

[26] B. Varghese, J. Prades, C. Reaño, and F. Silla. Acceleration-as-a-service:Exploiting virtualised GPUs for a financial application. In 11th IEEE In-ternational Conference on e-Science, e-Science 2015, Munich, Germany,August 31 - September 4, 2015, pages 47–56, 2015.

[27] C. Reaño, F. Silla, G. Shainer, and S. Schultz. Local and remote GPUsperform similar with EDR 100G InfiniBand. In Proceedings of the Indus-trial Track of the 16th International Middleware Conference, MiddlewareIndustry 2015, Vancouver, BC, Canada, December 7-11, 2015, pages 4:1–4:7, 2015.

[28] C. Reaño, F. Silla, A. C. Gimeno, A. J. Peña, R. Mayo, E. S. Quintana-Ortí, and J. Duato. Improving the user experience of the rCUDA remoteGPU virtualization framework. Concurrency and Computation: Practiceand Experience, 27(14):3746–3770, 2015.

[29] C. Reaño and F. Silla. Reducing the costs of teaching CUDA in laboratorieswhile maintaining the learning experience quality. In INTED2015 Proceed-ings, 9th International Technology, Education and Development Confer-ence, pages 3651–3660. IATED, 2-4 March, 2015 2015.

[30] C. Reaño and F. Silla. A performance comparison of CUDA remote GPUvirtualization frameworks. In 2015 IEEE International Conference on Clus-ter Computing, CLUSTER 2015, Chicago, IL, USA, September 8-11, 2015,pages 488–489, 2015.

[31] C. Reaño and F. Silla. On the deployment and characterization of CUDAteaching laboratories. In EDULEARN15 Proceedings, 7th InternationalConference on Education and New Learning Technologies, pages 3509–3516. IATED, 6-8 July, 2015 2015.

[32] C. Reaño and F. Silla. A live demo on remote GPU accelerated deeplearning using the rCUDA middleware. In Proceedings of the Posters andDemos Session of the 16th International Middleware Conference, Middle-ware Posters and Demos 2015, Vancouver, BC, Canada, December 7-11,2015, pages 3:1–3:2, 2015.

29

[33] C. Reaño and F. Silla. InfiniBand verbs optimizations for remote GPU vir-tualization. In 2015 IEEE International Conference on Cluster Computing,CLUSTER 2015, Chicago, IL, USA, September 8-11, 2015, pages 825–832,2015.

[34] C. Reaño and F. Silla. On the design of a demo for exhibiting rCUDA.In Proceedings of the 15th IEEE/ACM International Symposium on Clus-ter, Cloud and Grid Computing (CCGRID), Shenzhen, Guangdong, China,May 2015.

[35] C. Reaño, F. Silla, A. J. Peña, G. Shainer, S. Schultz, A. Castelló, E. S.Quintana-Ortí, and J. Duato. Boosting the Performance of Remote GPUVirtualization Using InfiniBand Connect-IB and PCIe 3.0. In Proceedingsof the IEEE International Conference on Cluster Computing (CLUSTER2014), Madrid, Spain, Sept. 2014.

[36] A. J. Peña, C. Reaño, F. Silla, R. Mayo, E. S. Quintana-Orti, and J. Duato.A complete and efficient CUDA-sharing solution for HPC clusters. ParallelComputing, 40(10):574 – 588, 2014.

[37] S. Iserte, A. Castelló, R. Mayo, E. S. Quintana-Ortí, C. Reaño, J. Prades,F. Silla, and J. Duato. SLURM Support for Remote GPU Virtualization:Implementation and Performance Study. In Proceedings of the Interna-tional Symposium on Computer Architecture and High Performance Com-puting (SBAC-PAD 2014), Paris, France, Oct. 2014.

[38] A. Castelló, J. Duato, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, V. Roca,and F. Silla. On the Use of Remote GPUs and Low-Power Processors forthe Acceleration of Scientific Applications. In The Fourth InternationalConference on Smart Grids, Green Communications and IT Energy-awareTechnologies, Chamonix, France, April 2014.

[39] C. Reaño, A. J. Peña, F. Silla, R. Mayo, E. S. Quintana-Ortí, and J. Duato.Influence of InfiniBand FDR on the performance of remote GPU virtual-ization. In Proceedings of the IEEE International Conference on ClusterComputing (CLUSTER 2013), Indianapolis, USA, Oct. 2013.

[40] C. Reaño, A. J. Peña, F. Silla, R. Mayo, E. S. Quintana-Ortí, and J. Duato.CU2rCU: towards the complete rCUDA remote GPU virtualization andsharing solution. In Proceedings of the 2012 International Conference onHigh Performance Computing (HiPC 2012), Pune, India, June 2012.

[41] J. Duato, A. J. Peña, F. Silla, R. Mayo, and E. S. Quintana-Ortí. Per-formance of CUDA virtualized remote GPUs in high performance clusters.In Proceedings of the 2011 International Conference on Parallel Processing(ICPP 2011), Taipei, Taiwan, Sept. 2011.

[42] J. Duato, A. J. Peña, F. Silla, J. C. Fernández, R. Mayo, and E. S.Quintana-Ortí. Enabling CUDA acceleration within virtual machines us-ing rCUDA. In Proceedings of the 2011 International Conference on HighPerformance Computing (HiPC 2011), Bangalore, India, Dec. 2011.

30

[43] J. Duato, A. J. Peña, F. Silla, R. Mayo, and E. S. Quintana-Ortí. rCUDA:reducing the number of GPU-based accelerators in high performance clus-ters. In Proceedings of the 2010 International Conference on High Per-formance Computing and Simulation (HPCS 2010), pages 224–231, Caen,France, June 2010.

[44] J. Duato, F. D. Igual, R. Mayo, A. J. Peña, E. S. Quintana-Ortí, andF. Silla. An efficient implementation of GPU virtualization in high perfor-mance clusters. In Euro-Par 2009, Parallel Processing – Workshops, vol-ume 6043 of Lecture Notes in Computer Science, pages 385–394. Springer-Verlag, 2010.

31

rcuda v20.07alpha user’s guide

Documents