ibm platform hpc v3.2 programs conforming to the application programming interface for the operating...

Issued: February 11, 2013; Revised August 2013

IBM® Platform HPC

IBM Platform HPC V3.2: GPU Management with NVIDIA

CUDA 5

Gábor Samu

Technical Product Manager

IBM Systems and Technology Group

Mehdi Bozzo-Rey

HPC Solutions Architect

IBM Systems and Technology Group

IBM Platform HPC: GPU Management with NVIDIA CUDA 5 of 14

Executive Summary ............................................................................................. 3

Introduction .......................................................................................................... 4

Environment Preparation ................................................................................... 5

Provision nodes equipped with NVIDIA Tesla ............................................... 9

Monitor nodes equipped with NVIDIA Tesla ............................................... 10

Best practices ....................................................................................................... 11

Conclusion .......................................................................................................... 12

Further reading................................................................................................... 12

Notices ................................................................................................................. 13

Trademarks ................................................................................................... 14

Contacting IBM ............................................................................................ 14


Executive Summary IBM Platform HPC Version 3.2 (“Platform HPC”) is easy-to-use, yet comprehensive

technical computing management software. It includes as standard GPU scheduling,

management and monitoring capabilities for systems equipped with NVIDIA Tesla

GPUs.

Platform HPC 3.2 has support for NVIDIA CUDA 4.1, including a CUDA 4.1 software Kit

which facilitates simplified deployment of the software in the clustered environment.

Later versions of NVIDIA Tesla based upon the NVIDIA Kepler architecture require

CUDA 5 to operate. This document provides steps to install and configure a Platform

HPC 3.2 cluster with NVIDIA Tesla Kepler hardware.


Introduction

The document serves as a guide to enabling Platform HPC GPU management capabilities

with NVIDIA CUDA 5. The steps below assume familiarity with Platform HPC

commands and concepts. The procedure relies on the following capabilities of Platform

HPC to deploy NVIDIA CUDA 5:

Cluster File Manager (CFM): This will be used to automate patching of the

system boot files to perform the installation of NVIDIA CUDA 5.

Post-Install script: This is used to trigger the execution of the system startup file

on first boot post-provisioning.

Note that the procedure provided in this document is generic and may be used to deploy

other software in a cluster managed by Platform HPC.


Environment Preparation The following steps assume that the Platform HPC V3.2 head node has been installed

and that compute nodes equipped with NIVDIA Tesla GPUs are available to be added

(provisioned) to the cluster.

The specifications of the example environment follow:

IBM Platform HPC V3.2 (Red Hat Enterprise Linux 6.2 x64)

NVIDIA® Tesla® K20c

NVIDIA CUDA 5 (cuda_5.0.35_linux_64_rhel6.x-1.run)

Two node cluster

o installer000 (cluster head node)

o compute000 (compute node, equipped with NVIDIA Tesla K20C

The following steps enable provisioning compute nodes equipped with NVIDIA Tesla:

1. The Administrator of the cluster must download NVIDIA CUDA 5 and copy it to the

/shared directory on the Platform HPC head node. This directory must be NFS-

mounted by all compute nodes managed by Platform HPC. Note that the execute bit

must be set on the CUDA package file.

# cp ./cuda_5.0.35_linux_64_rhel6.x-1.run /shared

# chmod 755 /shared/cuda_5.0.35_linux_64_rhel6.x-1.run

# ls -la /shared/cuda*

-rwxr-xr-x 1 root root 702136770 Apr 4 20:59

/shared/cuda_5.0.35_linux_64_rhel6.x-1.run

2. On the Platform HPC head node, create a new node group for nodes equipped with

NVIDIA Tesla hardware. Give the new node group template the name compute-

rhel-6.2-x86_64_Tesla. It is a copy of the built-in node group template

compute-rhel-6.2-x86_64.

# kusu-ngedit -c compute-rhel-6.2-x86_64 -n compute-rhel-6.2-x86_64_Tesla

Running plugin: /opt/kusu/lib/plugins/cfmsync/getent-data.sh

….

….

New file found: /etc/cfm/compute-rhel-6.2-

x86_64_Tesla/root/.ssh/authorized_keys

New file found: /etc/cfm/compute-rhel-6.2-x86_64_Tesla/root/.ssh/id_rsa


x86_64_Tesla/opt/kusu/etc/logserver.addr

New file found: /etc/cfm/compute-rhel-6.2-x86_64_Tesla/opt/lsf/conf/hosts


x86_64_Tesla/opt/lsf/conf/profile.lsf

New file found: /etc/cfm/compute-rhel-6.2-x86_64_Tesla/etc/group.merge

New file found: /etc/cfm/compute-rhel-6.2-x86_64_Tesla/etc/hosts.equiv

New file found: /etc/cfm/compute-rhel-6.2-x86_64_Tesla/etc/hosts

New file found: /etc/cfm/compute-rhel-6.2-x86_64_Tesla/etc/shadow.merge

New file found: /etc/cfm/compute-rhel-6.2-x86_64_Tesla/etc/.updatenics

New file found: /etc/cfm/compute-rhel-6.2-x86_64_Tesla/etc/passwd.merge


New file found: /etc/cfm/compute-rhel-6.2-x86_64_Tesla/etc/fstab.kusuappend

New file found: /etc/cfm/compute-rhel-6.2-x86_64_Tesla/etc/ssh/ssh_config

….

….

Distributing 76 KBytes to all nodes.

Sending to 192.0.2.255





3. Configure the CFM framework to patch the /etc/rc.local file on a set of compute

nodes.

The following example script checks for the existence of the NVIDIA CUDA tool

nvidia-smi on a node in /usr/bin. If nvidia-smi is not found in /usr/bin,

the script will mount the NFS share /depot/shared to /share and will run the

NVIDIA CUDA installation with the option for silent (non-interactive) installation.

Note: You must modify this script according to the Platform HPC environment

and NVIDIA CUDA 5 package filename (cuda_5.0.35_linux_64_rhel6.x-

1.run).

Save the following script as rc.local.append in the /etc/cfm/compute-rhel-

6.2-x86_64_Tesla/etc directory on the Platform HPC head node.

# Copyright International Business Machine Corporation, 2013

# This information contains sample application programs in source language, which

# illustrates programming techniques on various operating platforms. You may copy,

# modify, and distribute these sample programs in any form without payment to IBM,

# for the purposes of developing, using, marketing or distributing application

# programs conforming to the application programming interface for the operating

# platform for which the sample programs are written. These examples have not been

# thoroughly tested under all conditions. IBM, therefore, cannot guarantee or

# imply reliability, serviceability, or function of these programs. The sample

# programs are provided "AS IS", without warranty of any kind. IBM shall not be

# liable for any damages arising out of your use of the sample programs.

# Each copy or any portion of these sample programs or any derivative work, must

# include a copyright notice as follows:

# (C) Copyright IBM Corp. 2013.

# The following example script portion is used to install NVIDIA CUDA 5

# on the specified compute nodes. This is done by patching the

# /etc/rc.local file and running the NVIDIA CUDA 5 intaller in silent mode.

# The installation will not be performed if NVIDIA CUDA is found installed

# on the system already (test if /usr/bin/nvidia-smi exists).

# Pre-requisites:

# 1. NVIDIA CUDA 5 package exists in /shared and has permissions root/755.

# 2. /depot/shared on IBM Platform HPC head node is mounted at /shared

# (you will require to adjust the IP address here according to your environment).

if [ ! -f /usr/bin/nvidia-smi ]

then

mkdir /shared

mount -t nfs 192.0.2.150:/depot/shared /shared

/shared/cuda_5.0.35_linux_64_rhel6.x-1.run -driver -toolkit -silent

fi


4. Create a post-installation script which will be configured to execute on a set of

compute nodes. The post-installation script forces the execution of the updated

/etc/rc.local script during the initial boot of a node after provisioning.

Save the following script as /root/run_rc_local.sh on the Platform HPC head

node. Note that this script will be specified as a post-installation script in subsequent

steps.

# Copyright International Business Machine Corporation, 2013

# This information contains sample application programs in source language, which

# illustrates programming techniques on various operating platforms. You may copy,

# modify, and distribute these sample programs in any form without payment to IBM,

# for the purposes of developing, using, marketing or distributing application

# programs conforming to the application programming interface for the operating

# platform for which the sample programs are written. These examples have not been

# thoroughly tested under all conditions. IBM, therefore, cannot guarantee or

# imply reliability, serviceability, or function of these programs. The sample

# programs are provided "AS IS", without warranty of any kind. IBM shall not be

# liable for any damages arising out of your use of the sample programs.

# Each copy or any portion of these sample programs or any derivative work, must

# include a copyright notice as follows:

# (C) Copyright IBM Corp. 2013.

# The following script will force the execution of /etc/rc.local after it is

# updated via CFM on the initial boot of a node after provisioning.

#!/bin/sh -x

/etc/rc.local > /tmp/runrc.log 2>&1

5. On the Platform HPC head node, start kusu-ngedit and edit the node group

named installer-rhel-6.2-x86_64.

The following updates are required to enable monitoring of GPU devices in the

Platform HPC Web console.

On the Components screen, enable component-platform-lsf-gpu under

platform-lsf-gpu.

Select Yes to synchronize changes.

6. On the Platform HPC head node, start kusu-ngedit and edit the node group

named compute-rhel-6.2-x86_64_Tesla.

The following updates are required to enable the GPU monitoring agents on nodes,

in addition to the required OS software packages, and kernel parameters for NVIDIA

GPUs.

On the Boot Time Parameters screen, add the following Kernel Parameters at

the end of the line: rdblacklist=nouveau nouveau.modeset=0

On the Components screen, enable component-platform-lsf-gpu under

platform-lsf-gpu.

On the Optional Packages screen, enable the following packages:


kernel-devel

gcc

gcc-c++

On the Custom Scripts screen, add the script /root/run_rc_local.sh

Select Yes to synchronize changes.

7. Update the configuration of the Platform HPC workload manager. This is required

for the NVIDIA CUDA specific metrics to be taken into account.

# kusu-addhost -u


Updating installer(s)






Setting up dhcpd service...

Setting up dhcpd service successfully...

Setting up NFS export service...










Provision nodes equipped with NVIDIA Tesla After completing the environment pre-requisites, complete the following steps to the

provision the compute nodes equipped with NVIDIA Tesla.

You can provision nodes using the Platform HPC Web Console, or with the kusu-

addhost command. The following steps provision the node using the kusu-addhost

command with the newly created node group template compute-rhel-6.2-

x86_64_Tesla.

Note: Once nodes are discovered by kusu-addhost, the administrator must exit from

the listening mode by pressing Control-C. This will complete the node discovery process.

# kusu-addhost -i eth0 -n compute-rhel-6.2-x86_64_Tesla -b

Scanning syslog for PXE requests...

Discovered Node: compute000

Mac Address: 00:1e:67:31:45:58

^C

Command aborted by user...

Setting up dhcpd service...

Setting up dhcpd service successfully...

Setting up NFS export service...










Monitor nodes equipped with NVIDIA Tesla After provisioning all of your GPU-equipped nodes, you can now monitor GPU-related

metrics through the Platform HPC Web Console.

Navigate to the following URL with any supported Web browser, and log in as a user

with Administrative privileges:

http://<IBM_Platform_HPC_head_node>

The Platform HPC Web Console provides the following views of GPU metrics:

Dashboard view

Host list view (GPU tab)

Dashboard view

In the Dashboard view, hover the mouse pointer over a node equipped with NVIDIA

Tesla. The popup will display the GPU temperature and any ECC errors.


Host List view (GPU tab)

In the Host List view, select a node equipped with NVIDIA Tesla and select the GPUs tab

displayed in the bottom portion of the interface. This will display the temperature

(Celsius), and any ECC errors for each GPU detected.

Best practices

IBM Platform HPC V3.2 provides a generic framework for

synchronizing files, packages and for the execution of custom

scripts post provisioning. These capabilities may be used to

install software which is not packaged as an IBM Platform HPC

Kit. In this example the installation of NVIDIA CUDA 5 is

automated.

IBM Platform HPC V3.2 GPU monitoring functions as expected

with NVIDIA CUDA 5.


Conclusion IBM Platform HPC V3.2 is easy-to-use, yet comprehensive technical computing

management software. It includes as standard GPU scheduling, management and

monitoring capabilities for systems equipped with NVIDIA Tesla GPUs.

This document has described the steps to install and configure an IBM Platform HPC

V3.2 cluster with NVIDIA Tesla Kepler hardware. The document has guided you

through the steps for enabling IBM Platform HPC V3.2 GPU management capabilities

with NVIDIA CUDA 5. You have learned how to prepare your Platform HPC

environment, provision and monitor nodes equipped with NVIDIA Tesla. You have also

learned some best practices for automating the installation, patching, and

synchronization of files and packages using Platform HPC.

The procedures provided in this document are generic and may be used to deploy other

software in clusters managed by IBM Platform HPC.

Further reading

IBM Platform HPC Version 3.2

http://www-

03.ibm.com/systems/technicalcomputing/platformcomputing/index.html

NVIDIA Developer Zone

https://developer.nvidia.com/category/zone/cuda-zone

http://www-03.ibm.com/systems/technicalcomputing/platformcomputing/index.html

http://www-03.ibm.com/systems/technicalcomputing/platformcomputing/index.html

https://developer.nvidia.com/category/zone/cuda-zone


Notices This information was developed for products and services offered in the U.S.A.

IBM may not offer the products, services, or features discussed in this document in other

countries. Consult your local IBM representative for information on the products and services

currently available in your area. Any reference to an IBM product, program, or service is not

intended to state or imply that only that IBM product, program, or service may be used. Any

functionally equivalent product, program, or service that does not infringe any IBM

intellectual property right may be used instead. However, it is the user's responsibility to

evaluate and verify the operation of any non-IBM product, program, or service.

IBM may have patents or pending patent applications covering subject matter described in

this document. The furnishing of this document does not grant you any license to these

patents. You can send license inquiries, in writing, to:

IBM Director of Licensing

IBM Corporation

North Castle Drive

Armonk, NY 10504-1785

U.S.A.

The following paragraph does not apply to the United Kingdom or any other country where

such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES

CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER

EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-

INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do

not allow disclaimer of express or implied warranties in certain transactions, therefore, this

statement may not apply to you.

Without limiting the above disclaimers, IBM provides no representations or warranties

regarding the accuracy, reliability or serviceability of any information or recommendations

provided in this publication, or with respect to any results that may be obtained by the use of

the information or observance of any recommendations provided herein. The information

contained in this document has not been submitted to any formal IBM test and is distributed

AS IS. The use of this information or the implementation of any recommendations or

techniques herein is a customer responsibility and depends on the customer’s ability to

evaluate and integrate them into the customer’s operational environment. While each item

may have been reviewed by IBM for accuracy in a specific situation, there is no guarantee

that the same or similar results will be obtained elsewhere. Anyone attempting to adapt

these techniques to their own environment does so at their own risk.

This document and the information contained herein may be used solely in connection with

the IBM products discussed in this document.

This information could include technical inaccuracies or typographical errors. Changes are

periodically made to the information herein; these changes will be incorporated in new

editions of the publication. IBM may make improvements and/or changes in the product(s)

and/or the program(s) described in this publication at any time without notice.

Any references in this information to non-IBM websites are provided for convenience only

and do not in any manner serve as an endorsement of those websites. The materials at those

websites are not part of the materials for this IBM product and use of those websites is at your

own risk.

IBM may use or distribute any of the information you supply in any way it believes

appropriate without incurring any obligation to you.

Any performance data contained herein was determined in a controlled environment.

Therefore, the results obtained in other operating environments may vary significantly. Some

measurements may have been made on development-level systems and there is no

guarantee that these measurements will be the same on generally available systems.

Furthermore, some measurements may have been estimated through extrapolation. Actual

results may vary. Users of this document should verify the applicable data for their specific

environment.


Information concerning non-IBM products was obtained from the suppliers of those products,

their published announcements or other publicly available sources. IBM has not tested those

products and cannot confirm the accuracy of performance, compatibility or any other

claims related to non-IBM products. Questions on the capabilities of non-IBM products should

be addressed to the suppliers of those products.

All statements regarding IBM's future direction or intent are subject to change or withdrawal

without notice, and represent goals and objectives only.

This information contains examples of data and reports used in daily business operations. To

illustrate them as completely as possible, the examples include the names of individuals,

companies, brands, and products. All of these names are fictitious and any similarity to the

names and addresses used by an actual business enterprise is entirely coincidental.

COPYRIGHT LICENSE: © Copyright IBM Corporation 2013. All Rights Reserved.

This information contains sample application programs in source language, which illustrate

programming techniques on various operating platforms. You may copy, modify, and

distribute these sample programs in any form without payment to IBM, for the purposes of

developing, using, marketing or distributing application programs conforming to the

application programming interface for the operating platform for which the sample

programs are written. These examples have not been thoroughly tested under all conditions.

IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these

programs.

Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International

Business Machines Corporation in the United States, other countries, or both. If these and

other IBM trademarked terms are marked on their first occurrence in this information with a

trademark symbol (® or ™), these symbols indicate U.S. registered or common law

trademarks owned by IBM at the time this information was published. Such trademarks may

also be registered or common law trademarks in other countries. A current list of IBM

trademarks is available on the Web at “Copyright and trademark information” at

www.ibm.com/legal/copytrade.shtml

Windows is a trademark of Microsoft Corporation in the United States, other countries, or

both.

UNIX is a registered trademark of The Open Group in the United States and other countries.

Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.

Other company, product, or service names may be trademarks or service marks of others.

Contacting IBM To contact IBM in your country or region, check the IBM Directory of Worldwide

Contacts at http://www.ibm.com/planetwide

To learn more about IBM Platform Computing HPC, go to

http://www-03.ibm.com/systems/technicalcomputing/platformcomputing/products/hpc/

http://www.ibm.com/planetwide

http://www-03.ibm.com/systems/technicalcomputing/platformcomputing/products/hpc/

ibm platform hpc v3.2 programs conforming to the application programming interface for the operating...

Documents