general assumptions - ibm - united states web viewjust a note, the gpfs for sap-hana solution uses...

Spectrum Scale-FPO on SoftLayer

Reference ArchitectureSam Bigger

December 1, 2016Version 1.7

Contents

Contents.................................................................................................................................................................

GENERAL ASSUMPTIONS........................................................................................................................................

PERFORM NECESSARY MODIFICATIONS TO OS (RHEL/CENTOS/SL)........................................................................

ADD GPFS PATH TO ROOT’S BASH PROFILE....................................................................................................

CHECK FOR UNSUPPORTED OS BOOT OPTIONS.............................................................................................

MAKE SURE DISKS USED BY GPFS ARE NOT USED FOR OTHER PURPOSE.......................................................

CONFIGURE PASSWORDLESS SSH..................................................................................................................

DISABLE SE LINUX..........................................................................................................................................

STORAGE, OS AND NETWORK TUNING...................................................................................................................

STORAGE CONFIGURATION............................................................................................................................

VM SETTINGS.................................................................................................................................................

INSTALL GPFS SOFTWARE.......................................................................................................................................

INSTALL GPFS RPM PACKAGES.......................................................................................................................

BUILD GPFS KERNEL PORTABILITY LAYER FOR THE CURRENT OS, KERNEL AND GPFS PATCH LEVEL...............

CONFIGURE GPFS CLUSTER............................................................................................................................

CREATE NSD LUNS.................................................................................................................................................

CREATE GPFS-FPO FILE SYSTEM............................................................................................................................

CREATE A FILE PLACEMENT POLICY FILE...............................................................................................................

Tuning GPFS Configuration for FPO.......................................................................................................................

Important GPFS-FPO related parameters..............................................................................................................

COMMON GPFS OPERATIONS...............................................................................................................................

COMMON SERVICE OPERATIONS..........................................................................................................................

EXAMPLES OF COMMON COMMAND OUTPUTS..................................................................................................

INFINIBAND..........................................................................................................................................................

Scope...........................................................................................................................................................

Prerequisites................................................................................................................................................

High level steps............................................................................................................................................

1. Setup xCAT...............................................................................................................................................

2. Order GPFS-FPO servers with IB card on SoftLayer..................................................................................

3. Add IB servers into xCAT..........................................................................................................................

4. Configure and Publish LSF Cluster Definitions..........................................................................................

5. Provision and verify LSF clusters with IB..................................................................................................

6. Manage existing LSF clusters with IB........................................................................................................

7. Release IB Servers....................................................................................................................................

GENERAL ASSUMPTIONS

We are confined to using the basic SoftLayer Base GPFS-FPO Server, however we are able to make special requests to modify a base Server to get what we need at an additional price. The kind of things that we may want to modify are the substitution of SAS drives for SSD drives, the addition of InfiniBand, and possibly even the use of SATA drives at some point.

This base SoftLayer GPFS-FPO server consists of the following characteristics:

Server Type2U

CPU Xeon 5670 - dual socket / 6 core 2.93 GHz (12 total cores)

Ram 48 GB RAM

Local Disk 12x 600 GB SAS 15K RPM HDD

RAID Yes

Ethernet 1GigE

OS RHEL 6.X

Software Elastic Storage FPO

The default basic system configuration is targeted at replacing HDFS for a Map Reduce workload. This system consists of twelve physical Linux servers in 3 racks, 4 per rack, with internal DAS storage configured into a 12 node GPFS-FPO cluster. Three of the twelve nodes are metadata nodes. Nine of the twelve systems have twelve 15K 600GB SAS drives each. However, the three metadata nodes have two of their 15K 600 GB SAS drives replaced with 800GB SSD drives reserved just for metadata. These two SSD drives are the only difference between the three metadata nodes and the other nine data only nodes. The ideal design is to have one metadata node per rack, but how the systems get distributed across racks inside SoftLayer is currently out of our control.

Because this is a specific GPFS-FPO configuration, all of the twelve nodes are both client nodes and NSD server nodes. Three of the 12 nodes are manager/quorum/NSD server nodes. Two of the other nodes are designated as additional quorum nodes. The twelve 15K 600GB SAS drives in each data only nodes and ten in the meta data nodes, are going to be used for data only disk space for the GPFS-FPO Filesystem.

A single GPFS filesystem spans all twelve physical nodes. There is no RAID in effect for this configuration. This is the main reason why the smaller more expensive SAS drives are used for better reliability than that offered by SATA drives. For those who wonder why we don’t use NL-SAS drives, it is because an NL-SAS drive is just a SATA drive with a SAS interface. So, while they certainly have their advantages, they aren’t much more reliable than just a normal SATA drive. Since we are not using RAID, we need the more reliable drives to reduce failures and speed the recovery from failure scenarios.

Internal storage dedicated to GPFS is comprised of one NSD lun per physical volume per physical host and GPFS should use a 1MB filesystem block size, such that each data only NSD client-server will have twelve 600GB nsd luns, or 6TB of raw disk space per NSD server. The meta data nodes will each have ten 600GB luns for data only and two 800 GB luns using the SSD drives.

The supported operating system on the GPFS-FPO servers will be RHEL 6.X, the latest version supported by SoftLayer.

InfiniBand is supported for additional bandwidth, but it severely limits the data centers that you are currently able to get this option in. Please see the InfiniBand section at the bottom of this document for details on setting up InfiniBand.

PERFORM NECESSARY MODIFICATIONS TO OS (RHEL/CENTOS/SL)

ADD GPFS PATH TO ROOT’S BASH PROFILEPath to GPFS binaries should be added to users that will operate the clusters, such as root:

# vim /root/.bash_profile

Add /usr/lpp/mmfs/bin so that PATH looks similar to this:

PATH=$PATH:$HOME/bin:/usr/lpp/mmfs/bin

Root users need to re-login to have this take effect. Until then full paths can be used.

CHECK FOR UNSUPPORTED OS BOOT OPTIONSCheck for any unsupported OS boot options.

# cat /boot/grub/menu.lstkernel /vmlinuz-2.6.32-431.11.2.el6.x86_64 ro root=UUID=6a6db1f8-c1c8-4b94-aee2-4723cc35bb08 nomodeset rd_NO_LUKS KEYBOARDTYPE=pc KEYTABLE=us LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto rd_NO_LVM rd_NO_DM rhgb pcie_aspm=off biosdevname=0

The following is not supported, but also not used so no action needs to be performed: RHEL hugememm, RHEL largesmp, RHEL uniprocessor (UP). (Source: GPFS FAQs mentioned earlier).

MAKE SURE DISKS USED BY GPFS ARE NOT USED FOR OTHER PURPOSEIf disks to be used by GPFS are mounted or formatted using some other file system, free them up. Double check to make sure that you’re destroying correct disks!

# dfFilesystem 1K-blocks Used Available Use% Mounted on/dev/sda3 286661072 1779964 270319572 1% /tmpfs 16421272 0 16421272 0% /dev/shm/dev/sda1 253871 58783 181981 25% /boot/dev/sdb1 109364387840 36064 109364351776 1% /disk1

For example, in this case disk sdb was mounted and also in /etc/fstab so it had to be unmounted and removed. Mark it out or remove it from /etc/fstab:

# LABEL=/disk1 /disk1 xfs defaults 1 2

Also unmount the filesystem and if desired delete the unneeded mount point.

# rm -rf /disk1# umount /disk1

CONFIGURE PASSWORDLESS SSH We want to ensure passwordless SSH commands work from any to any GPFS server node so that cluster-wide commands (such as mount or service start commands can work across the cluster; if ordinary users have access to root account on GPFS clients or servers, this may represent a risk).

# ssh-keygen -t dsa# cat /root/.ssh/id_dsa.pub >> /root/.ssh/authorized_keys2# ssh localhostThe authenticity of host 'localhost (127.0.0.1)' can't be established.RSA key fingerprint is 15:b1:8a:29:f5:97:fd:68:2d:8c:0b:74:97:68:77:01.Are you sure you want to continue connecting (yes/no)? yes

Concatenate id_dsa.pub to authorized_keys2 of other servers and test passwordless SSH in an any-to-any fashion using all variations of hostnames (short name and fully-qualified domain name), e.g.:

# ssh n1_0_1.hpccloud.local # ssh n1_0_1# ssh n1_0_2.hpccloud.local # ssh n1_0_2

Make sure you are not prompted for password from any of the three server nodes.

DISABLE SE LINUXWe want it disabled. It was disabled by default so no change was required. Verify the setting:

# grep SELINUX /etc/selinux/config

Look for this line – it should contain “disabled”:

SELINUX=disabled

Do this on all the GPFS nodes.

STORAGE, OS AND NETWORK TUNING

STORAGE CONFIGURATIONBased on business requirements and technical characteristics of physical servers, we want to configure ten volumes without any RAID on each server. Furthermore we want to use the most optimal blocksize for Hadoop Map Reduce operations, or a 1MB filesystem block size.

VM SETTINGSThe OS should have about 5% of virtual memory always free. Out of 32GB we can take 2GB.

Attention: this value is in kilobytes, not bytes and it should be applied only on physical servers

vm.min_free_kbytes=2457600Have it take effect without OS restart.# sysctl -p

INSTALL GPFS SOFTWARE

INSTALL GPFS RPM PACKAGES Verify if the OS and kernel are supported against the GPFS FAQs: http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.doc%2Fgpfs_faqs%2Fgpfsclustersfaq.html

Install some mandatory and non-mandatory (but useful) RPM packages:

# yum -y install kernel-headers.x86_64 dstat gcc make kernel-headers kernel-devel ksh make gcc-c++

Verify and install the four "base" GPFS packages (base 4.1.0 release):

# ll *4.1.0-0*-rw-r--r-- 1 root root 11182009 May 21 2012 gpfs.base-4.1.0-0.x86_64.rpm-rw-r--r-- 1 root root 221562 May 21 2012 gpfs.docs-4.1.0-0.noarch.rpm-rw-r--r-- 1 root root 500281 May 21 2012 gpfs.gpl-4.1.0-0.noarch.rpm-rw-r--r-- 1 root root 96641 May 21 2012 gpfs.msg.en_US-4.1.0-0.noarch.rpm# rpm -i *4.1.0-0*# rpm -qa | grep gpfsgpfs.docs-4.1.0-0.noarchgpfs.base-4.1.0-0.x86_64gpfs.msg.en_US-4.1.0-0.noarchgpfs.gpl-4.1.0-0.noarch

Upgrade to GPFS 4.1.0-6 (the latest patch level at the moment):

# ll *4.1.0-6*

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.doc%2Fgpfs_faqs%2Fgpfsclustersfaq.html

http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.gpfs.doc%2Fgpfs_faqs%2Fgpfsclustersfaq.html

-rw-r--r-- 1 root root 13148242 Feb 27 19:24 gpfs.base-4.1.0-6.x86_64.update.rpm-rw-r--r-- 1 root root 254985 Feb 27 19:24 gpfs.docs-4.1.0-6.noarch.rpm-rw-r--r-- 1 root root 560882 Feb 27 19:24 gpfs.gpl-4.1.0-6.noarch.rpm-rw-r--r-- 1 root root 107072 Feb 27 19:26 gpfs.msg.en_US-4.1.0-6.noarch.rpm# rpm -Uvh *4.1.0-6*Preparing... ########################################### [100%] 1:gpfs.base ########################################### [ 25%] 2:gpfs.gpl ########################################### [ 50%] 3:gpfs.msg.en_US ########################################### [ 75%] 4:gpfs.docs ########################################### [100%]

Do this on all server nodes. By the way, clients use the same software so exactly the same procedure (except that passwordless SSH may need to be restricted) applies to GPFS clients.

BUILD GPFS KERNEL PORTABILITY LAYER FOR THE CURRENT OS, KERNEL AND GPFS PATCH LEVELAs the title says, this has to be done every time when the distribution, Linux kernel or GPFS is updated. Consequently, it is recommended to pay special attention to this in operations. For example, yum configuration can be modified to exclude the kernel package from automatic updates.

# cd /usr/lpp/mmfs/src# make LINUX_DISTRIBUTION=REDHAT_AS_LINUX Autoconfig# make World# make InstallImages

Should you want to avoid having to install development tools and build on all other nodes with the same distro/GPFS patch level/kernel level/CPU architecture, build an kernel portability RPM and install the RPM on other GPFS nodes:

# yum -y install rpm-build; make rpm

Watch output for the exact location and file name (/root/rpmbuild/RPMS/x86_64):

/root/rpmbuild/RPMS/x86_64/gpfs.gplbin-2.6.32-431.11.2.el6.x86_64-4.1.0-17.x86_64.rpm

If needed, some or all development tools can now be uninstalled as we won't need them until the next kernel or GPFS update.

# yum -y remove gcc make kernel-devel make

As long as the kernel version or GPFS patch level does not change, GPFS portability layer can be installed by simply installing the gpfs.gplbin RPM built above (e.g. yum -y install gpfs.gplbin-2.6.32-431.11.2.el6.x86_64-4.1.0-17.x86_64.rpm).

This file can be copied to other hosts which share the same architecture and kernel, e.g. shuynh-gpfs-dmss2 and even GPFS clients if they’re the same distribution, kernel level or GPFS patch level, thereby

eliminating the need to install and uninstall development tools on those nodes. Note that this RPM cannot be installed on the VM unless it is running the same distribution, kernel (and of course GPFS version).

CONFIGURE GPFS CLUSTERWe can first configure one server node and add more later on, or do it all at once.

Assuming all twelve hosts are ready and prepared, build a stanza file containing their hostnames and roles.

# cd /root/packages/

You will need to create the file gpfs-fpo-nodefile, be careful about cutting and pasting from a Word Doc to a linux file. The GPFS-FPO node naming convention is used here. It consists of 3 numbers in the name. The first number represents the rack. The second number represents whether the node is in the top or bottom half of the rack, and the third number is the number of the node in that half of the rack. So, for example, the name n1_0_1 means the node is in rack 1, bottom half of rack 1, and is node #1 in the bottom half of the rack.

We realize that we cannot guarantee how many racks or where in the rack the nodes are placed in SoftLayer, but this will be important for larger configurations later. So, for now, the names will likely just be idealistic. But, look at it this way, by making things miserably complex, it’s better job security. There’s always an upside.

# cat gpfs-fpo-nodefile

n1_0_1::n1_0_2:quorum-manager:n1_1_1::n1_1_2:quorum:n2_0_1::n2_0_2::n2_1_1::n2_1_2:quorum-manager:n3_0_1::n3_0_2:quorum:n3_1_1:quorum-manager:n3_1_2::

Approach #1: All the twelve client-servers are ready and all twelve client-servers will be configured during the creation of the GPFS cluster

Create the GPFS cluster with n1_1_2 as the primary node and n2_0_2 as the secondary node and give the cluster the name of gpfs-fpo-cluster

# /usr/lpp/mmfs/bin/mmcrcluster –N gpfs-fpo-nodefile –p n1_1_2 –s n2_0_2 –C gpfs-fpo-cluster –A –r /usr/bin/ssh –R /usr/bin/scp

Install GPFS license using mmchlicense.

Accept the GPFS EULA . Since the twelve servers will act as FPO NSD servers as well as clients, we'll install GPFS FPO license on all the nine servers. The quorum-manager nodes have to have server licenses.

# /usr/lpp/mmfs/bin/mmchlicense server --accept -N n1_0_2, n2_1_2, n3_1_1# /usr/lpp/mmfs/bin/mmchlicense fpo --accept -N n1_0_1, n1_1_1,n1_1_2,n2_0_1,n2_0_2,n2_1_1,n3_0_1,n3_0_2,n3_1_2

Customize the GPFS parameters

Set the maximum filesystem block size and configure GPFS service to start automatically:

# mmchconfig maxblocksize=1024K# mmchconfig autoload=yes

Start GPFS service.

# /usr/lpp/mmfs/bin/mmstartup -a

Verify GPFS state is “active” on all the three nodes using “mmgetstate”

# /usr/lpp/mmfs/bin/mmgetstate -aLs

Approach #2: Create GPFS cluster with single node and add the other nodes later

Maybe later.

Now we can start GPFS service.

# /usr/lpp/mmfs/bin/mmstartup -aWed Apr 16 03:40:41 CDT 2014: mmstartup: Starting GPFS ...

Verify GPFS state of this node:

# /usr/lpp/mmfs/bin/mmgetstate Node number Node name GPFS state------------------------------------------

1 n1_0_1 active

# /usr/lpp/mmfs/bin/mmlscluster

For now, shut down the service.

# /usr/lpp/mmfs/bin/mmshutdown -a

CREATE NSD LUNS

You probably noticed that the servers aren’t using any storage. Related to storage we have to perform following actions:

a) create "network storage" disks (place them under GPFS control) using "mmcrnsd" command

b) create Filesystem(s) using "mmcrfs" command

c) mount the filesystem on servers (mmmount <filesystemName>)

Check for available disks (except for those assigned to the OS):

[root@n1_0_1 packages]# fdisk -lDisk /dev/sdb: 111991.3 GB, 111991272243200 bytes255 heads, 63 sectors/track, 13615496 cylindersUnits = cylinders of 16065 * 512 = 8225280 bytesSector size (logical/physical): 512 bytes / 512 bytesI/O size (minimum/optimal): 512 bytes / 512 bytesDisk identifier: 0x00000000 Device Boot Start End Blocks Id System/dev/sdb1 1 267350 2147483647+ ee GPT

In this case /dev/sdb1 is what we’re supposed to use. Ideally we would have multiple disks at our disposal as that would provide better performance, but in any case having this information we can create a disk "description file" (also called "stanza") which we'll pass to the “create NSD” command.

For example, on server1 we'd have something like this:

In this config, nodes n1_0_2, n2_1_2, and n3_0_2 are 3 nodes with the 2 extra SSD drives for metadata. Config file assumes 10 SAS drives in each system are /dev/sdb-sdk, SSD drives in meta nodes are /dev/sdm and /dev/sdn

# cat /root/packages/gpfs-fpo-poolfile (file is this little part plus next 3 pages)%pool:pool=systemlayoutMap=clusterblocksize=256K #using smaller blocksize for metadata

%pool:pool=fpodata layoutMap=clusterblocksize=1024K allowWriteAffinity=yes #this option enables FPO feature

writeAffinityDepth=1 #place 1st copy on disks local to the node writing datablockGroupFactor=128 #yields chunk size of 128MB

#Disks in system pool are defined for metadata use only%nsd: nsd=n1_0_2_ssd_1 device=/dev/sdm servers=n1_0_2 usage=metadataOnly failureGroup=102 pool=system%nsd: nsd=n1_0_2_ssd_2 device=/dev/sdn servers=n1_0_2 usage=metadataOnly failureGroup=102 pool=system

%nsd: nsd=n2_1_2_ssd_1 device=/dev/sdm servers=n2_1_2 usage=metadataOnly failureGroup=212 pool=system%nsd: nsd=n2_1_2_ssd_2 device=/dev/sdn servers=n2_1_2 usage=metadataOnly failureGroup=212 pool=system

%nsd: nsd=n3_0_2_ssd_1 device=/dev/sdm servers=n3_0_2 usage=metadataOnly failureGroup=302 pool=system%nsd: nsd=n3_0_2_ssd_2 device=/dev/sdn servers=n3_0_2 usage=metadataOnly failureGroup=302 pool=system

#Disks in rack1 fpodata pool (rack1=n1_x_x)%nsd: nsd=n1_0_1_disk1 device=/dev/sdb servers=n1_0_1 usage=dataOnly failureGroup=1,0,1 pool=fpodata%nsd: nsd=n1_0_1_disk2 device=/dev/sdc servers=n1_0_1 usage=dataOnly failureGroup=1,0,1 pool=fpodata%nsd: nsd=n1_0_1_disk3 device=/dev/sdd servers=n1_0_1 usage=dataOnly failureGroup=1,0,1 pool=fpodata%nsd: nsd=n1_0_1_disk4 device=/dev/sde servers=n1_0_1 usage=dataOnly failureGroup=1,0,1 pool=fpodata%nsd: nsd=n1_0_1_disk5 device=/dev/sdf servers=n1_0_1 usage=dataOnly failureGroup=1,0,1 pool=fpodata%nsd: nsd=n1_0_1_disk6 device=/dev/sdg servers=n1_0_1 usage=dataOnly failureGroup=1,0,1 pool=fpodata%nsd: nsd=n1_0_1_disk7 device=/dev/sdh servers=n1_0_1 usage=dataOnly failureGroup=1,0,1 pool=fpodata%nsd: nsd=n1_0_1_disk8 device=/dev/sdi servers=n1_0_1 usage=dataOnly failureGroup=1,0,1 pool=fpodata%nsd: nsd=n1_0_1_disk9 device=/dev/sdj servers=n1_0_1 usage=dataOnly failureGroup=1,0,1 pool=fpodata%nsd: nsd=n1_0_1_disk10 device=/dev/sdk servers=n1_0_1 usage=dataOnly failureGroup=1,0,1 pool=fpodata%nsd: nsd=n1_0_1_disk11 device=/dev/sdl servers=n1_0_1 usage=dataOnly failureGroup=1,0,1 pool=fpodata%nsd: nsd=n1_0_1_disk12 device=/dev/sdm servers=n1_0_1 usage=dataOnly failureGroup=1,0,1 pool=fpodata

# Remember this is one of the meta nodes, so it only has 10 dataOnly disks%nsd: nsd=n1_0_2_disk1 device=/dev/sdb servers=n1_0_2 usage=dataOnly failureGroup=1,0,2 pool=fpodata%nsd: nsd=n1_0_2_disk2 device=/dev/sdc servers=n1_0_2 usage=dataOnly failureGroup=1,0,2 pool=fpodata%nsd: nsd=n1_0_2_disk3 device=/dev/sdd servers=n1_0_2 usage=dataOnly failureGroup=1,0,2 pool=fpodata%nsd: nsd=n1_0_2_disk4 device=/dev/sde servers=n1_0_2 usage=dataOnly failureGroup=1,0,2 pool=fpodata%nsd: nsd=n1_0_2_disk5 device=/dev/sdf servers=n1_0_2 usage=dataOnly failureGroup=1,0,2 pool=fpodata%nsd: nsd=n1_0_2_disk6 device=/dev/sdg servers=n1_0_2 usage=dataOnly failureGroup=1,0,2 pool=fpodata%nsd: nsd=n1_0_2_disk7 device=/dev/sdh servers=n1_0_2 usage=dataOnly failureGroup=1,0,2 pool=fpodata%nsd: nsd=n1_0_2_disk8 device=/dev/sdi servers=n1_0_2 usage=dataOnly failureGroup=1,0,2 pool=fpodata%nsd: nsd=n1_0_2_disk9 device=/dev/sdj servers=n1_0_2 usage=dataOnly failureGroup=1,0,2 pool=fpodata%nsd: nsd=n1_0_2_disk10 device=/dev/sdk servers=n1_0_2 usage=dataOnly failureGroup=1,0,2 pool=fpodata

%nsd: nsd=n1_1_1_disk1 device=/dev/sdb servers=n1_1_1 usage=dataOnly failureGroup=1,1,1 pool=fpodata%nsd: nsd=n1_1_1_disk2 device=/dev/sdc servers=n1_1_1 usage=dataOnly failureGroup=1,1,1 pool=fpodata%nsd: nsd=n1_1_1_disk3 device=/dev/sdd servers=n1_1_1 usage=dataOnly failureGroup=1,1,1 pool=fpodata%nsd: nsd=n1_1_1_disk4 device=/dev/sde servers=n1_1_1 usage=dataOnly failureGroup=1,1,1 pool=fpodata%nsd: nsd=n1_1_1_disk5 device=/dev/sdf servers=n1_1_1 usage=dataOnly failureGroup=1,1,1 pool=fpodata%nsd: nsd=n1_1_1_disk6 device=/dev/sdg servers=n1_1_1 usage=dataOnly failureGroup=1,1,1 pool=fpodata%nsd: nsd=n1_1_1_disk7 device=/dev/sdh servers=n1_1_1 usage=dataOnly failureGroup=1,1,1 pool=fpodata%nsd: nsd=n1_1_1_disk8 device=/dev/sdi servers=n1_1_1 usage=dataOnly failureGroup=1,1,1 pool=fpodata%nsd: nsd=n1_1_1_disk9 device=/dev/sdj servers=n1_1_1 usage=dataOnly failureGroup=1,1,1 pool=fpodata%nsd: nsd=n1_1_1_disk10 device=/dev/sdk servers=n1_1_1 usage=dataOnly failureGroup=1,1,1 pool=fpodata%nsd: nsd=n1_1_1_disk11 device=/dev/sdl servers=n1_1_1 usage=dataOnly failureGroup=1,1,1 pool=fpodata%nsd: nsd=n1_1_1_disk12 device=/dev/sdm servers=n1_1_1 usage=dataOnly failureGroup=1,1,1 pool=fpodata

%nsd: nsd=n1_1_2_disk1 device=/dev/sdb servers=n1_1_2 usage=dataOnly failureGroup=1,1,2 pool=fpodata%nsd: nsd=n1_1_2_disk2 device=/dev/sdc servers=n1_1_2 usage=dataOnly failureGroup=1,1,2 pool=fpodata%nsd: nsd=n1_1_2_disk3 device=/dev/sdd servers=n1_1_2 usage=dataOnly failureGroup=1,1,2 pool=fpodata%nsd: nsd=n1_1_2_disk4 device=/dev/sde servers=n1_1_2 usage=dataOnly failureGroup=1,1,2 pool=fpodata%nsd: nsd=n1_1_2_disk5 device=/dev/sdf servers=n1_1_2 usage=dataOnly failureGroup=1,1,2 pool=fpodata%nsd: nsd=n1_1_2_disk6 device=/dev/sdg servers=n1_1_2 usage=dataOnly failureGroup=1,1,2 pool=fpodata%nsd: nsd=n1_1_2_disk7 device=/dev/sdh servers=n1_1_2 usage=dataOnly failureGroup=1,1,2 pool=fpodata%nsd: nsd=n1_1_2_disk8 device=/dev/sdi servers=n1_1_2 usage=dataOnly failureGroup=1,1,2 pool=fpodata

%nsd: nsd=n1_1_2_disk9 device=/dev/sdj servers=n1_1_2 usage=dataOnly failureGroup=1,1,2 pool=fpodata%nsd: nsd=n1_1_2_disk10 device=/dev/sdk servers=n1_1_2 usage=dataOnly failureGroup=1,1,2 pool=fpodata%nsd: nsd=n1_1_2_disk11 device=/dev/sdl servers=n1_1_2 usage=dataOnly failureGroup=1,1,2 pool=fpodata%nsd: nsd=n1_1_2_disk12 device=/dev/sdm servers=n1_1_2 usage=dataOnly failureGroup=1,1,2 pool=fpodata

# /usr/lpp/mmfs/bin/mmcrnsd -F /root/packages/gpfs-fpo-poolfile

Verify that NSDs were successfully created:

# /usr/lpp/mmfs/bin/mmlsnsd File system Disk name NSD servers--------------------------------------------------------------------------- (free disk) n1_0_1_disk1 n1_0_1(should see 126 disks listed here)

We can see this disk is free (not used in any file system).

CREATE GPFS-FPO FILE SYSTEM

Bring up GPFS on all the nodes:

mmstartup -a

Next we'll use the create filesystem command on the same stanza file to create "fs1":

# /usr/lpp/mmfs/bin/mmcrfs bigfs -F /root/packages/gpfs-fpo-poolfile -A yes –i 4096 –m 3 –M 3 –n 32 –r 3 –R 3 –S relatime –E noThe following disks of bigfs will be formatted on node n1_0_1:

sdb: size 109366474752 KBsdc: size 109366474752 KBsdd: size 109366474752 KBsde: size 109366474752 KBsdf: size 109366474752 KBsdg: size 109366474752 KBsdh: size 109366474752 KBsdi: size 109366474752 KBsdj: size 109366474752 KBsdk: size 109366474752 KB

Formatting file system ...Disks up to size 896 TB can be added to storage pool system.Creating Inode FileCreating Allocation MapsCreating Log FilesClearing Inode Allocation MapClearing Block Allocation MapFormatting Allocation Map for storage pool system 14 % complete on Wed Apr 16 04:00:46 2014

Note that there are many file system create options that can be used for optimal performance. Exact parameters depend on the hardware and also workload that will be used on this system. Using the default options filesystems will be auto-mounted upon cluster service startup.

Mount the filesystem:# /gpfs/bigfs must exist on all 12 servers# /usr/lpp/mmfs/bin/mmmount bigfs /gpfs/bigfs –N all Wed Apr 16 04:11:00 CDT 2014: mmmount: Mounting file systems ...# dfFilesystem 1K-blocks Used Available Use% Mounted on/dev/sda3 286661072 1779976 270319560 1% /tmpfs 16421272 0 16421272 0% /dev/shm/dev/sda1 253871 58783 181981 25% /boot/dev/bigfs 109366474752 103764254976 5602219776 95% /gpfs/bigfs

Use the mmdf command to display a filesystem summary for fs1 in more detail:

# mmdf bigfsdisk disk size failure holds holds free KB free KBname in KB group metadata data in full blocks in fragments--------------- ------------- -------- -------- ----- -------------------- -------------------Disks in storage pool: system (Maximum disk size allowed is 268 TB)s1sdd 34179687424 1 Yes Yes 34157510656 (100%) 10368 ( 0%)s1sdc 34179687424 1 Yes Yes 34157490176 (100%) 10496 ( 0%)s1sdb 34179686400 1 Yes Yes 34157486080 (100%) 5504 ( 0%)s2sdd 34179687424 2 Yes Yes 34157494272 (100%) 10496 ( 0%)s2sdc 34179687424 2 Yes Yes 34157510656 (100%) 6528 ( 0%)s2sdb 34179686400 2 Yes Yes 34157481984 (100%) 9344 ( 0%) ------------- -------------------- -------------------(pool total) 205078122496 204944973824 (100%) 52736 ( 0%)

============= ==================== ===================(total) 205078122496 204944973824 (100%) 52736 ( 0%)Inode Information-----------------Number of used inodes: 4039Number of free inodes: 536633Number of allocated inodes: 540672Maximum number of inodes: 134217728

CREATE A FILE PLACEMENT POLICY FILE

Create a policy file containing the placement rules in a file called bigfs.pol.

Here’s an example:# cat bigfs.polRULE 'bgf' SET POOL 'fpodata' REPLICATE(1) WHERE NAME LIKE '%.tmp' AND setBGF(4) AND setWAD(1)RULE 'default' SET POOL 'fpodata'

The first rule above places all files with extension ”.tmp” in the fpodata pool using a chunk factor of 4 blocks and write affinity depth (WAD) set to 1 and with data replication count set to one for these temporary files. The second rule is the default rule to place all other files in fpopool pool using the attributes specified at the storage pool level. This rule applies to all files that do not match any other prior rules and must be specified as the last rule. The first rule is optional and is added here only for demonstration purpose.

Install the policy using the command mmchpolicy. The policies that are currently effective can be listed using the command mmlspolicy.

# mmchpolicy bigfs bigfs.pol

You can use the GPFS command ‘mmlsattr –L <path to a file in GPFS>’ to check these settings for individual files. Note that if an attribute is not listed in the output of mmlsattr, the settings provided at the pool or file system level is being used.

Tuning GPFS Configuration for FPO

GPFS-FPO clusters have a different architecture than traditional GPFS deployments. Therefore, default values of many of the configuration parameters are not suitable for FPO deployments and should be changed. Use the following table as a guide to update your cluster configuration settings.

Use the mmchconfig and mmlsconfig commands to change and list the current value of any given parameter. For example, to change readReplicaPolicy setting to local do the following. You can specify multiple configuration parameters in a single command as shown below.

#shows current setting of readReplicaPolicy parameter# mmlsconfig |grep -i readReplicaPolicy #changes the value to policy. Use –i ensure change immediate and persistent across node reboots. # mmchconfig readReplicaPolicy=local –i# mmchconfig maxStatCache=100000,maxFilesToCache=100000

Replace the parameter name and values as necessary. You only need to run these commands on one node in the cluster, GPFS propagates changes to all other nodes.

ATTENTION: Some parameters like maxStatCache and maxFilesToCache do not take effect until GPFS is restarted. GPFS can be restarted using the commands below.

/usr/lpp/mmfs/bin/mmumount all –a

/usr/lpp/mmfs/bin/mmshutdown -a

/usr/lpp/mmfs/bin/mmstartup –a

# Ensure that the GPFS daemon has started on all nodes in the GPFS cluster.

mmgetstate –a

Important GPFS-FPO related parameters

Parameter Name (* not set by BigInsights 2.1.0.1 or

need review)

Default Value

New Value Comment

readReplicaPolicy random local Enable the policy to read replicas from local disks.

enableRepWriteStream*

(undocumented)1-if FPO enabled

0–for non-FPO cluster

0 Use non-FPO replication mode for FPO configuration as well.

restripeOnDiskFailure No Yes Specifies whether GPFS attempts to automatically recover from certain common disk failures.

metadataDiskWaitTimeForRecovery* 300 (seconds) <see comments to customize >

Delay period before recovery for failed metadata disks is started. Used when RestripeOnDiskFailure is set to yes.

Ensure it is large enough to cover reboot time of the nodes hosting metadata disk servers.

dataDiskWaitTimeForRecovery* 600 (seconds) <see comments to customize >

Delay period before recovery for failed data disks is started. Used when RestripeOnDiskFailure is set to yes. Data disks are controlled via a separate tunable due to their distribution across the variety of node types.

Ensure it is large enough to cover the reboot of the slowest node in the cluster.

syncBuffsPerIteration (undocumented) 100 1 This is to expedite buffer flush and the rename operations done by MapReduce jobs.

minMissedPingTimeout*

(undocumented)

3(seconds) 10-60 (seconds) The lower bound on a missed ping timeout. For FPO cluster, longer grace time is desirable before marking a node as dead as it impacts all associated disks. Additionally, when running MapReduce workload CPU can get really busy and cause delayed ping responses. However, a longer timeout implies delay in recovery. A value between 10-60 seconds is recommended. This generally seems like a good balance between the time to detect the real failures and the rate of false failures detection triggered by a delayed ping response due to CPU or network overload.

leaseRecoveryWait (undocumented) 35 65 Allow larger grace window before starting the recovery.

Pagepool* Varied 25% of system Amount of physical memory reserved for cache

memory on the node. (use –N to list nodes to apply)

prefetchPct* 20(% of page-pool)

See comments GPFS uses this parameter as a guideline to limit the pagepool space used for prefetch or write-behind buffers. For MapReduce workload, generally sequential read and write, increase this parameter up to its 60% maximum.

prefetchThreads* 72 See comments Controls the maximum possible threads dedi-cated to prefetching data for sequential file reads or to handle sequential write-behind.prefetchThreads should be around twice the number of disks available to the node. Default should work well in most configurations.

prefetchAggressivenessRead*

(undocumented)-1 2 This parameter controls the aggressiveness in

prefetching file data Changing to 2 causes prefetch to be more aggressive possibly resulting in increased performance of map-reduce jobs.

prefetchAggressivenessWrite(undocumented)

-1 0 This parameter controls the aggressiveness in flushing the data buffers to disks. Changing to 0 results in holding the recently written data to be held in the cache for as long as possible.

maxFilesToCache* 4000 100000 Specifies the number of inodes to cache.Storing the inode of a file in cache permits faster re-access to the file when retrieving lo-cation information for data blocks. Increasing this number can improve throughput for work-loads with high file reuse as is the case with Hadoop Map-Reduce tasks. However, increas-ing this number excessively may cause paging at the file system manager node. The valueshould be large enough to handle the number of concurrently open files plus allow caching of recently used files.

maxStatCache* 1000 100000 Specifies the number of inodes to keep in the stat cache. The stat cache maintains only enough inode information to perform a query on the file system.

worker1Threads* 48 See comments Controls the maximum number of concurrent file operations at any one instant primarily for random read and write that cannot be pre-fetched.

For big data applications, if the MapReduce tasks are more than 10, increase this to 72.

nsdMinWorkerThreads 16 48 Increases NSD server performance by having a large number of dedicated threads for NSD service.

nsdInlineWriteMax (undocumented) 1K 1M This parameter defines the amount of data sent in-line with write requests to NSD server. It helps to reduce overhead caused by inter-node communication.

nsdSmallThreadRatio 0 2 Changing this parameter to 2 results in assigning more resources for large IO for

(undocumented) Hadoop workloads.

nsdThreadsPerQueue 3 10 The number of threads servicing an internal IO queue for disk operations.

Increasing it results in increased parallelism for disk IO and can increase throughput.

forceLogWriteOnFdatasync Yes No Controls forcing log writes to disk. When set to no, the GPFS log record is flushed only when a new block is allocated to the file.

disableInodeUpdateOnFdatasync No Yes When set to yes, the inode object is not updated on disk for mtime and atime updates on fdatasync() calls. File size updates are always synced to the disk.

unmountOnDiskFail* No Meta Controls how the GPFS daemon responds when a disk failure is detected.

When set to “meta”, the file system is unmounted when meta-data is no longer accessible. With replication factor set to 3, three failure groups must have at least one disk each in the failed state before a file system is unmounted.

cnfsReboot* Yes No If using CNFS, you must review section Clus-tered NFS for recommendation before chang-ing this parameter. It should be changed to “NO” only if the recommended setup is not possible.

dataDiskCacheProtectionMethod 0 2 Defines the kind of disks used for GPFS file sys-tem. The default 0 indicates that disks are “powerProtected” and no recovery is needed beyond standard GPFS log recovery.

A value of 2 ensures that when a node goes down, the data lost in the disk’s cache is rebuilt by the GPFS recovery process.

COMMON GPFS OPERATIONS

Service start-up and shutdown: to start GPFS service on all cluster nodes, add "-a", to start it on any node or group of nodes, define them with the switch "-N".

# mmstartup -a

# mmshutdown -N <node1, node2>

Filesystem mount and unmount:

# mmlsfs all

# mmmount <filesystemName>

# mmumount <filesystemName>

Disk and filesystem data and metadata status: use mmdf to display detailed GPFS system information including metadata, inode and “by NSD” space utilization.

# mmdf <filesystem>

COMMON SERVICE OPERATIONS

What follows is a list of commands the operations staff may need to execute. For additional information consult GPFS documentation and on a management node run mm[TAB] to see a list of main commands available.

Check cluster state – mmlscluster

Start and shutdown GPFS service – mmstartup, mmshutdown

Check disk state – mmlsnsd

Check which nodes have a filesystem mounted – mmlsmount <fs> -L

Establish, edit and list quota information – mmdefquota, mmedquota, mmlsquota

Add and remove a client or server node – mmaddnode, mmdelnode, mmchlicense (Note: nodes must be prepared as explained above – functional passwordless SSH logon from a management node, disabled SELinux, etc.)

Add and evacuate disk to an online filesystem – mmaddisk, mmdeldisk (WARNING: make sure you review this command and read about it in GPFS Problem Determination Guide as wrong operation can cause data loss)

When a server fails, one needs to determine if it is a minor problem where just a reboot will fix the failure and the system will continue to operate, or if it is a complete failure as in a power supply or motherboard failure. In the case of catastrophic failure, we will assume that at least all the disks in the system are still ok. This leaves us with two options, either replace the failed part or swap out the entire server. However, in the case of a swap out, those disks need to be moved from the original unit into the new unit. They shouldn’t need to be placed in the same exact spots in the new unit since they would already have GPFS labels stamped in the header. However, the two OS disks for booting the system should be moved to the spots, or whatever is needed to get the system to recognize those two as the bootable OS disks. You will need to be careful not to reprovision these disks since most of the GPFS configuration and packages are on the OS disks at this point.

When a disk fails, the disk needs to be replaced. After the disk is replaced, then a restripe of the system needs to be done to rebuild the new empty disk to have all the replicas of each file put back on it as was on the failed disk. This can take some time since we do not use RAID in this configuration and that means that the entire disk rebuild will cause network traffic for each piece of a file to be rebuilt on this new disk. This would be a 600GB drive that failed, so if the disk had been nearly full, that’s 600 GB of network traffic over, most likely, a 1GigE network. So, it could take upwards of two hours to rebuild that new drive. More, if the system is very busy doing other things. This is one of the arguments for possibly using RAID so we don’t bog down the network every time a drive fails. Just a note, the GPFS for SAP-HANA solution uses GPFS-FPO with RAID5 for this reason along with the replication also. The only reason they do not use

RAID6 is that they did not have the resources or time to test RAID6 and offer it as another solution choice. We may want to possibly consider this for GPFS-FPO on the cloud. For now, I am going with the GPFS best practices recommendations for a target use case of replacing HDFS in a Map Reduce environment. So, it may be that SAP-HANA has a more flexible environment compared to a Map Reduce setup.

EXAMPLES OF COMMON COMMAND OUTPUTS

[root@shuynh-gpfs-dmss1 fs1]# /usr/lpp/mmfs/bin/mmlscluster

GPFS cluster information======================== GPFS cluster name: n1_0_1.hpccloud.local GPFS cluster id: 8407352956664299536 GPFS UID domain: n1_0_1.hpccloud.local Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp

GPFS cluster configuration servers:----------------------------------- Primary server: n1_0_1.hpccloud.local Secondary server: (none)

Node Daemon node name IP address Admin node name Designation----------------------------------------------------------------------------------------------------- 1 n1_0_1.hpccloud.local 119.81.87.58 n1_0_1.hpccloud.local quorum-manager

Note that in the output above there is no secondary configuration server, whereas in reality we’d want to have one. Also, there is only one node in this cluster, which together tells you this is a “singleton” (single node) cluster.

INFINIBAND

HPC clients often require HPC cluster with infiniband (IB) for high performance data transport-ing. SoftLayer, however, doesn't support to order the server with IB directly from management portal and API. Instead, SoftLayer sales can help to setup the IB servers manually. This IB sup-port enables to support reprovisioning the manually setup IB servers. This document shows how to setup HPC cloud with IB support and to provision LSF cluster with CentOS 6.4 x86_64. Sup-port for RedHat is not technically available yet, but should be soon. This document would not change much for the version using Redhat except for the OS being chosen. In particular, the doc-ument describes the following:

How to order servers with IB on Softlayer. How to provision LSF cluster with IB How to verify IB on LSF cluster.

Scope

For IB support, CentOS 6.4 x86_64 is supported. For supporting RHEL 6.4 x86_64, RHEL 6.5 x86_64 and CentOS 6.5 x86_64, more testing effort is required.

For IB support on SUSE 10.04 x86_64, SoftLayer needs some development and testing effort. For IB support, SoftLayer verified on LSF, for Symphony, need some more development and test-

ing effort.

Prerequisites

Contact with SoftLayer to get the datacenter which supports to setup the IB server. PCM-AE management node has already been setup. Get OFED 2.1-1.0.6 installation package for CentOS 6.4 from Mellanox and put it into the HTTP

server path (such as, /var/www/html) where the LSF packages located in, by default the HTTP server should be PCM-AE management node. Note that this package is OS and OS version spe-cific, and is different for different OSes and OS versions.

OFED package for CentOS 6.4 x86_64 download URL: http://www.mellanox.com/downloads/ofed/MLNX_OFED-2.1-1.0.6/MLNX_OFED_LINUX-2.1-1.0.6-rhel6.4-x86_64.iso

Very Important: Make sure the file permission is 644 after put into the HTTP server path (/var/www/html/). For example, run command "chmod 644 /var/www/html/*".

High level steps

1. Setup xCAT. 2. Order servers with IB card on Softlayer. 3. Add IB servers into xCAT. 4. Configure and Publish LSF Cluster Definitions. 5. Provision and verify LSF clusters with IB. 6. Manage existing LSF clusters with IB.

1. Add IB servers to a LSF cluster (Flex Up) 2. Remove IB servers from a LSF cluster (Flex Down) 3. Cancel a LSF cluster

7. Release IB servers.

1. Setup xCAT

First get the datacenter which was found in the prerequisite section, and follow instructions in HPC_Cloud_xCAT_Setup.txt to install xCAT on that datacenter. You will need one xCAT installation per SoftLayer account/client.

After adding the xCAT adapter to PCM-AE successfully, you should see the CentOS stateful osimage in the page Inventory > xCAT > Adapter Instance Name > Templates

2. Order GPFS-FPO servers with IB card on SoftLayer

Ordering servers with IB will require coordination with SoftLayer sales. You also need let the SoftLayer sales know some of the server basic information, see below:

Data Center: The datacenter found in the prerequisite section

CPU: 12 core

Memory: 48 GB

Operating System: Always order CentOS 6.x (64 bit) even if xCAT will reprovision the servers with RHEL, SLES, etc ...

Disk: 12 15K 600GB SAS drives

Network: 1 Gbps Private Only Network Uplink, non-dual port

IB card: Mellanox MCX354A-FCCT ConnectX-3Pro FDR/40GbE

VLAN: Same with xCAT MN

Hostname: Specify a unique host name in the current account scope

Domain: Same with xCAT MN

After the servers with IB card are ordered successfully, you can check them on SoftLayer portal, navigate to BARE METAL and click into each server to verify the configurations. See below:

Verify IB card is added into the server, see below.

Verify VLAN is correct, make sure the private network of the IB server's VLAN is the same with xCAT, see below.

3. Add IB servers into xCAT In order to let the xCAT MN to provision the IB servers, you need to add them to xCAT first. Follow the instructions below:

1. Get the hostname of the IB servers from SoftLayer portal (see screenshot in step 2). For example, you've ordered 3 IB servers, and their hostnames are: newibcomp001.oncloud.com,newib-comp002.oncloud.com,newibcomp003.oncloud.com.

2. Get the API key of your account from API Key page on SoftLayer portal. 3. Logon PCM-AE management node and "su" as root user.

# cd /opt/platform/virtualization/conf/resourceAdapter/soft-LayerAdapter/scripts

4. Run command below to add IB servers into xCAT.

# ./addNodesToxCAT.sh {SL_ACCOUNT_NAME} {SL_API_KEY} {"SL_SERVERNAME[,SL_SERVERNAME...]"}

Very Important: This command will prompt to input the PCM-AE Management Node's Admin password which you have changed in the PCM-AE secure doc HPC_Cloud_CaaS_Management_Secure.txt (in section "Securing management node" -> "1. (PCM-AE only) Change The Default PCM-AE Management Node's Admin Pass-word")

For example:

#./addNodesToxCAT.sh czhang 2a4808b66ecd85f74ab3f-ca48b77a3133e869ee8210b6f3459621a455298*** newibcomp001.oncloud.com, newibcomp002.oncloud.com, newibcomp003.oncloud.com

Input the PCM-AE Admin password:Login successfully. % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed100 4614 100 4614 0 0 11715 0 --:--:-- --:--:-- --:--:-- 14647

Results:

Machines below are added to xCAT successfully! newibcomp001.oncloud.com newibcomp002.oncloud.com newibcomp003.oncloud.com

https://manage.softlayer.com/Administrative/apiKeychain

5. Verify the nodes are added successfully.

Logon PCM-AE portal, navigate to Resources > Inventory > xCAT > xCAT instance > Machines and see the machines.

4. Configure and Publish LSF Cluster Definitions. In this step, follow the instructions Step 2.2 Create and Publish Cluster Definitions to configure the LSF cluster definitions. But for IB support, there are two differences with Step 2.2:

Modify these cluster definitions to ensure that the xCAT osimage centos6.4-x86_64-install-PM-Tools of the required OS is selected for each machine definition.

Drag the post scripts Update nodes password and Install OFED into the each tier in the cluster definition.

As an example, we will show how to configure for the LSF cluster definition, but the steps are similar for other cluster definitions.

Go to Clusters > Definitions tab, select the LSF monthly cluster definition "LSF Monthly", and click the Modify button. This opens the Cluster Designer window. Click on the LSFMaster box, which corresponds to the LSFMaster tier machine definition. Click on the OS tab and select the xCAT osimage centos6.4-x86_64-install-PMTools.

Drag the post script Update nodes password and Install OFED into the LSFMaster tier. And reorder the sequence of the script layers by the rule below:

** work load manager (LSF/SYM) ** scripts having to do with storage ** authentication scripts ** scripts having to do with networks ** Machine Layer

Refer to the screenshot below as an example, make sure the post script "Install OFED" is immediately above the DNS post script.

Note: The postscript layer Install OFED contains two user variables "OFED_PKG_NAME" and "SOURCE_URL", they are not suggested to be changed when used by LSF or Symphony cluster definition. See details below:

SOURCE_URL - The HTTP server URL to download the OFED install package. This may be inherited by the parent LSF or Symphony cluster definition.

OFED_PKG_NAME - The default value "MLNX_OFED_LINUX-2.1-1.0.6-rhel6.4-x86_64.iso" is same with the downloaded OFED package in Prerequisites.

Repeat the same process for the other two tiers: ComputeMonthly and ComputeHourly, then make sure other necessary variables and configurations for the LSF cluster definition are OK. Click on the small disk icon in the top right corner of the definition to save the modification. And back in the Clusters > Definitions tab, select the unpublished definition "LSF Monthly", and click Manage > Publish to publish this Cluster definition.

5. Provision and verify LSF clusters with IB

Logon PCM-AE portal, navigate to Clusters > Cockpit, new a LSF cluster instance with the LSF cluster definition for IB.

After the LSF cluster instance is provisioned successfully, logon the LSF master to verify if the OFED package has been installed and configured successfully, and IB network is active and all the IB servers are joined in. Check OFED install log to make sure there is not error. And two important statements at the end of the log file.

# vim /tmp/installOFED.log ......OFED package is installed successfully......IB service is started successfully

Check the IB nodes by command below to make sure all the IB devices are up. # ibnodes

Ca : 0xf45214030031ad70 ports 2 "newibcomp003 HCA-1"Ca : 0xf45214030031adb0 ports 2 "newibcomp002 HCA-1"Ca : 0xf45214030031adc0 ports 2 "newibcomp001 HCA-1"Switch : 0xf4521403007febd0 ports 36 "MF0;IB-SW01:SX6036/U1" enhanced port 0 lid 1 lmc 0

# ibnetdiscover## Topology file: generated on Thu May 15 21:35:34 2014## Initiated from node f45214030031adc0 port f45214030031adc1

vendid=0x2c9devid=0xc738sysimgguid=0xf4521403007febd0switchguid=0xf4521403007febd0(f4521403007febd0)Switch 36 "S-f4521403007febd0" # "MF0;IB-SW01:SX6036/U1" enhanced port 0 lid 1 lmc 0[1] "H-f45214030031adb0"[1](f45214030031adb1) # "newibcomp002 HCA-1" lid 3 4xFDR[2] "H-f45214030031ad70"[1](f45214030031ad71) # "newibcomp003 HCA-1" lid 2 4xFDR[4] "H-f45214030031adc0"[1](f45214030031adc1) # "newibcomp001 HCA-1" lid 4 4xFDR

vendid=0x2c9devid=0x1007

sysimgguid=0xf45214030031ad73caguid=0xf45214030031ad70Ca 2 "H-f45214030031ad70" # "newibcomp003 HCA-1"[1](f45214030031ad71) "S-f4521403007febd0"[2] # lid 2 lmc 0 "MF0;IB-SW01:SX6036/U1" lid 1 4xFDR

vendid=0x2c9devid=0x1007sysimgguid=0xf45214030031adb3caguid=0xf45214030031adb0Ca 2 "H-f45214030031adb0" # "newibcomp002 HCA-1"[1](f45214030031adb1) "S-f4521403007febd0"[1] # lid 3 lmc 0 "MF0;IB-SW01:SX6036/U1" lid 1 4xFDR

vendid=0x2c9devid=0x1007sysimgguid=0xf45214030031adc3caguid=0xf45214030031adc0Ca 2 "H-f45214030031adc0" # "newibcomp001 HCA-1"[1](f45214030031adc1) "S-f4521403007febd0"[4] # lid 4 lmc 0 "MF0;IB-SW01:SX6036/U1" lid 1 4xFDR

6. Manage existing LSF clusters with IB

Add IB servers to a LSF cluster (Flex Up)

To add IB server to a LSF cluster is a three-steps process. See the three steps below:Very Important: These three steps must be done sequentially (atomically). There should not be any other cluster operations such as create, flexing up/down until the steps are completed.

1. Follow instructions in section "Order servers with IB card on SoftLayer" to order IB servers on SoftLayer.

2. Follow instructions in section "Add IB servers into xCAT" to add the ordered IB servers into xCAT.

3. Flex up machines on the existing LSF cluster. The number of On-Demand Ma-chines for the ComputeMonthly tier must equals the old number plus the number of new nodes. For example, assuming that you have ordered four IB servers from SoftLayer, and the ComputeMonthly tier of the running cluster already has one node, so you must set the On-Demand Machines number as five(4+1) for the ComputeMonthly tier. After the LSF cluster becomes active (flexup is done), follow the IB verification instructions in section "Provision and verify LSF clus-ters with IB" to verify that the new added IB servers are added into the IB net-work successfully.

Remove IB servers from a LSF cluster (Flex Down)

Flex down machines from LSF cluster, then follow the instructions in section "Release IB servers" to remove/cancel the unused IB server from xCAT/Soft-Layer.

Cancel a LSF cluster Cancel a LSF cluster, then follow the instructions in section "Release IB servers" to remove/cancel the unused IB servers from xCAT/SoftLayer.

7. Release IB Servers

When the IB servers are not used anymore by the HPC cluster, they need to be removed from xCAT and canceled from SoftLayer.

Logon PCM-AE portal, and cancel/flex down the HPC clusters which associate with the IB servers. The HPC cluster cancelling/flexing down operation will not remove IB server from xCAT MN and Softlayer. So still need to do the two steps below:

1. Delete IB servers from xCAT adapter page on PCM-AE GUI, so that they can be re-moved from xCAT MN. Navigate to Resources > Inventory > xCAT > xCAT_instance > Machines, select the IB servers which will be removed and then click Manage > Delete.

2. Contact with SoftLayer sales/support to cancel the IB servers.

general assumptions - ibm - united states web viewjust a note, the gpfs for sap-hana solution uses...

Documents