cluster with pbs torque

CLUSTER with PBS TORQUE (version 1.1 – 25-08-2008)

Compiled and organized by

Rafael C. Barreto

IFUSP

The main goal of this article is to introduce the end user to the structure and facilities of a distributed computing system. INTRODUCTION............................................................................................................ 2

Peak floating-point performance .................................................................................. 3

Real cluster performance .............................................................................................. 4

Portable Batch System.................................................................................................. 5

SHARE IS THE SECRET OF SUCCESS ....................................................................... 6

HOW OUR CLUSTER WORKS..................................................................................... 7

Making the script .......................................................................................................... 8

EXAMPLE 1 – GAUSSIAN .................................................................................... 9

EXAMPLE 2 – DICE............................................................................................. 10

Baby sitting the job..................................................................................................... 11

Frequent situations...................................................................................................... 12

Killing a job............................................................................................................ 12

Rescuing an interrupted job.................................................................................... 12

The job does not start! ............................................................................................ 12

SETTING UP THE CLUSTER (ADMINISTRATION) ............................................... 13

2

INTRODUCTION [From Wikipedia, the free encyclopedia] Distributed computing is a method of computer processing in which different parts of a program are run simultaneously on two or more computers that are communicating with each other over a network. Distributed computing is a type of segmented or parallel computing, but the latter term is most commonly used to refer to processing in which different parts of a program run simultaneously on two or more processors that are part of the same computer. What is a computer cluster? A computer cluster is a group of loosely coupled computers that work together closely so that in many respects they can be viewed as though they are a single computer. The components of a cluster are commonly, but not always, connected to each other through fast local area networks.

An example of a computer cluster Load-balancing clusters operate by having all workload come through one or more load-balancing front ends, which then distribute it to a collection of back end Platform LSF HPC, Sun Grid Engine, Moab Cluster Suite and Maui Cluster Scheduler. The term high performance computing (HPC) refers to the use of (parallel) supercomputers and computer clusters, that is, computing systems comprised of multiple (usually mass-produced) processors linked together in a single system with commercially available interconnects. This is in contrast to mainframe computers, which are generally monolithic in nature. While a high level of technical skill is needed to assemble and use such systems, they can be created from off-the-shelf components. Because of their flexibility, power, and relatively low cost, HPC systems increasingly dominate the world of supercomputing. Usually, computer systems in or above the teraflops-region are counted as HPC-computers. The term is most commonly associated with computing used for scientific research. Grid computing or grid clusters are a technology closely related to cluster computing. The key differences (by definitions that distinguish the two at all) between grids and traditional clusters are that grids connect collections of computers which do not fully trust each other, or which are geographically dispersed. An example of grid is the Folding@home project (F@h). F@h is a distributed computing project designed to perform computationally intensive simulations of protein folding and other molecular dynamics (MD). It was launched on October 1, 2000, and is currently managed by the

3

Pande Group, within Stanford University's chemistry department, under the supervision of Professor Vijay Pande. F@h is the most powerful distributed computing cluster in the world, according to Guinness. The TOP500 organization's semiannual list of the 500 fastest computers usually includes many clusters. As of November 2007, the top supercomputer is the Department of Energy's IBM BlueGene/L system with performance of 478.2 TFlops. In computing, FLOPS (or flops or flop/s) is an acronym meaning FLoating point Operations Per Second. The FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating point calculations, similar to instructions per second. Peak floating-point performance can be derived by some constant times the cycle time (e.g., MHz) of a chip. It also varies depending on what instruction set you are using. All the IA32 Intel architectures can do at most 1 FLOP per cycle (e.g., a 500MHz PIII can theoretically get at most 500 MFlops). For all AMD machines before the Athlon, this number is actually less than one. For Athlon and later, however, it is 2 (e.g., a 500 MHz Athlon has 1 GFlops theoretical peak). For Pentium 4 and Athlon XP, the constant is 4. [http://www.cs.utk.edu/~rwhaley/ATLAS/x86.html]. Cost of computing from 1997 to 2007

0.1

1

10

100

1000

10000

100000

1996 1998 2000 2002 2004 2006 2008

year

$ p

er G

FL

OP

S

Current Folding@home Status OS Type TFLOPS Active CPUs Total CPUs Windows 174 183089 1914990

Mac OS X/PowerPC 7 9289 111260

Mac OS X/Intel 23 7296 36760

Linux 41 23848 269458

GPU 39 663 5119

PLAYSTATION®3 707 28503 420699

Total 991 252688 2758286 Last updated at Fri, 01 Feb 2008 00:02:54

[http://fah-web.stanford.edu/cgi-bin/main.py?qtype=osstats]

4

Real cluster performance does not depend only on CPU frequency and architecture of its computer nodes. Memory frequency and communication to CPU, network card type, hard disk capacity and velocity can limit the cluster performance. Through this, it is necessary to look for a balance of all components. First, the computer nodes must have a motherboard with a Front Side Bus (FSB) compatible to the CPU frequency. The FSB (or system bus) is the physical bi-directional bus that carries all electronic signal information between the CPU and the northbridge. The northbridge, also known as the memory controller hub (MCH) in Intel systems, is traditionally one of the two chips in the core logic chipset on a PC motherboard, the other being the southbridge. [Wikipedia]

Second, the memories installed on the motherboard must be identical and must have a frequency compatible to the FSB. Third, the hard disks must be of the highest velocity possible, prearranged to support RAID schemes, maximizing access. RAID (Redundant Arrays of Inexpensive Disks) is a technology that supports the integrated use of two or more hard-drives in various configurations for the purposes of achieving greater performance, reliability through redundancy, and larger disk volume sizes through aggregation [Wikipedia]. Note that it must have an operating system (OS) that supports both chipsets and RAID. Fourth, the cluster must have a network velocity compatible to the necessities of inter node communication. Some parallelized programs really communicate a lot, so as faster the nodes are, faster must be the network communication. The communication between the nodes must pass through a network switch, connected by appropriated cables. Fifth, the cluster must have a least

5

infrastructure to work (no-breaks, air conditioner…). Finally, it must have a well-defined philosophy. Some clusters do not have jobs queues, while others do not allow the users to enter to the nodes. This must be analyzed hardly by the maintainers to maximize the performance and minimize the administration work. The most obvious way to make good use of computational resources among many users is to implement batch processing. Batch processing is execution of a series of programs ("jobs") on a computer without human interaction. Batch jobs are set up to all input data be pre-selected through scripts or command line parameters. This is in contrast to interactive programs that prompt the user for such input. The main benefits of batch processing are to allow sharing of computer resources among many users and to keep high overall rate of utilization. It better amortizes the cost of a computer, especially an expensive one. Batch processing has been associated with mainframe computers since the earliest days of electronic computing in 1950s. The program that controls the batch processing is called job scheduler. [Wikipedia] Portable Batch System (or simply PBS) is the name of computer software that performs job scheduling. Its primary task is to allocate computational tasks, i.e., batch jobs, among the available computing resources. It is often used in conjunction with UNIX cluster environments. Several spin-offs of this software have resulted in it having various names. However, the overall architecture and command-line interface remain essentially the same. PBS was originally developed by MRJ for NASA in the early to mid-1990s. MRJ was taken over by Veridian, which was later taken over by Altair Engineering, which currently distributes PBS Pro commercially. [Wikipedia] The following versions of Portable Batch System are currently available [Wikipedia]: * OpenPBS — unsupported original open source version. * TORQUE — a fork of OpenPBS, paid support available through Cluster Resources. * PBS Professional (PBS Pro) — a version maintained and sold commercially by Altair Engineering.

6

SHARE IS THE SECRET OF SUCCESS The great advantage of clustering is its reduced costs and increase efficiency. Simple analysis shows that total processing capacity is maximized and the idle time vanishes when computers are clustered. The graph below illustrates this fact with a hypothetical situation. The cyan color represents the use of a single CPU (or node) per user. Always exist an idle time between the jobs, and it refers to data analysis, project planning, discussion, and report writing. The blue and green colors represent the shared use of two CPUs (or nodes) by 2 users. In this configuration, there is no computer idle time, and the gain is maximized. Contradictorily, the user time to do data analysis and project planning is increased. The two black lines in the graph give total performance. The continuous line represents the shared use, and the dashed line represents the single use. In this hypothetical situation, shared use is 25% more efficient than single use. As well, 80 days of single use is the same of 63 days of shared use.

Real clusters and scientific groups are not as linear as that hypothetical situation. Real situations should looks like the graphic to the right. Preserving the idle user time for data analysis and planning, every random situation will maintain shared use performance above the single use. Moreover, with more CPUs (or nodes), greater will be the efficiency of the cluster.

There are ways to increase the cluster performance even more. As better is the job scheduler, more efficient will be the cluster. The most used policies to achieve this are based on wall time, CPU time, number of CPUs, nodes, and amount of memory requested. All of these are controlled by job and queue priority, compute resource availability, execution time allocated to user, number of simultaneous jobs allowed for a user, estimated and elapsed execution time. More information about the gain in efficiency with the job scheduler can be easily found through the internet.

0 10 20 30 40 50 60 70 800.0

0.5

1.0

1.5

2.0

0.0

0.5

1.0

1.5

2.0

To

tal perfo

rman

ce

(relative to

sing

le cpu

use)

Pro

cess

ing

cap

acit

y

Days

1 single CPU per user

2 shared CPUs per 2 users

0 10 20 30 40 50 60 70 800.0

0.5

1.0

1.5

2.0

0.0

0.5

1.0

1.5

2.0

To

tal perfo

rman

ce

(relative to

sing

le cpu

use)

Pro

cess

ing

cap

acit

y

Days

1 single CPU per user

2 shared CPUs per 2 users

7

HOW OUR CLUSTER WORKS The cluster is composed of one server connected to the nodes through a private local area network (LAN). The server has two network cards: one to communicate to the nodes (knight.cluster) and other to communicate to the outside world (cavalier.if.usp.br). Each node is an independent machine, with its local hard disk and OS. The users cannot enter to the nodes because they do not have passwords. To copy their input files to the nodes work directory (/rhome), the server’s home directory (/home) is exported via Network File System (NFS) protocol. To execute these input files, the user must write a script and submit it to the pbs_server program. The pbs_sched program will search the system, looking for a free node with the requested characteristics (number of CPUs, amount of memory), and will send the script to pbs_mom at the node. If no one free node with the requested characteristics is available, then the job will stay queued at the server until such node becomes free.

The pbs_mom uses the user identification (UID) to execute the script. Therefore, the node must have the same UIDs of the server. Usually this is made through Network Information Service (NIS), a.k.a. Yellow Pages (YP). At this cluster, it had preferred not to use this protocol, so that all users were created at each node. This choice reflects the fact that users do not enter to the nodes. The OS used into this cluster is Slackware Linux. On 64-bits machines, it was used a variant called Slamd64. Other Linux system could be used without problems; however, it is better to standardize the machines to reduce administration work. The choice for Slackware is justified by its low load, stability and simplicity. The server has only soft and graphical programs. This is made to prevent users to run at the command line, and force then to use the PBS. To use a graphics program from the server, like MOLDEN or VMD, the user must execute on its local machine the following command: xhost cavalier.if.usp.br Obviously, the user local machine must have an X Window System running. The server is configured to export the display to any client who prompts there.

cavalier / knight pbs_server

/home /rh/node

node pbs_mom

/home /rhome

user SSH pbs_sched

NFS

8

Making the script

The batch scripts must contain the following instructions: source /etc/profile # This provides the environment commands and paths. cd /home/user/inputdirectory # Enter into the input files directory, at the server. mkdir ~/tmp-date-workname # Creates a temporary work directory, at the node. All jobs go faster at running on local disks and the network is spared from a very large flow of data! # The node understands ~/ as the same of /rhome/user. cp inputfiles ~/tmp-date-workname # Copy the input files to the temporary work directory. cd ~/tmp-date-workname # Enter into the temporary work directory. execution commands cp outputfiles /home/user/inputdirectory # Copy the output files to the input files directory. The command smaker can make this script automatically. The smaker syntax is: smaker jobname type The current available types are gau (Gaussian), dice (Dice) and mol (Molcas). Any other type will make a generic script. The output is jobname.s file. The user must edit the file to receive the appropriate instructions. However, the type and jobname are sufficient to produce a good starting scratch of the final batch script. After edited, the job is executed through the command: qsub jobname.s Other commands are useful to supervise the job. qstat -n # Show all running and queued jobs on the server. nodes # Show the status of all nodes. mnode nodename # Mount the node work directory (/rhome/user) into the home nodes directory (~/nodes/nodename). The nodes are umounted every hour, to minimize the network flux.

9

EXAMPLE 1 – GAUSSIAN (in this example, the user is barreto) smaker water gau # Makes the water.s file.

water.s

All lines beginning with # will be ignored by pbs_mom. The first line gives the date that the script was made. The #PBS line request to pbs_sched for one node with two CPUs (points per node). The uname and echo lines create a file with the node name, date when the job started and finished. To execute water.com input file and take back its output file (water.log), the following lines should be uncommented: cp water.com ~/gau-0203-1724-water g03 water.com cp water.log /home/barreto To execute several .com files at once and get their output files, the following lines should be uncommented: cp *.com ~/gau-0203-1724-water for i in *.com; do g03 $i; done cp *.log /home/barreto This will execute all .com files from the input directory.

# Gerado por smaker em Sun Feb 3 17:24:34 BRST 2008 #PBS -l nodes=1:ppn=2 source /etc/profile uname -snmo >> /home/barreto/water-0203-1724 echo "Iniciou em $(date)" >> /home/barreto/water-0203-1724 cd /home/barreto mkdir ~/gau-0203-1724-water ################################################ # cp *.com ~/gau-0203-1724-water # cp water.com ~/gau-0203-1724-water cd ~/gau-0203-1724-water # g03 water.com # for i in *.com; do g03 $i; done # cp water.log /home/barreto # cp *.log /home/barreto ################################################ echo "Terminou em $(date)" >> /home/barreto/water-0203-1724 echo "" >> /home/barreto/water-0203-1724

10

EXAMPLE 2 – DICE (again, the user is barreto) smaker water dice # Makes the water.s file. If this exists, the smaker will overwrite it.

water.s

Now, the script requests one node with only one CPU (there is no need to use #PBS string in this situation). Again, to execute the water.in (starting for water.dat configuration), the following lines should be uncommented: cp water.in water.dat water.txt ~/dice-0203-1830-water dice < water.in > water.out cp water.out water.dat /home/barreto Note that the user is free to make its own script. The smaker command is just a tool to automate the process.

# Gerado por smaker em Sun Feb 3 18:30:05 BRST 2008 #PBS -l nodes=1:ppn=1 source /etc/profile uname -snmo >> /home/barreto/water-0203-1830 echo "Iniciou em $(date)" >> /home/barreto/water-0203-1830 cd /home/barreto mkdir ~/dice-0203-1830-water ################################################ # cp *.in *.dat *.txt ~/dice-0203-1830-water # cp water.in water.dat water.txt ~/dice-0203-1830-water cd ~/dice-0203-1830-water # for i in *.in; do dice < $i > $i.out; done # dice < water.in > water.out # cp *.out *.dat /home/barreto # cp water.out water.dat /home/barreto # cp *.* /home/barreto ################################################ echo "Terminou em $(date)" >> /home/barreto/water-0203-1830 echo "" >> /home/barreto/water-0203-1830

11

Baby sitting the job

The first thing to do after the qsub command is to execute: qstat -n

In this example, all the jobs belongs to the user barreto and to the heavy queue. The cluster has two queues: heavy and fast. Heavy is the default queue, while the fast exist only for quick tests. The first and second jobs are running (R status), while the third is queued (Q status). The users can run only two jobs at once on the heavy queue. The first job is using four CPUs (c2q1/3+c2q1/2+c2q1/1+c2q1/0) from the node c2q1. The second job is using two CPUs (c2d2/1+c2d2/0) from the node c2d2. To see the output file (like water.log) before its end, the user can execute the command: mnode c2d2 After that, the c2d2:/rhome (where the output file is being written) is mounted on knight:/rh/c2d2. This can be viewed with the command: df -h

Inside the user home directory, there is a link to the node directory. ls ~/nodes/c2d2/gau-0203-1851-water

If the job dies for any reason, the user must copy the output files from the nodes by hand.

water.com water.log

Filesystem Size Used Avail Use% Mounted on

/dev/sda1 19G 4.7G 13G 27% /

/dev/sdb1 19G 1.6G 16G 9% /usr/local

/dev/md/0 419G 2.6G 395G 1% /home

c2d2:/rhome 49G 530M 46G 2% /rh/c2d2

knight.cluster:

Req'd Req'd Elap

Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time

-------------------- -------- -------- ---------- ------ ----- --- ------ ----- - -----

206.knight.cluster barreto heavy benzene.s 6853 1 -- -- -- R 00:02

c2q1/3+c2q1/2+c2q1/1+c2q1/0

207.knight.cluster barreto heavy water.s 6748 1 -- -- -- R 00:02

c2d2/1+c2d2/0

208.knight.cluster barreto heavy phenol.s -- 1 -- -- -- Q --

--

12

Frequent situations

Killing a job Sometimes there is the need to kill a mistaken job. To do this, the user must find out its Job ID number (JID) through a qstat command. In the previous example, the JID of the water batch script was 207. So, to kill this job the user must execute: qdel 207 Only the owner of the job or the PBS operators can kill a job. Rescuing an interrupted job Single point Gaussian calculation with 300 electrons, using a soft post-Hartree Fock method and a polarized augmented basis set, can spend many hours. The nodes are equipped with relative large amount of memory, sufficient to do a direct calculation with CIS(D)/aug-cc-pVDZ in a 300 electrons system. If the job is interrupted for any reason, there are ways to restart close where it stopped. Here two points arise. First case, the system is too big and no node supports a direct calculation. Inevitably, the Gaussian will use a lot of disk to do the calculation, making it an order of magnitude longer. To restart the job, the input file will need to read its checkpoint file (chk). Dice program has the same problem, and to restart its job the input file will need to read the last configuration file (dat). The second case arises when the single point calculation is relatively fast, but the job has dozens of such calculations. To restart the job, the user will need to change the script file, so that will not execute again the calculations already done. The output and checkpoint files always can be copied to the server with the help of mnode command. The job does not start! If the job status is queued (Q): 1) You have other jobs running, and you reached the limit of jobs per user. 2) All nodes with the characteristics you requested are occupied. If the job is not queued (Q), the status is exiting (E) or it runs (R) for just a while: 1) The script has the copying and executing strings commented, missing or wrong. 2) The input files do not exist or are wrong. 3) The executing program does not exist on the node. If you verified each possibility and found no one, so contact the administrator.

13

SETTING UP THE CLUSTER (ADMINISTRATION) ## INSTALLING THE SERVER ##

## take the current slackware or slamd64 from a DVD ## ## log as root ## fdisk /dev/sda fdisk /dev/sdb mdadm --create /dev/md0 --level=1 --raid-devices=2 /dev/sd[ab]1 mdadm --create /dev/md1 --level=0 --raid-devices=2 /dev/sd[ab]3 ## the partitions should be like this ## Disk /dev/sda: 250.0 GB, 250059350016 bytes

Device Boot Start End Blocks Id System

/dev/sda1 * 1 1946 15631213+ fd Linux raid autodetect

/dev/sda2 1947 2008 498015 82 Linux swap

/dev/sda3 2009 30401 228066772+ fd Linux raid autodetect

Disk /dev/sdb: 250.0 GB, 250059350016 bytes


/dev/sdb1 * 1 1946 15631213+ fd Linux raid autodetect

/dev/sdb2 1947 2008 498015 82 Linux swap

/dev/sdb3 2009 30401 228066772+ fd Linux raid autodetect

Disk /dev/md0: 16.0 GB, 16006250496 bytes


## prepare to install ## setup ## add swap ## ## format the partitions with ext3 ## ## the file system should be like this ## Filesystem Size Mounted on

/dev/md/0 15G /

/dev/md/1 215G /home

## setup the network with hostname knight, domain cluster, ip 192.168.1.1 ## ## make sure eth0 is the onboard gigabit Ethernet card ## ## add the new ip number to the /etc/hosts of the server and nodes ## ## recompile the kernel to improve performance ## ## incomplete ##

14

## ADDING A NODE ##

## take the current slackware or slamd64 from a DVD ## ## log as root ## fdisk /dev/sda fdisk /dev/sdb fdisk /dev/sdc mdadm --create /dev/md0 --level=0 --raid-devices=2 /dev/sd[bc]1 ## the partitions should be like this ## Disk /dev/sda: 250.0 GB, 250059350016 bytes


/dev/sda1 * 1 1217 9775521 83 Linux

/dev/sda2 1218 1461 1959930 82 Linux swap

/dev/sda3 1462 2070 4891792+ 83 Linux

/dev/sda4 2071 30401 227568757+ 83 Linux

Disk /dev/sdb: 250.0 GB, 250059350016 bytes


/dev/sdb1 1 30401 244196001 fd Linux raid autodetect

Disk /dev/sdc: 250.0 GB, 250059350016 bytes


/dev/sdc1 1 30401 244196001 fd Linux raid autodetect


## prepare to install ## setup ## add swap ## ## format the partitions with ext3 ## ## the file system should be like this ## Filesystem Size Mounted on

/dev/sda1 9.2G /

/dev/sda3 4.6G /usr/local

/dev/sda4 214G /rhome

/dev/md/0 459G /scr

## setup the network with the next cluster ip number ## ## add the new ip number to the /etc/hosts of the server and nodes ## ## recompile the kernel to improve performance ## ## if the keyboard is weird ## loadkeys us ## or ## loadkeys br-abnt2 ## set off the execution mode of unused daemons from /etc/rc.d ## chmod -x rc.alsa rc.gpm rc.wireless rc.yp ## copy the server's /etc/hosts to resolve the machine names ## rsync -av 192.168.1.1:/etc/hosts /etc/hosts

15

## get the server's .ssh passphrase to freely change between nodes and server ## rsync -avr knight:~/.ssh ~/ ## copy torque, g03, dice, tinker, dalton directories from another node to /usr/local ## ## install torque mom ## for i in torque-package-*.sh; do sh $i --install; done echo "$pbsserver knight.cluster # /home is NFS mounted on all hosts $usecp *.cluster:/home /home $usecp *.cluster:/rhome /rhome" > /var/spool/torque/mom_priv/config ## prepare to mount the server's /home at the node ## echo "knight:/home /home nfs users,hard,intr,rw,retry=1,bg 0 0" >> /etc/fstab rpc.mountd rpc.nfsd ## clean the default /home system files and directories, and mount it ## rm -r /home/* mount /home ## prepare to export the node /rhome and /scr to the server ## echo "/rhome knight(rw) /scr knight(rw)" >> /etc/exports exportfs -r ## prepare to create the user accounts ## ## copy the nodes_users and nodes_users_new from another node to /usr/local/sbin ## nodes_users ## prepare to copy the programs ## ## copy the intel and pgi compiler from another node to /opt ## ## prepare de lib environment ## echo "/opt/pgi/linux86-64/7.1-5/libso /opt/pgi/linux86-64/7.1-5/lib /opt/pgi/linux86-64/7.1-5/libso-gh /opt/pgi/linux86-64/7.1-5/lib-gh /opt/intel/cce/10.1.012/lib /opt/intel/fce/10.1.012/lib /opt/intel/cmkl/10.0.1.014/lib/em64t /opt/intel/mkl/10.0.1.014/lib/em64t" >> /etc/ld.so.conf ldconfig ## prepare the /etc/profile to behave adequately ## ## copy the program's paths from another node's /etc/profile ##

16

## add the /rhome and /scr in the server's /etc/fstab ## echo "node:/rhome /rh/node nfs noauto,users,hard,intr,rw,retry=1,bg 0 0 node:/scr /rh/scr_node nfs noauto,users,hard,intr,rw,retry=1,bg 0 0" >> /etc/fstab ## create the mount point in the server ## mkdir /rh/node ## create the link to the new node in the /home/user/nodes directory ## server_nodes_new node ## create the node in the PSB server ## qmgr create node node ## set the correct number of cores ## set node node np = n q ## copy the rc.torque-mom script from another node, and execute it ## ## add the execution instruction to the rc.M ## echo "# Inicia o pbs-mom. if [ -x /etc/rc.d/rc.torque-mom ]; then . /etc/rc.d/rc.torque-mom fi" >> rc.M ## finito! ##

cluster with pbs torque

Documents