software stack on the worker nodes amd opteron/intel westmere redhat 64bit scientific linux...

31
Software stack on the worker nodes AMD Opteron/Inte l Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Upload: tamara-reaney

Post on 01-Apr-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Software stack on the worker nodes

AMD Opteron/Intel

Westmere

Redhat 64bit

Scientific Linux

Portland, GNU,Intel

OpenMPI

Sun Grid Engine v6

Ganglia

Page 2: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Keeping up-to-date with application packages From time to time the application packages get updated.

When this happens;– news articles inform us of the changes via iceberg news. To

read the news- type news on iceberg or check the URL: http://www.wrgrid.group.shef.ac.uk/icebergdocs/news.dat

– The previous versions or the new test versions of software are normally accessed via the version number. For example;

abaqus69 , nagexample23 , matlab2011a

Page 3: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Running application packages in batch queues

icberg have locally provided commands for running some of the popular applications in batch mode. These are;

runfluent , runansys , runmatlab , runabaqus

To find out more just type the name of the command on its own while on iceberg.

Page 4: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Setting up your software development environment

• Excepting the scratch areas on worker nodes, the view of the filestore is identical on every worker.

• You can setup your software environment for each job by means of the module commands.

• All the available software environments can be listed by using the module avail command.

• Having discovered what software is available on iceberg, you can then select the ones you wish to use by using-      module add       or     module load commands

• You can load as many non-clashing modules as you need by consecutive module add commands.

• You can use the module list command to check the list of currently loaded modules.

Page 5: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Software development environment

• Compilers– PGI compilers

– Intel Compilers

– GNU compilers

• Libraries– NAG Fortran libraries ( MK22 , MK23 )

– NAG C libraries ( MK 8 )

• Parallel programming related– OpenMP

– MPI ( OpenMPI , mpich2 , mvapich )

Page 6: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Managing Your Jobs Sun Grid Engine Overview

SGE is the resource management system, job scheduling and

batch control system. (Others available such as PBS, Torque/Maui, Platform LSF )

• Starts up interactive jobs on available workers• Schedules all batch orientated ‘i.e. non-interactive’ jobs • Attempts to create a fair-share environment• Optimizes resource utilization

Page 7: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Job scheduling on the cluster

SGE worker node

SGE worker node

SGE worker node

SGE worker node

SGE worker node

SGE MASTER node

Queue-A Queue-B Queue-C

A S

lot 1

A S

lot 2

B S

lot 1

C S

lot 1

C S

lot 2

C S

lot 3

B S

lot 1

B S

lot 2

B S

lot 3

B S

lot 1

C S

lot 1

C S

lot 2

A S

lot 1

B S

lot 1

C S

lot 1

Queues

Policies

Priorities

Share/Tickets

Resources

Users/Projects

JOB Y JOB ZJOB X

JOB U

JOB OJOB N

Page 8: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Submitting your job

There are two SGE commands submitting jobs;– qsh or qrsh : To start an interactive job– qsub : To submit a batch job

There are also a list of home produced commands for submitting some of the popular applications to the batch system. They all make use of the qsub command.

These are;

runfluent , runansys , runmatlab , runabaqus

Page 9: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Managing Jobs monitoring and controlling your jobs• There are a number of commands for querying and

modifying the status of a job running or waiting to run. These are;– qstat or Qstat (query job status)– qdel (delete a job)

– qmon ( a GUI interface for SGE )

Page 10: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Running Jobs Example: Submitting a serial batch job

Use editor to create a job script in a file (e.g. example.sh):

#!/bin/bash# Scalar benchmark echo ‘This code is running on’ `hostname`date

./linpack

Submit the job: qsub example.sh

Page 11: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Running Jobsqsub and qsh options

-l h_rt=hh:mm:ss The wall clock time. This parameter must be specified, failure to include this parameter will result in the error message: “Error: no suitable queues”. Current default is 8 hours .

-l arch=intel*

-l arch=amd*

Force SGE to select either Intel or AMD architecture nodes. No need to use this parameter unless the code has processor dependency.

-l mem=memory sets the virtual-memory limit e.g. –l mem=10G (for parallel jobs this is per processor and not total).

Current default if not specified is 6 GB .

-l rmem=memory Sets the limit of real-memory required

Current default is 2 GB.Note: rmem parameter must always be less than mem.

-help Prints a list of options

-pe ompigige np

-pe openmpi-ib np

-pe openmp np

Specifies the parallel environment to be used. np is the number of processors required for the parallel job.

Page 12: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Running Jobs

qsub and qsh options ( continued)

-N jobname By default a job’s name is constructed from the job-script-file-name and the job-id that is allocated to the job by SGE.

This options defines the jobname. Make sure it is unique because the job output files are constructed from the jobname.

-o output_file Output is directed to a named file. Make sure not to overwrite important files by accident.

-j y Join the standard output and standard error output streams recommended

-m [bea]

-M email-address

Sends emails about the progress of the job to the specified email address. If used, both –m and –M must be specified. Select any or all of the b,e and a to imply emailing when the job begins, ends or aborts.

-P project_name Runs a job using the specified projects allocation of resources.

-S shell Use the specified shell to interpret the script rather than the default bash shell. Use with care.

A better option is to specify the shell in the first line of the job script. E.g. #!/bin/bash

-V Export all environment variables currently in effect to the job.

Page 13: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Running Jobsbatch job example

qsub example:

qsub –l h_rt=10:00:00 –o myoutputfile –j y myjob

OR alternatively … the first few lines of the submit script myjob contains -

$!/bin/bash$# -l h_rt=10:00:00 $# -o myoutputfile$# -j y

and you simply type; qsub myjob

Page 14: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Running JobsInteractive jobs

qsh , qrsh

• These two commands, find a free worker-node and start an interactive session for you on that worker-node.

• This ensures good response as the worker node will be dedicated to your job.• The only difference between qsh and qrsh is that ;

qsh starts a session in a new command window where asqrsh uses the existing window. Therefore, if your terminal connection does not support graphics ( i.e. XWindows) than qrsh will continue to work where as qsh will fail to start.

Page 15: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Running JobsA note on interactive jobs

• Software that requires intensive computing should be run on the worker nodes and not the head node.

• You should run compute intensive interactive jobs on the worker nodes by using the qsh or qrsh command.

• Maximum ( and also default) time limit for interactive jobs is 8 hours.

Page 16: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Managing JobsMonitoring your jobs by qstat or Qstat

Most of the time you will be interested only in the progress of your own jobs thro’ the system.

• Qstat command gives a list of all your jobs ‘interactive & batch’ that are known to the job scheduler. As you are in direct control of your interactive jobs, this command is useful mainly to find out about your batch jobs ( i.e. qsub’ed jobs).

• qstat command ‘note lower-case q’ gives a list of all the executing and waiting jobs by everyone.

• Having obtained the job-id of your job by using Qstat, you can get further information about a particular job by typing ;

qstat –f -j job_id You may find that the information produced by this command is far more than you care for, in that case the following command can be used to find out about memory usage for example;

qstat –f -job_id | grep usage

Page 17: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Managing Jobsqstat example output

State can be: r=running , qw=waiting in the queue, E=error state. t=transfering’just before starting to run’ h=hold waiting for other jobs to finish.

job-ID prior name user state submit/start at queue slots ja-task-ID -------------------------------------------------------------------------------------------------------------------------------------------------

206951 0.51000 INTERACTIV bo1mrl r 07/05/2005 09:30:20 [email protected] 1 206933 0.51000 do_batch4 pc1mdh r 07/04/2005 16:28:20 [email protected]. 1 206700 0.51000 l100-100.m mb1nam r 07/04/2005 13:30:14 [email protected]. 1 206698 0.51000 l50-100.ma mb1nam r 07/04/2005 13:29:44 [email protected]. 1 206697 0.51000 l24-200.ma mb1nam r 07/04/2005 13:29:29 [email protected]. 1 206943 0.51000 do_batch1 pc1mdh r 07/04/2005 17:49:45 [email protected]. 1 206701 0.51000 l100-200.m mb1nam r 07/04/2005 13:30:44 [email protected]. 1 206705 0.51000 l100-100sp mb1nam r 07/04/2005 13:42:07 [email protected]. 1 206699 0.51000 l50-200.ma mb1nam r 07/04/2005 13:29:59 [email protected]. 1 206632 0.56764 job_optim2 mep02wsw r 07/03/2005 22:55:30 [email protected] 18 206600 0.61000 mrbayes.sh bo1nsh qw 07/02/2005 11:22:19 [email protected] 24 206911 0.51918 fluent cpp02cg qw 07/04/2005 14:19:06 [email protected] 4

Page 18: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Managing Jobs Deleting/cancelling jobsqdel command can be used to cancel running jobs or remove

from the queue the waiting jobs.• To cancel an individual Job;

qdel job_id Example: qdel 123456• To cancel a list of jobs

qdel job_id1 , job_id2 , so on …• To cancel all jobs belonging to a given username qdel –u username

Page 19: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Managing Jobs

Job output files

• When a job is queued it is allocated a jobid ( an integer ).

• Once the job starts to run normal output is sent to the output (.o) file and the error o/p is sent to the error (.e) file.

▬ The default output file name is: <script>.o<jobid>▬ The default error o/p filename is: <script>.e<jobid>

• If -N parameter to qsub is specified the respective output files become <name>.o<jobid> and <name>.e<jobid>

• –o or –e parameters can be used to define the output files to use.

• -j y parameter forces the job to send both error and normal output into the same (output) file (RECOMMENDED )

Page 20: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Monitoring the job output files• The following is an example of submitting a SGE job

and checking the output produced qsub myjob.sh

job <123456> submitted

qstat –f –j 123456 (is the job running ?)

When the job starts to run, type ;

tail –f myjob.sh.o123456

Page 21: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Problem Session

• Problem 10– Only up to test 5

Page 22: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Managing Jobs

Reasons for job failures

– SGE cannot find the binary file specified in the job script – You ran out of file storage. It is possible to exceed your

filestore allocation limits during a job that is producing large output files. Use the quota command to check this.

– Required input files are missing from the startup directory– Environment variable is not set correctly

(LM_LICENSE_FILE etc)– Hardware failure (eg. mpi ch_p4 or ch_gm errors)

Page 23: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Finding out the memory requirements of a job• Virtual Memory Limits:

▬ Default virtual memory limits for each job is 6 GBytes▬ Jobs will be killed if virtual memory used by the job exceeds the amount

requested via the –l mem= parameter.• Real Memory Limits:

▬ Default real memory allocation is 2 GBytes ▬ Real memory resource can be requested by using –l rmem= ▬ Jobs exceeding the real memory allocation will not be deleted but will run with

reduced efficiency and the user will be emailed about the memory deficiency.▬ When you get warnings of that kind, increase the real memory allocation for your

job by using the –l rmem= parameter.▬ rmem must always be less than mem

Determining the virtual memory requirements for a job;▬ qstat –f –j jobid | grep mem▬ The reported figures will indicate

- the currently used memory ( vmem )- Maximum memory needed since startup ( maxvmem)- cumulative memory_usage*seconds ( mem )

▬ When you run the job next you need to use the reported value of vmem to specify the memory requirement

Page 24: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Managing JobsRunning arrays of jobs

• Add the –t parameter to the qsub command or script file (with #$ at beginning of the line)– Example: –t 1-10

• This will create 10 tasks from one job• Each task will have its environment variable

$SGE_TASK_ID set to a single unique value ranging from 1 to 10.

• There is no guarantee that task number m will start before task number n , where m<n .

Page 25: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Managing Jobs Running cpu-parallel jobs

• Parallel environment needed for a job can be specified by the: -pe <env> nn parameter of qsub command, where <env> is..

▬ openmp : These are shared memory OpenMP jobs and therefore must run on a single node using its multiple processors.

▬ ompigige : OpenMPI library- Gigabit Ethernet. These are MPI jobs running on multiple hosts using the ethernet fabric ( 1Gbit/sec)

▬ openmpi-ib : OpenMPI library-Infiniband. These are MPI jobs running on multiple hosts using the Infiniband Connection ( 32GBits/sec )

▬ mvapich2-ib : Mvapich-library-Infiniband. As above but using the MVAPICH MPI library.

• Compilers that support MPI. ▬ PGI ▬ Intel▬ GNU

Page 26: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Setting up the parallel environment in the job script.

• Having selected the parallel environment to use via the qsub –pe parameter, the job script can define a corresponding environment/compiler combination to be used for MPI tasks.

• MODULE commands help setup the correct compiler and MPI transport environment combination needed.

List of currently available MPI modules are; mpi/pgi/openmpi mpi/pgi/mvapich2

mpi/intel/openmpi mpi/intel/mvapich2 mpi/gcc/openmpi mpi/gcc/mvapich2

• For GPU programming with CUDA libs/cuda

• Example: module load mpi/pgi/openmpi

Page 27: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Summary of module load parameters for parallel MPI environments

TRANSPORTqsub parameter -------------------------COMPILER to use

-pe openmpi-ib -pe ompigige

-pe mvapich2-ib

PGI mpi/pgi/openmpi mpi/pgi/mvapich2

Intel mpi/intel/openmpi mpi/intel/mvapich2

GNU mpi/gcc/openmpi mpi/gcc/mvapich2

Page 28: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Running GPU parallel jobs

GPU parallel processing is supported on 8  Nvidia Tesla Fermi M2070s GPU units attached to iceberg.

• In order to use the GPU hardware you will need to join the GPU project by emailing [email protected]

• You can then submit jobs that use the GPU facilities by using the following three parameters to the qsub command;-P gpu-l arch=intel*-l gpu=nn

where 1<= nn <= 8 is the number of gpu-modules to be used by the job.

P stands for project that you belong to. See next slide.

Page 29: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Special projects and resource management• Bulk of the iceberg cluster is shared equally amongst the users.

I.e. each user has the same privileges for running jobs as another user.

• However, there are extra nodes connected to the iceberg cluster that are owned by individual research groups.

• It is possible for new research project groups to purchase their own compute nodes/clusters and make it part of the iceberg cluster.

• We define such groups as special project groups in SGE parlance and give priority to their jobs on the machines that they have purchased.

• This scheme allows such research groups to use iceberg as ordinary users “with equal share to other users” or to use their privileges ( via the –P parameter) to run jobs on their own part of the cluster without having to compete with other users.

• This way, everybody benefits as the machines currently unused by a project group can be utilised to run normal users’ short jobs.

Page 30: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Job queues on iceberg

Queue name Time Limit (Hours)

System specification

short.q 8

long.q 168 Long running serial jobs

parallel.q 168 Jobs requiring multiple nodes

openmp.q 168 Shared memory jobs using openmp

gpu.q 168 Jobs using the gpu units

Page 31: Software stack on the worker nodes AMD Opteron/Intel Westmere Redhat 64bit Scientific Linux Portland, GNU,Intel OpenMPI Sun Grid Engine v6 Ganglia

Getting help

• Web site– http://www.shef.ac.uk/wrgrid/

• Documentation– http://www.shef.ac.uk/wrgrid/using

• Training (also uses the learning management system)– http://www.shef.ac.uk/wrgrid/training

• Uspace– http://www.shef.ac.uk/wrgrid/uspace

• Contacts– http://www.shef.ac.uk/wrgrid/contacts.html