cluster workshop
DESCRIPTION
Cluster Workshop. For COMP RPG students 17 May, 2010 High Performance Cluster Computing Centre (HPCCC) Faculty of Science Hong Kong Baptist University. Outline. Overview of Cluster Hardware and Software Basic Login and Running Program in a job queuing system Introduction to Parallelism - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/1.jpg)
Cluster WorkshopCluster Workshop
For COMP RPG students
17 May, 2010
High Performance Cluster Computing Centre (HPCCC)Faculty of Science
Hong Kong Baptist University
![Page 2: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/2.jpg)
2
Outline• Overview of Cluster Hardware and Overview of Cluster Hardware and
SoftwareSoftware• Basic Login and Running Program in a Basic Login and Running Program in a
job queuing systemjob queuing system • Introduction to ParallelismIntroduction to Parallelism
– Why ParallelismWhy Parallelism– Cluster ParallelismCluster Parallelism
• Open MPOpen MP• Message Passing InterfaceMessage Passing Interface• Parallel Program ExamplesParallel Program Examples• Policy Policy for usingfor using sciblade.sci.hkbu.edu.hksciblade.sci.hkbu.edu.hkhttp://www.sci.hkbu.edu.hk/hpccc/sciblade
2
![Page 3: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/3.jpg)
Overview of Cluster Overview of Cluster Hardware and SoftwareHardware and Software
![Page 4: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/4.jpg)
4
Cluster Hardware
This 256-node PC cluster (sciblade) consist of:
• Master node x 2• IO nodes x 3 (storage)• Compute nodes x 256• Blade Chassis x 16• Management network• Interconnect fabric• 1U console & KVM switch• Emerson Liebert Nxa 120k VA UPS
4
![Page 5: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/5.jpg)
5
Sciblade Cluster
256-node clusters supported by fund from RGC
5
![Page 6: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/6.jpg)
6
Hardware Configuration
• Master Node– Dell PE1950, 2x Xeon E5450 3.0GHz (Quad Core)– 16GB RAM, 73GB x 2 SAS drive
• IO nodes (Storage)– Dell PE2950, 2x Xeon E5450 3.0GHz (Quad Core)– 16GB RAM, 73GB x 2 SAS drive– 3TB storage Dell PE MD3000
• Compute nodes x 256 each with– Dell PE M600 blade server w/ Infiniband network – 2x Xeon E5430 2.66GHz (Quad Core)– 16GB RAM, 73GB SAS drive
6
![Page 7: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/7.jpg)
7
Hardware Configuration• Blade Chassis x 16
– Dell PE M1000e – Each hosts 16 blade servers
• Management Network– Dell PowerConnect 6248 (Gigabit Ethernet) x 6
• Inerconnect fabric– Qlogic SilverStorm 9120 switch
• Console and KVM switch – Dell AS-180 KVM– Dell 17FP Rack console
• Emerson Liebert Nxa 120kVA UPS7
![Page 8: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/8.jpg)
8
Software List• Operating System
– ROCKS 5.1 Cluster OS– CentOS 5.3 kernel 2.6.18
• Job Management System – Portable Batch System– MAUI scheduler
• Compilers, Languages – Intel Fortran/C/C++ Compiler for Linux V11– GNU 4.1.2/4.4.0 Fortran/C/C++ Compiler
8
![Page 9: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/9.jpg)
9
Software List
• Message Passing Interface (MPI) Libraries – MVAPICH 1.1– MVAPICH2 1.2– OPEN MPI 1.3.2
• Mathematic libraries – ATLAS 3.8.3– FFTW 2.1.5/3.2.1– SPRNG 2.0a(C/Fortran) /4.0(C++/Fortran)
9
![Page 10: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/10.jpg)
10
Software List
• Molecular Dynamics & Quantum Chemistry– Gromacs 4.0.7
– Gamess 2009R1,
– Gaussian 03
– Namd 2.7b1
• Third-party Applications – FDTD simulation– MATLAB 2008b– TAU 2.18.2, VisIt 1.11.2– Xmgrace 5.1.22
– etc
10
![Page 11: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/11.jpg)
11
Software List
• Queuing system– Torque/PBS– Maui scheduler
• Editors– vi– emacs
11
![Page 12: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/12.jpg)
12
Hostnames
• Master node– External : sciblade.sci.hkbu.edu.hk– Internal : frontend-0
• IO nodes (storage)– pvfs2-io-0-0, pvfs2-io-0-1, pvfs-io-0-2
• Compute nodes– compute-0-0.local, …, compute-0-255.local
12
![Page 13: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/13.jpg)
Basic Login and Running Program in a Job Queuing System
![Page 14: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/14.jpg)
14
Basic login• Remote login to the master node• Terminal login
– using secure shellssh -l username sciblade.sci.hkbu.edu.hk
• Graphical login– PuTTY & vncviewer e.g.
[username@sciblade]$ vncserver
New ‘sciblade.sci.hkbu.edu.hk:3 (username)' desktop is sciblade.sci.hkbu.edu.hk:3
It means that your session will run on display 3.
14
![Page 15: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/15.jpg)
15
Graphical login
• Using PuTTY to setup a secured connection: Host Name=sciblade.sci.hkbu.edu.hk
15
![Page 16: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/16.jpg)
16
Graphical login (con’t)
• ssh protocol version
16
![Page 17: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/17.jpg)
17
Graphical login (con’t)
• Port 5900 +display number (i.e. 3 in this case)
17
![Page 18: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/18.jpg)
18
Graphical login (con’t)• Next, click Open, and login to sciblade• Finally, run VNC Viewer on your PC, and enter
"localhost:3" {3 is the display number}
• You should terminate your VNC session after you have finished your work. To terminate your VNC session running on sciblade, run the command[username@tdgrocks] $ vncserver –kill : 3
18
![Page 19: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/19.jpg)
19
Linux commands• Both master and compute nodes are installed with
Linux• Frequently used Linux command in PC cluster
http://www.sci.hkbu.edu.hk/hpccc/sciblade/faq_sciblade.php
cp cp f1 f2 dir1 copy file f1 and f2 into directory dir1
mv mv f1 dir1 move/rename file f1 into dir1
tar tar xzvf abc.tar.gz Uncompress and untar a tar.gz format file
tar tar czvf abc.tar.gz abc create archive file with gzip compression
cat cat f1 f2 type the content of file f1 and f2
diff diff f1 f2 compare text between two files
grep grep student * search all files with the word student
history history 50 find the last 50 commands stored in the shell
kill kill -9 2036 terminate the process with pid 2036
man man tar displaying the manual page on-line
nohup nohup runmatlab a run matlab (a.m) without hang up after logout
ps ps -ef find out all process run in the systems
sort sort -r -n studno sort studno in reverse numerical order
19
![Page 20: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/20.jpg)
20
ROCKS specific commands
• ROCKS provides the following commands for users to run programs in all compute node. e.g.– cluster-fork
• Run program in all compute nodes
– cluster-fork ps• Check user process in each compute node
– cluster-kill• Kill user process at one time
– tentakel• Similar to cluster-fork but run faster
20
![Page 21: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/21.jpg)
21
GangliaWeb based management and monitoring• http://sciblade.sci.hkbu.edu.hk/ganglia
21
![Page 22: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/22.jpg)
Why ParallelismWhy Parallelism
![Page 23: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/23.jpg)
23
Why Parallelism – Passively
• Suppose you are using the most efficient algorithm with an optimal implementation, but the program still takes too long or does not even fit onto your machine
• Parallelization is the last chance.
23
![Page 24: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/24.jpg)
24
Why Parallelism – Initiative
• Faster– Finish the work earlier
= Same work in shorter time
– Do more work= More work in the same time
• Most importantly, you want to predict the result before the event occurs
24
![Page 25: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/25.jpg)
25
ExamplesMany of the scientific and engineering problems require enormous computational power.
Following are the few fields to mention:
– Quantum chemistry, statistical mechanics, and relativistic physics
– Cosmology and astrophysics
– Computational fluid dynamics and turbulence
– Material design and superconductivity
– Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling
– Medicine, and modeling of human organs and bones
– Global weather and environmental modeling
– Machine Vision
25
![Page 26: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/26.jpg)
26
Parallelism
• The upper bound for the computing power that can be obtained from a single processor is limited by the fastest processor available at any certain time.
• The upper bound for the computing power available can be dramatically increased by integrating a set of processors together.
• Synchronization and exchange of partial results among processors are therefore unavoidable.
26
![Page 27: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/27.jpg)
27
Multiprocessing Clustering
IS
CU CU CU CU
PU PU PU PU
Shared Memory
1 n-1 n2
21 n-1 n
ISISIS
DSDSDSDS
DS
LM LM LM LM
CPU CPU CPU CPU
Interconnecting Network
1 n-1 n2
21 n-1 n
DSDSDS
Distributed Memory – Cluster
Shared Memory – Symmetric multiprocessors (SMP)
Parallel Computer Architecture
27
![Page 28: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/28.jpg)
28
Clustering: Pros and Cons
• Advantages – Memory scalable to number of processors.
∴Increase number of processors, size of memoryand bandwidth as well.
– Each processor can rapidly access its own memory without interference
• Disadvantages – Difficult to map existing data structures to this
memory organization – User is responsible for sending and receiving data
among processors
28
![Page 29: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/29.jpg)
29
TOP500 Supercomputer Sites (www.top500.org)
Architecture of Top 500
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009Year
Sh
are
pe
rce
nta
ge
SMP Constellations
MPP Cluster
29
![Page 30: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/30.jpg)
Cluster ParallelismCluster Parallelism
![Page 31: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/31.jpg)
31
Parallel Programming Paradigm
Multithreading – OpenMP
Message Passing– MPI (Message Passing Interface)– PVM (Parallel Virtual Machine)
Shared memory, Distributed memory
Shared memory only
31
![Page 32: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/32.jpg)
32
Distributed Memory• Programmers view:
– Several CPUs– Several block of memory– Several threads of action
• Parallelization– Done by hand
• Example– MPI
time
P1
P1 P2 P3
P2
P3
Process 0
Process 1
Process 2
SerialSerial
Data exchange viainterconnection
Process
MessageMessagePassingPassing
32
![Page 33: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/33.jpg)
33
Message Passing Model
Message Passing
The method by which data from one processor's memory is copied to the memory of another processor.
ProcessA process is a set of executable instructions (program) which runs on a processor.
Message passing systems generally associate only one process per processor, and the terms "processes" and "processors" are used interchangeably
Data exchange
timetime
P1
P1 P2 P3
P2
P3
Process 0
Process 1
Process 2
SerialSerial
MessageMessagePassingPassing
33
![Page 34: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/34.jpg)
OpenMP
![Page 35: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/35.jpg)
35
OpenMP Mission• The OpenMP Application Program Interface
(API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms.
• Jointly defined by a group of major computer hardware and software vendors.
• OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.
35
![Page 36: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/36.jpg)
36
OpenMP compiler choice
• gcc 4.40 or above– compile with -fopenmp
• Intel 10.1 or above– compile with –Qopenmp on Windows– compile with –openmp on linux
• PGI compiler– compile with –mp
• Absoft Pro Fortran– compile with -openmp
36
![Page 37: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/37.jpg)
37
Sample openmp example
#include <omp.h>
#include <stdio.h>
int main() {
#pragma omp parallelprintf("Hello from thread %d, nthreads %d\n", omp_get_thread_num(), omp_get_num_threads());
}
37
![Page 38: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/38.jpg)
38
serial-pi.c#include <stdio.h>
static long num_steps = 10000000;
double step;
int main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
printf("Est Pi= %f\n",pi);
}
38
![Page 39: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/39.jpg)
39
Openmp version of spmd-pi.c#include <omp.h>
#include <stdio.h>
static long num_steps = 10000000;
double step;
#define NUM_THREADS 8
int main ()
{ int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id,nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i=0, pi=0.0;i<nthreads;i++)
pi += sum[i] * step;
printf("Est Pi= %f using %d threads \n",pi,nthreads);
}
39
![Page 40: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/40.jpg)
Message Passing Interface (MPI)
![Page 41: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/41.jpg)
41
MPI
• Is a library but not a language, for parallel programming
• An MPI implementation consists of– a subroutine library with all MPI functions– include files for the calling application program– some startup script (usually called mpirun, but not
standardized)
• Include the lib file mpi.h (or however called) into the source code
• Libraries available for all major imperative languages (C, C++, Fortran …)
41
![Page 42: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/42.jpg)
42
General MPI Program Structure
MPI include file
#include <mpi.h>void main (int argc, char *argv[]){int np, rank, ierr;ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&np);/* Do Some Works */ierr = MPI_Finalize();}
variable declarations #include <mpi.h>void main (int argc, char *argv[]){int np, rank, ierr;ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&np);/* Do Some Works */ierr = MPI_Finalize();}
Initialize MPI environment
#include <mpi.h>void main (int argc, char *argv[]){int np, rank, ierr;ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&np);/* Do Some Works */ierr = MPI_Finalize();}
Do work and make message passing calls
#include <mpi.h>void main (int argc, char *argv[]){int np, rank, ierr;ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&np);/* Do Some Works */ierr = MPI_Finalize();}
Terminate MPI Environment
#include <mpi.h>void main (int argc, char *argv[]){int np, rank, ierr;ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank);MPI_Comm_size(MPI_COMM_WORLD,&np);/* Do Some Works */ierr = MPI_Finalize();}
42
![Page 43: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/43.jpg)
43
Sample Program: Hello World!
• In this modified version of the "Hello World" program, each processor prints its rank as well as the total number of processors in the communicator MPI_COMM_WORLD.
• Notes:– Makes use of the pre-defined
communicator MPI_COMM_WORLD.– Not testing for error status of routines!
43
![Page 44: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/44.jpg)
44
Sample Program: Hello World!#include <stdio.h>#include “mpi.h” // MPI compiler header file
void main(int argc, char **argv) {
int nproc,myrank,ierr;
ierr=MPI_Init(&argc,&argv); // MPI initialization
// Get number of MPI processesMPI_Comm_size(MPI_COMM_WORLD,&nproc);
// Get process id for this processorMPI_Comm_rank(MPI_COMM_WORLD,&myrank);
printf (“Hello World!! I’m process %d of %d\n”,myrank,nproc);
ierr=MPI_Finalize(); // Terminate all MPI processes
}
44
![Page 45: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/45.jpg)
45
Performance
• When we write a parallel program, it is important to identify the fraction of the program that can be parallelized and to maximize it.
• The goals are:– load balance– memory usage balance– minimize communication overhead– reduce sequential bottlenecks– scalability
45
![Page 46: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/46.jpg)
46
Compiling & Running MPI Programs
• Using mvapich 1.1
1. Setting path, at the command prompt, type: export PATH=/u1/local/mvapich1/bin:$PATH
(uncomment this line in .bashrc)
2. Compile using mpicc, mpiCC, mpif77 or mpif90, e.g.mpicc –o cpi cpi.c
3. Prepare hostfile (e.g. machines) number of compute nodes:
Compute-0-0Compute-0-1Compute-0-2Compute-0-3
4. Run the program with a number of processor node:mpirun –np 4 –machinefile machines ./cpi
46
![Page 47: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/47.jpg)
47
Compiling & Running MPI Programs
• Using mvapich 1.2
1. Prepare .mpd.conf and .mpd.passwd and saved in your home directory :
MPD_SECRETWORD=gde1234-3
(you may set your own secret word)
2. Setting environment for mvapich 1.2export MPD_BIN=/u1/local/mvapich2
export PATH=$MPD_BIN:$PATH
(uncomment this line in .bashrc)
3. Compile using mpicc, mpiCC, mpif77 or mpif90, e.g.mpicc –o cpi cpi.c
4. Prepare hostfile (e.g. machines) one hostname per line like previous section
47
![Page 48: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/48.jpg)
48
Compiling & Running MPI Programs
5. Pmdboot with the hostfile mpdboot –n 4 –f machines
6. Run the program with a number of processor node:mpiexec –np 4 ./cpi
7. Remember to clean after running jobs by mpdallexitmpdallexit
48
![Page 49: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/49.jpg)
49
Compiling & Running MPI Programs
• Using openmpi:1.21. Setting environment for openmpi
export LD-LIBRARY_PATH=/u1/local/openmpi/
lib:$LD-LIBRARY_PATH
export PATH=/u1/local/openmpi/bin:$PATH
(uncomment this line in .bashrc)
2. Compile using mpicc, mpiCC, mpif77 or mpif90, e.g.mpicc –o cpi cpi.c
3. Prepare hostfile (e.g. machines) one hostname per line like previous section
4. Run the program with a number of processor nodempirun –np 4 –machinefile machines ./cpi
49
![Page 50: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/50.jpg)
50
Submit parallel jobs into torque batch queue Prepare a job script, say omp.pbs like the following
#!/bin/sh
### Job name
#PBS -N OMP-spmd
### Declare job non-rerunable
#PBS -r n
### Mail to user
##PBS -m ae
### Queue name (small, medium, long, verylong)
### Number of nodes
#PBS -l nodes=1:ppn=8
#PBS -l walltime=00:08:00
cd $PBS_O_WORKDIR
export OMP_NUM_THREADS=8
./omp-test
./serial-pi
./omp-spmd-pi
Submit it using qsubqsub omp.pbs
50
![Page 51: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/51.jpg)
51
Another example of pbs scripts Prepare a job script, say scripts.sh like the following
#!/bin/sh
### Job name
#PBS -N Sorting
### Declare job non-rerunable
#PBS -r n
### Number of nodes
#PBS -l nodes=4
#PBS -l walltime=08:00:00
# This job's working directory
echo Working directory is $PBS_O_WORKDIR
cd $PBS_O_WORKDIR
echo Running on host `hostname`
echo Time is `date`
echo Directory is `pwd`
echo This jobs runs on the following processors:
echo `cat $PBS_NODEFILE`
# Define number of processors
NPROCS=`wc -l < $PBS_NODEFILE`
echo This job has allocated $NPROCS nodes
# Run the parallel MPI executable
/u1/local/mvapich1/bin/mpirun -v -machinefile $PBS_NODEFILE -np $NPROCS ./bubblesort >> bubble.out
Submit it using qsubqsub scripts.sh
51
![Page 52: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/52.jpg)
Parallel Program Examples
![Page 53: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/53.jpg)
53
Example 1: Estimation of Pi (OpenMP)
#include <omp.h>
#include <stdio.h>
static long num_steps = 10000000;
double step;
#define NUM_THREADS 8
int main ()
{ int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id,nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i=0, pi=0.0;i<nthreads;i++)
pi += sum[i] * step;
printf("Est Pi= %f using %d threads \n",pi,nthreads);
}
53
![Page 54: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/54.jpg)
54
Example 2a: Sorting – quick sort
• The quick sort is an in-place, divide-and-conquer, massively recursive sort.
• The efficiency of the algorithm is majorly impacted by which element is chosen as the pivot point.
• The worst-case efficiency of the quick sort, O(n2), occurs when the list is sorted and the left-most element is chosen.
• If the data to be sorted isn't random, randomly choosing a pivot point is recommended. As long as the pivot point is chosen randomly, the quick sort has an algorithmic complexity of O(n log n).
Pros: Extremely fast.Cons: Very complex algorithm, massively recursive
54
![Page 55: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/55.jpg)
55
Quick Sort Performance
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1 2 4 8 16 32 64 128
Processes
Tim
e(se
c)
Processes Time
1 0.410000
2 0.300000
4 0.180000
8 0.180000
16 0.180000
32 0.220000
64 0.680000
128 1.300000
55
![Page 56: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/56.jpg)
56
Example 2b: Sorting -Bubble Sort• The bubble sort is the oldest and simplest sort in use.
Unfortunately, it's also the slowest. • The bubble sort works by comparing each item in the
list with the item next to it, and swapping them if required.
• The algorithm repeats this process until it makes a pass all the way through the list without swapping any items (in other words, all items are in the correct order).
• This causes larger values to "bubble" to the end of the list while smaller values "sink" towards the beginning of the list.
56
![Page 57: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/57.jpg)
57
Bubble Sort Performance
Processes Time
1 3242.327
2 806.346
4 276.4646
8 78.45156
16 21.031
32 4.8478
64 2.03676
128 1.240197
0
500
1000
1500
2000
2500
3000
3500
1 2 4 8 16 32 64 128
Processes
Tim
e (s
ec)
57
![Page 58: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/58.jpg)
58
Monte Carlo Integration
• "Hit and miss" integration
• The integration scheme is to take a large number of random points and count the number that are within f(x) to get the area
58
![Page 59: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/59.jpg)
59
Monte Carlo Integration
• Monte Carlo Integration to Estimate Pi
4
141
square inside hittingpt of no.
area shaded hittingpt of no.2
2
r
r
59
![Page 60: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/60.jpg)
60
Example 1: omp
omp/test-omp.comp/serial-pi.comp/spmd-pi.cCompile program by the command: makeRun the program in parallel by./omp-spmd-piSubmit job to PBS byqsub omp.pbs
Example 3: Sorting
sorting/qsort.csorting/bubblesort.csorting/script.shsorting/qsort sorting/bubblesort
Submit job to PBS queuing system byqsub script.sh
Example 2: Prime
prime/prime.cprime/prime.f90prime/primeParallel.cprime/Makefileprime/machines
Compile by the command: makeRun the serial program by./primeC or ./primeFRun the parallel program bympirun –np 4 –machinefile machines ./primeMPI
Example 4: pmatlab
pmatlab/startup.mpmatlab/RUN.mpmatlab/sample-pi.m
Submit job to PBS byqsub Qpmatlab.pbs
60
![Page 61: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/61.jpg)
Policy for using sciblade.sci.hkbu.edu.hk
![Page 62: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/62.jpg)
62
Policy1. Every user shall apply for his/her own computer user
account to login to the master node of the PC cluster, sciblade.sci.hkbu.edu.hk.
2. The account must not be shared his/her account and password with the other users.
3. Every user must deliver jobs to the PC cluster from the master node via the PBS job queuing system. Automatically dispatching of job using scripts or robots are not allowed.
4. Users are not allowed to login to the compute nodes.
5. Foreground jobs on the PC cluster are restricted to program testing and the time duration should not exceed 1 minutes CPU time per job.
![Page 63: Cluster Workshop](https://reader030.vdocuments.us/reader030/viewer/2022032606/56812d74550346895d928991/html5/thumbnails/63.jpg)
63
Policy (continue)6. Any background jobs run on the master node or
compute nodes are strictly prohibited and will be killed without prior notice.
7. The current restrictions of the job queuing system are as follows,– The maximum number of running jobs in the job queue is 8.– The maximum total number of CPU cores used in one time
cannot exceed 512.
8. The restrictions in item 7 will be reviewed timely for the growing number of users and the computation need.