mpi an introduction kadin tseng scientific computing and visualization group boston university

July 30 - August 01, 2002

MPI Basics

1 of 43

Linux Clusters and Tiled Display Walls

MPI

An Introduction

Kadin Tseng

Scientific Computing and Visualization Group

Boston University


MPI Basics

2 of 43

Linux Clusters and Tiled Display WallsSCV Computing Facilities

• SGI Origin2000 – 192 processors– Lego (32), Slinky (32), Jacks(64), Playdoh(64)– R10000, 195 MHz

• IBM SP - four 16-processor nodes– Hal - login node– Power3, 375 MHz

• IBM Regatta - three 32-processor nodes– Twister - login node (currently in friendly-user

phase)– Power4, 1.3 GHz

• Linux cluster - 38 2-processor nodes– Skate and Cootie - Two login nodes– 1.3 GHz Intel Pentium 3 (Myrinet Interconnect)


MPI Basics

3 of 43

Linux Clusters and Tiled Display WallsUseful SCV Info

• SCV home page http://scv.bu.edu/

• Batch queues http://scv.bu.edu/SCV/scf-techsumm.html• Resource Applications

– MARINER http://acct.bu.edu/FORMS/Mariner-Pappl.html– Alliance http://www.ncsa.uiuc.edu/alliance/applying/

• Help– Online FAQs– Web-based tutorials (MPI, OpenMP, F90, MATLAB, Graphics tools)– HPC consultations by appointment

• Doug Sondak ([email protected])• Kadin Tseng ([email protected])

– [email protected], [email protected]– Origin2000 Repository http://scv.bu.edu/SCV/Origin2000/– IBM SP and Regatta Repository http://scv.bu.edu/SCV/IBMSP/– Alliance web-based tutorials http://foxtrot.ncsa.uiuc.edu:8900/


MPI Basics

4 of 43

Linux Clusters and Tiled Display WallsMultiprocessor Architectures

Linux cluster– Shared-memory

Between the 2 processors of each node.– Distributed-memory There are 38 nodes; can use both processors in each node.


MPI Basics

5 of 43


P0 P1 P2 Pn

Memory

Shared Memory Architecture

P0, P1, …, Pn are processors.

• • • •


MPI Basics

6 of 43


N0 N1 N2 Nn

Interconnect

Distributed Memory Architecture

M0 MnM2M1

M0, M1, … Mn are memories associated with nodes N0, N1, …, Nn. Interconnect is Myrinet (or Ethernet).

• • • •


MPI Basics

7 of 43

Linux Clusters and Tiled Display WallsParallel Computing Paradigms

• Parallel Computing Paradigms– Message Passing (MPI, PVM, ...)

• Distributed and shared memory

– Directives (OpenMP, ...)• Shared memory

– Multi-Level Parallel programming (MPI + OpenMP)• Distributed/shared memory


MPI Basics

8 of 43

Linux Clusters and Tiled Display WallsMPI Topics to Cover

• Fundamentals• Basic MPI functions• Nonblocking send/receive• Collective Communications• Virtual Topologies• Virtual Topology Example

– Solution of the Laplace Equation


MPI Basics

9 of 43

Linux Clusters and Tiled Display WallsWhat is MPI ?

• MPI stands for Message Passing Interface.• It is a library of subroutines/functions, not a computer

language.• These subroutines/functions are callable from fortran

or C programs.• Programmer writes fortran/C code, insert appropriate

MPI subroutine/function calls, compile and finally link with MPI message passing library.

• In general, MPI codes run on shared-memory multi-processors, distributed-memory multi-computers, cluster of workstations, or heterogeneous cluster of the above.

• Current MPI is MPI-1. MPI-2 will be available in the near future.


MPI Basics

10 of 43

Linux Clusters and Tiled Display WallsWhy MPI ?

• To enable more analyses in a prescribed amount of time.• To reduce time required for one analysis.• To increase fidelity of physical modeling.• To have access to more memory.• To provide efficient communication between nodes of

networks of workstations, …• To enhance code portability; works for both shared- and

distributed-memory.• For “embarrassingly parallel” problems, such as some

Monte-Carlo applications, parallelizing with MPI can be trivial with near-linear (or even superlinear) scaling.


MPI Basics

11 of 43

Linux Clusters and Tiled Display WallsMPI Preliminaries

• MPI’s pre-defined constants, function prototypes, etc., are included in a header file. This file must be included in your code wherever MPI function calls appear (in “main” and in user subroutines/functions) :– #include “mpi.h” for C codes– #include “mpi++.h” for C++ codes– include “mpif.h” for fortran 77 and fortran 9x codes

• MPI_Init must be the first MPI function called.• Terminates MPI by calling MPI_Finalize.• These two functions must only be called once in user code.


MPI Basics

12 of 43

Linux Clusters and Tiled Display WallsMPI Preliminaries (continued)

• C is case-sensitive language. MPI function names always begin with “MPI_”, followed by specific name with leading character capitalized, e.g., MPI_Comm_rank. MPI pre-defined constant variables are expressed in upper case characters, e.g., MPI_COMM_WORLD.

• Fortran is not case-sensitive. No specific case rules apply.• MPI fortran routines return error status as last argument of

subroutine call, e.g., call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)• Error status is returned as int function value for C MPI

functions, e.g., int ierr = MPI_Comm_rank(MPI_COMM_WORLD, rank);


MPI Basics

13 of 43

Linux Clusters and Tiled Display WallsWhat is A Message

• Collection of data (array) of MPI data types• Basic data types such as int /integer, float/real• Derived data types

• Message “envelope” – source, destination, tag, communicator


MPI Basics

14 of 43

Linux Clusters and Tiled Display WallsModes of Communication

• Point-to-point communication– Blocking – returns from call when task complete

• Several send modes; one receive mode– Nonblocking – returns from call without waiting for task

to complete• Several send modes; one receive mode

• Collective communication


MPI Basics

15 of 43

Linux Clusters and Tiled Display WallsEssentials of Communication

• Sender must specify valid destination.• Sender and receiver data type, tag, communicator

must match.• Receiver can receive from non-specific (but valid)

source.• Receiver returns extra (status) parameter to report

info regarding message received.• Sender specifies size of sendbuf; receiver specifies

upper bound of recvbuf.


MPI Basics

16 of 43

Linux Clusters and Tiled Display WallsMPI Data Types vs C Data Types

• MPI types -- C types– MPI_INT – signed int– MPI_UNSIGNED – unsigned int– MPI_FLOAT – float– MPI_DOUBLE – double– MPI_CHAR – char– . . .


MPI Basics

17 of 43

Linux Clusters and Tiled Display WallsMPI vs Fortran Data Types

• MPI_INTEGER – INTEGER• MPI_REAL – REAL• MPI_DOUBLE_PRECISION – DOUBLE PRECISION• MPI_CHARACTER – CHARACTER(1)• MPI_COMPLEX – COMPLEX• MPI_LOGICAL – LOGICAL

• …


MPI Basics

18 of 43

Linux Clusters and Tiled Display WallsMPI Data Types

• MPI_PACKED• MPI_BYTE


MPI Basics

19 of 43

Linux Clusters and Tiled Display WallsSome MPI Implementations

• MPICH (ANL) – Version 1.2.1 is latest.– There is a list of supported MPI-2 functionalities.

• LAM (UND/OSC)• CHIMP (EPCC)• Vendor implementations (SGI, IBM, …)• Don’t worry! - as long as vendor supports the MPI

Standard (IBM’s MPL does not). • Job execution procedures of implementations may

differ.

There are a number of implementations :


MPI Basics

20 of 43

Linux Clusters and Tiled Display WallsExample 1 (Integration)

We will introduce some fundamental MPI function calls through the computation of a simple integral by the Mid-point rule.

p is number of partitions and n is increments per partition

hjniaa npabh

ha

dxxdxx

ij

p

i

n

j

hij

p

i

n

j

ha

a

b

a

ij

ij

*)*(;//)(

)cos(

)cos()cos(

1

0

1

02

1

0

1

0


MPI Basics

21 of 43

Linux Clusters and Tiled Display WallsIntegrate cos(x) by Mid-point Rule

2x 0x

)(xcos

)(xf

x

2

0

)cos( dxx


MPI Basics

22 of 43

Linux Clusters and Tiled Display WallsExample 1 - Serial fortran code

Program Example1 implicit none integer n, p, i, j real h, result, a, b, integral, pi pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration p = 4 ! number of partitions (processes) n = 500 ! # of increments in each part. h = (b-a)/p/n ! length of increment result = 0.0 ! Initialize solution to the integral do i=0,p-1 ! Integral sum over all partitions result = result + integral(a,i,h,n) enddo print *,'The result =',result stop end


MPI Basics

23 of 43

Linux Clusters and Tiled Display Walls. . Serial fortran code (cont’d)

real function integral(a, i, h, n) implicit none integer n, i, j real h, h2, aij, a, fct, x

fct(x) = cos(x) ! Integral kernel integral = 0.0 ! initialize integral h2 = h/2. do j=0,n-1 ! sum over all "j" integrals aij = a + (i*n + j)*h ! lower limit of integration integral = integral + fct(aij+h2)*h enddo

return end

example1.f continues . . .


MPI Basics

24 of 43

Linux Clusters and Tiled Display WallsExample 1 - Serial C code

#include <math.h>#include <stdio.h>float integral(float a, int i, float h, int n);void main() { int n, p, i, j, ierr; float h, result, a, b, pi, my_result; pi = acos(-1.0); /* = 3.14159... * a = 0.; /* lower limit of integration */ b = pi/2.; /* upper limit of integration */ p = 4; /* # of partitions */ n = 500; /* increments in each process */ h = (b-a)/n/p; /* length of increment */ result = 0.0; for (i=0; i<p; i++) { /* integral sum over procs */ result += integral(a,i,h,n); } printf("The result =%f\n",result);}


MPI Basics

25 of 43

Linux Clusters and Tiled Display Walls. . Serial C code (cont’d)

float integral(float a, int i, float h, int n){ int j; float h2, aij, integ; integ = 0.0; /* initialize integral */ h2 = h/2.; for (j=0; j<n; j++) { /* integral in each partition */ aij = a + (i*n + j)*h; /* lower limit of integration */ integ += cos(aij+h2)*h; } return integ;}

example1.c continues . . .


MPI Basics

26 of 43

Linux Clusters and Tiled Display WallsExample 1 - Parallel fortran code

PROGRAM Example1_1 implicit none integer n, p, i, j, ierr, master, Iam real h, result, a, b, integral, pi include "mpif.h“ ! pre-defined MPI constants, ... integer source, dest, tag, status(MPI_STATUS_SIZE) real my_result

data master/0/ ! 0 is the master processor responsible ! for collecting integral sums …

MPI functions used for this example:

• MPI_Init, MPI_Comm_rank, MPI_Comm_size, MPI_Finalize

• MPI_Send, MPI_Recv


MPI Basics

27 of 43

Linux Clusters and Tiled Display Walls. . . Parallel fortran code (cont’d)

! executable statements before MPI_Init is not! advisable; side effect implementation-dependent pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration n = 500 ! increments in each process h = (b - a)/ p / n dest = master ! proc to compute final result tag = 123 ! set tag for job

! Starts MPI processes ... call MPI_Init(ierr)! Get current process id call MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr)! Get number of processes from command line call MPI_Comm_size(MPI_COMM_WORLD, p, ierr)


MPI Basics

28 of 43

Linux Clusters and Tiled Display Walls... Parallel fortran code (cont’d)

my_result = integral(a,Iam,h,n) ! compute local sum write(*,"('Process ',i2,' has the partial result of’, & f10.6)”)Iam,my_result if(Iam .eq. master) then

result = my_result ! Init. final result do source=1,p-1 ! loop to collect local sum (not parallel !) call MPI_Recv(my_result, 1, MPI_REAL, source, tag, & MPI_COMM_WORLD, status, ierr) ! not safe result = result + my_result enddo print *,'The result =',result else call MPI_Send(my_result, 1, MPI_REAL, dest, tag, & MPI_COMM_WORLD, ierr) ! send my_result to dest. endif call MPI_Finalize(ierr) ! let MPI finish up ... end


MPI Basics

29 of 43

Linux Clusters and Tiled Display WallsExample 1 - Parallel C code

#include <mpi.h> #include <math.h> #include <stdio.h>float integral(float a, int i, float h, int n); /* prototype */void main(int argc, char *argv[]) { int n, p, i; float h, result, a, b, pi, my_result; int myid, source, dest, tag; MPI_Status status; MPI_Init(&argc, &argv); /* start MPI processes */

MPI_Comm_rank(MPI_COMM_WORLD, &myid); /* current proc. id */

MPI_Comm_size(MPI_COMM_WORLD, &p); /* # of processes */


MPI Basics

30 of 43

Linux Clusters and Tiled Display Walls… Parallel C code (continued)

pi = acos(-1.0); /* = 3.14159... */ a = 0.; /* lower limit of integration */ b = pi/2.; /* upper limit of integration */ n = 500; /* number of increment within each process */ dest = 0; /* define the process that computes the final result */ tag = 123; /* set the tag to identify this particular job */ h = (b-a)/n/p; /* length of increment */

i = myid; /* MPI process number range is [0,p-1] */

my_result = integral(a,i,h,n);

printf("Process %d has the partial result of %f\n", myid,my_result);


MPI Basics

31 of 43

Linux Clusters and Tiled Display Walls… Parallel C code (continued)

if(myid == 0) { result = my_result; for (i=1;i<p;i++) { source = i; /* MPI process number range is [0,p-1] */ MPI_Recv(&my_result, 1, MPI_REAL, source, tag, MPI_COMM_WORLD, &status); result += my_result; } printf("The result =%f\n",result); } else { MPI_Send(&my_result, 1, MPI_REAL, dest, tag, MPI_COMM_WORLD); /* send my_result to “dest” } MPI_Finalize(); /* let MPI finish up ... */}


MPI Basics

32 of 43

Linux Clusters and Tiled Display WallsExample1_2 – Parallel Integration

PROGRAM Example1_2 implicit none integer n, p, i, j, k, ierr, master real h, a, b, integral, pi integer req(1)

include "mpif.h" ! This brings in pre-defined MPI constants, ... integer Iam, source, dest, tag, status(MPI_STATUS_SIZE) real my_result, Total_result, result

data master/0/



• MPI_Recv, MPI_Isend, MPI_Wait


MPI Basics

33 of 43

Linux Clusters and Tiled Display WallsExample1_2 (continued)

c**Starts MPI processes ... call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr) call MPI_Comm_size(MPI_COMM_WORLD, p, ierr) pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration n = 500 ! number of increment within each process dest = master ! define process that computes the final result tag = 123 ! set the tag to identify this particular job h = (b-a)/n/p ! length of increment

my_result = integral(a,Iam,h,n) ! Integral of process Iam write(*,*)'Iam=',Iam,', my_result=',my_result


MPI Basics

34 of 43


if(Iam .eq. master) then ! the following is serial result = my_result do k=1,p-1 call MPI_Recv(my_result, 1, MPI_REAL, & MPI_ANY_SOURCE, tag, ! “wildcard” & MPI_COMM_WORLD, status, ierr) ! status - about source … result = result + my_result ! Total = sum of integrals enddo else call MPI_Isend(my_result, 1, MPI_REAL, dest, tag, & MPI_COMM_WORLD, req, ierr) ! send my_result to “dest” call MPI_Wait(req, status, ierr) ! wait for nonblock send ... endifc**results from all procs have been collected and summed ... if(Iam .eq. 0) write(*,*)'Final Result =',result call MPI_Finalize(ierr) ! let MPI finish up ...

stop end


MPI Basics

35 of 43

Linux Clusters and Tiled Display WallsExample1_3 Parallel Integration

PROGRAM Example1_3 implicit none integer n, p, i, j, ierr, master real h, result, a, b, integral, pi

include "mpif.h" ! This brings in pre-defined MPI constants, ... integer Iam, source, dest, tag, status(MPI_STATUS_SIZE) real my_result

data master/0/



• MPI_Bcast, MPI_Reduce, MPI_SUM


MPI Basics

36 of 43


c**Starts MPI processes ... call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD, Iam, ierr) call MPI_Comm_size(MPI_COMM_WORLD, p, ierr)

pi = acos(-1.0) ! = 3.14159... a = 0.0 ! lower limit of integration b = pi/2. ! upper limit of integration dest = 0 ! define the process that computes the final result tag = 123 ! set the tag to identify this particular job if(Iam .eq. master) then print *,'The requested number of processors =',p print *,'enter number of increments within each process' read(*,*)n endif

c**Broadcast "n" to all processes call MPI_Bcast(n, 1, MPI_INTEGER, 0, MPI_COMM_WORLD, ierr)


MPI Basics

37 of 43


c**Starts MPI processes ... h = (b-a)/n/p ! length of increment my_result = integral(a,Iam,h,n) write(*,"('Process ',i2,' has the partial result of',f10.6)") & Iam,my_result

call MPI_Reduce(my_result, result, 1, MPI_REAL, MPI_SUM, dest, & MPI_COMM_WORLD, ierr) ! Compute integral sum

if(Iam .eq. master) then print *,'The result =',result endif

call MPI_Finalize(ierr) ! let MPI finish up ... stop end


MPI Basics

38 of 43

Linux Clusters and Tiled Display WallsCompilations With MPICH

• Create Makefile is practical way to compile program

CC = mpiccCCC = mpiCCF77 = mpif77F90 = mpif90OPTFLAGS = -O3f-example: f-example.o $(F77) $(OPTFLAGS) -o f-example f-example.oc-example: c-example.o $(CC) $(OPTFLAGS) -o c-example c-example.o


MPI Basics

39 of 43

Linux Clusters and Tiled Display WallsCompilations With MPICH (cont’d)

Makefile continues . . .

.c.o: $(CC) $(OPTFLAGS) -c $*.c.f.o: $(F77) $(OPTFLAGS) -c $*.f.f90.o: $(F90) $(OPTFLAGS) -c $*.f90.SUFFIXES: .f90


MPI Basics

40 of 43

Linux Clusters and Tiled Display WallsRun Jobs Interactively

skate% mpirun.ch_gm –np 4 example (alias mpirun ‘/usr/local/mpich/1.2.1..7b/gm-1.5.1_Linux-2.4.7-10enterprise/smp/pgi/ssh/bin/mpirun.ch_gm)

On SCV’s Linux cluster, use mpirun.ch_gm to run MPI jobs.


MPI Basics

41 of 43

Linux Clusters and Tiled Display WallsRun Jobs in Batch

We use PBS, or Portable Batch System, to manage batch jobs.

% qsub my_pbs_script

Where qsub is a batch submission script while my_pbs_script is a user-provided script containing user-specific pbs batch job info.


MPI Basics

42 of 43

Linux Clusters and Tiled Display WallsMy_pbs_script

#!/bin/bash

# Set the default queue

#PBS -q dque

# Request 4 nodes with 1 processor per node

#PBS -l nodes=4:ppn=1,walltime=00:10:00

MYPROG="psor"

GMCONF=~/.gmpi/conf.$PBS_JOBID

/usr/local/xcat/bin/pbsnodefile2gmconf $PBS_NODEFILE >$GMCONF

cd $PBS_O_WORKDIR

NP=$(head -1 $GMCONF)

mpirun.ch_gm --gm-f $GMCONF --gm-recv polling --gm-use-shmem --gm-kill 5 –-gm-np $NP PBS_JOBID=$PBS_JOBID $MYPROG


MPI Basics

43 of 43

Linux Clusters and Tiled Display WallsOutput of Example1_1

skate% mpirun.ch_gm –np 4 example1_1Process 1 has the partial result of 0.324423Process 2 has the partial result of 0.216773Process 0 has the partial result of 0.382683Process 3 has the partial result of 0.076120 The result = 1.000000

skate% make –f make.linux example1_1

Not in order !

mpi an introduction kadin tseng scientific computing and visualization group boston university

Documents