issues in translation of high performance fortran
DESCRIPTION
Issues in Translation of High Performance Fortran. Bryan Carpenter NPAC at Syracuse University Syracuse, NY 13244 [email protected]. Goals of this lecture. Discuss translation of some elementary HPF examples to MPI code. Illustrate the need for a Distributed Array Descriptor (DAD). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/1.jpg)
Issues in Translation of High Performance Fortran
Bryan Carpenter
NPAC at Syracuse UniversitySyracuse, NY [email protected]
![Page 2: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/2.jpg)
Goals of this lecture Discuss translation of some
elementary HPF examples to MPI code. Illustrate the need for a Distributed Array Descriptor (DAD).
Develop an abstract model of a DAD, and show how it can be used to translate simple codes.
![Page 3: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/3.jpg)
Contents of Lecture Introduction.
Translation of simple HPF fragment to SPMD. The problem of procedures.
Requirements for an array descriptor. Groups.
Process grids. Restricted groups.
Range objects. A DAD
![Page 4: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/4.jpg)
A simple HPF program Here is a simple HPF program:
!HPF$ PROCESSORS P(4)
REAL A(50) !HPF$ DISTRIBUTE A(BLOCK) ONTO P
FORALL (I = 1:50) A(I) = 1.0 * I
We want to translate this to an MPI program.
![Page 5: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/5.jpg)
Translation of simple program
INTEGER W_RANK, W_SIZE, ERRCODE
INTEGER BLK_SIZEPARAMETER (BLK_SIZE = (50 + 3)/4)
REAL A(BLK_SIZE)
INTEGER BLK_START, BLK_COUNTINTEGER L, I
CALL MPI_INIT(ERRCODE)
CALL MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE)CALL MPI_COMM_SIZE(MPI_COMM_WORLD, W_SIZE, ERRCODE)
IF (W_RANK < 4) THEN BLK_START = W_RANK * BLK_SIZE IF (50 – BLK_START >= BLK_SIZE) THEN BLK_COUNT = BLK_SIZE ELSEIF (50 – BLK_START > 0) THEN BLK_COUNT = 50 – BLK_START ELSE BLK_COUNT = 0 ENDIF
DO L = 1, BLK_COUNT I = BLK_START + L A(L) = 1.0 * I ENDDOENDIF
CALL MPI_FINALIZE(ERRCODE)
![Page 6: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/6.jpg)
Setting up the environment
Associated code:
INTEGER W_RANK, W_SIZE, ERRCODE. . .
CALL MPI_INIT(ERRCODE)
CALL MPI_COMM_SIZE(MPI_COMM_WORLD, W_RANK, ERRCODE)CALL MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE). . .
CALL MPI_FINALIZE(ERRCODE)
![Page 7: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/7.jpg)
Allocating segment of the distributed array Associated statements are:
INTEGER BLK_SIZE PARAMETER (BLK_SIZE = (50 + 3)/4)
REAL A(BLK_SIZE)
Segment size is 50/4
![Page 8: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/8.jpg)
Testing this processor holds a segment Associated code is:
IF (W_RANK < 4) THEN . . . ENDIF
Assumes number of MPI processes is at least the size of the largest processor arrangement of HPF program.
![Page 9: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/9.jpg)
Computing parameters of locally held segment
Associated code: INTEGER BLK_START, BLK_COUNT . . . BLK_START = W_RANK * BLK_SIZE IF (50 – BLK_START >= BLK_SIZE) THEN BLK_COUNT = BLK_SIZE ELSEIF (50 – BLK_START > 0) THEN BLK_COUNT = 50 – BLK_START ELSE BLK_COUNT = 0 ENDIF
BLK_START—position in global index space. BLK_COUNT—elements in segment.
![Page 10: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/10.jpg)
Loop over local elements Associated code:
INTEGER L, I . . .
DO L = 1, BLK_COUNT I = BLK_START + L A(L) = 1.0 * I ENDDO
![Page 11: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/11.jpg)
An HPF procedure Superficially similar program:
SUBROUTINE INIT(D)
REAL D(50) !HPF$ INHERIT D
FORALL (I = 1:50) D(I) = 1.0 * I END
INHERIT directive means mapping of dummy should be same as actual, whatever that is.
![Page 12: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/12.jpg)
Procedure call with block-distributed actual
!HPF$ PROCESSORS P(4)
REAL A(50)!HPF$ DISTRIBUTE A(BLOCK) ONTO P
CALL INIT(A)
Mapping of D:
![Page 13: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/13.jpg)
Procedure call with cyclically distributed actual
!HPF$ PROCESSORS P(4)
REAL A(50)!HPF$ DISTRIBUTE A(CYCLIC) ONTO P
CALL INIT(A)
Mapping of D:
![Page 14: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/14.jpg)
Procedure call with strided alignment of actual
!HPF$ PROCESSORS P(4)
REAL A(100)!HPF$ DISTRIBUTE A(BLOCK) ONTO P
CALL INIT(A(1:100:2))
Mapping of D:
![Page 15: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/15.jpg)
Procedure call with row-aligned actual
!HPF$ PROCESSORS Q(2, 2)
REAL A(6, 50)!HPF$ DISTRIBUTE A(BLOCK, BLOCK) ONTO Q
CALL INIT(A(2, :)) Mapping of D:
![Page 16: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/16.jpg)
The problem Somehow INIT must be translated to deal
with data having any of these decompositions, or any legal HPF mapping. Actual mapping not known until run-time.
Not an artificial example. Libraries that operate on distributed arrays (eg the communication libraries discussed later) must deal with exactly this situation.
![Page 17: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/17.jpg)
Requirements for an array descriptor Seems that to translate procedure
calls, need some non-trivial data structure to describe layout of actual argument.
The Distributed Array Descriptor (DAD). Want to understand requirements and
best organization of a DAD. Adopt object-oriented principles to
build an abstract design.
![Page 18: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/18.jpg)
Distributed array dimensions Obvious structural feature of HPF array:
multidimensional. Each dimension mapped independently as:
Collapsed (serial), Simple block distribution, Simple cyclic distribution, Block cyclic distribution, General block distribution (HPF 2.0), Linear alignment to any of above.
![Page 19: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/19.jpg)
Converting block distribution to cyclic distribution
BLK_SIZE = (N + NP – 1) / NP. . .
BLK_START = R * BLK_SIZE. . .
IF (N – BLK_START >= BLK_SIZE) THEN
BLK_COUNT = BLK_SIZEELSEIF (N – BLK_START > 0) THEN BLK_COUNT = N – BLK_STARTELSE BLK_COUNT = 0ENDIF. . .
I = BLK_START + L
BLK_SIZE = (N + NP – 1) / NP. . .
BLK_START = R. . .
BLK_COUNT = (N – R + NP – 1) / NP
. . .
I = BLK_START + NP * (L - 1) + 1
![Page 20: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/20.jpg)
Distributed ranges Have different kinds of array dimension
(distribution format). Each kind of dimension has a different
set of formulae for segment layout, index computation, etc.
OO interpretation: virtual functions on a class hierarchy.
Implement as the Range hierarchy. DAD for rank-r array will contain r Range
objects, one per dimension.
![Page 21: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/21.jpg)
Dealing with “hidden” dimensions of sections Array may be mapped to slice of grid:
Rank-1 section only has one range object. Need some other structure to represent embedding in subgrid.
![Page 22: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/22.jpg)
DAD groups Need a group concept similar to
MPI_Group. Want lightweight structure for
representing arbitrary slices of process grids.
Object representing grid itself needs multidimensional structure (cf Cartesian Communicator in MPI).
![Page 23: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/23.jpg)
Representing processor arrangements In OO runtime descriptor, expect entity
like processor arrangement becomes an object.
Use C++ for definiteness: !HPF$ PROCESSORS P(4)
becomes Procs1 p(4);
and !HPF$ PROCESSORS Q(2, 2)
becomes Procs2 q(2, 2);
![Page 24: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/24.jpg)
Hierarchy of process grids
![Page 25: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/25.jpg)
Interface of Procs and Dimension
class Procs {public: int member() const;
Dimension dim(const int d) const; . . .};
class Dimension {public: int size() const;
int crd() const; . . .};
![Page 26: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/26.jpg)
Using Procs in translation
INTEGER W_RANK, . . .
CALL MPI_COMM_RANK(MPI_COMM_WORLD, W_RANK, ERRCODE)
. . .
IF (W_RANK < 4) THEN BLK_START = W_RANK * BLK_SIZE . . .ENDIF
Procs1 p(4);. . .
if (p.member()) { blk_start = p.dim(0).crd() *
blk_size; . . .}
Becomes:
![Page 27: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/27.jpg)
Restricted process groups Slice of process grid to which array
section may be mapped. Portion of grid selected by
specifying subset of dimension coordinates.
Lightweight representation. Use bitmask to represent dimension set.
![Page 28: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/28.jpg)
Example restricted groups in 2-dimensional grid
![Page 29: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/29.jpg)
Representation of subgrids
example dimension lead tuple set process
a) {dim(0), dim(1)} 0 (p, 11 , 0) 2
b) {dim(0)} 8 (p, 10 , 8) 2
c) {dim(1)} 1 (p, 01 , 1) 2
d) {} 6 (p, 00 , 6) 2
![Page 30: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/30.jpg)
The Group classclass Group {public: Group(const Procs& p);
void restrict(Dimension d, const int coord);
int member() const; . . .}
Lightweight—implementation in about 3 words. Can freely copy and discard. DAD contains a Group object.
![Page 31: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/31.jpg)
Ranges In DAD, range object describes
extent and distribution format of one array dimension.
Expect a class hierarchy of ranges. Each subclass corresponds to a
different kind of distribution format for an array dimension.
![Page 32: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/32.jpg)
A hierarchy of ranges
![Page 33: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/33.jpg)
Interface of the Range class
Class range {public: int size() const;
Dimension dim() const;
int volume() const;
Range subrng(const int extent, const int base, const int stride = 1) const;
void block(Block* blk, const int crd) const;
void location(Location* loc, const int glb) const; . . .};
![Page 34: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/34.jpg)
Translating simple HPF program to C++
Source:
!HPF$ PROCESSORS P(4)
REAL A(50)!HPF$ DISTRIBUTE A(BLOCK) ONTO
P
FORALL (I = 1:50) A(I) = 1.0 * I
Translation:
Procs1 p(4);
BlockRange x(50, p.dim(0));
float* a = new float [x.volume()];
if (p.member()) { Block b; x.block(&b, p.dim(0).crd());
for (int l = 0; l < b.count; l++) { const int i = b.glb_bas + b.glb_stp * l
+ 1; a [b.sub_bas + b.sub_stp * l] = 1.0 * i; }}
![Page 35: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/35.jpg)
Features of C++ translation Arguments of BlockRange constructor are
process dimension and extent of range. Fields of Block define count of local loop
and base and step for local subscript and global index.
If distribution directive is changed to: !HPF$ DISTRIBUTE A(CYCLIC) ONTO P
only change is x declaration becomes: CyclicRange x(50, p.dim(0));
—apparently making progress toward writing code that works for any distribution.
![Page 36: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/36.jpg)
The Block and Location structures
struct Block { int count;
int glb_bas; int glb_stp;
int sub_bas; int sub_stp;};
struct Location {
int sub; int crd; . . .};
![Page 37: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/37.jpg)
Memory strides
Fortran 90 program:REAL B(100, 100). . .CALL FOO(B(1, :))
SUBROUTINE FOO(C )REAL C(:). . .END
First dimension of D most-rapidly-varying in memory.
Second dimension has memory stride 100—inherited by C.
Fortran compilers normally pass a dope vector containing r extents and r strides for rank-r argument.
Stride not really a property of the distributed range. Store separately in DAD.
![Page 38: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/38.jpg)
A DAD
Abstract DAD for a rank-r array is an object containing: A distribution group, and r range objects, and r integer strides.
![Page 39: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/39.jpg)
Interface of the DAD class
Struct DAD { DAD(const int _rank, const Group& _group, Map _maps []);
const Group& grp() const;
Range rng(const int d) const;
int str(const int d) const; . . .};
![Page 40: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/40.jpg)
Map structure
struct Map { Map(Range _range, const int
_stride);
Range range; int stride;};
![Page 41: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/41.jpg)
Translating HPF program with inherited mapping
Source:
SUBROUTINE INIT(D)
REAL D(50)!HPF$ INHERIT D
FORALL (I = 1:50) D(I) = 1.0 * I
END
Translation:
void init(float* d, DAD* d_dad) {
Group p = d_dad->grp(); if (p.member()) { Range x = d_dad->rng(0); int s = d_dad->str(0);
Block b; x.block(&b, p.dim(0).crd());
for (int l = 0; l < b.count; l++) { const int i = b.glb_bas + b.glb_stp * l +
1; d [s * (b.sub_bas + b.sub_stp * l)] = 1.0
* i; } }}
![Page 42: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/42.jpg)
Translation of call with block-distributed actual
Source:
!HPF$ PROCESSORS P(4)
REAL A(50)!HPF$ DISTRIBUTE A(BLOCK) ONTO
P
CALL INIT(A)
Translation:
Procs1 p(4);
BlockRange x(50, p.dim(0));
float* a = new float [x.volume()];
Map maps [1];maps [0] = Map(x, 1);
DAD dad(1, p, maps);
init(a, &dad);
![Page 43: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/43.jpg)
Translation of call with cyclically distributed actual
Source:
!HPF$ PROCESSORS P(4)
REAL A(50)!HPF$ DISTRIBUTE A(CYCLIC) ONTO P
CALL INIT(A)
Translation:
Procs1 p(4);
CyclicRange x(50, p.dim(0));
float* a = new float [x.volume()];
Map maps [1];maps [0] = Map(x, 1);
DAD dad(1, p, maps);
init(a, &dad);
![Page 44: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/44.jpg)
Translation of call with strided alignment of actual
Source:
!HPF$ PROCESSORS P(4)
REAL A(100)!HPF$ DISTRIBUTE A(BLOCK) ONTO
P
CALL INIT(A(1:100:2))
Translation:
Procs1 p(4);
BlockRange x(100, p.dim(0));
float* a = new float [x.volume()];
// Create DAD for section a(::2)
Range x2 = x.subrng(50, 0, 2);
Map maps [1];maps [0] = Map(x2, 1);
DAD dad(1, p, maps);
init(a, &dad);
![Page 45: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/45.jpg)
Translation of call with row-aligned actual
Source:
!HPF$ PROCESSORS Q(2, 2)
REAL A(6, 50)!HPF$ DISTRIBUTE A(BLOCK, BLOCK) ONTO Q
CALL INIT(A(2, :))
Translation:Procs2 q(2, 2);
BlockRange x(6, q.dim(0)), y(50, q.dim(1));
float* a = new float [x.volume() * y.volume()];
// Create DAD for section a(1, :)
Location i;x.location(&i, 1);
Group p = q;p.restrict(q.dim(0), i.crd);
Map maps [1];maps [0] = Map(y, x.volume());
DAD dad(1, p, maps);
init(a + i.sub, &dad);
![Page 46: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/46.jpg)
Other features of the Adlib DAD Support for block-cyclic distributions.
Local loops traversing distributed data need outer loop over set of local blocks. LocBlocksIndex iterator class. offset() method computes overall memory offset.
Support for ghost extension, other memory layouts. Shift in memory for ghost region not in local subscript (universal—memory-layout-independent). disp(), offset(), step() methods applied to local subscript.
![Page 47: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/47.jpg)
Other features of the Adlib DAD, II Support for loops over subranges.
Additional block() methods take triplet arguments—directly traverse subranges. crds() methods define ranges of coordinates where local blocks actually exist.
Other feature to support communication library. AllBlocksIndex.
Miscellaneous inquires and predicates. Useful in general libraries, and for runtime checking programs for correctness.
![Page 48: Issues in Translation of High Performance Fortran](https://reader035.vdocuments.us/reader035/viewer/2022062422/568131c8550346895d982fcc/html5/thumbnails/48.jpg)
Next Lecture: Communication in Data Parallel
Languages Patterns of communication needed to
implement language constructs. Libraries that support these
communication patterns.