a multi-platform co-array fortran compiler

1

A Multi-platform Co-Array Fortran Compiler

Yuri Dotsenko Cristian Coarfa

John Mellor-Crummey

Department of Computer ScienceRice University

Houston, TX USA

2

Motivation

Parallel Programming Models

• MPI: de facto standard

– difficult to program

• OpenMP: inefficient to map on distributed memory platforms

– lack of locality control

• HPF: hard to obtain high-performance

– heroic compilers needed!

Global address space languages: CAF, Titanium, UPC

an appealing middle ground

3

Co-Array Fortran

• Global address space programming model

– one-sided communication (GET/PUT)

• Programmer has control over performance-critical factors

– data distribution

– computation partitioning

– communication placement

• Data movement and synchronization as language primitives

– amenable to compiler-based communication optimization

4

CAF Programming Model Features

• SPMD process images– fixed number of images during execution

– images operate asynchronously

• Both private and shared data– real x(20, 20) a private 20x20 array in each image

– real y(20, 20)[*] a shared 20x20 array in each image

• Simple one-sided shared-memory communication – x(:,j:j+2) = y(:,p:p+2)[r] copy columns from image r into local columns

• Synchronization intrinsic functions– sync_all – a barrier and a memory fence

– sync_mem – a memory fence

– sync_team([team members to notify], [team members to wait for])

• Pointers and (perhaps asymmetric) dynamic allocation

5

integer a(10,20)[*]

if (this_image() > 1)

a(1:10,1:2) = a(1:10,19:20)[this_image()-1]

a(10,20) a(10,20) a(10,20)

image 1 image 2 image N

image 1 image 2 image N

One-sided Communication with Co-Arrays

6

Rice Co-Array Fortran Compiler (cafc)

• First CAF multi-platform compiler

– previous compiler only for Cray shared memory systems

• Implements core of the language

– currently lacks support for derived type and dynamic co-arrays

• Core sufficient for non-trivial codes

• Performance comparable to that of hand-tuned MPI codes

• Open source

7

Outline

• CAF programming model

• cafc

Core language implementation

– Optimizations

• Experimental evaluation

• Conclusions

8

Implementation Strategy

• Source-to-source compilation of CAF codes – uses Open64/SL Fortran 90 infrastructure – CAF Fortran 90 + communication operations

• Communication

– ARMCI library for one-sided communication on clusters

– load/store communication on shared-memory platforms

Goals

–portability

–high-performance on a wide range of platforms

9

Co-Array Descriptors

• Initialize and manipulate Fortran 90 dope vectors

real :: a(10,10,10)[*]

type CAFDesc_real_3 integer(ptrkind) :: handle ! Opaque handle ! to CAF runtime representation real, pointer:: ptr(:,:,:) ! Fortran 90 pointer ! to local co-array dataend Type CAFDesc_real_3

type(CAFDesc_real_3):: a

10

Allocating COMMON and SAVE Co-Arrays

• Compiler

– generates static initializer for each common/save variable

• Linker

– collects calls to all initializers

– generates global initializer that calls all others

– compiles global initializer and links into program

• Launch

– invokes global initializer before main program begins

• allocates co-array storage outside Fortran 90 runtime system

• associates co-array descriptors with allocated memory

Similar to handling for C++ static constructors

11

Parameter Passing

• Call-by-value convention (copy-in, copy-out)

– pass remote co-array data to procedures only as values

• Call-by-co-array convention*

– argument declared as a co-array by callee

– enables access to local and remote co-array data

• Call-by-reference convention* (cafc)

– argument declared as an explicit shape array

– enables access to local co-array data only

– enables reuse of existing Fortran code

* requires an explicit interface

call f((a(I)[p]))

subroutine f(a)real :: a(10)[*]

real :: x(10)[*]call f(x)

subroutine f(a)real :: a(10)

12

Multiple Co-dimensions

Managing processors as a logical multi-dimensional grid

integer a(10,10)[5,4,*] 3D processor grid 5 x 4 x …

• Support co-space reshaping at procedure calls

– change number of co-dimensions

– co-space bounds as procedure arguments

13

Implementing Communication

x(1:n) = a(1:n)[p] + …

• Use a temporary buffer to hold off processor data

– allocate buffer

– perform GET to fill buffer

– perform computation: x(1:n) = buffer(1:n) + …

– deallocate buffer

• Optimizations

– no temporary storage for co-array to co-array copies

– load/store communication on shared-memory systems

14

Synchronization

• Original CAF specification: team synchronization only

– sync_all, sync_team

• Limits performance on loosely-coupled architectures

• Point-to-point extensions

– sync_notify(q)

– sync_wait(p)

Point to point synchronization semantics

Delivery of a notify to q from p

all communication from p to q issued before the notify has been delivered to q

15

Outline


• cafc

– Core language implementation

Optimizations

• procedure splitting

• supporting hints for non-blocking communication

• packing strided communications

• Experimental evaluation

• Conclusions

16

An Impediment to Code Efficiency

• Original reference

rhs(1,i,j,k,c) = … + u(1,i-1,j,k,c) - …

• Transformed reference

rhs%ptr(1,i,j,k,c) = … + u%ptr(1,i-1,j,k,c) - …

• Fortran 90 pointer-based co-array representation does not convey

– the lack of co-array aliasing

– co-array contiguity

– co-array bounds

• Lack of knowledge inhibits important code optimizations

17

Procedure Splitting

subroutine f(…)real, save :: c(100)[*]interface subroutine f_inner(…, c_arg) real :: c_arg[*] end subroutine f_innerend interface

call f_inner(…,c)

end subroutine f

subroutine f_inner(…, c_arg)real :: c_arg(100)[*]

... = c_arg(50) ...

end subroutine f_inner

subroutine f(…)real, save :: c(100)[*]

... = c(50) ...

end subroutine f

CAF to CAF preprocessing

18

Benefits of Procedure Splitting

• Generated code conveys

– lack of co-array aliasing

– co-array contiguity

– co-array bounds

• Enables back-end compiler to generate better code

19

Hiding Communication Latency

Goal: enable communication/computation overlap

• Impediments to generating non-blocking communication

– use of indexed subscripts in co-dimensions

– lack of whole program analysis

• Approach: support hints for non-blocking communication

– overcome conservative compiler analysis

– enable sophisticated programmers to achieve good performance today

20

Hints for Non-blocking PUTs

• Hints for CAF run-time system to issue non-blocking PUTsregion_id = open_nb_put_region()...Put_Stmt_1...Put_Stmt_N...call close_nb_put_region(region_id)

• Complete non-blocking PUTs:

call complete_nb_put_region(region_id)

• Open problem: Exploiting non-blocking GETs?

21

Strided vs. Contiguous Transfers

• Problem

CAF remote reference might induce many small data transfers

a(i,1:n)[p] = b(j,1:n)

• Solution

pack strided data on source and unpack it on destination

22

Pragmatics of Packing

Who should implement packing?

• The CAF programmer

– difficult to program

• The CAF compiler

– unpacking requires conversion of PUTs into two-sided communication (a difficult whole-program transformation)

• The communication library

– most natural place

– ARMCI currently performs packing on Myrinet

23

CAF Compiler Targets (Sept 2004)

• Processors

– Pentium, Alpha, Itanium2, MIPS

• Interconnects

– Quadrics, Myrinet, Gigabit Ethernet, shared memory

• Operating systems

– Linux, Tru64, IRIX

24

Outline


• cafc

– Core language implementation

– Optimizations

Experimental evaluation

• Conclusions

25

Experimental Evaluation

• Platforms

– Alpha+Quadrics QSNet (Elan3)

– Itanium2+Quadrics QSNet II (Elan4)

– Itanium2+Myrinet 2000

• Codes

– NAS Parallel Benchmarks (NPB 2.3) from NASA Ames

26

NAS BT Efficiency (Class C)

27

NAS SP Efficiency (Class C)

lack of non-blocking notify implementationblocks CAF comm/comp overlap

28

NAS MG Efficiency (Class C)

• ARMCI comm is efficient• pt-2-pt synch in boosts CAF performance 30%

29

NAS CG Efficiency (Class C)

30

NAS LU Efficiency (class C)

31

Impact of Optimizations

Assorted Results

• Procedure splitting

– 42-60% improvement for BT on Itanium2+Myrinet cluster

– 15-33% improvement for LU on Alpha+Quadrics

• Non-blocking communication generation

– 5% improvement for BT on Itanium2+Quadrics cluster

– 3% improvement for MG on all platforms

• Packing of strided data

– 31% improvement for BT on Alpha+Quadrics cluster

– 37% improvement for LU on Itanium2+Quadrics cluster

See paper for more details

32

Conclusions

• CAF boosts programming productivity

– simplifies the development of SPMD parallel programs

– shifts details of managing communication to compiler

• cafc delivers performance comparable to hand-tuned MPI

• cafc implements effective optimizations

– procedure splitting

– non-blocking communication

– packing of strided communication (in ARMCI)

• Vectorization needed to achieve true performance portability with machines like Cray X1

http://www.hipersoft.rice.edu/caf

a multi-platform co-array fortran compiler

Documents