experiences with co-array fortran on hardware shared memory platforms

Experiences with Co-array Fortran on Hardware Shared Memory Platforms

Yuri Dotsenko Cristian CoarfaJohn Mellor-Crummey Daniel Chavarria-Miranda

Rice University, Houston, TX

Co-array Fortran

Global Address Space (GAS) languageSPMD programming modelSimple extension of Fortran 90Explicit control over data placement and computation distributionPrivate dataShared data: both local and remoteOne-sided communication (PUT and GET) Team and point-to-point synchronization

Co-array Fortran: Example

integer :: a(10,20)[*]

if (this_image() > 1)

a(1:10,1:2) = a(1:10,19:20)[this_image()-1]

a(10,20) a(10,20) a(10,20)

image 1 image 2 image N

image 1 image 2 image N

Copies from left neighbor

Compiling CAF

Source-to-source translationPrototype Rice cafc Fortran 90 pointer-based co-array representation ARMCI-based data movement

Goal: performance transparency

Challenges: Retain CAF source-level information

Array contiguity, array bounds, lack of aliasing Exploit efficient fine-grain communication on SMPs

Outline

Co-array representation and data access Local data Remote data

Experimental evaluationConclusions

Representation and Access for Local Data

Efficient local access to SAVE/COMMON co-arrays is crucial to achieving best performance on a target architecture

Fortran 90 pointerFortran 90 pointer to structureCray pointerSubroutine argumentCOMMON block (need support for symmetric shared objects)

Fortran 90 Pointer Representation

CAF declaration: real, save :: a(10,20)[*]

After translation: type T1 integer(PtrSize) handle real, pointer :: local(:,:)end type T1type (T1) ca

Local access: ca%local(2,3)

Portable representationBack-end compiler has no knowledge about:

Potential aliasing (no-alias flags for some compilers) Contiguity Bounds

Implemented in cafc

Fortran 90 Pointer to Structure Representation


After translation: type T1 real :: local(10,20)end type T1type (T1), pointer :: ca

Conveys constant bounds and contiguityPotential aliasing is still a problem

Cray Pointer Representation


After translation: real :: a_local(10,20)pointer (a_ptr, a_local)

Conveys constant bounds and contiguityPotential aliasing is still a problemCray pointer is not in Fortran 90 standard

Subroutine Argument Representation

CAF source: subroutine foo(…) real, save :: a(10,20)[*] a(i,j) = … + a(i-1,j) * …end subroutine foo

After translation:

subroutine foo(…) ! F90 representation for co-array a call foo_body(ca%local(1,1), ca%handle, …)end subroutine foo

subroutine foo_body(a_local, a_handle, …) real :: a_local(10,20) a_local(i,j) = … + a_local(i-1,j) * …end subroutine foo_body

Subroutine Argument Representation (cont.)

Avoid conservative assumptions about co-array aliasing by the back-end compilerPerformance is close to optimalExtra procedures and procedure callsImplemented in cafc

COMMON Block Representation

CAF declaration: real :: a(10,20)[*]common /a_cb/ a

After translation: real :: ca(10,20)common /ca_cb/ ca

Yields best performance for local accessesOS must support symmetric data objects

Outline



Generating CAF Communication

Generic parallel architectures Library function calls to move data

Shared memory architectures (load/store) Fortran 90 pointers Vector of Fortran 90 pointers Cray pointers

Communication Generation for Generic Parallel Architectures

CAF code: a(:) = b(:)[p] + …

Translated code: allocate b_temp(:)call GET( b, p, b_temp, … )a(:) = b_temp(:) + …deallocate b_temp

Portable: works on clusters and SMPsFunction overhead per fine-grain accessUses temporary to hold off-processor dataImplemented in cafc

Communication Generation Using Fortran 90 Pointers

CAF code: do j = 1, N C(j) = A(j)[p]end do

Translated code: do j = 1, N ptrA => A(j) call CafSetPtr(ptrA,p,A_handle) C(j) = ptrAend do

Function call overhead for each referenceImplemented in cafc

Pointer Initialization Hoisting

Naïvely translated code: do j = 1, N ptrA => A(j) call CafSetPtr(ptrA,p,A_handle) C(j) = ptrA end do

Code with hoisted pointer initialization:ptrA => A(1:N)call CafSetPtr(ptrA,p,A_handle)do j = 1, N C(j) = ptrA(j)end do

Pointer initialization hoisting is not yet implemented in cafc

Communication Generation Using Vector of Fortran 90 Pointers


Translated code: … initialization …do j = 1, N C(j) = ptrVectorA(p)%ptrA(j)end do

Does not require pointer initialization hoisting and avoids function callsWorse performance than that of hoisted pointer initialization

Communication Generation Using Cray Pointers


Translated code: integer(PtrSize) :: addrA(:)… addrA initialization …do j = 1, N ptrA = addrA(p) C(j) = A_rem(j)end do

addrA(p) – address of co-array A on image pCray pointer initialization hoisting yields only marginal improvement

Outline



Experimental Platforms

SGI Altix 3000 128 Itanium2 1.5 GHz, 6 MB L3 cache processors Linux (2.4.21 kernel) Intel Fortran Compiler 8.0

SGI Origin 2000 16 MIPS R12000 350 MHz, 8 MB L2 cache processors IRIX64 6.5 MIPSpro Compiler 7.3.1.3m

Benchmarks

STREAMRandom AccessSpark98NAS MG and SP

STREAM

Copy kernelDO J = 1, N DO J = 1, N C(J) = A(J) C(J) = A(J)[p]END DO END DO

Triad kernelDO J = 1, N DO J = 1, N A(J)=B(J)+s*C(J) A(J)=B(J)[p]+s*C(J)[p]END DO END DO

Goal: investigate how well architecture bandwidth can be delivered up to the language level

STREAM: Local Accesses

COMMON block is the best, if platform allowsSubroutine parameter has similar performance to COMMON block representationPointer-based representations have performance within 5% of the best on the Altix (with no-aliasing flag), and within 15% on the OriginFortran 90 pointer representation yields 30% of performance on the Altix without using the flag to specify lack of pointer aliasingArray section statements with Fortran 90 pointer representation yield 40-50% performance on the Origin

STREAM: Remote Accesses

COMMON block representation for local access + Cray pointer for remote accesses is the bestSubroutine argument + Cray pointer for remote accesses has similar performanceRemote accesses with function call per access yield very poor performance (24 times slower than the best on the Altix, five times slower on the Origin)Generic strategy (with intermediate temporaries) delivers only 50-60% of performance on the Altix and 30-40% of performance on the Origin for vectorized code (except for Copy kernel)Pointer initialization hoisting is crucial for Fortran 90 pointers remote accesses and desirable for Cray pointersSimilarly coded OpenMP version has comparable performance on the Altix (90% for the scale kernel) and 86-90% on the Origin

Spark98

Based on CMU’s earthquake simulation codeComputes sparse matrix-vector productIrregular application with fine-grain accessesMatrix distribution and computation partitioning is done offline (sf2 traces)Spark98 computes partial product locally, then assembles the result across processors

Spark98 (cont.)

Versions Serial (Fortran kernel, ported from C) MPI (Fortran kernel, ported from C) Hybrid (best shared memory threaded version) CAF versions (based on MPI version):

CAF Packed PUTs CAF Packed GETs CAF GETs (computation with remote data accessed “in

place”)

Spark98 GETs Result Assembly

v2(:,:) = v(:,:)call sync_all()do s = 0, subdomains-1 if (commindex(s) < commindex(s+1)) then pos = commindex(s) comm_len = commindex(s+1) - pos

v(:, comm(pos:pos+comm_len-1)) = & v(:, comm(pos:pos+comm_len-1)) + & v2(:, comm_gets(pos:pos+comm_len-1))[s]

end ifend docall sync_all()

Spark98 Performance on Altix

Performance of all CAF versions is comparable to that of MPI and better on large number of CPUs

CAF GETs is simple and more “natural” to code, but up to 13% slower

Without considering locality, applications do not scale on NUMA architectures (Hybrid)

ARMCI library is more efficient than MPI

NAS MG and SP

Versions: MPI (NPB 2.3) CAF (based on MPI NPB 2.3)

Generic code generation with subroutine argument co-array representation (procedure splitting)

Shared memory code generation (Fortran 90 pointers; vectorized source code) with subroutine argument co-array representation

OpenMP (NPB 3.0)

Class C

NAS SP Performance on Altix

Performance of CAF versions is comparable to that of MPICAF-generic has better performance than CAF-shm because

it uses memcpy, which hides latency by keeping optimal number of memory ops in flight

OpenMP scales poorly

NAS MG Performance on Altix

Conclusions

Direct load/store communication improves performance of fine-grain accesses by a factor of 24 on the Altix 3000 and five on the Origin 2000“In-place” data use in CAF statements incurs acceptable abstraction overheadPerformance comparable to that of MPI codes for fine- and coarse-grain applicationsWe plan to implement in cafc optimal, architecture dependent, code generation for local and remote co-array accesses

www.hipersoft.rice.edu/caf

http://www.hipersoft.rice.edu/caf

experiences with co-array fortran on hardware shared memory platforms

Documents

coarray aliasing

subroutine foo real

array bounds

type t1 real

end compilerperformance

data placement

local accessesos

contiguitypotential