a multi-platform co-array fortran compiler
DESCRIPTION
A Multi-platform Co-Array Fortran Compiler. Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston, TX USA. Motivation. Parallel Programming Models MPI: de facto standard difficult to program - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/1.jpg)
1
A Multi-platform Co-Array Fortran Compiler
Yuri Dotsenko Cristian Coarfa
John Mellor-Crummey
Department of Computer ScienceRice University
Houston, TX USA
![Page 2: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/2.jpg)
2
Motivation
Parallel Programming Models
• MPI: de facto standard
– difficult to program
• OpenMP: inefficient to map on distributed memory platforms
– lack of locality control
• HPF: hard to obtain high-performance
– heroic compilers needed!
Global address space languages: CAF, Titanium, UPC
an appealing middle ground
![Page 3: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/3.jpg)
3
Co-Array Fortran
• Global address space programming model
– one-sided communication (GET/PUT)
• Programmer has control over performance-critical factors
– data distribution
– computation partitioning
– communication placement
• Data movement and synchronization as language primitives
– amenable to compiler-based communication optimization
![Page 4: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/4.jpg)
4
CAF Programming Model Features
• SPMD process images– fixed number of images during execution
– images operate asynchronously
• Both private and shared data– real x(20, 20) a private 20x20 array in each image
– real y(20, 20)[*] a shared 20x20 array in each image
• Simple one-sided shared-memory communication – x(:,j:j+2) = y(:,p:p+2)[r] copy columns from image r into local columns
• Synchronization intrinsic functions– sync_all – a barrier and a memory fence
– sync_mem – a memory fence
– sync_team([team members to notify], [team members to wait for])
• Pointers and (perhaps asymmetric) dynamic allocation
![Page 5: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/5.jpg)
5
integer a(10,20)[*]
if (this_image() > 1)
a(1:10,1:2) = a(1:10,19:20)[this_image()-1]
a(10,20) a(10,20) a(10,20)
image 1 image 2 image N
image 1 image 2 image N
One-sided Communication with Co-Arrays
![Page 6: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/6.jpg)
6
Rice Co-Array Fortran Compiler (cafc)
• First CAF multi-platform compiler
– previous compiler only for Cray shared memory systems
• Implements core of the language
– currently lacks support for derived type and dynamic co-arrays
• Core sufficient for non-trivial codes
• Performance comparable to that of hand-tuned MPI codes
• Open source
![Page 7: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/7.jpg)
7
Outline
• CAF programming model
• cafc
Core language implementation
– Optimizations
• Experimental evaluation
• Conclusions
![Page 8: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/8.jpg)
8
Implementation Strategy
• Source-to-source compilation of CAF codes – uses Open64/SL Fortran 90 infrastructure – CAF Fortran 90 + communication operations
• Communication
– ARMCI library for one-sided communication on clusters
– load/store communication on shared-memory platforms
Goals
–portability
–high-performance on a wide range of platforms
![Page 9: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/9.jpg)
9
Co-Array Descriptors
• Initialize and manipulate Fortran 90 dope vectors
real :: a(10,10,10)[*]
type CAFDesc_real_3 integer(ptrkind) :: handle ! Opaque handle ! to CAF runtime representation real, pointer:: ptr(:,:,:) ! Fortran 90 pointer ! to local co-array dataend Type CAFDesc_real_3
type(CAFDesc_real_3):: a
![Page 10: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/10.jpg)
10
Allocating COMMON and SAVE Co-Arrays
• Compiler
– generates static initializer for each common/save variable
• Linker
– collects calls to all initializers
– generates global initializer that calls all others
– compiles global initializer and links into program
• Launch
– invokes global initializer before main program begins
• allocates co-array storage outside Fortran 90 runtime system
• associates co-array descriptors with allocated memory
Similar to handling for C++ static constructors
![Page 11: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/11.jpg)
11
Parameter Passing
• Call-by-value convention (copy-in, copy-out)
– pass remote co-array data to procedures only as values
• Call-by-co-array convention*
– argument declared as a co-array by callee
– enables access to local and remote co-array data
• Call-by-reference convention* (cafc)
– argument declared as an explicit shape array
– enables access to local co-array data only
– enables reuse of existing Fortran code
* requires an explicit interface
call f((a(I)[p]))
subroutine f(a)real :: a(10)[*]
real :: x(10)[*]call f(x)
subroutine f(a)real :: a(10)
![Page 12: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/12.jpg)
12
Multiple Co-dimensions
Managing processors as a logical multi-dimensional grid
integer a(10,10)[5,4,*] 3D processor grid 5 x 4 x …
• Support co-space reshaping at procedure calls
– change number of co-dimensions
– co-space bounds as procedure arguments
![Page 13: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/13.jpg)
13
Implementing Communication
x(1:n) = a(1:n)[p] + …
• Use a temporary buffer to hold off processor data
– allocate buffer
– perform GET to fill buffer
– perform computation: x(1:n) = buffer(1:n) + …
– deallocate buffer
• Optimizations
– no temporary storage for co-array to co-array copies
– load/store communication on shared-memory systems
![Page 14: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/14.jpg)
14
Synchronization
• Original CAF specification: team synchronization only
– sync_all, sync_team
• Limits performance on loosely-coupled architectures
• Point-to-point extensions
– sync_notify(q)
– sync_wait(p)
Point to point synchronization semantics
Delivery of a notify to q from p
all communication from p to q issued before the notify has been delivered to q
![Page 15: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/15.jpg)
15
Outline
• CAF programming model
• cafc
– Core language implementation
Optimizations
• procedure splitting
• supporting hints for non-blocking communication
• packing strided communications
• Experimental evaluation
• Conclusions
![Page 16: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/16.jpg)
16
An Impediment to Code Efficiency
• Original reference
rhs(1,i,j,k,c) = … + u(1,i-1,j,k,c) - …
• Transformed reference
rhs%ptr(1,i,j,k,c) = … + u%ptr(1,i-1,j,k,c) - …
• Fortran 90 pointer-based co-array representation does not convey
– the lack of co-array aliasing
– co-array contiguity
– co-array bounds
• Lack of knowledge inhibits important code optimizations
![Page 17: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/17.jpg)
17
Procedure Splitting
subroutine f(…)real, save :: c(100)[*]interface subroutine f_inner(…, c_arg) real :: c_arg[*] end subroutine f_innerend interface
call f_inner(…,c)
end subroutine f
subroutine f_inner(…, c_arg)real :: c_arg(100)[*]
... = c_arg(50) ...
end subroutine f_inner
subroutine f(…)real, save :: c(100)[*]
... = c(50) ...
end subroutine f
CAF to CAF preprocessing
![Page 18: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/18.jpg)
18
Benefits of Procedure Splitting
• Generated code conveys
– lack of co-array aliasing
– co-array contiguity
– co-array bounds
• Enables back-end compiler to generate better code
![Page 19: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/19.jpg)
19
Hiding Communication Latency
Goal: enable communication/computation overlap
• Impediments to generating non-blocking communication
– use of indexed subscripts in co-dimensions
– lack of whole program analysis
• Approach: support hints for non-blocking communication
– overcome conservative compiler analysis
– enable sophisticated programmers to achieve good performance today
![Page 20: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/20.jpg)
20
Hints for Non-blocking PUTs
• Hints for CAF run-time system to issue non-blocking PUTsregion_id = open_nb_put_region()...Put_Stmt_1...Put_Stmt_N...call close_nb_put_region(region_id)
• Complete non-blocking PUTs:
call complete_nb_put_region(region_id)
• Open problem: Exploiting non-blocking GETs?
![Page 21: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/21.jpg)
21
Strided vs. Contiguous Transfers
• Problem
CAF remote reference might induce many small data transfers
a(i,1:n)[p] = b(j,1:n)
• Solution
pack strided data on source and unpack it on destination
![Page 22: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/22.jpg)
22
Pragmatics of Packing
Who should implement packing?
• The CAF programmer
– difficult to program
• The CAF compiler
– unpacking requires conversion of PUTs into two-sided communication (a difficult whole-program transformation)
• The communication library
– most natural place
– ARMCI currently performs packing on Myrinet
![Page 23: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/23.jpg)
23
CAF Compiler Targets (Sept 2004)
• Processors
– Pentium, Alpha, Itanium2, MIPS
• Interconnects
– Quadrics, Myrinet, Gigabit Ethernet, shared memory
• Operating systems
– Linux, Tru64, IRIX
![Page 24: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/24.jpg)
24
Outline
• CAF programming model
• cafc
– Core language implementation
– Optimizations
Experimental evaluation
• Conclusions
![Page 25: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/25.jpg)
25
Experimental Evaluation
• Platforms
– Alpha+Quadrics QSNet (Elan3)
– Itanium2+Quadrics QSNet II (Elan4)
– Itanium2+Myrinet 2000
• Codes
– NAS Parallel Benchmarks (NPB 2.3) from NASA Ames
![Page 26: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/26.jpg)
26
NAS BT Efficiency (Class C)
![Page 27: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/27.jpg)
27
NAS SP Efficiency (Class C)
lack of non-blocking notify implementationblocks CAF comm/comp overlap
![Page 28: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/28.jpg)
28
NAS MG Efficiency (Class C)
• ARMCI comm is efficient• pt-2-pt synch in boosts CAF performance 30%
![Page 29: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/29.jpg)
29
NAS CG Efficiency (Class C)
![Page 30: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/30.jpg)
30
NAS LU Efficiency (class C)
![Page 31: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/31.jpg)
31
Impact of Optimizations
Assorted Results
• Procedure splitting
– 42-60% improvement for BT on Itanium2+Myrinet cluster
– 15-33% improvement for LU on Alpha+Quadrics
• Non-blocking communication generation
– 5% improvement for BT on Itanium2+Quadrics cluster
– 3% improvement for MG on all platforms
• Packing of strided data
– 31% improvement for BT on Alpha+Quadrics cluster
– 37% improvement for LU on Itanium2+Quadrics cluster
See paper for more details
![Page 32: A Multi-platform Co-Array Fortran Compiler](https://reader035.vdocuments.us/reader035/viewer/2022062500/56815177550346895dbfb303/html5/thumbnails/32.jpg)
32
Conclusions
• CAF boosts programming productivity
– simplifies the development of SPMD parallel programs
– shifts details of managing communication to compiler
• cafc delivers performance comparable to hand-tuned MPI
• cafc implements effective optimizations
– procedure splitting
– non-blocking communication
– packing of strided communication (in ARMCI)
• Vectorization needed to achieve true performance portability with machines like Cray X1
http://www.hipersoft.rice.edu/caf