An Evaluation of Data-Parallel Compiler Support for Line-Sweep
Applications
Daniel Chavarría-MirandaJohn Mellor-Crummey
Dept. of Computer ScienceRice University
Partition computationInsert comm / syncManage storage
Same answers as Fortran program
Parallel Machine
HPF Program Compilation
Sequential Fortran program
+ data partitioning
High-Performance Fortran (HPF)
Industry-standard data parallel language Partitioning of data drives partitioning of
computation, …
Motivation
Obtaining high performance from applications written using high-level parallel languages has been elusive
Tightly-coupled applications are particularly hard Data dependences serialize computation
– induces tradeoffs between parallelism, communication granularity and frequency
– traditional HPF partitionings limit scalability and performance
Communication might be needed inside loops
Contributions
A set of compilation techniques that enable us to match hand-coded performance for tightly-coupled applications
An analysis of their performance impact
dHPF Compiler
Based on an abstract equational framework– manipulates sets of processors, array elements, iterations and
pairwise mappings between these sets– optimizations and code generation are implemented as
operations on these sets and mappings Sophisticated computation partitioning model
– enables partial replication of computation to reduce communication
Support for the multipartitioning distribution– MULTI distribution specifier– suited for line-sweep computations
Innovative optimizations– reduce communication– improve locality
Overview
IntroductionLine Sweep Computations Performance Comparison Optimization Evaluation
– Partially Replicated Computation
– Interprocedural Communication Elimination
– Communication Coalescing
– Direct Access Buffers
Conclusions
Line-Sweep Computations
1D recurrences on a multidimensional domain
Recurrences order computation along each dimension Compiler based parallelization is hard: loop carried
dependences, fine-grained parallelism
Partitioning Choices (Transpose)
Local Sweeps along x and z
Transpose
Local Sweep along y
Transpose back
Partitioning Choices (block + CGP)
Partial wavefront-type parallelism
Processor 0
Processor 1
Processor 2
Processor 3
Partitioning Choices (multipartitioning)
Full parallelism for sweeping along any partitioned dimension
Processor 0
Processor 1
Processor 2
Processor 3
NAS SP & BT benchmarks from NASA Ames– use ADI to solve the Navier-Stokes equation in 3D– forward & backward line sweeps on each dimension, for
each time step SP solves scalar penta-diagonal systems BT solves block-tridiagonal systems SP has double communication volume and frequency
NAS SP & BT Benchmarks
Experimental Setup
2 versions from NASA, each written in Fortran 77– parallel MPI hand-coded version– sequential version (3500 lines)
dHPF input: sequential version + HPF directives (including MULTI, 2% line count increase)
Inlined several procedures manually:– enables dHPF to overlap local computation with
communication without interprocedural tiling Platform: SGI Origin 2000 (128 250 MHz procs.),
SGI’s MPI implementation, SGI’s compilers
Performance Comparison
Compare four versions of NAS SP & BT
Multipartitioned MPI hand-coded version from NASA– different executables for each number of processors
Multipartitioned dHPF-generated version– single executable for all numbers of processors
Block-partitioned dHPF-generated version (with coarse-grain pipelining, using a 2D partition)
– single executable for all numbers of processors Block-partitioned pghpf-compiled version from PGI’s
source code (using a full transpose with a 1D partition)– single executable for all numbers of processors
Efficiency for NAS SP (1023 ‘B’ size)
> 2x multipartitioning comm. volume
similar comm. volume, more serialization
Efficiency for NAS BT (1023 ‘B’ size)
> 2x multipartitioning comm. volume
Introduction Line Sweep Computations Performance ComparisonOptimization Evaluation
– Partially Replicated Computation
– Inteprocedural Communication Elimination
– Communication Coalescing
– Direct Access Buffers
Conclusions
Overview
Evaluation Methodology
All versions are dHPF-generated using multipartitioning
Turn off a particular optimization (“n - 1” approach)– determine overhead without it (% over fully optimized)
Measure its contribution to overall performance– total execution time– total communication volume– L2 data cache misses (where appropriate)
Class A (643) and class B (1023) problem sizes on two different processor counts (16 & 64 processors)
Partially Replicated Computation
ON_HOME a(i-2, j) ON_HOME a(i+2, j) ON_HOME a(i, j-2) ON_HOME a(i-1, j+1)
ON_HOME a(i, j)
ON_EXT_HOME a(i, j)
SHADOW a(2, 2) SHADOW a(2, 2)
Partial computation replication is used to reduce communication
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Red
uctio
n in
Com
m. V
ol. d
HPF
MP
16 Proc.
64 Proc.
Impact of Partial Replication
BT: eliminate comm. for 5D arrays fjac and njac in lhs<xyz> Both: eliminate comm. for six 3D arrays in compute_rhs
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Red
uctio
n in
Exe
c. T
ime
dHP
F M
P
16 Proc.
64 Proc.
Impact of Partial Replication (cont.)
Interprocedural Communication Reduction
REFLECT: placement of near-neighbor communication
LOCAL: communication not needed for a scope extended ON HOME: partial computation replication Compiler doesn’t need full interprocedural
communication and availability analyses to determine whether data in overlap regions & comm. buffers is fresh
Extensions to HPF/JA Directives
Interprocedural Communication Reduction (cont.)
SHADOW a(2, 1)REFLECT (a(0:0, 1:0), a(1:0, 0:0))
From leftneighbor
From topneighbor
SHADOW a(2, 1)REFLECT (a)
The combination of REFLECT, extended ON HOME and LOCAL reduces communication volume by ~13%, resulting in a ~9% reduction in execution time
Normalizing Communication
Same non-localdata needed
P0 P1
a(i, j)
a(i, j - 2)
a(i, j + 2)
a(i, j)
P0 P1
do i = 1, n do j = 2, n – 2 a(i, j) = a(i, j - 2) ! ON_HOME a(i, j) a(i, j + 2) = a(i, j) ! ON_HOME a(i, j + 2)
Coalescing Communication
A
A
Coalesced Message
Impact of Normalized Coalescing
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Red
uctio
n in
Com
m. V
ol. d
HP
F M
P
16 Proc.
64 Proc.
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Red
uctio
n in
Exe
c. T
ime
dHP
F M
P
16 Proc.
64 Proc.
Impact of Normalized Coalescing
Key optimization for scalability
Direct Access Buffers
Choices for receiving complex coalesced messages
Unpack them into the shadow regions– two simultaneous live copies in cache– unpacking can be costly– uniform access to non-local & local data
Reference them directly out of the receive buffer– introduces two modes of access for data (non-local &
interior)– overhead of having a single loop with these two modes is
high– loops should be split into non-local & interior portions,
according to the data they reference
Impact of Direct Access Buffers
Use direct access buffers for the main swept arrays Direct access buffers + loop splitting reduces L2 data
cache misses by ~11%, resulting in a reduction of ~11% in execution time
Conclusions
Compiler-generated code can match the performance of sophisticated hand-coded parallelizations
High performance comes from the aggregate benefit of multiple optimizations
Everything affects scalability: good parallel algorithms are only the starting point, excellent resource utilization on the target machine is needed
Data-parallel compilers should target each potential source of inefficiency in the generated code, if they want to deliver the performance scientific users demand
Efficiency for NAS SP (‘A’)
Efficiency for NAS BT (‘A’)
Data Partitioning
-20.00%
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
Red
uctio
n in
Exe
c. T
ime
dHP
F M
P
16 Proc.
64 Proc.
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
Red
uctio
n in
Com
m. V
ol. d
HP
F M
P
16 Proc.
64 Proc.
Data Partitioning (cont.)
Partially Replicated Computation
do i = 1, n do j = 2, n
a(i,j) = u(i,j-1) + 1.0 ! ON_HOME a(i,j) ON_HOME a(i,j+1) b(i,j) = u(i,j-1) + a(i,j-1) ! ON_HOME a(i,j)
Processor p Processor p + 1
Local portion A + Shadow Regions Local portion A + Shadow Regions
Local portion U + Shadow Regions
Communication
ReplicatedComputation
Local portions U/B + Shadow Regions
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Red
uctio
n in
Exe
c. T
ime
dHP
F M
P
16 Proc.
64 Proc.
Using HFP/JA for Comm. Elimination
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Red
uctio
n in
Com
m. V
ol. d
HP
F M
P
16 Proc.
64 Proc.
Using HFP/JA for Comm. Elimination
do timestep = 1, T
do j = 1, n do i = 3, n a(i, j) = a(i + 1, j) + b(i – 1, j) ! ON_HOME a(i,j) enddo enddo
do j = 1, n do i = 1, n – 2 a(i + 2, j) = a(i + 3, j) + b(i + 1, j) ! ON_HOME a(i + 2, j) enddo enddo
do j = 1, n do i = 1, n – 1 a(i + 1, j) = a(i + 2, j) + b(i + 1, j) ! ON_HOME b(i + 1, j) enddo enddoenddo
Coalesce communication at this point
Normalized Comm. Coalescing (cont.)
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Red
uctio
n in
Exe
c. T
ime
dHP
F M
P
16 Proc.
64 Proc.
Impact of Direct Access Buffers
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Red
uctio
n in
L2
Mis
ses
dHP
F M
P
16 Proc.
64 Proc.
Impact of Direct Access Buffers
Direct Access Buffers
Processor 0 Processor 1
Pack, Send, Receive &Unpack
Direct Access Buffers
Processor 0 Processor 1
Pack, Send & Receive Use