an evaluation of data-parallel compiler support for line-sweep applications

An Evaluation of Data-Parallel Compiler Support for Line-Sweep

Applications

Daniel Chavarría-MirandaJohn Mellor-Crummey

Dept. of Computer ScienceRice University

Partition computationInsert comm / syncManage storage

Same answers as Fortran program

Parallel Machine

HPF Program Compilation

Sequential Fortran program

+ data partitioning

High-Performance Fortran (HPF)

Industry-standard data parallel language Partitioning of data drives partitioning of

computation, …

Motivation

Obtaining high performance from applications written using high-level parallel languages has been elusive

Tightly-coupled applications are particularly hard Data dependences serialize computation

– induces tradeoffs between parallelism, communication granularity and frequency

– traditional HPF partitionings limit scalability and performance

Communication might be needed inside loops

Contributions

A set of compilation techniques that enable us to match hand-coded performance for tightly-coupled applications

An analysis of their performance impact

dHPF Compiler

Based on an abstract equational framework– manipulates sets of processors, array elements, iterations and

pairwise mappings between these sets– optimizations and code generation are implemented as

operations on these sets and mappings Sophisticated computation partitioning model

– enables partial replication of computation to reduce communication

Support for the multipartitioning distribution– MULTI distribution specifier– suited for line-sweep computations

Innovative optimizations– reduce communication– improve locality

Overview

IntroductionLine Sweep Computations Performance Comparison Optimization Evaluation

– Partially Replicated Computation

– Interprocedural Communication Elimination

– Communication Coalescing

– Direct Access Buffers

Conclusions

Line-Sweep Computations

1D recurrences on a multidimensional domain

Recurrences order computation along each dimension Compiler based parallelization is hard: loop carried

dependences, fine-grained parallelism

Partitioning Choices (Transpose)

Local Sweeps along x and z

Transpose

Local Sweep along y

Transpose back

Partitioning Choices (block + CGP)

Partial wavefront-type parallelism

Processor 0

Processor 1

Processor 2

Processor 3

Partitioning Choices (multipartitioning)

Full parallelism for sweeping along any partitioned dimension

Processor 0

Processor 1

Processor 2

Processor 3

NAS SP & BT benchmarks from NASA Ames– use ADI to solve the Navier-Stokes equation in 3D– forward & backward line sweeps on each dimension, for

each time step SP solves scalar penta-diagonal systems BT solves block-tridiagonal systems SP has double communication volume and frequency

NAS SP & BT Benchmarks

Experimental Setup

2 versions from NASA, each written in Fortran 77– parallel MPI hand-coded version– sequential version (3500 lines)

dHPF input: sequential version + HPF directives (including MULTI, 2% line count increase)

Inlined several procedures manually:– enables dHPF to overlap local computation with

communication without interprocedural tiling Platform: SGI Origin 2000 (128 250 MHz procs.),

SGI’s MPI implementation, SGI’s compilers

Performance Comparison

Compare four versions of NAS SP & BT

Multipartitioned MPI hand-coded version from NASA– different executables for each number of processors

Multipartitioned dHPF-generated version– single executable for all numbers of processors

Block-partitioned dHPF-generated version (with coarse-grain pipelining, using a 2D partition)

– single executable for all numbers of processors Block-partitioned pghpf-compiled version from PGI’s

source code (using a full transpose with a 1D partition)– single executable for all numbers of processors

Efficiency for NAS SP (1023 ‘B’ size)

> 2x multipartitioning comm. volume

similar comm. volume, more serialization

Efficiency for NAS BT (1023 ‘B’ size)

> 2x multipartitioning comm. volume

Introduction Line Sweep Computations Performance ComparisonOptimization Evaluation

– Partially Replicated Computation

– Inteprocedural Communication Elimination

– Communication Coalescing

– Direct Access Buffers

Conclusions

Overview

Evaluation Methodology

All versions are dHPF-generated using multipartitioning

Turn off a particular optimization (“n - 1” approach)– determine overhead without it (% over fully optimized)

Measure its contribution to overall performance– total execution time– total communication volume– L2 data cache misses (where appropriate)

Class A (643) and class B (1023) problem sizes on two different processor counts (16 & 64 processors)

Partially Replicated Computation

ON_HOME a(i-2, j) ON_HOME a(i+2, j) ON_HOME a(i, j-2) ON_HOME a(i-1, j+1)

ON_HOME a(i, j)

ON_EXT_HOME a(i, j)

SHADOW a(2, 2) SHADOW a(2, 2)

Partial computation replication is used to reduce communication

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

Com

m. V

ol. d

HPF

MP

16 Proc.

64 Proc.

Impact of Partial Replication

BT: eliminate comm. for 5D arrays fjac and njac in lhs<xyz> Both: eliminate comm. for six 3D arrays in compute_rhs

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

Exe

c. T

ime

dHP

F M

P

16 Proc.

64 Proc.

Impact of Partial Replication (cont.)

Interprocedural Communication Reduction

REFLECT: placement of near-neighbor communication

LOCAL: communication not needed for a scope extended ON HOME: partial computation replication Compiler doesn’t need full interprocedural

communication and availability analyses to determine whether data in overlap regions & comm. buffers is fresh

Extensions to HPF/JA Directives

Interprocedural Communication Reduction (cont.)

SHADOW a(2, 1)REFLECT (a(0:0, 1:0), a(1:0, 0:0))

From leftneighbor

From topneighbor

SHADOW a(2, 1)REFLECT (a)

The combination of REFLECT, extended ON HOME and LOCAL reduces communication volume by ~13%, resulting in a ~9% reduction in execution time

Normalizing Communication

Same non-localdata needed

P0 P1

a(i, j)

a(i, j - 2)

a(i, j + 2)

a(i, j)

P0 P1

do i = 1, n do j = 2, n – 2 a(i, j) = a(i, j - 2) ! ON_HOME a(i, j) a(i, j + 2) = a(i, j) ! ON_HOME a(i, j + 2)

Coalescing Communication

A

A

Coalesced Message

Impact of Normalized Coalescing

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

Com

m. V

ol. d

HP

F M

P

16 Proc.

64 Proc.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

Exe

c. T

ime

dHP

F M

P

16 Proc.

64 Proc.

Impact of Normalized Coalescing

Key optimization for scalability

Direct Access Buffers

Choices for receiving complex coalesced messages

Unpack them into the shadow regions– two simultaneous live copies in cache– unpacking can be costly– uniform access to non-local & local data

Reference them directly out of the receive buffer– introduces two modes of access for data (non-local &

interior)– overhead of having a single loop with these two modes is

high– loops should be split into non-local & interior portions,

according to the data they reference

Impact of Direct Access Buffers

Use direct access buffers for the main swept arrays Direct access buffers + loop splitting reduces L2 data

cache misses by ~11%, resulting in a reduction of ~11% in execution time

Conclusions

Compiler-generated code can match the performance of sophisticated hand-coded parallelizations

High performance comes from the aggregate benefit of multiple optimizations

Everything affects scalability: good parallel algorithms are only the starting point, excellent resource utilization on the target machine is needed

Data-parallel compilers should target each potential source of inefficiency in the generated code, if they want to deliver the performance scientific users demand

Efficiency for NAS SP (‘A’)

Efficiency for NAS BT (‘A’)

Data Partitioning

-20.00%

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

Red

uctio

n in

Exe

c. T

ime

dHP

F M

P

16 Proc.

64 Proc.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

Red

uctio

n in

Com

m. V

ol. d

HP

F M

P

16 Proc.

64 Proc.

Data Partitioning (cont.)

Partially Replicated Computation

do i = 1, n do j = 2, n

a(i,j) = u(i,j-1) + 1.0 ! ON_HOME a(i,j) ON_HOME a(i,j+1) b(i,j) = u(i,j-1) + a(i,j-1) ! ON_HOME a(i,j)

Processor p Processor p + 1

Local portion A + Shadow Regions Local portion A + Shadow Regions

Local portion U + Shadow Regions

Communication

ReplicatedComputation

Local portions U/B + Shadow Regions

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

Exe

c. T

ime

dHP

F M

P

16 Proc.

64 Proc.

Using HFP/JA for Comm. Elimination

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

Com

m. V

ol. d

HP

F M

P

16 Proc.

64 Proc.

Using HFP/JA for Comm. Elimination

do timestep = 1, T

do j = 1, n do i = 3, n a(i, j) = a(i + 1, j) + b(i – 1, j) ! ON_HOME a(i,j) enddo enddo

do j = 1, n do i = 1, n – 2 a(i + 2, j) = a(i + 3, j) + b(i + 1, j) ! ON_HOME a(i + 2, j) enddo enddo

do j = 1, n do i = 1, n – 1 a(i + 1, j) = a(i + 2, j) + b(i + 1, j) ! ON_HOME b(i + 1, j) enddo enddoenddo

Coalesce communication at this point

Normalized Comm. Coalescing (cont.)

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

Exe

c. T

ime

dHP

F M

P

16 Proc.

64 Proc.


0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

L2

Mis

ses

dHP

F M

P

16 Proc.

64 Proc.



Processor 0 Processor 1

Pack, Send, Receive &Unpack


Processor 0 Processor 1

Pack, Send & Receive Use

an evaluation of data-parallel compiler support for line-sweep applications

Documents