an evaluation of data-parallel compiler support for line-sweep applications

41
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría- Miranda John Mellor- Crummey Dept. of Computer Science Rice University

Upload: stefan

Post on 12-Feb-2016

19 views

Category:

Documents


0 download

DESCRIPTION

An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications. Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice University. Sequential Fortran program + data partitioning. Partition computation Insert comm / sync Manage storage. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

An Evaluation of Data-Parallel Compiler Support for Line-Sweep

Applications

Daniel Chavarría-MirandaJohn Mellor-Crummey

Dept. of Computer ScienceRice University

Page 2: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Partition computationInsert comm / syncManage storage

Same answers as Fortran program

Parallel Machine

HPF Program Compilation

Sequential Fortran program

+ data partitioning

High-Performance Fortran (HPF)

Industry-standard data parallel language Partitioning of data drives partitioning of

computation, …

Page 3: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Motivation

Obtaining high performance from applications written using high-level parallel languages has been elusive

Tightly-coupled applications are particularly hard Data dependences serialize computation

– induces tradeoffs between parallelism, communication granularity and frequency

– traditional HPF partitionings limit scalability and performance

Communication might be needed inside loops

Page 4: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Contributions

A set of compilation techniques that enable us to match hand-coded performance for tightly-coupled applications

An analysis of their performance impact

Page 5: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

dHPF Compiler

Based on an abstract equational framework– manipulates sets of processors, array elements, iterations and

pairwise mappings between these sets– optimizations and code generation are implemented as

operations on these sets and mappings Sophisticated computation partitioning model

– enables partial replication of computation to reduce communication

Support for the multipartitioning distribution– MULTI distribution specifier– suited for line-sweep computations

Innovative optimizations– reduce communication– improve locality

Page 6: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Overview

IntroductionLine Sweep Computations Performance Comparison Optimization Evaluation

– Partially Replicated Computation

– Interprocedural Communication Elimination

– Communication Coalescing

– Direct Access Buffers

Conclusions

Page 7: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Line-Sweep Computations

1D recurrences on a multidimensional domain

Recurrences order computation along each dimension Compiler based parallelization is hard: loop carried

dependences, fine-grained parallelism

Page 8: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Partitioning Choices (Transpose)

Local Sweeps along x and z

Transpose

Local Sweep along y

Transpose back

Page 9: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Partitioning Choices (block + CGP)

Partial wavefront-type parallelism

Processor 0

Processor 1

Processor 2

Processor 3

Page 10: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Partitioning Choices (multipartitioning)

Full parallelism for sweeping along any partitioned dimension

Processor 0

Processor 1

Processor 2

Processor 3

Page 11: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

NAS SP & BT benchmarks from NASA Ames– use ADI to solve the Navier-Stokes equation in 3D– forward & backward line sweeps on each dimension, for

each time step SP solves scalar penta-diagonal systems BT solves block-tridiagonal systems SP has double communication volume and frequency

NAS SP & BT Benchmarks

Page 12: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Experimental Setup

2 versions from NASA, each written in Fortran 77– parallel MPI hand-coded version– sequential version (3500 lines)

dHPF input: sequential version + HPF directives (including MULTI, 2% line count increase)

Inlined several procedures manually:– enables dHPF to overlap local computation with

communication without interprocedural tiling Platform: SGI Origin 2000 (128 250 MHz procs.),

SGI’s MPI implementation, SGI’s compilers

Page 13: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Performance Comparison

Compare four versions of NAS SP & BT

Multipartitioned MPI hand-coded version from NASA– different executables for each number of processors

Multipartitioned dHPF-generated version– single executable for all numbers of processors

Block-partitioned dHPF-generated version (with coarse-grain pipelining, using a 2D partition)

– single executable for all numbers of processors Block-partitioned pghpf-compiled version from PGI’s

source code (using a full transpose with a 1D partition)– single executable for all numbers of processors

Page 14: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Efficiency for NAS SP (1023 ‘B’ size)

> 2x multipartitioning comm. volume

similar comm. volume, more serialization

Page 15: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Efficiency for NAS BT (1023 ‘B’ size)

> 2x multipartitioning comm. volume

Page 16: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Introduction Line Sweep Computations Performance ComparisonOptimization Evaluation

– Partially Replicated Computation

– Inteprocedural Communication Elimination

– Communication Coalescing

– Direct Access Buffers

Conclusions

Overview

Page 17: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Evaluation Methodology

All versions are dHPF-generated using multipartitioning

Turn off a particular optimization (“n - 1” approach)– determine overhead without it (% over fully optimized)

Measure its contribution to overall performance– total execution time– total communication volume– L2 data cache misses (where appropriate)

Class A (643) and class B (1023) problem sizes on two different processor counts (16 & 64 processors)

Page 18: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Partially Replicated Computation

ON_HOME a(i-2, j) ON_HOME a(i+2, j) ON_HOME a(i, j-2) ON_HOME a(i-1, j+1)

ON_HOME a(i, j)

ON_EXT_HOME a(i, j)

SHADOW a(2, 2) SHADOW a(2, 2)

Partial computation replication is used to reduce communication

Page 19: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

Com

m. V

ol. d

HPF

MP

16 Proc.

64 Proc.

Impact of Partial Replication

BT: eliminate comm. for 5D arrays fjac and njac in lhs<xyz> Both: eliminate comm. for six 3D arrays in compute_rhs

Page 20: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

Exe

c. T

ime

dHP

F M

P

16 Proc.

64 Proc.

Impact of Partial Replication (cont.)

Page 21: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Interprocedural Communication Reduction

REFLECT: placement of near-neighbor communication

LOCAL: communication not needed for a scope extended ON HOME: partial computation replication Compiler doesn’t need full interprocedural

communication and availability analyses to determine whether data in overlap regions & comm. buffers is fresh

Extensions to HPF/JA Directives

Page 22: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Interprocedural Communication Reduction (cont.)

SHADOW a(2, 1)REFLECT (a(0:0, 1:0), a(1:0, 0:0))

From leftneighbor

From topneighbor

SHADOW a(2, 1)REFLECT (a)

The combination of REFLECT, extended ON HOME and LOCAL reduces communication volume by ~13%, resulting in a ~9% reduction in execution time

Page 23: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Normalizing Communication

Same non-localdata needed

P0 P1

a(i, j)

a(i, j - 2)

a(i, j + 2)

a(i, j)

P0 P1

do i = 1, n do j = 2, n – 2 a(i, j) = a(i, j - 2) ! ON_HOME a(i, j) a(i, j + 2) = a(i, j) ! ON_HOME a(i, j + 2)

Page 24: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Coalescing Communication

A

A

Coalesced Message

Page 25: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Impact of Normalized Coalescing

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

Com

m. V

ol. d

HP

F M

P

16 Proc.

64 Proc.

Page 26: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

Exe

c. T

ime

dHP

F M

P

16 Proc.

64 Proc.

Impact of Normalized Coalescing

Key optimization for scalability

Page 27: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Direct Access Buffers

Choices for receiving complex coalesced messages

Unpack them into the shadow regions– two simultaneous live copies in cache– unpacking can be costly– uniform access to non-local & local data

Reference them directly out of the receive buffer– introduces two modes of access for data (non-local &

interior)– overhead of having a single loop with these two modes is

high– loops should be split into non-local & interior portions,

according to the data they reference

Page 28: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Impact of Direct Access Buffers

Use direct access buffers for the main swept arrays Direct access buffers + loop splitting reduces L2 data

cache misses by ~11%, resulting in a reduction of ~11% in execution time

Page 29: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Conclusions

Compiler-generated code can match the performance of sophisticated hand-coded parallelizations

High performance comes from the aggregate benefit of multiple optimizations

Everything affects scalability: good parallel algorithms are only the starting point, excellent resource utilization on the target machine is needed

Data-parallel compilers should target each potential source of inefficiency in the generated code, if they want to deliver the performance scientific users demand

Page 30: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Efficiency for NAS SP (‘A’)

Page 31: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Efficiency for NAS BT (‘A’)

Page 32: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Data Partitioning

-20.00%

0.00%

20.00%

40.00%

60.00%

80.00%

100.00%

Red

uctio

n in

Exe

c. T

ime

dHP

F M

P

16 Proc.

64 Proc.

Page 33: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

Red

uctio

n in

Com

m. V

ol. d

HP

F M

P

16 Proc.

64 Proc.

Data Partitioning (cont.)

Page 34: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Partially Replicated Computation

do i = 1, n do j = 2, n

a(i,j) = u(i,j-1) + 1.0 ! ON_HOME a(i,j) ON_HOME a(i,j+1) b(i,j) = u(i,j-1) + a(i,j-1) ! ON_HOME a(i,j)

Processor p Processor p + 1

Local portion A + Shadow Regions Local portion A + Shadow Regions

Local portion U + Shadow Regions

Communication

ReplicatedComputation

Local portions U/B + Shadow Regions

Page 35: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

Exe

c. T

ime

dHP

F M

P

16 Proc.

64 Proc.

Using HFP/JA for Comm. Elimination

Page 36: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

Com

m. V

ol. d

HP

F M

P

16 Proc.

64 Proc.

Using HFP/JA for Comm. Elimination

Page 37: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

do timestep = 1, T

do j = 1, n do i = 3, n a(i, j) = a(i + 1, j) + b(i – 1, j) ! ON_HOME a(i,j) enddo enddo

do j = 1, n do i = 1, n – 2 a(i + 2, j) = a(i + 3, j) + b(i + 1, j) ! ON_HOME a(i + 2, j) enddo enddo

do j = 1, n do i = 1, n – 1 a(i + 1, j) = a(i + 2, j) + b(i + 1, j) ! ON_HOME b(i + 1, j) enddo enddoenddo

Coalesce communication at this point

Normalized Comm. Coalescing (cont.)

Page 38: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

Exe

c. T

ime

dHP

F M

P

16 Proc.

64 Proc.

Impact of Direct Access Buffers

Page 39: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Red

uctio

n in

L2

Mis

ses

dHP

F M

P

16 Proc.

64 Proc.

Impact of Direct Access Buffers

Page 40: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Direct Access Buffers

Processor 0 Processor 1

Pack, Send, Receive &Unpack

Page 41: An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications

Direct Access Buffers

Processor 0 Processor 1

Pack, Send & Receive Use