cse 262 lecture 13

CSE 262 Lecture 13

Communication overlap – Continued

Heterogeneous processing

Announcements •  Final presentations

u  Friday March 13th, 10:00 AM to 1:00PM Note time change.

u  Room 3217, CSE Building (EBU3B)

Scott B. Baden / CSE 262 / UCSD, Wi '15 2

An alternative way to hide communication •  Reformulate MPI code into a data-driven form

u  Decouple scheduling and communication handling from the application

u  Automatically overlap communication with computation

02

143

Worker threads

Communication handlers

Dynamic scheduling

Irecv j

Send j

Wait

Comp SPMD MPI Task dependency

graph

Runtime system Irecv i

Send i

Wait

Comp


…

Bamboo Programming Model •  Olap-regions: task switching point

u  Data availability is checked at entry u  Only 1 olap may be active at a time u  When a task is ready, some olap region’s

input conditions have been satisfied •  Send blocks

u  Hold send calls only u  Enable the olap-region

•  Receive blocks: u  Hold receive and/or send calls u  Receive calls are input to olap-region u  Send calls are output to an olap-region

•  Activities in send blocks must be independent of those in receive blocks

•  MPI_Wait/MPI_Waitall can reside anywhere within the olap-region

1  #pragma bamboo olap 2  { 3  #pragma bamboo send 4  {…} 5  #pragma bamboo receive 6  {… } 7  }

10 #pragma bamboo olap 11 {…}

… Computation ….

….

φ1 φ2 … φN

OLAP1 OLAPN


Results •  Stampede at TACC

u  102,400 cores; dual socket Sandy Bridge processors u  K20 GPUs

•  Cray XE-6 at NERSC (Hopper) u  153,216 cores; dual socket 12-core Magny Cours u  4 NUMA nodes per Hopper node, each with 6 cores u  3D Toroidal Network

•  Cray XC30 at NERSC (Edison) u  133,824 cores; dual socket 12-core Ivy Bridge u  Dragonfly Network


0

5

10

15

20

25

30

35

40

12288 24576 49152 98304

MPI-basic MPI-olap

Bamboo-basic MPI-nocomm

Stencil application performance (Hopper) •  Solve 3D Laplace equation, Dirichlet BCs (N=30723)

7-point stencil Δu = 0, u=f on ∂Ω •  Added 4 Bamboo pragmas to a 419 line MPI code

TFL

OPS

/s


2D Cannon - Weak scaling study

•  Communication cost: 11% - 39% •  Bamboo improves MPI-basic 9%-37% •  Bamboo outperforms MPI-olap at scale

0

200

400

600

800

1000

1200

1400

4096 16384 65536

MPI-basic MPI-olap Bamboo MPI-nocomm TF

LOPS

/s

N=N0=196608 N0/42/3 N0/41/3 Cores

Edison


Communication Avoiding Matrix Multiplication (Hopper)

•  Pathological matrices in Planewave basis methods for ab-initio molecular dynamics (Ng

3 x Ne), For Si: Ng=140, Ne=2000

•  Weak scaling study, used OpenMP, 23 pragmas, 337 lines

Cores

TFL

OPS

/s

Matrix size

0 10 20 30 40 50 60 70 80 90

100 110

4096 8192 16384 32768

MPI+OMP MPI+OMP-olap Bamboo+OMP MPI-OMP-nocomm

N=N0= 20608 N= 21/3N0 N=2N0 N=22/3N0

210.4TF


Virtualization Improves Performance

!"

#$%&"

'"

'$#&"

'$'"

'$'&"

'$!"

'$!&"

'$("

'" !" )" *"

+,-"'!!**"+,-"!)&./"+,-")%'&!"+,-"%*(#)"

0 12345"

647892:4;2<=+">258=7"

#$*&"

#$%"

#$%&"

'"

'$#&"

'$'"

'$'&"

'$!"

'" !" )" *"

+,-"'!!**"+,-"!)&./"+,-")%'&!"+,-"%*(#)"

647892:4;2<=+">258=7"0 ?

@A"

B1'C"D2E1==""

B1!C"D2E1==F?@A""

Jacobi (MPI+OMP) Cannon 2.5D (MPI)

c=2, VF=8 c=2, VF=4 c=2, VF=2 c=4, VF=2


Virtualization Improves Performance

!"

#$%&"

'"

'$#&"

'$'"

'$'&"

'$!"

'$!&"

'$("

'" !" )" *"

+,-"'!!**"+,-"!)&./"+,-")%'&!"+,-"%*(#)"

0 12345"

647892:4;2<=+">258=7"

#$*&"

#$%"

#$%&"

'"

'$#&"

'$'"

'$'&"

'$!"

'" !" )" *"

+,-"'!!**"+,-"!)&./"+,-")%'&!"+,-"%*(#)"

647892:4;2<=+">258=7"0 ?

@A"

B1'C"D2E1==""

B1!C"D2E1==F?@A""

!"

#$%"

#$&"

#$'"

("

($("

($)"

($*"

($+"

($!"

(" )" +" &" (,"

-./"(#)+"-./"+#',"-./(,*&+"

0.1123

."

45673895:8;<-"=8>7<6"

!"#$#%%#&'$(%$)*+',-.*/-'*(0$*0$12$3-00(0$

Jacobi (MPI+OMP) Cannon 2D (MPI)


High Performance Linpack (HPL) on Stampede

21.5 22

22.5 23

23.5 24

24.5 25

25.5 26

26.5 27

27.5 28

139264 147456 155648 163840 172032 180224

Basic Unprioritized Scheduling Prioritized Scheduling Olap

TFLOP/S

Matrix size

•  Solve systems of linear equations using LU factorization

•  Latency-tolerant lookahead code is complicated

2048 cores on Stampede

!"#"$%&'()*+,(-.(/(!"#"$%&'

()*+,(-

.(0( 1( /!"

0!" 2!(3(2!(4(0!5/!"

•  Results •  Bamboo meets the performance of the

highly-optimized version of HPL •  Uses the far simpler non-lookahead

version •  Task prioritization is crucial •  Bamboo improves the baseline version

of HPL by up to 10% Scott B. Baden / CSE 262 / UCSD, Wi '15 11

Bamboo on multiple GPUs

•  MPI+CUDA programming model u  CPU is host and GPU works as a device u  Host-host with MPI and host-device with

CUDA u  Optimize MPI and CUDA portions separately

•  Need a GPU-aware programming model u  Allow a device to transfer data to another device u  Compiler and runtime system handle the data

transfer u  Hide both host-device and host-host

communication automatically

GPU0 GPU1

MPI0 MPI1 Send/recv

cudaMemcpy

MPI+CUDA

GPU task x

GPU task y

MPI 0 RTS

Send/recv

GPU-aware model

RTS handles communication

MPI 1 RTS


3D Jacobi – Weak Scaling Study

•  Results on Stampede u  Bamboo-GPU

outperforms MPI-basic u  Bamboo-GPU and MPI-

olap hide most communication overheads

•  Bamboo-GPU improves performance by

u  Hide Host-Host transfer u  Hide Host-Device transfer u  Tasks residing in the same

GPU send address of the message

GFLOP/S

GPU count

0

200

400

600

800

1000

1200

1400

4 8 16 32

MPI-basic MPI-olap Bamboo Bamboo-GPU MPI-nocomm

Stampede


Multigrid – Weak Scaling

•  A geometric multigrid solver to Helmholtz’s equation [Willams et al. 12]

u  Vcycle: restrict, smooth, solve, interpolate, smooth

u  Smooth: Red-Black Gauss-Seidel u  DRAM avoiding with the wave-front

method

Time (secs)

Cores 0

0.5 1

1.5 2

2.5 3

3.5

2048 4096 8192 16384 32768

MPI Bamboo MPI-nocomm Edison

Results: § Communication cost: 16%-22% § Bamboo improves the

performance by up to 14% § Communication overlap is

effective on levels L0 and L1

107

Table 6.1. Left: execution time in seconds of different modules in the multigrid solver.Right: the relative cost of communication at each grid level (the smaller level, the finerthe grid).

Cores Comm Compute pack/unpack inter-box copy Comm/total time at each level

L0 L1 L2 L3 L4

2048 0.448 1.725 0.384 0.191 12% 21% 36% 48% 48%

4096 0.476 1.722 0.353 0.191 12% 24% 37% 56% 50%

8192 0.570 1.722 0.384 0.191 13% 27% 45% 69% 63%

16384 0.535 1.726 0.386 0.192 12% 30% 48% 53% 49%

32768 0.646 1.714 0.376 0.189 17% 28% 44% 63% 58%

weak sense) and that Bamboo improves the performance by up to 14%. These results are

promising, given that overlapping communication with computation on a multigrid solver

is challenging due to the following reasons. First, communication is effective at finest

grids only as the message size on these grid levels is still significant. In coarser levels,

the message size gets smaller and smaller, increasing the overhead of virtualization. In

addition, when moving from a fine to a coarser grid computation shrinks by a factor of

8 whereas communication reduces by only a factor of 4, reducing the efficiency of the

overlapping technique. Furthermore, the number of messages that each processor has to

communicate messages with its 26 neighbors is significant. This increases the processing

overhead of the runtime system.

The results in Fig. 6.7 were obtained with an incremental optimization as follows.

There are only 2 levels of buffering in the original source code: i) aggregating small

messages into a larger one and ii) a potential message buffering done by MPI runtime.

Since Bamboo uses active message 2, it issues an extra data copy from application data

buffer to the message payload and vice versa. In this application, such data copies are

redundant since we can use the payload as an application data buffer. We manually2payload along with header information

A GPU-aware programming model

•  MPI+CUDA programming model u  CPU is host and GPU works as a device u  Host-host with MPI and host-device with CUDA u  Optimize MPI and CUDA portions separately

•  A GPU-aware programming model

u  Allow a device to transfer data to another device u  Compiler and runtime system handle the data

transfer u  We implemented a GPU-aware runtime system u  Hide both host-device and host-host communication

automatically

GPU0 GPU1

MPI0 MPI1 Send/recv

cudaMemcpy

MPI+CUDA

GPU task x

GPU task y

MPI 0 RTS

Send/recv

GPU-aware

RTS handles communication

MPI 1 RTS

3D Jacobi – Weak Scaling

•  Results u  Bamboo-GPU

outperforms MPI-basic u  Bamboo-GPU and MPI-

olap hide most communication overheads

•  Bamboo-GPU improves performance by

u  Hide Host-Host transfer u  Hide Host-Device transfer u  Tasks residing in the same

GPU send address of the message

GFLOP/s

GPU count

0

200

400

600

800

1000

1200

1400

4 8 16 32

MPI-basic MPI-olap Bamboo Bamboo-GPU MPI-nocomm

Stampede

Bamboo Design

•  Core message passing u  Support point-to-point routines u  Require programmer annotation u  Employ Tarragon runtime system [Cicotti

06, 11] •  Subcommunicator layer

u  Support MPI_Comm_split u  No annotation required

•  Collectives u  A framework to translate collectives u  Implement common collectives u  No annotation required

•  User-defined subprograms u  A normal MPI program

Bamboo implementation of collective routines

Collective

Subcommunicator

Core message passing

User-defined subprograms

Collectives


Bamboo Translator

Annotated MPI input

EDG front-end

ROSE AST Annotation handler

Tarragon

Analyzer

Inlining

Translating

MPI reordering

ROSE back-end Optimizer

… MPI extractor

Outlining

Bamboo middle-end Transformer


Bamboo Transformations •  Outlining

u  TaskGraph definition: fill various Tarragon methods with input source code blocks

•  MPI Translation: capture MPI calls and generate calls to Tarragon u  Some MPI calls removed, e.g. Barrier(), Wait() u  Conservative static analysis to determine task

dependencies •  Code reordering: reorder certain code to

accommodate Tarragon semantics


Code outlining

Compute

yes

No

output data

Runtime system (RTS)

All input ready?

No

State= EXEC fireable

All input ready? yes

RTS listening for incoming messages

for (int iter=0; iter<nIters; iter++){

#pragma bamboo olap

{

#pragma bamboo send

{ … }

#pragma bamboo receive

{ … }

}

compute

}

inject data

Iter =nIters


Firing & yielding Rule Generation •  Extract source information in all Recv and iRecv calls

of each olap-region, including associated if and for statements

•  A task is fireable if and only if it receives messages from all source

for(source = 0 to n) if(source%2==0) MPI_Recv from source

bool test(){ for (source =0 to n) if(source%2==0) if(notArrivedFrom(source)){ return false; } return true; }

Recv(0) ∧ Recv (2) ∧…∧ Recv(n)

Firing rule: while (messageArrival) { return test (); }

Yiedling rule: yield = ! test(); Scott B. Baden / CSE 262 / UCSD, Wi '15 21

Inter-procedural translation

Void multiGridSolver(){ #pragma bamboo olap for(int level=0; level<nLevels; level++){

Send_to_neighbors(); Receive_from_neighbors(); Update the data grid

} }

for(cycle=0 to nVcycles { multiGridSolver(); }

Void send_to_neighbors(){ forall neighbors

if(neighbor) MPI_Isend(neighbor) }

Void receive_from_neighbors(){ forall neighbors

if(neighbor) MPI_Irecv(neighbor) }

Bamboo inlines functions that either directly or indirectly hold MPI calls only

invoke

invo

ke

invo

ke

inline

inline

inlin

e

main


Collective implementation

Main file error = MPI_Barrier(comm); Bamboo_Barrier(comm);

int Bamboo_Barrier(MPI_Comm comm){ #pragma bamboo olap for (int step = 1; step < size; step<<=1) { MPI_Send(1 byte to (rank+step)%size, comm); MPI_Recv(1 byte from (rank-step+size)%size, comm); } return SUCCESS; }

A collective Library

Rename collective call

Merge library’s AST to main Then inline collective call

comm_0 = comm; #pragma bamboo olap for (int step = 1; step < size; step<<=1) { MPI_Send(1 byte to (rank+step)%size, comm_0); MPI_Recv(1 byte from (rank-step+size)%size, comm_0); } error = SUCCESS;

Next translation process Scott B. Baden / CSE 262 / UCSD, Wi '15 23

cse 262 lecture 13

Documents