cse 262 lecture 13
TRANSCRIPT
![Page 1: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/1.jpg)
CSE 262 Lecture 13
Communication overlap – Continued
Heterogeneous processing
![Page 2: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/2.jpg)
Announcements • Final presentations
u Friday March 13th, 10:00 AM to 1:00PM Note time change.
u Room 3217, CSE Building (EBU3B)
Scott B. Baden / CSE 262 / UCSD, Wi '15 2
![Page 3: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/3.jpg)
An alternative way to hide communication • Reformulate MPI code into a data-driven form
u Decouple scheduling and communication handling from the application
u Automatically overlap communication with computation
02
143
Worker threads
Communication handlers
Dynamic scheduling
Irecv j
Send j
Wait
Comp SPMD MPI Task dependency
graph
Runtime system Irecv i
Send i
Wait
Comp
Scott B. Baden / CSE 262 / UCSD, Wi '15 3
![Page 4: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/4.jpg)
…
Bamboo Programming Model • Olap-regions: task switching point
u Data availability is checked at entry u Only 1 olap may be active at a time u When a task is ready, some olap region’s
input conditions have been satisfied • Send blocks
u Hold send calls only u Enable the olap-region
• Receive blocks: u Hold receive and/or send calls u Receive calls are input to olap-region u Send calls are output to an olap-region
• Activities in send blocks must be independent of those in receive blocks
• MPI_Wait/MPI_Waitall can reside anywhere within the olap-region
1 #pragma bamboo olap 2 { 3 #pragma bamboo send 4 {…} 5 #pragma bamboo receive 6 {… } 7 }
10 #pragma bamboo olap 11 {…}
… Computation ….
….
φ1 φ2 … φN
OLAP1 OLAPN
Scott B. Baden / CSE 262 / UCSD, Wi '15 4
![Page 5: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/5.jpg)
Results • Stampede at TACC
u 102,400 cores; dual socket Sandy Bridge processors u K20 GPUs
• Cray XE-6 at NERSC (Hopper) u 153,216 cores; dual socket 12-core Magny Cours u 4 NUMA nodes per Hopper node, each with 6 cores u 3D Toroidal Network
• Cray XC30 at NERSC (Edison) u 133,824 cores; dual socket 12-core Ivy Bridge u Dragonfly Network
Scott B. Baden / CSE 262 / UCSD, Wi '15 5
![Page 6: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/6.jpg)
0
5
10
15
20
25
30
35
40
12288 24576 49152 98304
MPI-basic MPI-olap
Bamboo-basic MPI-nocomm
Stencil application performance (Hopper) • Solve 3D Laplace equation, Dirichlet BCs (N=30723)
7-point stencil Δu = 0, u=f on ∂Ω • Added 4 Bamboo pragmas to a 419 line MPI code
TFL
OPS
/s
Scott B. Baden / CSE 262 / UCSD, Wi '15 6
![Page 7: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/7.jpg)
2D Cannon - Weak scaling study
• Communication cost: 11% - 39% • Bamboo improves MPI-basic 9%-37% • Bamboo outperforms MPI-olap at scale
0
200
400
600
800
1000
1200
1400
4096 16384 65536
MPI-basic MPI-olap Bamboo MPI-nocomm TF
LOPS
/s
N=N0=196608 N0/42/3 N0/41/3 Cores
Edison
Scott B. Baden / CSE 262 / UCSD, Wi '15 7
![Page 8: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/8.jpg)
Communication Avoiding Matrix Multiplication (Hopper)
• Pathological matrices in Planewave basis methods for ab-initio molecular dynamics (Ng
3 x Ne), For Si: Ng=140, Ne=2000
• Weak scaling study, used OpenMP, 23 pragmas, 337 lines
Cores
TFL
OPS
/s
Matrix size
0 10 20 30 40 50 60 70 80 90
100 110
4096 8192 16384 32768
MPI+OMP MPI+OMP-olap Bamboo+OMP MPI-OMP-nocomm
N=N0= 20608 N= 21/3N0 N=2N0 N=22/3N0
210.4TF
Scott B. Baden / CSE 262 / UCSD, Wi '15 8
![Page 9: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/9.jpg)
Virtualization Improves Performance
!"
#$%&"
'"
'$#&"
'$'"
'$'&"
'$!"
'$!&"
'$("
'" !" )" *"
+,-"'!!**"+,-"!)&./"+,-")%'&!"+,-"%*(#)"
0 12345"
647892:4;2<=+">258=7"
#$*&"
#$%"
#$%&"
'"
'$#&"
'$'"
'$'&"
'$!"
'" !" )" *"
+,-"'!!**"+,-"!)&./"+,-")%'&!"+,-"%*(#)"
647892:4;2<=+">258=7"0 ?
@A"
B1'C"D2E1==""
B1!C"D2E1==F?@A""
Jacobi (MPI+OMP) Cannon 2.5D (MPI)
c=2, VF=8 c=2, VF=4 c=2, VF=2 c=4, VF=2
Scott B. Baden / CSE 262 / UCSD, Wi '15 9
![Page 10: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/10.jpg)
Virtualization Improves Performance
!"
#$%&"
'"
'$#&"
'$'"
'$'&"
'$!"
'$!&"
'$("
'" !" )" *"
+,-"'!!**"+,-"!)&./"+,-")%'&!"+,-"%*(#)"
0 12345"
647892:4;2<=+">258=7"
#$*&"
#$%"
#$%&"
'"
'$#&"
'$'"
'$'&"
'$!"
'" !" )" *"
+,-"'!!**"+,-"!)&./"+,-")%'&!"+,-"%*(#)"
647892:4;2<=+">258=7"0 ?
@A"
B1'C"D2E1==""
B1!C"D2E1==F?@A""
!"
#$%"
#$&"
#$'"
("
($("
($)"
($*"
($+"
($!"
(" )" +" &" (,"
-./"(#)+"-./"+#',"-./(,*&+"
0.1123
."
45673895:8;<-"=8>7<6"
!"#$#%%#&'$(%$)*+',-.*/-'*(0$*0$12$3-00(0$
Jacobi (MPI+OMP) Cannon 2D (MPI)
Scott B. Baden / CSE 262 / UCSD, Wi '15 10
![Page 11: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/11.jpg)
High Performance Linpack (HPL) on Stampede
21.5 22
22.5 23
23.5 24
24.5 25
25.5 26
26.5 27
27.5 28
139264 147456 155648 163840 172032 180224
Basic Unprioritized Scheduling Prioritized Scheduling Olap
TFLOP/S
Matrix size
• Solve systems of linear equations using LU factorization
• Latency-tolerant lookahead code is complicated
2048 cores on Stampede
!"#"$%&'()*+,(-.(/(!"#"$%&'
()*+,(-
.(0( 1( /!"
0!" 2!(3(2!(4(0!5/!"
• Results • Bamboo meets the performance of the
highly-optimized version of HPL • Uses the far simpler non-lookahead
version • Task prioritization is crucial • Bamboo improves the baseline version
of HPL by up to 10% Scott B. Baden / CSE 262 / UCSD, Wi '15 11
![Page 12: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/12.jpg)
Bamboo on multiple GPUs
• MPI+CUDA programming model u CPU is host and GPU works as a device u Host-host with MPI and host-device with
CUDA u Optimize MPI and CUDA portions separately
• Need a GPU-aware programming model u Allow a device to transfer data to another device u Compiler and runtime system handle the data
transfer u Hide both host-device and host-host
communication automatically
GPU0 GPU1
MPI0 MPI1 Send/recv
cudaMemcpy
MPI+CUDA
GPU task x
GPU task y
MPI 0 RTS
Send/recv
GPU-aware model
RTS handles communication
MPI 1 RTS
Scott B. Baden / CSE 262 / UCSD, Wi '15 12
![Page 13: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/13.jpg)
3D Jacobi – Weak Scaling Study
• Results on Stampede u Bamboo-GPU
outperforms MPI-basic u Bamboo-GPU and MPI-
olap hide most communication overheads
• Bamboo-GPU improves performance by
u Hide Host-Host transfer u Hide Host-Device transfer u Tasks residing in the same
GPU send address of the message
GFLOP/S
GPU count
0
200
400
600
800
1000
1200
1400
4 8 16 32
MPI-basic MPI-olap Bamboo Bamboo-GPU MPI-nocomm
Stampede
Scott B. Baden / CSE 262 / UCSD, Wi '15 13
![Page 14: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/14.jpg)
Multigrid – Weak Scaling
• A geometric multigrid solver to Helmholtz’s equation [Willams et al. 12]
u Vcycle: restrict, smooth, solve, interpolate, smooth
u Smooth: Red-Black Gauss-Seidel u DRAM avoiding with the wave-front
method
Time (secs)
Cores 0
0.5 1
1.5 2
2.5 3
3.5
2048 4096 8192 16384 32768
MPI Bamboo MPI-nocomm Edison
Results: § Communication cost: 16%-22% § Bamboo improves the
performance by up to 14% § Communication overlap is
effective on levels L0 and L1
107
Table 6.1. Left: execution time in seconds of different modules in the multigrid solver.Right: the relative cost of communication at each grid level (the smaller level, the finerthe grid).
Cores Comm Compute pack/unpack inter-box copy Comm/total time at each level
L0 L1 L2 L3 L4
2048 0.448 1.725 0.384 0.191 12% 21% 36% 48% 48%
4096 0.476 1.722 0.353 0.191 12% 24% 37% 56% 50%
8192 0.570 1.722 0.384 0.191 13% 27% 45% 69% 63%
16384 0.535 1.726 0.386 0.192 12% 30% 48% 53% 49%
32768 0.646 1.714 0.376 0.189 17% 28% 44% 63% 58%
weak sense) and that Bamboo improves the performance by up to 14%. These results are
promising, given that overlapping communication with computation on a multigrid solver
is challenging due to the following reasons. First, communication is effective at finest
grids only as the message size on these grid levels is still significant. In coarser levels,
the message size gets smaller and smaller, increasing the overhead of virtualization. In
addition, when moving from a fine to a coarser grid computation shrinks by a factor of
8 whereas communication reduces by only a factor of 4, reducing the efficiency of the
overlapping technique. Furthermore, the number of messages that each processor has to
communicate messages with its 26 neighbors is significant. This increases the processing
overhead of the runtime system.
The results in Fig. 6.7 were obtained with an incremental optimization as follows.
There are only 2 levels of buffering in the original source code: i) aggregating small
messages into a larger one and ii) a potential message buffering done by MPI runtime.
Since Bamboo uses active message 2, it issues an extra data copy from application data
buffer to the message payload and vice versa. In this application, such data copies are
redundant since we can use the payload as an application data buffer. We manually2payload along with header information
![Page 15: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/15.jpg)
A GPU-aware programming model
• MPI+CUDA programming model u CPU is host and GPU works as a device u Host-host with MPI and host-device with CUDA u Optimize MPI and CUDA portions separately
• A GPU-aware programming model
u Allow a device to transfer data to another device u Compiler and runtime system handle the data
transfer u We implemented a GPU-aware runtime system u Hide both host-device and host-host communication
automatically
GPU0 GPU1
MPI0 MPI1 Send/recv
cudaMemcpy
MPI+CUDA
GPU task x
GPU task y
MPI 0 RTS
Send/recv
GPU-aware
RTS handles communication
MPI 1 RTS
![Page 16: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/16.jpg)
3D Jacobi – Weak Scaling
• Results u Bamboo-GPU
outperforms MPI-basic u Bamboo-GPU and MPI-
olap hide most communication overheads
• Bamboo-GPU improves performance by
u Hide Host-Host transfer u Hide Host-Device transfer u Tasks residing in the same
GPU send address of the message
GFLOP/s
GPU count
0
200
400
600
800
1000
1200
1400
4 8 16 32
MPI-basic MPI-olap Bamboo Bamboo-GPU MPI-nocomm
Stampede
![Page 17: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/17.jpg)
Bamboo Design
• Core message passing u Support point-to-point routines u Require programmer annotation u Employ Tarragon runtime system [Cicotti
06, 11] • Subcommunicator layer
u Support MPI_Comm_split u No annotation required
• Collectives u A framework to translate collectives u Implement common collectives u No annotation required
• User-defined subprograms u A normal MPI program
Bamboo implementation of collective routines
Collective
Subcommunicator
Core message passing
User-defined subprograms
Collectives
Scott B. Baden / CSE 262 / UCSD, Wi '15 17
![Page 18: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/18.jpg)
Bamboo Translator
Annotated MPI input
EDG front-end
ROSE AST Annotation handler
Tarragon
Analyzer
Inlining
Translating
MPI reordering
ROSE back-end Optimizer
… MPI extractor
Outlining
Bamboo middle-end Transformer
Scott B. Baden / CSE 262 / UCSD, Wi '15 18
![Page 19: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/19.jpg)
Bamboo Transformations • Outlining
u TaskGraph definition: fill various Tarragon methods with input source code blocks
• MPI Translation: capture MPI calls and generate calls to Tarragon u Some MPI calls removed, e.g. Barrier(), Wait() u Conservative static analysis to determine task
dependencies • Code reordering: reorder certain code to
accommodate Tarragon semantics
Scott B. Baden / CSE 262 / UCSD, Wi '15 19
![Page 20: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/20.jpg)
Code outlining
Compute
yes
No
output data
Runtime system (RTS)
All input ready?
No
State= EXEC fireable
All input ready? yes
RTS listening for incoming messages
for (int iter=0; iter<nIters; iter++){
#pragma bamboo olap
{
#pragma bamboo send
{ … }
#pragma bamboo receive
{ … }
}
compute
}
inject data
Iter =nIters
Scott B. Baden / CSE 262 / UCSD, Wi '15 20
![Page 21: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/21.jpg)
Firing & yielding Rule Generation • Extract source information in all Recv and iRecv calls
of each olap-region, including associated if and for statements
• A task is fireable if and only if it receives messages from all source
for(source = 0 to n) if(source%2==0) MPI_Recv from source
bool test(){ for (source =0 to n) if(source%2==0) if(notArrivedFrom(source)){ return false; } return true; }
Recv(0) ∧ Recv (2) ∧…∧ Recv(n)
Firing rule: while (messageArrival) { return test (); }
Yiedling rule: yield = ! test(); Scott B. Baden / CSE 262 / UCSD, Wi '15 21
![Page 22: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/22.jpg)
Inter-procedural translation
Void multiGridSolver(){ #pragma bamboo olap for(int level=0; level<nLevels; level++){
Send_to_neighbors(); Receive_from_neighbors(); Update the data grid
} }
for(cycle=0 to nVcycles { multiGridSolver(); }
Void send_to_neighbors(){ forall neighbors
if(neighbor) MPI_Isend(neighbor) }
Void receive_from_neighbors(){ forall neighbors
if(neighbor) MPI_Irecv(neighbor) }
Bamboo inlines functions that either directly or indirectly hold MPI calls only
invoke
invo
ke
invo
ke
inline
inline
inlin
e
main
Scott B. Baden / CSE 262 / UCSD, Wi '15 22
![Page 23: CSE 262 Lecture 13](https://reader034.vdocuments.us/reader034/viewer/2022050302/626ee20740181f4e7a1b55d8/html5/thumbnails/23.jpg)
Collective implementation
Main file error = MPI_Barrier(comm); Bamboo_Barrier(comm);
int Bamboo_Barrier(MPI_Comm comm){ #pragma bamboo olap for (int step = 1; step < size; step<<=1) { MPI_Send(1 byte to (rank+step)%size, comm); MPI_Recv(1 byte from (rank-step+size)%size, comm); } return SUCCESS; }
A collective Library
Rename collective call
Merge library’s AST to main Then inline collective call
comm_0 = comm; #pragma bamboo olap for (int step = 1; step < size; step<<=1) { MPI_Send(1 byte to (rank+step)%size, comm_0); MPI_Recv(1 byte from (rank-step+size)%size, comm_0); } error = SUCCESS;
Next translation process Scott B. Baden / CSE 262 / UCSD, Wi '15 23