scalable heterogeneous computing: accelerating mpi & i/o on
Post on 27-Mar-2022
5 Views
Preview:
TRANSCRIPT
Scalable Heterogeneous Computing: Accelerating MPI & I/O on CPU+GPU Nodes
Sadaf Alam Swiss National Supercomputing Centre SOS16 Workshop
Collaborators & contributors
• DK Panda, Devendar Bureddy Ohio State University
• Antonio J. Peña Technical University of Valencia
• Oliver Fuhrer MeteoSwiss
2
Characteristics of Internode GPU Communication
3
!"#$ !"#$%&'()*+&&(*'$
,"#$ ,"#$"!%($ "!%($
,"#-,"#$.)/&01()$230')/*4+&$
56-&+7($8&'()*+&&(*'$
5&-&+7($8&'()*+&&(*'$
5&-&+7($8&'()*+&&(*'$
Copy data from GPU to CPU (device to host)!
MPI message transfer between two CPUs (host to host MPI)!
Copy data from CPU to GPU (host to device)!
Motivating Examples • GPU-accelerated weather modeling application (COSMO)
– Halo Exchanges between GPUs (on-node and off-node)
– Different halo sizes
– Goal: a single source, auto-tuned, code base for platforms with diverse CPU/GPU node configurations
• GPU virtualization (rCUDA)
– Solution for small to mid-size heterogeneous clusters
– Scheduling on remote GPU without performance penalties
– Goal: define a cost model for scheduling decisions through detailed characterization of GPU-GPU data transfers
4
Bandwidth and Latency Considerations on Homogenous Multi-core Systems
5
!"#-!"#$.)/&01()$230')/*4+&$98':$;+&-#&81+)<$=(<+)>$2**(00$?;#=2@$*+&087()/4+&0$
;A$3B1$
;A$3B1$
;#=2C$
;#=2D$
;#=2E$
;#=2F$
;#=2E$
;#=2F$
;#=2C$
;#=2D$
;+7($F$ ;+7($D$
%&'()*+&&(*'$
Empirical Results (One, Dual-socket Interlagos)
6
F$D$
C$E$
G$H$
I$J$
;#=2$F$
K$L$
DF$DD$
DC$DE$
DG$DH$
;#=2$D$
ED$EF$
CL$CK$
CJ$CI$
CH$CG$
;#=2$E$
CE$CC$
CD$CF$
DL$DK$
DJ$DI$
;#=2$C$
!"#$%&'#
()*#$%&'#
*)+#$%&'#
(),#$%&'#
7
="%$3M9$'9+$;#=2$*N+0(0'$'+$
;A$
="%$3M9$'9+$;#=2$1/)':(0'$1)+<$;A$
O/&7987':$+&$':($!)/>$PQI$0>0'(<$98':$%&'()N/R+0$")+*(00+)0$
8
O/&7987':$+&$':($!)/>$PQI$0>0'(<$98':$%&'()N/R+0$")+*(00+)0$B08&R$N+9$N(S(N$*+<<B&8*/4+&$2"%$?B,;%@$/&7$="%$
O(&(T'$+1$'B&8&R$1+)$;#=2F$
;+$3(&(T'$+1$'B&8&R$1+)$;#=2C$
Latency and Bandwidth: CPU & Accelerators
9
;#=2F$
;#=2D$
;#=2C$
;#=2E$
;A$O#U$
,"#F$
,"#D$
;#=2D$
;#=2F$
;#=2E$
;#=2C$
;A$O#U$
,"#F$
,"#D$
;+7($F$ ;+7($D$
%&'()*+&&(*'$
Implementation Scenarios
10
Latencytotal = LCPU!GPUpageable
+ LGPU!CPUpageable
+ LCPU!CPUMPI
;/VS($8<WN(<(&'/4+&$
.>W8*/N$8<WN(<(&'/4+&$
27S/&*(7$8<WN(<(&'/4+&$
QXW()'$8<WN(<(&'/4+&$
Latencytotal = LCPU!GPUpinned
+ LGPU!CPUpinned
+ LCPU!CPUMPI
Latencytotal =MIN LCPU!GPUpinned
( )+MIN LGPU!CPUpinned
( )+MIN LCPU!CPU"#$
%&'
Latencytotal =MAX MIN LCPU!GPU( ),MIN LGPU!CPU( ),MIN LCPU!CPU"#$
%&'
()*
+,-
11
O/&7987':$3('9((&$'9+$,"#$&+7(0$+&$!)/>$PYI$WN/Z+)<$
!#[2$[(S8*(-:+0'$="%$&+7(-&+7($$!#[2$\+0'-7(S8*($B,;%$,"#-,"#$B,;%$,"#-,"#$
12
O/&7987':$3('9((&$'9+$,"#$&+7(0$+&$!)/>$PYI$?.]78@$WN/Z+)<$.B&8&R$B08&R$786()(&'$W/*^('$08_(0$$
Advances in CUDA and InfiniBand Drivers
• Enabling of shared access to device memories
• Uniform virtual addressing
• Peer to peer support
• GPU and CPU bindings
• An optimal implementation exploits all of the above BUT there shouldn’t be a rewrite after the introduction of a new feature
13
Halo Exchange Code Implementation with MVAPICH2-GPU
14
`8&RN($*+7($3/0($1+)$/NN$*+&TRB)/4+&0$
\/N+$
`/<($%M5$\B3$
\/N+$
`/<($;+7($
\/N+$
;+7($F$ ;+7($D$
Implementation using MVAPICH2-GPU
15
-./01## 230.4513# $67#849:#OB6()$8&84/N8_/4+&$/&7$T&/N$*+W>$+&$':($:+0'$
!$ !$$
"$
!+W>$3B6()0$'+$/&7$1)+<$:+0'$1+)$(/*:$4<($0'(W$
!$$
"$ "$
!+<WB'/4+&$+&$':($7(S8*($$
!$$
!$$
!$$
;+&-3N+*^8&R$0(&70$/&7$)(*(8S(0$$
!$ !$ !$
a+9()$<(<+)>b$c$1(9()$N8&(0$+1$*+7($
a(/0'$:+0'$<(<+)>$c$1(9(0'$N8&(0$+1$*+7($
!"
!#!!!!$"
!#!!!%"
!#!!!%$"
!#!!!&"
!#!!!&$"
!#!!!'"
&" (" )" *" %!" %&" %(" %)"
!"#$%&'$(
)*&$+,)
-./)'0&1&)2)3.4)5$6"+$&)
+,-./"0&(*"12" 34."0&(*"12" 5&5"0&(*"12"
Results (MPI + GPU Halo Exchange)
• On a dual-socket, dual-GPU, ID QDR system
16
!"
!#!!!!$"
!#!!!%"
!#!!!%$"
!#!!!&"
!#!!!&$"
!#!!!'"
&" (" )" *" %!" %&" %(" %)"
!"#$%&'$(
)*&$+,)
-./)'0&1&)2)3.4)5$6"+$&)
+,-./"0%!%)"12" 34."0%!%)"12" 5&5"0%!%)"12"
!"
!#!!!$"
!#!!!%"
!#!!!&"
!#!!!'"
!#!!!("
!#!!!)"
%" '" )" *" $!" $%" $'" $)"
!"#$%&'$(
)*&$+,)
-./)'0&1&)2)3.4)5$6"+$&)
+,-./"0'!**"12" 34."0'!**"12" 5%5"0'!**"12"
!"!#!!!$"!#!!%"
!#!!%$"!#!!&"!#!!&$"!#!!'"!#!!'$"!#!!("
&" (" )" *" %!" %&" %(" %)"
!"#$%&'$(
)*&$+,)
-./)'0&1&)2)3.4)5$6"+$&)
+,-./"0%)'1)"23" 45."0%)'1)"23" 6&6"0%)'1)"23"
Results (MPI + GPU Halo Exchange)
• On a 8 GPU, single node system with 2 I/O hubs
17
!"!#!!!$"!#!!!%"!#!!!&"!#!!!'"!#!!!("!#!!!)"!#!!!*"!#!!!+"!#!!!,"!#!!$"
%" '" )" +"
!"#$%&'$(
)*&$+,)
-./)'0&1&)2)3.4)5$6"+$&)
-./01"2%'+"34" 560"2%'+"34" 7%7"2%'+"34"
!"!#!!!$"!#!!!%"!#!!!&"!#!!!'"!#!!!("!#!!!)"!#!!!*"!#!!!+"
%" '" )" +"
!"#$%&'$(
)*&$+,)
-./)'0&1&)2)3.4)5$6"+$&)
,-./0"1$!$)"23" 45/"1$!$)"23" 6%6"1$!$)"23"
!"!#!!!$"!#!!!%"!#!!!&"!#!!!'"!#!!!("!#!!!)"!#!!!*"!#!!!+"!#!!!,"!#!!$"
%" '" )" +"
!"#$%&'$(
)*&$+,)
-./)'0&1&)2)3.4)5$6"+$&)
-./01"2'!++"34" 560"2'!++"34" 7%7"2'!++"34"
!"!#!!!$"!#!!%"
!#!!%$"!#!!&"!#!!&$"!#!!'"!#!!'$"!#!!("!#!!($"!#!!$"
&" (" )" *"
!"#$%&'$(
)*&$+,)
-./)'0&1&)2)3.4)5$6"+$&)
+,-./"0%)'1)"23" 45."0%)'1)"23" 6&6"0%)'1)"23"
Results (MPI + GPU Halo Exchange) • Without MVAPICH2-GPU
18
!"
!#!!$"
!#!!%"
!#!!&"
!#!!'"
!#!!("
!#!!)"
!#!!*"
%" '" )" +"
!"#$%&'$(
)*&$+,)
-./)'0&1&)2)3.4)5$6"+$&)
,-./0"1%'+"23" 45/"1%'+"23" 6%6"1%'+"23" 789":6;<=6>"
!"
!#!!$"
!#!!%"
!#!!&"
!#!!'"
!#!!("
!#!!)"
!#!!*"
%" '" )" +"
!"#$%&'$(
)*&$+,)
-./)'0&1&)2)3.4)5$6"+$&)
,-./0"1$!$)"23" 45/"1$!$)"23" 6%6"1$!$)"23" 789":6;<=6>"
!"!#!!$"!#!!%"!#!!&"!#!!'"!#!!("!#!!)"!#!!*"!#!!+"
%" '" )" +"
!"#$%&'$(
)*&$+,)
-./)'0&1&)2)3.4)5$6"+$&)
,-./0"1'!++"23" 45/"1'!++"23" 6%6"1'!++"23" 789":6;<=6>"
!"!#!!$"!#!!%"!#!!&"!#!!'"!#!!("!#!!)"!#!!*"!#!!+"
%" '" )" +"
!"#$%&'$(
)*&$+,)
-./)'0&1&)2)3.4)5$6"+$&)
,-./0"1$)&*)"23" 45/"1$)&*)"23" 6%6"1$)&*)"23" 789":6;<=6>"
Shortcoming of Current Solution
• Not a portable solution
– Only a single MPI library implementation
– Dependency on a given runtime & driver
– Network hardware dependency
• Interoperability constraints
– Emerging, directive-based programming
– MPI standards conformance
19
Future Plans & Thoughts
• Move additional computation to the GPU to minimize transfers – Towards a self hosted implementation – Reduction in host memory requirements
• Encourage middleware developers to provide similar solutions – OpenCL support – Multiple accelerators support – Multiple MPI implementation support (discussions in the forum)
• Explore interoperability with emerging programming models for accelerators
20
top related